## Testing for data analysis

In a data analysis context, we want to test our code, as usual, but also our data (i.e., expected schema; e.g., data types) and our statistics (i.e., expected properties of distributions; e.g., value ranges). We focus on a defensive programming approach, by running expectation checks.

In [12]:
import pandas as pd

In [13]:
df = pd.read_csv("./Pandas Dataset - I/tidy_who.csv")

In [14]:
df.sample(10)

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
222835,Argentina,AMR,1981,,ep,m,1524
304059,Netherlands,EUR,2014,,ep,f,4554
74684,Romania,EUR,2001,825.0,sp,f,2534
27035,Madagascar,AFR,1985,,sp,m,3544
378912,Guam,WPR,2010,,rel,f,14
98083,Serbia,EUR,2007,26.0,sp,f,5564
3189,India,SEA,1989,,sp,m,14
73685,Myanmar,SEA,2004,2622.0,sp,f,2534
361310,Barbados,AMR,2000,,rel,m,5564
347318,Democratic People's Republic of Korea,SEA,2012,,rel,m,3544


## Testing Code

As far as code is concerned (when we implement operations to transform data), please refer to the lesson on testing, debugging, and profiling.

In the first notebook, we came across
- `pd.testing.assert_frame_equal()`; be aware that
- `pd.testing.assert_series_equal()` and 
- `pd.testing.assert_index_equal()` are also available.

In [15]:
pd.testing.assert_index_equal(df.index, df.index)

## Testing data

In [16]:
df['year'].dtype

dtype('int64')

In [26]:
assert df['year'].dtype == 'int64'

In [20]:
df['sex'].dtype

dtype('O')

In [21]:
assert df['sex'].dtype == 'object'

## Testing statistics

In [27]:
assert df['year'].max() <= 2017

In [33]:
df['year']

0         1980
1         1981
2         1982
3         1983
4         1984
          ... 
429739    2011
429740    2012
429741    2013
429742    2014
429743    2015
Name: year, Length: 429744, dtype: int64

In [31]:
assert df['cases'].min() == 0

In [32]:
df['cases'].min()

0.0

In [35]:
df['cases']

0           NaN
1           NaN
2           NaN
3           NaN
4           NaN
          ...  
429739      NaN
429740      NaN
429741    725.0
429742    718.0
429743    629.0
Name: cases, Length: 429744, dtype: float64

When datasets are large, it might be difficult to carry out exact tests (for example, using `pd.testing.assert_series_equal()`). It might then be reasonable to test for properties of a series, rather than element-wise equality.

In [34]:
df['cases'].describe()

count     81381.000000
mean        667.482496
std        4490.566875
min           0.000000
25%           3.000000
50%          28.000000
75%         200.000000
max      250051.000000
Name: cases, dtype: float64

Make use of visual checks too: For example, it is generally a lot more straightforward to spot outliers if you plot your data!

In [36]:
assert df['sex'].nunique() > 1

In [37]:
df['sex']

0         m
1         m
2         m
3         m
4         m
         ..
429739    f
429740    f
429741    f
429742    f
429743    f
Name: sex, Length: 429744, dtype: object

In [38]:
df['sex'].nunique()

2

## Handling missing data

Some data are missing, either because they exist but were not collected or because they never existed. How can we detect missing data (null values)?

In [39]:
df_sub = df[(df.country == 'Greece') & (df.year > 2014) & (df.age_range == 65)]
df_sub

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
48827,Greece,EUR,2015,,sp,m,65
102545,Greece,EUR,2015,,sp,f,65
156263,Greece,EUR,2015,,sn,m,65
209981,Greece,EUR,2015,,sn,f,65
263699,Greece,EUR,2015,,ep,m,65
317417,Greece,EUR,2015,,ep,f,65
371135,Greece,EUR,2015,86.0,rel,m,65
424853,Greece,EUR,2015,42.0,rel,f,65


In [40]:
df_sub['cases'].isnull()

48827      True
102545     True
156263     True
209981     True
263699     True
317417     True
371135    False
424853    False
Name: cases, dtype: bool

In [41]:
df_sub['cases'].notnull()

48827     False
102545    False
156263    False
209981    False
263699    False
317417    False
371135     True
424853     True
Name: cases, dtype: bool

In [43]:
df_sub['cases'].notnull().value_counts()

False    6
True     2
Name: cases, dtype: int64

In [44]:
df_sub['cases'].sum()

128.0

In [45]:
df_sub.fillna('NA')

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
48827,Greece,EUR,2015,,sp,m,65
102545,Greece,EUR,2015,,sp,f,65
156263,Greece,EUR,2015,,sn,m,65
209981,Greece,EUR,2015,,sn,f,65
263699,Greece,EUR,2015,,ep,m,65
317417,Greece,EUR,2015,,ep,f,65
371135,Greece,EUR,2015,86.0,rel,m,65
424853,Greece,EUR,2015,42.0,rel,f,65


In [46]:
df_sub.fillna('0')

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
48827,Greece,EUR,2015,0.0,sp,m,65
102545,Greece,EUR,2015,0.0,sp,f,65
156263,Greece,EUR,2015,0.0,sn,m,65
209981,Greece,EUR,2015,0.0,sn,f,65
263699,Greece,EUR,2015,0.0,ep,m,65
317417,Greece,EUR,2015,0.0,ep,f,65
371135,Greece,EUR,2015,86.0,rel,m,65
424853,Greece,EUR,2015,42.0,rel,f,65


In [47]:
df_sub.dropna()

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
371135,Greece,EUR,2015,86.0,rel,m,65
424853,Greece,EUR,2015,42.0,rel,f,65
