<a href="https://colab.research.google.com/github/IrfanPavel/Advanced-Pandas-Operations/blob/main/04_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing for data analysis

In a data analysis context, we want to test our code, as usual, but also our data (i.e., expected schema; e.g., data types) and our statistics (i.e., expected properties of distributions; e.g., value ranges). We focus on a defensive programming approach, by running expectation checks.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/content/tidy_who.csv')

In [3]:
df.sample(5)

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
280246,Luxembourg,EUR,1990,,ep,f,1524
202454,Guinea,AFR,1982,,sn,f,5564
133518,Haiti,AMR,2004,,sn,m,3544
395563,Malta,EUR,1981,,rel,f,2534
196091,Marshall Islands,WPR,1997,,sn,f,4554


## Testing code

As far as code is concerned (when we implement operations to transform data), please refer to the lesson on testing, debugging, and profiling.

In the first notebook, we came across `pd.testing.assert_frame_equal()`; be aware that `pd.testing.assert_series_equal()` and `pd.testing.assert_index_equal()` are also available.

In [4]:
pd.testing.assert_index_equal(df.index, df.index)

## Testing data

In [5]:
df['year'].dtype

dtype('int64')

In [6]:
assert df['year'].dtype == 'int'

In [7]:
df['sex'].dtype

dtype('O')

In [8]:
assert df['sex'].dtype == 'object'

## Testing statistics

In [9]:
assert df['year'].max() <= 2017

In [10]:
assert df['cases'].min() == 0

When datasets are large, it might be difficult to carry out exact tests (for example, using `pd.testing.assert_series_equal()`). It might then be reasonable to test for properties of a series, rather than element-wise equality.

In [11]:
df['cases'].describe()

count     81381.000000
mean        667.482496
std        4490.566875
min           0.000000
25%           3.000000
50%          28.000000
75%         200.000000
max      250051.000000
Name: cases, dtype: float64

Make use of visual checks too: For example, it is generally a lot more straightforward to spot outliers if you plot your data!

In [12]:
assert df['sex'].nunique() > 1

## Handling missing data

Some data are missing, either because they exist but were not collected or because they never existed. How can we detect missing data (null values)?

In [13]:
df_sub = df[(df.country == 'Greece') & (df.year > 2014) & (df.age_range == 65)]
df_sub

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
48827,Greece,EUR,2015,,sp,m,65
102545,Greece,EUR,2015,,sp,f,65
156263,Greece,EUR,2015,,sn,m,65
209981,Greece,EUR,2015,,sn,f,65
263699,Greece,EUR,2015,,ep,m,65
317417,Greece,EUR,2015,,ep,f,65
371135,Greece,EUR,2015,86.0,rel,m,65
424853,Greece,EUR,2015,42.0,rel,f,65


In [14]:
df_sub['cases'].isnull()

48827      True
102545     True
156263     True
209981     True
263699     True
317417     True
371135    False
424853    False
Name: cases, dtype: bool

In [15]:
df_sub['cases'].notnull()

48827     False
102545    False
156263    False
209981    False
263699    False
317417    False
371135     True
424853     True
Name: cases, dtype: bool

In [16]:
df_sub['cases'].isnull().value_counts()

True     6
False    2
Name: cases, dtype: int64

When summing data, null (missing) values are treated as zero.

In [17]:
df_sub['cases'].sum()

128.0

In [18]:
df_sub.fillna('NA')

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
48827,Greece,EUR,2015,,sp,m,65
102545,Greece,EUR,2015,,sp,f,65
156263,Greece,EUR,2015,,sn,m,65
209981,Greece,EUR,2015,,sn,f,65
263699,Greece,EUR,2015,,ep,m,65
317417,Greece,EUR,2015,,ep,f,65
371135,Greece,EUR,2015,86.0,rel,m,65
424853,Greece,EUR,2015,42.0,rel,f,65


In [19]:
df_sub['cases'].fillna('0')

48827        0
102545       0
156263       0
209981       0
263699       0
317417       0
371135    86.0
424853    42.0
Name: cases, dtype: object

In [20]:
df_sub.dropna()

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
371135,Greece,EUR,2015,86.0,rel,m,65
424853,Greece,EUR,2015,42.0,rel,f,65


### Hands-on exercises

1. What type would you expect the variable `cases` to be?
2. Write an expectation check to ensure that the number of missing values for `cases` is less than the total number of observations. 
3. What is the ratio of non-null values for `cases` in regions `EUR` and `AFR` (together)?

## Reference

* Tutorial on "Best Testing Practices for Data Science" by Eric J. Ma at PyCon 2017:
https://www.youtube.com/watch?v=yACtdj1_IxE