# Unit Testing

<img src="images/test.jpg"/>

### Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt
pd.set_option('display.max_columns', 100) # Show all columns when looking at dataframe

### Importing Data

In [2]:
# Download NHANES 2015-2016 data
df = pd.read_csv("data/nhanes_2015_2016.csv")
df.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,DMDMARTL,DMDHHSIZ,WTINT2YR,SDMVPSU,SDMVSTRA,INDFMPIR,BPXSY1,BPXDI1,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,1.0,2,134671.37,1,125,4.39,128.0,70.0,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,3.0,1,24328.56,1,125,1.32,146.0,88.0,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,1.0,2,12400.01,1,131,1.51,138.0,46.0,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,6.0,1,102718.0,1,131,5.0,132.0,72.0,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,3.0,5,17627.67,2,126,1.23,100.0,70.0,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [3]:
df.index = range(1,df.shape[0]+1)
df.index

RangeIndex(start=1, stop=5736, step=1)

### Goal
**We want to find the mean of first 100 rows of 'BPXSY1' when 'RIDAGEYR' > 60**

In [4]:
condition = df['RIDAGEYR'] > 60

df[condition].loc[0:99,'BPXSY1'].mean()

139.57142857142858

In [5]:
df[df['RIDAGEYR'] > 60].iloc[range(0,100), 16].mean()

136.29166666666666

**Who is not correct? Let's test our code on only ten rows in a new DataFrame, so we can easily check.**

In [6]:
test = pd.DataFrame({'col1': np.repeat([3,1],5), 'col2': range(3,13)}, index=range(1,11))
test

Unnamed: 0,col1,col2
1,3,3
2,3,4
3,3,5
4,3,6
5,3,7
6,1,8
7,1,9
8,1,10
9,1,11
10,1,12


**We want to find the mean of first 5 rows of 'col2' when 'col1' > 2.**

In [7]:
condition2 = test.col1>2

test[condition2].loc[0:4,'col2'].mean()

4.5

In [8]:
test[test.col1>2].iloc[range(0,5), 1].mean()

5.0

**Both, should return 5!**

In [9]:
test[condition2].loc[0:4,'col2']

1    3
2    4
3    5
4    6
Name: col2, dtype: int64

In [10]:
test[test.col1>2].iloc[range(0,5), 1]

1    3
2    4
3    5
4    6
5    7
Name: col2, dtype: int64

0 is not in the row index labels because the second row's value is < 2. For now, pandas defaults to filling this with NaN.

**Let's compare now with our rows:**

In [11]:
# This is take the 100 first rows in the DataFrame with RIDAGEYR > 60
df[df.RIDAGEYR > 60].iloc[range(0,100), 16]

1      128.0
3      138.0
6      116.0
14     124.0
15     132.0
       ...  
331    126.0
334    160.0
339    128.0
343    144.0
345    106.0
Name: BPXSY1, Length: 100, dtype: float64

In [12]:
# This is take, in the 100 first rows, who have RIDAGEYR > 60
df[df['RIDAGEYR'] > 60].loc[1:100,'BPXSY1']

1     128.0
3     138.0
6     116.0
14    124.0
15    132.0
22    148.0
23    140.0
30    122.0
31    146.0
32    160.0
34    120.0
36    150.0
40    124.0
44    158.0
45    144.0
46    168.0
50    134.0
54    146.0
57    196.0
61    132.0
72    138.0
75    134.0
78    164.0
79    106.0
81    150.0
82    114.0
84    142.0
90    134.0
Name: BPXSY1, dtype: float64

**So, the correct is:**

In [13]:
df[df.RIDAGEYR > 60].iloc[range(0,100), 16].mean()

136.29166666666666

**Another way, using Panda Series:**

In [14]:
pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), 16])

136.29166666666666

In [15]:
pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), df.columns.get_loc('BPXSY1')])

136.29166666666666