# Book 5: Statistical tests using Pandas and Scipy

Here we will go over how to use ```pandas``` to read several ```.csv``` files, **concatenate** them into a single table, and then use ```scipy``` for statistical tests.

## Load the data
Let us use what we learned before to load the ```.csv``` table into a data frame

In [None]:
import pandas as pd

df1 = pd.read_csv('./data/Results_01.csv')
df1['Student'] = '01'

df2 = pd.read_csv('./data/Results_02.csv')
df2['Student'] = '02'

# concatenate
df = pd.concat([df1, df2])
print(df.head())

# simple box plot, to vissualy inspect if there are big differences between both groups
df.boxplot(column="Area",by="Student")

## Using Scipy for statistical testing

When we do a **statistical test** we assume a **null hypotesis - H0** and our objective is to try and **falsify** that hyotesis. In most cases the **H0** poses that there is no relationship (or difference) between the **distributions** being tested. To determine if **H0** is false we use the now famous **p-value**. The **p-value** The estimates how probable it is that you would see the differences between the two sets if the null hypothesis were true. Therefore, a p-value of 0.04 means that you have a 4% chance that **H0** is true and thus we report a statistically significant relationship (or difference) between the input and target distributions.

## Scipy
[Scipy](https://scipy.org/) is a big package for scientific computing commonly used in Python, it is already included in our virtual environment due to packages previously installed. To test that this is the case try to run the block bellow. If you run into problems then [install](https://anaconda.org/conda-forge/scipy) the package by: ```conda install -c conda-forge scipy```

In [None]:
from scipy import stats 
# as this runs, we are good to continue

For clarity let us get out the numbers of interest from the data table by using the ```query``` method of pandas data_frames. ```query``` allows us to extract values from a column while filtering based ob the values of another. Bellow, I ```query``` for the **Area** values that belong to **Student 01** and **02** respectively.

In [None]:
Area_s1=df.query("Student == '01'")['Area']
Area_s2=df.query("Student == '02'")['Area']

One of the most common mistakes while doing statistical testing is to use a particular method, e.g. t-test, whitout checkig first that the underlying assumptions of the test are true. In general we must check the following to know what tests we can use:

* Independence: each meassurement of the variables is independent of each other

* Normality: the data follows a normal distribution (Gaussian distribution).

* Homogeneous variance: the variance (spread around the mean) within each group being compared is similar.


## Homogeneity
Lets us first test for homogeneous variance 

In [None]:
# homogeneity
stats.levene(Area_s1, Area_s2)

## Normality

Now we check is the distributions to test are normal - aka Gaussian

In [None]:
# Shapiro-Wilk test for normality
stats.shapiro(Area_s1)

In [None]:
# Shapiro-Wilk test for normality
stats.shapiro(Area_s2)

## The actual t-test

Now that all seems to be good then we can run the t-test

In [None]:
# Independent t-test
stats.ttest_ind(Area_s1, Area_s2)