# Basic Statistical Testing

In [1]:
# let's import our usual libararies.
import pandas as pd
import numpy as np
# our new geust 
from scipy import stats

__When we do hypothesis testing, we actually have two statements of interest:__

- the first is our actual explanation, which we call the alternative hypothesis.

- the second is that the explanation we have is not sufficient, and we call this the null hypothesis.

``Our actual testing method is to determine whether the null hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null hypothesis and we accept our alternative.``

In [11]:
# let's examine the idea by example:
df = pd.read_csv('data/grades.csv')
df.head(3)

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000


In [12]:
# let's see the shape of df
# we have 2315 student and 13 column for 'ass_grade' & 'ass_submission_date'
df.shape

(2315, 13)

we'll segment those students according to the submission date:
- So, for those who finished the 1st ass by the end of DEC 2015  are`early finishers`.
- those who finished it sometimes after that are `late finishers`.

In [18]:
# first let's make the `assignment1_submission` col to DateTime.
df['assignment1_submission'] = pd.to_datetime(df['assignment1_submission'])

# then we'll segment according to this column 
early_finishers = df[df['assignment1_submission'] < '2016']

# SO, whithout following the same approach that we used to it lets try out 
# another one: Since there are no students can be in both dfs , we'll include
# all students in 'df' that are not in 'early_finishers' using the defualt 
# index col.
late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head(3)

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000


There are lots of other ways to do this:
- For instance, you could just copy and paste the first projection and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to change the date down the road you have to remember to change it in two places.
____
- You could also do a join of the dataframe df with early_finishers - if you do a left join you only keep the items in the left dataframe, so this would have been a good answer.
____
- You also could have written a function that determines if someone is early or late, and then called .apply() on the dataframe and added a new column to the dataframe. This is a pretty reasonable answer as well.

In [19]:
# Now, lets examine if there is a higher avg grades for early over late(Ass1)
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


ok, these looks pretty similar. But,are they really the same?. So, this is where t-test comes in. it allows us to form the alternative hypothesis `these are different` as well as the null hypothesis `these are the same`.

- when we're doing hypothesis testing, we have to choose the signeficant level as a threeshold we're willing to accept which called alpha. lets use 5% (0.05) ,this is commonly used percentage but is may be wrong.

The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing in Python:
- we're going to use the `ttest_ind()` function which does an independent t-test (meaning the populations are not related to one another).
___
- The result of ttest_ind() are the t-statistic and a `p-value`. It's this latter value, the probability, which is most important to us, as `it indicates the chance (between 0 and 1) of our null hypothesis being True`.

In [23]:
# let's bring in our ttest_ind() from scipy and pass in our two groups.
from scipy.stats import ttest_ind 
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

- the pvalue = 0.18 and this exceeds our alpha(the threesold to accept the null) = 0.05.
- which means that we cannot reject the null that inform us that the 2 groups averege Ass1 grades are the same.
- we don't have enough certainty in our evidence to come with the contrary.but this doesn't mean that we've proven that the 2 pops are the same.

In [24]:
# why don't we check all the Assigments for both groups.
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with respect to grade.Let's take a look at those p-values for a moment though, because they are saying things that can inform experimental design down the road:

- For instance, one of the assignments, assignment 3, has a p-value around 0.1. This means that if we accepted a level of chance similarity of 11% this would have been considered statistically significant.

As a research, this would suggest to me that there is something here
worth considering following up on:

- For instance, if we had a small number of participants (we don't) or if there was something unique about this assignment as it relates to our experiment (whatever it was) then there may be follow up experiments we could run.

P-values have come under fire recently for being insuficient for telling us enough about the interactions which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used
more regularly:
- One issue with p-values is that as you run more tests you are likely to get a value which is statistically significant just by chance.

In [32]:
# lets see a simulation of this,creating 2 df of 100 column each of them
# with 100 numbers.
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df2 = pd.DataFrame([np.random.random(100) for x in range(100)])

# list comprehention says that i want a list that contains 100 random numbers
# and iterate this process 100 times to get 100 lists each of them has 100
# random number,Then use these arrays(lists) to create a dataFrame.  
df1.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.894592,0.049842,0.253985,0.125167,0.697074,0.381103,0.405503,0.603949,0.622662,0.500422,...,0.461025,0.275951,0.879728,0.635666,0.930484,0.294695,0.71195,0.752545,0.574683,0.575228


Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same as the row inside df2? :

- Let's say our critical value is 0.1, or alpha of 10%. And we're going to compare each column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%, which means that we have sufficient evidence to say that the columns are different(NULL).

In [57]:
# lets create a function to do this report:
def test_columns(alpha = 0.1):
    total_num_col_to_reject_the_null = 0
    # keep in mind that the 2 dfs have the same column index so i'll iterate
    # over the coulmns of both using one for loop.
    for col in df1.columns:
        # As ttest_ind() returns 2 values lets do tuple unpacking.
        teststat, pval = ttest_ind(df1[col], df2[col])
        if pval <= alpha :
            print('col {} is statistically significant at alpha {} with pval {}'. format(col,alpha,pval))
            total_num_col_to_reject_the_null+=1
    print('\t') 
    print('Total number of columns to reject the NuLL {}, which is {}%'.format(total_num_col_to_reject_the_null,
                                                                                                        total_num_col_to_reject_the_null/len(df1.columns)*100))
test_columns()

col 8 is statistically significant at alpha 0.1 with pval 0.011201275622828858
col 14 is statistically significant at alpha 0.1 with pval 0.05892081458812326
col 16 is statistically significant at alpha 0.1 with pval 0.030765564152336108
col 31 is statistically significant at alpha 0.1 with pval 0.045815940651396984
col 40 is statistically significant at alpha 0.1 with pval 0.01008519508643523
col 41 is statistically significant at alpha 0.1 with pval 0.0327183792538498
col 44 is statistically significant at alpha 0.1 with pval 0.02750254769128424
col 47 is statistically significant at alpha 0.1 with pval 0.09794875112250545
col 48 is statistically significant at alpha 0.1 with pval 0.07477654456249958
col 74 is statistically significant at alpha 0.1 with pval 0.06601385578943071
col 75 is statistically significant at alpha 0.1 with pval 0.09918760239382138
col 82 is statistically significant at alpha 0.1 with pval 0.043728057724814175
col 87 is statistically significant at alpha 0.1 w

Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same?

- `Remember that all the ttest does is check if two sets are similar given some level of confidence,` in our case, 10%.
- The more random comparisons you do, the more will just happen to be the same by chance. In this example, we checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1.

In [58]:
# what if we change the alpha:
test_columns(0.05)

col 8 is statistically significant at alpha 0.05 with pval 0.011201275622828858
col 16 is statistically significant at alpha 0.05 with pval 0.030765564152336108
col 31 is statistically significant at alpha 0.05 with pval 0.045815940651396984
col 40 is statistically significant at alpha 0.05 with pval 0.01008519508643523
col 41 is statistically significant at alpha 0.05 with pval 0.0327183792538498
col 44 is statistically significant at alpha 0.05 with pval 0.02750254769128424
col 82 is statistically significant at alpha 0.05 with pval 0.043728057724814175
	
Total number of columns to reject the NuLL 7, which is 7.000000000000001%


So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand that this p-value isn't magic:
- it's just a threshold for you when reporting results and trying to answer
  your hypothesis.
- What's a reasonable threshold? Depends on your question, and you need to engage domain experts to better understand what they would consider significant.

In [59]:
# Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose
# chi squared
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
test_columns()

# now we van see that most of the columns are statlly. significant.

col 0 is statistically significant at alpha 0.1 with pval 0.003027205050417967
col 1 is statistically significant at alpha 0.1 with pval 0.0014409394779935007
col 2 is statistically significant at alpha 0.1 with pval 1.0838195758967829e-06
col 3 is statistically significant at alpha 0.1 with pval 1.5260901684106276e-05
col 4 is statistically significant at alpha 0.1 with pval 0.00022331668791731346
col 5 is statistically significant at alpha 0.1 with pval 5.608306497295161e-06
col 6 is statistically significant at alpha 0.1 with pval 0.0005365096125164064
col 7 is statistically significant at alpha 0.1 with pval 0.00033698346183002063
col 8 is statistically significant at alpha 0.1 with pval 0.0003463053453323207
col 9 is statistically significant at alpha 0.1 with pval 0.001075109014313189
col 10 is statistically significant at alpha 0.1 with pval 0.0007519321497065969
col 11 is statistically significant at alpha 0.1 with pval 9.07153817408264e-06
col 12 is statistically significant a

In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you
to the SciPy library, which you can use for the students t test. We've discussed some of the practical
issues which arise from looking for statistical significance. There's much more to learn about hypothesis
testing, for instance, there are different tests used, depending on the shape of your data and different
ways to report results instead of just p-values such as confidence intervals or bayesian analyses. But this
should give you a basic idea of where to start when comparing two populations for differences, which is a
common task for data scientists.