In this lecture, we're going to review some of the basic statistical testing in Python. 

We're going to talk about hypothesis testing, statistical significance, and using SciPy to run the student's t-test. We use statistics a lot in different ways in data science, and in this lecture we want to refresh our knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of hypothesis testing is to determine if for instance the two different conditions we have in an experiment have resulted in different impacts.

In [11]:
import numpy as np
import pandas as pd
from scipy import stats

SciPy is an interesting collection of libraries for data science, and we'll use most or perhaps all of these libraries. It includes NumPy and pandas, but also plotting libraries such as Matplotlib and a number of other scientific library functions as well. 

Ha is our actual explanation, which we call the alternative hypothesis, and H0 is that the explanation we have is not sufficient and we call this the null hypothesis. Our actual testing method is to determine whether the null hypothesis is true or not. If we find that there is a difference between groups then we can reject the null hypothesis and we accept our alternative.

In [12]:
df = pd.read_csv('resources/week-4/datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [13]:
# there are 6 assignments in the dataframe, let's look at some summary statistics for this dataframe
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [14]:
# let's segment this population into two pieces. 
# Let's say those who finish the first assignment by the end of December 2015, we'll call them early finishers, 
# and those who finish at sometime after that we'll call them late finishers.

early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [15]:
# the solution like last one
late_finishers = df[pd.to_datetime(df['assignment1_submission']) >= '2016']
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [16]:
# the smartter solution! The dataframe df and the early_finishers share index values, 
# so we just want everything in the df which is not in early_finishers
late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


We could also do a join of the DataFrame df with early finishers. If we do a left join we only keep the items in the left DataFrame. So this would have been a good answer. We also could have written a function that determines if someone is early or late and then call.apply on the DataFrame and added a new column to the DataFrame.

In [17]:
# The pandas DataFrame object has a variety of statistical functions associated with it.
# If we call the mean function directly on the DataFrame, we see that each of the means for the assignments are calculated. 
# Let's compare the means for our two populations.

print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


The SciPy library contains different statistical tests and forms a basis for hypothesis testing in Python. We're going to use the ttest_ind(), which does an independent t-test, meaning that the populations in the two groups are not related to one another. The result of t-ttest_ind() are this t statistic and the p-value. It's this latter value the probability which is most important to us as it indicates the chance between zero and one of our null hypothesis being true.

In [18]:
# H0: mu1 = mu2
# Ha: mu1 != mu2, alpha = 0.05

from scipy.stats import ttest_ind
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

P-value = 0.186 > alha = 0.05, do not reject H0. We can't say the 2 population means are the same.

In [19]:
# let's check the other assignment grades?
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


It looks like in this data we do not have enough evidence to suggest the populations differ with respect to grade

Now, p-values have come under fire recently for being insufficient for telling us enough about the interactions which are happening and two other techniques confidence intervals and Bayesian analyses are being used more regularly. 

One issue with p-values is that as you run more tests you're likely to get a value which is statistically significant just by chance. So let's see a little simulation of this.

In [20]:
# first, we create a dataframe of 100 columns, each with 100 numbers
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.599656,0.570508,0.985211,0.534333,0.590798,0.803496,0.873016,0.555947,0.654152,0.623992,...,0.029785,0.036335,0.813467,0.742804,0.037351,0.362541,0.741766,0.834536,0.145446,0.299138
1,0.328879,0.122905,0.485979,0.021392,0.995932,0.038664,0.381615,0.082461,0.348496,0.039397,...,0.347155,0.698234,0.625649,0.313903,0.187861,0.760679,0.975424,0.965591,0.231981,0.527296
2,0.869569,0.83382,0.267019,0.124396,0.538469,0.70107,0.886995,0.570527,0.092496,0.257245,...,0.997308,0.433073,0.730254,0.563633,0.666215,0.909956,0.66278,0.004443,0.45977,0.649046
3,0.303188,0.588504,0.817367,0.309785,0.774866,0.362113,0.303857,0.654966,0.924915,0.553219,...,0.895148,0.375505,0.126935,0.366783,0.284873,0.685676,0.186241,0.485114,0.451445,0.115025
4,0.280696,0.912471,0.650809,0.421407,0.397565,0.062493,0.44654,0.292049,0.698762,0.588787,...,0.783643,0.912917,0.893518,0.79345,0.582358,0.822629,0.343656,0.103811,0.128414,0.38869


In [21]:
# let's create the second dataframe
df2 = pd.DataFrame([np.random.random(100) for x in range(100)])

# this statement means we generate 100 random numbers between 0 and 1, and repeat it for 100 times

Are these two DataFrames the same? Maybe a better question is for a given row inside of dF1 is it the same as that same row inside of df2. 

Let's take a look. Let's say our critical value here is 0.1 or an Alpha of 10 percent. We're going to compare each column in dF1 to the same numbered column in df2 and we'll report when the p-value isn't less than 10 percent, which means that we have sufficient evidence to say that the columns are different.

In [23]:
# let's write a function called test_columns

def test_columns(alpha=0.1):
    num_diff = 0  # keeping track of how many differ
    for col in df1.columns:
        teststat,pval=ttest_ind(df1[col], df2[col])  # running ttest_ind between the 2 dataframes
        
        if pval <= alpha:
            print("Col {} is statistically significant different at alpha={}, pval={}".format(col, alpha, pval))
            num_diff = num_diff + 1
        
    # printing out some summary stats
    print(("Total number different was {}, which is {}%".format(num_diff, float(num_diff)/len(df1.columns)*100)))
        
test_columns()

Col 2 is statistically significant different at alpha=0.1, pval=0.09969633695485941
Col 3 is statistically significant different at alpha=0.1, pval=0.06325772577048468
Col 6 is statistically significant different at alpha=0.1, pval=0.0017544682178023518
Col 17 is statistically significant different at alpha=0.1, pval=0.04392519775135556
Col 24 is statistically significant different at alpha=0.1, pval=0.06633325846915858
Col 45 is statistically significant different at alpha=0.1, pval=0.09115896653298396
Col 82 is statistically significant different at alpha=0.1, pval=0.06311677530878561
Col 99 is statistically significant different at alpha=0.1, pval=0.08996420340221147
Total number different was 8, which is 8.0%


We see there are a bunch of columns that are different! In fact, that number looks a lot like the alpha value we chose.

Remember that all the ttest does is check if two sets are similar given some level of confidence. In this example, we checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1.

In [24]:
# we can try other alpha values as well
test_columns(0.05)

Col 6 is statistically significant different at alpha=0.05, pval=0.0017544682178023518
Col 17 is statistically significant different at alpha=0.05, pval=0.04392519775135556
Total number different was 2, which is 2.0%


Keep in mind that this p-value isn't magic, that it's a threshold for us when reporting results and trying to answer our hypothesis. What's a reasonable threshold? Depends on our question, and we need to engage domain experts to better understand what they would consider significant.

In [25]:
# just for fun, let's create that second dataframe using a non-normal distribution, We'll arbitrarily chose chi squared

df2 = pd.DataFrame([np.random.chisquare(df=1, size=100) for x in range(100)])
test_columns()

Col 0 is statistically significant different at alpha=0.1, pval=0.001265625162285096
Col 1 is statistically significant different at alpha=0.1, pval=0.0008547622074273447
Col 2 is statistically significant different at alpha=0.1, pval=0.06796241246954407
Col 3 is statistically significant different at alpha=0.1, pval=0.003404428736799194
Col 4 is statistically significant different at alpha=0.1, pval=0.0027439175925184542
Col 5 is statistically significant different at alpha=0.1, pval=4.719753791592969e-06
Col 6 is statistically significant different at alpha=0.1, pval=4.808970715071064e-06
Col 7 is statistically significant different at alpha=0.1, pval=0.03580165476156797
Col 8 is statistically significant different at alpha=0.1, pval=0.003364652381062284
Col 9 is statistically significant different at alpha=0.1, pval=0.016950545893634105
Col 10 is statistically significant different at alpha=0.1, pval=0.005893108928167432
Col 11 is statistically significant different at alpha=0.1, pv

we see that all or most columns test to be statistically significant at the 10% level.