# Basic Statistical Testing

Hypothesis testing, statistical significance and using scipy to run student's t-tests

In [1]:
# Hypothesis testing is a core data analysis activity behind experimentation.
# The goal of hypothesis testing is to determine if, for instance, the two different conditions
# we have in an experiment have resulted in different impacts.

# Let's import our usual numpy and pandas libraries
import numpy as np
import pandas as pd

# Let's bring some new libraries from scipy
from scipy import stats

In [2]:
# Now, scipy is an interesting collection of libraries for data science and we'll use most or
# perhaps all of these libraries.
# It includes numpy and pandas, but also plotting libraries such as matplotlib, and a number of
# scientific library functions as well

In [3]:
# When we do hypothesis testing, we actually have two statements of interest:
# 1) Our actual explanation (alternative hypothesis)
# 2) The explanation we have is not sufficient (null hypothesis)
# Our actual testing method is to determine whether the null hypothesis is true or not. If we
# find that there is a difference between groups, then we can reject the null hypothesis and 
# we accept our alternative

# Example
df = pd.read_csv('Course1_Resources/resources/week-4/datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [5]:
# If we take a look at the dataframe, we see we have six different assignments. Let's look at
# some summary statistics of this DataFrame
print('There are {} rows and {} columns'.format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [7]:
# Let's segment this population into two pieces:
# 1) Those who finished the first assignment by the end of December 2015 -> early finishers
# 2) Those who finished it sometime after that -> late finishers


# Early finishers
early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [8]:
# Late finishers

# The dataframes df and early_finishers share index values, so we really just want everything
# in the df which is not in early_finishers
late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [9]:
# Other ways we could have pulled late_finishers data:

# 1) We could just copy and paste the first projection (for early_finishers) and change the sign
# from less than to greather than or equal to.
# This is ok, but if we decide we want to change the data down the road, we have to remember to
# change it in two places.

# 2) We could also do a join of the dataframe df with early_finishers. If we do a left join we
# only keep the items in the left dataframe, so this would have been a good answer.

# 3) We could also have written a function that determines if someone is early or late, and
# then called .apply() on the dataframe and added a new column to the dataframe. This is a pretty
# reasonable answer as well

In [14]:
# As we've seen, the pandas dataframe object has a variety of statistical functions associated
# with it. If we call the mean function directly on the dataframe, we see that each of the
# means for the assignment are calculated.

# Let's compare the means for our two populations
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


In [15]:
# They look pretty similar, but are they the same?
# This is where the students' t-test comes in. It allows us to form the alternative hypothesis
# ('There are different') as well as the null hypothesis ('These are the same') and then test
# that null hypothesis.

# When doing hypothesis testing, we have to choose a significance level as a threshold for how
# much of a chance we're willing to accept. This significance level is tipically called alpha.
# For this example we'll use a threshold of 0.05 for our alpha or 5%. 

# The SciPy library contains a number of different statistical tests and forms a basis for
# hypothesis testing in Python and we're going to use the ttest_ind() function which does an
# independent t-test (meaning the populations are not related to one another). The result of
# ttest_index() are the t-statistic and a p-value. It's the latter value, which is most
# important to us, as it indicates the chance (between 0 and 1) of our null hypothesis being
# True

# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

In [16]:
# So here we see that the probabily is 0.18, and this is above our alpha value of 0.05. This
# means that we cannot reject the null hypothesis.
# The null hypothesis was that the two populations are the same, and we don't have enough
# certainty in our evidence (because it is greater than alpha) to come to a conclusion to the
# contrary. This doesn't mean that we have proven the populations are the same.

In [19]:
# Let's do the same for the other assignments
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


In [20]:
# It looks like in this data we do not have enought evidence to suggest the populations differ
# with respect to grade.
# Let's take a look at those p-values for a moment though, because they are saying things that
# can inform experimental design down the road. For instance, one of the assignments, assignment3,
# has a p-value around 0.1. This means, if we accepted a level of chance similarity of 11% this
# would have been considered statistically signifcant. As a researcher, this would suggest to
# us that there is something here worth considering following up on. For instance, if we had a
# small number of participants(we don't) or if there was something unique about this assignment
# as it relates to our experiment (whatever it was) then there may be followup experiments we 
# could run

In [21]:
# P-values have come under fire recently for being insufficient for telling us enough about the
# interactions which are happening, and two other techniques are used more regularly:
# 1) Confidence intervals
# 2) Bayesian analyses
# One issue with p-values is that as we run more tests we are more likely to get a value which
# is statistically significant just by chance.

In [22]:
# Let's do a simulation

# First, let's create a dataframe of 100 columns, each with 100 numbers
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.574578,0.486894,0.614057,0.871077,0.595803,0.140474,0.755176,0.947483,0.797821,0.249667,...,0.188646,0.618,0.851562,0.281242,0.962265,0.87135,0.149782,0.413393,0.13476,0.913617
1,0.660848,0.539889,0.69609,0.758549,0.76498,0.675419,0.843624,0.227383,0.780584,0.398803,...,0.070447,0.529838,0.178456,0.564001,0.921022,0.114015,0.53716,0.039685,0.008441,0.429835
2,0.171825,0.55696,0.560651,0.770705,0.887793,0.012007,0.446093,0.219365,0.381416,0.843338,...,0.896627,0.731103,0.103085,0.674335,0.417921,0.838006,0.043738,0.520789,0.713151,0.257989
3,0.214486,0.259801,0.953401,0.154619,0.436209,0.303767,0.828925,0.234883,0.212205,0.960843,...,0.962075,0.423579,0.485904,0.674097,0.769102,0.05837,0.666614,0.466591,0.781447,0.792416
4,0.169171,0.564308,0.107787,0.367551,0.231512,0.900281,0.916403,0.004906,0.334245,0.175155,...,0.189882,0.110647,0.949153,0.136265,0.393109,0.135044,0.527724,0.076202,0.964158,0.59173


In [24]:
# Let's create a second dataframe
df2 = pd.DataFrame([np.random.random(100) for x in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.306694,0.856214,0.23387,0.09539,0.268243,0.602391,0.514826,0.665825,0.324226,0.093358,...,0.091902,0.163168,0.507776,0.877632,0.511004,0.061386,0.203383,0.245351,0.817687,0.604729
1,0.084745,0.822845,0.228898,0.759191,0.720582,0.881199,0.452223,0.70734,0.627031,0.881582,...,0.924154,0.293461,0.209229,0.061855,0.637155,0.150724,0.679581,0.774905,0.640804,0.628542
2,0.17401,0.614159,0.606195,0.750629,0.60172,0.787631,0.944687,0.200345,0.376063,0.375533,...,0.029669,0.994324,0.582281,0.492855,0.90348,0.285502,0.869681,0.988094,0.448714,0.235352
3,0.243558,0.043569,0.597477,0.895024,0.97344,0.272478,0.146766,0.08246,0.142003,0.551292,...,0.089192,0.01644,0.912215,0.731795,0.155218,0.297399,0.089658,0.792494,0.386667,0.093219
4,0.70599,0.34498,0.392427,0.671091,0.171971,0.835246,0.480922,0.921043,0.484116,0.706116,...,0.263467,0.692288,0.439943,0.419389,0.306972,0.632332,0.607314,0.770004,0.049712,0.214919


In [33]:
# Are these two DataFrames the same?
# Better question: For a given row inside of df1, is it the same as the row inside df2?

# Let's take a look
# Let's set out critical value as 0.1, or an alpha of 10%. We are goind to compare each column
# in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than
# 10%, which means that w ehave sufficient evident to say that the columns are different.

# Let's write this in a function called test_columns
def test_columns(alpha = 0.1):
    # To keep a track of how many differ
    num_diff = 0
    # And now we iterate over the columns
    for col in df1.columns:
        # we can run out ttest_ind between the two dataframes
        teststat, pval = ttest_ind(df1[col], df2[col])
        # And we check the p-value vs the alpha
        if pval <= alpha:
            print('Col {} is statistically significantly different at alpha = {}, pval = {}'
                  .format(col, alpha, pval))
            num_diff = num_diff + 1
        # Let's print out some summary statistics
    print('Total number different was {}, which is {}%'
          .format(num_diff, (float(num_diff) / len(df1.columns)) * 100))

# Let's run it
test_columns()

Col 1 is statistically significantly different at alpha = 0.1, pval = 0.02495000306320323
Col 10 is statistically significantly different at alpha = 0.1, pval = 0.012125140376841884
Col 15 is statistically significantly different at alpha = 0.1, pval = 0.0856130193696212
Col 19 is statistically significantly different at alpha = 0.1, pval = 0.0179357084070721
Col 34 is statistically significantly different at alpha = 0.1, pval = 0.0851482007601295
Col 47 is statistically significantly different at alpha = 0.1, pval = 0.07202591402735758
Col 74 is statistically significantly different at alpha = 0.1, pval = 0.006017309309742663
Col 75 is statistically significantly different at alpha = 0.1, pval = 0.014436121403061636
Col 86 is statistically significantly different at alpha = 0.1, pval = 0.0705872778804611
Col 87 is statistically significantly different at alpha = 0.1, pval = 0.08639196202681966
Col 99 is statistically significantly different at alpha = 0.1, pval = 0.03869214438642864
T

In [34]:
# Interesting, so we see that there are a bunch of columns that are different!
# In fact, that number looks a lot like the alpha value we chose. So what's going on? Shouldn't
# all of the columns be the same?
# Remember that all the ttest does is check if two sets are similar given some level of
# confidence, in our case 10%.
# The more random comparisons we do, the more will just happen to be the same by chance.
# In this case we checked 100 columns, so we would expect to be roughly 10 of them if our alpha
# was 0.1

# Testing other values of alpha
test_columns(alpha = 0.05)

Col 1 is statistically significantly different at alpha = 0.05, pval = 0.02495000306320323
Col 10 is statistically significantly different at alpha = 0.05, pval = 0.012125140376841884
Col 19 is statistically significantly different at alpha = 0.05, pval = 0.0179357084070721
Col 74 is statistically significantly different at alpha = 0.05, pval = 0.006017309309742663
Col 75 is statistically significantly different at alpha = 0.05, pval = 0.014436121403061636
Col 99 is statistically significantly different at alpha = 0.05, pval = 0.03869214438642864
Total number different was 6, which is 6.0%


In [35]:
# We have to keep this in mind when we are doing statistical tests like the t-test which has a
# p-value.
# This p-value isn't magic, that it's a threshold for us when reporting results and trying to
# answer our hypothesis.
# What's a reasonable threshold? Depend on our question, and we need to engage domain experts
# to better understand what they would consider significant

# Just for fun, let's recreate that second dataframe using a non-normal ditribution. We'll 
# arbitrarily use chi squared
df2 = pd.DataFrame([np.random.chisquare(df = 1, size = 100) for x in range(100)])
test_columns()

Col 0 is statistically significantly different at alpha = 0.1, pval = 0.0052630813025101395
Col 1 is statistically significantly different at alpha = 0.1, pval = 0.0006474969848327646
Col 2 is statistically significantly different at alpha = 0.1, pval = 0.0011769999716757654
Col 3 is statistically significantly different at alpha = 0.1, pval = 0.0020452616837123272
Col 4 is statistically significantly different at alpha = 0.1, pval = 0.00032276791981351676
Col 5 is statistically significantly different at alpha = 0.1, pval = 0.02908677240933548
Col 6 is statistically significantly different at alpha = 0.1, pval = 0.0005147933380455382
Col 7 is statistically significantly different at alpha = 0.1, pval = 0.012794847196842618
Col 8 is statistically significantly different at alpha = 0.1, pval = 3.375246468466319e-05
Col 9 is statistically significantly different at alpha = 0.1, pval = 0.0030308361985538824
Col 10 is statistically significantly different at alpha = 0.1, pval = 0.000193060

In [None]:
# We see that all or most columns test to be statistically significant at the 10% level