---
#Basic Statistical Testing
---

In this lecture we're going to review some of the basics of statistical testing in python. We're going to talk about hypothesis testing, statistical significance, and using SciPy to run student's t-tests.

##Hypothesis Testing

We use statistics in a lot of different ways in data science. In this lecture, I want to refresh your knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of hypothesis testing is to determine if two different conditions in an experiment result in different impacts.

In [1]:
# Let's import our usual numpy and pandas libraries
import numpy as np
import pandas as pd

# Now let's bring in some new libraries from scipy
from scipy import stats

Now, Scipy has an interesting collection of libraries for data science and you'll use most or perhaps all of these libraries. It includes NumPy and Pandas, but also plotting libraries such as Matplotlib (which we'll use next), and a number of scientific library functions as well.

When we do hypothesis testing, we actually have two statements of interest: the first is our actual explanation, which we call the alternative hypothesis ($H_1$), and the second is that the explanation we have is not sufficient, and we call this the null hypothesis ($H_0$). Our actual testing method is to determine whether the null hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null hypothesis and we accept our alternative.

Let's see an example of this using some grade data:

In [None]:
# Mount the drive.
# from google.colab import drive
# drive.mount('/content/drive')

# Commenting these lines out since I run the notebook locally.

In [2]:
df=pd.read_csv ("data/grades.csv")
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


If we take a look at the DataFrame's content, we see that we have six different assignments. Let's look at some summary statistics of this DataFrame.

In [3]:
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


For the purpose of this lecture, let's divide this population into two groups:

1.   "early finishers": those who finished the first assignment by the end of December 2015
2.   "late finishers": those who finished the first assignment after Dec 2015.

In order to do so, it would be useful to convert `assignment1_submission` from a regular string to a `DateTime` object. In order to do so we can use pandas' [`to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) attribute.



In [4]:
early_finishers=df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


Now that you're skilled with pandas, how would you go about getting the late_finishers dataframe?

Here's my solution. We know that the dataframe `df` and the `early_finishers` share index values, so we really just want everything in the `df` which is not in `early_finishers`.

In [5]:
late_finishers=df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


Let's take a minute and think of all the other ways we could have also done this.

1. You could just copy and paste the first projection and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to change the date down the road you have to remember to change it in two places.
2. You could also do a join of the dataframe `df` with `early_finishers` - if you do a left join you only keep the items in the left dataframe. This would have been a good answer.
3. You also could have written a function that determines if someone is early or late, and then called `.apply()` on the dataframe and added a new column to the dataframe. This is a pretty reasonable answer as well.

As you've seen, the pandas DataFrame object has a variety of statistical functions associated with it. If we call the `mean()` function directly on the dataframe, we see that each of the means for the assignments are calculated. Let's compare the means for our two populations.

In [6]:
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024304
74.0450648477065


Ok, these look pretty similar. But, are they the same? Let's say that we have a hypothesis that early finishers perform better than late finisher. In order to test it, we need to see if the two means are considered "significantly" different or not.

This is where the **students' t-test** comes in. It allows us to form the alternative hypothesis ($H_1$="The two means are different") as well as the null hypothesis ($H_0$="The two means are the same") and then *test the null hypothesis*. As such, we can come to two conclusions:

1. *reject the null hypothesis*, i.e. show that the two means are indeed different
2. *fail to reject the null hypothesis*.

When doing hypothesis testing, we also have to choose a significance level as a threshold for how much of a chance we're willing to accept. This is to say, what is the probability that we will reject the null hypothesis when it is true? This significance level is typically called alpha. For this example, let's use a threshold of 0.05 for our alpha or 5%, which is the most commonly used number but it's really quite arbitrary.

The SciPy library contains a number of different [statistical tests](https://docs.scipy.org/doc/scipy/reference/stats.html) some of which form a basis for hypothesis testing. In order to do an independent t-test, we use SciPy's [`ttest_ind()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) function. The `ttest_ind()` function returns the t-statistic and a p-value. It's this latter value, the probability, which is most important to us, as it indicates the chance (between 0 and 1) of our null hypothesis being True.

In [7]:
# Let's bring in the ttest_ind function
from scipy import stats

# Let's run this function with our two populations, looking at the assignment 1 grades
stats.ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.3223540853721598, pvalue=0.18618101101713855)

So here we see that the probability that the null hypothesis is true is 0.19, which is above our alpha value of 0.05. As such, we *failed to reject the null hypothesis*.

The null hypothesis was that the two populations are the same, and we don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to the contrary. *Note that this also doesn't mean that we have proven the populations are the same.*

In [8]:
# Why don't we check the other assignment grades?
print(stats.ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(stats.ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(stats.ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(stats.ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(stats.ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227867)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with respect to grade. Let's take a look at those p-values for a moment though, because they are saying things that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a p-value around 0.1. This means that if we accepted a level of chance in similarity of 11% this would have been considered statistically significant. As a researcher, this would suggest to me that there is something here worth considering following up on. For instance, if we had a small number of participants (we don't) or if there was something unique about this assignment as it relates to our experiment (whatever it was) then there may be followup experiments we could run.

P-values have come under fire recently for being insufficient for telling us enough about the interactions which are happening, and two other techniques, confidence intervals and bayesian analyses, are being used more regularly. One issue with p-values is that as you run more tests you are likely to get a value which is statistically significant just by chance.

In [9]:
# Lets see a simulation of this. First, let's create a dataframe of 100 columns, each with 100 numbers between 0 and 1
df1=pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.5671,0.781043,0.393838,0.581919,0.763398,0.269889,0.360209,0.844134,0.94774,0.37495,...,0.886204,0.82725,0.27904,0.200217,0.978078,0.018792,0.885409,0.808276,0.905506,0.046967
1,0.110776,0.796641,0.299574,0.937844,0.778204,0.040028,0.084418,0.094288,0.135638,0.3245,...,0.781789,0.476162,0.56276,0.352403,0.66663,0.012462,0.57924,0.99616,0.46526,0.751418
2,0.445499,0.328784,0.555929,0.314703,0.902298,0.685417,0.222744,0.28609,0.240504,0.160481,...,0.89305,0.504484,0.56481,0.958745,0.58963,0.388107,0.675017,0.188406,0.06809,0.314795
3,0.186999,0.211178,0.150249,0.831502,0.757596,0.461316,0.701185,0.028163,0.571327,0.503119,...,0.566458,0.044573,0.584139,0.505537,0.652345,0.102802,0.091165,0.855625,0.565393,0.973469
4,0.168697,0.843332,0.814919,0.199094,0.901533,0.50064,0.142504,0.467241,0.339646,0.417728,...,0.385902,0.105018,0.712415,0.035936,0.739278,0.497347,0.729172,0.462999,0.833294,0.437451


Pause for a minute and reflect -- do you understand the list comprehension and how I created this DataFrame?

You don't have to use a list comprehension to do this, but you should be able to read this and figure out how it works as this is a commonly used approach on web forums.

In [10]:
# Ok, let's create a second dataframe
df2=pd.DataFrame([np.random.random(100) for x in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.467375,0.263148,0.459955,0.787021,0.529175,0.554542,0.025789,0.036341,0.306165,0.242293,...,0.303917,0.347498,0.867375,0.18073,0.354352,0.071389,0.949731,0.925626,0.09102,0.955486
1,0.83,0.645734,0.232797,0.200017,0.550347,0.615246,0.306348,0.897222,0.593727,0.56878,...,0.446213,2.2e-05,0.161915,0.119789,0.810554,0.69557,0.043975,0.508048,0.03329,0.998085
2,0.649112,0.353875,0.3362,0.41106,0.031867,0.551017,0.452063,0.631763,0.114834,0.691712,...,0.260139,0.344919,0.507921,0.397932,0.774288,0.092614,0.455228,0.036959,0.561909,0.856593
3,0.394503,0.569563,0.84214,0.779259,0.882388,0.496112,0.839899,0.03995,0.295744,0.498701,...,0.699278,0.121214,0.903396,0.606076,0.98692,0.879012,0.339516,0.836628,0.039606,0.405938
4,0.336656,0.233226,0.459436,0.738735,0.463224,0.985554,0.563798,0.664995,0.097652,0.513295,...,0.872963,0.732382,0.190057,0.515701,0.479545,0.811149,0.450356,0.698104,0.264875,0.545774


Are these two DataFrames the same? Maybe a better question is, is a given column inside of df1 the same as the column inside df2?

Let's take a look. Let's assume our critical value is 0.1, or the alpha is 10%. Then let's compare each column in `df1` to the same numbered column in `df2`. We'll report when the p-value is less than 10%, which means that we have sufficient evidence to say that the columns are different.

In [11]:
# Let's write this in a function called test_columns
def test_columns(alpha=0.1):
    # I want to keep track of how many differ
    num_diff=0
    # And now we can just iterate over the columns
    for col in df1.columns:
        # we can run out ttest_ind between the two dataframes
        teststat,pval=stats.ttest_ind(df1[col],df2[col])
        # and we check the pvalue versus the alpha
        if pval<=alpha:
            # And now we'll just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval))
            num_diff=num_diff+1
    # and let's print out some summary stats
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()

Col 10 is statistically significantly different at alpha=0.1, pval=0.009589711358292874
Col 16 is statistically significantly different at alpha=0.1, pval=0.07816130123296282
Col 17 is statistically significantly different at alpha=0.1, pval=0.04931347947833014
Col 21 is statistically significantly different at alpha=0.1, pval=0.058010577695205484
Col 41 is statistically significantly different at alpha=0.1, pval=0.07591646613002086
Col 53 is statistically significantly different at alpha=0.1, pval=0.02251353345024437
Col 58 is statistically significantly different at alpha=0.1, pval=0.006596802898284635
Col 81 is statistically significantly different at alpha=0.1, pval=0.01786751621090758
Col 85 is statistically significantly different at alpha=0.1, pval=0.02765595706980199
Total number different was 9, which is 9.0%


Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? Remember that all the ttest does is check if two sets are similar given some level of confidence, in our case, 10%. The more random comparisons you do, the more will just happen to be the same by chance. In this example, we checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1.

In [14]:
# We can test some other alpha values as well
test_columns(0.05)

Col 10 is statistically significantly different at alpha=0.05, pval=0.009589711358292874
Col 17 is statistically significantly different at alpha=0.05, pval=0.04931347947833014
Col 53 is statistically significantly different at alpha=0.05, pval=0.02251353345024437
Col 58 is statistically significantly different at alpha=0.05, pval=0.006596802898284635
Col 81 is statistically significantly different at alpha=0.05, pval=0.01786751621090758
Col 85 is statistically significantly different at alpha=0.05, pval=0.02765595706980199
Total number different was 6, which is 6.0%


So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer your hypothesis. What's a reasonable threshold? That depends on your question, and you need to engage domain experts to better understand what they would consider significant.

In [15]:
# Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose
# chi squared
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
test_columns()

Col 0 is statistically significantly different at alpha=0.1, pval=0.0015613074813677088
Col 1 is statistically significantly different at alpha=0.1, pval=0.0003509666027877339
Col 2 is statistically significantly different at alpha=0.1, pval=0.00016195511309266658
Col 3 is statistically significantly different at alpha=0.1, pval=0.0001633105957447988
Col 4 is statistically significantly different at alpha=0.1, pval=0.016341496210788824
Col 5 is statistically significantly different at alpha=0.1, pval=0.004731614455866453
Col 6 is statistically significantly different at alpha=0.1, pval=1.4510323961708793e-05
Col 7 is statistically significantly different at alpha=0.1, pval=0.03208755170966381
Col 8 is statistically significantly different at alpha=0.1, pval=0.012191235751458156
Col 9 is statistically significantly different at alpha=0.1, pval=0.00010596992941577062
Col 10 is statistically significantly different at alpha=0.1, pval=5.395781066274648e-06
Col 11 is statistically significa

Now we see that all or most columns test to be statistically significant at the 10% level.

In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you to the SciPy library, which you can use for the students t-test. We've discussed some of the practical issues which arise from looking for statistical significance. There's much more to learn about hypothesis testing, for instance, there are different tests used, depending on the shape of your data and different ways to report results instead of just p-values such as confidence intervals or bayesian analyses. But this should give you a basic idea of where to start when comparing two populations for differences, which is a common task for data scientists.