# Distributions in Pandas

In [1]:
import pandas as pd
import numpy as np

Here we ask for a number from the NumPy binomial distribution. We have two parameters to pass in. The first is the number of times we want it to run. The second is the chance we get a zero, which we will use to represent heads here. Let's run one round of this simulation. 

In [2]:
np.random.binomial(1, 0.5)

0

What if we run the simulation a thousand times and divided the result by a thousand. Well you see a number pretty close to 0.5 which means half of the time we had a heads and half of the time we had a tails. 

In [None]:
np.random.binomial(1000, 0.5)/1000

Of course an even weighted binomial distribution is only one simple example. We can also have unevenly weighted binomial distributions. For instance what's the chance although we're tornado today while I’m filming. It's pretty low even though we do get tornadoes here. So maybe there a hundredth of a percentage chance. We can put this into a binomial distribution as a weighting in NumPy. If we run this 100,000 times we see there are pretty minimal tornado events. 

In [None]:
chance_of_tornado = 0.01/100
np.random.binomial(100000, chance_of_tornado)

So what's the chance of this happening two days in a row? 

In [None]:
chance_of_tornado = 0.01

tornado_events = np.random.binomial(1, chance_of_tornado, 1000000)
    
two_days_in_a_row = 0
for j in range(1,len(tornado_events)-1):
    if tornado_events[j]==1 and tornado_events[j-1]==1:
        two_days_in_a_row+=1

print('{} tornadoes back to back in {} years'.format(two_days_in_a_row, 1000000/365))

The point here though is that modern computational power allows us to very quickly simulate the effects of different parameters in a distribution. Leading to a new way of understanding the problem. You don't have to work out all the math you can quite often simulate the problem instead and observe the results. 

In [None]:
np.random.uniform(0, 1)

In [None]:
np.random.normal(0.75)

Formula for standard deviation
$$\sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \overline{x})^2}$$

 Let's just walk through how we would write this up. Let's draw 1,000 samples from a normal distribution with an expected value of 0.75 and a standard deviation of 1. Then we calculate the actual mean using NumPy's mean feature. The part inside the summation says xi- x bar. Xi is the current item in the list and x bar is the mean. So we calculate the difference, then we square the result, then we sum all of these. 

In [None]:
distribution = np.random.normal(0.75,size=1000)

np.sqrt(np.sum((np.mean(distribution)-distribution)**2)/len(distribution))

Now we don't normally have to do all this work ourselves, but I wanted to show you how you can sample from the distribution, create a precise programmatic description of a formula, and apply it to your data. But for standard deviation, which is just one particular measure of variability, NumPy has a built-in function that you can apply, called STD. 

In [None]:
np.std(distribution)

There's a couple more measures of distribution that are interesting to talk about. One of these is the shape of the tales of the distribution and this is called the kurtosis. We can measure the kurtosis using the statistics functions in the SciPy package. A negative value means the curve is slightly more flat than a normal distribution, and a positive value means the curve is slightly more peaky than a normal distribution. Remember that we aren't measuring the kurtosis of the distribution per se, but of the thousand values which we sampled out of the distribution. This is a sublet but important distinction. 

In [None]:
import scipy.stats as stats
stats.kurtosis(distribution)

We could also move out of the normal distributions and push the peak of the curve one way or the other. And this is called the skew. 

In [None]:
stats.skew(distribution)

The Chi Squared Distribution has only one parameter called the degrees of freedom. The degrees of freedom is closely related to the number of samples that you take from a normal population. It's important for significance testing. But what I would like you to observe, is that as the degrees of freedom increases, the shape of the Chi Squared distribution changes. In particular, the skew to the left begins to move towards the center. We can observe this through simulation. 

First we'll sample 1,000 values from a Chi Squared distribution with degrees of freedom 2. Now we can see that the skew is quite large.

In [None]:
chi_squared_df2 = np.random.chisquare(2, size=10000)
stats.skew(chi_squared_df2)

In [None]:
We see that the skew has decreased significantly. 

In [None]:
chi_squared_df5 = np.random.chisquare(5, size=10000)
stats.skew(chi_squared_df5)

In [3]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

output = plt.hist([chi_squared_df2,chi_squared_df5], bins=50, histtype='step', 
                  label=['2 degrees of freedom','5 degrees of freedom'])
plt.legend(loc='upper right')




NameError: name 'chi_squared_df2' is not defined

# Hypothesis Testing

We can load a file called grids.csv. If we take a look at the data frame inside, we see we have six different assignments. Each with a submission time and it looks like there just under 3,000 entries in this data file. 

In [None]:
df = pd.read_csv('grades.csv')

In [None]:
df.head()

In [None]:
len(df)

For the purpose of this lecture, let's segment this population in to two pieces. Those who finish the first assignment by the end of December 2015 and those who finish it sometimes after that. 

In [None]:
early = df[df['assignment1_submission'] <= '2015-12-31']
late = df[df['assignment1_submission'] > '2015-12-31']

In [None]:
early.mean()

In [None]:
late.mean()

There are slight differences, though. It looks like the end of the six assignments, the early users are doing better by about a percentage point. 

The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing in Python. A T test is one way to compare the means of two different populations. In the SciPy library, the T test end function will compare two independent samples to see if they have different means. 

We want to compare the assignment grades for the first assignment between the two populations, we could generate a T test by passing these two series into the T test in function. The result is a two with a test statistic and a p-value. The p-value here is much larger than our 0.05. So we cannot reject the null hypothesis, which is that the two populations are the same.

 In more late terms, we would say that there's no statistically significant difference between these two sample means.

In [None]:
from scipy import stats
stats.ttest_ind?

In [None]:
stats.ttest_ind(early['assignment1_grade'], late['assignment1_grade'])

Let's check with assignment two grade. 

In [None]:
stats.ttest_ind(early['assignment2_grade'], late['assignment2_grade'])

No, that's much larger than 0.052. How about with assignment three? 

In [None]:
stats.ttest_ind(early['assignment3_grade'], late['assignment3_grade'])

Well, that's much closer, but still beyond our threshold value. It's important to stop here and talk about serious process from with how we're handling this investigation of the difference between these two populations. When we set the alpha to be 0.05, we're saying that we expect it that there will be positive result, 5% of the time just to the chance. 

As we run more and more T tests, we're more likely to find a positive result just because of the number of T tests we have run. 

When a data scientist run many tests in this way, it's called p-hacking or dredging and it's a serious methodological issue. P-hacking results in spurious correlations instead of generalizable results. There are a couple of different ways you can deal with p-hacking. The first is called the Bonferroni correction. In this case, you simply tighten your alpha value, the threshold of significance based on the number of tests you're running. So if you choose 0.05 with 1 test, you want to run 3 test, you reduce alpha by multiplying 0.05 by one-third to get a new value of 0.01 sub. I personally find this approach to be very conservative. Another option is to hold out some of your data for testing to see how generizable your result is. In this case, we might take half of our data for each of the two data frames, run our T test with that. Form specific hypothesis based on the result of these tests, then run very limited tests on the rest of the data. 

More Information [here](https://fivethirtyeight.com/features/science-isnt-broken/).

Einen schöner deutschsprachiger Artikel zur "P-Gläubigkeit" findet sich
[hier](http://www.sueddeutsche.de/wissen/wissenschaft-das-magische-p-1.3676252).