<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>

<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Lesson 1*

# Statistics, Probability and Inference

## Learning Objectives
* [Part 1](#p1): Normal Distribution Revisted
* [Part 2](#p2): Student's T Test
* [Part 3](#p3): Hypothesis Test & Doing it Live

## What is Descriptive Statistics?

<https://statistics.laerd.com/statistical-guides/descriptive-inferential-statistics.php>

In [None]:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,4,6,8,10]})
df.head()

In [None]:
# How can we quickly look at some descriptive statistics of the above dataframe?
df.describe()

## What is Inferential Statistics?

![stats](https://slideplayer.com/slide/5130463/16/images/2/Statistical+Inference.jpg)

## Hypothesis Testing (T-Tests)



Ever thought about how long it takes to make a pancake? Have you ever compared the tooking time of a pancake on each eye of your stove? Is the cooking time different between the different eyes? Now, we can run an experiment and collect a sample of 1,000 pancakes on one eye and another 800 pancakes on the other eye. Assumed we used the same pan, batter, and technique on both eyes. Our average cooking times were 180 (5 std) and 178.5 (4.25 std) seconds repsectively. Now, we can tell those numbers are not identicial, but how confident are we that those numbers are practically the same? How do we know the slight difference isn't caused by some external randomness?

Yes, today's lesson will help you figure out how long to cook your pancakes (*theoretically*). Experimentation is up to you; otherwise, you have to accept my data as true. How are going to accomplish this? With probability, statistics, inference and maple syrup (optional). 

<img src="https://images.unsplash.com/photo-1541288097308-7b8e3f58c4c6?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=3300&q=80" width=400>



## Normal Distribution Revisited

What is the Normal distribution: A probability distribution of a continuous real valued random-variable. The Normal distribution properties make it useful for the *Central Limit Theorm*, because if we assume a variable follows the normal distribution, we can make certain conclusions based on probabilities.

In [None]:
import numpy as np

mu = 0 # mean
sigma = 1 # standard deviation
sample = np.random.normal(mu, sigma, 1000)

In [None]:
import seaborn as sns
from matplotlib import style

style.use('fivethirtyeight')

ax = sns.distplot(sample, color='r')
ax.axvline(np.percentile(sample,97.5),0)
ax.axvline(np.percentile(sample,2.5),0);

![The Normal Distribution](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Empirical_Rule.PNG/350px-Empirical_Rule.png)

When we talk about the **population parameters**: we use $\mu$ and $\sigma$ for mean and standard deviation

When we talk about the sample **sample statistics**: we use $\bar{x}$ and s

    


#### Our first 2-sample t-test (pancake example)
1) Null Hypothesis: (boring hypothesis)

𝑥¯1==𝑥¯2 

Or that the average cooking time between the two burners is the same.

2) Alternative Hypothesis: (the opposite of the null)

𝑥¯1≠𝑥¯2 

𝑥¯1−𝑥¯2≠0 
3) Confidence Level: The probability of seing a true result in spite of random variability. (How often do I want to make sure that I'm right.) Typically: 95%, 99%, 99.9%

In [None]:
from scipy.stats import ttest_ind

In [None]:
# sample 1
mu1 = 180 # sample mean
sigma1 = 5 # standard deviation
sample1 = np.random.normal(mu1, sigma1, 1000)
sample1[:5]

In [None]:
# sample 2
mu2 = 178.5 # sample mean
sigma2 = 4.25 # standard deviation
sample2 = np.random.normal(mu2, sigma2, 800)
sample2[:5]

In [None]:
# compare
results=ttest_ind(sample1, sample2)
results

In [None]:
# what is the p-value?
print(results[1])
print('{:f}'.format(results[1]))
print('{:.15f}'.format(results[1]))

4) T Statistic: A value that falls along a t-distribution.

A vertical bar that falls on our t-distribution
5) P-value: The p-value that we're interested in is (1-Confidence Level) or in our case: .05

The probability of getting this test result (t-statistic) due to random chance.

The probability of our null hypothesis being true.

6) Conclusions:

Due to observing a t-statistic of 8.9 and a resulting p-value of .00000000000000000109, we reject the null hypothesis that the cooking times of these two burners is the same, and suggest the alternative hypothesis, that they are different.

(Because our p-value was less than .05, we reject the null hypothesis).

# Why do we use the t-distribution in hypothesis tests?

![t-distribution-low-dof](https://lambdachops.com/img/t-distribution-low-dof.png)

![t-distribution-approximates-normal](https://github.com/ryanallredblog/ryanallredblog.github.io/blob/master/img/t-distribution-approximates-normal.png?raw=true)

### Helpful video on why we use the t-distribution

<https://www.youtube.com/watch?v=Uv6nGIgZMVw>

However, in order to understand it you'll need to understand what a z-score is:

A z-score calculates the number of standard deviations an observations lies from the population mean. The problem is that in real-world situations, we don't know what the sample mean is, so we have to turn to using the sample mean to estimate the population mean. Because the sample mean is generated from a sample and used to estimate the population mean with some level of uncertainty, it also has its own distribution a nd spread. This means that for low sample sizes both our estimates of the sample mean and sample population are not very precise, they're kind of spread out. It's this spread that makes the t-distribution wider than the normal distribution for low sample sizes. However, with the larger the sample size, the closer the t-distribution approximates the normal distribution.


## Student's T Test

>Assuming data come from a Normal distribution, the t test provides a way to test whether the sample mean (that is the mean calculated from the data) is a good estimate of the population mean. 

The derivation of the t-distribution was first published in 1908 by William Gosset while working for the Guinness Brewery in Dublin. Due to proprietary issues, he had to publish under a pseudonym, and so he used the name Student.

The t-distribution is essentially a distribution of means of normaly distributed data. When we use a t-statistic, we are  checking that a mean fails within a certain $\alpha$ probability of the mean of means.

In [None]:
t_df10 = np.random.standard_t(df=10, size=10)
t_df100 = np.random.standard_t(df=100, size=100)
t_df1000 = np.random.standard_t(df=1000, size=1000)

In [None]:
sns.kdeplot(t_df10, color='r');
sns.kdeplot(t_df100, color='y');
sns.kdeplot(t_df1000, color='b');

In [None]:
i = 10
for sample in [t_df10, t_df100, t_df1000]:
    print(f"t - distribution with {i} degrees of freedom")
    print("---" * 10)
    print(f"Mean: {sample.mean()}")
    print(f"Standard Deviation: {sample.std()}")
    print(f"Variance: {sample.var()}")
    i = i*10

Why is it different from normal? To better reflect the tendencies of small data and situations with unknown population standard deviation. In other words, the normal distribution is still the nice pure ideal (thanks to the central limit theorem), but the t-distribution is much more useful in many real-world situations.

## Live Lecture - let's perform and interpret a t-test

We'll generate our own data, so we can know and alter the "ground truth" that the t-test should find. We will learn about p-values and how to interpret "statistical significance" based on the output of a hypothesis test. We will also dig a bit deeper into how the test statistic is calculated based on the sample error, and visually what it looks like to have 1 or 2 "tailed" t-tests.

#### Get and prepare the data

In [None]:
# imports
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [None]:
# get the data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

In [None]:
# make into a dataframe
column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

df = pd.read_csv('house-votes-84.data', 
                 header=None, 
                 names=column_headers,
                 na_values="?")

df.head()

In [None]:
# recode votes as numeric
df = df.replace({'y': 1, 'n': 0})
df.head()

In [None]:
# how many from each party?
df['party'].value_counts()

In [None]:
# how did Republicans vote?
rep = df[df['party']=='republican']
rep.head()

In [None]:
# how did Democrats vote?
dem = df[df['party']=='democrat']
dem.head()

In [None]:
# the percentage of republicans who voted "yes" (1) 
# on the handicapped-infants bill

rep['handicapped-infants'].sum()/len(rep)

# len() is counting NaN values too!

In [None]:
# Remove NaN values from this column

col = rep['handicapped-infants']

np.isnan(col)

handicapped_infants_no_nans = col[~np.isnan(col)]

# The same column as before, but I've dropped the NaN values
handicapped_infants_no_nans

handicapped_infants_no_nans.sum()/len(handicapped_infants_no_nans)

In [None]:
# Average rate of voting 'yes' on the handicapped-infants
rep['handicapped-infants'].mean()

#### water project bill (two-sample t-test)

1) Null Hypothesis: There is no difference between average voting rates (levels of support) for the water-project bill between democrats and republicans in the house of representatives. (support is equal)

𝑥¯1==𝑥¯2 
Where  𝑥¯1  is the mean of republican votes and  𝑥¯2  is the mean of democrat votes.

2) Alternative Hypothesis:

𝑥¯1≠𝑥¯2 
Levels of support between the two parties will differ.

3) 95% Confidence Level

In [None]:
from scipy.stats import ttest_ind

In [None]:
# What is the mean support of Republicans?
rep['water-project'].mean()

In [None]:
# what is the mean support of Democrats?
dem['water-project'].mean()

In [None]:
# compare with a t-test:
ttest_ind(rep['water-project'], dem['water-project'])

In [None]:
# account for NaN's
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')

In [None]:
# You could also remove NaN values from this column

col = rep['water-project']
rep_water_project_no_nans = col[~np.isnan(col)]

col = dem['water-project']
dem_water_project_no_nans = col[~np.isnan(col)]

# My sample sizes for the two samples:
print(len(rep_water_project_no_nans))
print(len(dem_water_project_no_nans))

When I have multiple samples (using a 2-sample t-test) I will use the smaller of the two samples to determine my degrees of freedom

So in this case, df = 148-1 = 147

4) T-statistic: .089

5) P-value: .929

I want to reject the null hypothesis if my p-value is < .05 or if my p-value is less than (1-confidence_level)

Conclusion: due to a p-value of .929 I fail to reject the null hypothesis that republican and democrat support for the water-project bill is different.

I never say that I "accept" the null hypothesis, I just say that I "fail to reject"

#### 1-sample T-test example

I'm only using one sample, and my null hypothesis will be different.

We're looking at Democrat support of the South-Africa bill

1a) Null Hypothesis:

𝑥¯1  (average dem support for SA bill) == 1.

This says that 100% of democrats support this bill. Full support.

1b) Null Hypothesis:

𝑥¯1  (average dem support for SA bill) == .5

This says that 50% of democrats support this bill. The party is split.

1c) Null Hypothesis:

𝑥¯1  (average dem support for SA bill) == 0.

This says that 0% of democrats support this bill. The party is against the bill.

1d) Null Hypothesis:

𝑥¯1  (average dem support for SA bill) == .78245

This says that 0% of democrats support this bill. The party is against the bill.

**With 1-sample t-tests I can frame that I'm asking through my choice of null hypothesis**

1) Null Hypothesis:  𝑥¯1  (average dem support for SA bill) == .5

This says that 50% of democrats support this bill. The party is split.

2) Alternative Hypothesis: Support is not equal to .5 or 50%

𝑥¯1  (average dem support for SA bill)  ≠  .5

This says nothing about if support is greater than or less than 50%, it's just saying that it's not 50% - it's different, it's something other than 50%.

3) Confidence Level: 95%

In [None]:
# import
from scipy.stats import ttest_1samp

In [None]:
# conduct the t-test
ttest_1samp(dem['south-africa'], .5, nan_policy='omit')

Due to a p-value of (basically 0) we reject the null hypothesis that democrat support for the South Africa bill is .5 (split party) and conclude that it is something different.


In [None]:
# what is the average support among Democrats?
dem['south-africa'].mean()

In [None]:
# is it significantly different from 90%?
ttest_1samp(dem['south-africa'], .9, nan_policy='omit')

Fail to reject the null hypothesis:

I conclude that that democrat support for the South Africa bill is not significantly different from 90%.

In [None]:
# what about 89.9?
ttest_1samp(dem['south-africa'], .899, nan_policy='omit')

Due to a p-value of .048, I reject the null hypothesis that democrat support for this bill is 89.9% and suggest the alternative that it is different from 89.9%

# Resources

- https://homepage.divms.uiowa.edu/~mbognar/applets/t.html
- https://rpsychologist.com/d3/tdist/
- https://gallery.shinyapps.io/tdist/
- https://en.wikipedia.org/wiki/Standard_deviation#Sample_standard_deviation_of_metabolic_rate_of_northern_fulmars
- https://www.khanacademy.org/math/ap-statistics/two-sample-inference/two-sample-t-test-means/v/two-sample-t-test-for-difference-of-means