<a href="https://colab.research.google.com/github/HakujouRyu/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)
5. Practice 1-sample t-tests

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)
3. Add visuals

In [0]:
import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-09-16 18:59:59--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2019-09-16 18:59:59 (127 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
#import with column names and NaN values.
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 na_values='?',
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
#str to int
df = df.replace({'y': 1, 'n':0})

In [0]:
#Separate into our 'Samples'
dems = df[df['party'] == 'democrat']
reps = df[df['party'] == 'republican']
cols=['handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

In [0]:
dems.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [0]:
reps.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
#looking at output and practice.
ttest_ind(reps['budget'], dems['budget'], nan_policy='omit')

Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)

In [0]:
#Let's do some comprehension to speed things up.
results = {column : ttest_ind(reps[column], dems[column], nan_policy='omit') for column in cols}
results

{'aid-to-contras': Ttest_indResult(statistic=-18.052093200819733, pvalue=2.82471841372357e-54),
 'anti-satellite-ban': Ttest_indResult(statistic=-12.526187929077842, pvalue=8.521033017443867e-31),
 'budget': Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77),
 'crime': Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47),
 'duty-free': Ttest_indResult(statistic=-12.853146132542978, pvalue=5.997697174347365e-32),
 'education': Ttest_indResult(statistic=20.500685724563073, pvalue=1.8834203990450192e-64),
 'el-salvador-aid': Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68),
 'handicapped-infants': Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18),
 'immigration': Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066),
 'mx-missile': Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47),
 'physician-fee-freeze': Ttest_indResult(statistic=49.367081573

In [0]:
#could make this even more general as a function.
def get_test_scores(columns_as_list, DataFrame1, DataFrame2, nan_policy):
    return {col : ttest_ind(DataFrame1[col], DataFrame2[col], nan_policy=nan_policy) for col in columns_as_list}

In [0]:
get_test_scores(cols, reps, dems, 'omit')

{'aid-to-contras': Ttest_indResult(statistic=-18.052093200819733, pvalue=2.82471841372357e-54),
 'anti-satellite-ban': Ttest_indResult(statistic=-12.526187929077842, pvalue=8.521033017443867e-31),
 'budget': Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77),
 'crime': Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47),
 'duty-free': Ttest_indResult(statistic=-12.853146132542978, pvalue=5.997697174347365e-32),
 'education': Ttest_indResult(statistic=20.500685724563073, pvalue=1.8834203990450192e-64),
 'el-salvador-aid': Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68),
 'handicapped-infants': Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18),
 'immigration': Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066),
 'mx-missile': Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47),
 'physician-fee-freeze': Ttest_indResult(statistic=49.367081573

In [0]:
#Dems support more
print(f"Dems support 'budget' more than republicans with a score of {results['budget']}")
#Reps Support more
print(f"Republicans support 'crime' more than republicans with a score of {results['crime']}")
#Pretty equal on support
print(f"Dems and republicans seem to both support 'water-project with a score of {results['water-project']}")

Dems support 'budget' more than republicans with a score of Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)
Republicans support 'crime' more than republicans with a score of Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47)
Dems and republicans seem to both support 'water-project with a score of Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)


1) Null Hypothesis:

In 1-sample t-tests YOU GET TO CHOOSE YOUR NULL HYPOTHESIS

H0 : 0.0 - There is ZERO republican support for the duty-free bill

2) Alternative Hypothesis

Ha :  x¯≠0  - There is non-zero support for the budget bill among repulbicans.

3) Confidence Level: 95% or .95

In [0]:
ttest_1samp(reps['duty-free'], 0, nan_policy='omit')

Ttest_1sampResult(statistic=3.9091802389817065, pvalue=0.00013809783981978333)

4) t-statistic: 3.9

5) p-value of .00013809

--- 

Conclusion: Due to a p-value of near-zero, I reject the null hypothesis that republican support is zero and conclude that republican support is non-zero. 

1) Null Hypothesis:

In 1-sample t-tests YOU GET TO CHOOSE YOUR NULL HYPOTHESIS

H0 : 0.0 - There is ZERO republican support for the crime bill

2) Alternative Hypothesis

Ha : x¯≠0 - There is non-zero support for the budget bill among Democrats.

3) Confidence Level: 95% or .95

In [0]:
ttest_1samp(dems['crime'],0, nan_policy='omit')

Ttest_1sampResult(statistic=11.745810821577512, pvalue=9.087409645908879e-26)

4) t-statistic: 11.74

5) p-value of 9.087409645908879e-26

---
Conclusion: Due to a p-value of near-zero, I reject the null hypothesis that democrat support is zero and conclude that republican support is non-zero.