<a href="https://colab.research.google.com/github/FuriouStyles/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/Stephen_P_LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [0]:
df = pd.read_csv('house-votes-84.data', delimiter=',', header=None)

In [0]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
cols = ['Class Name', 'handicapped-infants', 'water-project-cost-sharing', 'adoption-of-the-budget-resolution', 'physician-fee-freeze', 'el-salvador-aid', 'religious-groups-in-schools', 'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile', 'immigration', 'synfuels-corporation-cutback', 'education-spending', 'superfund-right-to-sue', 'crime', 'duty-free-exports', 'export-administration-act-south-africa']
df.columns = cols

In [0]:
df = df.replace({'?': np.NaN, 'y': 1, 'n': 0})
df.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
gop_votes = df[df['Class Name'] == 'republican']
dem_votes = df[df['Class Name'] == 'democrat']

In [0]:
gop_crime_votes = gop_votes['crime']
dem_crime_votes = dem_votes['crime']


In [0]:
df_grouped = df.groupby(df['Class Name']).mean()
df_grouped.T

Class Name,democrat,republican
handicapped-infants,0.604651,0.187879
water-project-cost-sharing,0.502092,0.506757
adoption-of-the-budget-resolution,0.888462,0.134146
physician-fee-freeze,0.054054,0.987879
el-salvador-aid,0.215686,0.951515
religious-groups-in-schools,0.476744,0.89759
anti-satellite-test-ban,0.772201,0.240741
aid-to-nicaraguan-contras,0.828897,0.152866
mx-missile,0.758065,0.115152
immigration,0.471483,0.557576


### Get ttest on the Crime Bill

In [0]:
ttest_ind(gop_crime_votes, dem_crime_votes, nan_policy='omit')

Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47)

With a p_value of 9.95 x 10^47, we can say that Republicans offer more support for the crime bill than Democrats in a statistically significant way

### Get ttest on the education-spending bill

In [0]:
gop_missile_votes = gop_votes['mx-missile']
dem_missile_votes = dem_votes['mx-missile']

ttest_ind(dem_missile_votes, gop_missile_votes, nan_policy='omit')

Ttest_indResult(statistic=16.437503268542994, pvalue=5.03079265310811e-47)

With p_value a 5.03 x 10^-47, we can reject the null hypothesis and confidently say that Democrats offer more support for the MX Missile bill than Republicans in a statistically significant way

###Get ttest on the immigration bill


In [0]:
gop_immigration_votes = gop_votes['immigration']
dem_immigration_votes = dem_votes['immigration']

ttest_ind(dem_immigration_votes, gop_immigration_votes, nan_policy='omit')

Ttest_indResult(statistic=-1.7359117329695164, pvalue=0.08330248490425066)

With a p_value of 0.08, we cannot confidently reject the null hypothesis, and cannot conclude that there is a signficant different that exists between Democrats and Republicans on this issue.

##Stretch Goal - Refactor into a Function




In [0]:
# We're going to hold on to gop_votes and dem_votes and use them here.
# -- I was made aware of a potential security flaw that by using gop_votes and dem_votes
# -- from the global scope I was introducing potential bugs and security vulerabilities.
# -- The below code should mitigate that, and it works as intended.

# x is the name of the bill in the column header in the dataframe
# The goal is to accept a dataframe, clean it, filter it, and successfully perform a ttest on it

def get_ttest(frame, x):
  frame = frame.replace({'?': np.NaN, 'y': 1, 'n': 0})
  gop_votes = frame[frame['Class Name'] == 'republican']
  dem_votes = frame[frame['Class Name'] == 'democrat']
  gop_bill_votes = gop_votes[x]
  dem_bill_votes = dem_votes[x]
  return ttest_ind(gop_bill_votes, dem_bill_votes, nan_policy='omit')

In [0]:
get_ttest(df, 'immigration')

Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066)