<a href="https://colab.research.google.com/github/DRodriguez615/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [26]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-11-13 22:03:13--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2019-11-13 22:03:18 (284 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [37]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy.stats import ttest_ind, ttest_rel

column_names= ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

df= pd.read_csv('house-votes-84.data', header = None, names=column_names, na_values='?')
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [38]:
# change to numeric
df = df.replace({'y':1, 'n':0})
df.head(2)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,


In [46]:
df.columns

Index(['party', 'handicapped-infants', 'water-project', 'budget',
       'physician-fee-freeze', 'el-salvador-aid', 'religious-groups',
       'anti-satellite-ban', 'aid-to-contras', 'mx-missile', 'immigration',
       'synfuels', 'education', 'right-to-sue', 'crime', 'duty-free',
       'south-africa'],
      dtype='object')

In [39]:
df['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [40]:
rep = df[df['party']=='republican']
len(rep)

168

In [41]:
dem = df[df['party']=='democrat']
len(dem)

267

Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

1) Null Hypothesis: There is no difference between average voting rates (levels of support) for the handicapped-infants bill between democrats and republicans in the house of representatives. (support is equal)

𝑥¯1==𝑥¯2 
Where  𝑥¯1  is the mean of republican votes and  𝑥¯2  is the mean of democrat votes.

2) Alternative Hypothesis:

𝑥¯1≠𝑥¯2 
Levels of support between the two parties will differ. pvalue <.01

3) 99% Confidence Level

In [68]:

print(rep['handicapped-infants'].sum()/len(rep))
print(dem['handicapped-infants'].sum()/len(dem))

0.18452380952380953
0.5842696629213483


In [64]:
print(rep['handicapped-infants'].mean())
print(dem['handicapped-infants'].mean())

0.18787878787878787
0.6046511627906976


In [0]:
col = rep['handicapped-infants']
rep_handicapped_infants_no_nans = col[~np.isnan(col)]

col = dem['handicapped-infants']
dem_handicapped_infants_no_nans = col[~np.isnan(col)]

In [66]:

print(len(rep_handicapped_infants_no_nans))
print(len(dem_handicapped_infants_no_nans))

165
258


In [92]:
# pvalue < .1
# democrats support handicapped infants bill more
print(ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit'))
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit').pvalue/2 < .01

Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)


True

4) T-statistic: 9.20

5) P-value: 1.61

I want to reject the null hypothesis if my p-value is < .01 

Conclusion: due to a p-value of 1.61 I *reject the null hypothesis that republican and democrat support for the handicapped-infants bill is the same.
I *fail to reject the alternative hypothesis that republican and democrat support for the handicapped-infants bill will differ

Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

1) Null Hypothesis: There is no difference between average voting rates (levels of support) for the water-project bill between democrats and republicans in the house of representatives. (support is equal)

𝑥¯1==𝑥¯2 
Where  𝑥¯1  is the mean of republican votes and  𝑥¯2  is the mean of democrat votes.

2) Alternative Hypothesis:

𝑥¯1≠𝑥¯2 
Levels of support between the two parties will differ.

3) 99% Confidence Level

In [47]:

rep['water-project'].mean()

0.5067567567567568

In [48]:
dem['water-project'].mean()

0.502092050209205

In [0]:
col = rep['water-project']
rep_water_project_no_nans = col[~np.isnan(col)]

col = dem['water-project']
dem_water_project_no_nans = col[~np.isnan(col)]

In [54]:

print(len(rep_water_project_no_nans))
print(len(dem_water_project_no_nans))

148
239


In [88]:
# not much difference p > .1
print(ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit'))
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit').pvalue/2 > .01

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)


True

4) T-statistic: .088

5) P-value: .929

I want to reject the null hypothesis if my p-value is < .01 

Conclusion: due to a p-value of .929 I fail to reject the null hypothesis that republican and democrat support for the water-project bill.

*pvalue > .01

Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

1) Null Hypothesis: There is no difference between average voting rates (levels of support) for the physician-fee-freeze bill between democrats and republicans in the house of representatives. (support is equal)

𝑥¯1==𝑥¯2 
Where  𝑥¯1  is the mean of republican votes and  𝑥¯2  is the mean of democrat votes.

2) Alternative Hypothesis:

𝑥¯1≠𝑥¯2 
Levels of support between the two parties will differ.

3) 99% Confidence Level

In [0]:
col = rep['physician-fee-freeze']
rep_physician_fee_freeze_no_nans = col[~np.isnan(col)]

col = dem['physician-fee-freeze']
dem_physician_fee_freeze_no_nans = col[~np.isnan(col)]



In [90]:
# republicans support physician_fee_freeze bill more 
print(len(rep_physician_fee_freeze_no_nans))
print(len(dem_physician_fee_freeze_no_nans))

165
259


In [74]:
print(rep['physician-fee-freeze'].mean())
print(dem['physician-fee-freeze'].mean())

0.9878787878787879
0.05405405405405406


In [94]:
# pvalue < .1
print(ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit'))
ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit').pvalue/2 < .01

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)


True

4) T-statistic: 49.36

5) P-value: 1.99

I want to reject the null hypothesis if my p-value is < .01 

Conclusion: due to a p-value of 1.99 I reject the null hypothesis and
*fail to reject the alternative hypothesis that republican and democrat support for the physician-fee-freeze bill will differ.