<a href="https://colab.research.google.com/github/CJRicciardi/DS-Unit-1-Sprint-2-Statistics/blob/master/DS-Unit-1-Sprint-2-Statistics/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
import scipy.stats as stat
import seaborn as sns

In [0]:
### Load the data file

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-20 20:33:07--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-01-20 20:33:07 (127 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
#inspect datafile head

!head house-votes-84.data

republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


In [0]:
#inspect data file tail

!tail house-votes-84.data

democrat,n,n,y,n,n,n,y,y,n,y,y,n,n,n,y,?
democrat,y,n,y,n,n,n,y,y,y,y,n,n,n,n,y,y
republican,n,n,n,y,y,y,y,y,n,y,n,y,y,y,n,y
democrat,?,?,?,n,n,n,y,y,y,y,n,n,y,n,y,y
democrat,y,n,y,n,?,n,y,y,y,y,n,y,n,?,y,y
republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,?,n


In [0]:
#create initial data frame with column titles

df = pd.read_csv('house-votes-84.data', names=['party', 'handicapped-infants', 'water-project', 
'budget', 'physician-fee-freeze', 'el-salvador-aid', 'religious-groups', 
'anti-satellite-test-ban', 'aid-to-nic-contras', 'mx-missile', 'immigration', 
'synfuels', 'education', 'right-to-sue', 'crime', 'duty-free', 
'south-africa'])

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-nic-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
#replace y w/ 1, n with 0, and ? w/ NaN

df = df.replace('?',np.nan).replace('n',0).replace('y',1)

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-nic-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# create republican filter

isrep = df['party']=='republican'

isrep.head()

0     True
1     True
2    False
3    False
4    False
Name: party, dtype: bool

In [0]:
# create republican df

rep = df[isrep]

rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-nic-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
# create democrat filter  

isdem = df['party']=='democrat'

isdem.head()

0    False
1    False
2     True
3     True
4     True
Name: party, dtype: bool

In [0]:
# create democrat df

dem = df[isdem]

dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-nic-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


Hypothesis for the Water Projeect Bill.

1) Null Hypothesis: There is no difference in support for the Handicapped Infants Bill between democrats and republicans in the house of representatives.

$\bar{x}_{1} = bar{x}_{2}$

2) Alternative hypothesis: There is a differenece between democtrat and republican support for the handicapped infants bill in the house of representatives. 

$\bar{x}_{1} \neq \bar{x}_{2}$

Levels of support betweent he parties will differ.

3) 95% confidence level

In [0]:
print('Rep Mean:', rep['water-project'].mean())
print('Dem Mean:', dem['water-project'].mean())

stat.ttest_ind(dem['water-project'], rep['water-project'], nan_policy='omit')

Rep Mean: 0.5067567567567568
Dem Mean: 0.502092050209205


Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

Water Project Bill

4) P Statisitic = -0.089

5) T Value =  0.929

I want to reject the null hypothesis if my p-value is < .05.

Conclusion: We fail to reject the null hypothesis.

Hypothesis for the Budget Bill.

1) Null Hypothesis: There is no difference in support for the Handicapped Infants Bill between democrats and republicans in the house of representatives.

$\bar{x}_{1} = bar{x}_{2}$

2) Alternative hypothesis: There is a differenece between democtrat and republican support for the handicapped infants bill in the house of representatives. 

$\bar{x}_{1} \neq \bar{x}_{2}$

Levels of support betweent he parties will differ.

3) 95% confidence level

In [0]:
print('Rep Mean:', rep['budget'].mean())
print('Dem Mean:', dem['budget'].mean())

stat.ttest_ind(dem['budget'], rep['budget'], nan_policy='omit')

Rep Mean: 0.13414634146341464
Dem Mean: 0.8884615384615384


Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795404463e-77)

Budget

4) P Statisitics = 23.213

5) p Value = 0.000

I want to reject the null hypothesis if my p-value is < .05.

Conclusion: We fail to reject the alternative hypothesis.

Hypothesis for the Physician Fee Freeze Bill.

1) Null Hypothesis: There is no difference in support for the Handicapped Infants Bill between democrats and republicans in the house of representatives.

$\bar{x}_{1} = bar{x}_{2}$

2) Alternative hypothesis: There is a differenece between democtrat and republican support for the handicapped infants bill in the house of representatives. 

$\bar{x}_{1} \neq \bar{x}_{2}$

Levels of support betweent he parties will differ.

3) 95% confidence level

In [0]:
print('Rep Mean:', rep['physician-fee-freeze'].mean())
print('Dem Mean:', dem['physician-fee-freeze'].mean())

stat.ttest_ind(dem['physician-fee-freeze'], rep['physician-fee-freeze'], nan_policy='omit')

Rep Mean: 0.9878787878787879
Dem Mean: 0.05405405405405406


Ttest_indResult(statistic=-49.36708157301406, pvalue=1.994262314074344e-177)

Physician Fee Freeze

4) P Statistic = -49.367

5) P Value = 0.00

I want to reject the null hypothesis if my p-value is < .05.

Conclusion: We fail to reject the alternative hypothesis.