<a href="https://colab.research.google.com/github/andrewwhite5/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

##Read in the data

In [0]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('house-votes-84.data')
print(df.shape)
df.head()

(434, 17)


Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y


In [3]:
col_headers = ['Party', 'Handicapped Infants', 'Water Project Cost Sharing', 'Adoption of Budget Resolution', 
               'Physician Fee Freeze', 'El Salvador Aid', 'Religious Groups in Schools', 'Anti-Satellite Test Ban', 
               'Aid to Nicaraguan Contras', 'Mx Missile', 'Immigration', 'Synfuels Corporation Cutback', 
               'Education Spending', 'Superfund Right to Sue', 'Crime', 'Duty Free Exports', 
               'Export Administration Act South Africa']

df = pd.read_csv('house-votes-84.data', header=None, names=col_headers, na_values='?')
df.head()

Unnamed: 0,Party,Handicapped Infants,Water Project Cost Sharing,Adoption of Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups in Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,Mx Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports,Export Administration Act South Africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [4]:
df.describe()

Unnamed: 0,Party,Handicapped Infants,Water Project Cost Sharing,Adoption of Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups in Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,Mx Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports,Export Administration Act South Africa
count,435,423,387,424,424,420,424,421,420,413,428,414,404,410,418,407,331
unique,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
top,democrat,n,y,y,n,y,y,y,y,y,y,n,n,y,y,n,y
freq,267,236,195,253,247,212,272,239,242,207,216,264,233,209,248,233,269


In [5]:
df.isna().sum()

Party                                       0
Handicapped Infants                        12
Water Project Cost Sharing                 48
Adoption of Budget Resolution              11
Physician Fee Freeze                       11
El Salvador Aid                            15
Religious Groups in Schools                11
Anti-Satellite Test Ban                    14
Aid to Nicaraguan Contras                  15
Mx Missile                                 22
Immigration                                 7
Synfuels Corporation Cutback               21
Education Spending                         31
Superfund Right to Sue                     25
Crime                                      17
Duty Free Exports                          28
Export Administration Act South Africa    104
dtype: int64

In [6]:
df = df.drop(columns='Export Administration Act South Africa')  # Too many missing values
df.head()

Unnamed: 0,Party,Handicapped Infants,Water Project Cost Sharing,Adoption of Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups in Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,Mx Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y


In [7]:
df = df.fillna(method='bfill')
df.head()

Unnamed: 0,Party,Handicapped Infants,Water Project Cost Sharing,Adoption of Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups in Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,Mx Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports
0,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,n
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n
2,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,y,n
3,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,n,n
4,democrat,y,y,y,n,y,y,n,n,n,n,y,n,y,y,y


In [8]:
df['Duty Free Exports'] = df['Duty Free Exports'].fillna(method='ffill')
df.tail()

Unnamed: 0,Party,Handicapped Infants,Water Project Cost Sharing,Adoption of Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups in Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,Mx Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports
430,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n
431,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n
432,republican,n,n,n,y,y,y,n,n,n,n,y,y,y,y,n
433,republican,n,n,n,y,y,y,n,n,n,y,n,y,y,y,n
434,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,n


##Hypothesis 1:
###H<sub>0</sub>: μ (Republican support of Synfuels Corporation Cutback Bill)  == μ (Democratic support of Synfuels Corporation Cutback Bill)

###H<sub>1</sub>: μ (Republican support of Synfuels Corporation Cutback Bill) < μ (Democratic support of Synfuels Corporation Cutback Bill)




In [0]:
# columns = ['Handicapped Infants', 'Water Project Cost Sharing', 'Adoption of Budget Resolution', 
#                'Physician Fee Freeze', 'El Salvador Aid', 'Religious Groups in Schools', 'Anti-Satellite Test Ban', 
#                'Aid to Nicaraguan Contras', 'Mx Missile', 'Immigration', 'Synfuels Corporation Cutback', 
#                'Education Spending', 'Superfund Right to Sue', 'Crime', 'Duty Free Exports']

# df_new = []

# def give_numbers():
#   for _ in columns:
#     df_new.append(df[_].replace({'y': 1, 'n': 0}))
    
# give_numbers()

In [0]:
# df_new.head()

In [0]:
df['Synfuels Corporation Cutback'] = df['Synfuels Corporation Cutback'].replace({'y': 1, 'n': 0})
df['Religious Groups in Schools'] = df['Religious Groups in Schools'].replace({'y': 1, 'n': 0})
df['Immigration'] = df['Immigration'].replace({'y': 1, 'n': 0})

In [12]:
df['Synfuels Corporation Cutback'].head(10)

0    0
1    0
2    1
3    1
4    1
5    0
6    0
7    0
8    0
9    0
Name: Synfuels Corporation Cutback, dtype: int64

In [13]:
pd.crosstab(df['Party'], df['Synfuels Corporation Cutback'])

Synfuels Corporation Cutback,0,1
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
democrat,133,134
republican,143,25


In [14]:
rep = df[df['Party'].isin(['republican'])]
dem = df[df['Party'].isin(['democrat'])]
rep.head()

Unnamed: 0,Party,Handicapped Infants,Water Project Cost Sharing,Adoption of Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups in Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,Mx Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports
0,republican,n,y,n,y,y,1,n,n,n,1,0,y,y,y,n
1,republican,n,y,n,y,y,1,n,n,n,0,0,y,y,y,n
7,republican,n,y,n,y,y,1,n,n,n,0,0,n,y,y,n
8,republican,n,y,n,y,y,1,n,n,n,0,0,y,y,y,n
10,republican,n,y,n,y,y,0,n,n,n,0,1,n,y,y,n


In [15]:
dem.head()

Unnamed: 0,Party,Handicapped Infants,Water Project Cost Sharing,Adoption of Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups in Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,Mx Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports
2,democrat,n,y,y,n,y,1,n,n,n,0,1,n,y,y,n
3,democrat,n,y,y,n,y,1,n,n,n,0,1,n,y,n,n
4,democrat,y,y,y,n,y,1,n,n,n,0,1,n,y,y,y
5,democrat,n,y,y,n,y,1,n,n,n,0,0,n,y,y,y
6,democrat,n,y,n,y,y,1,n,n,n,0,0,n,y,y,y


In [0]:
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [17]:
tstat, pvalue = ttest_ind((dem['Synfuels Corporation Cutback'] == 1), (rep['Synfuels Corporation Cutback'] == 1))
print(tstat)
print(pvalue)

7.951508196691867
1.618715709606326e-14


###Reject the null hypothesis (H<sub>0</sub>)

##Hypothesis 2:
###H<sub>0</sub>: μ (Republican support of Religious Groups in Schools Bill)  == μ (Republican support of Religious Groups in Schools Bill)

###H<sub>1</sub>: μ (Republican support of Religious Groups in Schools Bill) > μ (Republican support of Religious Groups in Schools Bill)

In [0]:
# rep['Religious Groups in Schools'] = rep['Religious Groups in Schools'].replace({'y': 1, 'n': 0})
# dem['Religious Groups in Schools'] = dem['Religious Groups in Schools'].replace({'y': 1, 'n': 0})

In [19]:
rep['Religious Groups in Schools'].head()

0     1
1     1
7     1
8     1
10    0
Name: Religious Groups in Schools, dtype: int64

In [20]:
dem['Religious Groups in Schools'].head()

2    1
3    1
4    1
5    1
6    1
Name: Religious Groups in Schools, dtype: int64

In [21]:
pd.crosstab(df['Party'], df['Religious Groups in Schools'])

Religious Groups in Schools,0,1
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
democrat,141,126
republican,19,149


In [22]:
tstat, pvalue = ttest_ind((dem['Religious Groups in Schools'] == 1), (rep['Religious Groups in Schools'] == 1))
print(tstat)
print(pvalue)

-9.602708758468282
6.365428881925945e-20


###Reject the null hypothesis (H<sub>0</sub>)

##Hypothesis 3:
###H<sub>0</sub>: μ (Republican support of Immigration Bill)  == μ (Democratic support of Immigration Bill)

###H<sub>1</sub>: μ (Republican support of Immigration Bill) != μ (Democratic support of Immigration Bill)

In [0]:
# rep['Immigration'] = rep['Immigration'].replace({'y': 1, 'n': 0})
# dem['Immigration'] = dem['Immigration'].replace({'y': 1, 'n': 0})

In [24]:
rep['Immigration'].head()

0     1
1     0
7     0
8     0
10    0
Name: Immigration, dtype: int64

In [25]:
dem['Immigration'].head()

2    0
3    0
4    0
5    0
6    0
Name: Immigration, dtype: int64

In [26]:
pd.crosstab(df['Party'], df['Immigration'])

Immigration,0,1
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
democrat,140,127
republican,75,93


In [27]:
tstat, pvalue = ttest_ind((dem['Immigration'] == 1), (rep['Immigration'] == 1))
print(tstat)
print(pvalue)

-1.5834489375006358
0.11404925251581424


###Fail to reject the null hypothesis (H<sub>0</sub>)