<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import style
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [7]:
# Load data
voting_data = pd.read_csv('https://raw.githubusercontent.com/JimKing100/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/master/house-votes-84.data',
                          header=None)
voting_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [8]:
voting_data = voting_data.rename(columns={0: 'party'})
voting_data.head()

Unnamed: 0,party,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [9]:
voting_data[voting_data=='?'] = 'n'
voting_data[voting_data=='n'] = 0
voting_data[voting_data=='y'] = 1
voting_data.head()

Unnamed: 0,party,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
2,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1


In [10]:
for i in range(1, 17):
  
  rep_data = voting_data[voting_data['party'] == 'republican'][i]
  dem_data = voting_data[voting_data['party'] == 'democrat'][i]
  rep_mean = voting_data[voting_data['party'] == 'republican'][i].mean()
  dem_mean = voting_data[voting_data['party'] == 'democrat'][i].mean()

  tstat, pvalue = ttest_ind(rep_data, dem_data)
  print('rep mean = ', rep_mean)
  print('dem mean = ', dem_mean)
  print('tstat issue %d - ' %i, tstat)
  print('pvalue issue %d - ' %i,pvalue)
  print('\n')
  

rep mean =  0.18452380952380953
dem mean =  0.5842696629213483
tstat issue 1 -  -8.8971307386929
pvalue issue 1 -  1.5743382054892986e-17


rep mean =  0.44642857142857145
dem mean =  0.449438202247191
tstat issue 2 -  -0.061312129984723324
pvalue issue 2 -  0.9511389201775375


rep mean =  0.13095238095238096
dem mean =  0.8651685393258427
tstat issue 3 -  -21.882663845792216
pvalue issue 3 -  4.994834925373301e-72


rep mean =  0.9702380952380952
dem mean =  0.052434456928838954
tstat issue 4 -  45.5632745656491
pvalue issue 4 -  2.701847998042241e-167


rep mean =  0.9345238095238095
dem mean =  0.20599250936329588
tstat issue 5 -  20.95855719303714
pvalue issue 5 -  7.604798956400169e-68


rep mean =  0.8869047619047619
dem mean =  0.4606741573033708
tstat issue 6 -  9.874683331966061
pvalue issue 6 -  7.088016917558158e-21


rep mean =  0.23214285714285715
dem mean =  0.7490636704119851
tstat issue 7 -  -12.20185970394008
pvalue issue 7 -  1.2271674100822144e-29


rep mean =  0.14

1)  Data was loaded and cleaned, the null values were replaced with 'n'

2)  The democrats support the following issues with p < .01

-  1 handicapped infants
-  3 adoption of the budget
-  7 anti-satellite test ban
-  8 aid to contras
-  9 mx-missle
-  11 synfuels
-  15 duty free exports

3)  The republicans support the following issues with p < .01

-  4 physician fee freeze
-  5 el salvador aid
-  6 religious groups in school
-  12 education spending
-  13 superfund
-  14 crime

4)  Difference > .1

-  2 water project cost sharing
-  10 immigration
-  16 export admin act
