<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import style
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [6]:
# Load data
voting_data = pd.read_csv('https://raw.githubusercontent.com/JimKing100/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/master/house-votes-84.data',
                          header=None)
voting_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
voting_data = voting_data.rename(columns={0: 'party'})

In [0]:
# Calculate the percentage of yes votes for each party
def yes_vote_percentages(vdata, issue):

  # Republican yes votes
  rep_total = vdata[(vdata['party'] == 'republican') & (vdata[issue] == 'y')].count()
  rep_yes_votes = rep_total[issue]
  print('r yes', rep_yes_votes)
  
  # Democrat yes votes
  dem_total = vdata[(vdata['party'] == 'democrat') & (vdata[issue] == 'y')].count()
  dem_yes_votes = dem_total[issue]
  print('d yes', dem_yes_votes)
  
  # Republican no votes
  rep_total = vdata[(vdata['party'] == 'republican') & (vdata[issue] == 'n')].count()
  rep_no_votes = rep_total[issue]
  print('r no', rep_no_votes)
  
  # Democrat no votes
  dem_total = vdata[(vdata['party'] == 'democrat') & (vdata[issue] == 'n')].count()
  dem_no_votes = dem_total[issue]
  print('d no', dem_no_votes)
  
  # Total yes votes
  yes_vote_total = vdata[(vdata[issue] == 'y')].count()
  total_yes_votes = yes_vote_total[issue]
  
  # Total no votes
  no_vote_total = vdata[(vdata[issue] == 'n')].count()
  total_no_votes = no_vote_total[issue]
  
  # Total votes
  total_votes = total_yes_votes + total_no_votes 
  
  # Calculate percentages and totals
  total_rep_votes = rep_yes_votes + rep_no_votes
  total_dem_votes = dem_yes_votes + dem_no_votes
  rpercent = rep_yes_votes / total_votes
  dpercent = dem_yes_votes / total_votes

  return dpercent, rpercent

In [9]:
for i in range(1, 17):
  dem_percent, rep_percent = yes_vote_percentages(voting_data, i)

  dem_mu = dem_percent
  rep_mu = rep_percent

  dem_var = dem_mu * (1-dem_mu)
  rep_var = rep_mu * (1-rep_mu)

  testa = np.random.normal(dem_percent, np.sqrt(dem_var), 50)
  testb = np.random.normal(rep_percent, np.sqrt(rep_var), 50)

  tstat, pvalue = ttest_ind(testa, testb)
  print('tstat issue %d - ' %i, tstat)
  print('pvalue issue %d - ' %i,pvalue)
  

r yes 31
d yes 156
r no 134
d no 102
tstat issue 1 -  4.70091504185835
pvalue issue 1 -  8.47243572322949e-06
r yes 75
d yes 120
r no 73
d no 119
tstat issue 2 -  0.7731082204850803
pvalue issue 2 -  0.4413190887656372
r yes 22
d yes 231
r no 142
d no 29
tstat issue 3 -  6.204936497172127
pvalue issue 3 -  1.3121896983852103e-08
r yes 163
d yes 14
r no 2
d no 245
tstat issue 4 -  -4.667212613442862
pvalue issue 4 -  9.69052465212431e-06
r yes 157
d yes 55
r no 8
d no 200
tstat issue 5 -  -1.5981223588913895
pvalue issue 5 -  0.11323517231948124
r yes 149
d yes 123
r no 17
d no 135
tstat issue 6 -  -0.2506827675456463
pvalue issue 6 -  0.8025842087910496
r yes 39
d yes 200
r no 123
d no 59
tstat issue 7 -  4.578894114518229
pvalue issue 7 -  1.3743628241996568e-05
r yes 24
d yes 218
r no 133
d no 45
tstat issue 8 -  3.9554516048852153
pvalue issue 8 -  0.00014450079762554397
r yes 19
d yes 188
r no 146
d no 60
tstat issue 9 -  4.120171736511351
pvalue issue 9 -  7.919913621585878e-05
r 