<a href="https://colab.research.google.com/github/JoshFowlkes/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/LS_DS_131_Statistics_Probability_Assignment/LS_DS_131_Statistics_Probability_Assignment_Josh_Fowlkes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### the data scientist trio starter pack + ttest imports

import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind


In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', header=None)

print(df.shape)
df.head(10)

(435, 17)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


In [0]:
# made a columns variable just incase i need to call upon it for the stretch goal
columns = ['party', 'handicapped-infants', 'water-project', 
                     'budget', 'physician-fee-freeze', 'el-salvador-aid',
                     'religious-groups', 'anti-satellite-ban', 'aid-to-contras',
                     'mx-missile', 'immigration', 'synfuels', 'education',
                     'right-to-sue', 'crime', 'duty-free', 'south-africa']

# renaming the columns

df.columns = ['party', 'handicapped-infants', 'water-project', 
                     'budget', 'physician-fee-freeze', 'el-salvador-aid',
                     'religious-groups', 'anti-satellite-ban', 'aid-to-contras',
                     'mx-missile', 'immigration', 'synfuels', 'education',
                     'right-to-sue', 'crime', 'duty-free', 'south-africa']


In [4]:
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [5]:
# converting all the yes's,no's, and ?'s into their proper numeric values + NaNs  
df = df.replace(to_replace=['n', 'y', '?'], value=[0, 1, np.nan])
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


This satisfies the first requirment of the assignment, all data is nice and clean at this point and appears to be identical to the example we saw in lecture today. 

In [6]:
# checking to make sure the values are numeric and not strings
df.dtypes

party                    object
handicapped-infants     float64
water-project           float64
budget                  float64
physician-fee-freeze    float64
el-salvador-aid         float64
religious-groups        float64
anti-satellite-ban      float64
aid-to-contras          float64
mx-missile              float64
immigration             float64
synfuels                float64
education               float64
right-to-sue            float64
crime                   float64
duty-free               float64
south-africa            float64
dtype: object

In [7]:
# making our democrate variable to call on for each column
dem = df[df['party'] == 'democrat']
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [8]:
# making our republican variable to call on for each column
rep = df[df['party'] == 'republican']
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [9]:
df['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [10]:
# Handicap infants mean for Democrats

print(dem['handicapped-infants'].value_counts(dropna=False))
print('Democrat Support:', dem['handicapped-infants'].mean())

1.0    156
0.0    102
NaN      9
Name: handicapped-infants, dtype: int64
Democrat Support: 0.6046511627906976


In [11]:
# Handicap infants mean for Republicans 
print(rep['handicapped-infants'].value_counts(dropna=False))
print('Republican support:', rep['handicapped-infants'].mean())

0.0    134
1.0     31
NaN      3
Name: handicapped-infants, dtype: int64
Republican support: 0.18787878787878787


In [32]:
# 1Sample T Test for Handicapped-infants for Democrats
print('Split: ',ttest_1samp(dem['handicapped-infants'], .5, nan_policy='omit'))
print('All Against: ',ttest_1samp(dem['handicapped-infants'], 0, nan_policy='omit'))
print('All In Favor: ', ttest_1samp(dem['handicapped-infants'], 1, nan_policy='omit'))
print('60%: ', ttest_1samp(dem['handicapped-infants'], 6, nan_policy='omit'))

Split:  Ttest_1sampResult(statistic=3.431373087696574, pvalue=0.000699612317167372)
All Against:  Ttest_1sampResult(statistic=19.825711173357988, pvalue=1.0391992873567661e-53)
All In Favor:  Ttest_1sampResult(statistic=-12.96296499796484, pvalue=6.590394568934029e-30)
60%:  Ttest_1sampResult(statistic=-176.90634585457897, pvalue=1.7711329008626174e-270)


In [33]:
# 1Sample T Test for Handicapped-infants for Republicans
print('Split: ', ttest_1samp(rep['handicapped-infants'], .5, nan_policy='omit'))
print('All Against: ', ttest_1samp(rep['handicapped-infants'], 0, nan_policy='omit'))
print('All in favor: ', ttest_1samp(rep['handicapped-infants'], 1, nan_policy='omit'))
print('60%: ', ttest_1samp(rep['handicapped-infants'], 6, nan_policy='omit'))

Split:  Ttest_1sampResult(statistic=-10.232833482397659, pvalue=2.572179359890009e-19)
All Against:  Ttest_1sampResult(statistic=6.159569669016066, pvalue=5.434587970316366e-09)
All in favor:  Ttest_1sampResult(statistic=-26.625236633811387, pvalue=1.978873197183477e-61)
60%:  Ttest_1sampResult(statistic=-190.54926814794865, pvalue=2.1396044378365313e-194)


In [14]:
# Water Project Mean for Democrats
print(dem['water-project'].value_counts(dropna=False))
print('Democrat Support:', dem['water-project'].mean())

1.0    120
0.0    119
NaN     28
Name: water-project, dtype: int64
Democrat Support: 0.502092050209205


In [15]:
# Water project Mean for Republicans 
print(rep['water-project'].value_counts(dropna=False))
print('Republican Support:', rep['water-project'].mean())

1.0    75
0.0    73
NaN    20
Name: water-project, dtype: int64
Republican Support: 0.5067567567567568


In [31]:
#1 Sample Ttest for democrats on Water-Project
print('Split: ',ttest_1samp(dem['water-project'], .5, nan_policy='omit'))
print('All Against: ', ttest_1samp(dem['water-project'], 0, nan_policy='omit'))
print('All In Favor: ', ttest_1samp(dem['water-project'], 1, nan_policy='omit'))
print('60%:  ',ttest_1samp(dem['water-project'], .6, nan_policy='omit'))


Split:  Ttest_1sampResult(statistic=0.06454972243678961, pvalue=0.9485867005339235)
All Against:  Ttest_1sampResult(statistic=15.49193338482967, pvalue=6.633846650320544e-38)
All In Favor:  Ttest_1sampResult(statistic=-15.36283393995609, pvalue=1.8031537722768159e-37)
60%:   Ttest_1sampResult(statistic=-3.0209270100417855, pvalue=0.002795053715081168)


In [34]:
# 1 Sample Ttest for Republicans on Water-Project
print('Split: ', ttest_1samp(rep['water-project'], .5, nan_policy='omit'))
print('All Against: ', ttest_1samp(rep['water-project'], 0, nan_policy='omit'))
print('All In Favor: ', ttest_1samp(rep['water-project'], 1, nan_policy='omit'))
print('60%:  ',ttest_1samp(rep['water-project'], .6, nan_policy='omit'))


Split:  Ttest_1sampResult(statistic=0.16385760607458383, pvalue=0.8700683158522193)
All Against:  Ttest_1sampResult(statistic=12.28932045559371, pvalue=2.525482675130834e-24)
All In Favor:  Ttest_1sampResult(statistic=-11.961605243444543, pvalue=1.8656648229239887e-23)
60%:   Ttest_1sampResult(statistic=-2.2612349638292413, pvalue=0.025212260772844)


# Two Sample T-Tests

In [53]:
print('Democrat Support: ', dem['handicapped-infants'].mean())
print('Republican Support: ', rep['handicapped-infants'].mean())

Democrat Support:  0.6046511627906976
Republican Support:  0.18787878787878787


In [56]:
ttest_ind(dem['handicapped-infants'], rep['handicapped-infants'], nan_policy='omit')

Ttest_indResult(statistic=9.205264294809222, pvalue=1.613440327937243e-18)

Null Hypothesis: There is equal support(%) from both parties on this bill. 

Given the results above, a climbing T Statistic and a p value thats decreasing immensely, I would REJECT the null hypothesis that there is the same level of support for this bill amongst the two parties. This gives credence to the above means being an accurate representation and NOT random chance.

Also this satisfies part 2 of the assignment: finding an issue that democrats support more than republicans with a p value < .01, and in this case the p value comes in at a nice .00000000000000000161344.

In [63]:
print('Democrat Support: ', dem['water-project'].mean())
print('Republican Support: ', rep['water-project'].mean())

Democrat Support:  0.502092050209205
Republican Support:  0.5067567567567568


In [62]:
ttest_ind(dem['water-project'], rep['water-project'], nan_policy='omit')

Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

Null Hypothesis: There is equal support from both parties on this bill.

Given the results: the means come in at almost identical right at the halfway point. The T Stastic is decreasing and the p value is increasing and quite large(>.1), and thus I would NOT REJECT the null hypothesis in this case. 

This Satisfies condition 4 of the assignment, the p value is > 1, and the split in terms of support from either party is right down the middle

In [64]:
print('Democrat Support: ', dem['el-salvador-aid'].mean())
print('Republican Support: ', rep['el-salvador-aid'].mean())

Democrat Support:  0.21568627450980393
Republican Support:  0.9515151515151515


In [65]:
ttest_ind(dem['el-salvador-aid'], rep['el-salvador-aid'], nan_policy='omit')

Ttest_indResult(statistic=-21.13669261173219, pvalue=5.600520111729011e-68)

Null Hypothesis: There is equal support from both parties on this bill.

Given the results: a quick look at the means for both parties shows an overwhelming majority(% wise) of republicans in favor vs democrats. And with the T Statistic decreasing massively and the p value getting incredibly small, I REJECT the null hypothesis. It is very clearly evident that one parties favors a particular side of this issue far more than the other and that an equal turnout would be a stunning example of random chance. 

This also satisfies part 3 of the assignment instructions. Found an issue that republicans supported more than democrats with a p value that is very tiny, e^-68 tiny. Although I totally admit, I come from a family with a lot of interest in politics and so I was able to look at the title of the issue and make an assumption in this case. I totally acknowledge that in fields I may not know a lot about(for example I saw a kaggle dataset on pulsar stars, I would have no sort of bias or knowledge going into that lol) 

# Stretch Goal: Functions

In [50]:
# making a function to do this over and over:

for (df['party'] == 'republican') in df.iteritems():
  print('Split: ', ttest_1samp(columnData, .5, nan_policy='omit'))


SyntaxError: ignored

In [51]:
for (columnName, columnData) in df.iteritems():
  print('column name:', columnName)


column name: party
column name: handicapped-infants
column name: water-project
column name: budget
column name: physician-fee-freeze
column name: el-salvador-aid
column name: religious-groups
column name: anti-satellite-ban
column name: aid-to-contras
column name: mx-missile
column name: immigration
column name: synfuels
column name: education
column name: right-to-sue
column name: crime
column name: duty-free
column name: south-africa


In [66]:
 lapply(df[-1], function(x) t.test(x ~ df$Label))

SyntaxError: ignored