<a href="https://colab.research.google.com/github/Daniel-Benson-Poe/DS-Unit-1-Sprint-2-Statistics/blob/master/db_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# Imports
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import seaborn as sns

In [0]:
# get raw data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-29 22:14:26--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-01-29 22:14:27 (132 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
# create column headers
column_headers =  ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

In [0]:
# Read data into a dataframe and look at the top five rows
df = pd.read_csv('house-votes-84.data', header=None, names=column_headers, na_values='?')
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
# Look at the dataframe's shape
df.shape

(435, 17)

In [0]:
# Recode votes as numeric
df = df.replace({'y':1, 'n':0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# Check how many representitives each party has
df['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [0]:
# Look into how republicans voted
rep = df[df['party']=='republican']
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
# Look into how the democrats voted
dem = df[df['party']=='democrat']
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [0]:
# We did the handicapped-infants and the water-project columns in class, so let's 
# check out the budget column
# First let's see the average rate of republicans voting 'yes'
rep['budget'].mean() 

0.13414634146341464

In [0]:
# Now let's see the average rate of democrats voting 'yes'
dem['budget'].mean()

0.8884615384615384

In [0]:
# Perfect! It appears that the democrats voted 'yes' much more often than the republicans did.
# Let's run a ttest to determine if our given numbers can actually be representative
# of the population; there are nans so we will use the 'omit' nan policy
ttest_ind(rep['budget'], dem['budget'], nan_policy='omit')

Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)

In [0]:
# Let's convert that pvalue into more readable notation
print('{:.80f}'.format(ttest_ind(rep['budget'], dem['budget'], nan_policy='omit').pvalue))

0.00000000000000000000000000000000000000000000000000000000000000000000000000002070


In [0]:
# The pvalue above is definitely enough to show that democrats support this bill
# more than the republicans

In [0]:
# Let's take a look at another column now!
# But first, let's throw together some functions to make things a little easier.

##Function Time

In [0]:
# Function that returns sample tstatistic, pvalue, or both
def sample_tester(data1, data2, value='pvalue'):
  test = ttest_ind(data1, data2, nan_policy='omit')

  if value == 'pvalue':
    return test.pvalue

  if value == 'statistic':
    return test.statistic

  if value == 'both':
    return test


In [0]:
# Now test our function against the example above first returning the default value (pvalue)
sample_tester(rep['budget'], dem['budget'])

2.0703402795404463e-77

In [0]:
# Now returning the t statistic
sample_tester(rep['budget'], dem['budget'], 'statistic')

-23.21277691701378

In [0]:
# Finally returning both
sample_tester(rep['budget'], dem['budget'], 'both')

Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)

## End of Function Time

In [0]:
# Let's start looking into other columns now
# First let's refresh our memory on what columns are available
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# Find the mean of each party for the physician-fee-freeze column
print(rep['physician-fee-freeze'].mean())
print(dem['physician-fee-freeze'].mean())

0.9878787878787879
0.05405405405405406


In [0]:
# Test our sample to determine if it reflects the population
sample_tester(rep['physician-fee-freeze'], dem['physician-fee-freeze'])

1.994262314074344e-177

In [0]:
# Our pvalue definitely shows that republicans overwhelmingly support the 
# physician fee freeze bill more than the democrats, with nearly 99% of 
# republicans supporting it while only 5% democrats supporting it.

In [0]:
# Let's look at el-salvador-aid now
print(rep['el-salvador-aid'].mean())
print(dem['el-salvador-aid'].mean())

0.9515151515151515
0.21568627450980393


In [0]:
# Check ttest of our sample
sample_tester(rep['el-salvador-aid'], dem['el-salvador-aid'])

5.600520111729011e-68

In [0]:
# Our pvalue shows that republicans overwhelmingly support the 
# el-salvador-aid bill more than the democrats, with 95% of 
# republicans supporting it while only 21% democrats supporting it.

In [0]:
# Look into religious groups
print(rep['religious-groups'].mean())
print(dem['religious-groups'].mean())

0.8975903614457831
0.47674418604651164


In [0]:
# Now check our ttest
sample_tester(rep['religious-groups'], dem['religious-groups'])

2.3936722520597287e-20

In [0]:
# Our pvalue shows that republicans support the 
# religious groups bill more than the democrats, with nearly 90% of 
# republicans supporting it and nearly 48% of democrats supporting it.

In [0]:
# Find mean of both parties for anti-sattelite ban
print(rep['anti-satellite-ban'].mean())
print(dem['anti-satellite-ban'].mean())

0.24074074074074073
0.7722007722007722


In [0]:
# Check our ttest
sample_tester(rep['anti-satellite-ban'], dem['anti-satellite-ban'])

8.521033017443867e-31

In [0]:
# Our pvalue shows that democrats support the 
# anti satellite ban bill more than the republicans, with 77% of 
# democrats supporting it 24% republicans supporting it.

In [0]:
# Find means for aid-to-contras bill
print(rep['aid-to-contras'].mean())
print(dem['aid-to-contras'].mean())

0.15286624203821655
0.8288973384030418


In [0]:
# Check ttest
sample_tester(rep['aid-to-contras'], dem['aid-to-contras'])

2.82471841372357e-54

In [0]:
# Our pvalue above shows that democrats overwhelmingly support the 
# aid-to-contras bill more than the republicans, with nearly 83% of 
# democrats supporting it while only 15% republicans supporting it.

In [0]:
# Find means for mx-missile	bills
print(rep['mx-missile'].mean())
print(dem['mx-missile'].mean())

0.11515151515151516
0.7580645161290323


In [0]:
# Check our ttest
sample_tester(rep['mx-missile'], dem['mx-missile'])

5.03079265310811e-47

In [0]:
# Our pvalue definitely shows that democrats support the 
# mx missile bill more than the republicans, with nearly 76% of 
# democrats supporting it while only 12% republicans supporting it.

In [0]:
# Find means for immigration columns
print(rep['immigration'].mean())
print(dem['immigration'].mean())

0.5575757575757576
0.4714828897338403


In [0]:
sample_tester(rep['immigration'], dem['immigration'])

0.08330248490425066

In [0]:
# Our pvalue shows that there is probably very little difference in support 
# for the immigration bill between the democrat and the republican
# parties

In [0]:
# Find means for synfuels columns
print(rep['synfuels'].mean())
print(dem['synfuels'].mean())

0.1320754716981132
0.5058823529411764


In [0]:
# Check our ttest
sample_tester(rep['synfuels'], dem['synfuels'])

1.5759322301054064e-15

In [0]:
# Our pvalue shows that democrats support the 
# synfuels bill more than the republicans, but only about half of democrats 
# supporting it, and 13% of republicans

In [0]:
# Now education
print(rep['education'].mean())
print(dem['education'].mean())

0.8709677419354839
0.14457831325301204


In [0]:
# ttest
sample_tester(rep['education'], dem['education'])

1.8834203990450192e-64

In [0]:
# Our pvalue shows that republicans overwhelmingly support the 
# education bill more than the democrats, with 879% of 
# republicans supporting it and only 14% of democrats supporting it.

In [0]:
# Find means for right-to-sue 
print(rep['right-to-sue'].mean())
print(dem['right-to-sue'].mean())

0.8607594936708861
0.2896825396825397


In [0]:
# Now ttest
sample_tester(rep['right-to-sue'], dem['right-to-sue'])

1.2278581709672758e-34

In [0]:
# Our pvalue shows that republicans support the 
# right to sue bill more than the democrats, with 86% of 
# republicans supporting it and nearly 29% democrats supporting it.

In [0]:
# Find means for crime column
print(rep['crime'].mean())
print(dem['crime'].mean())

0.9813664596273292
0.35019455252918286


In [0]:
# ttest
sample_tester(rep['crime'], dem['crime'])

9.952342705606092e-47

In [0]:
# Our pvalue shows that republicans support the 
# crime bill more than the democrats, with an overwhelming 98% of 
# republicans supporting it and 35% of democrats supporting it.

In [0]:
# Find means for duty-free bill
print(rep['duty-free'].mean())
print(dem['duty-free'].mean())

0.08974358974358974
0.6374501992031872


In [0]:
# Check ttest
sample_tester(rep['duty-free'], dem['duty-free'])

5.997697174347365e-32

In [0]:
# Our pvalue shows that republicans support the 
# duty free bill more than the democrats, with nearly 90% of 
# republicans supporting but also a majority 64% democrats supporting it.

In [0]:
# find means for south-africa
print(rep['south-africa'].mean())
print(dem['south-africa'].mean())

0.6575342465753424
0.9351351351351351


In [0]:
# ttest
sample_tester(rep['south-africa'], dem['south-africa'])

3.652674361672226e-11

In [0]:
# Our pvalue shows that democrats support the 
# south africa bill more than the republicans, with nearly 94% of 
# democrats supporting it but also a majority 66% of republicans supporting it.