<a href="https://colab.research.google.com/github/JohnMorrisonn/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/131_Statistics_Probability_Assignment_John_Morrison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, ttest_ind, ttest_ind_from_stats, ttest_rel
import seaborn as sns


In [2]:
columns = [
    'party', 'hcapped-infants',
    'water-project', 'budget-resolution',
    'phys-fee-freeze', 'el-salvador-aid',
    'relig-groups', 'anti-satellite',
    'aid-to-nic', 'mx-missile',
    'immigration','synfuels-cutback',
    'edu-spending', 'superfund-sue',
    'crime', 'duty-exports',
    'export-africa'
]

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', names=columns)
df.head()

Unnamed: 0,party,hcapped-infants,water-project,budget-resolution,phys-fee-freeze,el-salvador-aid,relig-groups,anti-satellite,aid-to-nic,mx-missile,immigration,synfuels-cutback,edu-spending,superfund-sue,crime,duty-exports,export-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
df = df.replace({'n': int(0), 'y': int(1), '?': None})

In [4]:
df.head()

Unnamed: 0,party,hcapped-infants,water-project,budget-resolution,phys-fee-freeze,el-salvador-aid,relig-groups,anti-satellite,aid-to-nic,mx-missile,immigration,synfuels-cutback,edu-spending,superfund-sue,crime,duty-exports,export-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [5]:
#Creating the parties
rep = df[df['party'] == 'republican']
dem = df[df['party'] == 'democrat']
rep

Unnamed: 0,party,hcapped-infants,water-project,budget-resolution,phys-fee-freeze,el-salvador-aid,relig-groups,anti-satellite,aid-to-nic,mx-missile,immigration,synfuels-cutback,edu-spending,superfund-sue,crime,duty-exports,export-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0
11,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,,
14,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,0.0,
15,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,,0.0,
18,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,0.0,0.0
28,republican,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0


Handicapped-Infants (Dems support more)

In [6]:
print("Democrat Support: ", dem['hcapped-infants'].mean())
print("Republican Support: ", rep['hcapped-infants'].mean())
ttest_ind(rep['hcapped-infants'], dem['hcapped-infants'], nan_policy='omit')

Democrat Support:  0.6046511627906976
Republican Support:  0.18787878787878787


Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

1. Null Hypothesis: Any member is predicted to vote either way on this topic regardless of party.

2. Alternative Hypothesis: Party association will affect the voting behavior of this topic.

3. pvalue=1.613440327937243e-18 < 0.05 --> Reject the null hypothesis

4. T-stat = -9.20

5. p:value = 1.613440327937243e-18

Interpretation: Due to calculating the t-statistic -9.20 and p-value 1.6e-18, we reject the null hypothesis that party relation has no affect on voting with this policy and suggest the alternative hypothesis.

Physician Fee Freeze (Republicans Support More)

In [7]:
print("Democrat Support: ", dem['phys-fee-freeze'].mean())
print("Republican Support: ", rep['phys-fee-freeze'].mean())
ttest_ind(rep['phys-fee-freeze'], dem['phys-fee-freeze'], nan_policy='omit')

Democrat Support:  0.05405405405405406
Republican Support:  0.9878787878787879


Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

1. Null Hypothesis: Any member is predicted to vote either way on this topic regardless of party.

2. Alternative Hypothesis: Party association will affect the voting behavior of this topic.

3. pvalue=1.994262314074344e-177< 0.05 --> Reject the null hypothesis

4. T-stat = 49.36

5. p:value = 1.994262314074344e-177

Interpretation: Due to calculating the t-statistic 49.36 and p-value 1.99e-177, we reject the null hypothesis that party relation has no affect on voting with this policy and suggest the alternative hypothesis.

Water Project Cost Sharing (Little Party Difference)

In [8]:
print("Democrat Support: ", dem['water-project'].mean())
print("Republican Support: ", rep['water-project'].mean())
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')

Democrat Support:  0.502092050209205
Republican Support:  0.5067567567567568


Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

1. Null Hypothesis: Any member is predicted to vote either way on this topic regardless of party.

2. Alternative Hypothesis: Party association will affect the voting behavior of this topic.

3. pvalue=0.929 --> Fail to reject the null hypothesis

4. T-stat = 0.089

5. p:value = 0.929

Interpretation: Due to calculating the t-statistic 0.089 and p-value 0.929, we fail to reject the null hypothesis that party relation has no affect on voting with this policy.

In [0]:
def two_sample_t(mean1, mean2):
  return ttest_ind(mean1, mean2, nan_policy='omit')

def one_sample_t(mean1, null):
  return ttest_1samp(mean1, null, nan_policy='omit')