<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [1]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-09-16 20:38:57--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2019-09-16 20:38:58 (610 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [145]:
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load Data
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [146]:
#Our TL advised that we use 0 for null and -1 and 1 for no and yes accordingly.
#This suggestion solved my dilema for what to do with null values.
#However, I would like to figure out if the is a correlation between null values on certain issues.
#So I will keep some of the information I had previously devised to show what columsn I want to work with.
df = df.replace({'?':0, 'n':-1, 'y':1})

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,-1,1,-1,1,1,1,-1,-1,-1,1,0,1,1,1,-1,1
1,republican,-1,1,-1,1,1,1,-1,-1,-1,-1,-1,1,1,1,-1,0
2,democrat,0,1,1,0,1,1,-1,-1,-1,-1,1,-1,1,1,-1,-1
3,democrat,-1,1,1,-1,0,1,-1,-1,-1,-1,1,-1,1,-1,-1,1
4,democrat,1,1,1,-1,1,1,-1,-1,-1,-1,1,0,1,1,1,1


In [0]:
dem = df[df['party'] == 'democrat']
rep = df[df['party'] == 'republican']

In [0]:
df.isnull().sum()

In [31]:
(dem.isnull().sum())

party                    0
handicapped-infants      9
water-project           28
budget                   7
physician-fee-freeze     8
el-salvador-aid         12
religious-groups         9
anti-satellite-ban       8
aid-to-contras           4
mx-missile              19
immigration              4
synfuels                12
education               18
right-to-sue            15
crime                   10
duty-free               16
south-africa            82
dtype: int64

In [32]:
(rep.isnull().sum())

party                    0
handicapped-infants      3
water-project           20
budget                   4
physician-fee-freeze     3
el-salvador-aid          3
religious-groups         2
anti-satellite-ban       6
aid-to-contras          11
mx-missile               3
immigration              3
synfuels                 9
education               13
right-to-sue            10
crime                    7
duty-free               12
south-africa            22
dtype: int64

In [114]:
print(df.shape)
print(dem.shape)
print(rep.shape)

(435, 17)
(267, 17)
(168, 17)


In [148]:
dem.shape[0]/rep.shape[0]

1.5892857142857142

In [149]:
dem.shape[0]/df.shape[0]

0.6137931034482759

In [36]:
#I'm trying to figure out what to do about null values here.
#It appears that democrats have a greater number of NaN on several of the issues.
# I will try to stick to the issues that have a NaN value closest to 0.
# and test to see what the difference might be for the ones that are furthest from 0.
# Democrats also represent about 1.59 of every republican vote. 
# More likely for democrats to not be present than republicans.
# So we will also look at "aid-to-contras" a little bit closer as well.
(dem.isnull().sum())-(rep.isnull().sum())

party                    0
handicapped-infants      6
water-project            8
budget                   3
physician-fee-freeze     5
el-salvador-aid          9
religious-groups         7
anti-satellite-ban       2
aid-to-contras          -7
mx-missile              16
immigration              1
synfuels                 3
education                5
right-to-sue             5
crime                    3
duty-free                4
south-africa            60
dtype: int64

In [0]:
def test_t_for_dem_to_rep(dem_data, rep_data, issue):
  test = ttest_ind(dem_data[issue], rep_data[issue], nan_policy='omit')
  return print(test)

def test_t_for_rep_to_dem(rep_data, dem_data, issue):
  test = ttest_ind(rep_data[issue], dem_data[issue], nan_policy='omit')
  return test, print(test)

In [123]:
#This is what we went over in class.
#I wanted to test if function is working correctly.
#I can also see the difference between puting one or the other first.
#With the T score being negative when I test for republican as the null.
#meaning that the sample mean is lower than the hypothesized mean.
test_t_for_dem_to_rep(dem, rep, 'budget')
test12 = test_t_for_rep_to_dem(rep, dem, 'budget');

Ttest_indResult(statistic=22.8216930438848, pvalue=2.8721153143958906e-76)
Ttest_indResult(statistic=-22.8216930438848, pvalue=2.8721153143958906e-76)


2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

---

## Two Sample T-test

1) Null Hypothesis (boring hypothesis) default state

No difference in between democrats and republicans voting on anti-satellite-ban.

$\bar{x}_1 = \bar{x}_2$

2) Alternative Hypothesis (interesting hypothesis)

The mean voting on 'anti-satellite-ban' is different between democrats and republicans

$\bar{x}_1 \neq \bar{x}_2$

3) Confidence Level (The probability of seeing a true result in spite of random variability)

99% or .01

In [150]:
test_t_for_dem_to_rep(dem, rep, 'anti-satellite-ban')

Ttest_indResult(statistic=12.448556296273836, pvalue=1.2736295885307941e-30)


4) t-statistic: 12.449

5) p-value: 1.274 e-30

Conclusion:

Due to our test resulting in a t-statistic of 12.449 and having a p-vlaue of 1.274 e-30, we reject the null hypothesis that Democrats and Republicans vote at a similar rate on 'anti-satellite-ban', and suggest that on this issue both parties don't vote in a similar rate.

---

---

## Two Sample T-test

1) Null Hypothesis (boring hypothesis) default state

No difference in between democrats and republicans voting on "crime".

$\bar{x}_1 = \bar{x}_2$

2) Alternative Hypothesis (interesting hypothesis)

The mean voting on "crime" is different between democrats and republicans

$\bar{x}_1 \neq \bar{x}_2$

3) Confidence Level (The probability of seeing a true result in spite of random variability)

99% or .01

In [152]:
test_t_for_dem_to_rep(dem, rep, "crime")

Ttest_indResult(statistic=-16.09453857734642, pvalue=5.09559045451787e-46)


4) t-statistic: -16.0945

5) p-value: 5.096 e-46

Conclusion:

Due to our test resulting in a t-statistic of -16.0945 and having a p-vlaue of 5.096 e-46, we reject the null hypothesis that Democrats and Republicans vote at a similar rate on "crime", and suggest that on this issue both parties don't vote in a similar rate.

---

---

## Two Sample T-test

1) Null Hypothesis (boring hypothesis) default state

No difference in between democrats and republicans voting on Immigration issues.

$\bar{x}_1 = \bar{x}_2$

2) Alternative Hypothesis (interesting hypothesis)

The mean voting on Immigration is different between democrats and republicans

$\bar{x}_1 \neq \bar{x}_2$

3) Confidence Level (The probability of seeing a true result in spite of random variability)

99% or .01

In [124]:
test_t_for_dem_to_rep(dem, rep, 'immigration')

Ttest_indResult(statistic=-1.735016635686661, pvalue=0.08344939720307322)


4) t-statistic: -1.736

5) p-value: .08

Conclusion:

Due to our test resulting in a t-statistic of -1.736 and having a p-vlaue of .08, we fail to reject the null hypothesis that Democrats and Republicans vote at a similar rate on immigration, and suggest that on this issue both parties vote in a similar rate.

---

In [153]:
#The other value that stood out to me from looking at null values is the one from:
# aid-to-contras
#So I also wanted to test this column as well

test_t_for_dem_to_rep(dem, rep, 'aid-to-contras')

Ttest_indResult(statistic=17.791848422270405, pvalue=1.4948014750035628e-53)


In [0]:
#It appears that democrats and republicans disagree on this issue as well.
#Because of the high t score and very low p score. It appears that dem and rep disagree
#at a higher rate on this issea than other scores I have observed. 

In [158]:
test_t_for_dem_to_rep(dem, rep, 'water-project')

Ttest_indResult(statistic=-0.08764559884421932, pvalue=0.9301988772663677)


In [0]:
#This test shows us a near identical voting rate between democrats and republicans.
#printed this one out of curiosity.