<a href="https://colab.research.google.com/github/Khislatz/DS-Unit-1-Sprint-2-Statistics/blob/master/Khislat_Zhuraeva_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
from scipy.stats import ttest_ind
import numpy as np  
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

In [0]:
#get the raw data
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-30 01:58:58--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-01-30 01:58:58 (595 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
#make it into a dataframe 

column_headers = ['party','handicapped-infants', 'water-project', 'budget', 'physician-fee-freeze',
               'el-salvador-aid','religious-groups', 'anti-satellite-test-ban', 'aid-to-contras',
               'mx-missile', 'immigration', 'synfuels', 'education', 'right-to-sue', 'crime',
               'duty-free', 'south-africa']



In [128]:
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=column_headers,
                 na_values='?')
df.shape

(435, 17)

In [0]:
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
#recode votes as numeric
df = df.replace({'y':1, 'n':0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
#how many from each party
df['party'].value_counts().sort_index()

democrat      267
republican    168
Name: party, dtype: int64

In [0]:
#how did republicans vote? 
rep = df[df['party'] == 'republican']
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
#how did democrats vote?
dem = df[df['party'] == 'democrat']
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [0]:
#the percentage of republicans who voted "yes" (1) including NaNs 
# on the handicapped-infants bill
rep['handicapped-infants'].sum()/len(rep)

#len() is counting NaN values tool


0.18452380952380953

In [0]:
#remove NaN values from this column 
col =rep['handicapped-infants']
np.isnan(col) # in what cases this column is nan

0      False
1      False
7      False
8      False
10     False
       ...  
427    False
430    False
432    False
433    False
434    False
Name: handicapped-infants, Length: 168, dtype: bool

In [0]:
hi_no_nans = col[~np.isnan(col)]
hi_no_nans # ~ means the opposite
# hi_no_nans shows that we removed all the missing values 

0      0.0
1      0.0
7      0.0
8      0.0
10     0.0
      ... 
427    0.0
430    0.0
432    0.0
433    0.0
434    0.0
Name: handicapped-infants, Length: 165, dtype: float64

In [0]:
hi_no_nans.sum()/len(hi_no_nans)# about 18% of republicans voted in favor of handicapped infants bill
 # now we see a slightly different number 

0.18787878787878787

In [0]:
# or we could use a pandas built in function .mean() which automatically removes Nans 
rep['handicapped-infants'].mean() # about 18% of republicans voted in favor of handicapped infants bill

0.18787878787878787

NUMPY IS A LONG WAY AND PANDAS BUILT IN FUNCTION .MEAN() IS THE SAME THING. 

In [0]:
#Ttest

# 1) Null hypothesis implies that there is no difference between average voting rates between dems and reps (support is equal)
# 2) Alternative hypothesis implies that the levels of support between the two parties will differ 
# 3) 99.9% confidence level


In [0]:
#What is the mean support of Republicans?
rep['water-project'].mean()

0.5067567567567568

In [0]:
#What is the mean support of Democrats?
dem['water-project'].mean()

0.502092050209205

In [0]:
#compare with a t-test:
ttest_ind(rep['water-project'], dem['water-project'])

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
#account for NaNs
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')
# high p value means that it is very likely that a slight difference is occuring 
# by chance and does not reflect differences in a population


Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

In [0]:
#remove NaN values from this column 
col =rep['water-project']
rep_wp_no_nans = col[~np.isnan(col)]
col =dem['water-project']
dem_wp_no_nans = col[~np.isnan(col)]


In [120]:
#My sample size for the two samples
print(len(rep_wp_no_nans))
print(len(dem_wp_no_nans))

148
239


In [0]:
#When we have multiple samples (using a 2-sample test) we will use the smaller of the two samples 
#to determine my degrees of freedom
#In this case  148 < 239 ==> 148-1 = 147 # degrees of freedom within which the statistic is allowed to vary 


Conclusion: 
T-statistic: 0.089
P-value: 0.929

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Conclusion: Due to a p-value of 0.929 i fail to reject the null hypothesis that republicans and democrats support for the water-project bill is different.

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%

I never say that I "accept" the null hypothesis, i just say "I fail to reject" 

In [121]:
print(rep['handicapped-infants'].mean()) # about 18% of republicans voted in favor of handicapped infants bill
dem['handicapped-infants'].mean() # about 60% of democrats voted in favor of handicapped infants bill

0.18787878787878787


0.6046511627906976

In [147]:
ttest_ind(dem['handicapped-infants'], rep['handicapped-infants'])


Ttest_indResult(statistic=nan, pvalue=nan)

In [148]:
ttest_ind(dem['handicapped-infants'], rep['handicapped-infants'], nan_policy='omit')


Ttest_indResult(statistic=9.205264294809222, pvalue=1.613440327937243e-18)

Conclusion: T-statistic: 9.205 P-value: 0.000000000000000001613

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Concluion: Due to a p-value of 0.000000000000000001613 i **reject** the null hypothesis that republicans and democrats support for the candicapped infant bill is different.

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%


In [129]:
print(rep['budget'].mean()) # about 13% of republicans voted in favor of handicapped infants bill
dem['budget'].mean() # about 89% of democrats voted in favor of handicapped infants bill

0.13414634146341464


0.8884615384615384

In [149]:
ttest_ind(dem['budget'], rep['budget'], nan_policy='omit')

Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795404463e-77)

Conclusion: T-statistic: 23.212 P-value: 0.00000000000000000...207

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Conclusion: Due to a p-value of 0.00000000000000000...207 i reject the null hypothesis that republicans and democrats support for the budget bill is different.

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%

In [134]:
print(rep['physician-fee-freeze'].mean()) # about 99% of republicans voted in favor of physician fee freeze bill
dem['physician-fee-freeze'].mean() # about 5% of democrats voted in favor of physician fee freeze bill

0.9878787878787879


0.05405405405405406

In [137]:
ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit')

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

T-statistic: 49.367 P-value: 0.00000...199

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Conclusion: Due to a p-value of 0.00000...199 i reject the null hypothesis that republicans and democrats support for the physician fee freeze is different.
Low p value means that it is very unlikely that a slight difference is occuring 
by chance and does reflect differences in a population.
High p value (>0.01) means that it is very likely that a slight difference is occuring by chance and does not reflect differences in a population

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%

In [138]:
print(rep['el-salvador-aid'].mean()) # about 95% of republicans voted in favor of 'el-salvador-aid' bill
dem['el-salvador-aid'].mean() # about 22% of democrats voted in favor of 'el-salvador-aid' bill

0.9515151515151515


0.21568627450980393

In [139]:
ttest_ind(rep['el-salvador-aid'], dem['el-salvador-aid'], nan_policy='omit')

Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68)

T-statistic: 21.127 P-value: 0.00000...560

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Conclusion: Due to a p-value of 0.00000...560 i reject the null hypothesis that republicans and democrats support for the el-salvador-aid is different. Low p value means that it is very unlikely that a slight difference is occuring by chance and does reflect differences in a population. High p value (>0.01) means that it is very likely that a slight difference is occuring by chance and does not reflect differences in a population

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%



In [140]:
print(rep['religious-groups'].mean()) # about 90% of republicans voted in favor of 'religious-groups' bill
dem['religious-groups'].mean() # about 48% of democrats voted in favor of 'religious-groups' bill

0.8975903614457831


0.47674418604651164

In [141]:
ttest_ind(rep['religious-groups'], dem['religious-groups'], nan_policy='omit')

Ttest_indResult(statistic=9.737575825219457, pvalue=2.3936722520597287e-20)

T-statistic: 9.738 P-value: 0.00000...239

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Conclusion: Due to a p-value of 0.00000...239 i reject the null hypothesis that republicans and democrats support for the el-salvador-aidreligious groups is different. Low p value means that it is very unlikely that a slight difference is occuring by chance and does reflect differences in a population. High p value (>0.01) means that it is very likely that a slight difference is occuring by chance and does not reflect differences in a population

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%

In [142]:
print(rep['anti-satellite-test-ban'].mean()) # about 24% of republicans voted in favor of 'anti-satellite-test-ban' bill
dem['anti-satellite-test-ban'].mean() # about 77% of democrats voted in favor of 'anti-satellite-test-ban' bill

0.24074074074074073


0.7722007722007722

In [150]:
ttest_ind(dem['anti-satellite-test-ban'], rep['anti-satellite-test-ban'], nan_policy='omit')

Ttest_indResult(statistic=12.526187929077842, pvalue=8.521033017443867e-31)

T-statistic: 12.5269.738 P-value: 0.00000...8521

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Conclusion: Due to a p-value of 0.00000...239 i reject the null hypothesis that republicans and democrats support for the el-salvador-aidreligious groups is different. Low p value means that it is very unlikely that a slight difference is occuring by chance and does reflect differences in a population. High p value (>0.01) means that it is very likely that a slight difference is occuring by chance and does not reflect differences in a population

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%

In [152]:
print(rep['aid-to-contras'].mean()) # about 15% of republicans voted in favor of 'aid-to-contras' bill
dem['aid-to-contras'].mean() # about 83% of democrats voted in favor of 'aid-to-contras' bill

0.15286624203821655


0.8288973384030418

In [153]:
ttest_ind(dem['aid-to-contras'], rep['aid-to-contras'], nan_policy='omit')

Ttest_indResult(statistic=18.052093200819733, pvalue=2.82471841372357e-54)

T-statistic: 18.052 P-value: 0.00000...282

I want to reject the null hypothesis if my p-value is < 0.01 of if my p-value is less than (1-confidence_level). Confidence level in this case is 99.9%

Conclusion: Due to a p-value of 0.00000...282 i reject the null hypothesis that republicans and democrats support for the aid-to-contrast groups is different. Low p value means that it is very unlikely that a slight difference is occuring by chance and does reflect differences in a population. High p value (>0.01) means that it is very likely that a slight difference is occuring by chance and does not reflect differences in a population

Alternative hypothesis: Support is not equal to 0.5 or 50%. This says nothing about if support is greater or less than 50%, it's just saying it's different, it's something other than 50%

In [154]:
print(rep['mx-missile'].mean()) # about 12% of republicans voted in favor of handicapped infants bill
dem[ 'mx-missile'].mean() # about 76% of democrats voted in favor of handicapped infants bill

0.11515151515151516


0.7580645161290323

In [155]:
ttest_ind(dem['mx-missile'], rep['mx-missile'], nan_policy='omit')

Ttest_indResult(statistic=16.437503268542994, pvalue=5.03079265310811e-47)

In [0]:
print(rep[ 'el-salvador-aid'].mean()) # about 18% of republicans voted in favor of handicapped infants bill
dem['handicapped-infants'].mean() # about 60% of democrats voted in favor of handicapped infants bill

In [0]:
print(rep[ 'el-salvador-aid'].mean()) # about 18% of republicans voted in favor of handicapped infants bill
dem['handicapped-infants'].mean() # about 60% of democrats voted in favor of handicapped infants bill
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'])
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit')

In [0]:


               'immigration', 'synfuels', 'education', 'right-to-sue', 'crime',
               'duty-free', 'south-africa']

               print(rep['water-project'].mean())
print(rep['budget'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())
print(rep['water-project'].mean())


0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568
0.5067567567567568


['party',
 'handicapped-infants',
 'water-project',
 'budget',
 'physician-fee-freeze',
 'el-salvador-aid',
 'religious-groups',
 'anti-satellite-test-ban',
 'aid-to-contras',
 'mx-missile',
 'immigration',
 'synfuels',
 'education',
 'right-to-sue',
 'crime',
 'duty-free',
 'south-africa']

In [0]:
columns = ['handicapped-infants', 'water-project', 'budget', 'physician-fee-freeze',
               'el-salvador-aid','religious-groups', 'anti-satellite-test-ban', 'aid-to-contras',
               'mx-missile', 'immigration', 'synfuels', 'education', 'right-to-sue', 'crime',
               'duty-free', 'south-africa']
def myfunc(answers):
  for x in answers:
    rep[x].mean()

  

In [0]:
 myfunc(columns) 

In [0]:
df.apply(myfunc)

KeyError: ignored