<a href="https://colab.research.google.com/github/MrT3313/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

# NOTEBOOK IMPORTS

In [0]:
# IMPORTS
# -

import pandas as pd
import seaborn as sns
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

# DATA IMPORTS

In [0]:
# DATA URLs
# -  

voting_records_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
# !curl https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

In [3]:
# CREATE DATA FRAME
# -

# pd = pd.read_csv(voting_records_url)
voting_data = pd.read_csv(voting_records_url)

print(voting_data.shape)
voting_data.head()

(434, 17)


Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y


# INITIAL DATA MANIPULATION

In [4]:
# CHANGE HEADERS
# -

column_headers = [
    'party',
    'handicapped-infants',
    'water-project-cost-sharing',
    'adoption-of-the-budget-resolution',
    'physician-fee-freeze',
    'el-salvador-aid',
    'religious-groups-in-schools',
    'anti-satellite-test-ban',
    'aid-to-nicaraguan-contras',
    'mx-missle',
    'immigration',
    'synfules-corporation-cutback',
    'education-spending',
    'superfund-right-to-sue',
    'crime',
    'duty-free-exports',
    'export-administration-act-south-africa'
    
]

voting_data = pd.read_csv(voting_records_url, names=column_headers)

print(voting_data.shape)
voting_data.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missle,immigration,synfules-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [5]:
# CHECK NULL VALUES
  # Notes
    print(voting_data.shape)
    print(voting_data.isna().sum())
    ## - Shows 0 because it is looking for NaN specifically
  # -
  # ToDo
    # use the 'na_values' argument on the '.read_csv' method to set all '?s' to NaN
# -


# voting_data = pd.read_csv(voting_records_url, names=column_headers, na_values=['?'])
print(voting_data.shape)
voting_data.isna().sum()

(435, 17)
party                                     0
handicapped-infants                       0
water-project-cost-sharing                0
adoption-of-the-budget-resolution         0
physician-fee-freeze                      0
el-salvador-aid                           0
religious-groups-in-schools               0
anti-satellite-test-ban                   0
aid-to-nicaraguan-contras                 0
mx-missle                                 0
immigration                               0
synfules-corporation-cutback              0
education-spending                        0
superfund-right-to-sue                    0
crime                                     0
duty-free-exports                         0
export-administration-act-south-africa    0
dtype: int64
(435, 17)


party                                     0
handicapped-infants                       0
water-project-cost-sharing                0
adoption-of-the-budget-resolution         0
physician-fee-freeze                      0
el-salvador-aid                           0
religious-groups-in-schools               0
anti-satellite-test-ban                   0
aid-to-nicaraguan-contras                 0
mx-missle                                 0
immigration                               0
synfules-corporation-cutback              0
education-spending                        0
superfund-right-to-sue                    0
crime                                     0
duty-free-exports                         0
export-administration-act-south-africa    0
dtype: int64

In [6]:
voting_data.dtypes

party                                     object
handicapped-infants                       object
water-project-cost-sharing                object
adoption-of-the-budget-resolution         object
physician-fee-freeze                      object
el-salvador-aid                           object
religious-groups-in-schools               object
anti-satellite-test-ban                   object
aid-to-nicaraguan-contras                 object
mx-missle                                 object
immigration                               object
synfules-corporation-cutback              object
education-spending                        object
superfund-right-to-sue                    object
crime                                     object
duty-free-exports                         object
export-administration-act-south-africa    object
dtype: object

In [7]:
print(voting_data.shape)
voting_data.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missle,immigration,synfules-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


# ISSUE ANALYSIS

## Issue #1 

TITLE: handicapped_infants


### Subset #1

In [8]:
# Subset_1 = Individual Vote
# -

columns = ['party', 'handicapped-infants']

handicapped_infants = voting_data[columns]

print(handicapped_infants.shape)
print(handicapped_infants.head())
handicapped_infants.describe()

(435, 2)
        party handicapped-infants
0  republican                   n
1  republican                   n
2    democrat                   ?
3    democrat                   n
4    democrat                   y


Unnamed: 0,party,handicapped-infants
count,435,435
unique,2,3
top,democrat,n
freq,267,236


### Feature Engineering

N / ? / Y --> -1 / 0 / 1 

In [0]:
# Convert to number code
def convertTo_numberCode(item):
  # print(item)
  
  if item == 'n':
    return -1
  elif item == '?':
    return 0
  elif item == 'y':
    return 1

In [10]:
handicapped_infants['voteCode'] = handicapped_infants['handicapped-infants'].apply(convertTo_numberCode);

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [11]:
handicapped_infants.head()

Unnamed: 0,party,handicapped-infants,voteCode
0,republican,n,-1
1,republican,n,-1
2,democrat,?,0
3,democrat,n,-1
4,democrat,y,1


### Subset #2 

```
# This is formatted as code
```



In [12]:
# Subset_2 = party
handicapped_infants_R = handicapped_infants[handicapped_infants['party'] == 'republican']
print(handicapped_infants_R.shape)

handicapped_infants_D = handicapped_infants[handicapped_infants['party'] == 'democrat']
print(handicapped_infants_D.shape)


(168, 3)
(267, 3)


In [13]:
print(handicapped_infants_D.shape)
print(handicapped_infants_D.describe())
print(handicapped_infants_D.describe(exclude='number'))

mu_handicappedInfants_D = 0.202247
std_handicappedInfants_D = 0.963778
sample_handicappedInfants_D = 267.000000

(267, 3)
         voteCode
count  267.000000
mean     0.202247
std      0.963778
min     -1.000000
25%     -1.000000
50%      1.000000
75%      1.000000
max      1.000000
           party handicapped-infants
count        267                 267
unique         1                   3
top     democrat                   y
freq         267                 156


In [14]:
print(handicapped_infants_D.describe())

         voteCode
count  267.000000
mean     0.202247
std      0.963778
min     -1.000000
25%     -1.000000
50%      1.000000
75%      1.000000
max      1.000000


In [15]:
print(handicapped_infants_R.shape)
print(handicapped_infants_R.describe())
print(handicapped_infants_R.describe(exclude='number'))

mu_handicappedInfants_R = -0.613095
std_handicappedInfants_R = 0.780953
sample_handicappedInfants_R = 168.000000

(168, 3)
         voteCode
count  168.000000
mean    -0.613095
std      0.780953
min     -1.000000
25%     -1.000000
50%     -1.000000
75%     -1.000000
max      1.000000
             party handicapped-infants
count          168                 168
unique           1                   3
top     republican                   n
freq           168                 134


### T Tests

1) Null Hypothesis: There is no difference in voting between republicans and democrats

2) Alt Hypothesis: 

``` 
Notes: 
- T Stat = ~Roughly number of standard deviations away from the mean
- P Value = probability that the difference observed (the T Stat) could have happened by chance

T Stat = -9.22317772154614
P Value = 1.2761169357253626e-18
 ```

3) P Value rejects that there are differences in voting preferences between democrate and republicans on this vote

4) 

In [16]:
ttest_ind(handicapped_infants_R['voteCode'], handicapped_infants_D['voteCode'])

Ttest_indResult(statistic=-9.22317772154614, pvalue=1.2761169357253626e-18)

## Issue #2

In [17]:
# Subset_1 = Individual Vote
# -

columns = ['party', 'immigration']

immigration = voting_data[columns]

print(immigration.shape)
print(immigration.head())
immigration.describe()

(435, 2)
        party immigration
0  republican           y
1  republican           n
2    democrat           n
3    democrat           n
4    democrat           n


Unnamed: 0,party,immigration
count,435,435
unique,2,3
top,democrat,y
freq,267,216


In [18]:
immigration['voteCode'] = immigration['immigration'].apply(convertTo_numberCode);

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [20]:
immigration.head()

Unnamed: 0,party,immigration,voteCode
0,republican,y,1
1,republican,n,-1
2,democrat,n,-1
3,democrat,n,-1
4,democrat,n,-1


In [21]:
# Subset_2 = party
immigration_R = immigration[immigration['party'] == 'republican']
print(immigration.shape)

immigration_D = immigration[immigration['party'] == 'democrat']
print(immigration.shape)

(435, 3)
(435, 3)


In [22]:
ttest_ind(immigration_R['voteCode'], immigration_D['voteCode'])

Ttest_indResult(statistic=1.735016635686661, pvalue=0.08344939720307322)