<a href="https://colab.research.google.com/github/ABZ-Aaron/PartyAffiliation/blob/master/PartyAffiliation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Party Affiliation in the United States
#### By Aaron Wright - 1912626
---

The purpose of this analysis is two implement 2 distinct AI algorithms on a selected dataset. Throughout this report I will apply classification techniques to help assess voting habits, and predict whether an indvidiual votes **Democrat** or **Republican**.

## Dataset

This dataset was taken from the UCI Machine Learning Repository ([source]('https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records'))

### Title

1984 United States Congressional Voting Records

### Description


This dataset contains individual votes from each US Congressman on 16 different key votes identified by the CQA.

There are 9 different kinds of votes:

* 3 are "voted for", "paired for", "annouced for". These are simplified to "y" in the dataset. 

* 3 are "voted against", "paired against", and "announced against". These are simplified to "n" in the dataset.

* 3 are "voted present", "voted present to avoid conflict of interest", "did not vote or make position known". These are simplified to "?" in the dataset.

### Attributes

1. party
1. handicapped-infants
1. water-project-cost-sharing
1. adoption-of-the-budget-resolution
1. physician-fee-freeze
1. el-salvador-aid
1. religious-groups-in-schools
1. anti-satellite-test-ban
1. aid-to-nicaraguan-contras
1. mx-missile
1. immigration
1. synfuels-corporation-cutback
1. education-spending
1. superfund-right-to-sue
1. crime
1. duty-free-exports
1. export-administration-act-south-africa

## Imports

In [64]:
import pandas as pd
import plotly.express as px
import numpy as np

## Data Loading

In [2]:
# Take data from Github repository 
data = 'https://raw.githubusercontent.com/ABZ-Aaron/PartyAffiliation/master/votes_data.csv?token=AND25LFLHJXHZZMCYHIFOMTBVQHBE'

In [3]:
# Column names
cols = ["party", 
        "hc-infants", 
        "water-proj", 
        "budget-reso", 
        "fee-freeze", 
        "salvador-aid", 
        "relig-groups", 
        "anti-satell", 
        "nicar-contras",
        "mx-missile",
        "immigration",
        "syn-cutback",
        "edu-spending",
        "right-to-sue",
        "crime",
        "dutyfree-expo",
        "admin-act-sou"]

In [12]:
# Read in our data
votes = pd.read_csv(data, names = cols)

## Data Exploration

In [20]:
# Check top 3 records
votes.head(3)

Unnamed: 0,party,hc-infants,water-proj,budget-reso,fee-freeze,salvador-aid,relig-groups,anti-satell,nicar-contras,mx-missile,immigration,syn-cutback,edu-spending,right-to-sue,crime,dutyfree-expo,admin-act-sou
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [21]:
# Check bottom 3 records
votes.tail(3)

Unnamed: 0,party,hc-infants,water-proj,budget-reso,fee-freeze,salvador-aid,relig-groups,anti-satell,nicar-contras,mx-missile,immigration,syn-cutback,edu-spending,right-to-sue,crime,dutyfree-expo,admin-act-sou
430,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
431,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
432,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
433,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y
434,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,?,n


In [14]:
# Show information associated with dataframe
votes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   party          435 non-null    object
 1   hc-infants     435 non-null    object
 2   water-proj     435 non-null    object
 3   budget-reso    435 non-null    object
 4   fee-freeze     435 non-null    object
 5   salvador-aid   435 non-null    object
 6   relig-groups   435 non-null    object
 7   anti-satell    435 non-null    object
 8   nicar-contras  435 non-null    object
 9   mx-missile     435 non-null    object
 10  immigration    435 non-null    object
 11  syn-cutback    435 non-null    object
 12  edu-spending   435 non-null    object
 13  right-to-sue   435 non-null    object
 14  crime          435 non-null    object
 15  dutyfree-expo  435 non-null    object
 16  admin-act-sou  435 non-null    object
dtypes: object(17)
memory usage: 57.9+ KB


In [24]:
# Return number of unique values by column
votes.nunique()

party            2
hc-infants       3
water-proj       3
budget-reso      3
fee-freeze       3
salvador-aid     3
relig-groups     3
anti-satell      3
nicar-contras    3
mx-missile       3
immigration      3
syn-cutback      3
edu-spending     3
right-to-sue     3
crime            3
dutyfree-expo    3
admin-act-sou    3
dtype: int64

In [19]:
# Return unique values for target variable
votes['party'].unique()

array(['republican', 'democrat'], dtype=object)

In [26]:
# Return unique values from one feature variable
votes['hc-infants'].unique()

array(['n', '?', 'y'], dtype=object)

In [31]:
votes['party'].value_counts(dropna=False)

democrat      267
republican    168
Name: party, dtype: int64

In [33]:
votes['party'].value_counts(normalize=True)

democrat      0.613793
republican    0.386207
Name: party, dtype: float64

In [53]:
fig = px.pie(votes, "party", title = "Rep vs Dem Count")
fig.show()

### Datset Details

* **17** columns 
* **1** target variable
* **16** features
* **435** records
* No **NULL** values
* All of **object** datatype
* Target variable consists of two entries: **republican** and **democrat**
* Feature variables consist of three entries: **y**, **n** and **?**
* The **?** represents values that are neither yes or no.
* More Democrats (**61%**) than Republicans (**39%**) in the dataset.

## Data Cleaning

Let's first convert our feature variables to a numerical type. We'll need this if we run a k-nearest neighbours (KNN) algorithm.

In [27]:
# Replace the **y** and **n** with numerical values
votes.replace({"y" : 0 ,"n" : 1}, inplace = True)

There are a few ways we could deal with our **?** values:

1. Remove rows with ? values
1. Remove columns with ? values
1. Replace ? values with something

We don't want to remove columns as this would remove almost all the features. Replacing could be an option. However, the simplest option would be to remove records containing ?. 

Let's see how many records we have remaining after removing records containing ?.

In [73]:
# Remvoing NA rows
votes.replace('?', np.NaN).dropna()['party'].value_counts()

democrat      124
republican    108
Name: party, dtype: int64

It can be seen that we have significantly fewer records using this approach. The best approach therefore would be to replace the ? values with something.

In [9]:
votes.head()

Unnamed: 0,party,hc-infants,water-proj,budget-reso,fee-freeze,salvador-aid,relig-groups,anti-satell,nicar-contras,mx-missile,immigration,syn-cutback,edu-spending,right-to-sue,crime,dutyfree-expo,admin-act-sou
0,republican,1,0,1,0,0,0,1,1,1,0,?,0,0,0,1,0
1,republican,1,0,1,0,0,0,1,1,1,1,1,0,0,0,1,?
2,democrat,?,0,0,?,0,0,1,1,1,1,0,1,0,0,1,1
3,democrat,1,0,0,1,?,0,1,1,1,1,0,1,0,1,1,0
4,democrat,0,0,0,1,0,0,1,1,1,1,0,?,0,0,0,0
