## A Hands-on Workshop series in Machine Learning
### Session 2: Predicting election results using ANES (American National Election Study) data
#### Instructor: Aashita Kesarwani

You will use data from [ANES (American National Election Study)](https://electionstudies.org/data-center/) to build prediction models using decision trees and random forest for this session.

In [None]:
import numpy as np
import pandas as pd

import warnings
warnings.simplefilter('ignore') 

import matplotlib.pyplot as plt
import seaborn as sns # Comment this if seaborn is not installed
%matplotlib inline

path = 'data/'
df = pd.read_csv(path + 'anes.csv')

df.head()

In [None]:
df.shape

Let us check if there are any missing values in the dataset using [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) and [`sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) functions piped one after the other.

In [None]:
df.isnull().sum()

No missing values in the dataset!

Let us check the different columns in the dataset.

In [None]:
df.columns

* race : Race-ethnicity summary, 7 categories
* incgroup : DEMOGRAPHICS: Respondent Family - Income Group
* education : DEMOGRAPHICS: Respondent - Education, 7-categories
* classpercep : DEMOGRAPHICS: Respondent - Average or Upper Middle/Working Class 
* votenat : ELECTION: Did Respondent Vote in the National Elections
* votepresid : ELECTION: Vote for President- Major Parties and Other
* votecong : ELECTION: Vote for Congressman
* voteincumb : ELECTION: Did Respondent Vote for Incumbent U.S. House Candidate
* prevote : ELECTION: Respondent Pre-election Intent for Vote for President
* voteintactual : ELECTION: Intended Presidential Vote versus Actual Presidential Vote 
* voterpref : ELECTION: Voter Strength of Preference - Presidential Cand 
* novoterpref : ELECTION: Nonvoter Strength of Preference- Presidential Cand 
* mobiliz : MOBILIZATION: Respondent Try to Influence the Vote of Others During the Campaign
* poldiscuss : POLITICAL ENGAGEMENT: Respondent Discuss Politics with Family and Friends
* jobscale : ISSUES: Guaranteed Jobs and Income Scale
* numcandidat: ELECTION/RACE DESCRIPTION: Number of Candidates in U.S. House Race
* close : POLITICAL ENGAGEMENT: Which Presidential Race in State Be Close
* staterace : ELECTION/RACE DESCRIPTION: Senate Race in State

Let us see how many respondents voted in the national elections.

In [None]:
df['votenat'].value_counts()

This can be represented as a piechart:

In [None]:
plt.axis('equal') 
plt.title("Respondents voted or not in the national elections")
plt.pie(df['votenat'].value_counts(), labels=('Yes', 'No', 'NA'));

Let us check all the values in the *race* column using `unique()` function on the `race` column.

In [None]:
df['race'].unique()

In [None]:
df['race'].value_counts()

In [None]:
for col in df.columns:
    print("Column:", col)
    print(df[col].value_counts(), sep="\n")
    print()

We now print out the unique values in each column:

In [None]:
for col in df.columns:
    print("Column:", col)
    print(*sorted(df[col].unique()), sep="\n")
    print()

Note: 
* R stands for Respondent
* DK stands for Don't Know
* NA stands for Not Applicable or Not Available
* RF stands for refused to say
* Pre IW stands for Pre-election interviews (two months prior to elections)
* Post IW stands for Pre-election reinterviewing

Please take a close look at the columns. 

Notice that all the categories are represented by numerals that are in single digit. Let us use regular expressions package `re` to extract the numerical categories for the columns.

In [None]:
import re

First we need to figure out the pattern to extract the categories. Since all the columns in our dataframe needs the exact same processing, we can first figure out the pattern for the `race` column. For that we pick the first value for the race, call it `x` and then find the pattern for it.

In [None]:
x = df.loc[0, 'race']
x

Use `re.findall()` on `x` to extract `6`. Hint: `\d` is used to detect all the digits. Please refer to the section 3 in *Data manipulation with pandas* notebook.

Fill in the function below to extract catgeories that will be applied to all the columns and ***return the extracted category***.

In [None]:
def extract_category(x):
    # Fill in below

    return cat

Use [`map()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) function to apply `extract_category` to the race column. 

Hint: 
* The syntax is `df['Relevant_column'] = df['Relevant_column'].map(function_name)`. 

Let us check whether the *race* column is truly modified.

In [None]:
df['race'].head()

You should get the following output:

```
0    6
1    1
2    1
3    1
4    1
Name: race, dtype: object
```

Let us check the datatypes of all the columns using [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [None]:
df.dtypes

The `object` datatype is not suitable for the [`scikit-learn`](https://scikit-learn.org/stable/) models that we will use below. Let us change it into `category` datatype using [`astype`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html).

df['race'] = df['race'].astype('category')

In [None]:
df['race'].head()

The change in the data type should be reflected in the above output as:   
`Name: race, dtype: category
Categories (7, object): [1, 2, 3, 4, 5, 6, 9]`

So far, only *race* column is changed:

In [None]:
df.head(3)

Let us first copy the original dataframe as old_df so that we can refer back to it later if needed.

In [None]:
old_df = df.copy() 

Use a for loop to map the above function `extract_category` to each column and then change its datatype `astype('category')`, as tested above. 

Check again that all the columns are now converted to categories.

In [None]:
df.head()

Let us check datatypes again.

In [None]:
df.dtypes

Now, we have the data cleaned up, you should proceed to work on this dataset on your own. Following are some ideas to guide you in the process. Please feel free to reach out to the instructor/TAs to ask for help/discuss.

Exploratory data analysis of the features using graphs and other means:
* Intented vs actual vote for presidential canditate 
* How race, education, income group, etc. affect the vote?

Build models (using Decision Trees and/or Random Forest) to predict:
* whether a respondent voted or not for the national election (Target for atleast 75% accuracy on the validation set)
* vote for president (Target for atleast 70% accuracy on the validation set)
* vote for congressman (optional)

Tips:
* Start with a basic model with minimal features 
* Try adding/removing features to see how it affects the model. 
* Try removing rows with certain conditions. For example, while building the prediction model for presidential election:
    * you can use `df = df[df['votenat']=='2']` to filter only the respondents who voted in the elections. 
    * you can simplify the model by restricting to respondents that voted either Republican or Democratic.
* Use creativity in feature engineering 

***Important note: Beware of Data Leakage while building models:***
* Do not use a feature that inadvertently reveal information about the target variable that was not supposed to be known. For example, for predicting the target variable `votepresid`, you cannot use `votecong` and vice versa.  

At the end, make a copy of the notebook and clean it all up to present the analysis and model in a clear and coherent manner. It would be a great idea to share your work as a blog using [Github Pages](https://help.github.com/en/articles/what-is-github-pages).

For your reference:
* race : Race-ethnicity summary, 7 categories
* incgroup : DEMOGRAPHICS: Respondent Family - Income Group
* education : DEMOGRAPHICS: Respondent - Education, 7-categories
* classpercep : DEMOGRAPHICS: Respondent - Average or Upper Middle/Working Class 
* votenat : ELECTION: Did Respondent Vote in the National Elections
* votepresid : ELECTION: Vote for President- Major Parties and Other
* votecong : ELECTION: Vote for Congressman
* voteincumb : ELECTION: Did Respondent Vote for Incumbent U.S. House Candidate
* prevote : ELECTION: Respondent Pre-election Intent for Vote for President
* voteintactual : ELECTION: Intended Presidential Vote versus Actual Presidential Vote 
* voterpref : ELECTION: Voter Strength of Preference - Presidential Cand 
* novoterpref : ELECTION: Nonvoter Strength of Preference- Presidential Cand 
* mobiliz : MOBILIZATION: Respondent Try to Influence the Vote of Others During the Campaign
* poldiscuss : POLITICAL ENGAGEMENT: Respondent Discuss Politics with Family and Friends
* jobscale : ISSUES: Guaranteed Jobs and Income Scale
* numcandidat: ELECTION/RACE DESCRIPTION: Number of Candidates in U.S. House Race
* close : POLITICAL ENGAGEMENT: Which Presidential Race in State Be Close
* staterace : ELECTION/RACE DESCRIPTION: Senate Race in State

Note: 
* R stands for Respondent
* DK stands for Don't Know
* NA stands for Not Applicable or Not Available
* RF stands for refused to say
* Pre IW stands for Pre-election interviews (two months prior to elections)
* Post IW stands for Pre-election reinterviewing

Note: You will revisit this dataset for a short time in the next session when you will learn other machine learning algorithms. **Please make sure to write your code and analysis clearly so that you can quickly restart from where you leave today.**

#### Acknowledgment:
* The dataset used in this project is taken from [ANES (American National Election Study)](https://electionstudies.org/data-center/).