<a href="https://colab.research.google.com/github/Tam1979/TATA-ML/blob/master/w2_2b_rforest_election.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forests: Presidential Contributions

Let's look at a random forests models for the presidential dataset.

This dataset defines all presidential contribution amounts from publicly available information.

The purpose here is to try to classify the candidate to whom the contributor contributes.  

Here are the feature columns we will use:
1. Last Name (converted from Contributor Name)
2. First Name (converted from Contributor Name)
3. State 
4. Latitude (converted from Zipcode)
5. Longitude (converted from zipcode)
6. Employer
7. Occupation

### Notes

This is going to be a very difficult dataset to get high accuracy, because we don't have any features that are highly correlated with the outcome. Part of our analysis is to see which features prove to be the most useful. 

One might suspect that information like State, might be very predictive -- because presumably New Yorkers might contribute to Hillary Clinton and Texans might contribute to Donald Trump. However, it turns out that State is pretty weakly correlated to the outcome.  

One nice thing about random forests is that since we "bag" featues in differnet trees, we can empirically see which variables have hte most predictive power.  This is helpful for analytical reasons.



In [0]:
%matplotlib inline
import time
import pandas as pd

## Step 1: Load the data

In [0]:
t1 = time.perf_counter()
dataset = pd.read_csv("https://s3.amazonaws.com/elephantscale-public/data/presidential_election_contribs/2016/2016-medium-clean.csv",)
t2 = time.perf_counter()

print("read {:,} records in {:,.2f} ms".format(len(dataset), (t2-t1)*1000))

read 9,221 records in 733.89 ms


In [0]:
prediction_column = ['CAND_NM']
numeric_columns = ['LAT', 'LNG']
feature_columns = ['LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'LAT', 'LNG', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_columns = ['LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_index = ['FIRSTNAME_index', 'LASTNAME_index', 'CONTBR_ST_index', 'CONTBR_EMPLOYER_index', 
                     "CONTBR_OCCUPATION_index"]

### Print out a contribution count broken down by candidate?
**=> Q : Which candidates got the most donations? (in terms of number of donors) **

In [0]:
## TODO : print out per candidate breakdown
## Hint : What column represents Candidate name
dataset.groupby(['LASTNAME', 'FIRSTNAME']).size()


LASTNAME           FIRSTNAME               
AADAHL             DAVID                       1
AARANT             MARJORIE                    1
AARON              PETER                       1
AAS                MARTHA                      1
ABBOTT             JAMES                       1
                   ROBERT A MR.                1
                   SHERRY                      1
ABDUN-NUR DURRETT  CATHY                       1
ABEDIN             SALEHA                      1
ABEL               STEPHEN                     1
ABELEIN            JOANNA LYNN MS.             1
ABELSON            ABBY                        1
ABGHARY            SHAR                        1
ABIRAGI            KRISTIN M.                  1
ABRAMOV            DONNA IRENE                 1
ABRAMS             JEFFRY                      1
                   JOYCE                       1
ABRELL             DANIEL                      1
ABRUZZINI          LINDA                       1
ACHAM              VENESS

In [0]:
## TODO : sort the output by number of contributions
dataset.groupby(['CONTBR_ST_index', 'CONTBR_EMPLOYER_index', 'CONTBR_OCCUPATION_index' ]).size()

CONTBR_ST_index  CONTBR_EMPLOYER_index  CONTBR_OCCUPATION_index
0                -1                     -1                           2
                                         1                          15
                                         2                         186
                                         28                          6
                                         32                         11
                                         166                         2
                                         222                         1
                                         247                         1
                                         1714                        1
                  0                      0                           3
                                         3                           1
                                         5                           2
                                         9                           1
             

## Step 2: Build Indexers and feature vector

Let's index all the categorical columns, and build a labeld index.

In [0]:
for col in categorical_columns:
    dataset[col + '_index'] = pd.factorize(dataset[col])[0]

In [0]:
features = 1 # numeric columns plus our *_index columns
label = 3 #  What are we trying to predict?

## Step 4: Split data into training and test

**=> TODO: build training and test datasets 70%/30% **


In [0]:
## TODO : split 70% training and 30% testing
## Hint : 0.7 ,  0.3

from sklearn.model_selection import train_test_split
train_x, test_x,train_y, test_y = train_test_split()
next(ShuffleSplit().split(X, y))
print("training set = " , len(train_x))
print("testing set = " , len(test_x))

ValueError: ignored

## Step 5: Train the Model


In [0]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=20)

In [0]:
print("Starting model training....this will take some time")
t1 = time.perf_counter()
## TODO : train the model with our training set
## Hint : training
rf.fit(train_x,train_y)
t2 = time.perf_counter()
print("trained on {:,} records  in {:,.2f} ms".\
      format(len(train_x),  (t2-t1)*1000))

Starting model training....this will take some time


NameError: ignored

In [0]:
## TODO : predict with our test data
## Hint : testData2

t1 = time.perf_counter()
predictions = rf.predict(test_x)
t2 = time.perf_counter()
print("prediction on {:,} records  in {:,.2f} ms".\
      format(len(test_x),  (t2-t1)*1000))

predictions

## Step 6: Evaulate the model

In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

NameError: ignored

**=> TODO: Think about the test error here?  Does it seem high?  What does that say about our model?**

**=> How do we define model success? **

In [0]:
import numpy as np
names = np.unique(predictions)

In [0]:
from sklearn.metrics import confusion_matrix
pd.DataFrame(confusion_matrix(test_y, predictions, labels=names), index=names, columns=names)

NameError: ignored

Use the list above to interpret the label.  

**=>What can you conclude from the confusion matrix?**

Is our model better at predicting candidates with many donations (Clinton, Sanders), or few donations?

What can you say about our model perfromance.

## Step 8: Print the feature importanes

In [0]:
import pandas as pd
rf.feature_importances_

NotFittedError: ignored

**=> TODO Compare the relative weight of the feature importances? **



### Visualize feature importances

TODO: DO a visualization of feature importances.

**=> TODO Compare the relative weight of the feature importances? **

Why do you think that the lat/long and other fields did not contribute?

**=> BONUS: Do a Pearson Correlation Matrix of the variables to the outcome, to see correlation **

