<h3 align ='center'><font color='purple'>Creating Benchmark Models </font></h3>

### Motivation:

----

Now that I have a few basic ideas about what variables to potentially include/exclude from the dataset I want to compare the quality of a data exclusion technique to that of PCA. I'll run both techniques through a random forest classifier and compare the 'matthews correlation coefficient' of both benchmarks. Maybe i'll take a look at a combination of the two as well.

Note: If I end up using excluding predictors I need to determine, somehow, whether I want fewer predictors with more data or more data with fewer predictors.Hopefully i'll be able to import all of the 'good' predictors and the decision will be easy.

In [3]:
#import modules
import pandas as pd
import numpy as np
import os
import random
import csv
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier

#change working directory to current directory
curdir = os.getcwd()
os.chdir(curdir)

### Data Exclusion Model:

----

To help make the decision about whether more data fewer predictors is better than more predictors w/ less data i'll try two exclusion techniques. The first, the one im betting on, is to exclude predictors in favor of more data but I will also try retaining all predictors and sacrificing the number of datapoints I import.

In [2]:
# Predictor names to keep (from data exploration notebook)
numericalPredictors = ['L0_S12_F344', 'L0_S21_F522', 'L0_S12_F330', 'L0_S23_F663',
       'L1_S25_F2164', 'L0_S21_F517', 'L1_S24_F1743', 'L0_S12_F342',
       'L1_S24_F1798', 'L0_S22_F546', 'L3_S30_F3779', 'L0_S15_F406',
       'L1_S24_F1846', 'L3_S36_F3938', 'L0_S12_F334', 'L0_S21_F507',
       'L3_S36_F3930', 'L1_S24_F1844', 'L0_S22_F571', 'L1_S25_F2484',
       'L0_S15_F415', 'L0_S23_F627', 'L3_S35_F3903', 'L0_S23_F619',
       'L2_S27_F3144', 'L0_S3_F76', 'L0_S23_F655', 'L1_S25_F2021',
       'L2_S27_F3206', 'L0_S15_F397', 'L1_S24_F1831', 'L0_S17_F433',
       'L0_S17_F431', 'L1_S24_F1773', 'L0_S18_F449', 'L0_S3_F68',
       'L1_S24_F1647', 'L0_S7_F136', 'L1_S24_F872', 'L0_S21_F532',
       'L0_S9_F175', 'L0_S12_F338', 'L0_S21_F502', 'L1_S24_F1778',
       'L0_S12_F336', 'L0_S10_F234', 'L0_S2_F56', 'L1_S25_F2051',
       'L0_S10_F239', 'L2_S26_F3040', 'L0_S22_F551', 'L0_S14_F390',
       'L0_S21_F527', 'L0_S14_F374', 'L1_S24_F1516', 'L0_S12_F332',
       'L1_S24_F1824', 'L0_S21_F537', 'L0_S2_F40', 'L1_S24_F1763',
       'L1_S24_F1812', 'L0_S23_F667', 'L1_S25_F2960', 'L1_S24_F1518',
       'L0_S14_F386', 'L1_S24_F683', 'L3_S36_F3926', 'L2_S27_F3166',
       'L1_S24_F1520', 'L0_S3_F92', 'L1_S24_F1808', 'L0_S15_F403',
       'L1_S25_F2828', 'L0_S12_F340', 'L1_S24_F1667', 'L0_S11_F298',
       'L3_S29_F3407', 'L0_S11_F310', 'L0_S19_F459', 'L0_S15_F418',
       'L0_S13_F356', 'L0_S10_F244', 'L1_S24_F1758', 'L0_S10_F249',
       'L0_S11_F314', 'L0_S14_F362', 'L0_S19_F455', 'L0_S22_F561',
       'L0_S9_F190', 'L0_S0_F14', 'L1_S24_F1512', 'L1_S24_F1829',
       'L0_S10_F274', 'L1_S24_F1265', 'L0_S9_F165', 'L1_S24_F1569',
       'L0_S10_F224', 'L1_S24_F1000', 'L0_S9_F200', 'L0_S11_F306',
       'L1_S24_F1010', 'L3_S36_F3934', 'L0_S11_F318', 'L0_S12_F350',
       'L0_S9_F170', 'L1_S24_F1637', 'L0_S10_F229', 'L1_S24_F1728',
       'L3_S30_F3709', 'L1_S24_F1498', 'L1_S24_F1733', 'L1_S24_F1848',
       'L1_S24_F1573', 'L0_S9_F180', 'L3_S35_F3894', 'L0_S10_F259',
       'L3_S30_F3684', 'L0_S9_F185', 'L3_S44_F4121', 'L0_S12_F352',
       'L3_S34_F3882', 'L0_S5_F114', 'L0_S0_F6', 'L2_S26_F3036',
       'L0_S0_F12', 'L3_S33_F3873', 'L0_S2_F48', 'L0_S6_F132',
       'L0_S11_F282', 'L1_S24_F1685', 'L0_S11_F326', 'L0_S3_F84',
       'L2_S26_F3073', 'L2_S26_F3106', 'L3_S30_F3624', 'L0_S0_F10',
       'L0_S12_F348', 'L3_S36_F3918', 'L2_S26_F3113', 'L1_S24_F1690',
       'L0_S4_F104', 'L1_S24_F1850', 'L3_S36_F3922', 'L1_S25_F2767',
       'L0_S9_F195', 'L1_S24_F1695', 'L3_S34_F3880', 'L3_S30_F3664',
       'L1_S25_F2007', 'L0_S3_F100', 'L0_S11_F286', 'L0_S0_F4',
       'L2_S26_F3047', 'L3_S33_F3869', 'L3_S29_F3412', 'L0_S0_F2',
       'L3_S29_F3404', 'L2_S26_F3121', 'L0_S6_F122', 'L0_S9_F210',
       'L2_S26_F3062', 'L2_S26_F3117', 'L2_S26_F3051', 'L3_S29_F3461',
       'L0_S9_F160', 'L0_S14_F370', 'L3_S33_F3867', 'L1_S24_F1571',
       'L0_S22_F556', 'L3_S34_F3878', 'L0_S11_F294', 'L3_S30_F3584',
       'L0_S10_F264', 'L0_S0_F0', 'L0_S10_F219', 'L3_S29_F3476',
       'L0_S16_F421','Response','Id']

categoricalPredictors = ['L0_S9_F204', 'L0_S9_F179', 'L0_S9_F159', 'L0_S9_F199', 'L1_S24_F710',
       'L2_S26_F3099', 'L1_S24_F705', 'L1_S24_F675', 'L3_S32_F3851',
       'L3_S32_F3854','Id']

In [3]:
#import data
n_rows = 50000
n_rows_in_file = sum(1 for row in open('train_numeric.csv'))-1
skips = random.sample(range(1,n_rows_in_file),(n_rows_in_file-n_rows))
train_numeric = pd.read_csv('train_numeric.csv',usecols = numericalPredictors,skiprows = skips)
train_categorical = pd.read_csv('train_categorical.csv',usecols = categoricalPredictors,skiprows = skips,low_memory=False)

In [4]:
#convert categorical variables to numeric
for predictor in list(train_categorical.columns.values):
    uniqueVals = train_categorical[predictor].unique()
    train_categorical[predictor] = train_categorical[predictor].replace(uniqueVals,range(0,len(uniqueVals)))

In [5]:
#join dataframes
train_numeric['Id'] = train_categorical['Id']
train_numeric = train_numeric.merge(train_categorical,how='inner',on='Id')

In [6]:
#fill in nan values
train_numeric = train_numeric.fillna(train_numeric.mean())

#format data for sklearn
X = train_numeric.drop('Response',1)
Y = train_numeric['Response']
#create training and testSets
numTestRows = 10000
 #shuffle rows
shuffle = np.random.permutation(len(X))
X = X.iloc[shuffle]
Y = Y.iloc[shuffle]
Xt = X[len(X)-numTestRows:]
Yt = Y[len(Y)-numTestRows:]
X = X[:len(X)-numTestRows]
Y = Y[:len(Y)-numTestRows]

In [7]:
#train random forest
excludedForest = RandomForestClassifier(n_estimators=100)
excludedForest.fit(X,Y)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [10]:
#Evaluate forest
from sklearn.metrics import matthews_corrcoef
predictions = excludedForest.predict(Xt)
matthew = matthews_corrcoef(Yt,predictions)

In [32]:
matthew

0

.21... not bad.

### PCA Model:

----

In [66]:
#import data
n_rows = 10000
n_rows_in_file = sum(1 for row in open('train_numeric.csv'))-1
skips = random.sample(range(1,n_rows_in_file),(n_rows_in_file-n_rows))
train_numeric = pd.read_csv('train_numeric.csv',skiprows = skips)
train_categorical = pd.read_csv('train_categorical.csv',skiprows = skips,low_memory=False)

In [67]:
#convert categorical variables to numeric
for predictor in list(train_categorical.columns.values):
    uniqueVals = train_categorical[predictor].unique()
    train_categorical[predictor] = train_categorical[predictor].replace(uniqueVals,range(0,len(uniqueVals)))

In [68]:
#join dataframes
train_numeric['Id'] = train_categorical['Id']
train_numeric = train_numeric.merge(train_categorical,how='inner',on='Id')

In [69]:
#fill in nan values
train_numeric = train_numeric.fillna(train_numeric.mean())
train_numeric = train_numeric.drop('Id',1)
#format data for sklearn
X = train_numeric.drop('Response',1)
Y = train_numeric['Response']

In [70]:
from sklearn.decomposition import PCA
analysis = PCA(n_components=50)
analysis.fit(X)

PCA(copy=True, n_components=50, whiten=False)

In [74]:
newX = analysis.transform(X)
newX = pd.DataFrame(newX)

In [75]:
#create training and testSets
numTestRows = 2000
 #shuffle rows
shuffle = np.random.permutation(len(newX))
newX = newX.iloc[shuffle]
Y = Y[shuffle]

newXT = newX[len(newX)-numTestRows:]
Yt = Y[len(Y)-numTestRows:]
newX = newX[:len(newX)-numTestRows]
Y = Y[:len(Y)-numTestRows]

In [76]:
#train random forest
excludedForest = RandomForestClassifier(n_estimators=100)
excludedForest.fit(newX,Y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [77]:
from sklearn.metrics import matthews_corrcoef
predictions = excludedForest.predict(newXT)
matthew = matthews_corrcoef(Yt,predictions)

In [78]:
matthew

0.0

Hm.... this seems too bad to be accurate. Maybe i'm doing something wrong. I'll have to come back and reexamine this idea later