<h3 align='center'><font color='purple'>Improving on Benchmark</font></h3>

### Ideas:

----

So .21 is a decent score but I need a .229 to be in the bronze range and to really be safely in that range I need closer to a .25. So, I need some ideas on how to improve the model.

1) Explore optimizing the tradeoff between # of datapoints to train on and # of predictors

2) Integrate PCA into the model, increase number of predictors.

3) Explore alternate techniques for filling NaN values.

4) Do some more detailed data exploration now that I have fewer predictors.

### Data Exploration Cont.

----

There are still too many numerical predictors for me to go through each of them, think about them analyze them etc. Luckily there are only a few categorical variables in my model now and doing some exploration on those is very doable.

In [55]:
#import packages
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import os

#change working directory to current directory
curDir = os.getcwd()
os.chdir(curDir)

#which columns I want to import
categoricalPredictors = ['L0_S9_F204', 'L0_S9_F179', 'L0_S9_F159', 'L0_S9_F199', 'L1_S24_F710',
       'L2_S26_F3099', 'L1_S24_F705', 'L1_S24_F675', 'L3_S32_F3851',
       'L3_S32_F3854','Id']

#import subset of our data
nrows = 30000
n_rows_in_file = sum(1 for row in open('train_categorical.csv'))-1
skips = random.sample(range(1,n_rows_in_file),n_rows_in_file-nrows)
train_categorical = pd.read_csv('train_categorical.csv',usecols = categoricalPredictors,skiprows = skips,low_memory=False)

In [11]:
train_categorical.isnull().sum()

Id                  0
L0_S9_F159      29990
L0_S9_F179      29990
L0_S9_F199      29990
L0_S9_F204      29990
L1_S24_F675     29445
L1_S24_F705     29445
L1_S24_F710     29445
L2_S26_F3099    24165
L3_S32_F3851    29443
L3_S32_F3854    29443
dtype: int64

Okay so for the vast majority of these columns are empty despite them being strong predictors.

These numbers led me to look back at my benchmarks and I realized I do not have a large enough testing set to have reliable results on this one. I need to reimport my benchmark model and properly test it.

Since these categorical variables are empty for almost all the predictions i'll just use the numeric predictors since I need the extra memory to get reliable evaluation metrics.

In [46]:
numericalPredictors = ['L0_S12_F344', 'L0_S21_F522', 'L0_S12_F330', 'L0_S23_F663',
       'L1_S25_F2164', 'L0_S21_F517', 'L1_S24_F1743', 'L0_S12_F342',
       'L1_S24_F1798', 'L0_S22_F546', 'L3_S30_F3779', 'L0_S15_F406',
       'L1_S24_F1846', 'L3_S36_F3938', 'L0_S12_F334', 'L0_S21_F507',
       'L3_S36_F3930', 'L1_S24_F1844', 'L0_S22_F571', 'L1_S25_F2484',
       'L0_S15_F415', 'L0_S23_F627', 'L3_S35_F3903', 'L0_S23_F619',
       'L2_S27_F3144', 'L0_S3_F76', 'L0_S23_F655', 'L1_S25_F2021',
       'L2_S27_F3206', 'L0_S15_F397', 'L1_S24_F1831', 'L0_S17_F433',
       'L0_S17_F431', 'L1_S24_F1773', 'L0_S18_F449', 'L0_S3_F68',
       'L1_S24_F1647', 'L0_S7_F136', 'L1_S24_F872', 'L0_S21_F532',
       'L0_S9_F175', 'L0_S12_F338', 'L0_S21_F502', 'L1_S24_F1778',
       'L0_S12_F336', 'L0_S10_F234', 'L0_S2_F56', 'L1_S25_F2051',
       'L0_S10_F239', 'L2_S26_F3040', 'L0_S22_F551', 'L0_S14_F390',
       'L0_S21_F527', 'L0_S14_F374', 'L1_S24_F1516', 'L0_S12_F332',
       'L1_S24_F1824', 'L0_S21_F537', 'L0_S2_F40', 'L1_S24_F1763',
       'L1_S24_F1812', 'L0_S23_F667', 'L1_S25_F2960', 'L1_S24_F1518',
       'L0_S14_F386', 'L1_S24_F683', 'L3_S36_F3926', 'L2_S27_F3166',
       'L1_S24_F1520', 'L0_S3_F92', 'L1_S24_F1808', 'L0_S15_F403',
       'L1_S25_F2828', 'L0_S12_F340', 'L1_S24_F1667', 'L0_S11_F298',
       'L3_S29_F3407', 'L0_S11_F310', 'L0_S19_F459', 'L0_S15_F418',
       'L0_S13_F356', 'L0_S10_F244', 'L1_S24_F1758', 'L0_S10_F249',
       'L0_S11_F314', 'L0_S14_F362', 'L0_S19_F455', 'L0_S22_F561',
       'L0_S9_F190', 'L0_S0_F14', 'L1_S24_F1512', 'L1_S24_F1829',
       'L0_S10_F274', 'L1_S24_F1265', 'L0_S9_F165', 'L1_S24_F1569',
       'L0_S10_F224', 'L1_S24_F1000', 'L0_S9_F200', 'L0_S11_F306',
       'L1_S24_F1010', 'L3_S36_F3934', 'L0_S11_F318', 'L0_S12_F350',
       'L0_S9_F170', 'L1_S24_F1637', 'L0_S10_F229', 'L1_S24_F1728',
       'L3_S30_F3709', 'L1_S24_F1498', 'L1_S24_F1733', 'L1_S24_F1848',
       'L1_S24_F1573', 'L0_S9_F180', 'L3_S35_F3894', 'L0_S10_F259',
       'L3_S30_F3684', 'L0_S9_F185', 'L3_S44_F4121', 'L0_S12_F352',
       'L3_S34_F3882', 'L0_S5_F114', 'L0_S0_F6', 'L2_S26_F3036',
       'L0_S0_F12', 'L3_S33_F3873', 'L0_S2_F48', 'L0_S6_F132',
       'L0_S11_F282', 'L1_S24_F1685', 'L0_S11_F326', 'L0_S3_F84',
       'L2_S26_F3073', 'L2_S26_F3106', 'L3_S30_F3624', 'L0_S0_F10',
       'L0_S12_F348', 'L3_S36_F3918', 'L2_S26_F3113', 'L1_S24_F1690',
       'L0_S4_F104', 'L1_S24_F1850', 'L3_S36_F3922', 'L1_S25_F2767',
       'L0_S9_F195', 'L1_S24_F1695', 'L3_S34_F3880', 'L3_S30_F3664',
       'L1_S25_F2007', 'L0_S3_F100', 'L0_S11_F286', 'L0_S0_F4',
       'L2_S26_F3047', 'L3_S33_F3869', 'L3_S29_F3412', 'L0_S0_F2',
       'L3_S29_F3404', 'L2_S26_F3121', 'L0_S6_F122', 'L0_S9_F210',
       'L2_S26_F3062', 'L2_S26_F3117', 'L2_S26_F3051', 'L3_S29_F3461',
       'L0_S9_F160', 'L0_S14_F370', 'L3_S33_F3867', 'L1_S24_F1571',
       'L0_S22_F556', 'L3_S34_F3878', 'L0_S11_F294', 'L3_S30_F3584',
       'L0_S10_F264', 'L0_S0_F0', 'L0_S10_F219', 'L3_S29_F3476',
       'L0_S16_F421','Response','Id']

In [47]:
import pandas as pd
import numpy as np
import random
n = sum(1 for line in open('train_numeric.csv')) - 1 #number of records in file (excludes header)
s = 500000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
train_numeric = pd.read_csv('train_numeric.csv',usecols = numericalPredictors)

In [48]:
#fill in nan values
train_numeric = train_numeric.fillna(train_numeric.mean())

#format data for sklearn
X = train_numeric.drop('Response',1)
Y = train_numeric['Response']
#create training and testSets
numTestRows = 400000
 #shuffle rows
shuffle = np.random.permutation(len(X))
X = X.iloc[shuffle]
Y = Y.iloc[shuffle]
Xt = X[len(X)-numTestRows:]
Yt = Y[len(Y)-numTestRows:]
X = X[:len(X)-numTestRows]
Y = Y[:len(Y)-numTestRows]

In [51]:
#train random forest

from sklearn.ensemble import RandomForestClassifier
excludedForest = RandomForestClassifier(n_estimators=100,max_depth = 8,n_jobs=-1)
excludedForest.fit(X,Y)

#Evaluate forest
from sklearn.metrics import matthews_corrcoef
predictions = excludedForest.predict(Xt)
matthew= matthews_corrcoef(Yt,predictions)

In [52]:
matthew

0.049423437741575406

Okay with 400k test data we get a much more stable MCC. I wasn't able to make big improvements to my forest via tuning. I feel like i'm losing a lot of data by only training on less than a tenth of the data I have. If I had like twenty predictors I could train on all one million datapoints. Lets see if I really cant make PCA work. 

From my data exploration notebook I know I was able to capture roughly 80% of the total variance in the dataset with 40 predictors so lets start with 40 components.

In [35]:
import pandas as pd
import numpy as np
import random

train_numeric = pd.read_csv('train_numeric.csv', usecols = numericalPredictors)

In [36]:
#standardize my data first
train_numeric = train_numeric.fillna(train_numeric.mean())
Y = train_numeric['Response']
train_numeric = (train_numeric - train_numeric.mean())
X = train_numeric.drop('Response',1)


In [37]:
from sklearn.decomposition import PCA
pca_analysis = PCA(n_components = 20)
pca_analysis.fit(X)
X = pca_analysis.transform(X)

In [38]:
#create training and testSets
X = pd.DataFrame(X)
numTestRows = 400000
 #shuffle rows
shuffle = np.random.permutation(len(X))
X = X.iloc[shuffle]
Y = Y.iloc[shuffle]
Xt = X[len(X)-numTestRows:]
Yt = Y[len(Y)-numTestRows:]
X = X[:len(X)-numTestRows]
Y = Y[:len(Y)-numTestRows]

In [40]:
#train random forest
matthew = []
for i in range(1,10):
    from sklearn.ensemble import RandomForestClassifier
    excludedForest = RandomForestClassifier(n_estimators=10,max_depth = i,n_jobs=-1)
    excludedForest.fit(X,Y)

    #Evaluate forest
    from sklearn.metrics import matthews_corrcoef
    predictions = excludedForest.predict(Xt)
    matthew.append(matthews_corrcoef(Yt,predictions))

In [41]:
matthew

[0.0,
 0.0,
 0.0,
 0.0,
 -0.00012074064747062272,
 0.0,
 -0.00012074064747061976,
 -0.00012074064747062272,
 0.023839104985199817]

These scores are pretty terrible. PCA doesn't seem to be the way to go, maybe there is not a lot of multicolinearity between these two variables. Lets try xgboost since it maybe tolerate the missing values much better than my model.

In [26]:
import pandas as pd
import numpy as np
import random

train_numeric = pd.read_csv('train_numeric.csv',usecols = numericalPredictors)

In [27]:

X = train_numeric.drop('Response',1)
Y = train_numeric['Response']
#create training and testSets
numTestRows = 400000
 #shuffle rows
shuffle = np.random.permutation(len(X))
X = X.iloc[shuffle]
Y = Y.iloc[shuffle]
Xt = X[len(X)-numTestRows:]
Yt = Y[len(Y)-numTestRows:]
X = X[:len(X)-numTestRows]
Y = Y[:len(Y)-numTestRows]

In [53]:
#train xgboost
matthew = []

from xgboost import XGBClassifier

excludedForest = XGBClassifier(max_depth=6, base_score=0.005, n_estimators = 40)
excludedForest.fit(X.values,Y.values)

#Evaluate forest
from sklearn.metrics import matthews_corrcoef
predictions = (excludedForest.predict_proba(Xt.values)[:,1] > .18 ).astype(np.int8)
matthew.append(matthews_corrcoef(Yt,predictions))

In [54]:
matthew

[0.19094652866593009]

Okay not bad, xgboost is doing waaaay better, i'm definitely honing in on the level of accuracy that can be achieved via public kernals, too bad i dont have the parallel implementation of xgboost installed.