# EE397K: Data Science Lab
### Kaggle Competiton, October 2017
### Rachel Chen (rjc2737)

This is a story of naivety; getting lucky, tuning parameters, trying things the proper way, and then being surprised by the truth in the end.


# Getting Lucky

### The First Attempt: Straight Jump into XGBoost

The first thing I did was look up a [tutorial](https://www.kaggle.com/sudosudoohio/stratified-kfold-xgboost-eda-tutorial-0-281/notebook) that provided a quick and dirty solution to my problem. I wanted to shotgun to XGBoost and see what it would do. The tutorial uses the gini-metric as an evaluation metric, which is closely related to AUC.

In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb

#Read csv into pandas
train = pd.read_csv("train_final.csv")
test = pd.read_csv("test_final.csv")

#separate the data into variables
X = train.drop(['id', 'Y'], axis=1).values
y = train.Y
test_id = test.id.values
test = test.drop('id', axis=1)

# Define the gini metric - from https://www.kaggle.com/c/ClaimPredictionChallenge/discussion/703#5897
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert( len(actual) == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses
    
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    gini_score = gini_normalized(labels, preds)
    return 'gini', gini_score

#Stratified KFold - used to keep the distribution of each label consistent for each training batch.
kfold = 5
skf = StratifiedKFold(n_splits=kfold, random_state=42)

#create submission file
sub = pd.DataFrame()
sub['id'] = test_id
sub['Y'] = np.zeros_like(test_id)

In [6]:
#set parameters
params = {
    'min_child_weight': 10.0,
    'objective': 'binary:logistic',
    'max_depth': 7,
    'max_delta_step': 1.8,
    'colsample_bytree': 0.4,
    'subsample': 0.8,
    'eta': 0.025,
    'gamma': 0.65,
    'num_boost_round' : 700
    }

In [8]:
#XGBoost by stratified kfold 
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print('[Fold %d/%d]' % (i + 1, kfold))
    X_train, X_valid = X[train_index], X[test_index]
    y_train, y_valid = y[train_index], y[test_index]
    # Convert our data into XGBoost format
    d_train = xgb.DMatrix(X_train, y_train)
    d_valid = xgb.DMatrix(X_valid, y_valid)
    d_test = xgb.DMatrix(test.values)
    watchlist = [(d_train, 'train'), (d_valid, 'valid')]

    # Train the model! We pass in a max of 1,600 rounds (with early stopping after 70)
    # and the custom metric (maximize=True tells xgb that higher metric is better)
    mdl = xgb.train(params, d_train, 1600, watchlist, early_stopping_rounds=70, feval=gini_xgb, maximize=True, verbose_eval=100)

    print('[Fold %d/%d Prediction:]' % (i + 1, kfold))
    
    print mdl.best_ntree_limit
    # Predict on our test data
    #p_test = mdl.predict(d_test)
    p_test = mdl.predict(d_test, ntree_limit=mdl.best_ntree_limit)
    sub['Y'] += p_test/kfold

[Fold 1/5]
[0]	train-error:0.062555	valid-error:0.065393	train-gini:0.621462	valid-gini:0.566921
Multiple eval metrics have been passed: 'valid-gini' will be used for early stopping.

Will train until valid-gini hasn't improved in 70 rounds.
[100]	train-error:0.06073	valid-error:0.064594	train-gini:0.773729	valid-gini:0.699121
[200]	train-error:0.056729	valid-error:0.064294	train-gini:0.796966	valid-gini:0.700259
Stopping. Best iteration:
[188]	train-error:0.057054	valid-error:0.064894	train-gini:0.794211	valid-gini:0.700791

[Fold 1/5 Prediction:]
189
[Fold 2/5]
[0]	train-error:0.065053	valid-error:0.0665	train-gini:0.612306	valid-gini:0.614263
Multiple eval metrics have been passed: 'valid-gini' will be used for early stopping.

Will train until valid-gini hasn't improved in 70 rounds.
[100]	train-error:0.061728	valid-error:0.0642	train-gini:0.769528	valid-gini:0.725185
Stopping. Best iteration:
[42]	train-error:0.064978	valid-error:0.0652	train-gini:0.751387	valid-gini:0.726558

[Fo

In [9]:
#write to CSV
sub.to_csv('Submission1.csv', index=False)

### The result? 
Amazing. Without preprocessing the data and using the parameters from the tutorial, I had achieved a score of **0.86793** on the public leaderboard! Man, had I found the *right* tutorial to use. To be honest, I didn't completely understand what was going on, but I knew that the key principle was to tune the parameters to better my model.

# Tuning Parameters

Because I didn't know how to use grid-search yet, I just changed the parameters by hand. [This](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) guide was helpful in understanding the meaning behind the different parameters. 

My initial attempts consisted of tuning random parameters up/down at random increments but then I realized that doing this without a method would be hard for me to decipher what actually improved my score. I started to document all that I did to see if I could identify how the parameters affected my public leaderboard score. 

### My method:
1. See the results from the previous submission and pick a parameter to tune.
2. Change the number slightly. If results from the past showed that making this parameter bigger yielder worse results, make the parameter smaller and see what would happen.
3. Plug into the pipeline above to generate a submission file.
4. Submit and see what the score would be.

![](http://i.markdownnotes.com/image_WbP04sd.jpg)
*tuning parameters randomly and seeing what would happen*

The tutorial recommended to tune min_child_weight and max_depth first because these parameters make the biggest difference. I saw the results of this when I was randomly tuning parameters. So instead of tuning a couple of parameters at the same time, I started on min_child_weight.

![](http://i.markdownnotes.com/image_dq5dOLO.jpg)

### Tuning
Observation:
* It was interesting for me to observe that the **score was not a linear function of min_child_weight**. My first attempt at 9.5 was good, then I tried 9.3 which was yielded better results, but 9.2 was not better than 9.3. In order from best to worse performance based on min_child_weight with all other factors held constant: 9.3, 9.5, 9.2, 9.4, 9.35.

Takeaway:
* Though I did not use grid-search yet, in my hand implementation I saw how the parameters that are available in a grid-search could really make a difference. There are local mins and maxs when the score is a function of min_child_weight and in order to *really* tune parameters, being precise can make a big difference in end score.

Other Parameters:

After I found that 9.3 was the sweet-spot for min_child_weight, I moved onto tune max_depth, and then messed with eta, gamma, and subsample. To much disappointment, changing these parameters from what I originally used did not improve the score so I stayed with the parameters from the tutorial. Later when I tried to account for overfitting by playing around with the lamdba value but this did not improve my score on the public leaderboard either.

Tuning colsample_bytree started to make an improvement. Just like min_child_weight, the score was not a linear function of colsample_bytree. 0.55 yielded the best results, then in decreasing score, 0.5, 0.58. 0.48, and 0.6. I tracked this in my record as noted by the highlighted boxes in the table below.

### Blending...ish
Another technique that I used was taking the average of the top submissions and seeing how that faired in the public leaderboard. I used the average of [25,36,37,38,24,27,28] and this placed right in the middle of the submissions I took the average of. Nothing too exciting.

![](http://i.markdownnotes.com/image_1a2E30f.jpg)

### Getting Around the Submissions Limit
After submitting a collection of submissions, I pulled up the numbers of the submissions that had faired the best and compared them with other submissions that had not done as well. Below: the left cluster has better scores than the right cluster of columns.
![](http://i.markdownnotes.com/image_Kzb0TDC.png)
I noted that there was a general area that in which the entries would fall for each sample. As I tested my parameters, I would check thpse results against the results of entries that got top scores. If they were too different then I would not bother submitting them in Kaggle. This strategy helped me test parameters without wasting too many submissions, but it also surely led to overfitting, as I would soon discover.

### Takeaways from Hand-Tuning
Hand-tuning, in retrospect, was not the most efficient system for long-term results. Essentially I was doing my own version of grid-search.

Pros:
* I was able to get a pretty immediate score on Kaggle and move up the leaderboard from my hand-tuning attempts. I did pretty well on the public leaderboard with this method.

Cons:
* However, this was a pretty laborious process that took most of the week as I just went from (budding) intuition and hand-tuned randomly until I could identify a pattern.
* By hand-tuning parameter by parameter, one at a time, I could potentially be missing out on how parameters work together for a score that's better together. For example, if one parameter yielded a low score (let's say parameter A of value 0.5 gave a score of 0.8 but when A is 0.6, yields a score of 0.9) as did another (parameter B of value 0.5 gives a score 0.7, but when it is 0.6 gives a score of 0.8), but when A and B are 0.5 and 0.5, yield a score of 0.95.
* Since I did not calculate the local AUC score and banked on the result I got from the public leaderboard, I was in danger of overfitting when I adjusted my parameters based on responses from Kaggle.

### Let's Try Grid-Search

I found a [tutorial](https://cambridgespark.com/content/tutorials/hyperparameter-tuning-in-xgboost/index.html) that had some code for grid-search. I tested it out to see what would happen, using the gini-metric for consistency from my previous method.


In [12]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb

#Read csv into pandas
train = pd.read_csv("train_final.csv")
test = pd.read_csv("test_final.csv")
#Verify that it is correct
train.head()
test.head()

X = train.drop(['id', 'Y'], axis=1).values
y = train.Y
test_id = test.id.values
test = test.drop('id', axis=1)

# Define the gini metric - from https://www.kaggle.com/c/ClaimPredictionChallenge/discussion/703#5897
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert( len(actual) == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses
    
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    gini_score = gini_normalized(labels, preds)
    return 'gini', gini_score

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.1, random_state=42)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    'min_child_weight': 9.3,
    'objective': 'binary:logistic',
    'max_depth': 8,
    'max_delta_step': 2,
    'colsample_bytree': 0.4,
    'subsample': 0.8,
    'eta': 0.02,
    'gamma': 0.65,
    'num_boost_round' : 700
    }

gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(5,12,1)
    for min_child_weight in range(5,12,1)
]

In [14]:
# Define initial best params and gini
min_gini = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))

    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight

    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        seed=42,
        nfold=5,
        feval=gini_xgb,
        early_stopping_rounds=10
    )

    # Update best gini
    mean_gini = cv_results['test-gini-mean'].min()
    boost_rounds = cv_results['test-gini-mean'].argmin()
    print("\tGini {} for {} rounds".format(mean_gini, boost_rounds))
    if mean_gini < min_gini:
        min_gini = mean_gini
        best_params = (max_depth,min_child_weight)

print("Best params: {}, {}, Gini: {}".format(best_params[0], best_params[1], min_gini))

CV with max_depth=5, min_child_weight=5
	Gini 0.6434962 for 0 rounds
CV with max_depth=5, min_child_weight=6
	Gini 0.6440618 for 0 rounds
CV with max_depth=5, min_child_weight=7
	Gini 0.644752 for 0 rounds
CV with max_depth=5, min_child_weight=8
	Gini 0.6445916 for 0 rounds
CV with max_depth=5, min_child_weight=9
	Gini 0.6446492 for 0 rounds
CV with max_depth=5, min_child_weight=10
	Gini 0.6448934 for 0 rounds
CV with max_depth=5, min_child_weight=11
	Gini 0.645039 for 0 rounds
CV with max_depth=6, min_child_weight=5
	Gini 0.6439912 for 0 rounds
CV with max_depth=6, min_child_weight=6
	Gini 0.643363 for 0 rounds
CV with max_depth=6, min_child_weight=7
	Gini 0.6436786 for 0 rounds
CV with max_depth=6, min_child_weight=8
	Gini 0.6433024 for 0 rounds
CV with max_depth=6, min_child_weight=9
	Gini 0.6432454 for 0 rounds
CV with max_depth=6, min_child_weight=10
	Gini 0.6441316 for 0 rounds
CV with max_depth=6, min_child_weight=11
	Gini 0.6445514 for 0 rounds
CV with max_depth=7, min_child_we

The results were not as successful as I had hope. When I had plugged in the parameter results from GridSearch, I did not get a better score, instead, it fell within the bottom half of my total results. This was still the case when I tried with different seeds in splitting up my data. I got different results each time but averaged the numbers for min_child_weight and max_depth and the results were subpar. This incentivized me to keep doing my hand-tuning because the results were faring much better.

In [17]:
# Define initial best params and auc
min_auc = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))

    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight

    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        seed=42,
        nfold=5,
        metrics='auc',
        early_stopping_rounds=10
    )

    # Update best auc
    mean_auc = cv_results['test-auc-mean'].min()
    boost_rounds = cv_results['test-auc-mean'].argmin()
    print("\tAUC {} for {} rounds".format(mean_auc, boost_rounds))
    if mean_auc < min_auc:
        min_auc = mean_auc
        best_params = (max_depth,min_child_weight)

print("Best params: {}, {}, AUC: {}".format(best_params[0], best_params[1], min_auc))

CV with max_depth=5, min_child_weight=5
	AUC 0.8237166 for 0 rounds
CV with max_depth=5, min_child_weight=6
	AUC 0.8240056 for 0 rounds
CV with max_depth=5, min_child_weight=7
	AUC 0.8243488 for 0 rounds
CV with max_depth=5, min_child_weight=8
	AUC 0.824265 for 0 rounds
CV with max_depth=5, min_child_weight=9
	AUC 0.8243008 for 0 rounds
CV with max_depth=5, min_child_weight=10
	AUC 0.8244192 for 0 rounds
CV with max_depth=5, min_child_weight=11
	AUC 0.8244942 for 0 rounds
CV with max_depth=6, min_child_weight=5
	AUC 0.8239604 for 0 rounds
CV with max_depth=6, min_child_weight=6
	AUC 0.8236588 for 0 rounds
CV with max_depth=6, min_child_weight=7
	AUC 0.8237926 for 0 rounds
CV with max_depth=6, min_child_weight=8
	AUC 0.8235952 for 0 rounds
CV with max_depth=6, min_child_weight=9
	AUC 0.8235738 for 0 rounds
CV with max_depth=6, min_child_weight=10
	AUC 0.824039 for 0 rounds
CV with max_depth=6, min_child_weight=11
	AUC 0.8242538 for 0 rounds
CV with max_depth=7, min_child_weight=5
	AUC 0

In testing with AUC instead of the gini-metric, the results were about the same for min_child_weight and max_depth.

# Trying Things The Proper Way

I knew there was a better way to do things than to just plug in numbers into the parameters of a single XGBoost model. My plan for the next steps were this:

1. Data preprocessing - inspect and clean the data.
2. Use other models like Logistic Regression and Random Forest Classifier and tune parameters with grid-search.
3. Ensemble the results from these models (include XGBoost) and run another XGBoost model on top of that.

## Data Preprocessing

In [18]:
import numpy as np
import pandas as pd
from scipy.stats import skew

#Read csv into pandas
train = pd.read_csv("train_final.csv")
test = pd.read_csv("test_final.csv")
#Verify that it is correct
train.head()
test.head()

X = train.drop(['id', 'Y'], axis=1)
y = train.Y
train_id = train.id.values
test_id = test.id.values
test = test.drop('id', axis=1)

X_ = X.copy()
y_ = test.copy()

#### Find the empty values

I knew that some cells were missing, but which ones?

In [None]:
print X_.isnull().sum()
print y_.isnull().sum()

Interesting, just two columns, and thankfully the same columns for each data set. 
#### Check what kind of data each column contains. If categorical, replace with mode value. If numeric, replace with mean.

In [None]:
print X_.F5.value_counts()
print X_.F19.value_counts()

X_.F5 = X_.F5.replace(r'\s+', np.nan, regex=True).fillna(0)
X_.F19 = X_.F19.replace(r'\s+', np.nan, regex=True).fillna(X_.F19.mean())

y_.F5 = y_.F5.replace(r'\s+', np.nan, regex=True).fillna(0)
y_.F19 = y_.F19.replace(r'\s+', np.nan, regex=True).fillna(y_.F19.mean())

#### Then inspect each feature and determine if it is categorical or numeric data. 
I inspected the value counts list of each feature. If there was a range of 15 integers or less with distinct bucketizing, then I determined this to be a categorical feature. If the value_counts were not whole numbers and/or displayed value_counts of a series of 1's, then this was a numeric feature.

In [20]:
for i in range(len(X_.columns)):
    print X['F'+str(i+1)].value_counts()

1     48227
2      1484
3       218
4        43
5        15
6         8
18        1
12        1
8         1
Name: F1, dtype: int64
0     47470
1      1905
2       379
3       104
98       87
4        32
5         9
6         6
96        3
11        1
9         1
7         1
Name: F2, dtype: int64
 0.281078    1
-0.006354    1
 0.690687    1
-0.139957    1
 0.503359    1
 0.111420    1
-0.102351    1
-0.054156    1
 0.035681    1
-0.138120    1
 0.105642    1
-0.079776    1
 0.266861    1
 1.373554    1
 0.248705    1
 0.811418    1
 0.011395    1
 0.225882    1
 0.912682    1
-0.036151    1
 0.153582    1
 0.362630    1
 0.974042    1
 0.188194    1
 0.053269    1
 0.738982    1
 0.168041    1
 0.881221    1
 0.629370    1
 0.225687    1
            ..
 0.013587    1
 0.012951    1
 0.200457    1
 1.473349    1
 0.120619    1
-0.017810    1
 0.230897    1
 0.315938    1
 0.960116    1
 0.062209    1
 0.373453    1
 0.924025    1
 0.456454    1
-0.086199    1
 0.365165    1
 0.562710   

#### Feature Breakdown: categorical/numeric
F1: categorical <br />
F2: categorical <br />
F3: numberic <br />
F4: categorical <br />
F5: categorical <br />
F6: numeric <br />
F7: categorical <br />
F8: categorical <br />
F9: numeric <br />
F10: categorical <br />
F11: numeric <br />
F12: categorical <br />
F13: categorical <br />
F14: categorical <br />
F15: categorical <br />
F16: numeric <br />
F17: categorical <br />
F18: numeric <br />
F19: numeric <br />
F20: categorical <br />
F21: numeric <br />
F22: numeric <br />
F23: numeric <br />
F24: categorical <br />
F25: categorical <br />
F26: numeric <br />
F27: numeric <br />

#### With the categorical/numeric distinctions, I one-hot encoded categorical data,  and log transformed numeric data if it was skewed.
To note, I had to combine both the test and train sets into one data set to one-hot encode the categorical features. On my first attempt when I did the test and train set separately, the size of the the resulting one-hot encoded tables did not match up. Upon realizing that one set might hold more categories than the other, I realized I had to combine both sets together.
For numerical data I checked if the contents were skewed more than 0.75. If so, I normalized the numeric data by applying a log transform.

In [22]:
all_data = pd.concat([X_, y_], ignore_index=True)

categorical_array_ = [2, 4, 5, 7, 8, 10, 12, 13, 14, 15, 17, 20, 24, 25] #remember 1

F1_OH = pd.get_dummies(all_data['F1'], prefix='F1')
all_data = all_data.join(F1_OH).drop(['F1'], axis = 1)

for feat in categorical_array_:
    term = str('F'+str(feat))
    dummies = pd.get_dummies(all_data[term], prefix=term)
    all_data = all_data.join(dummies).drop([term], axis = 1)

processed_X = all_data.iloc[:49998, :]
processed_y = all_data.iloc[49998:, :]

processed_X['idx'] = train_id
processed_X = processed_X.set_index('idx')
processed_X.index.name = None
processed_y['idx'] = test_id
processed_y = processed_y.set_index('idx')
processed_y.index.name = None

# log transfrom skewed numeric features
# https://www.kaggle.com/apapiu/regularized-linear-models
numeric_array = [3, 6, 9, 11, 16, 18, 19, 21, 22, 23, 26, 27]

for feat in numeric_array:
    term = str('F'+str(feat))
    if processed_X[term].skew() > 0.75:
        processed_X[term] = np.log1p(processed_X[term])
    if processed_y[term].skew() > 0.75:
        processed_y[term] = np.log1p(processed_y[term])
        
print processed_X
print processed_y

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


             F3        F6        F9  F11       F16  F18        F19       F21  \
1      0.102174  0.693147  0.693147   32  1.018847  104  10.039023  0.000000   
2      0.133093  2.079442  0.693147   44  1.842136  144   9.341456  0.000000   
3      0.400330  8.344267  1.386294   32  1.018847  112   8.483430  0.000000   
4     -0.054486  1.098612  1.791759   46  1.512927  127   8.086718  0.693147   
5      0.548582  2.484907  0.693147   35  1.018847  148   8.294300  0.693147   
6      0.332169  4.356709  0.693147   44  1.842136  122   9.201905  1.386294   
7     -0.139516  1.945910  1.098612   40  1.018847  120   9.251194  1.386294   
8      0.737748  1.945910  0.693147   45  2.287471  113   8.732466  0.000000   
9      0.056452  2.397895  0.693147   42  1.018847  110   8.006701  0.000000   
10     0.751011  0.693147  3.178054   45  1.842136  127   8.804793  0.000000   
11     0.178506  1.386294  1.386294   34  1.018847  126   8.483430  0.000000   
12    -0.044231  0.693147  1.098612   49

## Feature Reduction

Now with 207 columns I thought about applying feature reduction with PCA or forward/backward feature selection. I realize that reducing and eliminating highly correlated features may have improved the results; however, 207 columns is not that much to process through so I decided to skip this step.

## Other Models: Logistic Regression, Random Forests

I wanted to try other models besides XGBoost to see how they would perform.

In [43]:
from sklearn.grid_search import GridSearchCV

from xgboost.sklearn import XGBClassifier

import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score  
from sklearn.model_selection import cross_val_predict


### Logistic Regression

Using [this](https://stackoverflow.com/questions/40667856/why-doesnt-gridsearchcv-give-c-with-highest-auc-when-scoring-roc-auc-in-logisti) example, I implemented Logistic Regression. First I did a grid-search to find the best C parameter. The one that performed the best of a value of 0.1.

In [24]:
lr = LogisticRegression(penalty = 'l1')
parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clf = GridSearchCV(lr, parameters, scoring='roc_auc', cv = 5, n_jobs=-1, verbose=5)
clf.fit(processed_X, y)
clf.grid_scores_,clf.best_params_, clf.best_score_

Fitting 5 folds for each of 7 candidates, totalling 35 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   30.7s
[Parallel(n_jobs=-1)]: Done  35 out of  35 | elapsed:  6.5min finished


([mean: 0.63363, std: 0.00874, params: {'C': 0.001},
  mean: 0.82948, std: 0.00818, params: {'C': 0.01},
  mean: 0.83590, std: 0.00853, params: {'C': 0.1},
  mean: 0.83521, std: 0.00907, params: {'C': 1},
  mean: 0.83263, std: 0.00895, params: {'C': 10},
  mean: 0.83088, std: 0.00898, params: {'C': 100},
  mean: 0.83030, std: 0.00834, params: {'C': 1000}],
 {'C': 0.1},
 0.8358992195351744)

In [27]:
lr = LogisticRegression(penalty = 'l1', C=0.1)
lr.fit(processed_X, y)
ypred_test = lr.predict_proba(processed_y)

sub = pd.DataFrame()
sub['id'] = test_id
sub['Y'] = ypred_test[:,1]
sub.to_csv('SubLog1.csv', index=False)

#### Results:
Using that value in my model, I trained it on the train data and predicted the results on the test set. This yielded a score of **0.84145** AUC from Kaggle, much lower from my XGBoost attempts.

### Random Forest

Similarly, I used [this](https://stackoverflow.com/questions/30102973/how-to-get-best-estimator-on-gridsearchcv-random-forest-classifier-scikit) and [this](https://www.kaggle.com/giovannibruner/randomforest-with-gridsearchcv) to help me apply the Random Forest model on my train and test set. I used grid-search to find the optimal max_depth, min_samples_split, n_estimators, min_samples_lead, max_features, and criterion parameters.

In [29]:
parameters = {"max_depth": [2,3,4,5,6,7,8,9,10,11,12]
            ,"min_samples_split" :[2,3,4,5,6]
            ,"n_estimators" : [10]
            ,"min_samples_leaf": [1,2,3,4,5]
            ,"max_features": (4,5,6,"sqrt")
            ,"criterion": ('gini','entropy')}

rf_regr = RandomForestClassifier()
model = GridSearchCV(rf_regr, parameters, scoring='roc_auc', cv = 5, n_jobs=-1, verbose=5)
model_fit = model.fit(processed_X,y)

learned_parameters = model_fit.best_params_

Fitting 5 folds for each of 2200 candidates, totalling 11000 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   22.1s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   33.6s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:   50.2s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 2170 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 3034 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 3520 tasks      | elapsed:  9.3min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 10.9min
[Parallel(n_jobs=-1)]: Done 4600 tasks      | elapsed: 12.6min

In [30]:
rfc = RandomForestClassifier(max_depth = learned_parameters["max_depth"]
                            ,max_features = learned_parameters['max_features']
                            ,min_samples_leaf = learned_parameters['min_samples_leaf']
                            ,min_samples_split = learned_parameters['min_samples_split']
                            ,criterion = learned_parameters['criterion']
                            ,n_estimators = 5000
                            ,n_jobs = -1)
rfc.fit(processed_X,y)
ypred_test = rfc.predict_proba(processed_y)

sub = pd.DataFrame()
sub['id'] = test_id
sub['Y'] = ypred_test[:,1]
sub.to_csv('subRF1a.csv', index=False)

#### Results
Plugging in these parameters into the model and training then predicting the submissions yieleded a **0.86066** AUC score from Kaggle. Better than Logistic Regression, but worse than my XGBoost model.

### XGBoost
Then on this new preprocessed train and test set I applied XGBoost referencing [this](https://www.kaggle.com/phunter/xgboost-with-gridsearchcv) and [this](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/). I applied grid-search piece by piece, looking for the best max_depth and min_child_weight, then gamma and learning_rate.

In [31]:
#tuning max_depth and min_child
param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5, verbose=5)
gsearch1.fit(processed_X, y)
gsearch1.best_params_, gsearch1.best_score_ 

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done  60 out of  60 | elapsed: 20.8min finished


({'max_depth': 3, 'min_child_weight': 5}, 0.8597534544340567)

In [32]:
xgb1 = XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=3,
 min_child_weight=5, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27)
xgb1.fit(processed_X,y)
ypred_test = xgb1.predict_proba(processed_y)

sub = pd.DataFrame()
sub['id'] = test_id
sub['Y'] = ypred_test[:,1]
sub.to_csv('subXGB1.csv', index=False)

##### First Results
I was curious about the results of this grid-search so I submitted it onto Kaggle -- **0.86493** AUC score. Not too bad, better than Logistic Regression and Random Forest, but not as good as when I had meticulously hand-tuned my first XGBoost submissions. Maybe more tuning will help. I moved onto tuning gamma and learning_rate.

In [33]:
param_test3 = {
 'gamma':[i/10.0 for i in range(0,7)],
 'learning_rate':[0.01, 0.02, 0.025, 0.03, 0.1, 0.15, 0.2]

}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=3,
 min_child_weight=5, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=-1, scale_pos_weight=1,seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5, verbose=5)
gsearch3.fit(processed_X, y)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

Fitting 5 folds for each of 49 candidates, totalling 245 fits


[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  2.2min
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed: 11.9min
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed: 28.4min
[Parallel(n_jobs=4)]: Done 245 out of 245 | elapsed: 44.7min finished


([mean: 0.84589, std: 0.00473, params: {'learning_rate': 0.01, 'gamma': 0.0},
  mean: 0.85140, std: 0.00613, params: {'learning_rate': 0.02, 'gamma': 0.0},
  mean: 0.85416, std: 0.00615, params: {'learning_rate': 0.025, 'gamma': 0.0},
  mean: 0.85551, std: 0.00672, params: {'learning_rate': 0.03, 'gamma': 0.0},
  mean: 0.85975, std: 0.00770, params: {'learning_rate': 0.1, 'gamma': 0.0},
  mean: 0.85803, std: 0.00761, params: {'learning_rate': 0.15, 'gamma': 0.0},
  mean: 0.85647, std: 0.00733, params: {'learning_rate': 0.2, 'gamma': 0.0},
  mean: 0.84589, std: 0.00473, params: {'learning_rate': 0.01, 'gamma': 0.1},
  mean: 0.85140, std: 0.00613, params: {'learning_rate': 0.02, 'gamma': 0.1},
  mean: 0.85416, std: 0.00615, params: {'learning_rate': 0.025, 'gamma': 0.1},
  mean: 0.85551, std: 0.00672, params: {'learning_rate': 0.03, 'gamma': 0.1},
  mean: 0.85975, std: 0.00770, params: {'learning_rate': 0.1, 'gamma': 0.1},
  mean: 0.85804, std: 0.00762, params: {'learning_rate': 0.15, 'g

In [48]:
xgb3 = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=3,
 min_child_weight=5, gamma=0.2, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=-1, scale_pos_weight=1,seed=27)
xgb3.fit(processed_X, y)
ypred_test = xgb3.predict_proba(processed_y)

sub = pd.DataFrame()
sub['id'] = test_id
sub['Y'] = ypred_test[:,1]
sub.to_csv('subsXGB3.csv', index=False)

##### Secondary Results
With these results, the submission yielded a **0.86493** AUC score from Kaggle. This was the same as the other submission! Interesting. I could see the power behind grid-search but it takes so dang long. At this point in the competition I didn't have much time to fine-tune it so I moved on to try another technique.

#### Ensembling
Taking the predicted results from the three models, I wanted to apply XGBoost on top of this ensemble. Using different models to predict data I hoped that combining them would account for the weaknesses and strengths of each model to produce a more successful result. First however, I needed to create a new set of training data. I used stratified kfold to generate folds within my preprocessed data and trained/tested for new results based on the Logistic Regression, Random Forest, and XGBoost models I had used above to create the predicted results.

##### Creating the Training and Testing Sets

In [35]:
from sklearn.model_selection import StratifiedKFold

kfold = 5
skf = StratifiedKFold(n_splits=kfold, random_state=42)

In [36]:
# create dataframe to hold new training set
ensem = pd.DataFrame()

ensem['logr'] = np.zeros_like(train_id)
ensem['ranf'] = np.zeros_like(train_id)
ensem['xgb'] = np.zeros_like(train_id)

ensem['idx'] = train_id
ensem = ensem.set_index('idx')

In [37]:
# split data and test and train based on the models used above
for (train, test) in skf.split (processed_X, y):
    xtrain = processed_X.iloc[train]
    xtest = processed_X.iloc[test]
    ytrain = y[train]
    
    ensem_lr = LogisticRegression(penalty = 'l1', C=0.1, n_jobs=-1)

    ensem_rfc = RandomForestClassifier(max_depth = 10
                            ,max_features = 'sqrt'
                            ,min_samples_leaf = 2
                            ,min_samples_split = 2
                            ,criterion = 'entropy'
                            ,n_estimators = 10
                            ,n_jobs = -1)
    
    ensem_xgb3 = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=3,
    min_child_weight=5, gamma=0.2, subsample=0.8, colsample_bytree=0.8,
    objective= 'binary:logistic', nthread=-1, scale_pos_weight=1,seed=27)
    
    ensem_lr.fit(xtrain, ytrain)
    ensem_rfc.fit(xtrain, ytrain)
    ensem_xgb3.fit(xtrain, ytrain)
    
# fill into train datafram
    ensem.iloc[test, 0] = (ensem_lr.predict_proba(xtest)[:,1])
    ensem.iloc[test, 1] = (ensem_rfc.predict_proba(xtest)[:,1])
    ensem.iloc[test, 2] = (ensem_xgb3.predict_proba(xtest)[:,1])

In [38]:
# put together test data results
ensem_y = pd.DataFrame()

ensem_y['logr'] = ensem_lr.predict_proba(processed_y)[:,1]
ensem_y['ranf'] = ensem_rfc.predict_proba(processed_y)[:,1]
ensem_y['xgb'] = ensem_xgb3.predict_proba(processed_y)[:,1]

ensem_y['idx'] = test_id
ensem_y = ensem_y.set_index('idx')

In [39]:
ensem.head()

Unnamed: 0_level_0,logr,ranf,xgb
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.04896,0.023159,0.0278
2,0.016549,0.012434,0.004198
3,0.102611,0.101756,0.120191
4,0.029428,0.025134,0.021824
5,0.020677,0.050055,0.042095


In [40]:
ensem_y.head()

Unnamed: 0_level_0,logr,ranf,xgb
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
49999,0.149194,0.212361,0.149641
50000,0.021118,0.009416,0.006431
50001,0.016782,0.005733,0.006382
50002,0.019481,0.02456,0.037562
50003,0.045361,0.09643,0.065868


#### XGBoost and Grid-Search on the Ensemble
I applied similar techniques from above on the ensemble to find the parameters for the XGBoost model using my new training and test sets. Instead this time I also calculated the local AUC score and examined the variance of the results before I submitted onto Kaggle.

First, tuning for max_depth and min_child_weight:

In [41]:
param_test4 = {
 'max_depth':range(3,10,1),
 'min_child_weight':range(1,6,1)
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=-1,iid=False, cv=5, verbose=5)
gsearch4.fit(ensem, y)
gsearch4.best_params_, gsearch4.best_score_ 

Fitting 5 folds for each of 35 candidates, totalling 175 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    9.8s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   57.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 175 out of 175 | elapsed:  3.4min finished


({'max_depth': 3, 'min_child_weight': 3}, 0.8585557045305536)

In [44]:
var = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=3,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27)
results = cross_val_score(var, ensem, y, scoring ='roc_auc', cv=5)
print results.mean(), results.std()

0.858287635941 0.00786573295246


Tuning the max_depth and min_child_weight gave a local result that was lower than my results from the public leaderboard from just the first XGBoost on processed, unensembled data. Perhaps more grid-search tuning will help.

In [45]:
param_test5 = {
 'gamma':[i/10.0 for i in range(0,5)],
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=3,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test5, scoring='roc_auc',n_jobs=-1,iid=False, cv=5, verbose=5)
gsearch5.fit(ensem, y)
gsearch5.best_params_, gsearch5.best_score_ 

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   51.8s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:  4.7min finished


({'colsample_bytree': 0.7, 'gamma': 0.3, 'subsample': 0.7}, 0.8584540722729062)

In [46]:
var = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=3,
 min_child_weight=2, gamma=0.1, subsample=0.7, colsample_bytree=0.7,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27)
results = cross_val_score(var, ensem, y, scoring ='roc_auc', cv=5)
print results.mean(), results.std()

0.858328381414 0.00736395846966


Tuning the by gamma and subsample and colsample_bytree improved the local score a bit but not by much. Let's just plug it into Kaggle and see what the public score says.

In [47]:
xgb4 = XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=3,
 min_child_weight=2, gamma=0.1, subsample=0.7, colsample_bytree=0.7,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27)
xgb4.fit(ensem,y)
ypred_test = xgb4.predict_proba(ensem_y)

sub = pd.DataFrame()
sub['id'] = test_id
sub['Y'] = ypred_test[:,1]
sub.to_csv('XGB4.csv', index=False)

#### End Result

The Kaggle AUC score for ensembling the three models with XGBoost had a local estimate of 0.838328 but it actually performed better with a score of **0.86347**.

# Final Thoughts

* My first attempts with hand-tuning the crap out of a single XGBoost model based on Kaggle responses gave me the best results (and also overconfidence and a false sense of security). I was able to get into second position on the public leaderboard with this method.
* After preprocessing, using grid-search, logistic regression and random forest models, and ensembling, the AUC scores did not improve.

After the private leaderboard was available, my position dropped from 2 to 18, much to my display. I realized I had probably overfitted because my first method relied so much on using the public scores from Kaggle to tune my XGBoost model.

If I were to do this again, I would want to spend more time meticulously tuning the parameters on the models logistic regression, random forest, and XGBoost models I used in my ensemble. And then spend more time tuning the parameters on the XGBoost I used on the ensemble. I would also utilize the ability to check the local AUC score of the model and would like to analyze how the local AUC scores differ from the public Kaggle scores.

All in all, I am proud of my progress and how much I learned through this assignment. Though it was stressful at times, it was also fun!