## Overview

Long ago, in the distant, fragrant mists of time, there was a competition...
It was not just any competition.

It was a competition that challenged mere mortals to model a 20,000x200 matrix of continuous variables using only 250 training samples... without overfitting.

Data scientists ― including Kaggle's very own Will Cukierski ― competed by the hundreds. Legends were made. (Will took 5th place, and eventually ended up working at Kaggle!) People overfit like crazy. It was a Kaggle-y, data science-y madhouse.
So... we're doing it again.

Don't Overfit II: The Overfittening
This is the next logical step in the evolution of weird competitions. Once again we have 20,000 rows of continuous variables, and a mere handful of training samples. Once again, we challenge you not to overfit. Do your best, model without overfitting, and add, perhaps, to your own legend.

In addition to bragging rights, the winner also gets swag. Enjoy!

## Data and Evaluation

What am I predicting?
You are predicting the binary target associated with each row, without overfitting to the minimal set of training examples provided.

### Files
- train.csv - the training set. 250 rows.
- test.csv - the test set. 19,750 rows.
- sample_submission.csv - a sample submission file in the correct format

### Columns
- id- sample id
- target- a binary target of mysterious origin.
- 0-299- continuous variables.

*Submissions are evaluated using AUCROC between the predicted target and the actual target value*

In [668]:
# initialization

%reset -f

import sys

import numpy as np, pandas as pd
import sklearn
from sklearn.model_selection import train_test_split

# ignore warnings (only if you are the kind that would code when the world is burning)
import warnings
warnings.filterwarnings('ignore')

# some options
MAX_EVALS=5
randomseed = 1 # the value for the random state used at various points in the pipeline
pd.options.display.max_rows = 1000 # specify if you want the full output in cells rather the truncated list
pd.options.display.max_columns = 200

# to display multiple outputs in a cell without usin print/display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# display wd files
import os as os
print('folder files: ', os.listdir(), '\n')
print('envir variables: ')
%who

folder files:  ['.ipynb_checkpoints', 'main.ipynb', 'sample_submission.csv', 'submission_1.csv', 'test.csv', 'train.csv'] 

envir variables: 
InteractiveShell	 MAX_EVALS	 np	 os	 pd	 randomseed	 sklearn	 sys	 train_test_split	 


In [669]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

ytrain = train.target
train.drop(['target', 'id'], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)

In [670]:
train.shape, test.shape
ytrain.value_counts()

((250, 300), (19750, 300))

1.0    160
0.0     90
Name: target, dtype: int64

In [671]:
X_train, X_test, y_train, y_test = train_test_split(train, ytrain, test_size=0.2, random_state=1, stratify=ytrain)
X_train.shape, X_test.shape

((200, 300), (50, 300))

In [672]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
#scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

train = scaler.fit_transform(train)
test = scaler.transform(test)

In [673]:
# ## pca

# from sklearn.decomposition import PCA

# pca = PCA(n_components=0.95)
# pca.fit(X_train)
# pca_train = pca.transform(X_train)
# pca_valid = pca.transform(X_test)

In [579]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression

logmod = LogisticRegressionCV(cv=5, penalty='l1', class_weight='balanced', solver='liblinear', 
                              Cs=[0.07, 0.08, 0.09, 0.095, 0.099, 0.1, 0.11, 0.12])
# logmod.fit(X=X_train, y=y_train)
# logmod.score(X=X_test, y=y_test)
# logmod.C_

logmod.fit(X=pca_train, y=y_train)
logmod.score(X=pca_valid, y=y_test)
logmod.C_

LogisticRegressionCV(Cs=[0.07, 0.08, 0.09, 0.095, 0.099, 0.1, 0.11, 0.12],
           class_weight='balanced', cv=5, dual=False, fit_intercept=True,
           intercept_scaling=1.0, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l1', random_state=None, refit=True,
           scoring=None, solver='liblinear', tol=0.0001, verbose=0)

0.7

array([0.095])

In [591]:
logmodfull = LogisticRegression(C=0.1, class_weight='balanced', n_jobs=-1, random_state=5, solver='liblinear', penalty='l1')
logmodfull.fit(X_train, y_train)
logmodfull.score(X_test, y_test)
logmodfull.fit(train, ytrain)
pred=logmodfull.predict_proba(test)

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=-1, penalty='l1', random_state=5,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

0.72

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=-1, penalty='l1', random_state=5,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [674]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras.utils import np_utils
from keras.utils.np_utils import to_categorical

In [676]:
X = np.array(X_train)
XV = np.array(X_test)
XX = np.array(train)
XXV = np.array(test)

Y = np_utils.to_categorical(y_train)
YV = np_utils.to_categorical(y_test)
YY = np_utils.to_categorical(ytrain)

In [686]:
from sklearn.model_selection import StratifiedKFold

seed = 7
np.random.seed(seed)
class_weight = {0: 1., 1: 1.}
input_dim = X_train.shape[1]

# define 5-fold cross validation test harness
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
cvscores = []
for train, test in kfold.split(XX, ytrain):
    # create model
    model = Sequential()
    model.add(Dense(300, input_dim = input_dim , activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dense(1000, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.1))
    model.add(Dense(100, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.1))
    model.add(Dense(20, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.1))
    model.add(Dense(5, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dense(1, activation = 'sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # Fit the model
    model.fit(XX[train], ytrain[train], epochs=10, batch_size=10, verbose=1, class_weight=class_weight, shuffle=True)
    # evaluate the model
    scores = model.evaluate(XX[test], ytrain[test], verbose=0)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1eb3c030da0>

acc: 62.00%
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1eb40847da0>

acc: 64.00%
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1eb43863160>

acc: 66.00%
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1eb4a3554e0>

acc: 62.00%
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1eb4d240b38>

acc: 60.00%
62.80% (+/- 2.04%)


In [688]:
cvscores
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

[61.99999976158143,
 64.00000095367432,
 66.00000011920929,
 61.99999976158143,
 59.99999976158142]

62.80% (+/- 2.04%)


In [667]:
pred=model.predict_proba(XV)
sklearn.metrics.roc_auc_score(y_true=y_test, y_score=pred)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x1eb10cd7e10>


acc: 64.00%


0.6840277777777777

In [None]:
# random class

class rand_mod():
    
    def __init__(self, train, ytrain, test, iter=10):
        self.train = train
        self.test = test
        self.ytrain = ytrain
        self.iter = iter
        
        self.main()
        
    def main(self):
        
        self.split(self.train, self.ytrain)
        
        return None
    
    def split(self):
        grid = {}
        mod_grid = {}
        
        for i in range(self.iter):
            x_train, x_test, y_train, y_test = train_test_split(self.train, self.ytrain)

In [166]:
# model imports

from sklearn.linear_model import RandomizedLogisticRegression
import xgboost as xgb

In [152]:
logitmod = RandomizedLogisticRegression(n_jobs=-1, random_state=1, selection_threshold=0.01, 
                                        sample_fraction=0.8, n_resampling=500)
logitmod.fit(X=X_train, y=y_train)

RandomizedLogisticRegression(C=1, fit_intercept=True, memory=None, n_jobs=-1,
               n_resampling=500, normalize=True, pre_dispatch='3*n_jobs',
               random_state=1, sample_fraction=0.8, scaling=0.5,
               selection_threshold=0.01, tol=0.001, verbose=False)

In [153]:
filtered_cols = logitmod.get_support(indices=True)
filtered_cols

array([  0,   4,  16,  26,  33,  39,  43,  52,  53,  63,  65,  73,  80,
        82,  89,  90,  91, 105, 108, 117, 119, 127, 129, 150, 151, 156,
       164, 165, 168, 170, 176, 180, 189, 201, 209, 217, 220, 221, 228,
       230, 237, 239, 240, 253, 272, 285, 295], dtype=int64)

In [154]:
XX_train = X_train.ix[:, filtered_cols]
XX_test = X_test.ix[:, filtered_cols]

In [157]:
mod = xgb.XGBClassifier(learning_rate=0.001, n_estimators=1000, colsample_bytree=0.5, max_depth=5, subsample=0.9,
                       eval_metric='auc', random_state=5)
mod.fit(XX_train, y_train)
mod.score(XX_test, y_test)
pred=mod.predict_proba(XX_test)
sklearn.metrics.roc_auc_score(y_true=y_test, y_score=pred[:,1])

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.5, eval_metric='auc', gamma=0,
       learning_rate=0.001, max_delta_step=0, max_depth=5,
       min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=5,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.9)

0.6507936507936508

0.6565217391304348

In [172]:
mod = xgb.XGBClassifier(learning_rate=0.001, n_estimators=1000, colsample_bytree=0.2, max_depth=10, subsample=0.8,
                       eval_metric='auc', random_state=5)
mod.fit(pca_train, y_train)
mod.score(pca_valid, y_test)
pred=mod.predict_proba(pca_valid)
sklearn.metrics.roc_auc_score(y_true=y_test, y_score=pred[:,1])

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.2, eval_metric='auc', gamma=0,
       learning_rate=0.001, max_delta_step=0, max_depth=10,
       min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=5,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.8)

0.6349206349206349

0.7380434782608696

In [156]:
import xgboost as xgb

mod = xgb.XGBClassifier(learning_rate=0.01, n_estimators=1000, colsample_bytree=1, max_depth=2, subsample=0.07,
                       eval_metric='auc', random_state=5)
mod.fit(train, ytrain)
pred=mod.predict_proba(test)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, eval_metric='auc', gamma=0, learning_rate=0.01,
       max_delta_step=0, max_depth=2, min_child_weight=1, missing=None,
       n_estimators=1000, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=5, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.07)

In [458]:
submission = pd.read_csv('sample_submission.csv')
submission['target'] = pred[:,1]
submission.to_csv('submission_1.csv', index=False)