# Don't overfit 2!

This is my approach to Kaggle's competition for don't overfit 2. This is quite an interesting problem, for we are given only 250 training samples while we are given 20K+ testing examples. Essentially, our goal is to build a model robust enough that it won't overfit to our training samples. 

While the 'traditional' ML models are performing very well with best being logistic regression at about 0.8+ at current observation, I try to push Hinton's CapsNet to see if we can perform as well or even better. For best performance I am expecting to an increase with a right blending of Capsnet, logistic reggresion, and perhaps a few more models. Here we go!

# What is happening so far
* Best model is logistic reg model
* permutation importance gives us good idea of important features, but fitting to the most important features lead to fast overfitting
* bootstrapping a small amount of test data to train dataset help generalize, but amounts beyond 20~ gives too much unstability

Current Plan:
* CapsNet
* blending of models other than log reg
* stacknet?

In [77]:
import matplotlib.pyplot as plt
import cv2
import pandas as pd
import numpy as np
import tensorflow as tf
from keras import backend as K
from keras.layers import Layer
from keras.layers import *
from keras.metrics import *
from keras.models import Model
from keras.callbacks import *
from keras.optimizers import *
from keras.applications import *
from keras import activations
from keras import utils
from keras.regularizers import l2

from sklearn.model_selection import train_test_split, StratifiedKFold, KFold, cross_val_score, GridSearchCV, RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn.metrics import accuracy_score, roc_auc_score
import gc
gc.enable()
gc.collect()

445

In [125]:
train_df = pd.read_csv("/Users/JoonH/dont-overfit-ii/train.csv")
test_df = pd.read_csv("/Users/JoonH/dont-overfit-ii/test.csv")

In [126]:
train_df.head()

Unnamed: 0,id,target,0,1,2,3,4,5,6,7,...,290,291,292,293,294,295,296,297,298,299
0,0,1.0,-0.098,2.165,0.681,-0.614,1.309,-0.455,-0.236,0.276,...,0.867,1.347,0.504,-0.649,0.672,-2.097,1.051,-0.414,1.038,-1.065
1,1,0.0,1.081,-0.973,-0.383,0.326,-0.428,0.317,1.172,0.352,...,-0.165,-1.695,-1.257,1.359,-0.808,-1.624,-0.458,-1.099,-0.936,0.973
2,2,1.0,-0.523,-0.089,-0.348,0.148,-0.022,0.404,-0.023,-0.172,...,0.013,0.263,-1.222,0.726,1.444,-1.165,-1.544,0.004,0.8,-1.211
3,3,1.0,0.067,-0.021,0.392,-1.637,-0.446,-0.725,-1.035,0.834,...,-0.404,0.64,-0.595,-0.966,0.9,0.467,-0.562,-0.254,-0.533,0.238
4,4,1.0,2.347,-0.831,0.511,-0.021,1.225,1.594,0.585,1.509,...,0.898,0.134,2.415,-0.996,-1.006,1.378,1.246,1.478,0.428,0.253


In [127]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Columns: 302 entries, id to 299
dtypes: float64(301), int64(1)
memory usage: 589.9 KB


In [250]:
test_df.head()

Unnamed: 0,id,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
0,250,0.5,-1.033,-1.595,0.309,-0.714,0.502,0.535,-0.129,-0.687,...,-0.088,-2.628,-0.845,2.078,-0.277,2.132,0.609,-0.104,0.312,0.979
1,251,0.776,0.914,-0.494,1.347,-0.867,0.48,0.578,-0.313,0.203,...,-0.683,-0.066,0.025,0.606,-0.353,-1.133,-3.138,0.281,-0.625,-0.761
2,252,1.75,0.509,-0.057,0.835,-0.476,1.428,-0.701,-2.009,-1.378,...,-0.094,0.351,-0.607,-0.737,-0.031,0.701,0.976,0.135,-1.327,2.463
3,253,-0.556,-1.855,-0.682,0.578,1.592,0.512,-1.419,0.722,0.511,...,-0.336,-0.787,0.255,-0.031,-0.836,0.916,2.411,1.053,-1.601,-1.529
4,254,0.754,-0.245,1.173,-1.623,0.009,0.37,0.781,-1.763,-1.432,...,2.184,-1.09,0.216,1.186,-0.143,0.322,-0.068,-0.156,-1.153,0.825


For our data we see that we have 300 continous variables, let's see if we can get any understanding of the data through some EDA.

# Sum EDA

# Our model(s)

We will use a capsnet inspired NN and a logistic regression model. For our NN we will also try to implement the idea of self-normalizing networks, or SNN, and blend the output probabilities with the logreg model.

In [109]:
y_train = train_df['target']
x_train = train_df.drop(['target', 'id'], axis = 1)
x_test = test_df.drop(['id'], axis = 1)

In [114]:
x_train_nn = x_train.astype('float32')
x_train_nn = np.expand_dims(x_train, axis = -1)

In [115]:
#250 samples with each sample containing 300 variables, we expand dims such that it will fit our NN model
x_train_nn.shape

(250, 300, 1)

In [172]:
#Training data for the logreg model
x_train.shape

(250, 300)

In [117]:
# We will use basic capsule implementation provided by Keras

# the squashing function.
# we use 0.5 in stead of 1 in hinton's paper.
# if 1, the norm of vector will be zoomed out.
# if 0.5, the norm will be zoomed in while original norm is less than 0.5
# and be zoomed out while original norm is greater than 0.5.
def squash(x, axis=-1):
    s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()
    scale = K.sqrt(s_squared_norm) / (0.5 + s_squared_norm)
    return scale * x


# define our own softmax function instead of K.softmax
# because K.softmax can not specify axis.
def softmax(x, axis=-1):
    ex = K.exp(x - K.max(x, axis=axis, keepdims=True))
    return ex / K.sum(ex, axis=axis, keepdims=True)


# define the margin loss like hinge loss
def margin_loss(y_true, y_pred):
    lamb, margin = 0.5, 0.1
    return K.sum(y_true * K.square(K.relu(1 - margin - y_pred)) + lamb * (
        1 - y_true) * K.square(K.relu(y_pred - margin)), axis=-1)


class Capsule(Layer):
    """A Capsule Implement with Pure Keras
    There are two vesions of Capsule.
    One is like dense layer (for the fixed-shape input),
    and the other is like timedistributed dense (for various length input).

    The input shape of Capsule must be (batch_size,
                                        input_num_capsule,
                                        input_dim_capsule
                                       )
    and the output shape is (batch_size,
                             num_capsule,
                             dim_capsule
                            )

    Capsule Implement is from https://github.com/bojone/Capsule/
    Capsule Paper: https://arxiv.org/abs/1710.09829
    """

    def __init__(self,
                 num_capsule,
                 dim_capsule,
                 routings=3,
                 share_weights=True,
                 activation='squash',
                 **kwargs):
        super(Capsule, self).__init__(**kwargs)
        self.num_capsule = num_capsule
        self.dim_capsule = dim_capsule
        self.routings = routings
        self.share_weights = share_weights
        if activation == 'squash':
            self.activation = squash
        else:
            self.activation = activations.get(activation)

    def build(self, input_shape):
        input_dim_capsule = input_shape[-1]
        if self.share_weights:
            self.kernel = self.add_weight(
                name='capsule_kernel',
                shape=(1, input_dim_capsule,
                       self.num_capsule * self.dim_capsule),
                initializer='glorot_uniform',
                trainable=True)
        else:
            input_num_capsule = input_shape[-2]
            self.kernel = self.add_weight(
                name='capsule_kernel',
                shape=(input_num_capsule, input_dim_capsule,
                       self.num_capsule * self.dim_capsule),
                initializer='glorot_uniform',
                trainable=True)

    def call(self, inputs):
        """Following the routing algorithm from Hinton's paper,
        but replace b = b + <u,v> with b = <u,v>.

        This change can improve the feature representation of Capsule.

        However, you can replace
            b = K.batch_dot(outputs, hat_inputs, [2, 3])
        with
            b += K.batch_dot(outputs, hat_inputs, [2, 3])
        to realize a standard routing.
        """

        if self.share_weights:
            hat_inputs = K.conv1d(inputs, self.kernel)
        else:
            hat_inputs = K.local_conv1d(inputs, self.kernel, [1], [1])

        batch_size = K.shape(inputs)[0]
        input_num_capsule = K.shape(inputs)[1]
        hat_inputs = K.reshape(hat_inputs,
                               (batch_size, input_num_capsule,
                                self.num_capsule, self.dim_capsule))
        hat_inputs = K.permute_dimensions(hat_inputs, (0, 2, 1, 3))

        b = K.zeros_like(hat_inputs[:, :, :, 0])
        for i in range(self.routings):
            c = softmax(b, 1)
            o = self.activation(K.batch_dot(c, hat_inputs, [2, 2]))
            if i < self.routings - 1:
                b = K.batch_dot(o, hat_inputs, [2, 3])
                if K.backend() == 'theano':
                    o = K.sum(o, axis=1)

        return o

    def compute_output_shape(self, input_shape):
        return (None, self.num_capsule, self.dim_capsule)

In [118]:
class Mask(layers.Layer):
    """
    Mask a Tensor with shape=[None, num_capsule, dim_vector] either by the capsule with max length or by an additional 
    input mask. Except the max-length capsule (or specified capsule), all vectors are masked to zeros. Then flatten the
    masked Tensor.
    For example:
        ```
        x = keras.layers.Input(shape=[8, 3, 2])  # batch_size=8, each sample contains 3 capsules with dim_vector=2
        y = keras.layers.Input(shape=[8, 3])  # True labels. 8 samples, 3 classes, one-hot coding.
        out = Mask()(x)  # out.shape=[8, 6]
        # or
        out2 = Mask()([x, y])  # out2.shape=[8,6]. Masked with true labels y. Of course y can also be manipulated.
        ```
    """
    def call(self, inputs, **kwargs):
        if type(inputs) is list:  # true label is provided with shape = [None, n_classes], i.e. one-hot code.
            assert len(inputs) == 2
            inputs, mask = inputs
        else:  # if no true label, mask by the max length of capsules. Mainly used for prediction
            # compute lengths of capsules
            x = K.sqrt(K.sum(K.square(inputs), -1))
            # generate the mask which is a one-hot code.
            # mask.shape=[None, n_classes]=[None, num_capsule]
            mask = K.one_hot(indices=K.argmax(x, 1), num_classes=x.get_shape().as_list()[1])

        # inputs.shape=[None, num_capsule, dim_capsule]
        # mask.shape=[None, num_capsule]
        # masked.shape=[None, num_capsule * dim_capsule]
        masked = K.batch_flatten(inputs * K.expand_dims(mask, -1))
        return masked

    def compute_output_shape(self, input_shape):
        if type(input_shape[0]) is tuple:  # true label provided
            return tuple([None, input_shape[0][1] * input_shape[0][2]])
        else:  # no true label provided
            return tuple([None, input_shape[1] * input_shape[2]])

    def get_config(self):
        config = super(Mask, self).get_config()
        return config

In [192]:
reg = l2(0.5)

inputs = Input(shape = (300,1))
x = Conv1D(256, (10), activation='elu', kernel_initializer = 'glorot_normal', kernel_regularizer=reg)(inputs)
x = Conv1D(128, (10), activation='elu', kernel_initializer = 'glorot_normal', kernel_regularizer=reg)(x)
#x = AveragePooling2D((2, 2))(x)
#x = Conv1D(96, (5), activation='elu', kernel_initializer = 'glorot_normal', kernel_regularizer=reg)(x)
x = Conv1D(64, (5), activation='elu', kernel_initializer = 'glorot_normal', kernel_regularizer=reg)(x)


capsule = Capsule(1, 32, 3, True)(x)
cap = Lambda(lambda z: K.sqrt(K.sum(K.square(z), 2)))(capsule)

#decoder
#y = Input(shape=(2,))
#masked_by_y = Mask()([capsule,y])
#masked = Mask()(capsule)

#decoder = models.Sequential(name = 'decoder')
#decoder.add(Dense(128, activation='relu', input_dim = 16*2))
#decoder.add(Dense(256, activation='relu'))
#decoder.add(Dense(300, activation = 'sigmoid'))

In [193]:
model = Model(inputs, cap)
#model = models.Model([inputs,y],[cap, decoder(masked_by_y)]) #model for training

#eval_model = models.Model(inputs, [cap, decoder(masked)]) #eval_model for prediction

#cap = Flatten()(cap)
#drop = DropConnect(Dense(32, activation="elu"), prob=0.5)(cap)
#output = Dense(1, activation = 'sigmoid') (drop)


# we use a margin loss
model.compile(loss=[margin_loss], optimizer='adamax', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_45 (InputLayer)        (None, 300, 1)            0         
_________________________________________________________________
conv1d_114 (Conv1D)          (None, 291, 256)          2816      
_________________________________________________________________
conv1d_115 (Conv1D)          (None, 282, 128)          327808    
_________________________________________________________________
conv1d_116 (Conv1D)          (None, 278, 64)           41024     
_________________________________________________________________
capsule_36 (Capsule)         (None, 1, 32)             2048      
_________________________________________________________________
lambda_36 (Lambda)           (None, 1)                 0         
Total params: 373,696
Trainable params: 373,696
Non-trainable params: 0
_________________________________________________________________


In [144]:
eval_model.compile(loss=[margin_loss, 'mse'], optimizer='adamax', metrics=['accuracy'])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_42 (InputLayer)           (None, 300, 1)       0                                            
__________________________________________________________________________________________________
conv1d_108 (Conv1D)             (None, 291, 256)     2816        input_42[0][0]                   
__________________________________________________________________________________________________
conv1d_109 (Conv1D)             (None, 282, 128)     327808      conv1d_108[0][0]                 
__________________________________________________________________________________________________
conv1d_110 (Conv1D)             (None, 278, 64)      41024       conv1d_109[0][0]                 
__________________________________________________________________________________________________
capsule_34

In [145]:
#CapsNet-Keras https://github.com/XifengGuo/CapsNet-Keras

# Stratified K fold

In [113]:
from sklearn.model_selection import StratifiedKFold

n_fold = 20
folds = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=42)
repeated_folds = RepeatedStratifiedKFold(n_splits=n_fold, n_repeats=20, random_state=42)

In [111]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Training models

In [84]:
def train_model(X, X_test, y, params, folds=folds, model_type='lgb', plot_feature_importance=False, averaging='usual', model=None):
    oof = np.zeros(len(X))
    prediction = np.zeros(len(X_test))
    scores = []
    feature_importance = pd.DataFrame()
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
        # print('Fold', fold_n, 'started at', time.ctime())
        X_train, X_valid = X[train_index], X[valid_index]
        y_train, y_valid = y[train_index], y[valid_index]
        
        if model_type == 'lgb':
            train_data = lgb.Dataset(X_train, label=y_train)
            valid_data = lgb.Dataset(X_valid, label=y_valid)
            
            model = lgb.train(params,
                    train_data,
                    num_boost_round=2000,
                    valid_sets = [train_data, valid_data],
                    verbose_eval=500,
                    early_stopping_rounds = 200)
            
            y_pred_valid = model.predict(X_valid)
            y_pred = model.predict(X_test, num_iteration=model.best_iteration_)
            
        if model_type == 'xgb':
            train_data = xgb.DMatrix(data=X_train, label=y_train, feature_names=X_tr.columns)
            valid_data = xgb.DMatrix(data=X_valid, label=y_valid, feature_names=X_tr.columns)

            watchlist = [(train_data, 'train'), (valid_data, 'valid_data')]
            model = xgb.train(dtrain=train_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=500, params=params)
            y_pred_valid = model.predict(xgb.DMatrix(X_valid, feature_names=X_tr.columns), ntree_limit=model.best_ntree_limit)
            y_pred = model.predict(xgb.DMatrix(X_test, feature_names=X_tr.columns), ntree_limit=model.best_ntree_limit)
        
        if model_type == 'cat':
            model = CatBoostClassifier(iterations=20000,  eval_metric='AUC', **params)
            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True, verbose=False)

            y_pred_valid = model.predict(X_valid)
            y_pred = model.predict(X_test)
            
            
        if model_type == 'sklearn':
            model = model
            model.fit(X_train, y_train)
            y_pred_valid = model.predict(X_valid).reshape(-1,)
            score = roc_auc_score(y_valid, y_pred_valid)
            # print(f'Fold {fold_n}. AUC: {score:.4f}.')
            # print('')
            
            y_pred = model.predict_proba(X_test)[:, 1]
            
        if model_type == 'glm':
            model = sm.GLM(y_train, X_train, family=sm.families.Binomial())
            model_results = model.fit()
            model_results.predict(X_test)
            y_pred_valid = model_results.predict(X_valid).reshape(-1,)
            score = roc_auc_score(y_valid, y_pred_valid)
            
            y_pred = model_results.predict(X_test)
        
        oof[valid_index] = y_pred_valid.reshape(-1,)
        scores.append(roc_auc_score(y_valid, y_pred_valid))

        if averaging == 'usual':
            prediction += y_pred
        elif averaging == 'rank':
            prediction += pd.Series(y_pred).rank().values  
        
        if model_type == 'lgb':
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = X.columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

    prediction /= n_fold
    
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
    
    if model_type == 'lgb':
        feature_importance["importance"] /= n_fold
        if plot_feature_importance:
            cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
                by="importance", ascending=False)[:50].index

            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

            plt.figure(figsize=(16, 12));
            sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False));
            plt.title('LGB Features (avg over folds)');
        
            return oof, prediction, feature_importance
        return oof, prediction, scores
    
    else:
        return oof, prediction, scores

In [370]:
model = linear_model.LogisticRegression(max_iter = 10000)
parameter_grid = {'solver': ['liblinear'],
                  'penalty': ['l1'],
                  'C': [0.1,0.105,0.11,0.115],
                  'class_weight': ['balanced', None]
                 }

grid_search = GridSearchCV(model, param_grid=parameter_grid, cv=folds, scoring='roc_auc', n_jobs=-1)
grid_search.fit(x_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

Best score: 0.8035238095238096
Best parameters: {'C': 0.115, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}




In [383]:
model = linear_model.LogisticRegression(class_weight='balanced', penalty='l1', C=0.11, solver='liblinear')
oof_lr, prediction_lr, scores = train_model(x_train, x_test, y_train, params=None, model_type='sklearn', model=model)

CV mean score: 0.7382, std: 0.0563.


In [194]:
model.fit(x_train_nn, y_train, batch_size = 50, epochs = 100, verbose = 1, validation_split = 0.2, shuffle = True)

Train on 200 samples, validate on 50 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x135d0eb66d8>

In [None]:
test(model=eval_model, data=(x_test, y_test), args=args)

In [189]:
test_id = test_df[['id']]
x_test = test_df.drop(['id'], axis = 1)
x_test = x_test.astype('float32')
x_test_nn = np.expand_dims(x_test, axis = -1)

In [197]:
results = model.predict(x_test_nn, batch_size=30, verbose=1)



In [66]:
test_id.size

19750

In [470]:
results = prediction_lr
predictions = pd.DataFrame(results, columns = ['target'])

ids = pd.DataFrame(test_id, columns = ['id'])
predictions = pd.concat([ids, predictions], axis = 1, sort=False)
predictions

Unnamed: 0,id,target
0,250,0.383895
1,251,0.325390
2,252,0.288592
3,253,0.427592
4,254,0.251649
5,255,0.156945
6,256,0.174616
7,257,0.102915
8,258,0.372323
9,259,0.141676


In [471]:
predictions.to_csv('dont_overfit_2_logreg_bootstrap.csv',index = False)

# Bootstrapping

Use pseudo-labelled test data as our training dataset

In [115]:
p_df = pd.read_csv('/Users/JoonH/dont-overfit-ii/bootstrap_data.csv', nrows = 254)
p_df.head()

Unnamed: 0,target,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
0,1,-0.098,2.165,0.681,-0.614,1.309,-0.455,-0.236,0.276,-2.246,...,0.867,1.347,0.504,-0.649,0.672,-2.097,1.051,-0.414,1.038,-1.065
1,0,1.081,-0.973,-0.383,0.326,-0.428,0.317,1.172,0.352,0.004,...,-0.165,-1.695,-1.257,1.359,-0.808,-1.624,-0.458,-1.099,-0.936,0.973
2,1,-0.523,-0.089,-0.348,0.148,-0.022,0.404,-0.023,-0.172,0.137,...,0.013,0.263,-1.222,0.726,1.444,-1.165,-1.544,0.004,0.8,-1.211
3,1,0.067,-0.021,0.392,-1.637,-0.446,-0.725,-1.035,0.834,0.503,...,-0.404,0.64,-0.595,-0.966,0.9,0.467,-0.562,-0.254,-0.533,0.238
4,1,2.347,-0.831,0.511,-0.021,1.225,1.594,0.585,1.509,-0.012,...,0.898,0.134,2.415,-0.996,-1.006,1.378,1.246,1.478,0.428,0.253


In [116]:
x_train = p_df.drop(['target'], axis = 1)
x_train.shape

(254, 300)

In [117]:
x_train.shape

(254, 300)

In [63]:
import featuretools as ft

# initialize entityset
es = ft.EntitySet('data')
es2 = ft.EntitySet('test')

# add entities (application table itself)
es.entity_from_dataframe(
    entity_id='main', # define entity id
    dataframe=p_df.drop(['target'], axis=1), # select underlying data
    index='id', # define unique index column
    # specify some datatypes manually (if needed)
    variable_types={
        f: ft.variable_types.Categorical 
        for f in train_df.columns if f.startswith('FLAG_')
    }
)

es2.entity_from_dataframe(
    entity_id='test', # define entity id
    dataframe=test_df, # select underlying data
    index='id', # define unique index column
    # specify some datatypes manually (if needed)
    variable_types={
        f: ft.variable_types.Categorical 
        for f in train_df.columns if f.startswith('FLAG_')
    }
)



Entityset: test
  Entities:
    test [Rows: 19750, Columns: 301]
  Relationships:
    No relationships

In [64]:
#The actual feature construction
# see feature set definitions (no actual computations yet)
# used for faster prototyping
fm_train,feature_defs = ft.dfs(
    entityset=es, 
    target_entity="main", 
    features_only=False,
    agg_primitives=[
        "mean",
        "mode", 
        "max", 
        "min", 
        "sum", 
        "std"
        
    ],
    trans_primitives=[
        "not",
        "diff",
        "not",
        "percentile",
        "cum_sum"
    ],
    max_depth=2,
    #cutoff_time=cutoff_times,
    #training_window=ft.Timedelta(60, "d"), # use only last X days in computations
    max_features=1000,
    chunk_size=5000,
    verbose=True,
)

fm_test,feature_defs = ft.dfs(
    entityset=es2, 
    target_entity="test", 
    features_only=False,
    agg_primitives=[
        "mean",
        "mode", 
        "max", 
        "min", 
        "sum", 
        "std"
        
    ],
    trans_primitives=[
        "not",
        "diff",
        "not",
        "percentile",
        "cum_sum"
    ],
    max_depth=2,
    #cutoff_time=cutoff_times,
    #training_window=ft.Timedelta(60, "d"), # use only last X days in computations
    max_features=1000,
    chunk_size=10000,
    verbose=True,
)

Built 600 features




Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|████████████████████████████████████████████| Calculated: 1/1 chunks
Built 600 features
Elapsed: 00:02 | Remaining: 00:00 | Progress: 100%|████████████████████████████████████████████| Calculated: 2/2 chunks


In [120]:
#folds = StratifiedKFold(n_splits=25, shuffle=True, random_state=42)
folds = KFold(n_splits=25, shuffle=True, random_state=42)
repeated_folds = RepeatedStratifiedKFold(n_splits=25, n_repeats=20, random_state=42)
scaler = StandardScaler()
y_train = p_df['target']
#x_train = fm_train
x_train = p_df.drop(['target'], axis = 1)
x_test = test_df.drop(['id'], axis = 1)
x_train = scaler.fit_transform(x_train)
#x_test = fm_test
x_test = scaler.transform(x_test)

In [124]:
model = linear_model.LogisticRegression(class_weight='balanced', penalty='l2', C=0.09, solver='liblinear', max_iter = 100000)
oof_lr, prediction_lr, scores = train_model(x_train, x_test, y_train, params=None, model_type='sklearn', model=model)

CV mean score: 0.6778, std: 0.2795.


In [74]:
fm_train = fm_train.drop_duplicates()
fm_test = fm_test.drop_duplicates()

from boruta import BorutaPy
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 100, n_jobs = -1, class_weight = 'balanced')
boruta_selector = BorutaPy(rfc, n_estimators = 'auto', verbose = 0)
boruta_selector.fit(x_train,y_train)

feature_df = pd.DataFrame(fm_train.columns.tolist(),columns = ['features'])
feature_df['rank'] = boruta_selector.ranking_
feature_df = feature_df.sort_values('rank',ascending=True).reset_index(drop=True)
columns_to_keep = feature_df.features[0:400]
boruta_train = fm_train[columns_to_keep]
boruta_test = fm_test[columns_to_keep]

n_fold = 80
folds = StratifiedKFold(n_splits=250, shuffle=True, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(boruta_train)
X_test = scaler.transform(boruta_test)

In [76]:
from sklearn import linear_model

model = linear_model.LogisticRegression(class_weight='balanced', penalty='l2', C=0.15, solver='liblinear', max_iter = 100000)
oof_lr, prediction_lr, scores = train_model(X_train, X_test, y_train, params=None, model_type='sklearn', model=model)

CV mean score: 0.7655, std: 0.1368.


In [28]:
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model, random_state=1).fit(X_train, y_train)
#eli5.show_weights(perm, top=50)

In [29]:
from sklearn.feature_selection import SelectFromModel
sel = SelectFromModel(perm,threshold=0.001, prefit=True)
X_trans = sel.transform(x_train)
X_test_trans = sel.transform(x_test)

In [42]:
model = linear_model.LogisticRegression(class_weight='balanced', penalty='l1', C=0.17, solver='liblinear', max_iter = 50000)
oof_lr, prediction_lr, _ = train_model(X_trans, X_test_trans, y_train, params=None, model_type='sklearn', model=model)

CV mean score: 0.6186, std: 0.1403.


In [504]:
from sklearn.svm import SVC
svc = SVC(C = 8.0, kernel='rbf', probability = True, gamma = 'auto')
oof_lr_svm, prediction_lr_svm, _ = train_model(X_trans, X_test_trans, y_train, params=None, model_type='sklearn', model=svc)

CV mean score: 0.8026, std: 0.1070.


In [70]:
#results = prediction_lr_svm * 0.3 + prediction_lr * 0.7
results = prediction_lr
predictions = pd.DataFrame(results, columns = ['target'])

ids = test_df['id']
predictions = pd.concat([ids, predictions], axis = 1, sort=False)
predictions.to_csv('dont_overfit_2_bootstrap2.csv',index = False)