<h2>Don't Overfit </h2>

<h2>Problem Statement </h2>

- Donot Overfit 2 is a unique problem statement where we are provided with only 250 training samples and 19750 test samples.
- The Objective of the problem is not to overfit with this train data and generalize well with our test data samples.
- The data set consists of 300 continuous random variables each standardized with mean centered to zero and variance 1.

<h2> Performance Metrics Used </h2>

- The Problem uses ROC AUC SCORE as the metric to measure the model performance

In [95]:
import numpy as np
import pandas as pd
from scipy import stats
import sklearn
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
pd.options.mode.chained_assignment = None   # default='warn'
from mlxtend.classifier import StackingClassifier
from sklearn.linear_model import Lasso
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from scipy.stats import randint as sp_randint
from scipy.stats import uniform
from scipy import stats
#import xgboost as xgb
from sklearn.utils.class_weight import compute_class_weight
from xgboost import XGBClassifier

<h1>Getting the Data into Data Frame </h1>

In [81]:
train_data=pd.read_csv("train.csv")
test_data=pd.read_csv("test.csv")

<h3>Simple First-Cut Solution using Logistic Regression and RepeatedStratifiedKFold</h3>

<h5>Data Split</h5>

In [82]:
train_data["target"]=train_data["target"].astype(int)
train_data=train_data.drop("id",axis=1)
X=train_data.drop("target",axis=1)
Y=train_data['target']
test=test_data.drop("id",axis=1)

In [86]:
<h3>Let us add simple features like mean,std and sum etc </h3>

SyntaxError: ignored

In [87]:
def simple_feature_engg(data):
  data["mean"]=np.mean(data,axis=1)
  data["sum"]=np.sum(data,axis=1)
  data["std_dev"]=np.std(data,axis=1)
  return data

In [88]:
X_fe=simple_feature_engg(X)
test_fe=simple_feature_engg(test)

In [89]:
from sklearn.preprocessing import StandardScaler
def standardize(data,test_data):
  scaler=StandardScaler()
  scaler=scaler.fit(data)
  trans_data=scaler.transform(data)
  test_trans_data=scaler.transform(test_data)
  return((pd.DataFrame(trans_data,columns=data.columns),pd.DataFrame(test_trans_data,columns=data.columns)))

In [90]:
X_scaled,test_scaled=standardize(X_fe,test_fe)

In [1]:
def model_with_parameters(model_name):

  if model_name=="KNN":
    model=KNeighborsClassifier()
    param_= {'n_neighbors':np.arange(3,45,2).tolist(),'algorithm':['kdtree','brute']}

  elif model_name=="LR":
    model=SGDClassifier(loss='log',class_weight='balanced',n_jobs=-1)
    param_={"penalty":["l1","l2","elasticnet"],
        "alpha":np.arange(0.1,0.9,0.01),
        "l1_ratio":np.arange(0.1,0.9,0.05)
      }

  elif model_name== "SVM":
    model=SGDClassifier(loss='hinge',class_weight='balanced',n_jobs=-1)
    param_={"penalty":["l1","l2","elasticnet"],
        "alpha":np.arange(0.1,0.9,0.01),
        "l1_ratio":np.arange(0.1,0.9,0.05)}
      
  elif model_name == "GNB":
    model=GaussianNB()
    param_=""

  elif model_name == "RF":
    model= RandomForestClassifier(random_state=42)
    param_={'n_estimators':[10,20,30,40,50,100,200,300,400],'max_depth':[2,3,5,7]}

  elif model_name == "XGB":
    model=XGBClassifier(scale_pos_weight=0.5)
    param_={"max_depth":[2,3,5,7],'n_estimators':[10,20,50,100,200,300]}

  return (model,param_) 

<h4>Random Search and Stratified-K-Fold Strategy </h4>

In [65]:
def rskf_func(n_splits,n_repeats): #Creating a function for repeated stratified K-Fold Object
    rskf_var=RepeatedStratifiedKFold(n_splits=n_splits,n_repeats=n_repeats)
    return rskf_var

In [110]:
def rskf_train(data,train_labels,test_data,model_name):

  test_val=np.zeros(len(test_data))
  cnt=0
  rskf=rskf_func(10,10)
  for train_index,valid_index in rskf.split(data,train_labels):
      #Gets 19 chunks of splitted data for train and 1 chunk for the validation out of 20 chunks
      #Each chunk can be used for validation once.Hence 20 iterations for 20 chunks
      #And we are repeating the process for 20 times.So 400 iterations in total
      X_train,X_valid=data.loc[train_index],X.loc[valid_index]
      Y_train,Y_valid=train_labels.loc[train_index],Y.loc[valid_index]
      model,param=model_with_parameters(model_name)
      if model_name == "GNB":
        model.fit(X_train,Y_train)
        clf_calib=CalibratedClassifierCV(model,cv=20,method='sigmoid')
        clf_calib.fit(X_train,Y_train)
      
      else:  
        grid_model=RandomizedSearchCV(model,param,cv=3,scoring='roc_auc',n_jobs=-1,verbose=0)
        grid_model.fit(X_train,Y_train)
        clf_calib=CalibratedClassifierCV(grid_model.best_estimator_,cv=4,method='sigmoid')
        clf_calib.fit(X_train,Y_train)

      if model_name == "stack":
        score=roc_auc_score(Y_valid.values,sclf.predict_proba(X_valid)[:,1])
        if(score > 0.75):
          print("Model ok for iteration")
        else:
          print("Skip model for this iteration")
      else:
        valid_roc=roc_auc_score(Y_valid.values,clf_calib.predict_proba(X_valid)[:,1])
        if( valid_roc > 0.75):
          print("<---Model ok for iteration")
          test_val+=clf_calib.predict_proba(test_data)[:,1]
          cnt+=1
        else:
          print("Skipping Model for this Iteration")
  
  
  final_prediction=test_val* (1./cnt)

  return final_prediction


In [114]:
model_list=["KNN","LR","SVM","GNB","RF","XGB"]
pred_dict={}
for model in tqdm(model_list):
  prediction=rskf_train(X_scaled,Y,test_scaled,model)
  pred_dict[model]=prediction








  0%|          | 0/6 [00:00<?, ?it/s][A[A[A[A[A[A

Skipping Model for this Iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
<---Model ok for iteration


  0%|          | 0/6 [00:08<?, ?it/s]


KeyboardInterrupt: ignored

In [102]:
pred_xgb=rskf_train(X_scaled,Y,test_scaled,"XGB")
pred_dict["XGB"]=pred_xgb

Skipping Model for this Iteration
<---Model ok for iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
Skipping Model for this Iteration
Skipping Model for this Iteration
<---Model ok for iteration
<---Model ok for iteration
<---Model ok for iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
<---Model ok for iteration
<---Model ok for iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
Skipping Model for this Iteration
<---Model ok for iteration
<---Model ok for iteration
<---Model ok for iteration
<---Model ok for iteratio

In [126]:
test_scaled.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,mean,sum,std_dev
0,0.478452,-0.998843,-1.728417,0.304138,-0.692502,0.533981,0.500624,-0.221157,-0.675928,1.233779,0.472827,-0.413123,1.74456,-0.225866,-0.349993,-0.011904,0.7686,0.861594,1.933633,0.289704,1.221403,-0.543319,-1.070748,-0.488631,0.312807,-0.829138,-1.344914,0.118911,0.321268,0.97885,-0.946246,0.857462,-1.099893,2.094126,0.992371,-2.137897,-0.869637,0.120537,0.539353,0.235924,...,1.472208,0.594182,0.620882,-1.076035,1.121284,1.020632,0.171795,-1.594102,1.162126,0.021879,-0.804297,0.179067,0.13728,-0.714548,0.679659,-0.285295,-1.594176,0.267015,0.991283,0.373836,-0.680804,1.135573,0.793785,0.394572,1.703194,-0.725889,0.957907,-0.131418,-2.837717,-0.906667,2.180115,-0.200209,2.159692,0.582494,-0.110786,0.300799,1.143073,2.47992,2.47992,3.225931


In [148]:
def stacking_classifier(data,te_scaled,Y):
  test_val=np.zeros(len(te_scaled))
  cnt=0
  rskf=rskf_func(10,10)
  clf1,param1=model_with_parameters("LR")
  clf2,param2=model_with_parameters("SVM")
  clf3,param3=model_with_parameters("RF")
  clf4,param4=model_with_parameters("XGB")
  for train_index,valid_index in rskf.split(data,Y):
      #Gets 19 chunks of splitted data for train and 1 chunk for the validation out of 20 chunks
      #Each chunk can be used for validation once.Hence 20 iterations for 20 chunks
      #And we are repeating the process for 20 times.So 400 iterations in total
      X_train,X_valid=data.loc[train_index],X.loc[valid_index]
      Y_train,Y_valid=Y.loc[train_index],Y.loc[valid_index]

      grid_model_1=RandomizedSearchCV(clf1,param1,cv=3,scoring='roc_auc',n_jobs=-1,verbose=0)
      grid_model_1.fit(X_train,Y_train)
      clf_1_calib=CalibratedClassifierCV(grid_model_1.best_estimator_,cv=3,method='sigmoid')
      clf_1_calib.fit(X_train,Y_train)

      grid_model_2=RandomizedSearchCV(clf2,param2,cv=3,scoring='roc_auc',n_jobs=-1,verbose=0)
      grid_model_2.fit(X_train,Y_train)
      clf_2_calib=CalibratedClassifierCV(grid_model_2.best_estimator_,cv=3,method='sigmoid')
      clf_2_calib.fit(X_train,Y_train)

      grid_model_3=RandomizedSearchCV(clf3,param3,cv=3,scoring='roc_auc',n_jobs=-1,verbose=0)
      grid_model_3.fit(X_train,Y_train)
      clf_3_calib=CalibratedClassifierCV(grid_model_3.best_estimator_,cv=3,method='sigmoid')
      clf_3_calib.fit(X_train,Y_train)

      grid_model_4=RandomizedSearchCV(clf4,param4,cv=3,scoring='roc_auc',n_jobs=-1,verbose=0)
      grid_model_4.fit(X_train,Y_train)
      clf_4_calib=CalibratedClassifierCV(grid_model_4.best_estimator_,cv=3,method='sigmoid')
      clf_4_calib.fit(X_train,Y_train)

      sclf=StackingClassifier(classifiers=[clf_1_calib,clf_2_calib,clf_3_calib,clf_4_calib],meta_classifier=clf_1_calib,use_probas=True)

      sclf.fit(X_train,Y_train)

      score=roc_auc_score(Y_valid.values,sclf.predict_proba(X_valid)[:,1])

      if score > 0.75:
        print("Model is ok for this iteration")
        test_val+=sclf.predict_proba(te_scaled)[:,1]
        cnt+=1
      else:
        print("Skipping this iteration")

  final_stack_prediction=test_val* (1./cnt)

  return final_stack_prediction
      



In [123]:
X_scaled.shape

(250, 303)

In [149]:
stack_prediction=stacking_classifier(X_scaled,test_scaled,Y)

Skipping this iteration
Model is ok for this iteration
Skipping this iteration
Model is ok for this iteration
Skipping this iteration
Model is ok for this iteration
Model is ok for this iteration
Skipping this iteration
Skipping this iteration
Model is ok for this iteration
Model is ok for this iteration
Model is ok for this iteration
Model is ok for this iteration
Skipping this iteration
Model is ok for this iteration
Model is ok for this iteration
Skipping this iteration
Skipping this iteration
Model is ok for this iteration
Model is ok for this iteration
Skipping this iteration
Model is ok for this iteration
Model is ok for this iteration
Model is ok for this iteration
Skipping this iteration
Skipping this iteration
Skipping this iteration
Model is ok for this iteration
Model is ok for this iteration
Skipping this iteration
Model is ok for this iteration
Skipping this iteration
Skipping this iteration
Skipping this iteration
Model is ok for this iteration
Skipping this iteration
Ski

In [151]:
stk_dict={}
stk_dict["stack"]=stack_prediction

In [103]:
pred_dict.keys()

dict_keys(['KNN', 'LR', 'SVM', 'GNB', 'RF', 'XGB'])

In [104]:
def final_submission_csv(name): #Getting csv for kaggle submission
    sub_df=pd.read_csv("sample_submission.csv")
    for k,v in name.items():
      final_df=pd.concat([sub_df["id"],pd.DataFrame(v)],axis=1)
      final_df.columns=["id","target"]
      final_df.to_csv(k + ".csv",index=False)
    

In [105]:
final_submission_csv(pred_dict)

In [152]:
final_submission_csv(stk_dict)

<h2>Final Score </h2>

<b>

- With Stacked Model we get a Private Score of 0.825 and Public Score of 0.838

</b>

<h2>Conclusion </h2>

- Simple Logistic Regression with K-Fold Cross Validation we got 80 % ROC 
- More Complex Stacked Classifer and Feature Selection Techniques help us achieve 83% ROC
- Without LB Probing we could get a good classifier which separates the data well in  Private LB 

<h2>References </h2>

- https://www.kaggle.com/featureblind/robust-lasso-patches-with-rfe-gs
- https://www.kaggle.com/rafjaa/dealing-with-very-small-datasets
- https://www.kaggle.com/iavinas/simple-short-solution-don-t-overfit-0-848
- https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/3096/stacking-models/4/module-4-machine-learning-ii-supervised-learning-models