## Targetting imbalance problem with Imbalance-XGBoost 
Until now, most of my efforts at countering the imbalance of the data has been in vain. Ultimately, I have not been able to find a solution that optimizes the accuracy and recall of the minority(0 in this case) class. Efforts with balanced bagging and SMOTE must continue as they are promising. However, I think its worth to test this new class. 
https://github.com/jhwjhw0123/Imbalance-XGBoost

## Best solution run


In [27]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from datetime import datetime
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.ensemble import BaggingClassifier
from imblearn.ensemble import BalancedRandomForestClassifier, RUSBoostClassifier

In [2]:
def average(df):
    sum = 0
    count = 0
    for x in df['f1']:
        if x!=-1:
            sum+=x
            count+=1
    avg = sum/count
    return avg


In [3]:
def results(model, X_train, X_valid,y_train, y_valid):
    soft_probs = model.predict_proba(X_valid)
    valid_auc = roc_auc_score(y_valid, soft_probs[:,1])
    print(valid_auc)
    train_preds = model.predict(X_train)
    acc = accuracy_score(y_train, train_preds)
    print(acc)
    preds = model.predict(X_valid)
    acc = accuracy_score(y_valid, preds)
    print(acc)
    print(classification_report(y_valid,preds))

In [4]:
df = pd.read_csv("train_final.csv")
df.describe()
f1_avg = average(df)
df['f1'].replace(-1,f1_avg, inplace = True)
df.describe()

Unnamed: 0,Id,Y,f1,f2,f3,f4,f5,f6,f7,f8,...,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24
count,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,...,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0,16383.0
mean,8192.0,0.942135,43031.41572,1.044375,11.770938,118323.581456,1.044436,0.050052,117089.674113,169730.1786,...,25894.316914,119045.099005,184622.040835,1.047305,125959.667765,1.044558,1.045718,1.041934,32718.9,1.043948
std,4729.509065,0.233495,33596.053696,0.264806,353.187115,4518.059755,0.265601,0.293892,10261.29297,69396.677853,...,36086.993946,18321.987129,100590.811845,0.306239,31091.344158,0.262576,0.266874,0.246597,3184929.0,0.25964
min,1.0,0.0,37.0,1.0,1.77,23779.0,1.0,0.0,4292.0,4673.0,...,25.0,4674.0,3130.0,1.0,117879.0,1.0,1.0,1.0,1.0,1.0
25%,4096.5,1.0,20331.0,1.0,1.77,118096.0,1.0,0.0,117961.0,117906.0,...,4554.0,118395.0,118398.0,1.0,118274.0,1.0,1.0,1.0,1.0,1.0
50%,8192.0,1.0,35530.0,1.0,1.77,118300.0,1.0,0.0,117961.0,128130.0,...,13234.0,118929.0,119095.0,1.0,118568.0,1.0,1.0,1.0,2.0,1.0
75%,12287.5,1.0,74240.5,1.0,3.54,118386.0,1.0,0.0,117961.0,234498.5,...,38902.0,120539.0,290919.0,1.0,120006.0,1.0,1.0,1.0,9.0,1.0
max,16383.0,1.0,312152.0,7.0,43910.16,286791.0,9.0,10.0,311178.0,311867.0,...,311696.0,286792.0,308574.0,18.0,311867.0,8.0,8.0,7.0,404288600.0,8.0


In [5]:
y = df.loc[:,'Y']
X = df.loc[:,'f1':'f24']
X_train, X_valid, y_train, y_valid = train_test_split(X,y,train_size = 0.75, test_size = 0.25,random_state = 42, shuffle = True)

Best submitted params:
(colsample_bytree=0.3,subsample = 0.7,max_depth=8,
                            n_estimators=1550, learning_rate =0.011,
                            colsample_bylevel=0.5,n_jobs=-1,base_score = 0.55,
                            random_state=42,)
  (colsample_bytree=0.4,subsample = 1,max_depth=7,
                            n_estimators=1350, learning_rate =0.012,
                            colsample_bylevel=0.6,n_jobs=-1,base_score = 0.55,
                            random_state=42)    
  (colsample_bytree=0.395,subsample = 0.8667,max_depth=10,
                            n_estimators=1550, learning_rate =0.0108,
                            colsample_bylevel=0.65,n_jobs=-1,base_score = 0.475,
                            random_state=42)

In [20]:
model_trial = XGBClassifier(colsample_bytree=0.4,subsample = 0.8667,max_depth=8,
                            n_estimators=1650, learning_rate =0.0119,
                            colsample_bylevel=0.65,n_jobs=-1,base_score = 0.55,
                            random_state=42)
                            
model_trial.fit(X_train, y_train)
results(model_trial,X_train, X_valid, y_train, y_valid)

0.8821691447756789
0.9894197118906161
0.96142578125
              precision    recall  f1-score   support

           0       0.95      0.31      0.47       224
           1       0.96      1.00      0.98      3872

    accuracy                           0.96      4096
   macro avg       0.95      0.66      0.72      4096
weighted avg       0.96      0.96      0.95      4096



## Balanced Bagging 
Improves recall

In [24]:
model_bagging = XGBClassifier(colsample_bytree=0.395,subsample = 0.8667,max_depth=11,
                            n_estimators=1538, learning_rate =0.0108,colsample_bynode = 0.87,
                            colsample_bylevel=0.65,n_jobs=-1,base_score = 0.475,
                            random_state=42)

In [25]:
balanced_bagging = BalancedBaggingClassifier(base_estimator=model_bagging,
...                                 sampling_strategy='auto',n_estimators=20,
...                                 replacement=True,
...                                 random_state=42)
model_bagged = balanced_bagging.fit(X_train, y_train)
results(model_bagged,X_train, X_valid, y_train, y_valid)

0.8636617288961038
0.8943598925693823
0.865478515625
              precision    recall  f1-score   support

           0       0.23      0.64      0.34       224
           1       0.98      0.88      0.93      3872

    accuracy                           0.87      4096
   macro avg       0.61      0.76      0.63      4096
weighted avg       0.94      0.87      0.89      4096



In [29]:
model_adaboost = AdaBoostClassifier(base_estimator = model_bagging)
model_adaboost.fit(X_train, y_train)
results(model_adaboost,X_train, X_valid, y_train, y_valid)

0.5
0.05892406608610727
0.0546875
              precision    recall  f1-score   support

           0       0.05      1.00      0.10       224
           1       0.00      0.00      0.00      3872

    accuracy                           0.05      4096
   macro avg       0.03      0.50      0.05      4096
weighted avg       0.00      0.05      0.01      4096



  _warn_prf(average, modifier, msg_start, len(result))
