# TODO:
https://www.kaggle.com/competitions/playground-series-s3e2/discussion/378795
https://www.kaggle.com/competitions/playground-series-s3e2/discussion/378780

These show that we should incorporate original data but when validating using K fold methods, we should only validate based on data in the competition dataset and not on original dataset. So implement this technique for this competition

### Also, bagging resulted in a lot better score with simple mean in the last competition, thought it didn't score much better on public, it absolutely ranked up to 60th position in the final private LB. So,
## Remember to trust your cvs over pbl

# A Few more TODOs:
* select features with less than or equal to 10 values as catergorical features, instead of current 20, see if it improves the score
* Try target encoding, weights of evidence AND leave one out encoding, see which one performs better

# Imports

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from pathlib import Path
import xgboost as xgb
import lightgbm as lgbm
import catboost
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score
from IPython.display import display
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
import optuna
from sklearn.preprocessing import StandardScaler

from category_encoders import TargetEncoder, LeaveOneOutEncoder, WOEEncoder

In [3]:
import warnings
warnings.filterwarnings('ignore')

# Loading Data

In [4]:
BASE_PATH = Path('../input/playground-series-s3e3')

# id is not going to be an informative feature, so we're dropping it for train
# but since we'll need test set's ids to make the submission file, so we'll save those in  a separate varible before dropping
train = pd.read_csv(BASE_PATH / "train.csv").drop(columns="id")
test = pd.read_csv(BASE_PATH / "test.csv")
test_idx = test.id
test = test.drop(columns="id")

# It's been shown that incorporating original data, improves scores - at least on the public leaderboard. So let's do that!
original = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')

train.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EnvironmentSatisfaction,Gender,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition
0,36,Travel_Frequently,599,Research & Development,24,3,Medical,1,4,Male,...,80,1,10,2,3,10,0,7,8,0
1,35,Travel_Rarely,921,Sales,8,3,Other,1,1,Male,...,80,1,4,3,3,4,2,0,3,0
2,32,Travel_Rarely,718,Sales,26,3,Marketing,1,3,Male,...,80,2,4,3,3,3,2,1,2,0
3,38,Travel_Rarely,1488,Research & Development,2,3,Medical,1,3,Female,...,80,0,15,1,1,6,0,0,2,0
4,50,Travel_Rarely,1017,Research & Development,5,4,Medical,1,2,Female,...,80,0,31,0,3,31,14,4,10,1


In [5]:
original.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


# Pre-Pre-Processing

### Let's make the feature names and order consistent b/w our competition dataset and original dataset, before we concatenate

In [6]:
original['Attrition'] = (original['Attrition'] == 'Yes').astype(np.int64)

# in original data, id is termed as "EmployeeNumber", so let's drop it
original.drop(columns="EmployeeNumber", inplace=True)

In [7]:
# now reordering the features in original dataset
original = original[list(train.columns)]

### Someone in the first compeition showed that adding a source feature i.e. a feature that indicates whether a given record is from original dataset or synthetic improves performance, so let's' do that!

In [8]:
original["is_original"] = 1
train["is_original"] = 0
test["is_original"] = 0

### Let's finally concatenate

In [9]:
train_extended = pd.concat([train, original]).reset_index(drop=True)
len(train_extended)

3147

### checking for null values

In [10]:
pd.concat([train_extended.isnull().sum().rename("Missing in Train"),
           test.isnull().sum().rename("Missing in Test")], axis=1).sort_values(by="Missing in Train")

Unnamed: 0,Missing in Train,Missing in Test
Age,0,0.0
Over18,0,0.0
OverTime,0,0.0
PercentSalaryHike,0,0.0
PerformanceRating,0,0.0
RelationshipSatisfaction,0,0.0
StandardHours,0,0.0
NumCompaniesWorked,0,0.0
StockOptionLevel,0,0.0
TrainingTimesLastYear,0,0.0


#### Insights: No missing values! Something to celebrate! :p

## Let's also concatenate test data to train

In [11]:
y = train_extended.Attrition
y

0       0
1       0
2       0
3       0
4       1
       ..
3142    0
3143    0
3144    0
3145    0
3146    0
Name: Attrition, Length: 3147, dtype: int64

In [12]:
df = pd.concat([train_extended.drop(columns="Attrition"), test])

# Preprocessing

### Identifying Categorical Features

In [13]:
df.dtypes.sort_values()

Age                          int64
YearsSinceLastPromotion      int64
YearsInCurrentRole           int64
YearsAtCompany               int64
WorkLifeBalance              int64
TrainingTimesLastYear        int64
TotalWorkingYears            int64
StockOptionLevel             int64
StandardHours                int64
RelationshipSatisfaction     int64
PerformanceRating            int64
PercentSalaryHike            int64
NumCompaniesWorked           int64
MonthlyRate                  int64
YearsWithCurrManager         int64
MonthlyIncome                int64
JobSatisfaction              int64
DailyRate                    int64
DistanceFromHome             int64
Education                    int64
EmployeeCount                int64
HourlyRate                   int64
EnvironmentSatisfaction      int64
JobLevel                     int64
JobInvolvement               int64
is_original                  int64
Gender                      object
MaritalStatus               object
OverTime            

### Remember, being of type int, doesn't mean that the feature cannot be categorial.
#### Let's check for unique values in each column

In [14]:
df.nunique().sort_values()

StandardHours                  1
EmployeeCount                  1
Over18                         1
is_original                    2
PerformanceRating              2
OverTime                       2
Gender                         2
BusinessTravel                 3
Department                     3
MaritalStatus                  3
RelationshipSatisfaction       4
JobSatisfaction                4
WorkLifeBalance                4
StockOptionLevel               5
JobInvolvement                 5
EnvironmentSatisfaction        5
Education                      6
JobLevel                       6
EducationField                 6
TrainingTimesLastYear          7
JobRole                        9
NumCompaniesWorked            11
PercentSalaryHike             15
YearsSinceLastPromotion       16
YearsWithCurrManager          18
YearsInCurrentRole            19
DistanceFromHome              29
YearsAtCompany                38
TotalWorkingYears             41
Age                           43
HourlyRate

#### INSIGHTS: Taking a quick look at number of unique values in features reveals that we should be safe setting the threshold for to 20 unique values for what consitutes as a categorical feature
#### We'll drop columns with only one value as they bring nothing to the table

#### But feel free to use your own intuition and test & trial to figure our what's works best in terms of threshold and features

In [25]:
feats_to_drop = [col for col in df.columns if df[col].nunique()==1]
cat_features = [col for col in df.columns if df[col].nunique() <= 10 and df[col].nunique() > 1]

In [16]:
df.drop(columns=feats_to_drop, inplace=True)

#### We won't use one hot encoder here, because we already have a large ratio of features to rows and one hotting would increase that ratio by a large margin even further which will result in severe overfitting
#### Rather we'll use ordinal/label encoder (they're basically the same thing)

In [26]:
len(df), len(y)

(4266, 3147)

In [92]:
# but first let's separate test and train_extended
X_train = df.iloc[:-len(test), :]
X_test = df.iloc[-len(test): , :]

In [94]:
target_enc = TargetEncoder()
loo_enc = LeaveOneOutEncoder(sigma=0.05)
woe_enc = WOEEncoder(sigma=0.05)

loo_enc.fit(X_train[cat_features], y)

X_train[cat_features] = loo_enc.transform(X_train[cat_features])
X_test[cat_features] = loo_enc.transform(X_test[cat_features])

X_train.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,HourlyRate,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,is_original
0,36,0.223048,599,0.121711,24,3,0.124383,4,0.147996,42,...,2,1,10,2,3,10,0,7,8,0
1,35,0.12859,921,0.173391,8,3,0.103659,1,0.147996,46,...,4,1,4,3,3,4,2,0,3,0
2,32,0.12859,718,0.173391,26,3,0.196141,3,0.147996,80,...,4,2,4,3,3,3,2,1,2,0
3,38,0.12859,1488,0.121711,2,3,0.124383,3,0.124063,40,...,3,0,15,1,1,6,0,0,2,0
4,50,0.12859,1017,0.121711,5,4,0.124383,2,0.124063,37,...,3,0,31,0,3,31,14,4,10,0


In [95]:
numerical_feats = list(set(df.columns) - set(cat_features))

(len(numerical_feats) + len(cat_features)) == len(df.columns)

True

In [96]:
cat_features

['BusinessTravel',
 'Department',
 'Education',
 'EducationField',
 'EnvironmentSatisfaction',
 'Gender',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'OverTime',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StockOptionLevel',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'is_original']

In [97]:
numerical_feats

['NumCompaniesWorked',
 'DistanceFromHome',
 'DailyRate',
 'HourlyRate',
 'TotalWorkingYears',
 'Age',
 'PercentSalaryHike',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'MonthlyRate',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager',
 'MonthlyIncome']

## Always a good idea to scale the features

In [98]:
sc = StandardScaler()
X_train[numerical_feats] = sc.fit_transform(X_train[numerical_feats])
X_test[numerical_feats] = sc.transform(X_test[numerical_feats])

### Let's seprate comp and original sets

In [99]:
#let's also separate original and copetition data
X_comp = X_train[X_train.is_original==0]
y_comp = y[X_comp.index]

X_original = X_train[X_train.is_original==1]
y_original = y[X_original.index].reset_index(drop=True)
X_original = X_original.reset_index(drop=True)

# Modelling

### But first, let's setup cross validation

In [23]:
# for i, (x, y) in enumerate(zip([1,2,3], [4,5,6])):
#     print(f"{'*'*10} {i}")
#     print(f"X: {x}")
#     print(f"Y: {y}")    

In [24]:
# a = np.array([1,2,3])
# b = np.array([4,5,6])

# np.append(a, b)

In [100]:
# # we're gonna train on the combined dataset but, we'll only calculate the validation score only on comp data

# # N_FOLDS = 10

# def cross_validate(X, y, model, model_verbose=None, verbose=None, X_original=None, y_original=None):
#     N_FOLDS = 5
#     all_scores = np.zeros(N_FOLDS)

#     skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)

#     for fold_id, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        
#         X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
#         y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
#         # for training we'll use data from both datasets
#         if X_original is not None:
#             X_tr = pd.concat([X_tr, X_original], axis=0)
#             y_tr = pd.concat([y_tr, y_original], axis=0)
               
#         model.fit(X_tr, y_tr, 
#                   eval_set=[(X_val, y_val)],
#                   early_stopping_rounds=50,
#                  verbose=model_verbose)
        
#         y_pred = model.predict_proba(X_val)[:, 1]
        
#         auc = roc_auc_score(y_val, y_pred)
        
#         print(f"Fold {fold_id} \t auc: {auc}")
        
#         all_scores[fold_id] = (auc)
    
#     avg_auc = np.mean(all_scores)
    
#     print(f"Avg AUC: {avg_auc}")

In [77]:
# # random params values - make sure to tune yours
# xgb_params = {'n_estimators': 150,
#                  'max_depth': 3,
#                  'learning_rate': 0.1,
#                  'min_child_weight': 4,
#                  'subsample': 0.7,
#                  'colsample_bytree': 0.3
#              }


# xgb_clf = xgb.XGBClassifier(**xgb_params)

# cross_validate(X_comp, y_comp, xgb_clf, model_verbose=False,
#                            X_original=X_original, y_original=y_original)

# # xgb_clf.fit(X_train, y, verbose=0)

Fold 0 	 auc: 0.8380912162162162
Fold 1 	 auc: 0.9057432432432433
Fold 2 	 auc: 0.819322033898305
Fold 3 	 auc: 0.8640677966101694
Fold 4 	 auc: 0.9020338983050847
Avg AUC: 0.8658516376546037


## INSIGHTS:
let's use this method of cross validation to
* Tune all our models
* Select top k
* Take their predictions average
* submit

In [75]:
# np.random.randint(1, 10, size=(2,3))

array([[7, 2, 9],
       [7, 1, 7]])

In [83]:
# some_X = pd.DataFrame(data=np.random.randint(1, 10, size=(2,3)))
# pd.concat([some_X, some_X], axis=0).reset_index(drop=True)

Unnamed: 0,0,1,2
0,6,1,9
1,3,3,5
2,6,1,9
3,3,3,5


# Hyperparameters Tuning

## XGBoost

In [102]:
# def objective_xgb(trial, X, y, X_original, y_original):
#     params = {
#         'tree_method': "gpu_hist",
#         'n_estimators': trial.suggest_int('n_estimators', 50, 400),
#         'max_depth': trial.suggest_int('max_depth', 2, 10),
#         'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.3),
#         'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
#         'gamma': trial.suggest_loguniform('gamma', 0.00001, 0.3),
#         'subsample': trial.suggest_float('subsample', 0.2, 1.0, step=0.05),
#         'colsample_bytree': trial.suggest_float('colsample_bytree', 0.2, 1.0, step=0.05),
#         'early_stopping_rounds': trial.suggest_int("early_stoppig_rounds", 40, 100)
#     }
#     # we're gonna train on the combined dataset but, we'll only calculate the validation score only on comp data

#     N_FOLDS = 5
#     all_scores = np.zeros(N_FOLDS)

#     skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)

#     for fold_id, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        
#         X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
#         y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
#         # for training we'll use data from both datasets
#         if X_original is not None:
#             X_tr = pd.concat([X_tr, X_original], axis=0)
#             y_tr = pd.concat([y_tr, y_original], axis=0)
        
#         model = xgb.XGBClassifier(**params)
#         model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
        
#         y_pred = model.predict_proba(X_val)[:, 1]
                
#         auc = roc_auc_score(y_val, y_pred)        
#         all_scores[fold_id] = auc
    
#     avg_auc = np.mean(all_scores)
    
#     print(f"Avg AUC: {avg_auc}")
    
#     return avg_auc

In [None]:
# study_xgb = optuna.create_study(study_name="xgboost_tuning", direction="maximize")
# func = lambda trial: objective_xgb(trial, X_comp, y_comp, X_original, y_original)
# study_xgb.optimize(func, n_trials=100)

In [104]:
# study_xgb.best_value

0.8723964154832797

In [105]:
# study_xgb.best_params

{'n_estimators': 195,
 'max_depth': 4,
 'learning_rate': 0.1562142569601105,
 'min_child_weight': 9,
 'gamma': 0.062380752916410806,
 'subsample': 0.9000000000000001,
 'colsample_bytree': 0.2,
 'early_stoppig_rounds': 63}

## INSIGHTS:
BEST VALUES:
* leave_one_out_encoding: 0.0.87239
* weight of evidence: 0.87103

Although the pbL is just luck based casino game at this point, we'll still submit using the best params to make sure we're on the right path with such cross_validation technique. The last time i tried using this technique, it resulted in severe overfitting because idk how i somehow messed something up

# Tuning LGBM

In [110]:
# from optuna.integration import LightGBMPruningCallback

# def objective_lgbm(trial, X, y, X_original, y_original):
#     param_grid = {
#         "device_type": "gpu",
#         "n_estimators": trial.suggest_int("n_estimators", 100, 2000),
#         "num_rounds": trial.suggest_int("num_rounds", 100, 500),
#         "learning_rate": trial.suggest_float("learning_rate", 0.0001, 0.3),
#         "num_leaves": trial.suggest_int("num_leaves", 20, 300),
#         "max_depth": trial.suggest_int("max_depth", 2, 12),
#         "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 20, 1000),
#         "lambda_l1": trial.suggest_loguniform('lambda_l1', 0.00001, 1.0),
#         "lambda_l2": trial.suggest_loguniform('lambda_l2', 0.00001, 1.0),
#         "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
#         "bagging_fraction":  trial.suggest_loguniform('bagging_fraction', 0.2, 1.0),
#         "feature_fraction": trial.suggest_loguniform('feature_fraction', 0.2, 1.0),
#         "early_stopping_rounds": trial.suggest_int("early_stopping_rounds", 50, 200),
#         "verbose": -1,
#     }

#     N_FOLDS = 5
#     all_scores = np.zeros(N_FOLDS)

#     skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)

#     for fold_id, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        
#         X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
#         y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
#         # for training we'll use data from both datasets
#         if X_original is not None:
#             X_tr = pd.concat([X_tr, X_original], axis=0)
#             y_tr = pd.concat([y_tr, y_original], axis=0)

            
#         model = lgbm.LGBMClassifier(objective="binary", is_unbalance=True, **param_grid)
#         model.fit(
#             X_tr,
#             y_tr,
#             eval_set=[(X_val, y_val)],
#             eval_metric="auc",
#             verbose=-1,
#         )
#         y_preds = model.predict_proba(X_val)[:, 1]
#         all_scores[fold_id] = roc_auc_score(y_val, y_preds)
    
#     auc = np.mean(all_scores)
#     print(f"AVG CV AUC: \t {auc}")
#     return auc

In [None]:
# study_lgbm = optuna.create_study(direction="maximize", study_name="LGBM Tuning")
# func = lambda trial: objective_lgbm(trial, X_comp, y_comp, X_original, y_original)
# study_lgbm.optimize(func, n_trials=100, show_progress_bar=True)

In [112]:
# study_lgbm.best_value

0.8695173499770957

In [109]:
# study_lgbm.best_params

{'n_estimators': 289,
 'num_rounds': 100,
 'learning_rate': 0.20387218552865483,
 'num_leaves': 49,
 'max_depth': 2,
 'min_data_in_leaf': 180,
 'lambda_l1': 0.29454856381940814,
 'lambda_l2': 0.04768773451967244,
 'min_gain_to_split': 2.4953566257592468,
 'bagging_fraction': 0.42646008454113976,
 'feature_fraction': 0.44305864350467467,
 'early_stopping_rounds': 117}

## INSIGHTS:
* 0.86737 with LeaveOneOutEncoder
* 0.86708 with WOE. almost the same tbh

# Catboostng

In [88]:
# def objective_cat(trial, X, y, X_original, y_original):
#     param = {
#         "n_estimators": trial.suggest_int("n_estimators", 100, 2000),
#         "loss_function": trial.suggest_categorical("loss_function", ["CrossEntropy"]),
#         "learning_rate": trial.suggest_loguniform("learning_rate", 1e-5, 1e0),
#         "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-2, 1e0),
#         "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
#         "depth": trial.suggest_int("depth", 1, 10),
#         "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
#         "bootstrap_type": trial.suggest_categorical("bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]),
#         "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 2, 20),
#         "one_hot_max_size": trial.suggest_int("one_hot_max_size", 2, 20),
#         "early_stopping_rounds": trial.suggest_int("early_stopping_rounds", 50, 200)
#     }
#     # Conditional Hyper-Parameters
#     if param["bootstrap_type"] == "Bayesian":
#         param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
#     elif param["bootstrap_type"] == "Bernoulli":
#         param["subsample"] = trial.suggest_float("subsample", 0.1, 1)
    
#     N_FOLDS = 5
#     all_scores = np.zeros(N_FOLDS)

#     skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)

#     for fold_id, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        
#         X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
#         y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
#         # for training we'll use data from both datasets
#         if X_original is not None:
#             X_tr = pd.concat([X_tr, X_original], axis=0)
#             y_tr = pd.concat([y_tr, y_original], axis=0)

#         cat_model = catboost.CatBoostClassifier(**param)
#         cat_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
        
#         y_preds = cat_model.predict_proba(X_val)[:, 1]
#         all_scores[fold_id] = roc_auc_score(y_val, y_preds)
    
#     auc = np.mean(all_scores)
#     print(f"AVG CV AUC: \t {auc}")
#     return auc

In [89]:
# study_cat = optuna.create_study(direction="maximize", study_name="CatBoost Tuning")
# func = lambda trial: objective_cat(trial, X_comp, y_comp, X_original, y_original)
# study_cat.optimize(func, n_trials=100, show_progress_bar=True)

[32m[I 2023-01-21 09:38:12,857][0m A new study created in memory with name: CatBoost Tuning[0m


  0%|          | 0/100 [00:00<?, ?it/s]

AVG CV AUC: 	 0.8692959230416857
[32m[I 2023-01-21 09:38:29,626][0m Trial 0 finished with value: 0.8692959230416857 and parameters: {'n_estimators': 967, 'loss_function': 'CrossEntropy', 'learning_rate': 0.06906434100010994, 'l2_leaf_reg': 0.09116141085889043, 'colsample_bylevel': 0.048959799203239125, 'depth': 3, 'boosting_type': 'Ordered', 'bootstrap_type': 'Bayesian', 'min_data_in_leaf': 4, 'one_hot_max_size': 12, 'early_stopping_rounds': 106, 'bagging_temperature': 2.13069839116252}. Best is trial 0 with value: 0.8692959230416857.[0m
AVG CV AUC: 	 0.8582826958314247
[32m[I 2023-01-21 09:38:41,458][0m Trial 1 finished with value: 0.8582826958314247 and parameters: {'n_estimators': 1320, 'loss_function': 'CrossEntropy', 'learning_rate': 0.10643359036486615, 'l2_leaf_reg': 0.4709594902098641, 'colsample_bylevel': 0.04160212144114572, 'depth': 10, 'boosting_type': 'Ordered', 'bootstrap_type': 'MVS', 'min_data_in_leaf': 7, 'one_hot_max_size': 4, 'early_stopping_rounds': 167}. Best 

In [90]:
# study_cat.best_value

0.8747553825011452

In [91]:
# study_cat.best_params

{'n_estimators': 324,
 'loss_function': 'CrossEntropy',
 'learning_rate': 0.3808994381813513,
 'l2_leaf_reg': 0.524131545356297,
 'colsample_bylevel': 0.07453454565627973,
 'depth': 2,
 'boosting_type': 'Ordered',
 'bootstrap_type': 'MVS',
 'min_data_in_leaf': 14,
 'one_hot_max_size': 18,
 'early_stopping_rounds': 183}

## INSIGHTS:
* 0.878423 with Leave One Out Encoding
* 0.87475 with WOE, quite worse. So let's stick to leave one out

In [39]:
# xgb_params = {'n_estimators': 177,
#              'max_depth': 3,
#              'learning_rate': 0.2814,
#              'min_child_weight': 8,
#              'gamma': 0.0001,
#              'subsample': 0.75,
#              'colsample_bytree': 0.2,
#              'early_stoppig_rounds': 79}

In [122]:
X_train_fr, X_val, y_train_fr, y_val = train_test_split(X_comp, y_comp, test_size=0.1, shuffle=True, random_state=1337,
                                                        stratify=y_comp)


X_train_fr = pd.concat([X_train_fr, X_original])
y_train_fr = pd.concat([y_train_fr, y_original])

In [None]:
# xgb_tuned_clf = xgb.XGBClassifier(**xgb_params)
# xgb_tuned_clf.fit(X_train_fr, y_train_fr, eval_set=[(X_val, y_val)], verbose=False)

In [44]:
# xgb_tuned_preds = xgb_tuned_clf.predict_proba(X_test)[:, 1]

In [121]:
# best xgb params
xgb_params = {'n_estimators': 195,
                 'max_depth': 4,
                 'learning_rate': 0.1562142569601105,
                 'min_child_weight': 9,
                 'gamma': 0.062380752916410806,
                 'subsample': 0.9000000000000001,
                 'colsample_bytree': 0.2,
                 'early_stoppig_rounds': 63}


# lgbm params
lgbm_params = {'n_estimators': 289,
                 'num_rounds': 100,
                 'learning_rate': 0.20387218552865483,
                 'num_leaves': 49,
                 'max_depth': 2,
                 'min_data_in_leaf': 180,
                 'lambda_l1': 0.29454856381940814,
                 'lambda_l2': 0.04768773451967244,
                 'min_gain_to_split': 2.4953566257592468,
                 'bagging_fraction': 0.42646008454113976,
                 'feature_fraction': 0.44305864350467467,
                 'early_stopping_rounds': 117}


# cat boooost
cat_params = {'n_estimators': 1054,
                 'loss_function': 'CrossEntropy',
                 'learning_rate': 0.28958661851562734,
                 'l2_leaf_reg': 0.03231273388976541,
                 'colsample_bylevel': 0.08854889705957293,
                 'depth': 1,
                 'boosting_type': 'Plain',
                 'bootstrap_type': 'MVS',
                 'min_data_in_leaf': 8,
                 'one_hot_max_size': 18,
                 'early_stopping_rounds': 181}

In [124]:
# Okay lets try submitting the simple average of best models
xgb_model = xgb.XGBClassifier(**xgb_params)
xgb_model.fit(X_train_fr, y_train_fr, eval_set=[(X_val, y_val)], verbose=False)

xgb_preds = xgb_model.predict_proba(X_test)[:, 1]

Parameters: { "early_stoppig_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




In [125]:
# Okay lets try submitting the simple average of best models
lgbm_model = lgbm.LGBMClassifier(objective="binary", is_unbalance=True, **lgbm_params)
lgbm_model.fit(X_train_fr, y_train_fr, eval_set=[(X_val, y_val)], verbose=-1)

lgbm_preds = lgbm_model.predict_proba(X_test)[:, 1]



In [127]:
# Okay lets try submitting the simple average of best models
cat_model = catboost.CatBoostClassifier(**cat_params)
cat_model.fit(X_train_fr, y_train_fr, eval_set=[(X_val, y_val)], verbose=False)

cat_preds = cat_model.predict_proba(X_test)[:, 1]

In [132]:
y_final = np.stack([xgb_preds, lgbm_preds, cat_preds]).mean(axis=0)

In [133]:
# avg_preds
submission = pd.DataFrame({"id": test_idx, "Attrition": y_final})
submission.head()

Unnamed: 0,id,Attrition
0,1677,0.203941
1,1678,0.198539
2,1679,0.093485
3,1680,0.088907
4,1681,0.478767


In [46]:
# # non-overfitting predictions
# submission = pd.DataFrame({"id": test_idx, "Attrition": y_final})
# submission.head()

Unnamed: 0,id,Attrition
0,1677,0.093922
1,1678,0.051655
2,1679,0.040405
3,1680,0.133737
4,1681,0.819291


In [86]:
# # non-overfitting predictions
# submission = pd.DataFrame({"id": test_idx, "Attrition": xgb_tuned_preds})
# submission.head()

Unnamed: 0,id,Attrition
0,1677,0.196305
1,1678,0.043251
2,1679,0.010886
3,1680,0.082801
4,1681,0.132661


In [41]:
# # non-overfitting predictions
# submission = pd.DataFrame({"id": test_idx, "Attrition": xgb_tuned_preds})
# submission.head()

Unnamed: 0,id,Attrition
0,1677,0.40117
1,1678,0.117693
2,1679,0.000777
3,1680,0.015711
4,1681,0.295577


In [42]:
# OVERFITTED PREDICTIONS
# submission = pd.DataFrame({"id": test_idx, "Attrition": xgb_tuned_preds})
# submission.head()

Unnamed: 0,id,Attrition
0,1677,0.021201
1,1678,0.001126
2,1679,1.7e-05
3,1680,0.000405
4,1681,0.947035


In [134]:
submission.to_csv("submission.csv", index=False)