# Employee Attrition Prediction

## Problem Statement

To predict employee attrition using CatBoost and XgBoost 

## Objectives

* explore the employee attrition dataset
* apply CatBoost and XgBoost on the dataset
* tune the model hyperparameters to improve accuracy
* evaluate the model using suitable metrics


## Dataset

The dataset used for this mini-project is [HR Employee Attrition dataset](https://data.world/aaizemberg/hr-employee-attrition). This dataset is synthetically created by IBM data scientists. There are 35 features and 1470 records. 

### Download Dataset

### Install CatBoost

In [None]:
!pip -qq install catboost

### Import Required Packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix, f1_score, classification_report, roc_curve
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier, metrics
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight') 
pd.set_option('display.max_columns', 100)
%matplotlib inline

## Load the Dataset

**Read the dataset**

In [None]:
# read the dataset
df = pd.read_csv('/content/wa_fn_usec_hr_employee_attrition_tsv.csv')

In [None]:
test_df = pd.read_csv('/content/hr_employee_attrition_test.csv')

In [None]:
# Check the shape of dataframe. 
df.shape

In [None]:
df.head()

## Data Exploration

- Check for missing values
- Check for consistent data type across a feature
- Check for outliers or inconsistencies in data columns
- Check for correlated features
- Do we have a target label imbalance
- How our independent variables are distributed relative to our target label
- Are there features that have strong linear or monotonic relationships? Making correlation heatmaps makes it easy to identify possible collinearity

**Create a `List` of numerical and categorical columns. Display a statistical description of the dataset. Remove missing values**

In [None]:
df.info()

In [None]:
categorical = [col for col in df.columns if df[col].dtypes == 'object']
numerical = [col for col in df.columns if df[col].dtypes != 'object']

categorical.remove('attrition')

In [None]:
df[categorical] = df[categorical].astype('category')

In [None]:
from sklearn.feature_selection import VarianceThreshold
var_thres = VarianceThreshold(threshold=0.2)

var_thres.fit(df[numerical])

constant_columns = [column for column in df[numerical].columns if column not in df[numerical].columns[var_thres.get_support()]]
constant_columns

In [None]:
for i in constant_columns:
  numerical.remove(i)

In [None]:
df.drop(columns=constant_columns, axis=1, inplace=True)

In [None]:
features = list(df.columns)
features.remove('attrition')

In [None]:
categorical_features_indices = [features.index(cat) for cat in categorical]

In [None]:
categorical_features_indices

In [None]:
df.isna().sum()

### Check for outliers

**Create a box plot to check for outliers**

In [None]:
# Check for outliers
for num in numerical:
  fig, ax = plt.subplots(figsize=(4,4))
  sns.boxplot(data=df, x=num, hue='attrition', ax=ax)

### Handling outliers

**Use lower bound as 25% and upper bound as 75% to handle the outliers**

In [None]:
df.describe().T

In [None]:
#Function to handle the outliers --
def handle_outlier(df, col):
  q1 = df[col].describe()['25%']
  q3 = df[col].describe()['75%']
  iqr = q3 - q1
  return np.where(df[col] > q3, q3, np.where(df[col] < q1, q1, df[col]))

In [None]:
for num in numerical:
  df[num] = handle_outlier(df, num)

In [None]:
# Recheck for outliers
df.describe().T

In [None]:
# Check for outliers
for num in numerical:
  fig, ax = plt.subplots(figsize=(4,4))
  sns.boxplot(data=df, x=num, hue='attrition', ax=ax)

### Target label imbalance

**Check if there is an imbalance in target label**

In [None]:
# Count of unique values in Attrition column
df['attrition'].value_counts()

In [None]:
# Plot barplot to visualize balance/imbalance
fig, ax = plt.subplots(figsize=(8,8))
sns.countplot(df['attrition'], ax=ax)

In [None]:
X = df.drop(columns=['attrition'])
y = df[['attrition']]

###Plot pairplot

**Visualize the relationships between the predictor variables and the target variable using a pairplot**

In [None]:
# Visualize a pairplot with relevant features
sns.pairplot(df[numerical])

### Explore Correlation

- Plotting the Heatmap

**Visualize the correlation among IBM employee attrition numerical features using a heatmap**

In [None]:
# Visualize heatmap
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(df[numerical].corr(), annot=True, ax=ax)

### Preparing the test feature space
* Remove outliers if any
* Handle the categorical feature if required
* Other processing steps can also be followed.

In [None]:
test_df.columns

In [None]:
test_df.drop(columns=['id'], axis=1, inplace=True)

In [None]:
test_df.drop(columns=constant_columns, axis=1, inplace=True)

In [None]:
for num in numerical:
  test_df[num] = handle_outlier(test_df, num)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
from sklearn.model_selection import StratifiedKFold
df['kfold'] = -1
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for fold, (train_indicies, valid_indicies) in enumerate(skf.split(df.drop(columns=['attrition']), df[['attrition']])):
  df.loc[valid_indicies, 'kfold'] = fold

In [None]:
df['kfold'].value_counts()

In [None]:
df['attrition'] = np.where(df['attrition'] == 'Yes', 1, np.where(df['attrition'] == 'No', 0, df['attrition']))

In the notebook, data processing is done separately for different models.
Considering the fact that different models may require data in different format and in turn different processes may be followed to process the data.

If the processing steps followed for the models are same, data processing can also be done once.

## Apply CatBoost

### Data Processing for CatBoost

**Data processing for CatBoost**
* **Copy the dataframe that was created after removing the outliers**
* **Handle the categorical features if required**
* **Create target column and feature space**

In [None]:
# Copy the data
catb_df = df.copy()

In [None]:
# Target Column
y = catb_df[['attrition']]

In [None]:
# Feature Space
X = catb_df.drop(columns=['attrition'])

In [None]:
categorical_features_indices

In [None]:
features

### Model Definition

**Define, train the model and display the results**

In [None]:
# Create CatBoost model
from catboost import CatBoostClassifier

from sklearn.metrics import f1_score,roc_auc_score

prediction = []
score = []

for fold in range(5):
    X_train = catb_df[catb_df.kfold != fold].reset_index(drop=True) 
    X_val = catb_df[catb_df.kfold == fold].reset_index(drop=True) 
    X_test = test_df.copy()
  
    # dependent variables 
    y_train = X_train['attrition'].astype(int)
    y_val = X_val['attrition'].astype(int)

    # independent variables
    X_train = X_train[features]
    X_val = X_val[features]

    # catboost modelling 
    model=CatBoostClassifier(n_estimators=500, max_depth=5, learning_rate=0.01, early_stopping_rounds=5, scale_pos_weight=5)
    model.fit(X_train, y_train, cat_features=categorical_features_indices, eval_set=(X_val, y_val), verbose=False)

    preds_valid = model.predict(X_val)

    #Training model apply the test data and predict the output
    test_predict = model.predict(X_test)
    prediction.append(test_predict)
    f1= f1_score(y_val,preds_valid, average='weighted')

    #Score 
    score.append(f1)
    print(f"fold:{fold},f1 score:{f1}")

print(np.mean(score),np.std(score))

In [None]:
!pip install optuna

In [None]:
import optuna 

def hyp_optimizer(trial):
    fold = 0
    # hyperparameters for CatBoost
    param = {}
    param['learning_rate'] = trial.suggest_float("learning_rate", 1e-2, 0.25, log=True)
    param['depth'] = trial.suggest_int('depth', 1, 11)
    param['scale_pos_weight'] = trial.suggest_int('scale_pos_weight', 1, 10)
    param['l2_leaf_reg'] = trial.suggest_float('l2_leaf_reg', 0.0001, 1.0, log = True)
    param['min_child_samples'] = trial.suggest_categorical('min_child_samples', [1, 4, 8, 16, 32])
    param['subsample'] = trial.suggest_float('subsample', 0.1, 0.8)
    param['n_estimators'] = trial.suggest_int('n_estimators', 500, 8000)     
    param['early_stopping_rounds'] = trial.suggest_int('early_stopping_rounds', 5, 100)                                  
    param['grow_policy'] = 'Depthwise'
    param['use_best_model'] = True
    param['eval_metric'] = 'F1'
    param['od_type'] = 'Iter'
    param['od_wait'] = 50
    param['random_state'] = 42
    param['logging_level'] = 'Silent'


    X_train = catb_df[catb_df.kfold != fold].reset_index(drop=True)
    X_val = catb_df[catb_df.kfold == fold].reset_index(drop=True)
    # X_test = test_df.copy()

    # dependent variables 
    y_train = X_train['attrition'].astype(int)
    y_val = X_val['attrition'].astype(int)

    # independent variables
    X_train = X_train[features]
    X_val = X_val[features]

    # catboost modelling 
    model=CatBoostClassifier(**param)
    model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_val, y_val),plot=True)

    preds_valid = model.predict(X_val)

    #Training model apply the test data and predict the output
    # test_predict = model.predict(X_test)
    # prediction.append(test_predict)
    f1= f1_score(y_val,preds_valid, average='weighted')

    #Score 
    # score.append(roc1)
    # print(f"fold:{fold},roc:{roc1}")

    return f1

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(hyp_optimizer, n_trials=100)

In [None]:
print(study.best_params)
print(study.best_trial) 

In [None]:
# Create CatBoost model
from catboost import CatBoostClassifier

from sklearn.metrics import f1_score,roc_auc_score

prediction = []
score = []

for fold in range (5):
    X_train = catb_df[catb_df.kfold != fold].reset_index(drop=True)
    X_val = catb_df[catb_df.kfold == fold].reset_index(drop=True)
    X_test = test_df.copy()

    param = {}
    param['learning_rate'] = 0.07696560775064637
    param['depth'] = 3
    param['scale_pos_weight'] = 3
    param['l2_leaf_reg'] = 0.004017265423496804
    param['min_child_samples'] = 8
    param['subsample'] = 0.3799643651983675
    param['n_estimators'] = 3731                                    
    param['grow_policy'] = 'Depthwise'
    param['use_best_model'] = True
    param['eval_metric'] = 'F1'
    param['od_type'] = 'Iter'
    param['od_wait'] = 50
    param['random_state'] = 42
    param['logging_level'] = 'Silent'
  
    # dependent variables 
    y_train = X_train['attrition'].astype(int)
    y_val = X_val['attrition'].astype(int)

    # independent variables
    X_train = X_train[features]
    X_val = X_val[features]

    # catboost modelling 
    model=CatBoostClassifier(**param)
    model.fit(X_train, y_train, cat_features=categorical_features_indices, eval_set=(X_val, y_val), verbose=False)

    preds_valid = model.predict(X_val)

    #Training model apply the test data and predict the output
    test_predict = model.predict(X_test)
    prediction.append(test_predict)
    f1= f1_score(y_val,preds_valid, average='weighted')

    #Score 
    score.append(f1)
    print(f"fold:{fold},f1 score:{f1}")

print(np.mean(score),np.std(score))

In [None]:
prediction

In [None]:
final_predict = stats.mode(np.column_stack(prediction),axis=1, keepdims=True)[0]

In [None]:
test_df.index += 1

In [None]:
test_df.index.name = 'id'
test_df['label'] = final_predict

In [None]:
test_df

In [None]:
test_df[['label']].to_csv('final_submission_df.csv')

## Apply XGBoost

### Data Processing for XGBoost


**Data Processing for XGBoost**
* **Copy the dataframe after the outliers were removed.**
* **Handle the categorical features if required**
* **Create target column and feature space**

In [None]:
# Copy dataframe
xgb_df = df.copy()

In [None]:
new_test_df = test_df.copy()

In [None]:
# Handling categorical features
categorical_df = pd.get_dummies(xgb_df[categorical])

In [None]:
# Concat the dummy variables to actual dataframe and remove initial categorical columns
xgb_df.drop(columns=categorical, inplace=True)

xgb_df = pd.concat([xgb_df, categorical_df], axis=1)

xgb_df.columns

In [None]:
# Feature Space
X = xgb_df.drop(columns=['attrition'])

# Targer label
y = xgb_df[['attrition']]

In [None]:
new_features = list(xgb_df.columns)
new_features.remove('attrition')
new_features.remove('kfold')

In [None]:
test_categorical_df = pd.get_dummies(test_df[categorical])

new_test_df.drop(columns=categorical, inplace=True)

new_test_df = pd.concat([new_test_df, test_categorical_df], axis=1)

new_test_df.columns

### Model Definition

**Define, train the model and display the results**

In [None]:
# Create XGBoost classifier model
from xgboost import XGBClassifier

prediction = []
score = []

for fold in range (5):
    X_train = xgb_df[xgb_df.kfold != fold].reset_index(drop=True)
    X_val = xgb_df[xgb_df.kfold == fold].reset_index(drop=True)
    X_test = new_test_df.copy()
  
    # dependent variables 
    y_train = X_train['attrition'].astype(int)
    y_val = X_val['attrition'].astype(int)

    # independent variables
    X_train = X_train[new_features]
    X_val = X_val[new_features]

    # xgboost modelling 
    model = XGBClassifier()
    model.fit(X_train,y_train,early_stopping_rounds=100,eval_set=[(X_val,y_val)],verbose=False)

    preds_valid = model.predict(X_val)

    #Training model apply the test data and predict the output
    test_predict = model.predict(X_test)
    prediction.append(test_predict)
    f1= f1_score(y_val,preds_valid, average='weighted')
    roc = roc_auc_score(y_val, preds_valid)
    acc = accuracy_score(y_val, preds_valid)

    #Score 
    score.append(f1)
    print(f"fold:{fold},f1 score:{f1}")
    print(f"fold:{fold},roc score:{roc}")
    print(f"fold:{fold},accuracy score:{acc}")
    print('-'*15)


print(np.mean(score),np.std(score))

In [None]:
!pip install optuna

In [None]:
import optuna

Hyperparameter Tuning - XGBoost 

In [None]:
def hyp_optimizer_xgb(trial):
    fold = 2
    # hyperparameters for XGBoost
    param = {}

    param['learning_rate'] = trial.suggest_float("learning_rate", 0.001, 1, log=True)
    param['n_estimators'] = trial.suggest_int('n_estimators', 100, 8000)
    param['scale_pos_weight'] = trial.suggest_int('scale_pos_weight', 1, 10)
    param['reg_lambda'] = trial.suggest_loguniform("reg_lambda", 1e-8, 100.0)
    param['reg_alpha'] = trial.suggest_loguniform("reg_alpha", 1e-8, 100.0)
    param['subsample'] = trial.suggest_float("subsample", 0.1, 1.0)
    param['colsample_bytree'] = trial.suggest_float("colsample_bytree", 0.1, 1.0)
    param['max_depth'] = trial.suggest_int("max_depth", 1,20)
    param['eval_metric'] = 'auc'

    X_train = xgb_df[xgb_df.kfold != fold].reset_index(drop=True)
    X_val = xgb_df[xgb_df.kfold == fold].reset_index(drop=True)
    # X_test = test_df.copy()

    # dependent variables 
    y_train = X_train['attrition'].astype(int)
    y_val = X_val['attrition'].astype(int)

    # independent variables
    X_train = X_train[new_features]
    X_val = X_val[new_features]

    # XGBRegressor moddelling 
    model = XGBClassifier(**param)

    # pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "validation-auc")

    model.fit(X_train,y_train,early_stopping_rounds=100,eval_set=[(X_val,y_val)], verbose=False)

    preds_valid = model.predict(X_val)

    #Training model apply the test data and predict the output
    # test_predict = model.predict(X_test)
    # prediction.append(test_predict)
    f1= f1_score(y_val,preds_valid, average='weighted')

    #Score 
    # score.append(roc1)
    # print(f"fold:{fold},roc:{roc1}")

    return f1

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(hyp_optimizer_xgb, n_trials=100)

In [None]:
print(study.best_params)
print(study.best_trial) 

In [None]:
# Create XGBoost classifier model with optimal hyperparameters
from xgboost import XGBClassifier

prediction = []
score = []

for fold in range(5):
    X_train = xgb_df[xgb_df.kfold != fold].reset_index(drop=True)
    X_val = xgb_df[xgb_df.kfold == fold].reset_index(drop=True)
    
    X_test = new_test_df.copy()

    param = {}
    param['learning_rate'] = study.best_params['learning_rate']
    param['n_estimators'] = study.best_params['n_estimators']
    param['scale_pos_weight'] = study.best_params['scale_pos_weight']
    param['reg_lambda'] = study.best_params['reg_lambda']
    param['reg_alpha'] = study.best_params['reg_alpha']
    param['subsample'] = study.best_params['subsample']
    param['colsample_bytree'] = study.best_params['colsample_bytree']
    param['max_depth'] = study.best_params['max_depth']
    param['eval_metric'] = 'auc'

  
    # dependent variables 
    y_train = X_train['attrition'].astype(int)
    y_val = X_val['attrition'].astype(int)

    # independent variables
    X_train = X_train[new_features]
    X_val = X_val[new_features]

    # xgboost modelling 
    model = XGBClassifier(**param)
    model.fit(X_train,y_train,early_stopping_rounds=100,eval_set=[(X_val,y_val)],verbose=False)

    y_pred = model.predict(X_val)

    # preds_proba = model.predict_proba(X_val)[:, 1]
  
    # fpr1, tpr1, thresh1 = roc_curve(y_val, preds_proba)

    # gmeans = np.sqrt(tpr1 * (1-fpr1))
    # ix = np.argmax(gmeans)

    # y_pred = np.where(preds_proba >=0.45, 1, 0)
    # preds_valid = model.predict(X_val)

    #Training model apply the test data and predict the output
    test_predict = model.predict(X_test)
    # test_predict_proba = model.predict_proba(X_test)
    # test_predict = np.where(test_predict_proba >=0.45, 1, 0)
    prediction.append(test_predict)

    f1= f1_score(y_val,y_pred, average='weighted')
    roc = roc_auc_score(y_val, y_pred)
    acc = accuracy_score(y_val, y_pred)

    #Score 
    score.append(f1)
    print(f"fold:{fold},f1 score:{f1}")
    print(f"fold:{fold},roc score:{roc}")
    print(f"fold:{fold},accuracy score:{acc}")
    print('-'*15)

print(np.mean(score),np.std(score))

### Model Performance

In [None]:
final_predict_mean = np.mean(np.column_stack(prediction),axis=1)

In [None]:
final_predict_mean

In [None]:
final_predict_values = np.where(final_predict_mean>0.4, 1, 0)

In [None]:
final_predict = stats.mode(np.column_stack(prediction),axis=1, keepdims=True)[0]

In [None]:
test_df.index -= 1

In [None]:
test_df.index.name = 'id'

In [None]:
test_df['label'] = final_predict_values
test_df[['label']].to_csv('final_submission_df10.csv')

## Apply LightGBM

### Feature Engineering for LightGBM

In [None]:
## Following the same procedure as followed in XGBoost

# Copy the dataframe
lgbm_df = df.copy()  

# Handling categorical features
categorical_df = pd.get_dummies(lgbm_df[categorical])

# Concat the dummy variables to actual dataframe and remove initial categorical columns
lgbm_df.drop(columns=categorical, inplace=True)
lgbm_df = pd.concat([lgbm_df, categorical_df], axis=1)
 
# Feature Space
X = lgbm_df.drop(columns=['attrition'])

# Targer label
y = lgbm_df[['attrition']]   

new_features = list(lgbm_df.columns)
new_features.remove('attrition')
new_features.remove('kfold')

### Model Definition

In [None]:
# Create LightGBM classifier model
prediction = []
score = []

for fold in range (5):
  X_train = lgbm_df[lgbm_df.kfold != fold].reset_index(drop=True)
  X_val = lgbm_df[lgbm_df.kfold == fold].reset_index(drop=True)
  X_test = new_test_df.copy()

  # dependent variables 
  y_train = X_train['attrition'].astype(int)
  y_val = X_val['attrition'].astype(int)

  # independent variables
  X_train = X_train[new_features]
  X_val = X_val[new_features]

  # lgbm modelling
  model = LGBMClassifier(learning_rate = 0.1)
  model.fit(X_train, y_train, early_stopping_rounds=100, eval_set=[(X_val,y_val)], verbose=False)

  preds_valid = model.predict(X_val)

  #Training model apply the test data and predict the output
  test_predict = model.predict(X_test)
  prediction.append(test_predict)
  f1= f1_score(y_val,preds_valid, average='weighted')

  #Score 
  score.append(f1)
  print(f"fold:{fold},f1 score:{f1}")

print(np.mean(score),np.std(score))

In [None]:
lgbm_df

In [None]:
# Create LightGBM classifier model
import optuna

def hyp_optimizer_lgbm(trial):
  prediction = []
  score = []

  fold = 0
  X_train = lgbm_df[lgbm_df.kfold != fold].reset_index(drop=True)
  X_val = lgbm_df[lgbm_df.kfold == fold].reset_index(drop=True)
  X_test = new_test_df.copy()

  param = {}

  param['learning_rate'] = trial.suggest_float("learning_rate", 0.001, 1, log=True)
  param['n_estimators'] = trial.suggest_int('n_estimators', 100, 8000)
  param['scale_pos_weight'] = trial.suggest_int('scale_pos_weight', 1, 10)
  param['reg_lambda'] = trial.suggest_loguniform("reg_lambda", 1e-8, 100.0)
  param['reg_alpha'] = trial.suggest_loguniform("reg_alpha", 1e-8, 100.0)
  param['subsample'] = trial.suggest_float("subsample", 0.1, 1.0)
  param['colsample_bytree'] = trial.suggest_float("colsample_bytree", 0.1, 1.0)
  param['max_depth'] = trial.suggest_int("max_depth", 1,20)
  # param['num_leaves'] = trial.suggest_int('num_leaves', 1, 1000)
  # param['min_child_samples'] = trial.suggest_int('min_child_samples', 1, 300)
  # param['cat_smooth'] = trial.suggest_int('min_data_per_groups', 1, 100)


  # dependent variables 
  y_train = X_train['attrition'].astype(int)
  y_val = X_val['attrition'].astype(int)

  # independent variables
  X_train = X_train[new_features]
  X_val = X_val[new_features]

  # xgboost modelling 
  model = LGBMClassifier(**param)
  model.fit(X_train, y_train, early_stopping_rounds=100, eval_set=[(X_val,y_val)], verbose=False)

  preds_valid = model.predict(X_val)

  #Training model apply the test data and predict the output
  # test_predict = model.predict(X_test)
  # prediction.append(test_predict)
  f1 = f1_score(y_val,preds_valid, average='weighted')

  #Score 
  # score.append(roc1)
  # print(f"fold:{fold},roc:{roc1}")

  return f1

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(hyp_optimizer_lgbm, n_trials=100)

In [None]:
print(study.best_params)
print(study.best_trial) 

In [None]:
# Create LightGBM classifier model
prediction = []
score = []

for fold in range (5):
  X_train = lgbm_df[lgbm_df.kfold != fold].reset_index(drop=True)
  X_val = lgbm_df[lgbm_df.kfold == fold].reset_index(drop=True)
  X_test = new_test_df.copy()

  # dependent variables 
  y_train = X_train['attrition'].astype(int)
  y_val = X_val['attrition'].astype(int)

  # independent variables
  X_train = X_train[new_features]
  X_val = X_val[new_features]

  param = {}

  param['learning_rate'] = 0.08670857337862124
  param['n_estimators'] = 2102
  param['scale_pos_weight'] = 4
  param['reg_lambda'] = 0.000978358022453289
  param['reg_alpha'] = 1.052363669709034e-08
  param['subsample'] = 0.9158804914976686
  param['colsample_bytree'] = 0.9135279801031584
  param['max_depth'] = 20

  # lgbm modelling
  model = LGBMClassifier(**param)
  model.fit(X_train, y_train, early_stopping_rounds=100, eval_set=[(X_val,y_val)], verbose=False)

  preds_valid = model.predict(X_val)

  #Training model apply the test data and predict the output
  test_predict = model.predict(X_test)
  prediction.append(test_predict)
  f1= f1_score(y_val,preds_valid, average='weighted')
  roc = roc_auc_score(y_val,preds_valid)
  acc = accuracy_score(y_val,preds_valid)

  #Score 
  score.append(f1)
  print(f"fold:{fold},f1 score:{f1}")
  print(f"fold:{fold},roc:{roc}")
  print(f"fold:{fold},accuracy:{acc}")
  print('-'*10)

print(np.mean(score),np.std(score))

### Model performance

In [None]:
final_predict = stats.mode(np.column_stack(prediction),axis=1, keepdims=True)[0]

In [None]:
test_df['label'] = final_predict
test_df[['label']].to_csv('final_submission_df3.csv')