# Bluechip-Summit "Employee Attrition Prediction" Hackathon

### Table of Contents
1. Importing Libraries
2. Dataset Loading
3. Data Preprocessing
4. Feature Engineering
   - Dropping Unnecessary Features
   - Centering and Scaling of Numerical Features
   - Winsorization
   - Encoding Categorical Features
   - Creating More Informative Features
   - Feature Scaling
5. Cross-Validation
6. Training and Prediction
7. Future Work

### Acknowledgments
I want to express my gratitude for the insightful guidance provided by the following notebooks:

1. [Starting Strong - XGBoost, LightGBM, CatBoost](https://www.kaggle.com/code/khawajaabaidullah/starting-strong-xgboost-lightgbm-catboost): This notebook served as a solid foundation, offering valuable techniques in feature engineering and modeling. The insights gained significantly contributed to achieving a good Data Analytics (DA) score in my local cross-validation setup.

2. [HR Analytics Final - Basic to Advanced EDA & ML](https://www.kaggle.com/code/ducminh0401/hr-analytics-final-basic-to-advanced-eda-ml): This notebook inspired various exploratory data analysis (EDA) approaches, enhancing my understanding of the data. It played a crucial role in shaping my EDA strategies.

### Additional Note
The incorporation of the original dataset, as allowed by the competition, also played a pivotal role in boosting my overall score.

This notebook is a reflection of an enriching learning journey, and I am grateful for the valuable guidance offered by these resources.


### 1. Importing libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from pathlib import Path
from pathlib import Path
import xgboost as xgb
import lightgbm as lgbm
import catboost
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score
from IPython.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
import optuna
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

#### 2. Dataset Loading

In [None]:
root_dir = Path('../input/bluechip-dataset')

# id is not going to be an informative feature, so we're dropping it for train
# but since we'll need test set's ids to make the submission file, so we'll save those in  a separate varible before dropping
train = pd.read_csv(root_dir / "train.csv").drop(columns="id")
test = pd.read_csv(root_dir / "test.csv")
test_idx = test.id #keep test ids for later submission
test = test.drop(columns="id")

# I have noticed that adding the original dataset improves score on the public leaderboard. So let's do that!
original = pd.read_csv('../input/bluechip-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
train.head()

In [None]:
#check number of samples in each dataset
print(f"There are {train.shape[0]} samples in train data")
print(f"There are {test.shape[0]} samples in test data")
print(f"There are {original.shape[0]} samples in original data")

### 3. Data Preprocessing

In [None]:
#making feaure names of both train and original data the same
original['Attrition'] = (original['Attrition'] == 'Yes').astype(np.int64)

# in original data, id is termed as "EmployeeNumber", so let's drop it
original.drop(columns="EmployeeNumber", inplace=True)
# now reordering the features in original dataset
original = original[list(train.columns)]

In [None]:
#concatenating train and original data
train_added = pd.concat([train, original]).reset_index(drop=True)
len(train_added)

In [None]:
#checking for missing data
print(f"Total missing data in train added with original data  is: {train_added.isnull().sum().sum()}")
print(f"Total missing data in test is: {train_added.isnull().sum().sum()}")

In [None]:
#concatenating train and test
y = train_added.Attrition
df = pd.concat([train_added.drop(columns="Attrition"), test])

### 4. Feature Engineering

#### Dropping unecesaary features

In [None]:
feats_to_drop = [col for col in df.columns if df[col].nunique()==1]
cat_features = [col for col in df.columns if df[col].nunique() <= 20 and df[col].nunique() > 1]
df.drop(columns=feats_to_drop, inplace=True)

#### Centering and scaling of numerical  features

In [None]:
# center and scale
CON_FEATURES = ['MonthlyRate', 'MonthlyIncome', 'DailyRate', 'HourlyRate', 'Age',
                'DistanceFromHome', 'TotalWorkingYears', 'YearsAtCompany', 'YearsInCurrentRole',
                'YearsWithCurrManager', 'PercentSalaryHike', 'NumCompaniesWorked', 'TrainingTimesLastYear', 
                'YearsSinceLastPromotion']
for feature in CON_FEATURES:
    mu = np.mean(df[feature])
    sigma = np.std(df[feature])
    df[feature] = (df[feature] - mu) / sigma

#### Winsorization

In [None]:
df.loc[df['Education'] == 15, 'Education'] = 5 # 5 is the max possible value based on the original data
df.loc[df['JobLevel'] == 7, 'JobLevel'] = 5   # 5 is the max possible value based on the original data

#### Encoding categorical features

In [None]:
#encode categorical features with ordinal encoder
ord_enc = OrdinalEncoder()

ord_enc.fit(df[cat_features])

df[cat_features] = ord_enc.transform(df[cat_features])
df.head()

#### Feature creation

In [None]:
#creating more informative features such as risk factors
def add_features(df):
    df['MonthlyIncome/Age'] = df['MonthlyIncome'] / df['Age']
    
    df["Age_risk"] = (df["Age"] < 34).astype(int)
    df["HourlyRate_risk"] = (df["HourlyRate"] < 60).astype(int)
    df["Distance_risk"] = (df["DistanceFromHome"] >= 20).astype(int)
    df["YearsAtCo_risk"] = (df["YearsAtCompany"] < 4).astype(int)
    
    df['NumCompaniesWorked'] = df['NumCompaniesWorked'].replace(0, 1)
    df['AverageTenure'] = df["TotalWorkingYears"] / df["NumCompaniesWorked"]
    # df['YearsAboveAvgTenure'] = df['YearsAtCompany'] - df['AverageTenure']
    
    df['JobHopper'] = ((df["NumCompaniesWorked"] > 2) & (df["AverageTenure"] < 2.0)).astype(int)
    
    df["AttritionRisk"] = df["Age_risk"] + df["HourlyRate_risk"] + df["Distance_risk"] + df["YearsAtCo_risk"] + df['JobHopper']
    
    # More feature engineering ideas for modelling
    df['feature_1'] = np.where(((df['StockOptionLevel'] >= 1) & 
                                (df['YearsAtCompany'] >= 3) & 
                                (df['YearsWithCurrManager'] >= 1)), 1, 0)
    df['feature_2'] = np.where(((df['StockOptionLevel'] < 1) & 
                                (df['MonthlyIncome'] > 2700) & 
                                (df['OverTime'] == 'Yes')), 1, 0)
    return df
df = add_features(df)

#### Feature transformation of numerical features by scaling 

In [None]:
scaler = StandardScaler()
for feature in CON_FEATURES:
    df[feature] = scaler.fit_transform(df[[feature]])
df
    

### 5. Cross validation

In [None]:
#diving data into train and test sets back
X_train = df.iloc[:len(train_added), :]
X_test = df.iloc[len(train_added): , :]
len(X_test)

In [None]:
#define dictionary to store models to be used and their respective names

#xgboost
xgb_params = {'n_estimators': 150,
              'random_state':0,
                 'max_depth': 3,
                 'learning_rate': 0.1,
                 'min_child_weight': 4,
                 'subsample': 0.7,
                 'colsample_bytree': 0.3,
             'verbose':0}
  
xgb_clf = xgb.XGBClassifier(**xgb_params)

#ight gbm
lgbm_params = {'n_estimators': 407,
               'random_state':0,
                 'num_rounds': 274,
                 'learning_rate': 0.1,
                 'num_leaves': 195,
                 'max_depth': 9,
                 'min_data_in_leaf': 46,
                 'lambda_l1': 0.01,
                 'lambda_l2': 0.6,
                 'min_gain_to_split': 1.42,
                 'bagging_fraction': 0.45,
                 'feature_fraction': 0.3,
              'verbosity':-1}
lgbm_clf = lgbm.LGBMClassifier(**lgbm_params)

#catboost_params
catboost_params = {'loss_function': 'CrossEntropy',
                     'learning_rate': 0.76,
                   'random_state':0,
                     'l2_leaf_reg': 0.014,
                     'colsample_bylevel': 0.06,
                     'depth': 1,
                     'boosting_type': 'Plain',
                     'bootstrap_type': 'Bernoulli',
                     'min_data_in_leaf': 18,
                     'one_hot_max_size': 14,
                     'subsample': 0.99,
                  'verbose':0}

catboost_clf = catboost.CatBoostClassifier(**catboost_params)
models_dict = {"xgboost":xgb_clf,"lightgbm":lgbm_clf,"catboost":catboost_clf}


In [None]:
def cross_validate(X, y, models_dict):
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1337)
    all_scores = {}
    for model in models_dict.keys():
        all_scores[model] = []
    for name,model in models_dict.items():
        print(f"Cross Validation <-> {name}")
        print("-------------------------")
    
        for fold_id, (train_idx, val_idx) in enumerate(skf.split(X, y)):
            X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
            y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
            
            model.fit(X_tr, y_tr)
            
            y_pred = model.predict_proba(X_val)[:, 1]
            
            auc = roc_auc_score(y_val, y_pred)
            
            print(f"Fold {fold_id} \t auc: {auc}")
            
            all_scores[name].append(auc)
        
        avg_auc = np.mean(all_scores[name])
        print(f"Avg AUC for {name} is : {avg_auc}")
        print("<--------------------------------->")
    for name, model in models_dict.items():
        all_scores[name] = np.mean(np.array(all_scores[name]))
    model_names = all_scores.keys()
    model_scores = all_scores.values()
    results = {"Model":model_names,"AUC_Score":model_scores}
    results = pd.DataFrame.from_dict(results).sort_values(by = "AUC_Score",ascending=False)
    # Plotting the bar chart
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(results['Model'], results['AUC_Score'], color='skyblue')
    plt.xlabel('Model')
    plt.ylabel('Score')
    plt.title('Model Scores in Descending Order')
    plt.xticks(rotation=45, ha='right')  # Adjust rotation for better readability
    
    # Add text labels on top of each bar
    for bar, score in zip(bars, results['AUC_Score']):
        plt.text(bar.get_x() + bar.get_width() / 2 - 0.15, bar.get_height() + 0.5, f'{score:.4f}', ha='center')
    
    plt.xticks(rotation=45, ha='right')  # Adjust rotation for better readability

    # Show the plot
    plt.show()
    #return results
        

In [None]:
cross_validate(
    X = X_train,
    y=y,
    models_dict=models_dict
)

### 6. Training and making prediction

In [None]:
#train on xgboost
xgb_clf = xgb.XGBClassifier(**xgb_params)
xgb_clf.fit(X_train, y, verbose=0)

#train on lgm
lgbm_clf = lgbm.LGBMClassifier(**lgbm_params)
lgbm_clf.fit(X_train, y, verbose=False)

#train with catboost
catboost_clf = catboost.CatBoostClassifier(**catboost_params)
catboost_clf.fit(X_train, y, verbose=False)

In [None]:
#make predictions on test data for each cross validated model
xgb_preds = xgb_clf.predict_proba(X_test)[:, 1]
lgbm_preds = lgbm_clf.predict_proba(X_test)[:, 1]
cat_preds = catboost_clf.predict_proba(X_test)[:, 1]

Submission 1 - xgboost

In [None]:
# Assign weights to each model
submission = pd.DataFrame({"id": test_idx, "Attrition":xgb_preds})
submission.head()
submission.to_csv("Submission_1_XGB.csv",index=False)


Submission 3 - Blending xgboost and catboost

In [None]:
# Give more weight to XGB
final_preds = np.column_stack([xgb_preds,
                               cat_preds]).mean(axis=1)
submission = pd.DataFrame({"id": test_idx, "Attrition":final_preds})
submission.head()
submission.to_csv("Submission_2_Blended_xg_cat.csv",index=False)