# **Modelling**
**(Machine Learning)**

The [IBM HR Analytics Employee Attrition & Performance dataset](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)  is a fictional dataset created by IBM data scientists to simulate real-world HR data. It contains information about employees’ demographics, job roles, satisfaction levels, performance, and employment history. The dataset has 1,470 rows (employees) and 35 columns, including both categorical and numerical variables, and is used to explore the factors that influence employee attrition and performance. The main feature categories are:

- **Demographics:** Age, Gender, MaritalStatus, Education, EducationField

- **Job Details:** Department, JobRole, JobLevel, JobInvolvement, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager

- **Compensation:** MonthlyIncome, MonthlyRate, DailyRate, HourlyRate, PercentSalaryHike, StockOptionLevel

- **Satisfaction Metrics:** JobSatisfaction, EnvironmentSatisfaction, RelationshipSatisfaction, WorkLifeBalance

- **Performance & Experience:** PerformanceRating, TotalWorkingYears, NumCompaniesWorked, TrainingTimesLastYear, YearsSinceLastPromotion

- **Other Attributes:** DistanceFromHome, BusinessTravel, OverTime, StandardHours

## Objectives


## Inputs
The dataset was obtained from [Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)

## Outputs
The cleaned csv file found [here](../data_set/processed/cleaned_employee_attrition.csv)

# ML Workflow

- Load the dataset & check values
- Encode data types if needed
- Create train and test sets
- Set up Grid with different algorithms(minimal params)
- Visualise Results (compare train and test for overfitting etc)
- Select best pipeline
- Tune the best pipeline
- Explanatory Visuals & Conclusions

---

# Change working directory
Change the working directory from its current folder to its parent folder as the notebooks will be stored in a subfolder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\employee-turnover-prediction\\jupyter_notebooks'

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\employee-turnover-prediction'

Changing path directory to the dataset

In [4]:
#path directory
raw_data_dir = os.path.join(current_dir, 'data_set/raw') 

#path directory
processed_data_dir = os.path.join(current_dir, 'data_set/processed') 


---

# Import packages

In [5]:
import numpy as np #import numpy
import pandas as pd #import pandas
import matplotlib.pyplot as plt #import matplotlib
import seaborn as sns #import seaborn
import plotly.express as px # import plotly
sns.set_style('whitegrid') #set style for visuals

---

# Load the raw dataset

In [6]:
#load the dataset
df = pd.read_csv(os.path.join(processed_data_dir, 'cleaned_employee_attrition.csv'))


The raw dataset is loaded using Pandas for ETL process

---

# Understand the dataset structure and content

In [7]:
#display the first 5 rows of the dataset
df.head(3).T

Unnamed: 0,0,1,2
Age,41,49,37
Attrition,Yes,No,Yes
DistanceFromHome,1,8,2
JobLevel,2,2,1
JobRole,Sales Executive,Research Scientist,Laboratory Technician
JobSatisfaction,4,2,3
MonthlyIncome,5993,5130,2090
NumCompaniesWorked,8,1,6
OverTime,Yes,No,Yes
WorkLifeBalance,1,3,3


In [8]:
#dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Age                      1470 non-null   int64 
 1   Attrition                1470 non-null   object
 2   DistanceFromHome         1470 non-null   int64 
 3   JobLevel                 1470 non-null   int64 
 4   JobRole                  1470 non-null   object
 5   JobSatisfaction          1470 non-null   int64 
 6   MonthlyIncome            1470 non-null   int64 
 7   NumCompaniesWorked       1470 non-null   int64 
 8   OverTime                 1470 non-null   object
 9   WorkLifeBalance          1470 non-null   int64 
 10  YearsSinceLastPromotion  1470 non-null   int64 
 11  YearsWithCurrManager     1470 non-null   int64 
 12  Attrition_encoded        1470 non-null   int64 
 13  OverTime_encoded         1470 non-null   int64 
dtypes: int64(11), object(3)
memory usage: 16

---

## Encode Features

**Features To Encode**

Overtime & Attrition already have encoded versions.

Job Role is also an object with different labels - to preserve label hierarchy will encode separately. Let us check the values for JobRole.

In [9]:
df["JobRole"].value_counts()

JobRole
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: count, dtype: int64

We have nine different roles across departments. The departments column of original data was dropped as not being of interest to our hypotheses. We can perform basic  label encoding which will map each role to a numeric value or create a hierarchy. The second approach may be better for the simplicity of model and to add meaning to our encoding so we will take this approach. However, the data also includes `JobLevel` so we can use the correlation between this and `JobRole` to order our hierarchy. (if this shows a simple relationship between levels we can probably drop `JobRole` instead).

In [10]:
# 1) Inspect distributions of JobLevel within each JobRole
#import display for better formatting of output
from IPython.display import display
display(
    df.groupby("JobRole")["JobLevel"]
      .agg(['count','min','median','mean','max'])
      .sort_values(['median','mean'], ascending=[True, True])
)

# 2) Build an order based on JobLevel (median, then mean as tiebreaker)
role_stats = (
    df.groupby("JobRole")["JobLevel"]
      .agg(median='median', mean='mean')
      .sort_values(['median','mean'], ascending=[True, True])
      .reset_index()
)
role_stats["role_rank"] = np.arange(1, len(role_stats)+1)  # 1 = most junior, larger = more senior

Unnamed: 0_level_0,count,min,median,mean,max
JobRole,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sales Representative,83,1,1.0,1.084337,2
Research Scientist,292,1,1.0,1.202055,3
Laboratory Technician,259,1,1.0,1.239382,3
Human Resources,52,1,1.0,1.480769,3
Sales Executive,326,2,2.0,2.328221,4
Manufacturing Director,145,2,2.0,2.448276,4
Healthcare Representative,131,2,2.0,2.473282,4
Research Director,80,3,4.0,3.975,5
Manager,102,3,4.0,4.303922,5


The order is reasonably clear, though manager having the highest rank seems like an anomaly (It is described as middle management in the original dataset). 
* Conclusion: Job Level may be a better indicator & less arbritary than JobRole
* We can also check monthly income to find a more definitive hierarchy

In [11]:
##Rank Job Role by MonthlyIncome
role_stats = (
    df.groupby("JobRole")["MonthlyIncome"]
      .agg(median='median', mean='mean')
      .sort_values(['median','mean'], ascending=[True, True])
      .reset_index()
)
role_stats["role_rank"] = np.arange(1, len(role_stats)+1)  # 1 = most junior, larger = more senior
role_stats

Unnamed: 0,JobRole,median,mean,role_rank
0,Sales Representative,2579.0,2626.0,1
1,Laboratory Technician,2886.0,3237.169884,2
2,Research Scientist,2887.5,3239.972603,3
3,Human Resources,3093.0,4235.75,4
4,Sales Executive,6231.0,6924.279141,5
5,Manufacturing Director,6447.0,7295.137931,6
6,Healthcare Representative,6811.0,7528.763359,7
7,Research Director,16510.0,16033.55,8
8,Manager,17454.5,17181.676471,9


This gives a clearer ranking - we have two measures which seem to effectively rank job roles. Though including it may introduce noise and overcomplicate it may also enrich it and reveal a more interesting picture. (i.e. factors not necessarily related to income or seniority). Fot this reason we will keep it in our model for testing and use LabelEncoding.

## Create Dataset for training

Next we will create our dataset for training. We can drop columns we have encoded versions of, `Attrition`, `OverTime`.

In [12]:
#create our dataset for training. We can drop columns we have encoded versions of, `Attrition`, `OverTime` 
df_model = df.drop(columns=['Attrition', 'OverTime'])
df_model.head(3).T


Unnamed: 0,0,1,2
Age,41,49,37
DistanceFromHome,1,8,2
JobLevel,2,2,1
JobRole,Sales Executive,Research Scientist,Laboratory Technician
JobSatisfaction,4,2,3
MonthlyIncome,5993,5130,2090
NumCompaniesWorked,8,1,6
WorkLifeBalance,1,3,3
YearsSinceLastPromotion,0,1,0
YearsWithCurrManager,5,7,0


---

## Create Train & Test sets

Next we split our dataset into training & test sets.

In [13]:
#split our dataset into training & test sets.
from sklearn.model_selection import train_test_split
X = df_model.drop('Attrition_encoded', axis=1)  # Features
y = df_model['Attrition_encoded']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((1176, 11), (294, 11), (1176,), (294,))

---

## GridSearch with Tree Models

### Import packages

In [14]:
## GridSearch with Tree Models
### Import packages
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#supress warnings
import warnings
warnings.filterwarnings('ignore')


**Create pipeline function**

First we create lists of features depending on type - Ordinal etc., then create a function to create pipelines.

In [18]:
#check numeric columns for unique values
X.nunique(), X.shape

(Age                          43
 DistanceFromHome             29
 JobLevel                      5
 JobRole                       9
 JobSatisfaction               4
 MonthlyIncome              1349
 NumCompaniesWorked           10
 WorkLifeBalance               4
 YearsSinceLastPromotion      16
 YearsWithCurrManager         18
 OverTime_encoded              2
 dtype: int64,
 (1470, 11))

By examining this we can see which are continuous numeric variables and which are ordinal (i.e. have an order/ranking)

In [21]:
### Preprocessing & Feature Selection imports
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import OrdinalEncoder
from sklearn.feature_selection import SelectFromModel

##separate features by type
ordinal_features = ['JobSatisfaction', 'JobLevel','WorkLifeBalance']
numerical_features = ['Age', 'DistanceFrom Home', 'MonthlyIncome', 'NumCompaniesWorked','YearSinceLastPromotion','YearsWithCurrManager']
#Overtime is already encoded as binary, JobRole will be label encoded
ordinal_features, numerical_features

(['JobSatisfaction', 'JobLevel', 'WorkLifeBalance'],
 ['Age',
  'DistanceFrom Home',
  'MonthlyIncome',
  'NumCompaniesWorked',
  'YearSinceLastPromotion',
  'YearsWithCurrManager'])

the next step is to create our pipeline optimisation function.

In [22]:
##Create function to build  model pipelines

# No scaling needed for tree-based models
def PipelineOptimization(model):
    """ Create a pipeline with preprocessing, feature selection, and model steps.
    """
    pipeline_base = Pipeline(steps=[
        # 1) Impute numeric (median is robust to skew)
        ("num_imputer", MeanMedianImputer(imputation_method="median", variables=numerical_features)),

        # 2) Impute categoricals (mode/most frequent)
        ("cat_imputer", CategoricalImputer(imputation_method="frequent", variables=ordinal_features)),

        # 3) Ordinal-encode ordered categories (keeps rank meaning for trees)
        ("ord_encoder", OrdinalEncoder(encoding_method="arbitrary", variables=ordinal_features)),

        # 4) Label-encode JobRole (arbitrary, no rank meaning)
        ("role_encoder", OrdinalEncoder(encoding_method="arbitrary", variables=["JobRole"])),

        # 5) Feature selection from model (uses model’s importances)
        ("feat_selection", SelectFromModel(model)),

        # 6) Final estimator
        ("model", model),
    ])
    return pipeline_base

In [17]:
##Gridsearch setup pipeline
# Define models and their hyperparameters
#minimal parameters to reduce computation time
models = {
    'DecisionTree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'max_depth': [3, 5, 7, 10, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'RandomForest': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 5, 7, 10, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'GradientBoosting': {
        'model': GradientBoostingClassifier(),
        'params': {
            'n_estimators': [50, 100, 200],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 7]
        }
    },
    'ExtraTrees': {
        'model': ExtraTreesClassifier(),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 5, 7, 10, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    }
}

