#### Import libraries

In [28]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
import joblib

#### Load Dataset

In [29]:
df = pd.read_csv("../data/employee data_raw.csv")
df.head()

Unnamed: 0,Employee_ID,Department,Gender,Age,Job_Title,Hire_Date,Years_At_Company,Education_Level,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score,Resigned
0,1,IT,Male,55,Specialist,,2.0,High School,5.0,6750.0,33,32.0,22.0,2,0.0,14.0,66.0,0,2.63,False
1,2,Finance,Male,29,Developer,2024-04-18 08:03:05.556036,0.0,High School,5.0,7500.0,34,34.0,13.0,14,100.0,12.0,61.0,2,1.72,False
2,3,Finance,Male,55,Specialist,,8.0,High School,3.0,5850.0,37,27.0,6.0,3,50.0,10.0,1.0,0,3.17,False
3,4,Customer Support,Female,48,Analyst,2016-10-22 08:03:05.556036,7.0,Bachelor,2.0,4800.0,52,10.0,28.0,12,100.0,10.0,0.0,1,1.86,False
4,5,Engineering,Female,36,Analyst,2021-07-23 08:03:05.556036,3.0,Bachelor,2.0,4800.0,38,,29.0,13,100.0,15.0,9.0,1,1.25,False


In [30]:
print("Initial shape:", df.shape)

Initial shape: (100000, 20)


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Employee_ID                  100000 non-null  int64  
 1   Department                   100000 non-null  object 
 2   Gender                       100000 non-null  object 
 3   Age                          100000 non-null  int64  
 4   Job_Title                    100000 non-null  object 
 5   Hire_Date                    95871 non-null   object 
 6   Years_At_Company             97614 non-null   float64
 7   Education_Level              96878 non-null   object 
 8   Performance_Score            95745 non-null   float64
 9   Monthly_Salary               100000 non-null  float64
 10  Work_Hours_Per_Week          100000 non-null  int64  
 11  Projects_Handled             94629 non-null   float64
 12  Overtime_Hours               97603 non-null   float64
 13  

#### Drop Duplicates


In [32]:
# check for duplicates
print("Duplicate rows before:", df.duplicated().sum())

Duplicate rows before: 0


There's no duplicated records

#### Drop rows with missing target

In [33]:
df = df.dropna(subset=["Resigned"])
print("After dropping missing target rows:", df.shape)

After dropping missing target rows: (98690, 20)


#### Convert target to numeric (True/False/Yes/No → 1/0)

In [34]:
df["Resigned"] = (
    df["Resigned"]
    .astype(str)
    .str.strip()
    .str.lower()
    .map({"yes": 1, "true": 1, "1": 1, "no": 0, "false": 0, "0": 0})
)

# Drop rows where mapping failed
df = df.dropna(subset=["Resigned"])

# Convert to int
df["Resigned"] = df["Resigned"].astype(int)

#### Drop irrelevant columns

In [35]:
df.drop(columns=["Employee_ID", "Hire_Date"], inplace=True, errors="ignore")


we drop Employrr_ID, because it is a unique identifier
Hire_Date - Redundant Column

In [36]:
df.shape

(98690, 18)

#### Handle missing values

In [37]:
for col in df.columns:
    if col != "Resigned":
        if df[col].dtype in ["int64", "float64"]:
            df[col] = df[col].fillna(df[col].median())
        else:
            df[col] = df[col].fillna(df[col].mode()[0])


In this step, we impute missing values based on the type of feature:

- **Numerical features** → filled with the **median** of the column.  
  - The median is chosen instead of the mean because it is more robust to outliers.  
  - Example: If the `Age` column has missing values, they will be replaced with the median employee age.

- **Categorical features** → filled with the **mode** (most frequently occurring value) of the column.  
  - This ensures missing entries are replaced with the most common category.  
  - Example: If the `Department` column has missing values, they will be filled with the department that appears most often (e.g., `"Sales"`).

This strategy preserves the dataset size, avoids dropping rows, and provides sensible replacements that reduce bias.


In [38]:
print("\nMissing values per column:\n", df.isna().sum())


Missing values per column:
 Department                     0
Gender                         0
Age                            0
Job_Title                      0
Years_At_Company               0
Education_Level                0
Performance_Score              0
Monthly_Salary                 0
Work_Hours_Per_Week            0
Projects_Handled               0
Overtime_Hours                 0
Sick_Days                      0
Remote_Work_Frequency          0
Team_Size                      0
Training_Hours                 0
Promotions                     0
Employee_Satisfaction_Score    0
Resigned                       0
dtype: int64


So now we handled the missing values

#### Split the data to features and Targets

In [39]:
X = df.drop(columns=["Resigned"])
y = df["Resigned"]

In [40]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )

In [41]:
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

Train shape: (78952, 17) Test shape: (19738, 17)


#### Define preprocessing

In [42]:
categorical_features = X.select_dtypes(include=["object"]).columns.tolist() 
numerical_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

In [43]:
print("Categorical features:", categorical_features) 
print("Numerical features:", numerical_features)

Categorical features: ['Department', 'Gender', 'Job_Title', 'Education_Level']
Numerical features: ['Age', 'Years_At_Company', 'Performance_Score', 'Monthly_Salary', 'Work_Hours_Per_Week', 'Projects_Handled', 'Overtime_Hours', 'Sick_Days', 'Remote_Work_Frequency', 'Team_Size', 'Training_Hours', 'Promotions', 'Employee_Satisfaction_Score']


##### Feature Transformation

In [44]:
categorical_transformer = OneHotEncoder(handle_unknown="ignore") 
numerical_transformer = StandardScaler()


- **Categorical features** → encoded using `OneHotEncoder(handle_unknown="ignore")`.  
  - Converts text categories (e.g., "HR", "Finance") into binary columns.  
  - `handle_unknown="ignore"` ensures that unseen categories in new data won’t break the model.

- **Numerical features** → scaled using `StandardScaler()`.  
  - Transforms values to have mean = 0 and standard deviation = 1.  
  - Ensures all numeric features are on the same scale, which helps many models converge better.


In [45]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

The `ColumnTransformer` applies different transformations to different feature types:

- **`("num", numerical_transformer, numerical_features)`**  
  → Applies `StandardScaler` to all numerical columns.

- **`("cat", categorical_transformer, categorical_features)`**  
  → Applies `OneHotEncoder` to all categorical columns.

#### Save artifacts

In [46]:
joblib.dump(preprocessor, "preprocessor.pkl") 
joblib.dump((X_train, X_test, y_train, y_test), "splits.pkl")

['splits.pkl']


- **preprocessor.pkl** → stores the preprocessing pipeline (scaling + encoding).  
- **splits.pkl** → stores the raw train/test splits (X_train, X_test, y_train, y_test).  
