# 1. Simplified EDA :

In [12]:
import pandas as pd

data = pd.read_csv("attrition_availabledata_04.csv")

# Display the structure of the dataset
print(data.info())
print(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   hrs                      2940 non-null   float64
 1   absences                 2940 non-null   float64
 2   JobInvolvement           2940 non-null   float64
 3   PerformanceRating        2940 non-null   float64
 4   EnvironmentSatisfaction  2940 non-null   float64
 5   JobSatisfaction          2940 non-null   float64
 6   WorkLifeBalance          2940 non-null   float64
 7   Age                      2940 non-null   float64
 8   BusinessTravel           2940 non-null   object 
 9   Department               2940 non-null   object 
 10  DistanceFromHome         2940 non-null   float64
 11  Education                2940 non-null   float64
 12  EducationField           2940 non-null   object 
 13  EmployeeCount            2940 non-null   float64
 14  EmployeeID              

In [16]:
# Basic information about the data
print("Shape of the dataset:", data.shape)
print("Data Types:\n", data.dtypes.value_counts())
print("Missing Values:\n", data.isnull().sum())

# Identify constant columns
constant_cols = [col for col in data.columns if data[col].nunique() == 1]
print("Constant Columns:", constant_cols)

# Check the target variable distribution
attrition_dist = data["Attrition"].value_counts(normalize=True) * 100
print("Attrition Class Distribution:\n", attrition_dist)


Shape of the dataset: (2940, 31)
Data Types:
 float64    23
object      8
Name: count, dtype: int64
Missing Values:
 hrs                        0
absences                   0
JobInvolvement             0
PerformanceRating          0
EnvironmentSatisfaction    0
JobSatisfaction            0
WorkLifeBalance            0
Age                        0
BusinessTravel             0
Department                 0
DistanceFromHome           0
Education                  0
EducationField             0
EmployeeCount              0
EmployeeID                 0
Gender                     0
JobLevel                   0
JobRole                    0
MaritalStatus              0
MonthlyIncome              0
NumCompaniesWorked         0
Over18                     0
PercentSalaryHike          0
StandardHours              0
StockOptionLevel           0
TotalWorkingYears          0
TrainingTimesLastYear      0
YearsAtCompany             0
YearsSinceLastPromotion    0
YearsWithCurrManager       0
Attrition    

In [18]:
# Drop constant columns
data_cleaned = data.drop(columns=["EmployeeCount", "Over18", "StandardHours"])

# Separate features and target variable
X = data_cleaned.drop(columns=["Attrition"])
y = data_cleaned["Attrition"].map({"Yes": 1, "No": 0})

Through our data summary, we notice that it contains `2940` instances and `31` features and no missing values in all 31 columns.
## 1.1 Data type
### Numerical variables :
We have overall 23 numerical variables, which are :
- `hrs`
- `absences`
- `JobInvolvement`
- `PerformanceRating`
- `EnvironmentSatisfaction`
- `JobSatisfaction`
- `WorkLifeBalance`
- `Age`
- `DistanceFromHome`
- `Education`
- `EmployeeCount` 
- `EmployeeID` 
- `JobLevel`
- `MonthlyIncome`
- `NumCompaniesWorked`
- `PercentSalaryHike`
- `StandardHours` 
- `StockOptionLevel`
- `TotalWorkingYears`
- `TrainingTimesLastYear`
- `YearsAtCompany`
- `YearsSinceLastPromotion`
- `YearsWithCurrManager`
### Categorical variables :
There are 8 categories in our data, that can be grouped according to the question :
- `BusinessTravel`
- `Department`
- `EducationField`
- `Gender`
- `JobRole`
- `MaritalStatus`
- `Over18`
- `Attrition` 

## 1.2 Categorical variables with high cardinality :
2 variables have relatively large number of unique categories, which would make modeling more complex :
- `JobRole` : has 9 categories
- `EducationField` has 6 categories

## 1.3 Features with missing values :
None

## 1.4 Constant columns or ID columns :
### Constant columns :
3 constant columns with one value and could be removed from our data :
- `Over18` 
- `EmployeeCount`
- `StandardHours`
### ID columns :
- `EmployeeID` : is a unique identifier and should be removed and excluded from modeling as it doesn't help in predicting the target.

## 1.5 Problem type and imbalance :
It is a classification problem, where `Attrition` is our target variable. It is likely imbalanced, since we have `83.88%` "No" and `16.12%` "Yes" in our response variable.

## 1.6 Additional considerations :
- High cardinality variables should be handled, as they may lead to overfitting if not handled properly.
- we have redundancy in our data; constant columns are dropped to reduce noise in the data
- We most likely need to resample or class weight adjust our response variable, in order to manage the imbalance in our target. 

# 2. Data setup :

## 2.1 Data split 

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


Training set shape: (2352, 27)
Test set shape: (588, 27)


## 2.2 Step for inner evaluation :

- For inner evaluation, we'd use `3-fold-cross-validation` for computational efficiency. It would allow us to divide the training set into 3 folds; 2 for training and third for validation in each iteration
- As for the workflow, we'd apply the method across all models and hyperparameter tuning tasks for consistency
- We'd then go for cross validation to reduce the risk of overfitting
- Finally, we'd choose a metric suited for the problem. We would choose F1-score metric, since we have an imbalanced dataset.


# 3. BASIC METHODS: TREES AND KNN 

## 3.1 rain, Evaluate, and Compare Two Basic Methods with Default Hyperparameters
- We'd compare two basic methods (Decision tree and KNN) with default hyperparameters alongside `DummyClassifier`
- For KNN, we'd compare two scaling methods (StandardScaling vs MinMaxScaler)

In [39]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import pandas as pd
import numpy as np
import time

# Load data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Preprocessing
categorical_cols = X_train.select_dtypes(include=["object"]).columns
numerical_cols = X_train.select_dtypes(exclude=["object"]).columns

# Define preprocessors
scalers = {
    "StandardScaler": StandardScaler(),
    "MinMaxScaler": MinMaxScaler()
}

# Models
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "KNN": KNeighborsClassifier(),
    "Dummy": DummyClassifier(strategy="most_frequent"),
}

# Baseline evaluation with default hyperparameters
results = []
for scaler_name, scaler in scalers.items():
    for model_name, model in models.items():
        preprocessor = ColumnTransformer(
            transformers=[
                ("num", scaler, numerical_cols),
                ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
            ]
        )
        pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", model)])
        
        # Measure training time
        start_time = time.time()
        pipeline.fit(X_train, y_train)
        training_time = time.time() - start_time
        
        # Evaluate
        y_pred = pipeline.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        bal_acc = balanced_accuracy_score(y_test, y_pred)
        
        # Save results
        results.append({
            "Scaler": scaler_name,
            "Model": model_name,
            "Accuracy": acc,
            "Balanced Accuracy": bal_acc,
            "Training Time (s)": training_time,
        })

results_df = pd.DataFrame(results)
print("Baseline Results:")
print(results_df)

Baseline Results:
           Scaler          Model  Accuracy  Balanced Accuracy  \
0  StandardScaler  Decision Tree  0.935374           0.880730   
1  StandardScaler            KNN  0.863946           0.651180   
2  StandardScaler          Dummy  0.838435           0.500000   
3    MinMaxScaler  Decision Tree  0.935374           0.880730   
4    MinMaxScaler            KNN  0.826531           0.590627   
5    MinMaxScaler          Dummy  0.838435           0.500000   

   Training Time (s)  
0           0.042549  
1           0.007462  
2           0.007779  
3           0.030499  
4           0.006964  
5           0.006824  


In [43]:
# Hyperparameter tuning (HPO)
param_grid_knn = {"model__n_neighbors": [3, 5, 7, 9]}
param_grid_tree = {"model__max_depth": [3, 5, 10, None]}

# GridSearchCV for KNN
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
    ]
)
knn_pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", KNeighborsClassifier())])
grid_knn = GridSearchCV(knn_pipeline, param_grid_knn, cv=3, scoring="balanced_accuracy", n_jobs=-1)
grid_knn.fit(X_train, y_train)

# GridSearchCV for Decision Tree
tree_pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", DecisionTreeClassifier(random_state=42))])
grid_tree = GridSearchCV(tree_pipeline, param_grid_tree, cv=3, scoring="balanced_accuracy", n_jobs=-1)
grid_tree.fit(X_train, y_train)

# Compare HPO results
print("Best Parameters for KNN:", grid_knn.best_params_)
print("Best Parameters for Decision Tree:", grid_tree.best_params_)
print("Best Balanced Accuracy (KNN):", grid_knn.best_score_)
print("Best Balanced Accuracy (Decision Tree):", grid_tree.best_score_)

# Evaluate the best models on the test set
final_knn = grid_knn.best_estimator_
final_tree = grid_tree.best_estimator_

knn_test_acc = balanced_accuracy_score(y_test, final_knn.predict(X_test))
tree_test_acc = balanced_accuracy_score(y_test, final_tree.predict(X_test))

print("Final KNN Test Accuracy:", knn_test_acc)
print("Final Decision Tree Test Accuracy:", tree_test_acc)

Best Parameters for KNN: {'model__n_neighbors': 3}
Best Parameters for Decision Tree: {'model__max_depth': None}
Best Balanced Accuracy (KNN): 0.6600106080687257
Best Balanced Accuracy (Decision Tree): 0.7737843581511896
Final KNN Test Accuracy: 0.7608839543076759
Final Decision Tree Test Accuracy: 0.8807302231237323
