<a href="https://colab.research.google.com/github/EmilyJarecki/IBM-Employee-Attrition/blob/main/Project_HR_IBM_Attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pycaret

In [None]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
from sklearn import set_config

#PyCaret
from pycaret.classification import *

%matplotlib inline

In [None]:
from google.colab import files
uploaded = files.upload()


### 1. Import Dataset

In [None]:
import io
data = pd.read_excel(io.BytesIO(uploaded[
    'IBM HR Employe Attrition Sample Data.xlsx']),header=0)

data = pd.read_excel('IBM HR Employe Attrition Sample Data.xlsx')

data.head()

### 2. Set up the Pycaret environment

In [None]:
# remove unnecessary columns

data = data.drop(columns=['EmployeeNumber', 'EmployeeCount',
                          'Over18', 'StandardHours', 'DailyRate',
                          'HourlyRate', 'MonthlyRate'])


clf1 = setup(data, target = 'Attrition', session_id=786)


# setup() handles encoding
# A lot of data cleaning is done here
# splits into 80/20


#### Data Exploration

In [None]:
#1 Attrition Count
sns.countplot(data, x='Attrition', palette='winter')
plt.title('Attrition Count')
plt.show()

data['Attrition'].value_counts(normalize=True) * 100


In [None]:
# 2 Monthly Income by Attrition
sns.boxplot(data, x='Attrition', y='MonthlyIncome', palette='magma')
plt.title('Monthly Income by Attrition')
plt.show()

In [None]:
#3 Distribution of Age by Attrition
sns.histplot(data, x='Age', hue='Attrition', kde=True, palette='mako')
plt.title('Distribution of Age by Attrition')
plt.show()

In [None]:
#4 Job Level vs. Attrition
sns.countplot(data, x='JobLevel', hue='Attrition', palette='rocket')
plt.title('Attrition by Job Level')
plt.show()

# Group by JobLevel and Attrition and count
counts = data.groupby(['JobLevel', 'Attrition']).size().reset_index(name='Count')

# Calculate percentage within each JobLevel
# Instead of direct assignment, use transform to align indices
counts['Percentage'] = counts.groupby('JobLevel')['Count'].transform(lambda x: x / x.sum() * 100)
print(counts)

In [None]:
#5 Department vs Attrition
sns.countplot(data, x='Department', hue='Attrition', palette='tab10')
plt.title('Attrition by Department')
plt.xticks(rotation=20)
plt.show()

dept_attrition_matrix = pd.crosstab(data['Department'], data['Attrition'])

print()
print("------------WHOLE NUMBERS-----------")
print(dept_attrition_matrix)
print()
print()

print("-------------PERCENTAGE-------------")
dept_attrition_pct = pd.crosstab(data['Department'], data['Attrition'], normalize='index') * 100

print(dept_attrition_pct.round(2))

In [None]:
# Encode target variable if necessary
data['Attrition_Encoded'] = data['Attrition'].map({'Yes': 1, 'No': 0})

# Heatmap
plt.figure(figsize=(20, 12))
sns.heatmap(data.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


### 3. Compare Baseline

In [None]:
compare_models()

### 4. Create Model

In [None]:
lr_model = create_model('lr')

### 5. Tune Model

In [None]:
tuned_lr = tune_model(lr_model, n_iter=50, optimize = 'AUC')

In [None]:
print("Logistic Regresison model:")
print(lr_model)
print()
print()
print("Tuned Logistic Regression Model: ")
print(tuned_lr)

### 6. Ensemble Model
Ensembling is the process of combining predictions from multiple machine learning models to create a stronger, more robust model.

Think of it like this: instead of relying on one "expert" (a single model), you ask several experts and average or vote on their answers. This tends to improve accuracy and reduce overfitting.

The ensemble model types are:
* Bagging
<br>
* Boosting
<br>
* Blending/Stacking

In [None]:
bagged_lr = ensemble_model(tuned_lr, n_estimators=50)

print(bagged_lr)

In [None]:
boosted_lr = ensemble_model(tuned_lr, method = 'Boosting')

### 7. Blend Models
To get maximum value out of blending, we want to mix diverse models.
<br>
Random Forest and Gradient Boosting will capture complex interactions, while Logistic Regression will focus on linear relationships. K-Nearest Neighbors will help make predictions based on proximity to other data points.

In [None]:
rf_model = create_model('rf')
xgb_model = create_model('xgboost')
knn_model = create_model('knn')

blended = blend_models(estimator_list = [tuned_lr, rf_model, xgb_model, knn_model])

### 8. Analyze Model

In [None]:
data.corr(numeric_only=True)['MonthlyIncome'].sort_values(ascending=False)


In [None]:
evaluate_model(tuned_lr)

In [None]:
plot_model(tuned_lr, plot = 'auc')

In [None]:
plot_model(tuned_lr, plot = 'confusion_matrix')

### 9. Interpret Model

My notes: XGBoost has built-in SHAP. This plot uses SHAP values to analyze the relationships between features and the target variable.

In [None]:
interpret_model(xgb_model)



# the interpret_model function only supports tree based models
# IT HANDLES MULTICOLLINEARITY BETTER

In [None]:
interpret_model(xgb_model, plot = 'correlation')

In [None]:
interpret_model(xgb_model, plot = 'reason', observation=1)

### 10. Predict Model

This is finally testing the predictive model created on the testing data.

In [None]:
pred_holdouts = predict_model(tuned_lr)
pred_holdouts.head()