## A CLASSIFICATION PROJECT - PREDICTING EMPLOYEE ATTRITION

#### BUSINESS UNDERSTANDING

Employee attrition refers to the process where employees leave an organization, either voluntarily or involuntarily. High attrition rates can be costly for businesses, as they impact productivity, morale, recruitment costs, and training expenses. 
The primary objective of predicting employee attrition is to identify employees who are at risk of leaving the organization in the near future. By doing so, companies can take proactive measures to improve retention, enhance employee satisfaction, and reduce the overall cost of turnover.

##### PROJECT GOAL
The goal of this project is to develop a robust machine learning pipeline to predict whether specific employees are likely to leave the company. The predictive modeling will be conducted following an in-depth analysis of the dataset obtained. 

##### ANALYTICAL QUESTIONS
1. What is the percentage of Attrition?
2. How satisfied are employees after 3 years at the company?
3. Does marital status affect attrition rate?

#### DATA UNDERSTANDING

#### Data Features
* Age: Age of employees

* Attrition: Employee attrition status

* Department: Department of employees

* Education Level: 1-Bachelor's degree; 2- Master's degree

* EducationField

* Environment Satisfaction: 1-Low; 2-Medium; 3-High; 4-Very High;

* Job Satisfaction: 1-Low; 2-Medium; 3-High; 4-Very High;

* MaritalStatus

* Gross Salary

* Work Life Balance: 1-Bad; 2-Good; 3-Better; 4-Best;

* Length of Service: number of years with employer


#### Loading the Necessary Libraries

In [172]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Data Preparation
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import RobustScaler, OneHotEncoder, LabelEncoder
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import joblib
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Load the dataset
data = pd.read_excel('Data/Attrition Dataset.xlsx')
data.head()

In [None]:
#Check cell values
data.info()

* All the columns have the correct data type.

In [None]:
#Check for null values
data.isna().sum()

#### Observations
* Most of the columns in the dataset do not have missing values.
* The column "Gross Salary" has 3 missing values, meaning that there are 3 records where the "Gross Salary" information is not provided. These will be handled using imputation.

In [85]:
#Remove the Employee ID column
data = data.drop('Employee ID', axis=1)

In [None]:
data.describe().T

In [None]:
data.describe(include='object').T

In [None]:
#Check for duplicates
data.duplicated().sum()

* This implies there are no duplicates in the dataset

In [None]:
# Print unique values for each column
for column in data.columns:
    print(f'Column Name: {column}\n')
    print(f'Number of Unique Values: {data[column].unique().size}\n')
    print(f'{data[column].unique()}')
    print('=' * 80)

In [None]:
#Convert column names to lower case
data.columns = data.columns.str.lower()

#Check the columns to confirm
column_names = data.columns
print(column_names)

In [None]:
#Merge column names
data.columns = data.columns.str.replace(' ', '_')

# Check the new column names
print(data.columns)

In [None]:
#Rename educationfield column
data.rename(columns={'educationfield': 'education_field'}, inplace=True)

# Check the updated column names
print(data.columns)

#### Univariate Analysis

In [None]:
#setting color palette for the project
sns.color_palette("pastel")
data.hist(figsize=(10, 10), grid=False, color='skyblue')
plt.show()

* Add KDE plots to see a smoother representation of the distribution of the features

In [None]:
for column in data.select_dtypes('number').columns:
    data[column].plot(kind='kde')
    plt.title(f'KDE for {column}')
    plt.show()

* Add a Boxplot to detect outliers and the scale of the data

In [None]:
sns.boxplot(data=data, orient='h')

* Points above the upper whisker indicate salaries much higher than typical, suggesting significant variation in gross salaries within the dataset.

#### Bi-variate Analysis

In [None]:
correlation = data.corr(numeric_only=True)
correlation

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(data=correlation, annot=True, cmap='coolwarm')

#### Multivariate Analysis

In [None]:
df = data[['attrition', 'length_of_service', 'age']]
plt.figure(figsize=(16, 8))
sns.pairplot(df, hue='attrition', palette={'Yes': 'skyblue', 'No': 'lightgreen'})
plt.show()

#### Answering Analytical Questions

1. What is the percentage of Attrition?

In [None]:
plt.figure(figsize=(10, 10))
data['attrition'].value_counts().plot.pie(startangle=90, colors=['pink', 'skyblue'], autopct='%1.1f%%', explode=(0.01, 0.05), pctdistance=0.85)
plt.title('The Attrition distribution')

#### How satisfied are employees after 3 years at the company?

In [None]:
data.columns

In [None]:
data.groupby(['job_satisfaction', 'attrition'])['department'].count().rename('Total').reset_index()

In [None]:
after_three_years = data[data['length_of_service'] > 3]
after_three_years.shape

In [None]:
after_three_years[['job_satisfaction', 'attrition']].groupby(['job_satisfaction', 'attrition'])['attrition'].count().rename('Count').reset_index()

#### Does marital status affect attrition rate?

In [None]:
# Calculate the total count and attrition count by marital status
marital_status_summary = data.groupby('marital_status')['attrition'].value_counts().unstack().fillna(0)

# Calculate the attrition rate
marital_status_summary['attrition_rate'] = marital_status_summary['Yes'] / (marital_status_summary['Yes'] + marital_status_summary['No'])

# Reset index for better readability
marital_status_summary = marital_status_summary.reset_index()

print(marital_status_summary)

In [None]:
# Calculate the total count of attrition (Yes + No) for each marital status
marital_status_summary['Total'] = marital_status_summary['Yes'] + marital_status_summary['No']

# Calculate the percentage of Yes and No attrition
marital_status_summary['Yes_pct'] = marital_status_summary['Yes'] / marital_status_summary['Total'] * 100
marital_status_summary['No_pct'] = marital_status_summary['No'] / marital_status_summary['Total'] * 100

# Create a stacked bar chart to visualize the percentages of 'Yes' and 'No' attrition by marital status
plt.figure(figsize=(10, 6))
marital_status_summary[['Yes_pct', 'No_pct']].plot(kind='bar', stacked=True, color=['skyblue', 'lightgreen'], figsize=(10, 6))

# Add percentage labels on each bar
for i in range(len(marital_status_summary)):
    plt.text(i, marital_status_summary['Yes_pct'].iloc[i] / 2, f"{marital_status_summary['Yes_pct'].iloc[i]:.1f}%", ha='center', color='white', fontweight='bold')
    plt.text(i, marital_status_summary['Yes_pct'].iloc[i] + marital_status_summary['No_pct'].iloc[i] / 2, f"{marital_status_summary['No_pct'].iloc[i]:.1f}%", ha='center', color='white', fontweight='bold')

plt.title('Percentage of Attrition by Marital Status')
plt.xlabel('marital_status')
plt.ylabel('Percentage')
plt.xticks(ticks=range(len(marital_status_summary)), labels=marital_status_summary['marital_status'], rotation=0)
plt.legend(['Attrition: Yes', 'Attrition: No'])
plt.tight_layout()
plt.show()

#### Data Preparation

In [None]:
#Check dataframe
data.head()

In [None]:
#Check cell values
data.info()

* The missing values in the gross_salary column will be replaced with the median of the column.

In [None]:
# Fill missing values in 'gross_salary' with the median
median_gross_salary = data['gross_salary'].median()
data['gross_salary'].fillna(median_gross_salary, inplace=True)

# Check if the missing values are filled
print(data['gross_salary'].isnull().sum())

In [None]:
#Check cell values
data.info()

#### Check if Dataset is Balanced
To check if the dataset is balanced, we set a threshold of 5%. If the absolute difference between the counts of the two classes is less than the threshold, then the dataset is considered balanced; otherwise, it's considered imbalanced.

In [None]:
# Count the occurrences of each class
class_counts = data['attrition'].value_counts()

# Set the threshold for imbalance (5% of the total number of rows)
threshold = len(data) * 0.05 

# Check if the dataset is balanced using .iloc to access by position
is_balanced = abs(class_counts.iloc[0] - class_counts.iloc[1]) < threshold

if is_balanced:
    print("The dataset is balanced.")
else:
    print("The dataset is imbalanced.")


In [None]:
#Count the occurrences of each class
class_counts = data['attrition'].value_counts()

#Plot the distribution of the target variable
plt.figure(figsize=(6, 4))
bars = class_counts.plot(kind='bar', color=['skyblue', 'orange'])
plt.title('Distribution of Attrition')
plt.xlabel('Attrition')
plt.ylabel('Count')
plt.xticks(rotation=0)

#Annotate the bars with churn counts
for i, count in enumerate(class_counts):
    plt.text(i, count + 10, str(count), ha='center', va='bottom')

plt.show()

* The visual also confirms the dataset is balanced since the absolute difference between the classes is less than 50 percent.

#### MODELING AND EVALUATION

#### Training the Balanced Dataset
In splitting the data, it is done such that;
X contains all the features except the target variable (attrition).

y contains only the target variable (attrition).

We use train_test_split to split the data into training and evaluation sets and set test_size to 0.3 which specifies that 30% of the data should be used for evaluation, while the rest is used for training.

X_train and y_train contain the training features and target variable respectively.X_eval and y_eval contain the evaluation features and target variable respectively.

In [117]:
#Define features (X) and target variable (y)
X = data.drop('attrition', axis=1) 
y = data['attrition']  

#Split the dataset into training and evaluation sets
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)

In [118]:
#Instantiate encoder
encoder = LabelEncoder()

#Encode y_train
y_train_encoded = encoder.fit_transform(y_train)

#Encode y_test
y_eval_encoded = encoder.transform(y_eval)

In [119]:
#Get categorical columns
categorical_columns = X.select_dtypes('object').columns

#Get numerical columns
numerical_columns = X.select_dtypes('number').columns

In [120]:
#Prepare numerical pipeline
numerical_pipeline=Pipeline(steps=[
('numerical_imputer',SimpleImputer(strategy='median')),
('scaler', StandardScaler())
    
])

#Prepare categorical pipeline
categorical_pipeline=Pipeline(steps=[
    ('categorical_imputer',SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))

])

#Column transformer preparation
preprocessor=ColumnTransformer(transformers=[
    ('numerical_pipeline', numerical_pipeline,numerical_columns),
    ('categorical_pipeline', categorical_pipeline, categorical_columns)
])

In [167]:
#Define the models
model = [
    ('K-Nearest_Neighbors', KNeighborsClassifier(n_neighbors=5)),  
    ('Logistic_Regression', LogisticRegression(random_state=42)),  
    ('Support_Vector_Machine', SVC(random_state=42)),  
    ('Decision_Tree', DecisionTreeClassifier(random_state=42)),  
    ('Random_Forest', RandomForestClassifier(random_state=42)),  
    ('Gradient_Boosting', GradientBoostingClassifier(random_state=42)),  
]

# Initialize an empty dictionary to store pipelines
all_pipelines = {}

#Create a DataFrame for the metrics
metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

#Train and evaluate each model
for model_name, classifier in model:
    
    #Create pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('feature_selection', SelectKBest(mutual_info_classif, k='all')),
        ('classifier', classifier),
    ])
    
    #Fit data to pipeline
    pipeline.fit(X_train, y_train_encoded)
    all_pipelines[model_name] = pipeline

    #Make predictions on the test set
    y_pred = pipeline.predict(X_eval)

    #Generate classification report for each model
    metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)
    
    #Evaluate the model
    accuracy = metrics['accuracy']
    precision = metrics['weighted avg']['precision']
    recall = metrics['weighted avg']['recall']
    f1_score= metrics['weighted avg']['f1-score']

    #Add metrics to metrics_output
    metrics_output.loc[len(metrics_output)] = [model_name, accuracy, precision, recall, f1_score]

In [None]:
#Display the metrics_output
metrics_output.sort_values(ascending=False, by='f1_score')

#### Observation
* Support Vector Machine (SVM) shows the best performance in terms of all metrics, making it a solid candidate for the highest-performing model in the set.

* Logistic Regression and Gradient Boosting perform similarly, providing reasonable accuracy with slightly lower scores than SVM, but may still be viable options. 

* Decision Tree and Random Forest have the lowest accuracy and other metrics, which indicates they may need tuning or might not be well-suited for this dataset.

* The F1-score for all models is generally close to the accuracy, indicating balanced precision and recall for most models, with SVM having the best balance.

#### Generate a Confusion Matrix for the Models

In [123]:
#Define the models
model = [
    ('K-Nearest_Neighbors', KNeighborsClassifier(n_neighbors=5)),  
    ('Logistic_Regression', LogisticRegression(random_state=42)),  
    ('Support_Vector_Machine', SVC(random_state=42)),  
    ('Decision_Tree', DecisionTreeClassifier(random_state=42)),  
    ('Random_Forest', RandomForestClassifier(random_state=42)),  
    ('Gradient_Boosting', GradientBoostingClassifier(random_state=42)),  
]

# Initialize an empty dictionary to store pipelines
all_pipelines_c = {}

# All confusion matrix
all_confusion_matrix =  {}

#Create a DataFrame for the metrics
c_metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

#Train and evaluate each model
for model_name, classifier in model:
    
    #Create pipeline
    c_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier),
    ])
    
    #Fit data to pipeline
    c_pipeline.fit(X_train, y_train_encoded)
    all_pipelines_c[model_name] = c_pipeline

    #Make predictions on the test set
    y_pred = c_pipeline.predict(X_eval)

    #Generate classification report for each model
    c_metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)
    
    #Evaluate the model
    c_accuracy = c_metrics['accuracy']
    c_precision = c_metrics['weighted avg']['precision']
    c_recall = c_metrics['weighted avg']['recall']
    c_f1_score= c_metrics['weighted avg']['f1-score']

    #Add metrics to metrics_output
    c_metrics_output.loc[len(c_metrics_output)] = [model_name, c_accuracy, c_precision, c_recall, c_f1_score]

    # Compute the confusion matrix and store it
    c_matrix = confusion_matrix(y_eval_encoded, y_pred)
    all_confusion_matrix[model_name] = c_matrix

In [None]:
# Iterate over the keys (model names) in the all_confusion_matrix dictionary
for model_name, c_matrix in all_confusion_matrix.items():
    print(f"Confusion Matrix for {model_name}:")
    print(c_matrix)

In [None]:
# Plot confusion matrices using heatmaps
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for i, (model, c_matrix) in enumerate(all_confusion_matrix.items()):
    ax = axes[i // 3, i % 3]
    sns.heatmap(c_matrix, annot=True, fmt="d", cmap="Blues", ax=ax)
    ax.set_title(f"Confusion Matrix for {model}")
    ax.set_xlabel('Predicted label')
    ax.set_ylabel('True label')

plt.tight_layout()
plt.show()

#### Observation
* K-Nearest Neighbors and Support Vector Machine both show relatively high false negatives, meaning they are more likely to miss true positives.

* Logistic Regression and Gradient Boosting have a more balanced error rate between false positives and false negatives, performing slightly better overall.

* Random Forest and Decision Tree also perform similarly but tend to have higher false negatives, especially Random Forest.

#### Visualize Evaluation Using ROC-AUC

In [None]:
# Define the models
model = [
    ('K-Nearest_Neighbors', KNeighborsClassifier(n_neighbors=5)),  
    ('Logistic_Regression', LogisticRegression(random_state=42)),  
    ('Support_Vector_Machine', SVC(random_state=42, probability=True)),  # Set probability=True
    ('Decision_Tree', DecisionTreeClassifier(random_state=42)),  
    ('Random_Forest', RandomForestClassifier(random_state=42)),  
    ('Gradient_Boosting', GradientBoostingClassifier(random_state=42)),  
]

all_pipelines_c = {}

# Create a DataFrame for the metrics
c_metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

# Train and evaluate each model
for model_name, classifier in model:
    
    # Create pipeline with feature selection
    c_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('feature_selection', SelectKBest(score_func=f_classif, k=5)),
        ('classifier', classifier)
    ])
    
    # Fit data to pipeline
    c_pipeline.fit(X_train, y_train_encoded)
    all_pipelines_c[model_name] = c_pipeline

    # Make predictions on the test set
    y_pred = c_pipeline.predict(X_eval)

    # Generate classification report for each model
    c_metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)
    
    # Evaluate the model
    c_accuracy = c_metrics['accuracy']
    c_precision = c_metrics['weighted avg']['precision']
    c_recall = c_metrics['weighted avg']['recall']
    c_f1_score= c_metrics['weighted avg']['f1-score']

    # Add metrics to metrics_output
    c_metrics_output.loc[len(c_metrics_output)] = [model_name, c_accuracy, c_precision, c_recall, c_f1_score]

# Plot roc_curves
fig, ax = plt.subplots(figsize=(10, 5))
all_roc_data = {}

for model_name, c_pipeline in all_pipelines_c.items():

    y_score = c_pipeline.predict_proba(X_eval)[:, 1]

    fpr, tpr, thresholds = roc_curve(y_eval_encoded, y_score)
    
    roc_auc = auc(fpr,tpr)

    roc_data = pd.DataFrame({'False Positive Rate': fpr, 'True Positive Rate': tpr, 'Threshold': thresholds})

    all_roc_data[model_name] = roc_data

    ax.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc})')

    ax.plot([0,1], [0,1])

    ax.set_ylabel('False Positive Rate')
    
    ax.set_xlabel('True Positive Rate')


ax.set_title('ROC Curve Plot for all Pipelines')
plt.legend()
plt.show()

* Based on the provided AUC (Area Under the ROC Curve) values, the best performing models are;
1. Gradient Boosting
2. Logistic Regression
3. Random_Forest

* These models have higher AUC values, indicating better overall performance in terms of the trade-off between true positive rate and false positive rate. 
* We will select Gradient Boosting and Logistic Regression as the best performing models, and perform threshold optimization as well as hyperparameter tunin on these two models. 

#### Threshold Optimization

In [None]:
# Preview of the threshold for Gradient Boosting
all_roc_data['Gradient_Boosting'].loc[10:,]

In [None]:
#Preview the Gradient Boosting pipeline
GradientBoostingClassifier_pipeline = all_pipelines['Gradient_Boosting']
GradientBoostingClassifier_pipeline

In [None]:
# Best threshold
gradient_threshold = 0.2

# Predict probabilities
y_pred_proba = GradientBoostingClassifier_pipeline.predict_proba(X_eval)[:, 1]

# Make predictions based on the threshold
predictions = (y_pred_proba > gradient_threshold).astype(int)

# Compute confusion matrix
gradient_threshold_matrix = confusion_matrix(y_eval_encoded, predictions)

# Saving the best model and threshold in variables
best_gradient_boosting_model = GradientBoostingClassifier_pipeline
best_gradient_threshold = gradient_threshold

# Print the confusion matrix
print(gradient_threshold_matrix)

In [None]:
# Plot confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(gradient_threshold_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Gradient Boosting')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# Preview of the threshold for Gradient Boosting
all_roc_data['Logistic_Regression'].loc[10:,]

In [None]:
#Preview the logistic regression pipeline
LogisticRegression_pipeline = all_pipelines['Logistic_Regression']
LogisticRegression_pipeline

In [None]:
# Best threshold
LR_threshold = 0.2

# Predict probabilities
y_pred_proba = LogisticRegression_pipeline.predict_proba(X_eval)[:, 1]

# Make predictions based on the threshold
predictions = (y_pred_proba > LR_threshold).astype(int)

# Compute confusion matrix
LR_threshold_matrix = confusion_matrix(y_eval_encoded, predictions)

# Saving the best model and threshold in variables
best_logistic_regression_model = LogisticRegression_pipeline
best_LR_threshold = LR_threshold
LR_threshold_matrix

In [None]:
# Plot confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(LR_threshold_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

#### HYPERPARAMETER TUNING TO IMPROVE MODEL PERFORMANCE

We will perform hyperparameter tuning on the two best performing models to improve their performance thus Gradient Boosting and Logistic Regression.

#### Hypertuning Gradient Boosting Model.

In [None]:
#Load the saved GradientBoostingClassifier pipeline
current_params = best_gradient_boosting_model.get_params()
current_params

In [None]:
# Define the parameter grid for hyperparameter tuning
param_grid = {
    'classifier__n_estimators': [100, 200, 300],   # Number of trees
    'classifier__learning_rate': [0.01, 0.1, 0.05],  # Learning rate
    'classifier__max_depth': [3, 5, 7],  # Depth of trees
    'classifier__min_samples_split': [2, 5, 10],  # Minimum samples for node split
    'classifier__min_samples_leaf': [1, 2, 4],  # Minimum samples at leaf
    'classifier__subsample': [0.8, 1.0],  # Fraction of samples used for training each tree
}

# Create the pipeline (your existing pipeline)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # ColumnTransformer for preprocessing
    ('feature_selection', SelectKBest(score_func=mutual_info_classif, k='all')),  # Feature selection
    ('classifier', GradientBoostingClassifier(random_state=42))  # Classifier
])

# Initialize GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(estimator=pipeline, 
                           param_grid=param_grid, 
                           cv=5,  # 5-fold cross-validation
                           scoring='accuracy',  # Evaluation metric
                           verbose=2,  # Print detailed progress
                           n_jobs=-1)  # Use all available cores

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train_encoded)

# Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best Hyperparameters: ", best_params)

In [None]:
# Initialize a DataFrame to store the results
tuned_models_df = pd.DataFrame(columns=['Model name', 'Accuracy', 'Precision', 'Recall', 'F1-Score'])

all_pipelines_c[model_name] = best_model

# Evaluate the best model on the test set
y_pred = best_model.predict(X_eval)

# Store classification report values as a dictionary
tuned_metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)

# Grab values from the metric dictionary
accuracy = tuned_metrics['accuracy']
precision = tuned_metrics['weighted avg']['precision']
recall = tuned_metrics['weighted avg']['recall']
f1 = tuned_metrics['weighted avg']['f1-score']
    
# Add these values to the table
tuned_models_df.loc[len(tuned_models_df)] = [model_name, accuracy, precision, recall, f1]

# Sort table to have highest f1 on top
tuned_models_df.sort_values(by='F1-Score', ascending=False, inplace=True)

#Display the results
print(tuned_models_df)

#### Use the tuned Gradient Boosting model to determine the trade-offs


In [None]:
# Best threshold
gradient_threshold = 0.2

# Predict probabilities
y_pred_proba = best_model.predict_proba(X_eval)[:, 1]

# Make predictions based on the threshold
predictions = (y_pred_proba > gradient_threshold).astype(int)

# Compute confusion matrix
gradient_threshold_matrix = confusion_matrix(y_eval_encoded, predictions)

# Saving the best model and threshold in variables
best_gradient_boosting_model = best_model
best_gradient_threshold = gradient_threshold

# Print the confusion matrix
print(gradient_threshold_matrix)

In [None]:
# Plot confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(gradient_threshold_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Gradient Boosting')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

#### Observation
* The original model generally performs better than the tuned model across all key metrics (accuracy, precision, and F1-score), with recall being identical for both models. This suggests that the hyperparameter tuning didn't improve the model's performance in this case. But it's essential to consider the trade-offs between these metrics based on the specific objectives of the Project.

* Confusion Matrix for the original gradient boosting model has the best overall performance, with the highest accuracy, precision, recall, and F1 score. It has a higher Recall, meaning the model is better at identifying positive cases (fewer false negatives).

* Confusion Matrix for the tuned model has good recall (75%), meaning it captures most of the actual positives, but it has a lower precision (53%), meaning many of the predicted positives are actually false.

#### Hypertuning the Logistic Regression Model

In [None]:

#Column transformer preparation
preprocessor=ColumnTransformer(transformers=[
    ('numerical_pipeline', numerical_pipeline,numerical_columns),
    ('categorical_pipeline', categorical_pipeline, categorical_columns)
])

# Define the parameter distribution for hyperparameter tuning
param_dist = {
    'classifier__C': np.logspace(-4, 4, 20),  
    'classifier__penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'classifier__solver': ['lbfgs', 'liblinear', 'saga'],  
    'classifier__max_iter': [100, 200, 300], }

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Initialize RandomizedSearchCV with the pipeline and parameter distribution
random_search = RandomizedSearchCV(estimator=pipeline,
                                   param_distributions=param_dist,
                                   n_iter=50,  
                                   cv=5,  
                                   scoring='accuracy',  
                                   verbose=2, 
                                   n_jobs=-1, 
                                   random_state=42)

# Fit RandomizedSearchCV to the training data
random_search.fit(X_train, y_train_encoded)

# Get the best parameters and best model
best_params_l = random_search.best_params_
best_model_l = random_search.best_estimator_

print("Best Hyperparameters: ", best_params_l)

In [None]:
# Initialize a DataFrame to store the results
tuned_models_df_l = pd.DataFrame(columns=['Model name', 'Accuracy', 'Precision', 'Recall', 'F1-Score'])

# Store the model name
model_name = "Logistic Regression"  

best_model_l = random_search.best_estimator_  

# Evaluate the best model on the test set
y_pred = best_model_l.predict(X_eval)

# Store classification report values as a dictionary
tuned_metrics_l = classification_report(y_eval_encoded, y_pred, output_dict=True)

# Grab values from the metric dictionary
accuracy = tuned_metrics_l['accuracy']
precision = tuned_metrics_l['weighted avg']['precision']
recall = tuned_metrics_l['weighted avg']['recall']
f1 = tuned_metrics_l['weighted avg']['f1-score']
    
# Add these values to the DataFrame
tuned_models_df_l.loc[len(tuned_models_df_l)] = [model_name, accuracy, precision, recall, f1]

# Sort table to have highest F1 on top
tuned_models_df_l.sort_values(by='F1-Score', ascending=False, inplace=True)

# Display the results
print(tuned_models_df_l)

#### Use the tuned Logistic Regression Model to determine the trade-offs


In [None]:
# Best threshold
LR_threshold = 0.2

# Predict probabilities
y_pred_proba = best_model_l.predict_proba(X_eval)[:, 1]

# Make predictions based on the threshold
predictions = (y_pred_proba > LR_threshold).astype(int)

# Compute confusion matrix
LR_threshold_matrix = confusion_matrix(y_eval_encoded, predictions)

# Saving the best model and threshold in variables
best_logistic_regression_model = best_model_l
best_LR_threshold = LR_threshold
LR_threshold_matrix

In [None]:
# Plot confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(LR_threshold_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')  
plt.show()

#### Observation
* The original Logistic Regression model performs better across all metrics (accuracy, precision, recall, and F1-score) than the tuned logistic regression model, making the original model preferable based on these results.

* However, the confusion matrix for the tuned model  performs better at identifying positive cases (higher TP) and reducing false negatives (FN) which is beneficial to this project. 

* The original matrix performs better in terms of identifying negative cases (higher TN) and reducing false positive. 

#### PROJECT IMPACT ASSESSMENT
##### Project Objective
The main objective of the project is to create a predictive model that improves true positives and reduces the occurrence of false negatives. Specifically, in the case of employee attrition prediction, a false negative arises when the model inaccurately forecasts that an employee will not exit, whereas they indeed do. This error could lead to missed chances to take action and retain valuable employees.

##### Optimal Model Selection
Due to the paramount importance of reducing false negatives, the ideal choice for best model is the model that shows the fewest occurrences of false negatives. In our assessment, the Gradient Boosting model has proven to excel in this aspect. The decrease in false negatives indicates an enhanced ability to recognize employees prone to exit, aligning closely with our key project objectives.

##### Conclusion
In conclusion, the Gradient Boosting model emerges as the top choice for our attrition prediction task, emphasizing the minimization of false negatives. Adopting this model empowers businesses to preemptively pinpoint and retain employees prone to exiting, ultimately bolstering employee retention rates.

In [175]:
df = data.to_csv('cleaned_data.csv', index=False)

#### SAVE THE MODEL


In [None]:
#Save the best model and threshold using joblib
joblib.dump((best_gradient_boosting_model, best_gradient_threshold), 'best_gb_model_and_threshold.pkl')

In [None]:
#Save the encoder
joblib.dump(encoder, 'Model/encoder.joblib')