# Employee Attrition Prediction Project
This project aims to predict whether an employee is at risk of leaving the company based on their attributes. By predicting employee attrition, we can help management take proactive measures to reduce turnover, ensuring talent retention and improved company culture. We use a dataset from IBM HR Analytics, applying machine learning techniques to build a classification model that provides actionable insights for HR strategy.

## Project Objectives
1. Load and explore the dataset.
2. Perform data cleaning and preprocessing.
3. Build and evaluate a machine learning model.
4. Visualise the results to understand key factors affecting attrition.

## Step 1: Importing Libraries
We begin by importing necessary libraries for data manipulation, visualisation, and model building. Libraries such as pandas and Scikit-learn will help us handle data and train models, while Seaborn and Matplotlib will be used to create insightful visualisations.

In [20]:
# Import necessary libraries for data handling, visualization, and machine learning

import pandas as pd  # For data handling and manipulation
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For more attractive visualizations
from sklearn.model_selection import train_test_split  # For splitting the dataset into training and test sets
from sklearn.preprocessing import LabelEncoder  # For encoding categorical variables
from sklearn.linear_model import LogisticRegression  # For building the classification model
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve  # For model evaluation

# Set some default styling options for plots
sns.set(style="whitegrid")

## Step 2: Loading the Dataset
In this step, we load the IBM HR Analytics Employee Attrition dataset into a pandas DataFrame. This dataset includes information such as job role, satisfaction level, salary, and other features that could influence an employee's decision to stay or leave the company. By understanding these attributes, we can predict employee attrition effectively.

In [None]:
# Load the dataset into a DataFrame

# Note: The dataset file must be uploaded first.
df = pd.read_csv('/content/WA_Fn-UseC_-HR-Employee-Attrition.csv')

# Display the first five rows of the dataset
df.head()

## Step 3: Exploratory Data Analysis (EDA)
To understand the dataset thoroughly, we begin by checking its structure, identifying any missing values, and analysing descriptive statistics of the numerical columns. We also visualise the distribution of attrition to understand the balance between employees who stayed and those who left.

In [None]:
# Check the structure of the dataset
print("Dataset Information:\n")
df.info()

# Check for any missing values in the dataset
print("\nMissing Values Count:\n")
print(df.isnull().sum())

# Display a statistical summary of the numerical columns
print("\nStatistical Summary:\n")
print(df.describe())

# Count of employees who left the company vs. stayed
sns.countplot(x='Attrition', data=df)
plt.title("Employee Attrition Count")
plt.show()

## Step 4: Data Cleaning and Preprocessing
Before building a machine learning model, it is crucial to clean the dataset. We convert categorical variables into numerical values using Label Encoding, making the data suitable for machine learning algorithms. This step ensures our features are in a usable format for the model.

In [None]:
# Encode categorical variables to numerical ones using LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# List of categorical columns to encode
categorical_cols = ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']

# Apply encoding to each column
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])

# Display the updated dataset
df.head()

## Step 5: Feature Selection and Data Splitting
In this step, we define the target variable (`Attrition`) and split the dataset into training and test sets. This allows us to train the model on one part of the data and test it on unseen data, ensuring the model generalises well and does not overfit. We also remove any non-informative columns like `Over18` to streamline the dataset.

In [None]:
# Define the target variable 'Attrition' and the feature variables 'X'
target = 'Attrition'
X = df.drop(columns=[target])
y = df[target]

# Split the dataset into 80% training and 20% test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Set: {X_train.shape}, Test Set: {X_test.shape}")

# Drop the Over18 column since it is not informative
X_train = X_train.drop(columns=['Over18'])
X_test = X_test.drop(columns=['Over18'])

# Identify columns with object (string) data types
categorical_cols = X_train.select_dtypes(include=['object']).columns
print(f"Categorical columns: {list(categorical_cols)}")

## Step 6: One-Hot Encoding and Scaling
To prepare our categorical variables for modelling, we apply One-Hot Encoding to convert them into numerical values, followed by scaling the features using `StandardScaler`. Scaling is particularly important for models like Logistic Regression to ensure all features contribute equally and improve convergence.

In [None]:
# One-Hot Encode the categorical columns in X_train and X_test
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)

# Ensure X_train and X_test have the same columns after encoding
X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)

# Display a summary to ensure encoding worked
print("Shape of X_train after encoding:", X_train_encoded.shape)
print("Shape of X_test after encoding:", X_test_encoded.shape)

# Save the column names for later use (since scaling removes them)
original_columns = X_train_encoded.columns

# Step 6: Scaling the Features
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and test sets
X_train_encoded = scaler.fit_transform(X_train_encoded)
X_test_encoded = scaler.transform(X_test_encoded)

## Step 7: Building the Logistic Regression Model
We will build a Logistic Regression model to predict employee attrition. Logistic Regression is a simple yet effective baseline model for binary classification tasks.

In [30]:
# Model Building - Logistic Regression
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model with enhanced configuration
model = LogisticRegression(max_iter=5000, solver='liblinear')

# Fit the model to the training data
model.fit(X_train_encoded, y_train)

# Make predictions using the test set
y_pred = model.predict(X_test_encoded)

## Step 8: Model Evaluation
After training the model, we evaluate its performance using metrics such as accuracy, confusion matrix, and classification report. These metrics help us understand the balance between true positives, false positives, and overall precision and recall. This information provides an indication of how well our model can identify employees at risk of attrition.

In [None]:
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Display the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Print the classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred))

## Step 9: Feature Importance
After training the model, we analyse which features had the most impact on predicting employee attrition. Understanding feature importance helps us identify key drivers behind an employee's decision to leave, which provides valuable insights for HR to improve retention strategies.

In [None]:
# Visualizing the importance of features using the model coefficients
feature_importance = pd.Series(abs(model.coef_[0]), index=original_columns)
feature_importance = feature_importance.sort_values(ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importance.head(10), y=feature_importance.head(10).index)
plt.title("Top 10 Important Features Affecting Employee Attrition")
plt.xlabel("Coefficient Value (Importance)")
plt.ylabel("Feature")
plt.show()

# Summary and Conclusion

In this project, we have built a machine learning model to predict employee attrition using the IBM HR Analytics dataset. We explored the dataset, cleaned it, encoded categorical variables, and built a logistic regression model to predict whether an employee might leave.

Our model provided insights into the key factors that affect employee attrition, which can help HR departments make informed decisions to reduce turnover.

### Future Work
- **Model Improvement**: We could try different machine learning models like Decision Trees, Random Forest, or even ensemble methods to see if we can achieve better accuracy.
- **Feature Engineering**: We could create new features to provide deeper insights and improve model accuracy.
- **Deployment**: The final model could be deployed as a web service to provide real-time predictions for employee attrition.
