# Employee Attrition Prediction

## 1. Introduction
This notebook analyzes the IBM HR Analytics dataset to predict employee attrition. The project involves loading and cleaning the data, performing exploratory data analysis to understand the key drivers of attrition, and building a classification model to predict whether an employee will leave the company.

## 2. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('employee_attrition.csv')

# Display the first few rows
df.head()

In [None]:
# Get a summary of the dataframe
df.info()

## 3. Data Cleaning and Preprocessing

In [None]:
# Check for missing values
df.isnull().sum().sum()

# Convert categorical features to numerical
from sklearn.preprocessing import LabelEncoder
for column in df.columns:
    if df[column].dtype == 'object':
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])

df.head()

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Attrition distribution
sns.countplot(x='Attrition', data=df)
plt.title('Distribution of Employee Attrition')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(20, 18))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## 5. Model Building and Training

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define features and target
X = df.drop('Attrition', axis=1)
y = df['Attrition']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## 6. Model Evaluation

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('\nClassification Report:')
print(classification_report(y_test, y_pred))
print('\nConfusion Matrix:')
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='g')
plt.show()

## 7. Conclusion
The Random Forest model performs well in predicting employee attrition, with a high accuracy. The classification report shows that the model is particularly good at identifying employees who will not churn, but less so for those who will. This is expected given the imbalanced nature of the dataset. Further improvements could be made by using more advanced techniques to handle class imbalance, such as SMOTE.