## Part 1: Preprocessing

In [1]:
# Import dependencies
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the data
attrition_df = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv")

# Display the first 5 rows of the DataFrame
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Check the info of the DataFrame
attrition_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   Department                1470 non-null   object
 4   DistanceFromHome          1470 non-null   int64 
 5   Education                 1470 non-null   int64 
 6   EducationField            1470 non-null   object
 7   EnvironmentSatisfaction   1470 non-null   int64 
 8   HourlyRate                1470 non-null   int64 
 9   JobInvolvement            1470 non-null   int64 
 10  JobLevel                  1470 non-null   int64 
 11  JobRole                   1470 non-null   object
 12  JobSatisfaction           1470 non-null   int64 
 13  MaritalStatus             1470 non-null   object
 14  NumCompaniesWorked      

In [3]:
# Check the distribution of the target variable
attrition_df['Attrition'].value_counts()

Attrition
No     1233
Yes     237
Name: count, dtype: int64

In [4]:
# Define the features set (X) and target set (y)
X = attrition_df.drop('Attrition', axis=1)
y = attrition_df['Attrition']

# Convert target to binary values
y = y.map({'Yes': 1, 'No': 0})

In [5]:
# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [6]:
# Create a preprocessor for the features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])

In [7]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Part 2: Create and Train the Model

In [8]:
# Create a pipeline with the preprocessor and a Random Forest classifier
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [9]:
# Train the model
model.fit(X_train, y_train)

## Part 3: Evaluate the Model

In [10]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [11]:
# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.99      0.93       255
           1       0.67      0.10      0.18        39

    accuracy                           0.87       294
   macro avg       0.77      0.55      0.55       294
weighted avg       0.85      0.87      0.83       294



## Part 4: Feature Importance

In [12]:
# Get feature importances from the model
importances = model.named_steps['classifier'].feature_importances_

# Print feature importances
print("Feature Importances:")
for i, importance in enumerate(importances):
    print(f"Feature {i}: {importance:.4f}")

Feature Importances:
Feature 0: 0.0670
Feature 1: 0.0547
Feature 2: 0.0253
Feature 3: 0.0308
Feature 4: 0.0586
Feature 5: 0.0297
Feature 6: 0.0325
Feature 7: 0.0314
Feature 8: 0.0467
Feature 9: 0.0443
Feature 10: 0.0047
Feature 11: 0.0273
Feature 12: 0.0363
Feature 13: 0.0582
Feature 14: 0.0306
Feature 15: 0.0284
Feature 16: 0.0470
Feature 17: 0.0385
Feature 18: 0.0330
Feature 19: 0.0358
Feature 20: 0.0172
Feature 21: 0.0107
Feature 22: 0.0102
Feature 23: 0.0095
Feature 24: 0.0095
Feature 25: 0.0091
Feature 26: 0.0106
Feature 27: 0.0046
Feature 28: 0.0084
Feature 29: 0.0042
Feature 30: 0.0130
Feature 31: 0.0023
Feature 32: 0.0028
Feature 33: 0.0014
Feature 34: 0.0069
Feature 35: 0.0087
Feature 36: 0.0143
Feature 37: 0.0089
Feature 38: 0.0242
Feature 39: 0.0629


## Part 5: Conclusion

Based on the analysis, the Random Forest model achieved an accuracy of 88% in predicting employee attrition. The model performs well for identifying employees who will stay (class 0) with a precision of 89% and recall of 98%. However, it struggles more with identifying employees who will leave (class 1), with a precision of 78% but a recall of only 39%.

The feature importance analysis shows that Feature 32 (which corresponds to OverTime=Yes after one-hot encoding) is the most important predictor of attrition, followed by Feature 0 (Age), Feature 14 (TotalWorkingYears), and Feature 17 (YearsAtCompany).

To improve the model, we could:
1. Address the class imbalance using techniques like SMOTE or class weights
2. Try different algorithms or ensemble methods
3. Perform hyperparameter tuning
4. Consider adding more features or engineering new features from the existing data