Within the context of human resources (HR), attrition is a reduction in the workforce caused by retirement or resignation. This is a serious problem faced by several organizations around the world as attrition is economically damaging to the organizations as the replacement employees have to be hired at a cost and trained again at a cost. High Rates of Attrition also damage the brand value of the company.

The Dataset belongs to a very fast-growing company. This company has witnessed several employees leaving the company in the last 3 years. The company’s HR team has always been reactive to attrition but now the team wants to be proactive and wished to predict the attrition of employees using the data they have in hand.

Goal: The goal here is to predict whether an employee will leave the company based on the various variables given in the dataset.

In [31]:
import pandas as pd

# Load the datasets
train_data = pd.read_csv('dataset/Train_MLB.csv')
test_data = pd.read_csv('dataset/Test_MLB.csv')

# Display the first few rows of each dataset to understand their structure
train_data_head = train_data.head()
test_data_head = test_data.head()

# Check for general information such as data types and missing values
train_data_info = train_data.info()
test_data_info = test_data.info()

train_data_head, test_data_head, train_data_info, test_data_info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4180 entries, 0 to 4179
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   EmployeeID          4180 non-null   int64  
 1   Attrition           4180 non-null   int64  
 2   Age                 3924 non-null   float64
 3   TravelProfile       4180 non-null   object 
 4   Department          4076 non-null   object 
 5   HomeToWork          3965 non-null   float64
 6   EducationField      4180 non-null   object 
 7   Gender              4146 non-null   object 
 8   HourlnWeek          3955 non-null   float64
 9   Involvement         4180 non-null   int64  
 10  WorkLifeBalance     4180 non-null   int64  
 11  Designation         4144 non-null   object 
 12  JobSatisfaction     4180 non-null   int64  
 13  ESOPs               4180 non-null   int64  
 14  NumCompaniesWorked  4180 non-null   int64  
 15  OverTime            4180 non-null   int64  
 16  Salary

(   EmployeeID  Attrition   Age TravelProfile Department  HomeToWork  \
 0     5110001          0  35.0        Rarely  Analytics         5.0   
 1     5110002          1  32.0           Yes      Sales         5.0   
 2     5110003          0  31.0        Rarely  Analytics         5.0   
 3     5110004          0  34.0           Yes      Sales        10.0   
 4     5110005          0  37.0            No  Analytics        27.0   
 
   EducationField  Gender  HourlnWeek  Involvement  ...  JobSatisfaction ESOPs  \
 0             CA    Male        69.0            1  ...                1     1   
 1     Statistics  Female        62.0            4  ...                2     0   
 2     Statistics       F        45.0            5  ...                2     1   
 3     Statistics  Female        32.0            3  ...                4     1   
 4     Statistics  Female        49.0            3  ...                4     1   
 
    NumCompaniesWorked  OverTime  SalaryHikelastYear  WorkExperience  \


In [32]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# Separate target from training data
X = train_data.drop(columns=['EmployeeID','Attrition'])
y = train_data['Attrition']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns

# Preprocessing for numerical data: Impute missing values and scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data: Impute missing values and one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create a pipeline with the preprocessor and a classifier
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', RandomForestClassifier(random_state=42))])

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
report = classification_report(y_val, y_pred)

print("Accuracy :",accuracy)
print("Classification Report :\n",report)

Accuracy : 0.9772727272727273
Classification Report :
               precision    recall  f1-score   support

           0       0.97      1.00      0.98       626
           1       0.98      0.92      0.95       210

    accuracy                           0.98       836
   macro avg       0.98      0.96      0.97       836
weighted avg       0.98      0.98      0.98       836



In [33]:
# Prepare test data by dropping 'EmployeeID' for prediction and saving it for the final output
test_employee_ids = test_data['EmployeeID']
X_test = test_data.drop(columns=['EmployeeID'])

# Make predictions on the test set
test_predictions = model.predict(X_test)

# Create a DataFrame in the required format
submission = pd.DataFrame({
    'EmployeeID': test_employee_ids,
    'Attrition': test_predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

In [34]:
test_employee_ids = test_data['EmployeeID']
X_test = test_data.drop(columns=['EmployeeID'])

# Make predictions on the test set
test_predictions = model.predict(X_test)

# Create a DataFrame in the required format
submission = pd.DataFrame({
    'EmployeeID': test_employee_ids,
    'Attrition': test_predictions
})

# Display the first few rows of the submission to verify the format
submission.head(100)
print(submission)

     EmployeeID  Attrition
0       5114181          0
1       5114182          0
2       5114183          1
3       5114184          0
4       5114185          0
..          ...        ...
995     5115176          0
996     5115177          0
997     5115178          0
998     5115179          0
999     5115180          0

[1000 rows x 2 columns]
