In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
import joblib


In the above cell, import all the models and dependencies required for this analysis.

In [48]:
data=pd.read_csv('HR.csv')
df = pd.DataFrame(data)
df.describe().T
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   sales                  14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


In this cell, we load the data from a CSV file and identify any data types other than float or integer that could potentially cause issues during later model execution.

In [49]:
df = pd.get_dummies(df, columns=['sales','salary'], drop_first=True)
X = df.drop('left', axis=1)
y = df['left']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


In this cell, we assign dummy values to all object data types, such as 'sales' and 'salary'. We then split the data into training and test sets, with a 70:30 ratio using train_test_split() Function .

In [50]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model_accuracy=accuracy_score(y_test, y_pred)
print("Accuracy:",model_accuracy)


Accuracy: 0.9917777777777778


In this cell, we train the Random Forest Classifier model on the training data. We then use the test data to predict the outcomes and evaluate the model's accuracy against the actual results.

In [51]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
logregAccuracy=accuracy_score(y_test, y_pred)
print("Accuracy:", logregAccuracy)

Accuracy: 0.7844444444444445


In this cell, we train the Logistic Resgression model on the training data. We then use the test data to predict the outcomes and evaluate the model's accuracy against the actual results.

In [52]:
if model_accuracy>=logregAccuracy:
    joblib.dump(model,"EmployeeTurnoverModle.joblib")
else:
    joblib.dump(logreg,"EmployeeTurnoverModle.joblib")



In this cell, we compare the accuracy of both models and save the more accurate one to avoid retraining. The selected model is saved as a EmployeeTurnoverModle.joblib file for direct loading in the future.

In [53]:
model =joblib.load("EmployeeTurnoverModle.joblib")
y_pred = model.predict(X_test)
mod_accuracy=accuracy_score(y_test, y_pred)
print("Accuracy:", mod_accuracy)

Accuracy: 0.9917777777777778


In this cell, we load the EmployeeTurnoverModel.joblib file to make predictions on the data again and print the model's accuracy.