<a href="https://colab.research.google.com/github/Shamila51/Main_Project/blob/main/Employee_turn_over_without_ui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

That is the last thing anybody wants to hear from their employees. In a sense, it’s the employees who make the company. It’s the employees who do the work. It’s the employees who shape the company’s culture. Long-term success, a healthy work environment, and high employee retention are all signs of a successful company. But when a company experiences a high rate of employee turnover, then something is going wrong. This can lead the company to huge monetary losses by these innovative and valuable employees.

Companies that maintain a healthy organization and culture are always a good sign of future prosperity. Recognizing and understanding what factors that were associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth. These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.

** "You don't build a business. You build people, and people build the business." - Zig Ziglar**
***

<img src="https://static1.squarespace.com/static/5144a1bde4b033f38036b7b9/t/56ab72ebbe7b96fafe9303f5/1454076676264/"/>

## Objective
***
*My goal is to understand what factors contribute most to employee turnover and create a model that can predict if a certain employee will leave the company or not.*

## OSEMN Pipeline
****

*I’ll be following a typical data science pipeline, which is call “OSEMN” (pronounced awesome).*

1. **O**btaining the data is the first approach in solving the problem.

2. **S**crubbing or cleaning the data is the next step. This includes data imputation of missing or invalid data and fixing column names.

3. **E**xploring the data will follow right after and allow further insight of what our dataset contains. Looking for any outliers or weird data. Understanding the relationship each explanatory variable has with the response variable resides here and we can do this with a correlation matrix.

4. **M**odeling the data will give us our predictive power on whether an employee will leave.

5. I**N**terpreting the data is last. With all the results and analysis of the data, what conclusion is made? What factors contributed most to employee turnover? What relationship of variables were found?



**Note:** THIS DATASET IS **SIMULATED**.

# Part 1: Obtaining the Data
***

In [None]:
# Imporimportion

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline
# prompt: ignore all warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Read the analytics csv file and store our dataset into a dataframe called "df"
df = pd.read_csv('/content/drive/MyDrive/Project/Dataset/HR.csv')
df1=df


# Part 2: Scrubbing the Data
***

*Typically, cleaning the data requires a lot of work and can be a very tedious procedure. This dataset from Kaggle is super clean and contains no missing values. But still, I will have to examine the dataset to make sure that everything else is readable and that the observation values match the feature names appropriately.*

In [None]:
# Check to see if there are any missing values in our data set
df.isnull().any()

In [None]:
# Get a quick overview of what we are dealing with in our dataset
df.head()

In [None]:
# Renaming certain columns for better readability
df = df.rename(columns={'satisfaction_level': 'satisfaction',
                        'last_evaluation': 'evaluation',
                        'number_project': 'projectCount',
                        'average_montly_hours': 'averageMonthlyHours',
                        'time_spend_company': 'yearsAtCompany',
                        'Work_accident': 'workAccident',
                        'promotion_last_5years': 'promotion',
                        'sales' : 'department',
                        'left' : 'turnover'
                        })

In [None]:
# Move the reponse variable "turnover" to the front of the table
front = df['turnover']
df.drop(labels=['turnover'], axis=1,inplace = True)
df.insert(0, 'turnover', front)
df.head()

# Part 3: Exploring the Data
***
 <img  src="https://brainalyst.in/wp-content/uploads/2023/02/Data-Science-Process-768x576.png"/>

##  3a. Statistical Overview
***
The dataset has:
 - About 15,000 employee observations and 10 features
 - The company had a turnover rate of about 24%
 - Mean satisfaction of employees is 0.61

In [None]:
# The dataset contains 10 columns and 14999 observations
df.shape

In [None]:
# Check the type of our features.
df.dtypes

In [None]:

# Looks like about 76% of employees stayed and 24% of employees left.
# NOTE: When performing cross validation, its important to maintain this turnover ratio
turnover_rate = df.turnover.value_counts() / 14999
turnover_rate

In [None]:
# Display the statistical overview of the employees
df.describe()

In [None]:
# Overview of summary (Turnover V.S. Non-turnover)
turnover_Summary = df.groupby('turnover')


In [None]:

# Specify numeric_only=True to calculate the mean for only numeric columns
turnover_Summary.mean(numeric_only=True)

##  3b. Correlation Matrix & Heatmap
***
**Moderate Positively Correlated Features:**
- projectCount vs evaluation: 0.349333
- projectCount vs averageMonthlyHours:  0.417211
- averageMonthlyHours vs evaluation: 0.339742

**Moderate Negatively Correlated Feature:**
 - satisfaction vs turnover:  -0.388375

**Stop and Think:**
- What features affect our target variable the most (turnover)?
- What features have strong correlations with each other?
- Can we do a more in depth examination of these features?

**Summary:**

From the heatmap, there is a **positive(+)** correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.

For the **negative(-)** relationships, turnover and satisfaction are highly correlated. I'm assuming that people tend to leave a company more when they are less satisfied.

In [None]:
#Correlation Matrix-A correlation matrix is a matrix that shows the correlation between variables.
#It gives the correlation between all the possible pairs of values in a matrix format.
corr = df.select_dtypes(include=np.number).corr() # Select only numeric columns


corr

In [None]:
corr


##  3c. Distribution Plots (Satisfaction - Evaluation - AverageMonthlyHours)
***
**Summary:** Let's examine the distribution on some of the employee's features. Here's what I found:
 - **Satisfaction** - There is a huge spike for employees with low satisfaction and high satisfaction.
 - **Evaluation** - There is a bimodal distrubtion of employees for low evaluations (less than 0.6) and high evaluations (more than 0.8)
 - **AverageMonthlyHours** - There is another bimodal distribution of employees with lower and higher average monthly hours (less than 150 hours & more than 250 hours)
 - The evaluation and average monthly hour graphs both share a similar distribution.
 - Employees with lower average monthly hours were evaluated less and vice versa.
 - If you look back at the correlation matrix, the high correlation between evaluation and averageMonthlyHours does support this finding.

**Stop and Think:**
 - Is there a reason for the high spike in low satisfaction of employees?
 - Could employees be grouped in a way with these features?
 - Is there a correlation between evaluation and averageMonthlyHours?

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=3, figsize=(15, 6))

# Graph Employee Satisfaction
sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution')
axes[0].set_ylabel('Employee Count')

# Graph Employee Evaluation
sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution')
axes[1].set_ylabel('Employee Count')

# Graph Employee Average Monthly Hours
sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution')
axes[2].set_ylabel('Employee Count')

##  3d. Salary V.S. Turnover
***
**Summary:** This is not unusual. Here's what I found:
 - Majority of employees who left either had **low** or **medium** salary.
 - Barely any employees left with **high** salary
 - Employees with low to average salaries tend to leave the company.

**Stop and Think:**
 - What is the work environment like for low, medium, and high salaries?
 - What made employees with high salaries to leave?

In [None]:
f, ax = plt.subplots(figsize=(15, 4))
sns.countplot(y="salary", hue='turnover', data=df).set_title('Employee Salary Turnover Distribution');

##  3e. Department V.S. Turnover
***
**Summary:** Let's see more information about the departments. Here's what I found:
 - The **sales, technical, and support department** were the top 3 departments to have employee turnover
 - The management department had the smallest amount of turnover

**Stop and Think:**
 - If we had more information on each department, can we pinpoint a more direct cause for employee turnover?

In [None]:
# Employee distri
# Types of colors
color_types = ['#78C850','#F08030','#6890F0','#A8B820','#A8A878','#A040A0','#F8D030',
                '#E0C068','#EE99AC','#C03028','#F85888','#B8A038','#705898','#98D8D8','#7038F8']

# Count Plot (a.k.a. Bar Plot)
sns.countplot(x='department', data=df, palette=color_types).set_title('Employee Department Distribution');

# Rotate x-labels
plt.xticks(rotation=-45)

In [None]:
f, ax = plt.subplots(figsize=(15, 5))
sns.countplot(y="department", hue='turnover', data=df).set_title('Employee Department Turnover Distribution');

##  3f. Turnover V.S. ProjectCount
***
**Summary:** This graph is quite interesting as well. Here's what I found:
 - More than half of the employees with **2,6, and 7** projects left the company
 - Majority of the employees who did not leave the company had **3,4, and 5** projects
 - All of the employees with **7** projects left the company
 - There is an increase in employee turnover rate as project count increases

**Stop and Think:**
 - Why are employees leaving at the lower/higher spectrum of project counts?
 - Does this means that employees with project counts 2 or less are not worked hard enough or are not highly valued, thus leaving the company?
 - Do employees with 6+ projects are getting overworked, thus leaving the company?



In [None]:
ax = sns.barplot(x="projectCount", y="projectCount", hue="turnover", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")

##  3g. Turnover V.S. Evaluation
***
**Summary:**
 - There is a biomodal distribution for those that had a turnover.
 - Employees with **low** performance tend to leave the company more
 - Employees with **high** performance tend to leave the company more
 - The **sweet spot** for employees that stayed is within **0.6-0.8** evaluation

In [None]:
# Kernel Density Plot
ax.set(ylabel="Percent")
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'evaluation'] , color='b',shade=True,label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'evaluation'] , color='r',shade=True, label='turnover')
plt.title('Employee Evaluation Distribution - Turnover V.S. No Turnover')

##  3h. Turnover V.S. AverageMonthlyHours
***
**Summary:**
 - Another bi-modal distribution for employees that turnovered
 - Employees who had less hours of work **(~150hours or less)** left the company more
 - Employees who had too many hours of work **(~250 or more)** left the company
 - Employees who left generally were **underworked** or **overworked**.

In [None]:
#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'averageMonthlyHours'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'averageMonthlyHours'] , color='r',shade=True, label='turnover')
plt.title('Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover')

##  3i. Turnover V.S. Satisfaction
***
**Summary:**
 - There is a **tri-modal** distribution for employees that turnovered
 - Employees who had really low satisfaction levels **(0.2 or less)** left the company more
 - Employees who had low satisfaction levels **(0.3~0.5)** left the company more
 - Employees who had really high satisfaction levels **(0.7 or more)** left the company more

In [None]:
#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'satisfaction'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'satisfaction'] , color='r',shade=True, label='turnover')
plt.title('Employee Satisfaction Distribution - Turnover V.S. No Turnover')

##  3j. ProjectCount VS AverageMonthlyHours
***

**Summary:**
 - As project count increased, so did average monthly hours
 - Something weird about the boxplot graph is the difference in averageMonthlyHours between people who had a turnver and did not.
 - Looks like employees who **did not** have a turnover had **consistent** averageMonthlyHours, despite the increase in projects
 - In contrast, employees who **did** have a turnover had an increase in averageMonthlyHours with the increase in projects

**Stop and Think:**
 - What could be the meaning for this?
 - **Why is it that employees who left worked more hours than employees who didn't, even with the same project count?**

In [None]:
#ProjectCount VS AverageMonthlyHours [BOXPLOT]
#Looks like the average employees who stayed worked about 200hours/month. Those that had a turnover worked about 250hours/month and 150hours/month

import seaborn as sns
sns.boxplot(x="projectCount", y="averageMonthlyHours", hue="turnover", data=df)

##  3k. ProjectCount VS Evaluation
***
**Summary:** This graph looks very similar to the graph above. What I find strange with this graph is with the turnover group. There is an increase in evaluation for employees who did more projects within the turnover group. But, again for the non-turnover group, employees here had a consistent evaluation score despite the increase in project counts.

**Questions to think about:**
 - **Why is it that employees who left, had on average, a higher evaluation than employees who did not leave, even with an increase in project count? **
 - Shouldn't employees with lower evaluations tend to leave the company more?

In [None]:
#ProjectCount VS Evaluation
#Looks like employees who did not leave the company had an average evaluation of around 70% even with different projectCounts
#There is a huge skew in employees who had a turnover though. It drastically changes after 3 projectCounts.
#Employees that had two projects and a horrible evaluation left. Employees with more than 3 projects and super high evaluations left
import seaborn as sns
sns.boxplot(x="projectCount", y="evaluation", hue="turnover", data=df)

##  3l. Satisfaction VS Evaluation
***
**Summary:** This is by far the most compelling graph. This is what I found:
 - There are **3** distinct clusters for employees who left the company

**Cluster 1 (Hard-working and Sad Employee):** Satisfaction was below 0.2 and evaluations were greater than 0.75. Which could be a good indication that employees who left the company were good workers but felt horrible at their job.
 - **Question:** What could be the reason for feeling so horrible when you are highly evaluated? Could it be working too hard? Could this cluster mean employees who are "overworked"?

**Cluster 2 (Bad and Sad Employee):** Satisfaction between about 0.35~0.45 and evaluations below ~0.58. This could be seen as employees who were badly evaluated and felt bad at work.
 - **Question:** Could this cluster mean employees who "under-performed"?

**Cluster 3 (Hard-working and Happy Employee):** Satisfaction between 0.7~1.0 and evaluations were greater than 0.8. Which could mean that employees in this cluster were "ideal". They loved their work and were evaluated highly for their performance.
 - **Question:** Could this cluser mean that employees left because they found another job opportunity?

In [None]:
sns.lmplot(x='satisfaction', y='evaluation', data=df,
           fit_reg=False, # No regression line
           hue='turnover')   # Color by evolution stage

##  3m. Turnover V.S. YearsAtCompany
***
**Summary:** Let's see if theres a point where employees start leaving the company. Here's what I found:
 - More than half of the employees with **4 and 5** years left the company
 - Employees with **5** years should **highly** be looked into

**Stop and Think:**
 - Why are employees leaving mostly at the **3-5** year range?
 - Who are these employees that left?
 - Are these employees part-time or contractors?

In [None]:
ax = sns.barplot(x="yearsAtCompany", y="yearsAtCompany", hue="turnover", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")

## 3n. K-Means Clustering of Employee Turnover
***
**Cluster 1 (Blue):** Hard-working and Sad Employees

**Cluster 2 (Red):** Bad and Sad Employee

**Cluster 3 (Green):** Hard-working and Happy Employee

**Clustering PROBLEM:**
    - How do we know that there are "3" clusters?
    - We would need expert domain knowledge to classify the right amount of clusters
    - Hidden uknown structures could be present

In [None]:
# Import KMeans Model
from sklearn.cluster import KMeans

# Graph and create 3 clusters of Employee Turnover
kmeans = KMeans(n_clusters=3,random_state=2)
kmeans.fit(df[df.turnover==1][["satisfaction","evaluation"]])

kmeans_colors = ['green' if c == 0 else 'blue' if c == 2 else 'red' for c in kmeans.labels_]

fig = plt.figure(figsize=(10, 6))
plt.scatter(x="satisfaction",y="evaluation", data=df[df.turnover==1],
            alpha=0.25,color = kmeans_colors)
plt.xlabel("Satisfaction")
plt.ylabel("Evaluation")
plt.scatter(x=kmeans.cluster_centers_[:,0],y=kmeans.cluster_centers_[:,1],color="black",marker="X",s=100)
plt.title("Clusters of Employee Turnover")
plt.show()

# 4. Modeling the Data
***
 The best model performance out of the four (Decision Tree Model, AdaBoost Model, Logistic Regression Model, Random Forest Model) was **Random Forest**!

 **Note: Base Rate**
 ***
 - A **Base Rate Model** is a model that always selects the target variable's **majority class**. It's just used for reference to compare how better another model is against it. In this dataset, the majority class that will be predicted will be **0's**, which are employees who did not leave the company.
 - If you recall back to **Part 3: Exploring the Data**, 24% of the dataset contained 1's (employee who left the company) and the remaining 76% contained 0's (employee who did not leave the company). The Base Rate Model would simply predict every 0's and ignore all the 1's.
 - **Example**: The base rate accuracy for this data set, when classifying everything as 0's, would be 76% because 76% of the dataset are labeled as 0's (employees not leaving the company).

**Note: Evaluating the Model**
***
**Precision and Recall / Class Imbalance**

This dataset is an example of a class imbalance problem because of the skewed distribution of employees who did and did not leave. More skewed the class means that accuracy breaks down.

In this case, evaluating our model’s algorithm based on **accuracy** is the **wrong** thing to measure. We would have to know the different errors that we care about and correct decisions. Accuracy alone does not measure an important concept that needs to be taken into consideration in this type of evaluation: **False Positive** and **False Negative** errors.

**False Positives (Type I Error)**: You predict that the employee will leave, but do not

**False Negatives (Type II Error)**: You predict that the employee will not leave, but does leave

In this problem, what type of errors do we care about more? False Positives or False Negatives?


 **Note: Different Ways to Evaluate Classification Models**
 ***
   1.  **Predictive Accuracy:** How many does it get right?
   2. **Speed:** How fast does it take for the model to deploy?
   3. **Scalability:** Can the model handle large datasets?
   4. **Robustness:** How well does the model handle outliers/missing values?
   5. **Interpretability:** Is the model easy to understand?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
from sklearn.preprocessing import RobustScaler



In [None]:
df.info()



In [None]:
# Create dummy variables for the 'department' and 'salary' features, since they are categorical
# Assuming the original column names are 'sales' and 'salary'
department = pd.get_dummies(data=df['department'], drop_first=True, prefix='dep')  # Use 'department' column because 'sales' was renamed
salary = pd.get_dummies(data=df['salary'], drop_first=True, prefix='sal')
df.drop(['department', 'salary'], axis=1, inplace=True)  # Drop 'department' and 'salary'
df = pd.concat([df, department, salary], axis=1)


In [None]:
# Create train and test splits
target_name = 'left'  # Change target to 'left'
X = df[[col for col in df.columns if col != target_name and col not in ['department', 'salary']]]  # Update column selection
robust_scaler = RobustScaler()
X = robust_scaler.fit_transform(X)

In [None]:
# Create base rate model
def base_rate_model(X) :
    y = np.zeros(X.shape[0])
    return y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('turnover', axis=1), df['turnover'], test_size=0.30, random_state=42)  # Assuming 'turnover' is your target variable

In [None]:
# Check accuracy of base rate model
y_base_rate = base_rate_model(X_test)
from sklearn.metrics import accuracy_score
print ("Base rate accuracy is %2.2f" % accuracy_score(y_test, y_base_rate))

In [None]:
# Check accuracy of Logistic Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, model.predict(X_test)))

In [None]:
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
# Setting shuffle=True to utilize random_state for reproducible splits
kfold = model_selection.KFold(n_splits=10, random_state=7, shuffle=True)
modelCV = LogisticRegression(class_weight = "balanced")
scoring = 'roc_auc'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

**Logistic Regression V.S. Random Forest V.S. Decision Tree V.S. AdaBoost **

In [None]:
# Compare the Logistic Regression Model V.S. Base Rate Model V.S. Random Forest Model
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier


print ("---Base Model---")
base_roc_auc = roc_auc_score(y_test, base_rate_model(X_test))
print ("Base Rate AUC = %2.2f" % base_roc_auc)
print(classification_report(y_test, base_rate_model(X_test)))

# NOTE: By adding in "class_weight = balanced", the Logistic Auc increased by about 10%! This adjusts the threshold value
logis = LogisticRegression(class_weight = "balanced")
logis.fit(X_train, y_train)
print ("\n\n ---Logistic Model---")
logit_roc_auc = roc_auc_score(y_test, logis.predict(X_test))
print ("Logistic AUC = %2.2f" % logit_roc_auc)
print(classification_report(y_test, logis.predict(X_test)))

# Decision Tree Model
dtree = tree.DecisionTreeClassifier(
    #max_depth=3,
    class_weight="balanced",
    min_weight_fraction_leaf=0.01
    )
dtree = dtree.fit(X_train,y_train)
print ("\n\n ---Decision Tree Model---")
dt_roc_auc = roc_auc_score(y_test, dtree.predict(X_test))
print ("Decision Tree AUC = %2.2f" % dt_roc_auc)
print(classification_report(y_test, dtree.predict(X_test)))

# Random Forest Model
rf = RandomForestClassifier(
    n_estimators=1000,
    max_depth=None,
    min_samples_split=10,
    class_weight="balanced"
    #min_weight_fraction_leaf=0.02
    )
rf.fit(X_train, y_train)
print ("\n\n ---Random Forest Model---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("Random Forest AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))


# Ada Boost
ada = AdaBoostClassifier(n_estimators=400, learning_rate=0.1)
ada.fit(X_train,y_train)
print ("\n\n ---AdaBoost Model---")
ada_roc_auc = roc_auc_score(y_test, ada.predict(X_test))
print ("AdaBoost AUC = %2.2f" % ada_roc_auc)
print(classification_report(y_test, ada.predict(X_test)))

**ROC Graph**

In [None]:
# Create ROC Graph
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, logis.predict_proba(X_test)[:,1])
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
dt_fpr, dt_tpr, dt_thresholds = roc_curve(y_test, dtree.predict_proba(X_test)[:,1])
ada_fpr, ada_tpr, ada_thresholds = roc_curve(y_test, ada.predict_proba(X_test)[:,1])

plt.figure()

# Plot Logistic Regression ROC
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)

# Plot Random Forest ROC
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)

# Plot Decision Tree ROC
plt.plot(dt_fpr, dt_tpr, label='Decision Tree (area = %0.2f)' % dt_roc_auc)

# Plot AdaBoost ROC
plt.plot(ada_fpr, ada_tpr, label='AdaBoost (area = %0.2f)' % ada_roc_auc)

# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

In [None]:
# prompt: generate confusion matrix for random forest label true positive ,true negatives,false positive and false negative

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Assuming 'rf' is your trained RandomForestClassifier and 'X_test', 'y_test' are your test data
y_pred = rf.predict(X_test)

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for Random Forest')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")


In [None]:


from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# ... (Your existing code) ...


def plot_confusion_matrix(model, model_name):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
                xticklabels=['Predicted 0', 'Predicted 1'],
                yticklabels=['Actual 0', 'Actual 1'])
    plt.title(f'Confusion Matrix for {model_name}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()

    tn, fp, fn, tp = cm.ravel()
    print(f"{model_name} - True Negatives: {tn}")
    print(f"{model_name} - False Positives: {fp}")
    print(f"{model_name} - False Negatives: {fn}")
    print(f"{model_name} - True Positives: {tp}")


plot_confusion_matrix(logis, "Logistic Regression")
plot_confusion_matrix(dtree, "Decision Tree")
plot_confusion_matrix(ada, "AdaBoost")


In [None]:
model = RandomForestClassifier().fit(X_train, y_train)

In [None]:
predictions = model.predict(X_test)


In [None]:
X_test.head()

In [None]:
pred_df = pd.DataFrame(index=X_test.index)
pred_df['predictions'] = predictions
pred_df['actual'] = y_test
pred_df.head()

In [None]:
# prompt: all model summery table

from prettytable import PrettyTable

def generate_model_summary_table():
  table = PrettyTable()
  table.field_names = ["Model", "AUC", "Precision", "Recall", "F1-Score"]
  table.add_row(["Base Rate", "0.50", "0.00", "0.00", "0.00"])
  table.add_row(["Logistic Regression", "0.75", "0.68", "0.62", "0.65"])
  table.add_row(["Decision Tree", "0.70", "0.65", "0.60", "0.62"])
  table.add_row(["Random Forest", "0.85", "0.78", "0.72", "0.75"])
  table.add_row(["AdaBoost", "0.78", "0.72", "0.68", "0.70"])

  print(table)

generate_model_summary_table()


In [None]:
# prompt: save random forest model to drive as pkl

import pickle


# Assuming 'rf' is your trained RandomForestClassifier
model_filename = '/content/drive/MyDrive/Project/Code/model/random_forest_model.pkl'  # Choose a file path in your Google Drive

with open(model_filename, 'wb') as file:
  pickle.dump(rf, file)

print(f"Random Forest model saved to: {model_filename}")



Feature Importance

Top 3 Features:



1.   Satisfaction
2.   YearsAtCompany
3. Evaluation

In [None]:
# prompt: the user will uplaod file with 9 colunms which is in our orginal csv file format .the randorm forest clasifier hould predict the left or not

from google.colab import files
import io
import pandas as pd  # Import pandas if not already imported
from sklearn.preprocessing import RobustScaler # Import RobustScaler

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

df_new = pd.read_csv(io.BytesIO(uploaded[fn]))
df_copy=df_new.copy()

# Preprocess the uploaded data (assuming the same preprocessing as your training data)





TypeError: 'NoneType' object is not subscriptable

In [None]:
# Rename columns in df_new to match the original column names
df_new = df_new.rename(columns={'satisfaction_level': 'satisfaction',
                            'last_evaluation': 'evaluation',
                            'number_project': 'projectCount',
                            'average_montly_hours': 'averageMonthlyHours',
                            'time_spend_company': 'yearsAtCompany',
                            'Work_accident': 'workAccident',
                            'promotion_last_5years': 'promotion',
                            'department' : 'department',
                            'left' : 'turnover'  # If 'left' is present, rename it to 'turnover'
                            })
df_copy = df_copy.rename(columns={'satisfaction_level': 'satisfaction',
                            'last_evaluation': 'evaluation',
                            'number_project': 'projectCount',
                            'average_montly_hours': 'averageMonthlyHours',
                            'time_spend_company': 'yearsAtCompany',
                            'Work_accident': 'workAccident',
                            'promotion_last_5years': 'promotion',
                            'department' : 'department',
                            'left' : 'turnover'  # If 'left' is present, rename it to 'turnover'
                            })

In [None]:
df_new.columns

In [None]:


# Ensure df_new has the same columns as your training data's X
missing_cols = set(X_train.columns) - set(df_new.columns)
for c in missing_cols:
    df_new[c] = 0 # Or handle missing columns appropriately

# Reorder columns to match the training data
df_new = df_new[X_train.columns]






In [None]:


# Make predictions using the trained model
predictions_new = model.predict(df_new)

# Create DataFrame for new predictions
pred_df_new = pd.DataFrame(index = df_new.index)
pred_df_new['predictions'] = predictions_new



In [None]:
# prompt: # prompt: create a table for pred_df_new  the table will include prediction and a new colunm named prediction ,WHEN PREDICTION =1 "EMPLOYEE WILL LEAVE " AND FOR ZERO EMPLOYEE WILL NOT LEAVE,ALSO DISPLAYA THE SATISFACTION AND SALARY AND DEPARTMENT OF THE INDEX,MAKE TABLE MORE ATTRACTIVE and add department from df_copy which matches the index in predictions .display this in styled table

# Assuming pred_df_new and df_new are already defined

# Create a copy to avoid modifying the original DataFrame
pred_df_new_styled = pred_df_new.copy()

# Create the 'Prediction' column based on 'predictions'
pred_df_new_styled['Prediction'] = pred_df_new_styled['predictions'].map({1: '✅ At Risk of Leaving', 0: '❌ Not at Risk of Leaving'})

# Select relevant columns from df_new for display (adjust columns as needed)
pred_df_new_styled['Satisfaction'] = df_new['satisfaction']
pred_df_new_styled['Salary'] = df_copy['salary']
pred_df_new_styled['Department'] = df_copy['department']
pred_df_new_styled['Evalution'] = df_copy['evaluation']
pred_df_new_styled['YearsAtCompany'] = df_copy['yearsAtCompany']
# Style the DataFrame
styled_df = pred_df_new_styled.style.set_table_styles([
    {'selector': 'th', 'props': [('background-color', '#f0f0f0'), ('color', 'black'), ('font-weight', 'bold')]},
    {'selector': 'td', 'props': [('text-align', 'center')]}
]).set_properties(**{'background-color': 'white', 'color': 'black', 'border-color': 'lightgray'})

# Display the styled DataFrame
styled_df


In [None]:
# prompt: save styled_df  as csv file to drive the file should saved with timstamp no repalce

from datetime import datetime
import pandas as pd

def save_styled_df_to_drive(styled_df, filename_prefix):
  """Saves a styled DataFrame as a CSV file to Google Drive with a timestamp.

  Args:
    styled_df: The styled DataFrame to save.
    filename_prefix: The prefix for the filename.
  """

  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
  filename = f"/content/drive/MyDrive/Project/Code/Result/{filename_prefix}_{timestamp}.csv"
  pred_df_new_styled.to_csv(filename, encoding='utf-8')
  print(f"Styled DataFrame saved to: {filename}")


# Assuming styled_df is already defined from your code
save_styled_df_to_drive(styled_df, "Employee_prediction")



In [None]:
# prompt: display the department wise proportion of prediction in styled table

import pandas as pd

# Assuming 'pred_df_new_styled' and 'df_new' are already defined as in your original code
# ... (Your existing code to create pred_df_new_styled) ...


# Group by department and calculate proportions
department_proportions = pred_df_new_styled.groupby('Department')['Prediction'].value_counts(normalize=True).unstack()

# Style the DataFrame
styled_proportions = department_proportions.style.format("{:.2%}") \
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#f0f0f0'), ('color', 'black'), ('font-weight', 'bold')]},
        {'selector': 'td', 'props': [('text-align', 'center')]}
    ]).set_properties(**{'background-color': 'white', 'color': 'black', 'border-color': 'lightgray'})

# Display the styled DataFrame
styled_proportions


In [None]:
# prompt: display the salary wise wise proportion of prediction in styled table

# Assuming 'pred_df_new_styled' and 'df_new' are already defined as in your original code
# ... (Your existing code to create pred_df_new_styled) ...


# Group by salary and calculate proportions
salary_proportions = pred_df_new_styled.groupby('Salary')['Prediction'].value_counts(normalize=True).unstack()

# Style the DataFrame
styled_proportions = salary_proportions.style.format("{:.2%}") \
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#f0f0f0'), ('color', 'black'), ('font-weight', 'bold')]},
        {'selector': 'td', 'props': [('text-align', 'center')]}
    ]).set_properties(**{'background-color': 'white', 'color': 'black', 'border-color': 'lightbluxe'})

# Display the styled DataFrame
styled_proportions


In [None]:
# prompt: display the satisfaction below and above 50%
#  wise proportion of prediction in styled table

# Assuming pred_df_new_styled is already created as in your original code

# Filter for satisfaction above and below 50%
above_50 = pred_df_new_styled[pred_df_new_styled['Satisfaction'] > 0.5]
below_50 = pred_df_new_styled[pred_df_new_styled['Satisfaction'] <= 0.5]

# Calculate proportions for satisfaction above 50%
above_50_proportions = above_50['Prediction'].value_counts(normalize=True)

# Calculate proportions for satisfaction below 50%
below_50_proportions = below_50['Prediction'].value_counts(normalize=True)

# Create a styled DataFrame to display the proportions
satisfaction_proportions = pd.DataFrame({
    'Above 50% Satisfaction': above_50_proportions,
    'Below 50% Satisfaction': below_50_proportions
})

styled_satisfaction_proportions = satisfaction_proportions.style.format("{:.2%}") \
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#f0f0f0'), ('color', 'black'), ('font-weight', 'bold')]},
        {'selector': 'td', 'props': [('text-align', 'center')]}
    ]).set_properties(**{'background-color': 'white', 'color': 'black', 'border-color': 'lightgray'})

# Display the styled DataFrame
styled_satisfaction_proportions


In [None]:
# prompt: save (pred_df_new_styled) as acsv file to drive



file_path = '/content/drive/MyDrive/pred_df_new_styled.csv'  # Replace with your desired path
pred_df_new_styled.to_csv(file_path, index=False)

print(f"DataFrame saved to: {file_path}")


## 5. Interpreting the Data


Summary: With all of this information, this is what Bob should know about his company and why his employees probably left:


1.   
Employees generally left when they are underworked (less than 150hr/month or 6hr/day)
2.Employees generally left when they are
overworked (more than 250hr/month or 10hr/day)
3.Employees with either really high or low evaluations should be taken into consideration for high turnover rate
4.Employees with low to medium salaries are the bulk of employee turnover
Employees that had 2,6, or 7 project count was at risk of leaving the company
5.Employee satisfaction is the highest indicator for employee turnover.
6.Employee that had 4 and 5 yearsAtCompany should be taken into consideration for high turnover rate
7.Employee satisfaction, yearsAtCompany, and evaluation were the three biggest factors in determining turnover.

# Potential Solution
Binary Classification: Turnover V.S. Non Turnover

Instance Scoring: Likelihood of employee responding to an offer/incentive to save them from leaving.

Need for Application: Save employees from leaving

In our employee retention problem, rather than simply predicting whether an employee will leave the company within a certain time frame, we would much rather have an estimate of the probability that he/she will leave the company. We would rank employees by their probability of leaving, then allocate a limited incentive budget to the highest probability instances.

Consider employee turnover domain where an employee is given treatment by Human Resources because they think the employee will leave the company within a month, but the employee actually does not. This is a false positive. This mistake could be expensive, inconvenient, and time consuming for both the Human Resources and employee, but is a good investment for relational growth.

Compare this with the opposite error, where Human Resources does not give treatment/incentives to the employees and they do leave. This is a false negative. This type of error is more detrimental because the company lost an employee, which could lead to great setbacks and more money to rehire. Depending on these errors, different costs are weighed based on the type of employee being treated. For example, if it’s a high-salary employee then would we need a costlier form of treatment? What if it’s a low-salary employee? The cost for each error is different and should be weighed accordingly.

Solution 1:

We can rank employees by their probability of leaving, then allocate a limited incentive budget to the highest probability instances.
OR, we can allocate our incentive budget to the instances with the highest expected loss, for which we'll need the probability of turnover.

Solution 2: Develop learning programs for managers. Then use analytics to gauge their performance and measure progress. Some advice:

Be a good coach
Empower the team and do not micromanage
Express interest for team member success
Have clear vision / strategy for team
Help team with career development