<a href="https://colab.research.google.com/github/TMQ5/Business-Analytics-Nanodegree-Program/blob/main/People%20Analytics/MNC%20Comany/HR_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Employee Data Analysis (People Analytics)

## Introduction

In this project, we will analyze employee data using the **HR Employee Analytics** dataset from [Kaggle](https://www.kaggle.com/datasets/kmldas/hr-employee-data-descriptive-analytics). We aim to answer two questions related to employee performance, and employee retention.


\
## Dataset Columns
The dataset contains the following columns:

- **Emp_Id**:
  - Employee ID, a unique identifier for each employee.

- **satisfaction_level**:
  - Satisfaction Level, ranges between 0 and 1, representing the employee's satisfaction with their job.

- **last_evaluation**:
  - Last Evaluation, the score of the last evaluation for the employee, ranging between 0 and 1.

- **number_project**:
  - Number of Projects, the number of projects the employee has worked on.

- **average_montly_hours**:
  - Average Monthly Hours, the average number of hours the employee worked per month.

- **time_spend_company**:
  - Time Spent in Company, the number of years the employee has been with the company.

- **Work_accident**:
  - Work Accident, a binary value (0 or 1) indicating whether the employee had a work accident.

- **left**:
  - Left, a binary value (0 or 1) indicating whether the employee left the company.

- **promotion_last_5years**:
  - Promotion in Last 5 Years, a binary value (0 or 1) indicating whether the employee was promoted in the last 5 years.

- **Department**:
  - Department, the department in which the employee works.

- **salary**:
  - Salary, the salary level (Low, Medium, High).


\


## The Six Steps of Data Analysis According to the Google Data Analytics Professional Certificate Methodology

1. **Ask:** Define the key questions we want to answer through the analysis.
2. **Prepare:** Gather and clean the data to ensure it is ready for analysis.
3. **Process:** Process the data to remove missing and duplicate values and convert textual data to numerical formats.
4. **Analyze:** Use various analytical techniques to answer the posed questions.
5. **Share:** Present the results through reports, dashboards, and presentations.
6. **Act:** Implement recommendations based on the analysis and monitor their impact.


#*1st Step: ASK*

There are two questions that will be answered in this project, these are:
1. **Employee Performance Analysis:** What are the key factors affecting employee performance, and how can we enhance these factors to improve productivity?
2. **Improving Employee Retention Rate:** What are the main factors affecting employee retention in the company, and how can we improve these rates?


#*2nd Step: PREPARE*

## Importing Libraries and Loading Data

Let's start by importing the necessary libraries and loading the data.


In [113]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


In [114]:
# Load the data
file_url = 'https://github.com/TMQ5/my_projects/raw/main/People%20Analytics/MNC%20Comany/HR_Employee_Data.xlsx'
data = pd.read_excel(file_url)

In [115]:
# Display the initial data
data.head()

Unnamed: 0,Emp_Id,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,IND02438,0.38,0.53,2,157,3,0,1,0,sales,low
1,IND28133,0.8,0.86,5,262,6,0,1,0,sales,medium
2,IND07164,0.11,0.88,7,272,4,0,1,0,sales,medium
3,IND30478,0.72,0.87,5,223,5,0,1,0,sales,low
4,IND24003,0.37,0.52,2,159,3,0,1,0,sales,low


#*3rd Step: PROCESS*

In [116]:
# Display summary of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Emp_Id                 14999 non-null  object 
 1   satisfaction_level     14999 non-null  float64
 2   last_evaluation        14999 non-null  float64
 3   number_project         14999 non-null  int64  
 4   average_montly_hours   14999 non-null  int64  
 5   time_spend_company     14999 non-null  int64  
 6   Work_accident          14999 non-null  int64  
 7   left                   14999 non-null  int64  
 8   promotion_last_5years  14999 non-null  int64  
 9   Department             14999 non-null  object 
 10  salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(3)
memory usage: 1.3+ MB


In [117]:
# Check for missing values
data.isnull().sum()

Emp_Id                   0
satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
Work_accident            0
left                     0
promotion_last_5years    0
Department               0
salary                   0
dtype: int64

## Data Overview

The dataset consists of 14,999 rows and 11 columns. There are no missing values in the dataset, and all data types are correct.

| Column                   | Data Type | Description                                    |
|--------------------------|-----------|------------------------------------------------|
| `Emp_Id`                 | object    | Employee ID (text)                             |
| `satisfaction_level`     | float64   | Employee satisfaction level (numeric)          |
| `last_evaluation`        | float64   | Last evaluation score (numeric)                |
| `number_project`         | int64     | Number of projects worked on (integer)         |
| `average_montly_hours`   | int64     | Average monthly hours worked (integer)         |
| `time_spend_company`     | int64     | Number of years spent in the company (integer) |
| `Work_accident`          | int64     | Whether the employee had a work accident (integer, 0 or 1) |
| `left`                   | int64     | Whether the employee left the company (integer, 0 or 1) |
| `promotion_last_5years`  | int64     | Whether the employee was promoted in the last 5 years (integer, 0 or 1) |
| `Department`             | object    | Department name (text)                         |
| `salary`                 | object    | Salary level ('low', 'medium', 'high')         |

Although the data type for the `salary` column is correct as text, we will retain this column and create a new column with numeric encoding. To facilitate analysis, we will retain the original text values in the `salary` column and create a new column, `salary_numeric`, that contains encoded numeric values. This approach allows us to use either the text or numeric values as needed for different analyses.

- `salary`: Original text values ('low', 'medium', 'high').
- `salary_numeric`: Encoded numeric values (1 for 'low', 2 for 'medium', 3 for 'high').

This dual-column approach provides flexibility, enabling us to leverage the clarity of text values in descriptive analyses and the computational efficiency of numeric values in statistical modeling and machine learning algorithms.


## Retaining Original Text Column and Creating Encoded Column




In [118]:
# Create an encoded version of the salary column
data['salary_numeric'] = data['salary'].map({'low': 1, 'medium': 2, 'high': 3})

In [119]:
# Display summary of the data again to verify changes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Emp_Id                 14999 non-null  object 
 1   satisfaction_level     14999 non-null  float64
 2   last_evaluation        14999 non-null  float64
 3   number_project         14999 non-null  int64  
 4   average_montly_hours   14999 non-null  int64  
 5   time_spend_company     14999 non-null  int64  
 6   Work_accident          14999 non-null  int64  
 7   left                   14999 non-null  int64  
 8   promotion_last_5years  14999 non-null  int64  
 9   Department             14999 non-null  object 
 10  salary                 14999 non-null  object 
 11  salary_numeric         14999 non-null  int64  
dtypes: float64(2), int64(7), object(3)
memory usage: 1.4+ MB


In [120]:
# Display some descriptive statistics
data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,salary_numeric
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268,1.594706
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281,0.637183
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0,1.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0,1.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0,2.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0,2.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0,3.0



### We observe the following from this Summary Statistical Summary:

- **satisfaction_level**: The satisfaction level ranges between 9% and 100%, with an average satisfaction level of 61%. This indicates a moderate level of overall employee satisfaction. Higher satisfaction levels are typically associated with increased productivity, lower turnover rates, and better overall workplace morale.
- **last_evaluation**: The last evaluation score ranges between 36% and 100%, with an average score of 72%. This indicates that, on average, employees perform well according to their evaluations. High evaluation scores could suggest effective performance management and a high level of employee competence.
- **number_project**: The number of projects worked on ranges between 2 and 7, with an average of 3.8 projects. This suggests that employees typically handle about 4 projects on average, indicating a moderate workload. Managing multiple projects can impact employee performance and satisfaction, depending on the support and resources available.
- **average_montly_hours**: The average monthly hours worked range between 96 and 310, with an average of 201 hours. This suggests that employees, on average, work more than the typical full-time hours. High average monthly hours could indicate a demanding work environment, which might affect employee satisfaction and work-life balance.
- **time_spend_company**: The time spent in the company ranges between 2 and 10 years, with an average of 3.5 years. This means that, on average, employees stay with the company for 3.5 years. A longer average tenure can indicate higher employee satisfaction and stability, while a shorter tenure may suggest issues with employee retention or job satisfaction.
- **Work_accident**: The work accident indicator ranges between 0% and 100%, with an average of 14%. This means that, on average, 14% of employees experienced work-related accidents. A high accident rate may indicate the need for improved safety measures and training within the company to ensure a safer work environment.
- **left**: The employee turnover indicator ranges between 0% and 100%, with an average of 24%. This means that, on average, 24% of employees left the company within the measured period. A high turnover rate can be indicative of issues within the company such as low job satisfaction, limited career advancement opportunities, or unfavorable working conditions.
- **promotion_last_5years**: The promotion in the last 5 years indicator ranges between 0% and 100%, with an average of 2%. This low average indicates that promotions are rare within the company, which could negatively affect employee motivation and retention if employees feel there are limited opportunities for advancement.
- **salary_numeric**: The salary level ranges between 1 and 3, with an average level of 1.6. This average indicates that the general salary level is closer to the lower end of the scale, with most employees having a salary level of either 1 (low) or 2 (medium).

#*4th Step: ANALYZE*


# Exploratory Data Analysis (EDA)

In [121]:
# Distribution of Satisfaction Level
fig = px.histogram(data, x='satisfaction_level', nbins=30, title='Distribution of Satisfaction Level')
fig.show()

The histogram of satisfaction level appears to have a bimodal distribution. This means there are two distinct peaks in the data, suggesting the presence of two different groups of employees with varying satisfaction levels.

- **First Peak**: A large number of employees with very low satisfaction levels, around 0.1 (10%), which may indicate issues in the work environment or company policies that lead to dissatisfaction.
- **Second Peak**: Another significant portion of employees has moderate satisfaction levels, around 0.6 (60%), indicating a group of employees who are moderately satisfied with their jobs.
- The number of employees gradually decreases after the satisfaction level of 0.7 (70%).

The bimodal distribution can provide valuable insights for the company about the presence of distinct employee groups and the areas that need improvement in the work environment to enhance overall employee satisfaction.


In [122]:
# Distribution of Last Evaluation
fig = px.histogram(data, x='last_evaluation', nbins=30, title='Distribution of Last Evaluation')
fig.show()

The histogram of last evaluation scores appears to have a bimodal distribution, indicating the presence of two distinct groups of employees based on their evaluation scores.

- **Low Scores (below 0.6)**: There are a few employees with low evaluation scores (below 60%).
- **First Peak (around 0.5)**: A noticeable increase in the number of employees with evaluation scores around 50%.
- **Medium Scores (0.6 to 0.8)**: The number of employees gradually decreases in this range.
- **Second Peak (around 0.9)**: A significant number of employees have high evaluation scores (around 90%).
- **High Scores (above 0.9)**: The number of employees with evaluation scores above 90% decreases after this point.

This bimodal distribution can provide insights into the performance levels within the company, suggesting the presence of two distinct groups: one with low to medium performance and another with high performance. The lower number of employees in the medium score range (0.6 to 0.8) may warrant further investigation to understand the underlying causes.


In [123]:
# Number of Projects Worked On
fig = px.histogram(data, x='number_project', nbins=6, title='Number of Projects Worked On')
fig.show()

The histogram of the number of projects worked on shows the distribution of how many projects employees have worked on.

- **Low Number of Projects (2 to 3 projects)**: A significant number of employees have worked on 2 or 3 projects, with a peak at 3 projects.
- **Medium Number of Projects (4 to 5 projects)**: The number of employees gradually decreases for those who have worked on 4 or 5 projects.
- **High Number of Projects (6 to 7 projects)**: Very few employees have worked on 6 or 7 projects.

This distribution suggests that most employees handle a low number of projects (2 to 3 projects), which may indicate a moderate workload or a company preference to keep the number of projects per employee low. The few employees who have worked on 6 or 7 projects may be those with higher experience or expertise.


In [124]:
# Distribution of Average Monthly Hours
fig = px.histogram(data, x='average_montly_hours', nbins=30, title='Distribution of Average Monthly Hours')
fig.show()

The histogram of average monthly hours worked shows the distribution of how many hours employees work on average per month.

- **Low Hours (below 150 hours)**: There are a few employees who work less than 150 hours per month. This may indicate part-time work or extended leave periods.
- **First Peak (around 150 hours)**: A noticeable increase in the number of employees working around 150 hours per month.
- **Medium Hours (150 to 250 hours)**: The number of employees gradually decreases in this range with another peak around 200 hours.
- **Second Peak (around 250 hours)**: Another significant number of employees work around 250 hours per month.
- **High Hours (above 250 hours)**: The number of employees working more than 250 hours per month decreases after this point.

This distribution suggests that there are clusters of employees working around 150 and 250 hours per month, indicating two distinct groups with different average work hours. The few employees working more than 250 hours per month may be experiencing high work pressure or intensive work cycles.


In [125]:
# Distribution of Time Spent in Company (years)
fig = px.histogram(data, x='time_spend_company', nbins=10, title='Distribution of Time Spent in Company (years)')
fig.show()

The histogram of time spent in the company shows a right-skewed distribution with an outlier.

- **High initial peak**: A significant number of employees have spent between 2 and 3 years in the company, forming a peak at the beginning.
- **Gradual decline**: The number of employees decreases gradually as the tenure increases.
- **Long right tail**: There are few employees who have spent a longer time in the company, creating a tail that extends to the right.
- **Outlier**: There is a noticeable outlier at 10 years, where a very small number of employees have spent this exceptionally long period in the company compared to other tenures.

This right-skewed distribution with an outlier indicates that most employees have shorter tenures, with a few employees having much longer tenures, including the outlier at 10 years.

### How to Handle the Outlier

Assuming the outlier at 10 years is genuine and reflects long-term loyal employees, we will retain it and highlight its impact on the analysis.

1. **Investigate the Outlier**: Check employee records to confirm if the 10-year tenure reflects actual continuous employment.
2. **Retain the Outlier**: Since it reflects a genuine case, we will retain it and emphasize its impact on the analysis.

### Impact of the Outlier

- **Employee Loyalty**: The outlier at 10 years may indicate a small group of highly loyal employees who have remained with the company for an exceptionally long period.
- **Analysis Consideration**: While analyzing the data, it's important to consider the influence of these long-tenured employees on overall metrics and trends.
- **Strategic Insights**: Understanding the characteristics and reasons behind the long tenure of these employees can provide valuable insights for employee retention strategies.

By retaining the outlier and acknowledging its impact, we ensure a more comprehensive and accurate analysis of employee tenure in the company.


In [126]:
# Distribution of Work Accidents
fig = px.histogram(data, x='Work_accident', nbins=2, title='Distribution of Work Accidents')
fig.show()

The histogram shows the distribution of work accidents, which is a binary distribution (or Bivariate Distribution).

- **X-axis**: Represents the binary values for work accidents:
  - 0: No work accident
  - 1: Work accident occurred
- **Y-axis**: Represents the count of employees for each category.

### Interpretation:

- **Value 0**: The large number of employees who did not experience any work accidents. Most employees fall into this category.
- **Value 1**: The smaller number of employees who experienced work accidents. This category has significantly fewer employees compared to the first category.

This binary distribution indicates that the majority of employees did not experience work accidents, while a minority did.


In [127]:
# Employee Turnover
fig = px.histogram(data, x='left', nbins=2, title='Employee Turnover')
fig.show()


The histogram shows the distribution of employee turnover, which is a binary variable (or Bivariate Distribution)..

- **X-axis**: Represents the binary values for employee turnover:
  - 0: Employee did not leave the company
  - 1: Employee left the company
- **Y-axis**: Represents the count of employees for each category.

### Interpretation:

- **Value 0**: The large number of employees who did not leave the company. Most employees fall into this category.
- **Value 1**: The smaller number of employees who left the company. This category has significantly fewer employees compared to the first category.

This binary distribution indicates that the majority of employees did not leave the company, while a minority did.


In [128]:
# Promotions in Last 5 Years
fig = px.histogram(data, x='promotion_last_5years', nbins=2, title='Promotions in Last 5 Years')
fig.show()


The histogram shows the distribution of promotions in the last 5 years, which is a binary distribution.

- **X-axis**: Represents the binary values for promotions:
  - 0: Employee did not receive a promotion in the last 5 years
  - 1: Employee received a promotion in the last 5 years
- **Y-axis**: Represents the count of employees for each category.

### Interpretation:

- **Value 0**: The large number of employees who did not receive any promotions in the last 5 years. Most employees fall into this category.
- **Value 1**: The smaller number of employees who received promotions in the last 5 years. This category has significantly fewer employees compared to the first category.

This binary distribution indicates that the majority of employees did not receive promotions in the last 5 years, while a minority did.


In [129]:
# Distribution of Salary Levels
fig = px.histogram(data, x='salary_numeric', nbins=3, title='Distribution of Salary Levels')
fig.show()


The histogram shows the distribution of salary levels, which is a categorical distribution.

- **X-axis**: Represents the numeric values for salary levels:
  - 1: Low salary
  - 2: Medium salary
  - 3: High salary
- **Y-axis**: Represents the count of employees for each category.

### Interpretation:

- **Low Salary (1)**: The large number of employees who receive low salaries. Most employees fall into this category.
- **Medium Salary (2)**: The smaller number of employees who receive medium salaries. This category has significantly fewer employees compared to the first category.
- **High Salary (3)**: The very small number of employees who receive high salaries. This category has the fewest employees among the three.

This categorical distribution indicates that the majority of employees receive low salaries, while a minority receive medium and high salaries.


In [130]:
# Importing machine learning libraries and functions
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score

# Questions Answers:


We will use the correlation matrix to find the correlation between variables to discover the factors that help us solve the two business questions mentioned at the beginning.

## 1. Job Performance Analysis:
#### **Business Question:** What factors most influence employee performance, and how can we enhance these factors to improve productivity?


In [131]:
# Calculate and print the average performance score
avg_performance = data['last_evaluation'].mean() * 100
print(f"Average Performance Score = {avg_performance:.2f}%")

Average Performance Score = 71.61%




The average performance score for the employees, based on the last evaluation, is calculated to be 71.61%. This indicates that, on average, employees have received a performance evaluation score of 71.61% out of 100%.


In [132]:
# Calculate the mean performance by department
mean_performance_by_department = data.groupby('Department')['last_evaluation'].mean().reset_index()

# Create a bar chart to visualize the average performance by department

fig = px.bar(mean_performance_by_department, x='Department', y='last_evaluation',
             labels={'Department': 'Department', 'last_evaluation': 'Average Performance'},
             title='Average Performance by Department')

# Update the layout of the figure
fig.update_layout(
    title_text='Average Performance by Department',
    xaxis_title='Department',
    yaxis_title='Average Performance'
)

# Display the figure
fig.show()



The bar chart shows the average employee performance by department. The horizontal axis (X-axis) represents the different departments in the company, while the vertical axis (Y-axis) represents the average last evaluation score (Average Performance).

***Conclusion:***

1. **Similarity in Average Performance:** The chart shows that the average performance across all departments is very similar, ranging between 0.65 and 0.71.

2. **No Significant Differences:** This similarity indicates that there are no significant differences in the average performance of employees across different departments.

3. **Impact of Department on Performance:** Based on this chart, we can conclude that the department in which an employee works is not a major factor in determining their performance level. Employee performance is similar regardless of the department they belong to.

***Summary:***

This analysis suggests that other factors such as the number of projects, average monthly working hours, time spent at the company, and personal satisfaction level might have a greater impact on employee performance compared to the department. Focusing on these factors would be more effective in improving overall employee performance.

In [133]:
# Exclude non-numeric text columns
numeric_data = data.select_dtypes(include=[np.number])

# Correlation Matrix
correlation_matrix = numeric_data.corr().round(2)

# Display the correlation matrix using Plotly
fig = px.imshow(correlation_matrix, text_auto=True, title='Correlation Matrix')
fig.show()


The factors that significantly influence employee performance based on the correlation matrix, we can look at the `last_evaluation` variable as an indicator of their performance. The variables that are strongly correlated with `last_evaluation` are:

**1. Correlation between last evaluation and number of projects (0.35):**
> This positive correlation suggests that employees who are involved in a higher number of projects tend to receive higher performance evaluations. It indicates that being assigned more projects could be a sign of trust in the employee's capabilities and may provide more opportunities to demonstrate skills and achieve high performance.

 **2. Correlation between last evaluation and average monthly hours (0.34):**
> This positive correlation means that employees who work more hours on average per month are likely to receive higher performance evaluations. This can be interpreted as those who are putting in more hours are perceived to be more dedicated or productive, thus receiving better evaluations.

**3. Correlation between last evaluation and time spent at the company (0.13):**
> This weaker positive correlation indicates that employees who have been with the company for a longer period tend to have slightly better performance evaluations. This can be due to their increased experience, familiarity with company processes, and possibly stronger relationships within the company, all contributing to better performance evaluations over time.

\




Based on the correlation matrix, we can derive insights into the key factors affecting employee performance. While the correlation values provide a starting point, further analysis is necessary to gain a deeper understanding of these relationships and develop actionable recommendations.




In [134]:
# Calculate the mean performance by the number of projects
mean_performance_by_project = data.groupby('number_project')['last_evaluation'].mean().reset_index()

# Create a bar chart to visualize the average performance by the number of projects
fig = px.bar(mean_performance_by_project, x='number_project', y='last_evaluation',
             labels={'number_project': 'Number of Projects', 'last_evaluation': 'Average Performance'},
             title='Average Performance by Number of Projects')

# Update the layout of the figure
fig.update_layout(
    title_text='Average Performance by Number of Projects',
    xaxis_title='Number of Projects',
    yaxis_title='Average Performance'
)

# Display the figure
fig.show()



The bar chart illustrates the average performance scores based on the number of projects worked on by employees. It shows a clear trend indicating the relationship between the number of projects and the average performance score.

***Key Points:***

- **Positive Correlation:** As the number of projects increases, the average performance score also tends to increase. This suggests a positive correlation between the number of projects and employee performance.
- **Trust and Capability:** Employees who handle more projects tend to receive higher performance evaluations. This could indicate that being assigned more projects is a sign of trust in the employee's capabilities and provides more opportunities to demonstrate skills and achieve high performance.

***Conclusion:***

The analysis indicates that the number of projects an employee works on significantly impacts their performance evaluation. Employees involved in a higher number of projects tend to have better performance scores, suggesting that increased workload, within reasonable limits, can be associated with higher performance. This insight can help in making informed decisions regarding project assignments and workload distribution to optimize employee performance.

In [135]:
# Calculate the mean performance and average monthly hours
mean_hours_performance = data.groupby('average_montly_hours')['last_evaluation'].mean().reset_index()

# Create the scatter plot
fig = px.scatter(mean_hours_performance, x='average_montly_hours', y='last_evaluation',
                 labels={'average_montly_hours': 'Average Monthly Hours', 'last_evaluation': 'Average Performance'},
                 title='Average Performance by Average Monthly Hours')

# Customize the layout
fig.update_layout(
    title_text='Average Performance by Average Monthly Hours',
    xaxis_title='Average Monthly Hours',
    yaxis_title='Average Performance'
)

# Show the plot
fig.show()



The scatter plot shows the relationship between average performance and average monthly hours worked by employees.

***Key Points:***

- **Positive Trend:** There is a visible positive trend, indicating that as the average monthly hours increase, the average performance of employees also tends to increase. This suggests that employees who work more hours generally receive higher performance evaluations.
- **Variability:** Although there is a general upward trend, the data points show some variability, indicating that other factors may also be influencing performance.

***Conclusion:***

The analysis indicates that average monthly hours worked is positively correlated with employee performance. Employees who work more hours tend to have better performance scores. This insight suggests that increased working hours can be associated with higher performance, but it is also essential to consider other factors that might contribute to performance to get a more comprehensive understanding.

In [136]:
# Group the data by time spent at the company and calculate the mean performance evaluation
mean_performance_by_time = data.groupby('time_spend_company')['last_evaluation'].mean().reset_index()

# Create a bar chart to show the average performance by time spent at the company
fig = px.bar(mean_performance_by_time, x='time_spend_company', y='last_evaluation',
             labels={'time_spend_company': 'Time Spent at the Company (Years)', 'last_evaluation': 'Average Performance'},
             title='Average Performance by Time Spent at the Company')

# Customize the layout of the chart
fig.update_layout(
    title_text='Average Performance by Time Spent at the Company',
    xaxis_title='Time Spent at the Company (Years)',
    yaxis_title='Average Performance'
)

# Display the chart
fig.show()



The bar chart shows the relationship between average performance and the time spent at the company (in years).

***Key Points:***

- **Fluctuating Performance:** The average performance fluctuates across different years spent at the company. It appears that employees with 3, 5, and 10 years of experience have relatively higher performance evaluations compared to those with 2, 4, and 6-9 years of experience.
- **Highest Performance:** Employees with 5 years of experience have the highest average performance.
- **General Insight:** The data does not show a clear linear trend indicating that more years spent at the company consistently lead to higher performance. Instead, performance peaks at certain points and drops at others.

***Conclusion:***

The time spent at the company is not a consistent predictor of employee performance. While certain years (like 5 years) show higher average performance, the trend is not linear. This suggests that other factors, beyond just the number of years spent at the company, significantly influence performance. It may be beneficial to consider additional variables and conduct further analysis to understand what drives these fluctuations in performance.

### Multiple Regression Analysis


**Chosen Model:** Linear Regression

#### Reason:
- **Linear regression** is suitable for identifying the relationship between a continuous dependent variable (in this case, employee performance represented by `last_evaluation`) and a set of continuous independent variables.
-**Multiple regression analysis** helps in determining the extent to which each factor affects employee performance, which can be achieved through linear regression by identifying the regression coefficients for each independent variable.




In [137]:
# Defining the feature variables (X) and the target variable (y)
X = data[['number_project', 'average_montly_hours', 'time_spend_company']]
y = data['last_evaluation']

In [138]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [139]:
# Training the linear regression model and making predictions on the test set
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [140]:
# Printing the model coefficients, intercept, mean squared error, and R^2 score
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean squared error (MSE): %.2f' % mean_squared_error(y_test, y_pred))
print('Coefficient of determination (R^2): %.2f' % r2_score(y_test, y_pred))

Coefficients: [0.03263722 0.00080914 0.00704876]
Intercept: 0.4050132193136506
Mean squared error (MSE): 0.02
Coefficient of determination (R^2): 0.17



**Coefficients:**
- number_project: 0.03263722
- average_montly_hours: 0.00080914
- time_spend_company: 0.00704876

These coefficients indicate the impact of each factor on the last evaluation score. For instance, for every additional project, the last evaluation score increases by approximately 0.0326, assuming other factors remain constant.

**Coefficient of Determination (R^2):**
- 0.17

The R^2 value measures the proportion of the variance in the dependent variable (last evaluation score) that is predictable from the independent variables. In this case, the model explains 17% of the variance in the performance evaluation, which is relatively low.

***Conclusion***

From the above analysis, we can conclude that the current model has a weak explanatory power (low R^2) for predicting the last evaluation score based on the selected variables. The chosen factors (number of projects, average monthly hours, and time spent at the company) have a minimal effect on the performance evaluation. Other unaccounted factors might have a more significant impact. It is recommended to conduct further analysis and include additional variables that could have a greater influence on performance.


In [141]:
# Defining the feature variables (X) and the target variable (y)
X_performance_all = data[['satisfaction_level', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'promotion_last_5years', 'salary_numeric']]
y_performance_all = data['last_evaluation']

In [142]:
# Splitting the dataset into training and testing sets
X_train_p_all, X_test_p_all, y_train_p_all, y_test_p_all = train_test_split(X_performance_all, y_performance_all, test_size=0.3, random_state=42)

In [143]:
# Training the linear regression model and making predictions on the test set
linreg_all = LinearRegression()
linreg_all.fit(X_train_p_all, y_train_p_all)
y_pred_p_all = linreg_all.predict(X_test_p_all)

In [144]:
# Getting the coefficients, intercept, mean squared error, and R^2 score for the linear regression model
coefficients_all = linreg_all.coef_
intercept_all = linreg_all.intercept_
mse_all = mean_squared_error(y_test_p_all, y_pred_p_all)
r2_all = r2_score(y_test_p_all, y_pred_p_all)

In [145]:
# Printing the model coefficients, intercept, mean squared error, and R^2 score
print('Coefficients:', coefficients_all)
print('Intercept:', intercept_all)
print('Mean squared error (MSE): %.2f' % mse_all)
print('Coefficient of determination (R²): %.2f' % r2_all)

Coefficients: [ 0.10730053  0.03567643  0.00078736  0.00862515 -0.00689286 -0.00687384
 -0.00593261]
Intercept: 0.337078282677726
Mean squared error (MSE): 0.02
Coefficient of determination (R²): 0.20



**Coefficients:**
- satisfaction_level: 0.10730053
- number_project: 0.03567643
- average_montly_hours: 0.00078736
- time_spend_company: 0.00862515
- Work_accident: -0.00689286
- promotion_last_5years: -0.00687384
- salary_numeric: -0.00593261

These coefficients indicate the impact of each factor on the last evaluation score. For instance, for every additional unit increase in satisfaction level, the last evaluation score increases by approximately 0.1073, assuming other factors remain constant.


**Coefficient of Determination (R^2):**
- 0.20

The R^2 value measures the proportion of the variance in the dependent variable (last evaluation score) that is predictable from the independent variables. In this case, the model explains 20% of the variance in the performance evaluation, which is relatively low.

***Conclusion***

From the above analysis, we can conclude that the current model has a weak explanatory power (low R^2) for predicting the last evaluation score based on the selected variables. The chosen factors (satisfaction level, number of projects, average monthly hours, time spent at the company, work accident, promotion in the last 5 years, and salary) have a minimal effect on the performance evaluation. Other unaccounted factors might have a more significant impact. It is recommended to conduct further analysis and include additional variables that could have a greater influence on performance.


In [146]:
# Calculate the average performance based on satisfaction level
mean_satisfaction_performance = data.groupby('satisfaction_level')['last_evaluation'].mean().reset_index()

# Create a scatter plot to show the relationship between satisfaction level and average performance
fig = px.scatter(mean_satisfaction_performance, x='satisfaction_level', y='last_evaluation',
                 labels={'satisfaction_level': 'Satisfaction Level', 'last_evaluation': 'Average Performance'},
                 title='Scatter Plot of Satisfaction Level vs. Average Performance')

# Customize the layout of the plot
fig.update_layout(
    title_text='Scatter Plot of Satisfaction Level vs. Average Performance',
    xaxis_title='Satisfaction Level',
    yaxis_title='Average Performance'
)

# Display the plot
fig.show()



The scatter plot shows the relationship between satisfaction level and average performance.

***Key Points:***

- **Positive Correlation:** There is a noticeable positive correlation between satisfaction level and average performance. As the satisfaction level increases, the average performance also tends to increase.
- **Low Satisfaction and Performance:** Employees with lower satisfaction levels (below 0.4) generally have lower average performance scores, mostly between 0.5 and 0.75.
- **High Satisfaction and Performance:** Employees with higher satisfaction levels (above 0.8) tend to have higher average performance scores, often exceeding 0.75 and reaching up to 0.9.

***Conclusion:***

The satisfaction level is a significant predictor of employee performance. Higher satisfaction levels are associated with higher performance evaluations. This suggests that improving employee satisfaction could lead to better performance outcomes. Companies should focus on strategies to enhance employee satisfaction to achieve higher performance levels.

### Hypothesis Testing



To further investigate the variables that showed strong relationships with performance in the previous analysis, we will conduct hypothesis testing. We aim to understand the impact of these variables on employee performance evaluations.

###  *1- Hypothesis Testing for the Impact of Number of Projects on Performance*

First, we will test the hypothesis regarding the effect of the number of projects on performance evaluations.

### Group Definitions:
1. **Group 1 (Low Projects Group)**: Employees who work on a number of projects less than or equal to the average number of projects.
2. **Group 2 (High Projects Group)**: Employees who work on a number of projects greater than the average number of projects.

### Hypothesis:
- **Null Hypothesis (H0)**: There is no statistically significant difference in performance evaluations between employees who work on a number of projects less than or equal to the average and those who work on a number of projects greater than the average.
- **Alternative Hypothesis (H1)**: There is a statistically significant difference in performance evaluations between employees who work on a number of projects less than or equal to the average and those who work on a number of projects greater than the average.

### Formula for Hypothesis Testing:
We will use the two-sample t-test to compare the means of the two groups:

$$
\begin{align*}
H_0: \mu_1 = \mu_2 \\
H_1: \mu_1 \neq \mu_2
\end{align*}
$$

The t-test will help determine if the observed differences in the performance evaluation scores between the two groups are statistically significant.



In [147]:
# Import the scipy.stats library for statistical functions
import scipy.stats as stats

In [148]:
# Calculate the mean number of projects
mean_projects = data['number_project'].mean()

In [149]:
# Define the low and high projects groups based on the mean number of projects
low_projects_group = data[data['number_project'] <= mean_projects]['last_evaluation']
high_projects_group = data[data['number_project'] > mean_projects]['last_evaluation']

In [150]:
# Perform a t-test to compare the performance evaluations of the two groups
t_stat, p_value = stats.ttest_ind(low_projects_group, high_projects_group)

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: -35.14401107447858
P-value: 4.4338579925292734e-260


In [151]:
# Set the significance level
alpha = 0.05

# Determine if we should reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference between the two groups.")

Reject the null hypothesis. There is a statistically significant difference between the two groups.



Given that the p-value is significantly less than the alpha level of 0.05, we reject the null hypothesis. This indicates that there is a statistically significant difference in performance evaluations between employees who work on a number of projects less than or equal to the average and those who work on a number of projects greater than the average.

***Conclusion:***

The result of the hypothesis test shows that the number of projects an employee works on has a statistically significant impact on their performance evaluation. Employees who handle more projects than the average tend to have different performance evaluations compared to those who handle fewer projects. This finding suggests that project load can be an important factor in assessing employee performance.









###  *2- Hypothesis Testing for the Impact of an Average Monthly Hours on Performance*

Second, we will test the hypothesis regarding the effect of the average monthly hours on performance evaluations.

### Group Definitions:
1. **Group 1 (Low Hours Group)**: Employees who work on an average number of monthly hours less than or equal to the overall average.
2. **Group 2 (High Hours Group)**: Employees who work on an average number of monthly hours greater than the overall average.

### Hypothesis:
- **Null Hypothesis (H0)**: There is no statistically significant difference in performance evaluations between employees who work on an average number of monthly hours less than or equal to the average and those who work on an average number of monthly hours greater than the average.
- **Alternative Hypothesis (H1)**: There is a statistically significant difference in performance evaluations between employees who work on an average number of monthly hours less than or equal to the average and those who work on an average number of monthly hours greater than the average.


In [152]:
# Calculate the mean of the 'average_montly_hours' column
mean_hours = data['average_montly_hours'].mean()

In [153]:
# Group the data based on the average monthly hours
low_hours_group = data[data['average_montly_hours'] <= mean_hours]['last_evaluation']
high_hours_group = data[data['average_montly_hours'] > mean_hours]['last_evaluation']

In [154]:
# Perform a t-test to compare the performance evaluations of the two groups
t_stat, p_value = stats.ttest_ind(low_hours_group, high_hours_group)

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: -37.15653700977242
P-value: 3.6510982507045833e-289


In [155]:
# Set the significance level
alpha = 0.05

# Determine if we should reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference between the two groups.")


Reject the null hypothesis. There is a statistically significant difference between the two groups.



Given that the p-value is significantly less than the alpha level of 0.05, we reject the null hypothesis. This indicates that there is a statistically significant difference in performance evaluations between employees who work on an average number of monthly hours less than or equal to the overall average and those who work on an average number of monthly hours greater than the overall average.

***Conclusion***

The result of the hypothesis test shows that the average number of monthly hours an employee works has a statistically significant impact on their performance evaluation. Employees who work more hours than the average tend to have different performance evaluations compared to those who work fewer hours. This finding suggests that monthly work hours can be an important factor in assessing employee performance.



###  *3- Hypothesis Testing for the Impact of Time Spend at Company on Performance*

Third, we will test the hypothesis regarding the effect of the time spent at the company on performance evaluations.

### Group Definitions:
1. **Group 1 (Low Time Spent Group)**: Employees who have spent less than or equal to the average time at the company.
2. **Group 2 (High Time Spent Group)**: Employees who have spent more than the average time at the company.

### Hypothesis:
- **Null Hypothesis (H0)**: There is no statistically significant difference in performance evaluations between employees who have spent less than or equal to the average time at the company and those who have spent more than the average time at the company.
- **Alternative Hypothesis (H1)**: There is a statistically significant difference in performance evaluations between employees who have spent less than or equal to the average time at the company and those who have spent more than the average time at the company.


In [156]:
# Calculate the mean of the 'time_spend_company' column
mean_time_spent = data['time_spend_company'].mean()

In [157]:
# Group the data based on the mean time spent at the company
low_time_group = data[data['time_spend_company'] <= mean_time_spent]['last_evaluation']
high_time_group = data[data['time_spend_company'] > mean_time_spent]['last_evaluation']

In [158]:
# Perform a t-test to compare the performance evaluations of the two groups
t_stat, p_value = stats.ttest_ind(low_time_group, high_time_group)

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: -30.903910720617834
P-value: 2.3654453912393553e-203


In [159]:
# Set the significance level
alpha = 0.05

# Determine if we should reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference between the two groups.")

Reject the null hypothesis. There is a statistically significant difference between the two groups.



Given that the p-value is significantly less than the alpha level of 0.05, we reject the null hypothesis. This indicates that there is a statistically significant difference in performance evaluations between employees who have spent less than or equal to the average time at the company and those who have spent more than the average time at the company.

***Conclusion***

The result of the hypothesis test shows that the time an employee spends at the company has a statistically significant impact on their performance evaluation. Employees who have spent more time at the company tend to have different performance evaluations compared to those who have spent less time. This finding suggests that tenure can be an important factor in assessing employee performance.


###  *4- Hypothesis Testing for the Impact of Satisfaction Level on Performance*

Fourth , we will test the hypothesis regarding the effect of the satisfaction level on performance evaluations.

### Group Definitions:
1. **Group 1 (Low Satisfaction Group)**: Employees who have a satisfaction level less than or equal to the average satisfaction level.
2. **Group 2 (High Satisfaction Group)**: Employees who have a satisfaction level greater than the average satisfaction level.

### Hypothesis:
- **Null Hypothesis (H0)**: There is no statistically significant difference in performance evaluations between employees with a satisfaction level less than or equal to the average and those with a satisfaction level greater than the average.
- **Alternative Hypothesis (H1)**: There is a statistically significant difference in performance evaluations between employees with a satisfaction level less than or equal to the average and those with a satisfaction level greater than the average.



In [160]:
# Calculate the mean of the 'satisfaction_level' column
mean_satisfaction = data['satisfaction_level'].mean()

In [161]:
# Group the data based on the mean satisfaction level
low_satisfaction_group = data[data['satisfaction_level'] <= mean_satisfaction]['last_evaluation']
high_satisfaction_group = data[data['satisfaction_level'] > mean_satisfaction]['last_evaluation']

In [162]:
# Perform a t-test to compare the performance evaluations of the two groups
t_stat, p_value = stats.ttest_ind(low_satisfaction_group, high_satisfaction_group)

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: -25.248164832690165
P-value: 8.798889815807712e-138


In [163]:
# Set the significance level
alpha = 0.05

# Determine if we should reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference between the two groups.")

Reject the null hypothesis. There is a statistically significant difference between the two groups.



Given that the p-value is significantly less than the alpha level of 0.05, we reject the null hypothesis. This indicates that there is a statistically significant difference in performance evaluations between employees who have a satisfaction level less than or equal to the average and those who have a satisfaction level greater than the average.

***Conclusion***

The result of the hypothesis test shows that the satisfaction level of an employee has a statistically significant impact on their performance evaluation. Employees with higher satisfaction levels tend to have different performance evaluations compared to those with lower satisfaction levels. This finding suggests that employee satisfaction can be an important factor in assessing employee performance.



###  Question Answer

#### Findings:

1. **Number of Projects:**
   - **Correlation**: A positive correlation (0.35) was found between the number of projects and employee performance. Employees involved in a higher number of projects tend to receive higher performance evaluations.
   - **Hypothesis Testing**: The hypothesis test revealed a statistically significant difference in performance evaluations between employees who work on fewer projects versus those who work on more projects (T-statistic: -35.14, P-value: 4.43e-260).
   - **Conclusion**: Increasing the number of projects assigned to employees could enhance their performance, provided that the workload is manageable.

2. **Average Monthly Hours:**
   - **Correlation**: There is a positive correlation between average monthly hours worked and employee performance.
   - **Hypothesis Testing**: The hypothesis test showed a statistically significant difference in performance evaluations between employees who work fewer monthly hours versus those who work more (T-statistic: -37.16, P-value: 3.65e-289).
   - **Conclusion**: Employees who work more hours on average tend to have better performance evaluations. Ensuring employees are engaged for sufficient hours without overworking them can improve performance.

3. **Time Spent at the Company:**
   - **Correlation**: A significant relationship exists between the time spent at the company and employee performance.
   - **Hypothesis Testing**: The hypothesis test indicated a statistically significant difference in performance evaluations between employees with shorter tenures versus those with longer tenures at the company (T-statistic: -30.90, P-value: 2.36e-203).
   - **Conclusion**: Employees with more years at the company generally perform better. Strategies to retain employees longer could lead to improved performance.

4. **Satisfaction Level:**
   - **Correlation**: There is a positive correlation between employee satisfaction levels and their performance evaluations.
   - **Hypothesis Testing**: The hypothesis test showed a statistically significant difference in performance evaluations between employees with lower satisfaction levels versus those with higher satisfaction levels (T-statistic: -25.24, P-value: 8.80e-138).
   - **Conclusion**: Higher satisfaction levels among employees are associated with better performance. Efforts to improve job satisfaction can lead to enhanced employee performance.

#### Recommendations:
- **Project Management**: Ensure a balanced and manageable number of projects per employee to optimize performance.
- **Work Hours**: Monitor and adjust work hours to ensure employees are engaged but not overworked, to maintain high performance.
- **Retention Strategies**: Develop and implement retention strategies to encourage longer tenures, which positively impact performance.
- **Satisfaction Programs**: Invest in programs and initiatives to enhance employee satisfaction, leading to better performance outcomes.

## 2. Improve Employee Retention Rate:

####**Business Question:** What are the key factors that affect employee retention rates in the company, and how can we improve these rates?




In [164]:
# Exclude non-numeric text columns
numeric_data = data.select_dtypes(include=[np.number])

# Correlation Matrix
correlation_matrix = numeric_data.corr().round(2)

# Display the correlation matrix using Plotly
fig = px.imshow(correlation_matrix, text_auto=True, title='Correlation Matrix')
fig.show()

The factors that significantly impact employee retention based on the correlation matrix, we can look at the variable `left`, which represents employees leaving the company. The variables that have a strong correlation with `left` are:

>**1. Correlation between Turnover and satisfaction_level (-0.39):**
This strong negative correlation indicates that employees with higher satisfaction levels are less likely to leave the company. Higher satisfaction levels are associated with lower turnover rates, suggesting that employee satisfaction plays a crucial role in retention.

>**2. Correlation between Turnover and Work_accident (-0.15):**
This weak negative correlation suggests that work accidents may negatively impact employee retention. Employees experiencing work accidents may feel unsafe or dissatisfied, leading to higher turnover rates.

>**3. Correlation between Turnover and salary_numeric (-0.16):**
This weak negative correlation indicates that salary levels might slightly influence employees' decisions to stay or leave. Lower salary levels may drive employees to seek better opportunities elsewhere.

>**4. Correlation between Turnover and time_spend_company (0.14):**
This weak positive correlation means that the length of time an employee has spent at the company might influence their decision to stay. Employees who have been with the company longer may have stronger motivations to stay, perhaps due to the relationships they've built or a sense of stability.


**Chosen Model:** Logistic Regression

#### Reason:
- **Logistic regression** is suitable for modeling binary variables (in this case, employee turnover represented by `left`).
- **Retention analysis** requires understanding the factors that affect the likelihood of employees leaving the company, and logistic regression can estimate these probabilities based on a set of independent variables.

In [165]:
X_retention = data[['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'promotion_last_5years', 'salary_numeric']]
y_retention = data['left']

In [166]:
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_retention, y_retention, test_size=0.3, random_state=42)

In [167]:
logreg_r = LogisticRegression(max_iter=1000)
logreg_r.fit(X_train_r, y_train_r)

In [168]:
y_pred_r = logreg_r.predict(X_test_r)

In [169]:
print(confusion_matrix(y_test_r, y_pred_r))
print(classification_report(y_test_r, y_pred_r))
print('Accuracy:', accuracy_score(y_test_r, y_pred_r))

[[3172  256]
 [ 711  361]]
              precision    recall  f1-score   support

           0       0.82      0.93      0.87      3428
           1       0.59      0.34      0.43      1072

    accuracy                           0.79      4500
   macro avg       0.70      0.63      0.65      4500
weighted avg       0.76      0.79      0.76      4500

Accuracy: 0.7851111111111111
