<a href="https://colab.research.google.com/github/TMQ5/my_projects/blob/main/People%20Analytics/MNC%20Comany/HR_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Employee Data Analysis (People Analytics)

## Introduction

In this project, we will analyze employee data using the **HR Employee Analytics** dataset from [Kaggle](https://www.kaggle.com/datasets/kmldas/hr-employee-data-descriptive-analytics). We aim to answer a set of questions related to employee performance, recruitment efficiency, employee satisfaction, and employee retention.

## Objectives

1. **Employee Performance Analysis:** What are the key factors affecting employee performance, and how can we enhance these factors to improve productivity?
2. **Improving Recruitment Efficiency:** How can we improve the efficiency of the recruitment process to reduce time and cost while increasing the quality of accepted candidates?
3. **Increasing Employee Satisfaction:** What are the key factors affecting employee satisfaction, and how can we enhance these factors to improve the work environment?
4. **Improving Employee Retention Rate:** What are the main factors affecting employee retention in the company, and how can we improve these rates?

## The Six Steps of Data Analysis According to the Google Data Analytics Professional Certificate Methodology

1. **Ask:** Define the key questions we want to answer through the analysis.
2. **Prepare:** Gather and clean the data to ensure it is ready for analysis.
3. **Process:** Process the data to remove missing and duplicate values and convert textual data to numerical formats.
4. **Analyze:** Use various analytical techniques to answer the posed questions.
5. **Share:** Present the results through reports, dashboards, and presentations.
6. **Act:** Implement recommendations based on the analysis and monitor their impact.




## Importing Libraries and Loading Data

Let's start by importing the necessary libraries and loading the data.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [5]:
# Load the data
file_url = 'https://github.com/TMQ5/my_projects/raw/main/People%20Analytics/MNC%20Comany/HR_Employee_Data.xlsx'
data = pd.read_excel(file_url)

In [6]:
# Display the initial data
data.head()

Unnamed: 0,Emp_Id,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,IND02438,0.38,0.53,2,157,3,0,1,0,sales,low
1,IND28133,0.8,0.86,5,262,6,0,1,0,sales,medium
2,IND07164,0.11,0.88,7,272,4,0,1,0,sales,medium
3,IND30478,0.72,0.87,5,223,5,0,1,0,sales,low
4,IND24003,0.37,0.52,2,159,3,0,1,0,sales,low


In [7]:
# Display summary of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Emp_Id                 14999 non-null  object 
 1   satisfaction_level     14999 non-null  float64
 2   last_evaluation        14999 non-null  float64
 3   number_project         14999 non-null  int64  
 4   average_montly_hours   14999 non-null  int64  
 5   time_spend_company     14999 non-null  int64  
 6   Work_accident          14999 non-null  int64  
 7   left                   14999 non-null  int64  
 8   promotion_last_5years  14999 non-null  int64  
 9   Department             14999 non-null  object 
 10  salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(3)
memory usage: 1.3+ MB


In [9]:
# Check for missing values
data.isnull().sum()

Emp_Id                   0
satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
Work_accident            0
left                     0
promotion_last_5years    0
Department               0
salary                   0
dtype: int64

## Data Overview

The dataset consists of 14,999 rows and 11 columns. There are no missing values in the dataset, and all data types are correct.

| Column                   | Data Type | Description                                    |
|--------------------------|-----------|------------------------------------------------|
| `Emp_Id`                 | object    | Employee ID (text)                             |
| `satisfaction_level`     | float64   | Employee satisfaction level (numeric)          |
| `last_evaluation`        | float64   | Last evaluation score (numeric)                |
| `number_project`         | int64     | Number of projects worked on (integer)         |
| `average_montly_hours`   | int64     | Average monthly hours worked (integer)         |
| `time_spend_company`     | int64     | Number of years spent in the company (integer) |
| `Work_accident`          | int64     | Whether the employee had a work accident (integer, 0 or 1) |
| `left`                   | int64     | Whether the employee left the company (integer, 0 or 1) |
| `promotion_last_5years`  | int64     | Whether the employee was promoted in the last 5 years (integer, 0 or 1) |
| `Department`             | object    | Department name (text)                         |
| `salary`                 | object    | Salary level ('low', 'medium', 'high')         |

Although the data type for the `salary` column is correct as text, we will retain this column and create a new column with numeric encoding. To facilitate analysis, we will retain the original text values in the `salary` column and create a new column, `salary_numeric`, that contains encoded numeric values. This approach allows us to use either the text or numeric values as needed for different analyses.

- `salary`: Original text values ('low', 'medium', 'high').
- `salary_numeric`: Encoded numeric values (1 for 'low', 2 for 'medium', 3 for 'high').

This dual-column approach provides flexibility, enabling us to leverage the clarity of text values in descriptive analyses and the computational efficiency of numeric values in statistical modeling and machine learning algorithms.


## Retaining Original Text Column and Creating Encoded Column




In [10]:
# Create an encoded version of the salary column
data['salary_numeric'] = data['salary'].map({'low': 1, 'medium': 2, 'high': 3})

In [11]:
# Display summary of the data again to verify changes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Emp_Id                 14999 non-null  object 
 1   satisfaction_level     14999 non-null  float64
 2   last_evaluation        14999 non-null  float64
 3   number_project         14999 non-null  int64  
 4   average_montly_hours   14999 non-null  int64  
 5   time_spend_company     14999 non-null  int64  
 6   Work_accident          14999 non-null  int64  
 7   left                   14999 non-null  int64  
 8   promotion_last_5years  14999 non-null  int64  
 9   Department             14999 non-null  object 
 10  salary                 14999 non-null  object 
 11  salary_numeric         14999 non-null  int64  
dtypes: float64(2), int64(7), object(3)
memory usage: 1.4+ MB


In [12]:
# Display some descriptive statistics
data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,salary_numeric
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268,1.594706
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281,0.637183
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0,1.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0,1.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0,2.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0,2.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0,3.0



### We observe the following from this Summary Statistical Summary:

- **satisfaction_level**: The satisfaction level ranges between 9% and 100%, with an average satisfaction level of 61%.
- **last_evaluation**: The last evaluation score ranges between 36% and 100%, with an average score of 72%.
- **number_project**: The number of projects worked on ranges between 2 and 7, with an average of 3.8 projects.
- **average_montly_hours**: The average monthly hours worked range between 96 and 310, with an average of 201 hours.
- **time_spend_company**: The time spent in the company ranges between 2 and 10 years, with an average of 3.5 years.
- **Work_accident**: The work accident indicator ranges between 0% and 100%, with an average of 14%. This means that, on average, 14% of employees experienced work-related accidents. A high accident rate may indicate the need for improved safety measures and training within the company to ensure a safer work environment.
- **left**: The employee turnover indicator ranges between 0% and 100%, with an average of 24%. This means that, on average, 24% of employees left the company within the measured period. A high turnover rate can be indicative of issues within the company such as low job satisfaction, limited career advancement opportunities, or unfavorable working conditions.
- **promotion_last_5years**: The promotion in the last 5 years indicator ranges between 0% and 100%, with an average of 2%. This low average indicates that promotions are rare within the company, which could negatively affect employee motivation and retention if employees feel there are limited opportunities for advancement.
- **salary_numeric**: The salary level ranges between 1 and 3, with an average level of 1.6. This average indicates that the general salary level is closer to the lower end of the scale, with most employees having a salary level of either 1 (low) or 2 (medium).