# 1. Introduction

HumanForYou is a pharmaceutical company based in India, employing around 4,000 individuals. The company experiences an annual employee turnover rate of approximately 15%, which poses significant operational challenges. The management seeks to leverage AI and data analytics to better understand and mitigate employee attrition.

# 2. Business Problem

The high turnover rate negatively impacts the company in the following ways:

- **Project Delays**: Departing employees cause disruptions in ongoing projects, leading to delays and affecting the company’s reputation.
- **Increased HR Costs**: A substantial HR team is required to continuously recruit and train new employees.
- **Productivity Loss**: New employees require training and adaptation time before becoming fully operational.

To address these challenges, the company has enlisted a data analytics team to analyze key factors influencing employee attrition and recommend strategies to improve retention.


# 3. Data provided by the Human Resources department

## 3.1 Data Human Resources department information regarding each employee

- **Age**: The employee’s age in 2015.
- **Attrition**: The subject of our study - did the employee leave the company during 2016?
- **BusinessTravel**: How often did the employee have to travel as part of their job in 2015?  
  (Non-Travel = never, Travel_Rarely = rarely, Travel_Frequently = frequently)
- **Department**:
- **DistanceFromHome**: Distance in km between the employee’s home and the company.
- **Education**: Education level:  
  1 = Before College (Bac level equivalent)  
  2 = College (Bac+2 equivalent)  
  3 = Bachelor (Bac+3)  
  4 = Master (Bac+5)  
  5 = PhD (Doctoral thesis)
- **EducationField**: Study field, main subject.
- **EmployeeCount**: Boolean set to 1 if the employee was counted in the workforce in 2015.
- **EmployeeId**: Employee identifier.
- **Gender**: Employee’s gender.
- **JobLevel**: Hierarchical level in the company from 1 to 5.
- **JobRole**: Job role in the company.
- **MaritalStatus**: Marital status of the employee (Single, Married, or Divorced).
- **MonthlyIncome**: Gross wage per month, in rupees.
- **NumCompaniesWorked**: Number of companies the employee has worked for before joining HumanForYou.
- **Over18**: Is the employee over 18 years old or not?
- **PercentSalaryHike**: Salary increase % in 2015.
- **StandardHours**: Number of working hours per day in the employee’s contract.
- **StockOptionLevel**: Level of investment in company shares by the employee.
- **TotalWorkingYears**: Number of years the employee has worked for the company in the same type of position.
- **TrainingTimesLastYear**: Number of training days in 2015.
- **YearsAtCompany**: Seniority in the company.
- **YearsSinceLastPromotion**: Number of years since the last individual salary raise.
- **YearsWithCurrentManager**: Number of years the employee has worked under their current manager’s responsibility.


need to be change

<p> Over18
<p> StandardHours

In [7]:
general_data

NameError: name 'general_data' is not defined

In [None]:
manager_survey_data.info()

In [None]:
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   Department               4410 non-null   object 
 6   EducationField           4410 non-null   object
 9   Gender                   4410 non-null   object 
 11  JobRole                  4410 non-null   object 
 12  MaritalStatus            4410 non-null   object 
 15  Over18                   4410 non-null   object   

In [None]:
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 


 3   Department               4410 non-null   object 
 6   EducationField           4410 non-null   object
 11  JobRole                  4410 non-null   object 

In [None]:
print([i for i in general_data.columns if general_data[i].nunique()==1 ])
    

In [None]:
matrix_general_data = general_data_encoded.corr(numeric_only = True)
for i in general_data_encoded.columns:
    print(matrix_general_data[i])

In [None]:
general_data["EducationField"].value_counts()

# Data that don't need to be taken acount :

## For Optimization and Relevant Data

- **EmployeeCount**: This column has only one unique value (`1`), meaning it doesn’t provide any useful variation. Since it is constant for all employees, it can be removed to optimize the dataset.  
- **Over18**: Since all employees are over 18 years old (`Yes` for everyone), this column does not provide any distinguishing information. Removing it helps avoid redundant data.  
- **StandardHours**: The standard working hours for all employees in their contracts are fixed at **8 hours**. Since there is no variation, this column does not contribute to any meaningful analysis and can be dropped for efficiency.  

## For Ethical Issues

- **Gender**: Including gender in the analysis could introduce **bias in decision-making**. Since an employee’s gender does not determine their performance, attrition, or job satisfaction, it is best to remove this column to ensure **fairness and avoid discrimination**.  
- **MaritalStatus**: Personal information such as marital status should not influence professional evaluations. Keeping this data could lead to **biased assumptions** about employees' work-life balance, commitment, or performance. Removing it promotes **equal treatment** of all employees, regardless of their personal situation.  



In [None]:
columns_to_drop = ["EmployeeCount", "Over18", "StandardHours", "Gender", "MaritalStatus"]

existing_columns = [col for col in columns_to_drop if col in general_data.columns]

if existing_columns:
    general_data = general_data.drop(columns=existing_columns)
    print(f"Dropped columns: {existing_columns}")
else:
    print("No columns were dropped. They were not found or already deleted")

general_data

In [None]:
general_data["StandardHours"].nunique()

In [None]:
from sklearn.preprocessing import OrdinalEncoder 
ordinal_encoder = OrdinalEncoder()

general_data_attrition =general_data[["Attrition"]]
general_data_attrition = ordinal_encoder.fit_transform(general_data_attrition)
ordinal_encoder = OrdinalEncoder(categories=[["No", "Yes"]])
general_data["Attrition"] = general_data_attrition

In [None]:
general_data_at = general_data_encoded[["JobRole"]]
general_data_at.head(10)

In [None]:
general_data_at.value_counts()

In [None]:
general_data_at = general_data[["Department"]] 

In [None]:
categorical_department = general_data["BusinessTravel"].unique().tolist()
categorical_department

In [None]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
import pandas as pd


### 🔹 Ordinal Encoding (For Ordered Categories)
ordinal_mappings = {
    "Attrition": [["No", "Yes"]],
    "BusinessTravel": [["Non-Travel", "Travel_Rarely", "Travel_Frequently"]]
}

# Apply Ordinal Encoding
for col, categories in ordinal_mappings.items():
    encoder = OrdinalEncoder(categories=categories)
    general_data[col] = encoder.fit_transform(general_data[[col]])

### 🔹 OneHot Encoding (For Categorical Variables)
categorical_columns = ["Department", "EducationField", "JobRole"]

# ✅ Use sparse_output=False instead of sparse=False
onehot_encoder = OneHotEncoder(sparse_output=False, drop="first")  # drop='first' avoids collinearity

# Fit and transform categorical columns
encoded_array = onehot_encoder.fit_transform(general_data[categorical_columns])

# Convert encoded array into a DataFrame
encoded_df = pd.DataFrame(encoded_array, columns=onehot_encoder.get_feature_names_out(categorical_columns))

# Merge with the original dataset (Dropping old categorical columns)
general_data_encoded = pd.concat([general_data.drop(columns=categorical_columns), encoded_df], axis=1)

### ✅ Display first few rows
print(general_data_encoded.head())

### ✅ Save Encoded Data (Optional)
# general_data_encoded.to_csv("data/general_data_encoded.csv", index=False)


In [None]:
general_data_encoded.columns

In [None]:
general_data

In [None]:
general_data_encoded.to_csv('encoded.csv', index=False)  

In [None]:
categorical_columns = ["Department","EducationField","JobRole"]
categorical_department = general_data["Department"].unique().tolist()
categorical_education_field = general_data["EducationField"].unique().tolist()
categorical_job_role = general_data["JobRole"].unique().tolist()



In [None]:
categorical_department

In [None]:
categorical_job_role

In [None]:
general_data["JobRole"]

In [None]:
ordinal_encoder.categories_

In [None]:
general_data_at = general_data[["BusinessTravel"]] 
general_data_at.head(10)

## 3.2 Data Last assessment provided by the manager

- **EmployeeID**: Employee identifier.
- **JobInvolvement**: An assessment of the employee’s involvement in their job, graded:  
  1 = Poor  
  2 = Average  
  3 = Significant  
  4 = Very significant
- **PerformanceRating**: An assessment of the employee’s annual performance level for the company, graded:  
  1 = Poor  
  2 = Good  
  3 = Excellent  
  4 = Beyond expectations


## 3.3 Data Survey on the quality of life at work

- **EnvironmentSatisfaction**: The work environment, graded:  
  1 = Low  
  2 = Medium  
  3 = High  
  4 = Very high
- **JobSatisfaction**: Satisfaction with their job, graded from 1 to 4 as above.
- **WorkLifeBalance**: Their work/private life balance, graded:  
  1 = Bad  
  2 = Satisfactory  
  3 = Very satisfactory  
  4 = Excellent


## 3.4 Data Working hours

In [None]:
#Library instalattion

!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install scikit-learn

In [None]:
#Library importation

import numpy as np
import pandas as pd
import matplotlib as plt

In [None]:
general_data = pd.read_csv('data/general_data.csv')

#Last assessment provided by the manager
manager_survey_data = pd.read_csv('data/manager_survey_data.csv')

#Survey on the quality of life at work
employee_survey_data = pd.read_csv('data/employee_survey_data.csv')

#Working hours
out_time_data = pd.read_csv('data/out_time.csv')
in_time_data = pd.read_csv('data/in_time.csv')


In [None]:
general_data

In [None]:
column_names = manager_survey_data.columns
column_names

In [None]:
cleaned_general_data = general_data.dropna()

#Last assessment provided by the manager
cleaned_manager_survey_data = manager_survey_data.dropna()

#Survey on the quality of life at work
cleaned_employee_survey_data = employee_survey_data.dropna()

#Working hours
cleaned_out_time_data = out_time_data.dropna()
cleaned_in_time_data = in_time_data.dropna()

In [None]:
print(f"general_data loss",general_data.shape[0] - cleaned_general_data.shape[0],"row", round((general_data.shape[0] - cleaned_general_data.shape[0])/general_data.shape[0],4)*100,"%")
print(f"manager_survey_data loss",manager_survey_data.shape[0] - cleaned_manager_survey_data.shape[0],"row", round((manager_survey_data.shape[0] - cleaned_manager_survey_data.shape[0])/manager_survey_data.shape[0],4)*100,"%")
print(f"employee_survey_data loss",employee_survey_data.shape[0] - cleaned_employee_survey_data.shape[0],"row", round((employee_survey_data.shape[0] - cleaned_employee_survey_data.shape[0])/employee_survey_data.shape[0],5)*100,"%")

In [None]:
general_data.info()

In [None]:
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   Department               4410 non-null   object 
 6   EducationField           4410 non-null   object
 9   Gender                   4410 non-null   object 
 11  JobRole                  4410 non-null   object 
 12  MaritalStatus            4410 non-null   object 
 15  Over18                   4410 non-null   object 


In [None]:
manager_survey_data

In [None]:
manager_survey_data.info()

In [None]:
employee_survey_data.info()

In [None]:
merged_general_survey = pd.merge(general_data, manager_survey_data, on="EmployeeID", how="inner")

In [None]:
merged_general_survey.info()

In [None]:
general_data.hist(bins=50, figsize=(16, 10))

In [None]:
in_time_data

In [None]:
out_time_data

In [None]:
general_data["TotalWorkingYears"].median()

In [None]:
general_data["TotalWorkingYears"].mean()

In [None]:
general_data["TotalWorkingYears"]

In [None]:
general_data

In [None]:
general_data.info()

In [None]:
matrix_general_data = general_data.corr(numeric_only = True)
matrix_general_data["TotalWorkingYears"]

In [None]:
general_data

In [None]:
matrix_general_data = general_data.corr(numeric_only = True)
matrix_general_data["Attrition"]


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Supposons que general_data est déjà un DataFrame


X = cleaned_general_data[["Age"]]  # X doit être un DataFrame (2D)
y = cleaned_general_data["TotalWorkingYears"]  # y est une série (1D)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

plt.scatter(X_test, y_test, color="blue", label="Données réelles")
plt.plot(X_test, y_pred, color="red", linewidth=2, label="Prédiction")
plt.xlabel("Age")
plt.ylabel("Total Working Years")
plt.legend()
plt.title("Régression linéaire: TotalWorkingYears vs Age")
plt.show()


In [None]:
age_input = np.array([[40]]) 
predicted_working_years = model.predict(age_input)
print(f"Predicted Total Working Years for Age 30: {round(predicted_working_years[0])}")


In [None]:
manager_survey_data.info()

In [None]:
employee_survey_data.info()

In [None]:
cleaned_out_time_data = out_time_data.dropna(axis=1,how='all')
cleaned_in_time_data = in_time_data.dropna(axis=1, how='all')

print(cleaned_in_time_data.shape,"in time;",cleaned_out_time_data.shape,"out time;")

In [None]:
cleaned_in_time_data

In [None]:
from pandas.plotting import scatter_matrix 

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"] 
scatter_matrix(general_data[general_data.columns], figsize=(30, 15)) 


In [None]:
general_data.columns

In [None]:
sample_incomplete_rows = general_data[general_data.isnull().any(axis=1)].head() 
sample_incomplete_rows

In [None]:
from sklearn.preprocessing import OrdinalEncoder 

ordinal_encoder = OrdinalEncoder() 
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat) 
housing_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# 

# PREPROCESSING THE DATA IN IN_TIME_DATA AND OUT_TIME_DATA