# FINAL REPORT WITH OUR ENTIRE APPROACH

### Table of contents : 


* 1. [ETHICS](#ethics)
* 2. [RAW DATASET DESCRIPTION](#raw-dataset-description)
* 3. [DATA WRANGLING](#data-wrangling)
* 4. [CHOICE OF AI MODEL](#choice-of-ai-model)
* 5. [MODEL APPLICATION](#model-application)
* 6. [RESULT ANALYSIS](#result-analysis)
* 7. [OPTIMISATION STEPS](#optimisation-steps)
* 8. [SELECTED MODEL AFTER COMPARING](#selected-model-after-comparing)
* 9. [FINAL MODEL AND JUSTIFICATION](#final-model-and-justification)

## 1. ETHICS

Based on the ethical principles oulined and following the European Commission's guidelines, we have made deliberate decision to exclude certain data attributes from our predictive model for employee turnover at HumanForYou. This decision is rooted in our commitment to fairness, non-discrimination, and the protection of employee privacy.

**Atributes Removed and Etichal Justification**
- **Age**: this attibute has been excluded to prevent age discrimination. Using age in a predictive model could unfairly bias outcomes against younger or older employees, wich is illegal under employement equality laws and violates the core ethical principle of non-discrimination.
- **Gender**: this attribute has been excluded to prevent gender bias. Including gender risks perpetuating existing societal inequalities and could lead to discriminatory outcomes, regardless of the model's statical findings. Its removal is essential for upholding fairness.
- **MaritalStatus_Married / MaritalStatus_Single**: these attributes have been excluded to protect employee privacy and prevent familial status discrimination. Marital status is a private matter with no legitimate, direct correlation to job performance or turnover risk. Using it could disadvantage certain social groups and is considered a protected characteristic under many fairness regulations.
- **WorkLifeBalance**: this attribute has been excluded due to significant privacy and ethical interpretation risks. It conflates personal circumstances with workplace demands, and a low score could be misinterpreted as an employee's lack of resilience rather than a failure of company policy. Applying it algorithmically risks blaming the employee for systemic issues and invading their private life management.
- **AvgWorkingHours**: this attribute has been excluded to prevent unfair surveillance and punitive use of operational data. Although it is a factual metric, using individual working hour patterns to predict turnover creates a surveillance culture. It could penalize employees for the company's failure to manage workloads sustainably. The ethical response to high average hours is to review team staffing and project planning at a managerial level, not to flag individuals in a predictive system.

**CONCLUSION**

The removal of these attributes is a proactive ethical decision. It ensures our model for HumanForYou builds trust, complies with the spirit of data protection laws like GDPR, and focuses its analysis on truly actionable and non-discriminatory factors within the company's direct control. By excluding these sensitive and ethically charged variables, we focus the solution on improving managerial practices, compensation fairness, and career development paths, creating sustainable improvements for the entire organization.



## 2. RAW DATASET DESCRIPTION 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 

In [5]:
extract_path = 'dataset'

# Assuming the extracted file is a CSV file named 'housing.csv'
csv_file_path = os.path.join(extract_path, 'general_data.csv')

# Load the dataset
general_data = pd.read_csv(csv_file_path)

In [7]:
general_data.head()


Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeID,Gender,...,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,...,1.0,Y,11,8,0,1.0,6,1,0,0
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,1,2,Female,...,0.0,Y,23,8,1,6.0,3,5,1,4
2,32,No,Travel_Frequently,Research & Development,17,4,Other,1,3,Male,...,1.0,Y,15,8,3,5.0,2,5,0,3
3,38,No,Non-Travel,Research & Development,2,5,Life Sciences,1,4,Male,...,3.0,Y,11,8,3,13.0,5,8,7,5
4,32,No,Travel_Rarely,Research & Development,10,1,Medical,1,5,Male,...,4.0,Y,12,8,2,9.0,2,6,0,4


In [8]:
general_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4410 non-null   int64  
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   Department               4410 non-null   object 
 4   DistanceFromHome         4410 non-null   int64  
 5   Education                4410 non-null   int64  
 6   EducationField           4410 non-null   object 
 7   EmployeeCount            4410 non-null   int64  
 8   EmployeeID               4410 non-null   int64  
 9   Gender                   4410 non-null   object 
 10  JobLevel                 4410 non-null   int64  
 11  JobRole                  4410 non-null   object 
 12  MaritalStatus            4410 non-null   object 
 13  MonthlyIncome            4410 non-null   int64  
 14  NumCompaniesWorked      

In [9]:
general_data.describe()

Unnamed: 0,Age,DistanceFromHome,Education,EmployeeCount,EmployeeID,JobLevel,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
count,4410.0,4410.0,4410.0,4410.0,4410.0,4410.0,4410.0,4391.0,4410.0,4410.0,4410.0,4401.0,4410.0,4410.0,4410.0,4410.0
mean,36.92381,9.192517,2.912925,1.0,2205.5,2.063946,65029.312925,2.69483,15.209524,8.0,0.793878,11.279936,2.79932,7.008163,2.187755,4.123129
std,9.133301,8.105026,1.023933,0.0,1273.201673,1.106689,47068.888559,2.498887,3.659108,0.0,0.851883,7.782222,1.288978,6.125135,3.221699,3.567327
min,18.0,1.0,1.0,1.0,1.0,1.0,10090.0,0.0,11.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,2.0,2.0,1.0,1103.25,1.0,29110.0,1.0,12.0,8.0,0.0,6.0,2.0,3.0,0.0,2.0
50%,36.0,7.0,3.0,1.0,2205.5,2.0,49190.0,2.0,14.0,8.0,1.0,10.0,3.0,5.0,1.0,3.0
75%,43.0,14.0,4.0,1.0,3307.75,3.0,83800.0,4.0,18.0,8.0,1.0,15.0,3.0,9.0,3.0,7.0
max,60.0,29.0,5.0,1.0,4410.0,5.0,199990.0,9.0,25.0,8.0,3.0,40.0,6.0,40.0,15.0,17.0


## 2. DATA WRANGLING

## 3. CHOICE OF AI MODEL 

## 4. MODEL APPLICATION  

## 5. RESULT ANALYSIS

## 6. OPTIMISATION STEPS

## 7. SELECTED MODEL AFTER COMPARING

## 8. FINAL MODEL AND JUSTIFICATION 