# Project Statement:
### Portobello Tech is an app innovator who has devised an intelligent way of predicting employee turnover within the company. It periodically evaluates employees' work details, including the number of projects they worked on, average monthly working hours, time spent in the company, promotions in the last five years, and salary level.
### Data from prior evaluations shows the employees’ satisfaction in the workplace. The data could be used to identify patterns in work style and their interest in continuing to work for the company.
### The HR Department owns the data and uses it to predict employee turnover. Employee turnover refers to the total number of workers who leave a company over time.
### As the ML Developer assigned to the HR Department, we have been asked to create ML programs to:
### 1. Perform data quality checks by checking for missing values, if any.
### 2. Understand what factors contributed most to employee turnover at EDA.
### 3. Perform clustering of employees who left based on their satisfaction and evaluation.
### 4. Handle the left Class Imbalance using the SMOTE technique.
### 5. Perform k-fold cross-validation model training and evaluate performance.
### 6. Identify the best model and justify the evaluation metrics used.
### 7. Suggest various retention strategies for targeted employees.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

### Load data into a data frame and check firtst 5  rows

In [12]:
# open hr_comma_sep.csv
df = pd.read_csv('hr_comma_sep.csv')

# check head of the data
print(df.head())


   satisfaction_level  last_evaluation  number_project  average_montly_hours  \
0                0.38             0.53               2                   157   
1                0.80             0.86               5                   262   
2                0.11             0.88               7                   272   
3                0.72             0.87               5                   223   
4                0.37             0.52               2                   159   

   time_spend_company  Work_accident  left  promotion_last_5years  sales  \
0                   3              0     1                      0  sales   
1                   6              0     1                      0  sales   
2                   4              0     1                      0  sales   
3                   5              0     1                      0  sales   
4                   3              0     1                      0  sales   

   salary  
0     low  
1  medium  
2  medium  
3     low  
4 

### It appears that there are 10 columns in the dataset and the 9th column is wrongly labeld as 'sales' whereas it appears to hole the value of the 'department' to which the employee belongs

In [13]:
# check the unique value of the 'sales' column
print(df['sales'].unique())

['sales' 'accounting' 'hr' 'technical' 'support' 'management' 'IT'
 'product_mng' 'marketing' 'RandD']


### We will change the 'sales' column name to 'department

In [14]:
# Change the name of the column 'sales' to 'department'
df = df.rename(columns={'sales': 'department'})

# print column names
print(df.columns)

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'department', 'salary'],
      dtype='object')


### Print information about the data frame to identify any missing values or other anomalies

In [16]:
# check info of the data
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None
