### Competition Link:
[click here](https://www.hackerearth.com/challenges/competitive/hackerearth-machine-learning-challenge-predict-employee-attrition-rate/machine-learning/predict-the-employee-attrition-rate-in-organizations-1d700a97/) 

### Problem statement
Employees are the most important part of an organization. Successful employees meet deadlines, make sales, and build the brand through positive customer interactions.

Employee attrition is a major cost to an organization and predicting such attritions is the most important requirement of the Human Resources department in many organizations. In this problem, your task is to predict the attrition rate of employees of an organization. 

### Data
    * Train.csv
    * Test.csv

### Submission format
You are required to write your predictions in a .csv file that contain the following columns:
    * Employee_ID
    * Attrition_rate

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px

%matplotlib inline

## Load the dataset as trainning data and submission data

In [2]:

data = pd.read_csv('../input/hackerearth-employee-attrition-rate/Train.csv')
submission_data = pd.read_csv('../input/hackerearth-employee-attrition-rate/Test.csv')

In [3]:
data.shape

(7000, 24)

In [4]:
submission_data.shape

(3000, 23)

In [5]:
data.head()

Unnamed: 0,Employee_ID,Gender,Age,Education_Level,Relationship_Status,Hometown,Unit,Decision_skill_possess,Time_of_service,Time_since_promotion,...,Compensation_and_Benefits,Work_Life_balance,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,Attrition_rate
0,EID_23371,F,42.0,4,Married,Franklin,IT,Conceptual,4.0,4,...,type2,3.0,4,0.7516,1.8688,2.0,4,5,3,0.1841
1,EID_18000,M,24.0,3,Single,Springfield,Logistics,Analytical,5.0,4,...,type2,4.0,3,-0.9612,-0.4537,2.0,3,5,3,0.067
2,EID_3891,F,58.0,3,Married,Clinton,Quality,Conceptual,27.0,3,...,type2,1.0,4,-0.9612,-0.4537,3.0,3,8,3,0.0851
3,EID_17492,F,26.0,3,Single,Lebanon,Human Resource Management,Behavioral,4.0,3,...,type2,1.0,3,-1.8176,-0.4537,,3,7,3,0.0668
4,EID_22534,F,31.0,1,Married,Springfield,Logistics,Conceptual,5.0,4,...,type3,3.0,1,0.7516,-0.4537,2.0,2,8,2,0.1827


In [6]:
data.describe()

Unnamed: 0,Age,Education_Level,Time_of_service,Time_since_promotion,growth_rate,Travel_Rate,Post_Level,Pay_Scale,Work_Life_balance,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,Attrition_rate
count,6588.0,7000.0,6856.0,7000.0,7000.0,7000.0,7000.0,6991.0,6989.0,7000.0,6423.0,7000.0,6344.0,7000.0,7000.0,7000.0,7000.0
mean,39.622799,3.187857,13.385064,2.367143,47.064286,0.817857,2.798,6.006294,2.387895,3.098571,-0.008126,-0.013606,1.891078,2.834143,7.101286,3.257,0.189376
std,13.60692,1.065102,10.364188,1.149395,15.761406,0.648205,1.163721,2.058435,1.122786,0.836377,0.98985,0.986933,0.529403,0.938945,1.164262,0.925319,0.185753
min,19.0,1.0,0.0,0.0,20.0,0.0,1.0,1.0,1.0,1.0,-1.8176,-2.7762,1.0,1.0,5.0,1.0,0.0
25%,27.0,3.0,5.0,1.0,33.0,0.0,2.0,5.0,1.0,3.0,-0.9612,-0.4537,2.0,2.0,6.0,3.0,0.0704
50%,37.0,3.0,10.0,2.0,47.0,1.0,3.0,6.0,2.0,3.0,-0.1048,-0.4537,2.0,3.0,7.0,3.0,0.14265
75%,52.0,4.0,21.0,3.0,61.0,1.0,3.0,8.0,3.0,3.0,0.7516,0.7075,2.0,3.0,8.0,4.0,0.235
max,65.0,5.0,43.0,4.0,74.0,2.0,5.0,10.0,5.0,5.0,1.6081,1.8688,3.0,5.0,9.0,5.0,0.9959


In [7]:
data.columns

Index(['Employee_ID', 'Gender', 'Age', 'Education_Level',
       'Relationship_Status', 'Hometown', 'Unit', 'Decision_skill_possess',
       'Time_of_service', 'Time_since_promotion', 'growth_rate', 'Travel_Rate',
       'Post_Level', 'Pay_Scale', 'Compensation_and_Benefits',
       'Work_Life_balance', 'VAR1', 'VAR2', 'VAR3', 'VAR4', 'VAR5', 'VAR6',
       'VAR7', 'Attrition_rate'],
      dtype='object')

## List the important features through xgboost features importance 
### Tutorial
1. [DataCamp](https://www.datacamp.com/community/tutorials/xgboost-in-python)
2. [machinelearningmastery](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/#:~:text=XGBoost%20is%20an%20implementation%20of,first%20XGBoost%20model%20in%20Python.)

I have done this on seperate notebook

In [11]:
features = ['Age', 'Compensation_and_Benefits', 'Work_Life_balance', 'Post_Level', 'growth_rate', 'Time_of_service', 'Pay_Scale', 'Hometown', 'Education_Level']

In [12]:
data[features].isna().sum()

Age                          412
Compensation_and_Benefits      0
Work_Life_balance             11
Post_Level                     0
growth_rate                    0
Time_of_service              144
Pay_Scale                      9
Hometown                       0
Education_Level                0
dtype: int64

In [13]:
# Convert the categorical data into numaric value

for feature in features:
    if data[feature].dtype == 'object':
        data[feature] = data[feature].astype('category')
        data[feature] = data[feature].cat.codes

## Observation after many iterations 
    * mean value ->(null value replace with mean will get 81.208% score)
    * 75% value ->(null value replace with mean will get 81.283% score)
    * Highest score is 81.668%

In [14]:
data['Age'].fillna(52, inplace=True)
data['Work_Life_balance'].fillna(3, inplace=True)
data['Time_of_service'].fillna(21, inplace=True) 
data['Pay_Scale'].fillna(8, inplace=True)

In [16]:
data[features].isna().sum()

Age                          0
Compensation_and_Benefits    0
Work_Life_balance            0
Post_Level                   0
growth_rate                  0
Time_of_service              0
Pay_Scale                    0
Hometown                     0
Education_Level              0
dtype: int64

In [17]:
submission_data['Age'].fillna(52, inplace=True)
submission_data['Work_Life_balance'].fillna(3, inplace=True)
submission_data['Time_of_service'].fillna(21, inplace=True) 
submission_data['Pay_Scale'].fillna(8, inplace=True)

In [18]:
submission_data[features].isna().sum()

Age                          0
Compensation_and_Benefits    0
Work_Life_balance            0
Post_Level                   0
growth_rate                  0
Time_of_service              0
Pay_Scale                    0
Hometown                     0
Education_Level              0
dtype: int64

In [23]:
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [24]:
X, y = data[features].values, data['Attrition_rate'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

def prepare_inputs(X_train, X_test):
    ohe = OrdinalEncoder()
    ohe.fit(X_train)
    X_train_enc = ohe.transform(X_train)
    X_test_enc = ohe.transform(X_test)
    return X_train_enc, X_test_enc

X_train, X_test = prepare_inputs(X_train, X_test)

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics


model = LinearRegression()
model.fit(X_train, y_train)

print('Intercept: \n', model.intercept_)
print('Coefficients: \n', model.coef_)

Intercept: 
 0.20702007867602656
Coefficients: 
 [-1.01373785e-04 -7.56724859e-03  2.63282869e-03  2.26938309e-03
  1.97094785e-04 -5.01009982e-05 -1.91362267e-03  1.50250847e-03
 -8.52149781e-04]


In [26]:
output = model.predict(X_test)

df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': output.flatten()})
df

Unnamed: 0,Actual,Predicted
0,0.1642,0.176026
1,0.0760,0.188350
2,0.2246,0.189556
3,0.3232,0.176016
4,0.1808,0.192105
...,...,...
1395,0.3332,0.201107
1396,0.0375,0.195420
1397,0.1613,0.186895
1398,0.1645,0.199594


In [27]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, output))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, output))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, output)))

Mean Absolute Error: 0.12111537844227241
Mean Squared Error: 0.030281866424668746
Root Mean Squared Error: 0.1740168567256309


In [29]:
XX = submission_data[features].values

ohe = OrdinalEncoder()
ohe.fit(XX)
XX = ohe.transform(XX)

In [31]:
y_predict = model.predict(XX)

import csv

with open('5th_submission.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Employee_ID", "Attrition_rate"])
    
    for i in range(3000):
        writer.writerow([submission_data['Employee_ID'][i], y_predict[i]])