<h1><center>Employee Attrition Rate Prediction</center></h1>

<img src="https://www.electronicsb2b.com/wp-content/uploads/2019/01/1-8.jpg" />

## Table of Contents
<ul>
    <li> Importing Files</li>
    <li> Feature Engineering</li>
    <li>Exploratory Data Analysis</li>
    <li> Model Creation</li>
</ul>


In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## About Dataset

   Variable         :       Description

*  Employee_ID	     :      Unique ID of each employee
*  Age	             :      Age of each employee
* Unit               :      Department under which the employee work
* Education	          :     Rating of Qualification of an employee (1-5)
* Gender	           :    Male-0 or Female-1
* Decision_skill_possess:	Decision skill that an employee possesses
* Post_Level	      :      Level of the post in an organization (1-5)
* Relationship_Status :	    Categorical Married or Single 
* Pay_Scale	          :      Rate in between 1 to 10
* Time_of_service	  :      Years in the organization
* growth_rate	      :      Growth rate in percentage of an employee
* Time_since_promotion:	    Time in years since the last promotion
* Work_Life_balance	  :      Rating for work-life balance given by an employee.
* Travel_Rate	      :      Rating based on travel history(1-3)
* Hometown	                Name of the city
* Compensation_and_Benefits:	Categorical Variabe
* VAR1 - VAR5	         :   Anominised variables
* Attrition_rate(TARGET_VARIABLE):	Attrition rate of each employee

## loading train and test data

In [None]:
train = pd.read_csv("/kaggle/input/hackerearth-employee-attrition/Train.csv")
test = pd.read_csv("/kaggle/input/hackerearth-employee-attrition/Test.csv")

In [None]:
train.head()

## Understanding the data 

In [None]:
train.shape # shape of the train data

In [None]:
train.info()

## Unique Values

In [None]:
for col in train.columns:
    print(col,":",len(train[col].unique()))

In [None]:
# credit: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction. 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

# Feature Engineering

<h3><b>Handling Missing Values</b></h3>

In [None]:
missing= missing_values_table(train)
missing

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median 
train.iloc[:,:] = imputer.fit_transform(train)

In [None]:
train.isna().sum()

**Unique values**

**Correlation**
* Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables.

In [None]:
plt.figure(figsize=(16,8))
corr=train.corr()
sns.heatmap(corr,annot=True,cmap='coolwarm',robust=True,fmt=".2f")
plt.show()

### Handling categorical varaibles

In [None]:
df1=pd.get_dummies(train['Unit'],drop_first=True)
df2=pd.get_dummies(train['Decision_skill_possess'],drop_first=True)
df3=pd.get_dummies(train['Hometown'],drop_first=True)
df4=pd.get_dummies(train['Compensation_and_Benefits'],drop_first=True)
dfg=pd.get_dummies(train['Gender'],drop_first=True)
dfs=pd.get_dummies(train['Relationship_Status'],drop_first=True)

In [None]:
train=pd.concat([df1,df2,df3,df4,dfg,dfs,train],axis=1)

In [None]:
#we have to drop the existing categorical columns from the dataframe

columns = ['Unit','Decision_skill_possess','Hometown','Compensation_and_Benefits','Gender','Relationship_Status']

train.drop(columns,axis=1,inplace=True)

In [None]:
train.head()

In [None]:
train.drop('Employee_ID',axis=1,inplace=True)

# Exploratory Data Analysis

In [None]:
!pip install pandas-profiling

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}


In [None]:
%%capture
import pandas_profiling as pp

In [None]:
pp.ProfileReport(train)

# Model creation

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,mean_squared_error

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=42)

In [None]:
print("x_train values count:",len(x_train))
print("y_train values count:",len(y_train))
print("x_test values count:",len(x_test))
print("y_tes values count:",len(y_test))

In [None]:
import xgboost as xgb

model=xgb.XGBRegressor()

model.fit(x_train,y_train)

## Let's predict it!

In [None]:
y_pred=model.predict(x_test)

In [None]:
print("R^2 score:",r2_score(y_test,y_pred))
print("Mean squred error:",mean_squared_error(y_test,y_pred))

## Hypertuning the model

In [None]:
## Hyper Parameters

params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30,0.50 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
    
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV
xgb_model = xgb.XGBRegressor()

random_search=RandomizedSearchCV(xgb_model,param_distributions=params,n_iter=5,n_jobs=-1,cv=5,verbose=3)
random_search.fit(x_train,y_train)

In [None]:
random_search.best_params_ #printing best parameters

#### from the randomized search cv we find out the best estimator, use this estimator to our model to get better accuracy

In [None]:
random_search.best_estimator_  #best estimator 

In [None]:
model=xgb.XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.3, gamma=0.4, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=12,
             min_child_weight=5, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
model.fit(x_train,y_train)

## It's time for prediction

In [None]:
y_pred=model.predict(x_test)

## Evaluate the model

In [None]:
print("R^2 score:",r2_score(y_test,y_pred))
print("Mean squred error:",mean_squared_error(y_test,y_pred))

### Our model was succesfully created ....

## repeat the steps for test data

In [None]:
#repeat the same preprocessing steps as train data cleaned

test.isna().sum()

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median 
test.iloc[:,:] = imputer.fit_transform(test)

In [None]:
df1=pd.get_dummies(test['Unit'],drop_first=True)
df2=pd.get_dummies(test['Decision_skill_possess'],drop_first=True)
df3=pd.get_dummies(test['Hometown'],drop_first=True)
df4=pd.get_dummies(test['Compensation_and_Benefits'],drop_first=True)
dfg=pd.get_dummies(test['Gender'],drop_first=True)
dfs=pd.get_dummies(test['Relationship_Status'],drop_first=True)

#concatenating all the dataframes into test data
test=pd.concat([df1,df2,df3,df4,dfg,dfs,test],axis=1)

#dropping the columns 
columns = ['Unit','Decision_skill_possess','Hometown','Compensation_and_Benefits','Gender','Relationship_Status','Employee_ID']

test.drop(columns,axis=1,inplace=True)

In [None]:
y_pred = model.predict(test)

In [None]:
submission = pd.read_csv("/kaggle/input/hackerearth-employee-attrition/sample_submission.csv")
submission.head()

Saving predictions(y_pred) 

In [None]:
prediction = pd.DataFrame(y_pred,columns={'Attrition_rate'})

In [None]:
submission.drop('Attrition_rate',axis=1,inplace=True) #dropping default column

In [None]:
submission=pd.concat([submission,prediction],axis=1)
submission.head()


<img src="https://www.thebalancecareers.com/thmb/RyLF9TrC_n00gRUulM7aAfmvitE=/2122x1194/smart/filters:no_upscale()/185002046-56b0974c3df78cf772cfe3c5.jpg" />