# decision tree and random forest

# Problem Statement

## Business Case:-Based on given features we need to find whether an employee will leave the company or not.

# Import Libraries

In [1]:
## Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# Load Data

In [2]:
## Loading the data
data=pd.read_csv('HR-Employee-Attrition.csv')

In [3]:
data

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8


# Domain Analysis

In [8]:
data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

We have to work on predicting an HR industry problem.

### y : Attrition (categorical nominal = Target variable)
It indicates whether an employee will leave the company (Yes/No).

### Age

This column represents the age of the employee.
It is a numerical variable.

### BusinessTravel

This column indicates how often the employee travels for work (e.g., Rarely, Frequently, No Travel).
It is a categorical variable.

### DailyRate

This column represents the daily salary rate of the employee.
It is a numerical variable.

### Department

This column represents the department of the employee (e.g., Sales, HR, R&D).
It is a categorical variable.

### DistanceFromHome

This column shows the distance between the employee’s home and workplace.
It is a numerical variable.

### Education

This column shows the education level of the employee (e.g., 1 to 5 scale).
It is an ordinal categorical variable.

### EducationField

This column represents the field of education (e.g., Life Sciences, Medical, Technical Degree).
It is a categorical variable.

### EmployeeCount

This column usually has a constant value.
It is a numerical variable but not useful for prediction.

### EmployeeNumber

This is a unique ID assigned to each employee.
It is a numerical identifier (should not be used for prediction).

### EnvironmentSatisfaction

This column shows satisfaction level with the workplace environment (1 to 4).
It is an ordinal categorical variable.

### Gender

This column represents the gender of the employee (Male/Female).
It is a categorical variable.

### HourlyRate

This column gives the hourly salary rate.
It is a numerical variable.

### JobInvolvement

This column indicates how involved the employee is in their job (1 to 4 scale).
It is an ordinal categorical variable.

### JobLevel

This column represents the job level or seniority of the employee.
It is an ordinal categorical variable.

### JobRole

This column indicates the specific role of the employee (e.g., Sales Executive, Research Scientist).
It is a categorical variable.

### JobSatisfaction

This column shows the employee’s job satisfaction level (1 to 4).
It is an ordinal categorical variable.

### MaritalStatus

This column indicates marital status (Single, Married, Divorced).
It is a categorical variable.

### MonthlyIncome

This column shows the monthly salary of the employee.
It is a numerical variable.

### MonthlyRate

This column shows the monthly billing rate.
It is a numerical variable.

### NumCompaniesWorked

This column shows how many companies the employee has worked for previously.
It is a numerical variable.

### Over18

This column indicates whether the employee is above 18 years of age.
It is a categorical variable but not useful (usually constant).

### OverTime

This column shows whether the employee works overtime (Yes/No).
It is a binary categorical variable.

### PercentSalaryHike

This column represents the percentage salary increase.
It is a numerical variable.

### PerformanceRating

This column shows the performance rating (1 to 4).
It is an ordinal categorical variable.

### RelationshipSatisfaction

This column indicates the satisfaction level with relationships at work (1 to 4).
It is an ordinal categorical variable.

### StandardHours

This column usually has the same value for all employees.
It is a numerical variable but not useful.

### StockOptionLevel

This column shows the stock option level awarded to the employee (0 to 3).
It is an ordinal categorical variable.

### TotalWorkingYears

This column shows the total years of working experience.
It is a numerical variable.

### TrainingTimesLastYear

This column shows the number of training sessions attended last year.
It is a numerical variable.

### WorkLifeBalance

This column represents work-life balance satisfaction (1 to 4).
It is an ordinal categorical variable.

### YearsAtCompany

This column represents the number of years the employee has been with the company.
It is a numerical variable.

### YearsInCurrentRole

This column represents how long the employee has been in their current role.
It is a numerical variable.

### YearsSinceLastPromotion

This column indicates how many years have passed since the employee’s last promotion.
It is a numerical variable.

### YearsWithCurrManager

This column shows the number of years the employee has worked with their current manager.
It is a numerical variable.

In [6]:
# Plotting how every  numerical feature correlate with the "target"
plt.figure(figsize=(20,25), facecolor='white')#canvas size
plotnumber = 1#counter for number of plot

for column in data2:#acessing columns form data2 DataFrame
    if plotnumber<=16 :#checking whether counter is less than 16 or not
        ax = plt.subplot(4,4,plotnumber)#plotting 8 graphs in canvas(4 rows and 4 columns)
        sns.histplot(x=data2[column]          # plotting hist plot and dropping null values,classification according to target
                        ,hue=data.Attrition)
        plt.xlabel(column,fontsize=20)##assigning name to x-axis and increasing it's font 
        plt.ylabel('Attrition',fontsize=20)#assigning name to y-axis and increasing it's font 
    plotnumber+=1#increasing counter by 1
plt.tight_layout()

NameError: name 'data2' is not defined

<Figure size 2000x2500 with 0 Axes>

In [7]:
## Balacing the data
from collections import Counter# importing counter to check count of each label
from imblearn.over_sampling import SMOTE #for balancing the data
sm=SMOTE()
x_smote,y_smote=sm.fit_resample(x_train,y_train)

NameError: name 'x_train' is not defined

In [8]:
print("actual",Counter(y_train))
print("after smote",Counter(y_smote))

NameError: name 'y_train' is not defined

In [9]:
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(x_smote,y_smote)
# prediction with x_test
y_pred=model.predict(x_test)
# predictions with x_train
y_train_predict=model.predict(x_train)

NameError: name 'x_smote' is not defined