# **Machine Learning Assignment**

**Professor : Mohammad Mahdavi
Module : M606
Student Name : Muhammad Safeer Raza
Student Id : GH1024697**



GitHub Link: https://github.com/Safeer46/Machine-Learning
Dataset Link: https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset

**The Purpose of this Project is to Evaluate Whether an employee will leave work or not , it is a classification problem , with several attributes determining the final solution.**

In [30]:
import pandas as pd
from google.colab import drive
from sklearn.model_selection import train_test_split as TTS
from sklearn.preprocessing import OneHotEncoder as OHE
from sklearn.preprocessing import StandardScaler as SS
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import confusion_matrix as CM
from sklearn.metrics import precision_recall_fscore_support as prfs

In [2]:
epd = pd.read_csv('/content/Employee.csv')
epd.head()


Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1


# Data Exploration

**Grouping the dataset into two categorires; Train Data and Test Data**



In [3]:
tn_data, tt_data = TTS (epd)
print("total size",epd.shape)

print("trainingdata_size",tn_data.shape)

print("testingdata_size",tt_data.shape)

total size (4653, 9)
trainingdata_size (3489, 9)
testingdata_size (1164, 9)


**To check the data for explicit and Implicit Unique Values**

In [4]:
tn_data["LeaveOrNot"].value_counts()

0    2298
1    1191
Name: LeaveOrNot, dtype: int64

In [5]:
tn_data['Education'].unique()

array(['Bachelors', 'Masters', 'PHD'], dtype=object)

In [6]:
tn_data.isnull().sum()

Education                    0
JoiningYear                  0
City                         0
PaymentTier                  0
Age                          0
Gender                       0
EverBenched                  0
ExperienceInCurrentDomain    0
LeaveOrNot                   0
dtype: int64

**Reprocessing the Data**

In [7]:
a_tn = tn_data.drop(["LeaveOrNot"], axis=1)
b_tn = tn_data["LeaveOrNot"]
a_tt = tt_data.drop(["LeaveOrNot"], axis=1)
b_tt = tt_data["LeaveOrNot"]

print("A split train data:",a_tn.shape)
print("B split Test Data:",a_tt.shape)

A split train data: (3489, 8)
B split Test Data: (1164, 8)


**Feature Engineering**

**The test and train dataset is encoded as the dataset has both categorical and numerical dataset**

In [8]:
encoder = OHE()
encoder.fit(a_tn)
a_tn = encoder.transform(a_tn)
a_tt = encoder.transform(a_tt)

print("x_tn size", a_tn.shape)
print("x_tn size", a_tt.shape)


x_tn size (3489, 48)
x_tn size (1164, 48)


**The model will be scaled to normalize data skewness using standard scaler**

In [9]:
scaler = SS(with_mean=False)
scaler.fit(a_tn)
a_tn = scaler.transform(a_tn)
a_tt = scaler.transform(a_tt)

print("a_tn size:", a_tn.shape)
print("a_tt size:", a_tt.shape)

a_tn size: (3489, 48)
a_tt size: (1164, 48)


# Model Testing


**Choosing the model & Tuning the hyperparameter**

**Algorithm 01 - Decision Tree Classifier**

In [20]:
DTC_grid = {
    "min_samples_leaf": range(1, 10),
    "max_features": ["auto", "sqrt", "log2"],
}
model_A = GridSearchCV(DecisionTreeClassifier(),
                       DTC_grid, cv=5, scoring="accuracy", n_jobs=-1
)
model_A.fit(a_tn, b_tn)
print("While using the decision tree clasifier this is the foremost accuracy = {:.2f}".format(model_A.best_score_))
print("These are the finest hyperparameters found while using decsion tree classifier= {}".format(model_A.best_params_))


While using the decision tree clasifier this is the foremost accuracy = 0.82
These are the finest hyperparameters found while using decsion tree classifier= {'max_features': 'sqrt', 'min_samples_leaf': 2}


**Algorithm 02 - Support Vector Machine**

In [19]:
SVM_grid = {

    'gamma':[0.01, 0.15],
    'C' : [1, 3]
}

model_B = GridSearchCV(SVC(),SVM_grid,cv=5, scoring="accuracy")

model_B.fit(a_tn, b_tn)
print("While using the SVM classifier this is the foremost accuracy = {:.2f}".format(model_B.best_score_))
print("These are the finest hyperparameters found while using decsion tree classifier= {}".format(model_B.best_params_))

While using the SVM classifier this is the foremost accuracy = 0.84
These are the finest hyperparameters found while using decsion tree classifier= {'C': 3, 'gamma': 0.01}


**Algorithm 03 - Random Forest Classifier**

In [18]:
rfc_grid = {
    'criterion': ['entropy', 'gini'],
    'n_estimators':[20, 40, 80, 150]
}
model_c = GridSearchCV(RFC(),rfc_grid,cv=5, scoring="accuracy")

model_c.fit(a_tn, b_tn)
print("While using the SVM classifier this is the foremost accuracy = {:.2f}".format(model_c.best_score_))
print("These are the finest hyperparameters found while using decsion tree classifier= {}".format(model_c.best_params_))

While using the SVM classifier this is the foremost accuracy = 0.82
These are the finest hyperparameters found while using decsion tree classifier= {'criterion': 'gini', 'n_estimators': 40}


**Once the Algorithms are implemented with hyperparameters the model will be tested**

In [35]:
final_prediction = model_B.predict(a_tt)
accuracy = AC (b_tt, final_prediction)
cm =  CM (b_tt, final_prediction)
precision, recall, f1, support = prfs(b_tt, final_prediction)

print("Accuracy=", accuracy)
print("Precision=", precision)
print("Recall =", recall)
print("F1-score =", f1)
print("Confusion Matrix:\n", cm)

Accuracy= 0.8625429553264605
Precision= [0.84958872 0.89776358]
Recall = [0.95761589 0.68704156]
F1-score = [0.9003736  0.77839335]
Confusion Matrix:
 [[723  32]
 [128 281]]


**Conclusion and Discussion**



** Three models were used for the pipline with an accuracy of
DT 82%
RF 82%
SVM 84%
As per the results we can see that the best model was Supoport Vector Machine with an accuracy of 84% on the training data. The models performance on the test data is 86%.**

In [40]:
!jupyter nbconvert --to html Employee_dataset.ipynb

[NbConvertApp] Converting notebook Employee_dataset.ipynb to html
[NbConvertApp] Writing 832440 bytes to Employee_dataset.html
