**COMPANY:** INX FUTURE INC | EMPLOYEE PERFORMANCE ANALYSIS

**BUISNESS CASE:** THE AIM IS TO PREDICT PREDICT THE PERFORMANCE RATING OF EMPLOYEES BASED ON THE PROVIDED DATASET FEATURES


#### MODEL CREATION & EVALUATION SUMMARY:
* Loading pre-process data
* Define dependant & independant features
* Balancing the target feature
* Split training and testing data
* Model creation, prediction & evaluation
* Save the Model

### IMPORT NECESSARY LIBRARY

In [1]:
import pandas as pd
import numpy as np
import pickle
from scipy import stats
import xgboost as xgb
from sklearn.svm import SVC
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV  
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,confusion_matrix

import warnings # Used to supressed the warnings
warnings.filterwarnings('ignore')

### LOADING PREPROCESS DATA

In [2]:
path2 = "/Users/mac/Documents/datascienceonecampus/DatamitesTraining/IABAC PROJECT ASSIGNMENT/Employee Performance Analysis/Data/preprocessed_employee_performance_analysis_data.csv"
data = pd.read_csv(path2)
pd.set_option('display.max_columns',None) # Used to display the all features
data.drop('Unnamed: 0',axis=1,inplace=True) # Drop unwanted feature
data.head()

Unnamed: 0,pca_1,pca_2,pca_3,pca_4,pca_5,pca6_,pca_7,pca_8,pca_9,pca_10,pca_11,pca_12,pca_13,pca_14,pca_15,pca_16,pca_17,pca_18,pca_19,pca_20,pca_21,pca_22,pca_23,pca_24,pca_25,PerformanceRating
0,-4.477633,-1.642226,1.173185,0.937445,-0.928846,1.086699,-0.725968,-1.463533,0.533238,0.398929,-1.235572,0.060502,-0.956449,-0.271149,1.382043,-0.693555,0.89937,0.135094,-0.433316,-0.201212,-0.033544,-0.252475,-0.519519,-0.270948,-0.175873,3
1,-4.359432,-0.065218,2.224724,1.516933,0.570524,-0.387658,-1.82813,0.085665,0.892809,0.860179,-1.533463,1.367427,0.224679,0.22674,0.33966,-0.273632,0.42023,-0.593152,1.136807,-0.840936,0.170133,-0.422335,-0.631238,0.001879,-0.656028,3
2,-4.248609,2.515834,4.840761,-0.186649,-1.665545,-0.245296,-0.489322,1.354694,0.514996,1.952407,0.53355,2.272418,-1.206544,0.033911,0.813391,-1.335221,-0.61601,0.483394,0.5603,0.271008,0.147982,0.277786,-0.234783,0.510924,0.738257,4
3,3.00894,0.739654,2.492356,3.290652,2.444091,1.796594,1.210603,-0.22146,-0.13419,-0.157597,-0.245363,-1.450949,-0.393031,1.304687,0.749247,1.469223,-0.266048,0.889484,-1.077229,-0.835094,1.254525,-0.003715,-0.506591,-0.013664,-0.437799,3
4,-4.246328,5.990469,-0.153349,0.784594,2.264407,-1.749641,0.507838,-0.524333,0.672134,1.158246,-1.54778,0.217248,-0.734415,-0.048797,-1.487128,0.656833,0.616331,-0.407177,1.018593,-0.12741,-1.252362,0.014501,-0.437993,-0.248946,-0.108599,3


### SPLIT THE DATASET INTO INDEPENDANT AND TARGET FEATURES

In [3]:
X = data.iloc[:,:-1]
y = data.PerformanceRating

In [4]:
X.head()

Unnamed: 0,pca_1,pca_2,pca_3,pca_4,pca_5,pca6_,pca_7,pca_8,pca_9,pca_10,pca_11,pca_12,pca_13,pca_14,pca_15,pca_16,pca_17,pca_18,pca_19,pca_20,pca_21,pca_22,pca_23,pca_24,pca_25
0,-4.477633,-1.642226,1.173185,0.937445,-0.928846,1.086699,-0.725968,-1.463533,0.533238,0.398929,-1.235572,0.060502,-0.956449,-0.271149,1.382043,-0.693555,0.89937,0.135094,-0.433316,-0.201212,-0.033544,-0.252475,-0.519519,-0.270948,-0.175873
1,-4.359432,-0.065218,2.224724,1.516933,0.570524,-0.387658,-1.82813,0.085665,0.892809,0.860179,-1.533463,1.367427,0.224679,0.22674,0.33966,-0.273632,0.42023,-0.593152,1.136807,-0.840936,0.170133,-0.422335,-0.631238,0.001879,-0.656028
2,-4.248609,2.515834,4.840761,-0.186649,-1.665545,-0.245296,-0.489322,1.354694,0.514996,1.952407,0.53355,2.272418,-1.206544,0.033911,0.813391,-1.335221,-0.61601,0.483394,0.5603,0.271008,0.147982,0.277786,-0.234783,0.510924,0.738257
3,3.00894,0.739654,2.492356,3.290652,2.444091,1.796594,1.210603,-0.22146,-0.13419,-0.157597,-0.245363,-1.450949,-0.393031,1.304687,0.749247,1.469223,-0.266048,0.889484,-1.077229,-0.835094,1.254525,-0.003715,-0.506591,-0.013664,-0.437799
4,-4.246328,5.990469,-0.153349,0.784594,2.264407,-1.749641,0.507838,-0.524333,0.672134,1.158246,-1.54778,0.217248,-0.734415,-0.048797,-1.487128,0.656833,0.616331,-0.407177,1.018593,-0.12741,-1.252362,0.014501,-0.437993,-0.248946,-0.108599


In [5]:
y.head()

0    3
1    3
2    4
3    3
4    3
Name: PerformanceRating, dtype: int64

In [6]:
X.shape

(1200, 25)

In [7]:
y.shape

(1200,)

### BALANCING THE TARGET FEATURE

SMOTE: SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them. SMOTE synthesises new minority instances between existing minority instances.


In [8]:
sm = SMOTE() # obeject creation
print("unbalanced data   :  ",Counter(y))
X_sm,y_sm = sm.fit_resample(X,y)
print("balanced data:    :",Counter(y_sm))

unbalanced data   :   Counter({3: 874, 2: 194, 4: 132})
balanced data:    : Counter({3: 874, 4: 874, 2: 874})


* Now target feature in balance

### SPLIT TRAINING AND TESTING DATA

In [9]:
X_train,X_test,y_train,y_test=train_test_split(X_sm,y_sm,random_state=42,test_size=0.20) # 20% data given to testing

In [10]:
# Check shape of train and test
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2097, 25), (525, 25), (2097,), (525,))

### MODEL CREATION

#### AIM 
* Create a sweet spot model (Low bias, Low variance)

#### HERE WE WILL BE EXPERIMENTING WITH THREE ALGORITHM
* Support Vector Machine => Classifier
* Logistic Regression
* Random Forest
* Ada Boost
* Artificial Neural Network [MLP Classifier]

### 1.Support Vector Machine

In [11]:

# Object Creaation
svc = SVC()

# Fitting training and testing data
svc.fit(X_train,y_train)

# Prediction on train data
svc_train_predict = svc.predict(X_train)

# Prediction on test data
svc_test_predict = svc.predict(X_test)

#### TRAINING ACCURACY

In [12]:
svc_train_accuracy = accuracy_score(svc_train_predict,y_train)
print("Training accuracy of support vector classifier model",svc_train_accuracy*100)
print("support vector classifier Classification report: \n",classification_report(svc_train_predict,y_train))

Training accuracy of support vector classifier model 97.04339532665713
support vector classifier Classification report: 
               precision    recall  f1-score   support

           2       1.00      0.96      0.98       719
           3       0.93      0.98      0.95       667
           4       0.98      0.98      0.98       711

    accuracy                           0.97      2097
   macro avg       0.97      0.97      0.97      2097
weighted avg       0.97      0.97      0.97      2097



* Support vector classifier perform well on training data

#### TESTING ACCURACY

In [13]:
svc_test_accuracy = accuracy_score(svc_test_predict,y_test)
print("Testing accuracy of support vector classifier model",svc_test_accuracy*100)
print("support vector classifier Classification report: \n",classification_report(svc_test_predict,y_test))

Testing accuracy of support vector classifier model 94.0952380952381
support vector classifier Classification report: 
               precision    recall  f1-score   support

           2       0.98      0.93      0.95       194
           3       0.86      0.96      0.91       156
           4       0.98      0.94      0.96       175

    accuracy                           0.94       525
   macro avg       0.94      0.94      0.94       525
weighted avg       0.94      0.94      0.94       525



In [40]:
confusion_matrix(y_test,svc_test_predict)

array([[180,   4,   0],
       [ 14, 149,  10],
       [  0,   3, 165]])

* In testing score is still lagging so we are going to do hyperparameter tunning with the help of grid search cv

In [14]:
param_grid = {'C':[0.1,0.5,10,50,60,70,80],
             'gamma':[1,0.1,0.001,0.0001,0.00001],
             'random_state':(list(range(1,20)))}
model = SVC() # Object creation
grid = GridSearchCV(model,param_grid,refit=True,verbose=2,scoring='f1',cv=5)

# Step:10 fitting the model for grid search
grid.fit(X,y)

Fitting 5 folds for each of 665 candidates, totalling 3325 fits
[CV] END .....................C=0.1, gamma=1, random_state=1; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=1; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=1; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=1; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=1; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=2; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=2; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=2; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=2; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=2; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, random_state=3; total time=   0.2s
[CV] END .....................C=0.1, gamma=1,

[CV] END ...................C=0.1, gamma=0.1, random_state=2; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=2; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=2; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=3; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=3; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=3; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=3; total time=   0.2s
[CV] END ...................C=0.1, gamma=0.1, random_state=3; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=4; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=4; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=4; total time=   0.1s
[CV] END ...................C=0.1, gamma=0.1, random_state=4; total time=   0.1s
[CV] END ...................

[CV] END .................C=0.1, gamma=0.001, random_state=4; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=4; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=4; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=4; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=4; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=5; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=5; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=5; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=5; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=5; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=6; total time=   0.1s
[CV] END .................C=0.1, gamma=0.001, random_state=6; total time=   0.1s
[CV] END .................C=

[CV] END ................C=0.1, gamma=0.0001, random_state=5; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=5; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=5; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=6; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=6; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=6; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=6; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=6; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=7; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=7; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=7; total time=   0.1s
[CV] END ................C=0.1, gamma=0.0001, random_state=7; total time=   0.1s
[CV] END ................C=0

[CV] END .................C=0.1, gamma=1e-05, random_state=6; total time=   0.2s
[CV] END .................C=0.1, gamma=1e-05, random_state=7; total time=   0.2s
[CV] END .................C=0.1, gamma=1e-05, random_state=7; total time=   0.2s
[CV] END .................C=0.1, gamma=1e-05, random_state=7; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=7; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=7; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=8; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=8; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=8; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=8; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=8; total time=   0.1s
[CV] END .................C=0.1, gamma=1e-05, random_state=9; total time=   0.1s
[CV] END .................C=

[CV] END .....................C=0.5, gamma=1, random_state=8; total time=   0.5s
[CV] END .....................C=0.5, gamma=1, random_state=8; total time=   0.6s
[CV] END .....................C=0.5, gamma=1, random_state=8; total time=   0.5s
[CV] END .....................C=0.5, gamma=1, random_state=8; total time=   0.4s
[CV] END .....................C=0.5, gamma=1, random_state=9; total time=   0.3s
[CV] END .....................C=0.5, gamma=1, random_state=9; total time=   0.3s
[CV] END .....................C=0.5, gamma=1, random_state=9; total time=   0.5s
[CV] END .....................C=0.5, gamma=1, random_state=9; total time=   0.3s
[CV] END .....................C=0.5, gamma=1, random_state=9; total time=   0.3s
[CV] END ....................C=0.5, gamma=1, random_state=10; total time=   0.3s
[CV] END ....................C=0.5, gamma=1, random_state=10; total time=   0.2s
[CV] END ....................C=0.5, gamma=1, random_state=10; total time=   0.3s
[CV] END ...................

[CV] END ...................C=0.5, gamma=0.1, random_state=9; total time=   0.2s
[CV] END ...................C=0.5, gamma=0.1, random_state=9; total time=   0.2s


KeyboardInterrupt: 

In [15]:
# set the best parameter 
clf =SVC(C=0.6,gamma=0.1,random_state=1)

# fit the model
clf.fit(X_train,y_train)

# Predict the x test
y_hat_clf = clf.predict(X_test)

#### TESTING ACCURACY AFTER HYPERPARAMETER TUNNING

In [16]:
test_accuracy = accuracy_score(y_hat_clf,y_test)
print("Testing accuracy of support vector classifier model",test_accuracy*100)
print("support vector classifier Classification report: \n",classification_report(y_hat_clf,y_test))

Testing accuracy of support vector classifier model 96.76190476190476
support vector classifier Classification report: 
               precision    recall  f1-score   support

           2       0.94      0.99      0.97       174
           3       0.98      0.93      0.95       182
           4       0.99      0.98      0.99       169

    accuracy                           0.97       525
   macro avg       0.97      0.97      0.97       525
weighted avg       0.97      0.97      0.97       525



* After hyperparameter tunning score is increases

### 2.Logistic Regression

**Logistic Regression** is commonly used to estimate the probability that an  instance belongs to a particular class.If the estimated probability is greater than 50% then the model predicts that the instance belongs to that class(*called the positive class ,labeled as '1')* or else it predict that it does not 
(*that is it belongs to the negative class ,labeled as 'o' )*.This make it a binary classifier 

**Equation of Logistic Regression is **
 Y= Wt x + B
 i.e y is equal to W to the power of t plus b

In [17]:
clf=LogisticRegression()

clf.fit(X_train,y_train)  ## training

# Prediction on train data
lg_train_predict = clf.predict(X_train)

# Prediction on test data
lg_test_predict = clf.predict(X_test)


#### TRAINING ACCURACY

In [18]:
lg_train_accuracy = accuracy_score(lg_train_predict,y_train)
print("Training accuracy of Logistic Regression model",lg_train_accuracy*100)
print("Logistic Regression Classification report: \n",classification_report(lg_train_predict,y_train))

Training accuracy of Logistic Regression model 91.17787315212207
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.96      0.90      0.93       734
           3       0.83      0.90      0.86       646
           4       0.95      0.93      0.94       717

    accuracy                           0.91      2097
   macro avg       0.91      0.91      0.91      2097
weighted avg       0.92      0.91      0.91      2097



Logistic Regression perform well on training data

#### TESTING ACCURACY

In [19]:
lg_test_accuracy = accuracy_score(lg_test_predict,y_test)
print("Testing accuracy of Logistic Regression model",lg_test_accuracy*100)
print("Logistic Regression Classification report: \n",classification_report(lg_test_predict,y_test))

Testing accuracy of Logistic Regression model 90.47619047619048
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.94      0.90      0.92       192
           3       0.82      0.89      0.85       158
           4       0.96      0.92      0.94       175

    accuracy                           0.90       525
   macro avg       0.90      0.90      0.90       525
weighted avg       0.91      0.90      0.91       525



In [41]:
confusion_matrix(y_test,lg_test_predict)

array([[173,  10,   1],
       [ 19, 141,  13],
       [  0,   7, 161]])

### 3.Random Forest

In [20]:
rf = RandomForestClassifier(n_estimators=100) # 100 decision tree

# fitting training data
rf.fit(X_train,y_train)

# Prediction on testing data
rf_test_predict = rf.predict(X_test)

# Prediction on training data
rf_train_predict = rf.predict(X_train)


#### TRANING ACCURACY

In [21]:
rf_train_accuracy = accuracy_score(rf_train_predict,y_train)
print("Training accuracy of random forest",rf_train_accuracy)
print("Classification report of training: \n",classification_report(rf_train_predict,y_train))

Training accuracy of random forest 1.0
Classification report of training: 
               precision    recall  f1-score   support

           2       1.00      1.00      1.00       690
           3       1.00      1.00      1.00       701
           4       1.00      1.00      1.00       706

    accuracy                           1.00      2097
   macro avg       1.00      1.00      1.00      2097
weighted avg       1.00      1.00      1.00      2097



* Random forest classifier very well work on training data.

#### TESTING ACCURACY

In [22]:
rf_test_accuracy = accuracy_score(rf_test_predict,y_test)
print("Testing accuracy of random forest",rf_test_accuracy*100)
print("Classification report of testing: \n",classification_report(rf_test_predict,y_test))

Testing accuracy of random forest 92.95238095238095
Classification report of testing: 
               precision    recall  f1-score   support

           2       0.95      0.92      0.94       190
           3       0.86      0.92      0.89       162
           4       0.98      0.95      0.96       173

    accuracy                           0.93       525
   macro avg       0.93      0.93      0.93       525
weighted avg       0.93      0.93      0.93       525



In [42]:
confusion_matrix(y_test,rf_test_predict)

array([[175,   9,   0],
       [ 15, 149,   9],
       [  0,   4, 164]])

#### HYPER PARAMETER TUNNING WITH RANDOMIZED SEARCH CV

In [23]:
#In random forest we are not used grid search CV because of memory reason.

n_estimators = [int(x) for x in np.linspace(start=100 ,stop=2000, num=10)] #No of decision tree in forest
max_features = ['auto', 'sqrt'] #Max no of feature consider to create decision tree
max_depth    = [int(x) for x in np.linspace(10,100,num=11)] #Max no of level in each decision tree
max_depth.append(None)
min_samples_split = [2,3,5,8] #Min number of data points placed in a node before the node is split
min_samples_leaf  = [1,2,3,4]  #Min number of data point allowed in leaf node

# Creating dictionary of paramter
random_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

# Object creation
rf_clf = RandomForestClassifier(random_state=42) #Provide random state because select rows and columns randomly

# Create Random search CV with parameter
rf_cv = RandomizedSearchCV(estimator=rf_clf,scoring='f1',param_distributions=random_grid,
                           n_iter=10,cv=2,verbose=2,random_state=1,n_jobs=-1)

# Fitting the training data
rf_cv.fit(X_train,y_train)

# Get best parameter
rf_best_params = rf_cv.best_params_
print(f"Best parameter: {rf_best_params}")


Fitting 2 folds for each of 10 candidates, totalling 20 fits


Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1123, in f1_score
    return fbeta_score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1261, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_l

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1123, in f1_score
    return fbeta_score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1261, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_l

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1123, in f1_score
    return fbeta_score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1261, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_l

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1123, in f1_score
    return fbeta_score(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1261, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_l

Best parameter: {'n_estimators': 311, 'min_samples_split': 5, 'min_samples_leaf': 3, 'max_features': 'auto', 'max_depth': 37}


In [24]:
# Create object and place the best paramter
rf_clf1 = RandomForestClassifier(**rf_best_params)

# Fitting the training data
rf_clf1.fit(X_train,y_train)

# Prediction on test data
rf_clf1_predict = rf_clf1.predict(X_test)

#### TEST ACCURACY AFTER HYPER-PARAMETER TUNNING

In [25]:
rf_accuracy = accuracy_score(rf_clf1_predict,y_test)
print("Accuracy after hyperparameter tunning",rf_accuracy*100)
print("Classification report: \n",classification_report(rf_clf1_predict,y_test))

Accuracy after hyperparameter tunning 93.52380952380952
Classification report: 
               precision    recall  f1-score   support

           2       0.93      0.96      0.95       179
           3       0.91      0.90      0.90       175
           4       0.96      0.95      0.96       171

    accuracy                           0.94       525
   macro avg       0.94      0.94      0.94       525
weighted avg       0.94      0.94      0.94       525



* After hyperparameter tunning score is not increases.

### 4.Artificial Neural Network [MLP Classifier]

In [26]:
model = MLPClassifier(hidden_layer_sizes=(60,3),
                      learning_rate='constant',
                      max_iter=250,
                      random_state=42)

In [27]:
# Fitting the training data
model.fit(X_train,y_train)

MLPClassifier(hidden_layer_sizes=(60, 3), max_iter=250, random_state=42)

In [28]:
# Predicting the probability
mlp_prdict_probability = model.predict_proba(X_test)
mlp_prdict_probability

array([[2.67157403e-05, 9.99962021e-01, 1.12634313e-05],
       [3.55533356e-06, 9.99745776e-01, 2.50668450e-04],
       [9.86043464e-01, 1.39336707e-02, 2.28655609e-05],
       ...,
       [9.90814054e-01, 9.17978828e-03, 6.15742517e-06],
       [9.34316927e-01, 6.52083511e-02, 4.74721978e-04],
       [1.43185133e-18, 1.66202646e-08, 9.99999983e-01]])

In [29]:
# Prediction on test data
mlp_test_predict = model.predict(X_test)

# Prediction on training data
mlp_train_predict = model.predict(X_train)

#### TRAINING ACCURACY

In [30]:
mlp_train_accuracy = accuracy_score(mlp_train_predict,y_train)
print("Training accuracy of MLP model is:",mlp_train_accuracy*100)
print("Classification report of training:"'\n',classification_report(mlp_train_predict,y_train))

Training accuracy of MLP model is: 99.4277539341917
Classification report of training:
               precision    recall  f1-score   support

           2       1.00      0.98      0.99       702
           3       0.98      1.00      0.99       689
           4       1.00      1.00      1.00       706

    accuracy                           0.99      2097
   macro avg       0.99      0.99      0.99      2097
weighted avg       0.99      0.99      0.99      2097



* Multilayer percepton Perform well on training data.

#### TESTING ACCURACY

In [31]:
mlp_test_accuracy = accuracy_score(mlp_test_predict,y_test)
print("Testing accuracy of MLP model is:",mlp_test_accuracy*100)
print("Classification report of testing:"'\n',classification_report(mlp_test_predict,y_test))

Testing accuracy of MLP model is: 95.80952380952381
Classification report of testing:
               precision    recall  f1-score   support

           2       0.98      0.95      0.97       191
           3       0.89      0.98      0.93       157
           4       1.00      0.95      0.97       177

    accuracy                           0.96       525
   macro avg       0.96      0.96      0.96       525
weighted avg       0.96      0.96      0.96       525

[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=8, n_estimators=1155; total time=  10.3s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=8, n_estimators=2000; total time=  20.4s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=8, n_estimators=1577; total time=  17.8s
[CV] END max_depth=37, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.5s
[CV] END max_depth=10, max_features=sqrt, min

In [43]:
confusion_matrix(y_test,mlp_test_predict)

array([[181,   3,   0],
       [ 10, 154,   9],
       [  0,   0, 168]])

* Multilayer percepton perform well on testing data

## 5. Naive Bayes Bernoulli

In [32]:
# Training the model
from sklearn.naive_bayes import BernoulliNB
model_nb = BernoulliNB()
model_nb.fit(X_train,y_train)

BernoulliNB()

In [56]:

# Prediction on train data

train_predict_nb = model_nb.predict(X_train)


y_predict_nb = model_nb.predict(X_test)

### TRAINING ACCURACY

In [57]:
nb_train_accuracy = accuracy_score(train_predict_nb,y_train)
print("Training accuracy of Naive Bayes Bernoulli model",nb_train_accuracy*100)
print("Logistic Regression Classification report: \n",classification_report(train_predict_nb,y_train))

Training accuracy of Naive Bayes Bernoulli model 80.92513113972342
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.83      0.82      0.82       705
           3       0.70      0.77      0.73       636
           4       0.90      0.84      0.87       756

    accuracy                           0.81      2097
   macro avg       0.81      0.81      0.81      2097
weighted avg       0.82      0.81      0.81      2097



### TESTING ACCURACY

In [59]:
nb_test_accuracy = accuracy_score(y_predict_nb,y_test)
print("Testing accuracy of Naive Bayes Bernoulli model",nb_test_accuracy*100)
print("Logistic Regression Classification report: \n",classification_report(y_predict_nb,y_test))

Testing accuracy of Logistic Regression model 75.42857142857143
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.77      0.79      0.78       179
           3       0.63      0.69      0.66       158
           4       0.86      0.77      0.81       188

    accuracy                           0.75       525
   macro avg       0.75      0.75      0.75       525
weighted avg       0.76      0.75      0.76       525



In [58]:
confusion_matrix(y_test,y_predict_nb)

array([[142,  31,  11],
       [ 32, 109,  32],
       [  5,  18, 145]])

## 6. K-Nearest Neighbor

In [36]:
# Training the model
from sklearn.neighbors import KNeighborsClassifier
model_knn = KNeighborsClassifier(n_neighbors=10,metric='euclidean') # Maximum accuracy for n=10
model_knn.fit(X_train,y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=10)

In [60]:
# Prediction on train data

train_predict_knn = model_knn.predict(X_train)


y_predict_knn = model_knn.predict(X_test)

### TRAINING ACCURACY

In [63]:
knn_train_accuracy = accuracy_score(train_predict_knn,y_train)
print("Training accuracy of K-Nearest Neighbor model",knn_train_accuracy*100)
print("Logistic Regression Classification report: \n",classification_report(train_predict_knn,y_train))

Training accuracy of K-Nearest Neighbor model 84.16785884597043
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.99      0.78      0.88       876
           3       0.54      0.99      0.70       383
           4       0.99      0.84      0.91       838

    accuracy                           0.84      2097
   macro avg       0.84      0.87      0.83      2097
weighted avg       0.91      0.84      0.86      2097



### TESTING ACCURACY

In [64]:
knn_test_accuracy = accuracy_score(y_predict_knn,y_test)
print("Testing accuracy of K-Nearest Neighbor model",knn_test_accuracy*100)
print("Logistic Regression Classification report: \n",classification_report(y_predict_knn,y_test))

Testing accuracy of K-Nearest Neighbor model 79.42857142857143
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.99      0.76      0.86       241
           3       0.38      1.00      0.55        66
           4       1.00      0.77      0.87       218

    accuracy                           0.79       525
   macro avg       0.79      0.84      0.76       525
weighted avg       0.92      0.79      0.83       525



In [65]:
confusion_matrix(y_test,y_predict_knn)

array([[183,   0,   1],
       [ 58,  66,  49],
       [  0,   0, 168]])

**Conclusion:**
* Support vector machine well perform on training data with accuracy 97.04% but the test score is 94.09% after applying Hyperparameter tunning score is 96.76% means model is performing well.

* Logistic Regression perform well on training data with 91.17% accuracy but in testing score is decrease to 90.52%.

* Random forest very well perform in training data with 100% accuracy but in testing 92.95% after doing hyperparameter tunning testing score is increase to 93.52%.

* Artifical neural network[Multilayer percepton] perform very well on training data with 99.42% accuracy and testing score is 95.80%.

* Training accuracy of Naive Bayes Bernoulli model 80.92% but in testing 75.42% decrease.

* Training accuracy of K-Nearest Neighbor model 84.16% while it reduce in Testing accuracy of K-Nearest Neighbor model 79.42%

* So we are select Artifical neuranl network [Multilayer percepton] model.

# SAVE THE MODEL

In [66]:
# saving model with the help of pickle
import pickle

file = open('mlp_classifier_model.pkl','wb')
pickle.dump(model,file)