##### Project : Evaluate Cars Condition based on given feature

Step 1: Import Dependencies

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore', category=Warning)

Step 2: Load Dataset

In [2]:
df = pd.read_csv(r'cars.csv',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


Nameming the column

In [3]:
df.columns=['buy','maint','doors','persons','lug_boot','safety','classes']
df.head()

Unnamed: 0,buy,maint,doors,persons,lug_boot,safety,classes
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
df.shape

(1728, 7)

This data contains 7 columns and 1728 rows

In [5]:
# checking for category level of all columns
for col in df.columns:
    print('-----------{}-----------'.format(col))
    print(df[col].value_counts())
    print()

-----------buy-----------
vhigh    432
high     432
med      432
low      432
Name: buy, dtype: int64

-----------maint-----------
vhigh    432
high     432
med      432
low      432
Name: maint, dtype: int64

-----------doors-----------
2        432
3        432
4        432
5more    432
Name: doors, dtype: int64

-----------persons-----------
2       576
4       576
more    576
Name: persons, dtype: int64

-----------lug_boot-----------
small    576
med      576
big      576
Name: lug_boot, dtype: int64

-----------safety-----------
low     576
med     576
high    576
Name: safety, dtype: int64

-----------classes-----------
unacc    1210
acc       384
good       69
vgood      65
Name: classes, dtype: int64



In this data there are numbers but there are object also, so we have to convert the categorical data to numbers.

Step 3: Data Preprocessing

In [6]:
#Checking for missing values
df.isnull().sum()

buy         0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
classes     0
dtype: int64

This data does not contain any null values

In [7]:
#Chicking the basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buy       1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   classes   1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


As we said all the data is in object so we have to convert them

In [8]:
#Applying labelencoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for i in df.columns:
    df[i]=le.fit_transform(df[i])

Object data are converted to numeric to check whether they have converted we will use zip and dict function in below code block

In [9]:
for i in df.columns:
    le_name = dict(zip(le.classes_, le.transform(le.classes_)))
    print('------------------------')
    print('Feature: ',i)
    print('Mapping: ',le_name)

------------------------
Feature:  buy
Mapping:  {'acc': 0, 'good': 1, 'unacc': 2, 'vgood': 3}
------------------------
Feature:  maint
Mapping:  {'acc': 0, 'good': 1, 'unacc': 2, 'vgood': 3}
------------------------
Feature:  doors
Mapping:  {'acc': 0, 'good': 1, 'unacc': 2, 'vgood': 3}
------------------------
Feature:  persons
Mapping:  {'acc': 0, 'good': 1, 'unacc': 2, 'vgood': 3}
------------------------
Feature:  lug_boot
Mapping:  {'acc': 0, 'good': 1, 'unacc': 2, 'vgood': 3}
------------------------
Feature:  safety
Mapping:  {'acc': 0, 'good': 1, 'unacc': 2, 'vgood': 3}
------------------------
Feature:  classes
Mapping:  {'acc': 0, 'good': 1, 'unacc': 2, 'vgood': 3}


So all the data are been transfored

In [10]:
df.head()

Unnamed: 0,buy,maint,doors,persons,lug_boot,safety,classes
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2


Step 4: Seprate X and Y

In [11]:
X = df.iloc[:,:-1]
Y = df.iloc[:,-1]

The data is seperated in Dependent and independent variable 

Step 5: Data spliting

In [12]:
# Spliting the data into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    Y,test_size=0.3,random_state=10)



The data is splited into train and test data, and the ratio is 70% data is in training and 30% data is in testing.</br>
To verify this we will print the size of each variable

In [13]:
print('Total data: ',df.shape)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)
print('Y_train: ',Y_train.shape)
print('Y_test: ',Y_test.shape)

Total data:  (1728, 7)
X_train:  (1209, 6)
X_test:  (519, 6)
Y_train:  (1209,)
Y_test:  (519,)


Step 6: Data Normalization

In [14]:
#perform data normalisation
from sklearn.preprocessing import MinMaxScaler
# create object for scaler
scaler = MinMaxScaler()

# fit the train data on scaler object 
scaler.fit(X_train)

X_train_scale = scaler.transform(X_train)
X_test_scale = scaler.transform(X_test)

Note: In this data set we can see all the values from all the columns and rows are ranging from 0 to 5 so there is no extreme high and low values so thats why we dont need to perfom standardsiation and normalisation

In [15]:
# compare normal and scale data
print(X_train.iloc[0])
print(X_train_scale[0])

buy         0
maint       0
doors       1
persons     2
lug_boot    0
safety      0
Name: 593, dtype: int32
[0.         0.         0.33333333 1.         0.         0.        ]


Step 7: Build the model

1. Decision Tree

In [16]:
from sklearn.tree import DecisionTreeClassifier
#create model object with default parameter
model_dt = DecisionTreeClassifier()

#create model object with custom parameter
# model_dt = DecisionTreeClassifier(criterion='gini',
#                                    random_state=10, 
#                                    min_samples_leaf=5, 
#                                    min_samples_split=20,
#                                    max_leaf_nodes=15,
#                                    max_depth=6)

model_dt.fit(X_train, Y_train)

DecisionTreeClassifier()

In [17]:
model_dt.get_depth()

14

In [18]:
#prediction on Test data

y_pred = model_dt.predict(X_test)


To compare the predicted value

In [19]:
print(list(zip(Y_test,y_pred)))

[(2, 2), (2, 2), (2, 2), (2, 2), (1, 1), (2, 2), (0, 0), (0, 0), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (3, 3), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (1, 1), (3, 3), (1, 1), (2, 2), (0, 0), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (3, 3), (2, 2), (2, 2), (0, 0), (0, 0), (2, 2), (2, 2), (3, 3), (2, 2), (2, 2), (2, 2), (1, 1), (2, 2), (0, 0), (3, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (1, 1), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0),

To check how well our model is build we will use confusion_matrix, accuracy_score, and classification report

In [20]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

print('-----------------Confusion Matrix-----------------')
print(confusion_matrix(Y_test,y_pred))
print()
print('-----------------Accuracy Score-----------------')
print(accuracy_score(Y_test,y_pred))
print()
print('-----------------Classification Report-----------------')
print(classification_report(Y_test,y_pred))

-----------------Confusion Matrix-----------------
[[100   1   1   0]
 [  3  18   0   0]
 [  0   0 371   0]
 [  1   0   0  24]]

-----------------Accuracy Score-----------------
0.9884393063583815

-----------------Classification Report-----------------
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       102
           1       0.95      0.86      0.90        21
           2       1.00      1.00      1.00       371
           3       1.00      0.96      0.98        25

    accuracy                           0.99       519
   macro avg       0.98      0.95      0.96       519
weighted avg       0.99      0.99      0.99       519



As we can see the accuracy score is good on the base model of decision tree with the accuracy of 98% and maximum predictions are right.</br>
But still we will try to built different machine learning model

In [21]:

# feature importance for all columns
# total will be 1 

print(list(zip(df.columns, model_dt.feature_importances_)))

[('buy', 0.1510848831946676), ('maint', 0.2524437370839914), ('doors', 0.06032514597076629), ('persons', 0.19325825727478224), ('lug_boot', 0.09713332412056555), ('safety', 0.24575465235522678)]


2. Using SVM

In [24]:
#Creating SVM 
from sklearn import svm
svc_model = svm.SVC()

#training svm model with train dataset
svc_model.fit(X_train,Y_train)

SVC()

Predict with test dataset

In [25]:
y_pred = svc_model.predict(X_test)

Evaluation on y_pred

In [26]:
print("---------Confusion Matrix------------")
print(confusion_matrix(Y_test, y_pred))
print()
print("---------Accuracy Score------------")
print(accuracy_score(Y_test,y_pred))
print()
print("---------Classification Report------------")
print(classification_report(Y_test,y_pred))

---------Confusion Matrix------------
[[ 74   0  28   0]
 [ 14   5   0   2]
 [  4   0 367   0]
 [  6   0   0  19]]

---------Accuracy Score------------
0.8959537572254336

---------Classification Report------------
              precision    recall  f1-score   support

           0       0.76      0.73      0.74       102
           1       1.00      0.24      0.38        21
           2       0.93      0.99      0.96       371
           3       0.90      0.76      0.83        25

    accuracy                           0.90       519
   macro avg       0.90      0.68      0.73       519
weighted avg       0.90      0.90      0.89       519



SVM is also good with accuracy score of 89%. 

Now we will build the Ensemble model

3. Bagging Approach

In [27]:
#Building the model using Bagging_classifier
from sklearn.ensemble import BaggingClassifier

model = BaggingClassifier(n_estimators=100, random_state=10)

model = model.fit(X_train,Y_train)

Predicting the ensamble model

In [28]:
y_pred=model.predict(X_test)

In [29]:
print("---------Confusion Matrix------------")
print(confusion_matrix(Y_test, y_pred))
print()
print("---------Accuracy Score------------")
print(accuracy_score(Y_test,y_pred))
print()
print("---------Classification Report------------")
print(classification_report(Y_test,y_pred))

---------Confusion Matrix------------
[[102   0   0   0]
 [  1  20   0   0]
 [  2   0 369   0]
 [  1   0   0  24]]

---------Accuracy Score------------
0.9922928709055877

---------Classification Report------------
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       102
           1       1.00      0.95      0.98        21
           2       1.00      0.99      1.00       371
           3       1.00      0.96      0.98        25

    accuracy                           0.99       519
   macro avg       0.99      0.98      0.98       519
weighted avg       0.99      0.99      0.99       519



BaggingClassifier did the best prediction with accuracy score of 99%.

4. Random Forest Model

In [30]:
#Building the model using Random_Forest_Classifier
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators=5000,
                                  random_state=10,
                                  verbose=1,
                                  n_jobs=-1)

model_rf.fit(X_train,Y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed:    6.3s finished


RandomForestClassifier(n_estimators=5000, n_jobs=-1, random_state=10, verbose=1)

In [31]:
y_pred = model_rf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 1234 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 1784 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done 2434 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 3184 tasks      | elapsed:    0.7s
[Parallel(n_jobs=8)]: Done 4034 tasks      | elapsed:    0.9s
[Parallel(n_jobs=8)]: Done 4984 tasks      | elapsed:    1.2s
[Parallel(n_jobs=8)]: Done 5000 out of 5000 | elapsed:    1.2s finished


In [32]:
print("---------Confusion Matrix------------")
print(confusion_matrix(Y_test, y_pred))
print()
print("---------Accuracy Score------------")
print(accuracy_score(Y_test,y_pred))
print()
print("---------Classification Report------------")
print(classification_report(Y_test,y_pred))

---------Confusion Matrix------------
[[ 99   3   0   0]
 [  2  19   0   0]
 [  1   0 370   0]
 [  1   0   0  24]]

---------Accuracy Score------------
0.9865125240847784

---------Classification Report------------
              precision    recall  f1-score   support

           0       0.96      0.97      0.97       102
           1       0.86      0.90      0.88        21
           2       1.00      1.00      1.00       371
           3       1.00      0.96      0.98        25

    accuracy                           0.99       519
   macro avg       0.96      0.96      0.96       519
weighted avg       0.99      0.99      0.99       519



Random Forest model also did the best prediction with 98% of Accuracy Score.

5. Boosting Classifier

In [33]:
#Building the model using AdaBoost_Classifier
from sklearn.ensemble import AdaBoostClassifier

model_AdaBoost = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(random_state=10),
    n_estimators=100,random_state=10)

model_AdaBoost.fit(X_train,Y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=10),
                   n_estimators=100, random_state=10)

In [34]:
y_pred = model_AdaBoost.predict(X_test)

In [35]:
print("---------Confusion Matrix------------")
print(confusion_matrix(Y_test, y_pred))
print()
print("---------Accuracy Score------------")
print(accuracy_score(Y_test,y_pred))
print()
print("---------Classification Report------------")
print(classification_report(Y_test,y_pred))

---------Confusion Matrix------------
[[ 99   2   1   0]
 [  4  17   0   0]
 [  0   0 371   0]
 [  1   0   0  24]]

---------Accuracy Score------------
0.9845857418111753

---------Classification Report------------
              precision    recall  f1-score   support

           0       0.95      0.97      0.96       102
           1       0.89      0.81      0.85        21
           2       1.00      1.00      1.00       371
           3       1.00      0.96      0.98        25

    accuracy                           0.98       519
   macro avg       0.96      0.94      0.95       519
weighted avg       0.98      0.98      0.98       519



Boosting also gave 98% of accuracy score.

6. Gradient Boosting

In [37]:
#Building the model using Gradient_Boosting_Classifier
from sklearn.ensemble import GradientBoostingClassifier

model_GradientBoosting = GradientBoostingClassifier(n_estimators=200,
                                                    random_state=10)

model_GradientBoosting.fit(X_train,Y_train)

GradientBoostingClassifier(n_estimators=200, random_state=10)

In [38]:
y_pred=model_GradientBoosting.predict(X_test)

In [39]:
print("---------Confusion Matrix------------")
print(confusion_matrix(Y_test, y_pred))
print()
print("---------Accuracy Score------------")
print(accuracy_score(Y_test,y_pred))
print()
print("---------Classification Report------------")
print(classification_report(Y_test,y_pred))

---------Confusion Matrix------------
[[101   1   0   0]
 [  0  21   0   0]
 [  0   0 371   0]
 [  0   0   0  25]]

---------Accuracy Score------------
0.9980732177263969

---------Classification Report------------
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       102
           1       0.95      1.00      0.98        21
           2       1.00      1.00      1.00       371
           3       1.00      1.00      1.00        25

    accuracy                           1.00       519
   macro avg       0.99      1.00      0.99       519
weighted avg       1.00      1.00      1.00       519



GradientBoosing did the best prediction of 99% accuracy Score.

Now we will select the best model with the highest accuracy score.</br>
1. Decision_Tree_Classifier = 98.84%</br>
2. SVC = 89.59%</br>
3. Bagging_Classifier = 99.22%</br>
4. Random_Forest_Classifier = 98.65%</br>
5. AdaBoost_Classifier = 98.45%
6. Gradient_Boosting_Classifier = 99.80%

As we see all model are the best model with accuracy score more than 85%.</br>
But the 99% accuracy score is of Bagging_Classifier and Gradient_Boosting_Classifier.</br>
But Gradient_Boosting_Classifier is the best because the accuracy score is 99.80% which is the close to 100%.

So the the best model is Gradient_Boosting_Classifier.