# Aim:

The aim of this project is to predict the safety of the car. In this project, I build a Decision Tree Classifier to predict the safety of the car using Decision Tree Classification with Python and Scikit-Learn. 

The dataset used here is the Car Evaluation Data Set, downloaded from the UCI Machine Learning Repository website.


# Dataset description

I have used the Car Evaluation Data Set downloaded from the UCI Machine Learning Repository website. The data set can be found at the following url:-

http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for expert system for decision making. The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.

It was donated by Marko Bohanec.

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

# Import Dataset

In [6]:
df = pd.read_csv('car.data.csv', header=None)

# Exploratory Data Analysis

In [8]:
# To check the dimensions of the dataset

df.shape

(1728, 7)

We can see that there are 1728 instances and 7 variables in the data set.

In [9]:
# preview the dataset

df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


### Rename column names

We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns.

In [10]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']


df.columns = col_names
df.columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

In [11]:
# let's again preview the dataset

df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


We can see that the column names are renamed. Now, the columns have meaningful names.

### View Summary of Dataset

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


### Frequency distribution of values in variables

In [13]:
for i in col_names:
    
    print(df[i].value_counts())

med      432
vhigh    432
low      432
high     432
Name: buying, dtype: int64
med      432
vhigh    432
low      432
high     432
Name: maint, dtype: int64
2        432
4        432
3        432
5more    432
Name: doors, dtype: int64
2       576
4       576
more    576
Name: persons, dtype: int64
med      576
big      576
small    576
Name: lug_boot, dtype: int64
med     576
low     576
high    576
Name: safety, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64


We can see that the doors and persons are categorical in nature. So, I will treat them as categorical variables.

### Summary of variables

* There are 7 variables in the dataset. All the variables are of categorical data type.

* These are given by buying, maint, doors, persons, lug_boot, safety and class.

* class is the target variable.

### Missing values in variables

In [14]:
# check missing values in variables

df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

We can see that there are no missing values in the dataset. I have checked the frequency distribution of values previously. It also confirms that there are no missing values in the dataset.

# Declare feature vector and target variable

In [15]:
x = df.drop(['class'], axis=1)

y = df['class']

# Splitting data into separate training and test dataset

In [16]:
# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 42)

In [17]:
# check the shape of X_train and X_test

x_train.shape, x_test.shape

((1157, 6), (571, 6))

# Feature Engineering

In [19]:
# check data types in X_train

x_train.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
dtype: object

### Encode categorical variables

In [20]:
x_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,vhigh,vhigh,3,more,med,low
468,high,vhigh,3,4,small,low
155,vhigh,high,3,more,small,high
1721,low,low,5more,more,small,high
1208,med,low,2,more,small,high


In [21]:
# import category encoders

import category_encoders as ce

In [22]:
# encode variables with ordinal encoding

encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])


x_train = encoder.fit_transform(x_train)

x_test = encoder.transform(x_test)

In [23]:
x_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,1,1,1,1,1,1
468,2,1,1,2,2,1
155,1,2,1,1,2,2
1721,3,3,2,1,2,2
1208,4,3,3,1,2,2


In [24]:
x_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
599,2,2,4,3,1,2
1201,4,3,3,2,1,3
628,2,2,2,3,3,3
1498,3,2,2,2,1,3
1263,4,3,4,1,1,1


# Decision Tree Classifier with criterion gini index

In [26]:
# import DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

In [27]:
# instantiate the DecisionTreeClassifier model with criterion gini index

clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)


# fit the model
clf_gini.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

### Predict the Test set results with criterion gini index

In [29]:
y_pred_gini = clf_gini.predict(x_test)

### Check accuracy score with criterion gini index

In [32]:
from sklearn.metrics import accuracy_score

acc_score_test = accuracy_score(y_test, y_pred_gini)
acc_score_test

0.8021015761821366

Here, y_test are the true class labels and y_pred_gini are the predicted class labels in the test-set.


### Compare the train-set and test-set accuracy

Now, I will compare the train-set and test-set accuracy to check for overfitting.

In [33]:
y_pred_train_gini = clf_gini.predict(x_train)

y_pred_train_gini

array(['unacc', 'unacc', 'unacc', ..., 'unacc', 'unacc', 'acc'],
      dtype=object)

In [34]:
acc_score_train = accuracy_score(y_train, y_pred_train_gini)
acc_score_train

0.7865168539325843

In [36]:
# print the scores on training and test set

score_train = clf_gini.score(x_train, y_train)
print(score_train)

score_test = clf_gini.score(x_test, y_test)
print(score_test)

0.7865168539325843
0.8021015761821366


Here, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.

# Decision Tree Classifier with criterion entropy

In [37]:
# instantiate the DecisionTreeClassifier model with criterion entropy

clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)


# fit the model
clf_en.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

### Predict the Test set results with criterion entropy

In [38]:
y_pred_en = clf_en.predict(x_test)

### Check accuracy score with criterion entropy

In [39]:
from sklearn.metrics import accuracy_score

acc_score_test_en = accuracy_score(y_test, y_pred_en)

acc_score_test_en

0.8021015761821366

### Compare the train-set and test-set accuracy

Now, I will compare the train-set and test-set accuracy to check for overfitting.

In [40]:
y_pred_train_en = clf_en.predict(x_train)

y_pred_train_en

array(['unacc', 'unacc', 'unacc', ..., 'unacc', 'unacc', 'acc'],
      dtype=object)

In [41]:
acc_score_train_en = accuracy_score(y_train, y_pred_train_en)

acc_score_train_en

0.7865168539325843

### Check for overfitting and underfitting

In [42]:
# print the scores on training and test set

score_train_en = clf_en.score(x_train, y_train)
print(score_train_en)

score_test_en = clf_en.score(x_test, y_test)
print(score_test_en)

0.7865168539325843
0.8021015761821366


We can see that the training-set score and test-set score is same as above. The training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.

Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.

But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. To solve this issue, we will print confusion matrix.

# Confusion Matrix

In [43]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_en)
cm

array([[ 73,   0,  56,   0],
       [ 20,   0,   0,   0],
       [ 12,   0, 385,   0],
       [ 25,   0,   0,   0]], dtype=int64)

# Classification Report

In [45]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_en))

              precision    recall  f1-score   support

         acc       0.56      0.57      0.56       129
        good       0.00      0.00      0.00        20
       unacc       0.87      0.97      0.92       397
       vgood       0.00      0.00      0.00        25

    accuracy                           0.80       571
   macro avg       0.36      0.38      0.37       571
weighted avg       0.73      0.80      0.77       571



# Results and conclusion

1. In this project, I built a Decision-Tree Classifier model to predict the safety of the car. I built two models, one with criterion gini index and another one with criterion entropy. Both the model yields a very good performance as indicated by the model accuracy in both the cases which was found to be 0.8021.

2. In the model with criterion gini index, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.

3. Similarly, in the model with criterion entropy, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021.We get the same values as in the case with criterion gini. So, there is no sign of overfitting.

4. In both the cases, the training-set and test-set accuracy score is the same. It may happen because of small dataset.

5. The confusion matrix and classification report yields very good model performance.