### The problem statement

* The problem is to predict the safety of the car. 
* In this project, I build a Decision Tree Classifier to predict the safety of the car.
* I implement Decision Tree Classification with Python and Scikit-Learn. 

### About the Dataset

* I have used the Car Evaluation Data Set for this project, downloaded from the UCI Machine Learning Repository website
http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

### Import Libraries

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [10]:
import warnings

warnings.filterwarnings('ignore')

### Import dataset


In [15]:
df = pd.read_csv('car.data',header = None)

### EDA

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [17]:
df.shape

(1728, 7)

In [16]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [18]:
df.columns

Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

In [19]:
# rename columns
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

In [20]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [25]:
# Frequency distribution of values in variables
for i in df.columns:
    
    print(df[i].value_counts())   

buying
vhigh    432
high     432
med      432
low      432
Name: count, dtype: int64
maint
vhigh    432
high     432
med      432
low      432
Name: count, dtype: int64
doors
2        432
3        432
4        432
5more    432
Name: count, dtype: int64
persons
2       576
4       576
more    576
Name: count, dtype: int64
lug_boot
small    576
med      576
big      576
Name: count, dtype: int64
safety
low     576
med     576
high    576
Name: count, dtype: int64
class
unacc    1210
acc       384
good       69
vgood      65
Name: count, dtype: int64


#### summary of variables

* There are 7 variables in the dataset. All the variables are of categorical data type.* 
These are given by buying, maint, doors, persons, lug_boot, safety and clas
*  class is the target variable.ble.

In [27]:
# Explore class variable

df['class'].value_counts()


class
unacc    1210
acc       384
good       69
vgood      65
Name: count, dtype: int64

In [28]:
# check missing values in variables

df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

### Declare feature vector and target variable


In [29]:
X = df.drop(['class'], axis=1)

y = df['class']

###  Split data into separate training and test set


In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [31]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape

((1157, 6), (571, 6))

### Feature Engineering

In [32]:
# check data types in X_train

X_train.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
dtype: object

In [33]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,vhigh,vhigh,3,more,med,low
468,high,vhigh,3,4,small,low
155,vhigh,high,3,more,small,high
1721,low,low,5more,more,small,high
1208,med,low,2,more,small,high


#### Encode categorical variables

In [36]:
# import category encoders

import category_encoders as ce

In [37]:
# encode variables with ordinal encoding

encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])


X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [38]:

X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,1,1,1,1,1,1
468,2,1,1,2,2,1
155,1,2,1,1,2,2
1721,3,3,2,1,2,2
1208,4,3,3,1,2,2


In [39]:
X_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
599,2,2,4,3,1,2
1201,4,3,3,2,1,3
628,2,2,2,3,3,3
1498,3,2,2,2,1,3
1263,4,3,4,1,1,1


### Model Training

#### Decision Tree Classifier with criterion gini index

In [40]:
# import DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

In [41]:
# instantiate the DecisionTreeClassifier model with criterion gini index

model_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)


# fit the model
model_gini.fit(X_train, y_train)

In [45]:
y_pred_gini = model_gini.predict(X_test)

In [47]:
# Check accuracy score with criterion gini index
from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

Model accuracy score with criterion gini index: 0.8021


In [49]:
# Compare the train-set and test-set accuracy

y_pred_train_gini = model_gini.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

Training-set accuracy score: 0.7865


In [51]:
# Check for overfitting and underfitting


print('Training set score: {:.4f}'.format(model_gini.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(model_gini.score(X_test, y_test)))

Training set score: 0.7865
Test set score: 0.8021


* Here, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting

####  Decision Tree Classifier with criterion entropy


In [53]:
# instantiate the DecisionTreeClassifier model with criterion entropy

model_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)

# fit the model
model_entropy.fit(X_train, y_train)

In [54]:
# Predict the Test set results with criterion entropy

y_pred_entropy = model_entropy.predict(X_test)

In [55]:
# Check accuracy score with criterion entropy

from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion entropy: {0:0.4f}'. format(accuracy_score(y_test, y_pred_entropy)))

Model accuracy score with criterion entropy: 0.8021


In [59]:
# Compare the train-set and test-set accuracy

y_pred_train_entropy = model_entropy.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_entropy)))


Training-set accuracy score: 0.7865


In [61]:
# Check for overfitting and underfitting

print('Training set score: {:.4f}'.format(model_entropy.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(model_entropy.score(X_test, y_test)))

Training set score: 0.7865
Test set score: 0.8021


* We can see that the training-set score and test-set score is same as above. The training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.

### Results and conclusion

1. The model yields a very good performance as indicated by the model accuracy in both the cases which was found to be 0.8021.
2. In the model with criterion gini index, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.
3. Similarly, in the model with criterion entropy, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021.We get the same values as in the case with criterion gini. So, there is no sign of overfitting.
4. In both the cases, the training-set and test-set accuracy score is the same. It may happen because of small dataset.


