<a href="https://colab.research.google.com/github/Guliko24/CE880_Lab_Work/blob/main/Week7/Exercise_7_1_Decision_tree_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CE880: An Approachable Introduction to Data Science
### Prepared by: Haider Raza (h.raza@essex.ac.uk)
### Approximate time: 120 minutes

***
## Learning Outcome

* Decision Tree
***
# Introduction to Decision Tree algorithm

A Decision Tree algorithm is one of the most popular machine learning algorithms. It uses a tree like structure and their possible combinations to solve a particular problem. It belongs to the class of supervised learning algorithms where it can be used for both classification and regression purposes.

A decision tree is a structure that includes a root node, branches (decision nodes), and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.

### Decision-Tree terminology

![Decision-Tree terminology](https://gdcoder.com/content/images/2019/05/Screen-Shot-2019-05-18-at-03.40.41.png)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = '/content/car_evaluation.csv'
df = pd.read_csv(data, header=None)

FileNotFoundError: [Errno 2] No such file or directory: '/content/car_evaluation.csv'

In [None]:
# view dimensions of dataset
df.shape

We can see that there are 1728 instances and 7 variables in the data set.

### View top 5 rows of dataset

In [None]:
# preview the dataset
df.head()

### Rename column names

We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-

In [None]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df.columns = col_names
col_names

In [None]:
# let's again preview the dataset
df.head()

We can see that the column names are renamed. Now, the columns have meaningful names.

### View summary of dataset

In [None]:
df.info()

### Frequency distribution of values in variables

Now, I will check the frequency counts of categorical variables.

In [None]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
for col in col_names:
    print(df[col].value_counts())


We can see that the `doors` and `persons` are categorical in nature. So, I will treat them as categorical variables.

### Summary of variables


- There are 7 variables in the dataset. All the variables are of categorical data type.


- These are given by `buying`, `maint`, `doors`, `persons`, `lug_boot`, `safety` and `class`.


- `class` is the target variable.

### Explore `class` variable

In [None]:
df['class'].value_counts()

The `class` target variable is ordinal in nature.

### Missing values in variables

In [None]:
# check missing values in variables

df.isnull().sum()

We can see that there are no missing values in the dataset. I have checked the frequency distribution of values previously. It also confirms that there are no missing values in the dataset.

# Declare feature vector and target variable

In [None]:
X = df.drop(['class'], axis=1)
y = df['class']

# Split data into separate training and test set

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [None]:
# check the shape of X_train and X_test
X_train.shape, X_test.shape

# Feature Engineering

**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.

First, I will check the data types of variables again.

In [None]:
# check data types in X_train
X_train.dtypes

### Encode categorical variables


Now, I will encode the categorical variables.

In [None]:
X_train.head()

We can see that all  the variables are ordinal categorical data type.

In [None]:
# import category encoders
!pip install category-encoders
import category_encoders as ce

In [None]:
# encode variables with ordinal encoding
encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_test.head()

We now have training and test set ready for model building.

## Decision Tree Classifier with criterion gini index

In [None]:
# import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# instantiate the DecisionTreeClassifier model with criterion gini index
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
# fit the model
clf_gini.fit(X_train, y_train)

### Predict the Test set results with criterion gini index

In [None]:
y_pred_gini = clf_gini.predict(X_test)

### Check accuracy score with criterion gini index

In [None]:
from sklearn.metrics import accuracy_score
print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

Here, **y_test** are the true class labels and **y_pred_gini** are the predicted class labels in the test-set.

### Compare the train-set and test-set accuracy


Now, I will compare the train-set and test-set accuracy to check for overfitting.

In [None]:
y_pred_train_gini = clf_gini.predict(X_train)
y_pred_train_gini

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

### Check for overfitting and underfitting

In [None]:
# print the scores on training and test set
print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))

Here, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.


### Visualize decision-trees

In [None]:
plt.figure(figsize=(12,8))
from sklearn import tree
tree.plot_tree(clf_gini.fit(X_train, y_train))

### Visualize decision-trees with graphviz

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf_gini, out_file=None,
                              feature_names=X_train.columns,
                              class_names=y_train,
                              filled=True, rounded=True,
                              special_characters=True)

graph = graphviz.Source(dot_data)
graph