<a href="https://colab.research.google.com/github/Priyo-prog/Machine-Learning/blob/main/Decision%20Tree/Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree Classification

* In this lesson we are going to use **scikit-learn** and **Cost-Complexity Pruning** to build the Classification tree.
* **Classification Trees** are exceptionally useful machine learning method when you need to know how the decisions are being made.


## Import libraries and modules

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

## Import the data

* We are goin to load the dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) . Specifically we are going to use [Heart Disease Dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data). This dataset will allow us to predict if someone has heart disease based ojn their age, sex, blood pressure and variety of other metrics.

* We need to replace the column numbers with the following column names:
 * age 
 * sex 
 * cp, chest pain
 * restbp, resting blood pressure (in mm Hg)
 * fbs, fasting blood sugar
 * restecg, resting electrocardiographic results
 * thalac, maximum heart rate achieved
 * exang, exercise induced angina
 * oldpeak, ST depression induced by exercise relative to rest 
 * slope, the slope of the peak exercise ST segment
 * ca, number of major vessels (0-3) colored by fluroscopy
 * thal, this is short of thalium heart scan
 * hd, diagnosis of heart disease, the predicted attribute

In [16]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [17]:
## Change the column numbers to column names
df.columns = ['age',
              'sex',
              'cp',
              'restbp',
              'chol',
              'fbs',
              'restecg',
              'thalach',
              'exang',
              'oldpeak',
              'slope',
              'ca',
              'thal',
              'hd']
df.head()              

Unnamed: 0,age,sex,cp,restbp,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,hd
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## Missing Data Part 1 : Identifying missing data

This part will involve the process of identifying and dealing with **Missing Data**.

**Missing Data** is simply a blank space, or a surrogate value like **NA**, that indicates that we failed to collect data for one of the features.

There are two main ways to deal with missing data:
1. We can remove the rows that contain missing data from the dataset. This is relatively easy to do, but it wastes all other values that we collected.

2. We can **impute** the values that are missing. In this context **impute** is just a fancy way of saying "we can make an educated guess about what the value should be".


In [18]:
## dtypes tell us the "data type" for each column
df.dtypes

age        float64
sex        float64
cp         float64
restbp     float64
chol       float64
fbs        float64
restecg    float64
thalach    float64
exang      float64
oldpeak    float64
slope      float64
ca          object
thal        object
hd           int64
dtype: object

We can see that they are almost all float64, however columns like **ca** and **thal** have object type and one column , **hd** has int64.

Object datatypes are used when there are mixture of things, like a mixture of numbers and letters. 

In [19]:
# print out unique values in the column called 'ca'
df['ca'].unique()

array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)

## Missing Data Part 2 : Dealing with missing data

Since scikit-learn classification trees do not support datasets with missing values, we need to figure out what to do these question marks. We can either delete these patients from the training dataset, or impute values for the missing data.

In [20]:
## print the number of rows that contain missing values
##
## loc[], short for "location", let's us specify which rows we want..
## and so we say we want any row with '?' in column 'ca'
## OR
## any row with '?' in column 'thal'
##
## 'len' short for "length", prints out the number of rows
len(df.loc[(df['ca']=='?')
            |
           (df['thal']=='?')])

6

Since only 6 rows have missing values, let's look at them

In [21]:
df.loc[(df['ca']=='?')
            |
           (df['thal']=='?')]

Unnamed: 0,age,sex,cp,restbp,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,hd
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,2
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0


Let's count the number of rows in full dataset

In [22]:
len(df)

303

So 6 of **303** rows, or **2%** contain missing values. Since **303 - 6 = 297**, and **297** is plenty of data to build classification tree, we will remove the rows with missing values, rather than try to impute their values.

In [23]:
## use loc[] to select all the rows that do not contain missing values 
## annd save them in a new dataframe called "df_no_missing"
df_no_missing = df.loc[(df['ca'] != '?')
                       & 
                       (df['thal'] != '?')]

In [24]:
len(df_no_missing)

297

In [25]:
df_no_missing['ca'].unique(), df_no_missing['thal'].unique()

(array(['0.0', '3.0', '2.0', '1.0'], dtype=object),
 array(['6.0', '3.0', '7.0'], dtype=object))

## Format Data Part 1 : Split the data into features and labels

In [26]:
## make a new copy of the columns used to make predictions
X = df_no_missing.drop('hd', axis=1).copy() # alternatively df_no_missing[:, :-1]
X.head()

Unnamed: 0,age,sex,cp,restbp,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0


In [27]:
## Make a new column of the data we want to predict
y = df_no_missing['hd'].copy()
y.head()

0    0
1    2
2    1
3    0
4    0
Name: hd, dtype: int64

## Format the Data Part 2 : One-Hot Encoding

Now that we have split the dataframe into two pieces, X, which containst the data we will use to predict classifications, and y, which contains the known classifications in our training dataset, we need to take a closer look at the variable X. the list below tells us what each variable represents and the type of data(float or categorical) it should contain:

* **age**, **Float**

* **sex - Category**
  * 0 = female
  * 1 = male

* **cp**, cheat pain, **Category**
 * 1 = typical angina
 * 2 = atypical angina
 * 3 = non-anginal pain
 * 4 = asymptomatic

* **restbp**, resting blood pressure(in mm Hg), **Float**

* **chol**, serum cholestrol in mg/dl, **Float**

* **fbs**, fasting blood sugar, **Category**
 * 0 =>=120 mg/dl
 * 1 =<120 mg/dl

* **restecg**, resting electrocardiographic result, **Category**






In [28]:
X.dtypes

age        float64
sex        float64
cp         float64
restbp     float64
chol       float64
fbs        float64
restecg    float64
thalach    float64
exang      float64
oldpeak    float64
slope      float64
ca          object
thal        object
dtype: object

At this point you may be wondering "what is wrong with treating categorical data like continuous data ?". To answer that question, let's look at an example

For **cp**(chest pain) column we have 4 options:
1. typical angina
2. atypical angina
3. non-anginal pain
4. asymptomatic

If we treated these values, 1,2,3 and 4, like continuous data, then we would assume that 4, which means "asymptomatic", is more similar to 3, which means "non-anginal pain", than it is to 1 or2. that means decision tree would be more likely to cluster the patients with 4s and 3s together than the patients with 4s and 1s together. In contrast, if we trat these numbers like categorical data, then we treat each one as separate category that is no more or less similar to any of the other categories. Thus the likelihood of clustering patients with 4s with 3s is the same as clustering 4s with 1s, and that approach is more reasonable.

In [29]:
X['cp'].unique()

array([1., 4., 3., 2.])

In [30]:
## For this use case, we will use get_dummies() to do One-hot Encoding

pd.get_dummies(X, columns=['cp']).head()

Unnamed: 0,age,sex,restbp,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,cp_1.0,cp_2.0,cp_3.0,cp_4.0
0,63.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,1,0,0,0
1,67.0,1.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,0,0,0,1
2,67.0,1.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,0,0,0,1
3,37.0,1.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,0,1,0
4,41.0,0.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,1,0,0


In [31]:
X_encoded = pd.get_dummies(X, columns=['cp',
                                       'restecg',
                                       'slope',
                                       'thal'])
X_encoded.head()

Unnamed: 0,age,sex,restbp,chol,fbs,thalach,exang,oldpeak,ca,cp_1.0,...,cp_4.0,restecg_0.0,restecg_1.0,restecg_2.0,slope_1.0,slope_2.0,slope_3.0,thal_3.0,thal_6.0,thal_7.0
0,63.0,1.0,145.0,233.0,1.0,150.0,0.0,2.3,0.0,1,...,0,0,0,1,0,0,1,0,1,0
1,67.0,1.0,160.0,286.0,0.0,108.0,1.0,1.5,3.0,0,...,1,0,0,1,0,1,0,1,0,0
2,67.0,1.0,120.0,229.0,0.0,129.0,1.0,2.6,2.0,0,...,1,0,0,1,0,1,0,0,0,1
3,37.0,1.0,130.0,250.0,0.0,187.0,0.0,3.5,0.0,0,...,0,1,0,0,0,0,1,1,0,0
4,41.0,0.0,130.0,204.0,0.0,172.0,0.0,1.4,0.0,0,...,0,0,0,1,1,0,0,1,0,0
