# Classification With Decision Trees, From Start to Finish

Author: ***Soroush Ghaderi***

In this lesson we will use **scikit-learn** and **Cost Complexity Pruning** to build **Classification Tree** which uses continues and categorical data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) to predict whether or not a paitent has [heart disease](https://archive.ics.uci.edu/ml/datasets/heart+disease).

**Classification Trees** are an exponentially useful machine learning method when you need to know how the decisions are being made. For example, if you have to justify the predictions to your boss, classification trees are a good method because each step in the decision making process is easy to understand.

In this notebook we will practice about ...

- **Importing Data**


- **Missing Data**

    - Identifying Missing Data
    - Dealing with Missing Data

- **Formatting the Data for Decision Trees**

    - Splitting data into Dependent and Independent Variables
    - One-Hot Encoding
    
- **Building a Preliminary Classification Tree**


- **Optimizing the tree with Cost Complexity Pruning**

    - Visulizing Alpha
    - Using Cross Validation to find the best value for Alpha
    
- **Building, Drawing, Interpreting and Evaluating the Final Classification Tree**

The very first thing we do is load in a bunch of modules, python, itself, just gives us a basic programming language.
These modules give us extra functionality to import the data, cleen it up and format it, and then built, evaluate and draw the classification tree.

**NOTE:** If your version of `scikit-learn` is older than 0.22.1, then the easiest thing to do is just update all of your **Anaconda** packages with the following command `conda update --all` However, if you only want to update `scikit-learn`, you can run this command `install scikit-learn=0.22.1`.

In [2]:
import pandas as pd # to load and manipulate data and for one-hot encoding
import numpy as np # to calculate the mean and standard deviation
import matplotlib.pyplot as plt # to visualize graphs
from sklearn.tree import DecisionTreeClassifier # to build classification tree
from sklearn.tree import plot_tree # to draw a classification tree
from sklearn.model_selection import train_test_split # to split data into training and testing data
from sklearn.model_selection import cross_val_score # for cross validation
from sklearn.metrics import confusion_matrix # to create a confusion matrix
from sklearn.metrics import plot_confusion_matrix # to draw confusion matrix

In [6]:
# since we have already data, we can load the file
heart = pd.read_csv("data/processed.cleveland.data", header=None)
# its also possible to link the data set on the read_csv function

Now we have data on **Heart** dataframe, let's look at first five rows of data using `head()` function.

In [8]:
# print the first 5 rows
heart.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


We see that instead of nice column names, we just have column numbers. Since nice column names would make it easier to know to format the data, let's replace the column with the following column names:
* **age**
* **sex**
* **cp**, chest pain
* **restbp**, resting blood pressure (in mm Hg)
* **chol**, serum cholestrol in mg/dl
* **fbs**, fasting blood sugar
* **restecg**, resting electrocardiographic results
* **thalach**, maximum heart rate achived
* **exang**, exercise induced angina
* **oldpeak**, ST depression induced by exercise relative to rest
* **slope**, the slope of the peak exercise ST segment
* **ca**, number of major vessels(0-3) colored by fluoroscopy
* **thal**, this is short of thalium heart scan
* **hd**, diagosis of heart disease, the prediction atribute