# Decision Trees

In this Notebook, you will explore a machine learning algorithm called Decision Tree. You will use this classification algorithm to build a model from historical patient data, including their responses to different medications. After training the decision tree, you will use the model to predict the class of an unknown patient or to identify the appropriate drug for a new patient.


## Overview of Decision Trees:
Decision Trees are a type of supervised learning algorithm used for classification and regression tasks. They work by splitting the data into subsets based on the value of input features. This process continues recursively, resulting in a tree structure where each node represents a decision based on a feature, and each leaf node represents a class label (for classification) or a continuous value (for regression).

### Steps:
* 1. Loading and Exploring the Dataset:

Load the dataset containing historical patient data, including features like age, sex, blood pressure, cholesterol levels, and drug response.

* 2. Data Preprocessing:

Clean and preprocess the data, handling missing values, encoding categorical variables, and preparing the data for training.


* 3. Splitting the Data:

Split the dataset into training and testing sets. The training set will be used to train the decision tree model, while the testing set will be used to evaluate its performance.


* 4. Building and Training the Decision Tree Model:

Use scikit-learn to create and train a decision tree classifier on the training data. The model will learn to classify patients based on their features and drug responses.


* 5. Model Evaluation:

Evaluate the performance of the decision tree model using various metrics such as accuracy, precision, recall, and F1-score. These metrics will help assess how well the model predicts the correct drug class for new patients.


* 6. Prediction:

Use the trained decision tree model to make predictions on new patient data. This will help in identifying the appropriate drug for a new patient based on their features.


Implementation in Python (using scikit-learn):
Here’s an example of how to implement a decision tree classifier in Python using scikit-learn:

Here we first load the required libraries in python.

In [40]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
%matplotlib inline

### Dataset Description

Imagine You have collected data on a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of five medications: Drug A, Drug B, Drug C, Drug X, and Drug Y.

Your task is to build a model to determine which drug might be appropriate for future patients with the same illness. The dataset's features include Age, Sex, Blood Pressure, and Cholesterol levels of the patients, while the target variable is the drug to which each patient responded.


# Download the dataset


In [41]:
drugfile = os.path.join('data', '/content/drug.csv')
df = pd.read_csv(drugfile)
df.head(20)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
5,22,F,NORMAL,HIGH,8.607,drugX
6,49,F,NORMAL,HIGH,16.275,drugY
7,41,M,LOW,HIGH,11.037,drugC
8,60,M,NORMAL,HIGH,15.171,drugY
9,43,M,LOW,NORMAL,19.368,drugY


# Data Visualization and Anylisis
Let’s explore how many of classes are in the dataset

In [42]:
df['Drug'].value_counts()

Drug
drugY    91
drugX    54
drugA    23
drugC    16
drugB    16
Name: count, dtype: int64

# Feature set
Lets defind feature sets, X:

In [43]:
df.columns

Index(['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K', 'Drug'], dtype='object')

To use scikit-learn library, we have to convert the Pandas data frame to a Numpy array:

In [44]:
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']] .values
X

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043],
       [22, 'F', 'NORMAL', 'HIGH', 8.607],
       [49, 'F', 'NORMAL', 'HIGH', 16.275],
       [41, 'M', 'LOW', 'HIGH', 11.037],
       [60, 'M', 'NORMAL', 'HIGH', 15.171],
       [43, 'M', 'LOW', 'NORMAL', 19.368],
       [47, 'F', 'LOW', 'HIGH', 11.767],
       [34, 'F', 'HIGH', 'NORMAL', 19.199],
       [43, 'M', 'LOW', 'HIGH', 15.376],
       [74, 'F', 'LOW', 'HIGH', 20.942],
       [50, 'F', 'NORMAL', 'HIGH', 12.703],
       [16, 'F', 'HIGH', 'NORMAL', 15.516],
       [69, 'M', 'LOW', 'NORMAL', 11.455],
       [43, 'M', 'HIGH', 'HIGH', 13.972],
       [23, 'M', 'LOW', 'HIGH', 7.298],
       [32, 'F', 'HIGH', 'NORMAL', 25.974],
       [57, 'M', 'LOW', 'NORMAL', 19.128],
       [63, 'M', 'NORMAL', 'HIGH', 25.917],
       [47, 'M', 'LOW', 'NORMAL', 30.568],
       [48, 'F', 'LOW',

# Handling Categorical Features

As you may notice, some features in this dataset are categorical, such as Sex or Blood Pressure (BP). Unfortunately, scikit-learn's Decision Trees do not handle categorical variables directly.

However, we can convert these features to numerical values.



In [45]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1])


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3])

X

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043],
       [22, 0, 2, 0, 8.607],
       [49, 0, 2, 0, 16.275],
       [41, 1, 1, 0, 11.037],
       [60, 1, 2, 0, 15.171],
       [43, 1, 1, 1, 19.368],
       [47, 0, 1, 0, 11.767],
       [34, 0, 0, 1, 19.199],
       [43, 1, 1, 0, 15.376],
       [74, 0, 1, 0, 20.942],
       [50, 0, 2, 0, 12.703],
       [16, 0, 0, 1, 15.516],
       [69, 1, 1, 1, 11.455],
       [43, 1, 0, 0, 13.972],
       [23, 1, 1, 0, 7.298],
       [32, 0, 0, 1, 25.974],
       [57, 1, 1, 1, 19.128],
       [63, 1, 2, 0, 25.917],
       [47, 1, 1, 1, 30.568],
       [48, 0, 1, 0, 15.036],
       [33, 0, 1, 0, 33.486],
       [28, 0, 0, 1, 18.809],
       [31, 1, 0, 0, 30.366],
       [49, 0, 2, 1, 9.381],
       [39, 0, 1, 1, 22.697],
       [45, 1, 1, 0, 17.951],
       [18, 0, 2, 1, 8.75],
       [74, 1, 0, 0, 9.567],
       [49, 1, 1, 1, 11.014],
       [65, 0, 0,

# Class Labels

In [46]:
y = df["Drug"]
y

0      drugY
1      drugC
2      drugC
3      drugX
4      drugY
       ...  
195    drugC
196    drugC
197    drugX
198    drugX
199    drugX
Name: Drug, Length: 200, dtype: object

# Train Test Split
Test accuracy is the percentage of correct predictions made by the model on data that it has not been trained on. Training and testing on the same dataset usually results in low Testing accuracy due to overfitting.

High Testing accuracy is crucial because the purpose of any model is to make accurate predictions on new, unseen data. One effective way to improve Testing accuracy is by using the Train/Test Split approach.

This method involves splitting the dataset into separate training and testing sets. Meaning no data point appears in both sets.

Here’s how it works:

Split the Dataset: Divide the dataset into a training set and a testing set.
* ***Train the Model:*** Use the training set to train the model.
* ***Test the Model:*** Use the testing set to evaluate the model.


This approach provides a more accurate assessment of out-of-sample accuracy because the testing dataset is not used during training, making it more representative of real-world scenarios.

Here is an example using Python and the train_test_split function from *sklearn*:

 ** Note: ''**random_state**'' ensures that we obtain the same splits each time the code is run, which is useful for reproducibility.

In [47]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=3)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)


Train set: (140, 5) (140,)
Test set: (60, 5) (60,)


# Classification

In [48]:
from sklearn.tree import DecisionTreeClassifier
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree.fit(X_train,y_train)

# Model Prediction

Let's make some predictions on the testing dataset and store it into a variable called predTree.

In [49]:
predTree = drugTree.predict(X_test)

Lets define a table comparing the predicted values by the model and the actual values

In [50]:
# Define the data for the two columns
data = {
    'Predited Values': predTree,
    'Actual Values': y_test
}

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)


    Predited Values Actual Values
40            drugY         drugY
51            drugX         drugX
139           drugX         drugX
197           drugX         drugX
170           drugX         drugX
82            drugC         drugC
183           drugY         drugY
46            drugA         drugA
70            drugB         drugB
100           drugA         drugA
179           drugY         drugY
83            drugA         drugA
25            drugY         drugY
190           drugY         drugY
159           drugX         drugX
173           drugY         drugY
95            drugX         drugX
3             drugX         drugX
41            drugB         drugB
58            drugX         drugX
14            drugX         drugX
143           drugY         drugY
12            drugY         drugY
6             drugY         drugY
182           drugX         drugX
161           drugB         drugB
128           drugY         drugY
122           drugY         drugY
101           

# Model Evaluation

In [51]:
from sklearn import metrics
import matplotlib.pyplot as plt

print("Decision Trees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

Decision Trees's Accuracy:  0.9833333333333333
