<h1>Decision Trees</h1>
<p>In this Notebook, you will explore a machine learning algorithm called Decision Tree. You will use this classification algorithm to build a model from historical patient data, including their responses to different medications. After training the decision tree, you will use the model to predict the class of an unknown patient or to identify the appropriate drug for a new patient.
</p>

<h3>Overview of Decision Trees</h3>
Decision Trees are a type of supervised learning algorithm used for classification and regression tasks. They work by splitting the data into subsets based on the value of input features. This process continues recursively, resulting in a tree structure where each node represents a decision based on a feature, and each leaf node represents a class label (for classification) or a continuous value (for regression).

<h3>Steps</h3>
<ol>
    <li>Loading and Exploring the Dataset:<br/>
        Loading the dataset containing historical patient data, including features like age, sex, blood pressure, cholesterol levels, and drug response.
    </li>
    <li>Data Preprocessing:<br/>
        Cleaning and preprocessing the data, handling missing values, encoding categorical variables, and preparing the data for training.
    </li>
    <li>Splitting the Data:<br/>
        Spliting the dataset into training and testing sets. The training set will be used to train the decision tree model, while the testing set will be used to evaluate its performance.
    </li>
    <li>Building and Training the Decision Tree Model:<br/>
        Using scikit-learn to create and train a decision tree classifier on the training data. The model will learn to classify patients based on their features and drug responses.
    </li>
    <li>Model Evaluation:<br/>
        Evaluating the performance of the decision tree model using various metrics such as accuracy, precision, recall, and F1-score. These metrics will help assess how well the model predicts the correct drug class for new patients.
    </li>
    <li>Prediction:<br/>
    Using the newly trained decision tree model to make predictions on new patient data. This will help in identifying the appropriate drug for a new patient based on their features.
    </li>
</ol>

<h3>Importing the needed packages</h3>

In [1]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os

%matplotlib inline

<h3>Importing the dataset</h3>

In [2]:
drugfile = os.path.join('Storage', 'drug.csv')
df = pd.read_csv(drugfile)
display(df.head(20))

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
5,22,F,NORMAL,HIGH,8.607,drugX
6,49,F,NORMAL,HIGH,16.275,drugY
7,41,M,LOW,HIGH,11.037,drugC
8,60,M,NORMAL,HIGH,15.171,drugY
9,43,M,LOW,NORMAL,19.368,drugY


<h3>Dataset Description</h3>
<p>
  Imagine You have collected data on a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of five medications: Drug A, Drug B, Drug C, Drug X, and Drug Y.  
</p>
<p>
   Your task is to build a model to determine which drug might be appropriate for future patients with the same illness. The dataset's features include Age, Sex, Blood Pressure, and Cholesterol levels of the patients, while the target variable is the drug to which each patient responded. 
</p>

<h3>Visualization & Analysis of dataset</h3>

<h5>The distribution of target classes in dataset</h5>

In [3]:
display(df['Drug'].value_counts())

Drug
drugY    91
drugX    54
drugA    23
drugC    16
drugB    16
Name: count, dtype: int64

<h5>The feature in the datasets</h5>

In [4]:
print(df.columns)

Index(['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K', 'Drug'], dtype='object')


<h3>Extracting dataset features</h3>

In [5]:
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']]
display(X)

X = X.values

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K
0,23,F,HIGH,HIGH,25.355
1,47,M,LOW,HIGH,13.093
2,47,M,LOW,HIGH,10.114
3,28,F,NORMAL,HIGH,7.798
4,61,F,LOW,HIGH,18.043
...,...,...,...,...,...
195,56,F,LOW,HIGH,11.567
196,16,M,LOW,HIGH,12.006
197,52,M,NORMAL,HIGH,9.894
198,23,M,NORMAL,NORMAL,14.020


<h3>Encoding categorical features</h3>

In [6]:
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1])


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3])

display(X)

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043],
       [22, 0, 2, 0, 8.607],
       [49, 0, 2, 0, 16.275],
       [41, 1, 1, 0, 11.037],
       [60, 1, 2, 0, 15.171],
       [43, 1, 1, 1, 19.368],
       [47, 0, 1, 0, 11.767],
       [34, 0, 0, 1, 19.199],
       [43, 1, 1, 0, 15.376],
       [74, 0, 1, 0, 20.942],
       [50, 0, 2, 0, 12.703],
       [16, 0, 0, 1, 15.516],
       [69, 1, 1, 1, 11.455],
       [43, 1, 0, 0, 13.972],
       [23, 1, 1, 0, 7.298],
       [32, 0, 0, 1, 25.974],
       [57, 1, 1, 1, 19.128],
       [63, 1, 2, 0, 25.917],
       [47, 1, 1, 1, 30.568],
       [48, 0, 1, 0, 15.036],
       [33, 0, 1, 0, 33.486],
       [28, 0, 0, 1, 18.809],
       [31, 1, 0, 0, 30.366],
       [49, 0, 2, 1, 9.381],
       [39, 0, 1, 1, 22.697],
       [45, 1, 1, 0, 17.951],
       [18, 0, 2, 1, 8.75],
       [74, 1, 0, 0, 9.567],
       [49, 1, 1, 1, 11.014],
       [65, 0, 0,

<h3>Extracting target classes</h3>

In [7]:
y = df["Drug"]
display(y)

0      drugY
1      drugC
2      drugC
3      drugX
4      drugY
       ...  
195    drugC
196    drugC
197    drugX
198    drugX
199    drugX
Name: Drug, Length: 200, dtype: object

<h3>Split the Dataset: Divide the dataset into a training set and a testing set.</h3>
<ol>
    <li>Training the Model by Using the training set to train the model</li>
    <li>Testing the Model by Using the testing set to evaluate the model.</li>
</ol>

<p>A datapoint cannot appear in both the training and testing dataset, so the model is tested on unseen data</p>

In [8]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=3)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (140, 5) (140,)
Test set: (60, 5) (60,)


<h3>Training the model for this classifier</h3>

In [9]:
from sklearn.tree import DecisionTreeClassifier
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree.fit(X_train,y_train)

<h3>Predicting the classes in the testing set</h3>

In [10]:
predTree = drugTree.predict(X_test)

In [11]:
# Define the data for the two columns
data = {
    'Predited Values': predTree,
    'Actual Values': y_test
}

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
display(df)

Unnamed: 0,Predited Values,Actual Values
40,drugY,drugY
51,drugX,drugX
139,drugX,drugX
197,drugX,drugX
170,drugX,drugX
82,drugC,drugC
183,drugY,drugY
46,drugA,drugA
70,drugB,drugB
100,drugA,drugA


<h3>Evaluating the model</h3>

In [12]:
from sklearn import metrics
import matplotlib.pyplot as plt

print("Decision Trees's Accuracy: ", metrics.accuracy_score(y_test, predTree))
print("Decision Trees's Precision: ", metrics.precision_score(y_test, predTree, average='macro'))
print("Decision Trees's Recall: ", metrics.recall_score(y_test, predTree, average='macro'))
print("Decision Trees's F1-score: ", metrics.f1_score(y_test, predTree, average='macro'))

Decision Trees's Accuracy:  0.9833333333333333
Decision Trees's Precision:  0.9913043478260869
Decision Trees's Recall:  0.9904761904761905
Decision Trees's F1-score:  0.9906775067750677
