# Decision Trees

Using Decision tree (a classification algorithm) to build a model from historical data of patients, and their respond to different medications. Then use the trained decision tree to predict the class of an unknown patient or to find a proper drug for a new patient.

In [13]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

### About the dataset

As a medical researcher, you are analyzing data from a group of patients who all suffered from the same illness but responded differently to various treatments. Each patient was treated with one of five drugs; Drug A, Drug B, Drug C, Drug X, or Drug Y. Your goal is to develop a model that can predict which drug would be most effective for a new patient with the same illness.

The dataset you are working with includes features such as Age, Sex, Blood Pressure, and Cholesterol levels, with the target variable being the specific drug each patient responded to. By using a decision tree classifier, you can train the model on this data to identify patterns and make accurate predictions for future patients, helping guide treatment decisions.

This approach demonstrates a binary classification task, where the model learns from the training data to classify new, unseen patients into the appropriate drug category, optimizing treatment plans based on individual patient characteristics.

In [15]:
my_data = pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv", delimiter=",")
my_data.to_csv('drug200.csv')                

In [16]:
df = pd.read_csv('drug200.csv')
df

Unnamed: 0.1,Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,0,23,F,HIGH,HIGH,25.355,drugY
1,1,47,M,LOW,HIGH,13.093,drugC
2,2,47,M,LOW,HIGH,10.114,drugC
3,3,28,F,NORMAL,HIGH,7.798,drugX
4,4,61,F,LOW,HIGH,18.043,drugY
...,...,...,...,...,...,...,...
195,195,56,F,LOW,HIGH,11.567,drugC
196,196,16,M,LOW,HIGH,12.006,drugC
197,197,52,M,NORMAL,HIGH,9.894,drugX
198,198,23,M,NORMAL,NORMAL,14.020,drugX


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   200 non-null    int64  
 1   Age          200 non-null    int64  
 2   Sex          200 non-null    object 
 3   BP           200 non-null    object 
 4   Cholesterol  200 non-null    object 
 5   Na_to_K      200 non-null    float64
 6   Drug         200 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 11.1+ KB


In [18]:
df.drop('Unnamed: 0', axis = 1, inplace = True)
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [19]:
df.nunique()

Age             57
Sex              2
BP               3
Cholesterol      2
Na_to_K        198
Drug             5
dtype: int64

### Pre-processing

In [20]:
# Creating the Dependent and independent variablas]
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

In [21]:
y = df['Drug']
y

0      drugY
1      drugC
2      drugC
3      drugX
4      drugY
       ...  
195    drugC
196    drugC
197    drugX
198    drugX
199    drugX
Name: Drug, Length: 200, dtype: object

In [22]:
df.columns

Index(['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K', 'Drug'], dtype='object')

As seen, some columns such as 'BP' and 'Cholesterol' are both categorical variables which is not interpretation to sklearn decision trees. This means that these variables are be converted into numerical values by using the get_dummies() function from Pandas library.

In [23]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F', 'M'])
X[:, 1] = le_sex.transform(X[:, 1])


le_BP = preprocessing.LabelEncoder()
le_BP.fit(['LOW', 'NORMAL', 'HIGH'])
X[:, 2] = le_BP.transform(X[:, 2])


le_chol = preprocessing.LabelEncoder()
le_chol.fit(['NORMAL', 'HIGH'])
X[:, 3] = le_chol.transform(X[:, 3])

In [24]:
y[0 : 5]

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

### Setting up the Decision Tree

In [25]:
from sklearn.model_selection import train_test_split

The train_test_split need the parameters: 
- X, y, test_size = 0.3, and random_state = 3.


X and y arrays are required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 3)

### Practice

In [27]:
# Printing the shape of X_train and y_train ensuring that their dimensions match

In [28]:
# Printing the shape of X_train and y_train ensuring that their dimensions match
print('Shape of X_train: ', X_train.shape)
print('Shape of y_train: ', y_train.shape)

Shape of X_train:  (140, 5)
Shape of y_train:  (140,)


In [29]:
# Printing the shape of X_test and y_test ensuring that their dimensions match
print('Shape of X_test: ', X_test.shape)
print('Shape of y_test: ', y_test.shape)

Shape of X_test:  (60, 5)
Shape of y_test:  (60,)


### Modeling

In [30]:
# Creating a decisionTreeClassidier called drugTree.
drugtree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4)
drugtree

In [31]:
drugtree.fit(X_train, y_train)

### Prediction

In [32]:
# Predicting the train set
predtree = drugtree.predict(X_test)
print(predtree[0:5])
print(y_test[0:5])

['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
40     drugY
51     drugX
139    drugX
197    drugX
170    drugX
Name: Drug, dtype: object


### Evaluation

In [33]:
# Importing metrix from sklearn to check the accuracy of the model
from sklearn import metrics
import matplotlib.pyplot as plt
accuracy = metrics.accuracy_score(y_test, predtree)
print(f"DecisionTree's Accuracy: {accuracy:.2f}")

DecisionTree's Accuracy: 0.98


__Accuracy classification score__ computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.  

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

### Visualization

In [41]:
import six
from six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from six import StringIO
import pydotplus
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np


dot_data = StringIO()
filename = 'drugstree.png'
featureNames = df.columns[0:5]
targetNames = df['Drug'].unique().tolist()
# out = drugtree.export_graphviz(tree, feature_names = featureNames, out_file = dot_data, class_names = np.unique(y_train),
                           # filled = True, special_characters = True, rotate = False)
export_graphviz(drugtree, out_file=dot_data, feature_names=featureNames, 
               class_names=np.unique(y_train), filled=True, special_characters=True, rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
img = mping.imread(filename)
plt.figure(figsize = (100, 200))
plt.imshow(img, interpolation = 'nearest')



***************