# Decision Tree

In this notebook decision tree classification algorithm has been implemented. Historical data of different patients and recommended drugs has been used to train the model. 
 
Trained decision tree based model will be used to predict drug for a new patient.


### Dataset 

Data has been collected from a group of patients who have suffered from same illness.During their course of treatment each patient responded to one of five medications Drug A, Drug B, Drug C, Drug X and Drug Y.

Objective is to build a model to predict which drug is appropriate for a future patien with same illness. Features of this dataset are Age, Sex, Blood Pressure and Cholestrol of patients. Target is the drug that each patient responded.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

# Read  Dataset

In [2]:
data=pd.read_csv('drug200.csv',delimiter=',')

data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [3]:
data.shape

(200, 6)

# Preprocessing Data

Using __data__ as dm __drug200.csv__ data read read by pandas, declare following variable,

 1. __X__ as feature matrix
 2. __Y__ as target variable

In [4]:
X=data[['Age','Sex','BP','Cholesterol','Na_to_K']].values
X[0:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.113999999999999],
       [28, 'F', 'NORMAL', 'HIGH', 7.797999999999999],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

Some features in this dataset are catagorical features. Such as __SEX__ , __BP__ , __Cholesterol__ . Sklearn Decision Trees don't handle categorical variables. These features need to be converted to the numerical values.

__pandas.get_dummies()__ converts categorical variables into dummy variables/indiators. 

In [5]:
# Convert Categorical Variables into dummy variables

from sklearn import preprocessing

le_sex=preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1]=le_sex.transform(X[:,1])

le_BP=preprocessing.LabelEncoder()
le_BP.fit(['LOW','NORMAL','HIGH'])
X[:,2]=le_BP.transform(X[:,2])

le_Cholesterol=preprocessing.LabelEncoder()
le_Cholesterol.fit(['NORMAL','HIGH'])
X[:,3]=le_Cholesterol.transform(X[:,3])

X[0:5]


array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.113999999999999],
       [28, 0, 2, 0, 7.797999999999999],
       [61, 0, 1, 0, 18.043]], dtype=object)

In [6]:
Y=data[['Drug']]
Y[0:5]

Unnamed: 0,Drug
0,drugY
1,drugC
2,drugC
3,drugX
4,drugY


# Outline Model

### Split Data into Train, Test set

In [7]:
from sklearn.model_selection import train_test_split

Now <b> train_test_split </b> will return 4 different parameters. We will name them:<br>
X_trainset, X_testset, y_trainset, y_testset <br> <br>
The <b> train_test_split </b> will need the parameters: <br>
X, y, test_size=0.3, and random_state=3. <br> <br>
The <b>X</b> and <b>y</b> are the arrays required before the split, the <b>test_size</b> represents the ratio of the testing dataset, and the <b>random_state</b> ensures that we obtain the same splits.

In [8]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, Y, test_size=0.3, random_state=3)

print(X_trainset.shape)
print(y_trainset.shape)

(140, 5)
(140, 1)


In [9]:
print(X_testset.shape)
print(y_testset.shape)

(60, 5)
(60, 1)


At first an instance of __DesionTreeClassifier__ will be created.

Inside of the classifier,  criterion="entropy" is used to see the information gain of each node.

In [10]:
Model_DT=DecisionTreeClassifier(criterion="entropy", max_depth = 4)

Model_DT

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

# Train Model

In [11]:
Model_DT.fit(X_trainset,y_trainset)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

# Test Model

### Predict

In [12]:
pred_y=Model_DT.predict(X_testset)

print("1. Predicted Traget ")

print(pred_y[0:5])

print("2. Orignat Traget ")

print(y_testset[0:5])

1. Predicted Traget 
['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
2. Orignat Traget 
      Drug
40   drugY
51   drugX
139  drugX
197  drugX
170  drugX


### Evaluate Model Performance

In [15]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("Decision Tree Accuracy :  " , metrics.accuracy_score(y_testset,pred_y))

Decision Tree Accuracy :   0.9833333333333333
