### Predict class of an unknown patient and refer proper drug - Decision tree model
##### The dataset of drugs is taken from kaggle for the analysis of the features of a particular category of patient and then to suggest drug as per the category the patient will belong to.  We will use classification algorithm to build a model from the historical data of patients, and their response to different medications. Then we will use the trained decision tree to predict the class of an unknown patient, or to find a proper drug for a new patient.
##### The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to. It is a sample of multiclass classifier, and we will use the training part of the dataset to build a decision tree, and then use it to predict the class of an unknown patient, or to prescribe a drug to a new patient.

### Importing the libraries and the dataset

In [1]:
import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [2]:
df = pd.read_csv('drug200.csv')
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,DrugY


### Setting up features variable and target variable

In [3]:
x = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
x[:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

In [4]:
y = df[['Drug']].values
y[:5]

array([['DrugY'],
       ['drugC'],
       ['drugC'],
       ['drugX'],
       ['DrugY']], dtype=object)

#### So, as we can see that some features in this dataset are categorical, such as Sex, BP, Cholesterol. Sklearn Decision Trees does not handle categorical variables. Thus we need to convert these features to numerical values using pandas.get_dummies() to convert the categorical variable into dummy/indicator variables.

In [5]:
from sklearn import preprocessing
sex = preprocessing.LabelEncoder().fit(['F','M'])
x[:,1] = sex.transform(x[:,1])

In [6]:
x[:5]

array([[23, 0, 'HIGH', 'HIGH', 25.355],
       [47, 1, 'LOW', 'HIGH', 13.093],
       [47, 1, 'LOW', 'HIGH', 10.114],
       [28, 0, 'NORMAL', 'HIGH', 7.798],
       [61, 0, 'LOW', 'HIGH', 18.043]], dtype=object)

In [7]:
bp = preprocessing.LabelEncoder().fit(['HIGH', 'NORMAL', 'LOW'])
x[:,2] = bp.transform(x[:,2])

In [8]:
x[:5]

array([[23, 0, 0, 'HIGH', 25.355],
       [47, 1, 1, 'HIGH', 13.093],
       [47, 1, 1, 'HIGH', 10.114],
       [28, 0, 2, 'HIGH', 7.798],
       [61, 0, 1, 'HIGH', 18.043]], dtype=object)

In [9]:
df['Cholesterol'].value_counts()

HIGH      103
NORMAL     97
Name: Cholesterol, dtype: int64

In [10]:
chole =  preprocessing.LabelEncoder().fit(['HIGH', 'NORMAL'])
x[:,3] = chole.transform(x[:,3])

In [11]:
x[:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043]], dtype=object)

### Train and test data split

In [12]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.3, random_state = 3)
print('Training dataset shape:    ',train_x.shape, train_y.shape)
print('Testing dataset shape:    ',test_x.shape, test_y.shape)

Training dataset shape:     (140, 5) (140, 1)
Testing dataset shape:     (60, 5) (60, 1)


### Building the model

In [13]:
treee = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4)
treee

DecisionTreeClassifier(criterion='entropy', max_depth=4)

In [14]:
treee.fit(train_x, train_y)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

In [15]:
yhat = treee.predict(test_x)
yhat

array(['DrugY', 'drugX', 'drugX', 'drugX', 'drugX', 'drugC', 'DrugY',
       'drugA', 'drugB', 'drugA', 'DrugY', 'drugA', 'DrugY', 'DrugY',
       'drugX', 'DrugY', 'drugX', 'drugX', 'drugB', 'drugX', 'drugX',
       'DrugY', 'DrugY', 'DrugY', 'drugX', 'drugB', 'DrugY', 'DrugY',
       'drugA', 'drugX', 'drugB', 'drugC', 'drugC', 'drugX', 'drugX',
       'drugC', 'DrugY', 'drugX', 'drugX', 'drugX', 'drugA', 'DrugY',
       'drugC', 'DrugY', 'drugA', 'DrugY', 'DrugY', 'DrugY', 'DrugY',
       'DrugY', 'drugB', 'drugX', 'DrugY', 'drugX', 'DrugY', 'DrugY',
       'drugA', 'drugX', 'DrugY', 'drugX'], dtype=object)

### Checking the accuracy

In [16]:
from sklearn import metrics
print('Accuracy of the decision tree is:    ', metrics.accuracy_score(test_y, yhat))

Accuracy of the decision tree is:     0.9833333333333333


So, the model is giving a huge accuracy. Thus the prediction of the drug is quite accurate