#### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### In this scenario, we're going to delve into a medical research in which we get to choose the best medicine for each patient with the aid of data.
#### Our dataset consists of data from past 200 patients. The data includes information about demographic and health status of the patients. Namely,
#### Age, Sex, Blood Pressure, Cholesterol Level, Na To K Equilibrium and the medicine prescribed "Drug".

#### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### First let's import the required libraries

In [11]:
import pandas as pd # To be able to handle the data with ease (by employing dataframe structure)
import numpy as np # To be able to treat the data as vectors
from sklearn import preprocessing # To carry out data preprocessing before forming the model
from sklearn.model_selection import train_test_split # For splitting data into train & test clusters
from sklearn import metrics # For accuracy evaluation
from sklearn.tree import DecisionTreeClassifier

### Importing the data from the internet

#### In this example, our objective is to build a model in which we try to find out which drug might be the best choice for a future patient with the same illness.

In [2]:
!wget -O drugList.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv

--2020-05-04 12:04:10--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6027 (5.9K) [text/csv]
Saving to: ‘drugList.csv’


2020-05-04 12:04:11 (655 MB/s) - ‘drugList.csv’ saved [6027/6027]



In [3]:
df = pd.read_csv("drugList.csv", delimiter=",")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


### Data Preprocessing

In [4]:
x = df[["Age", "Sex", "BP", "Cholesterol", "Na_to_K"]].values
y = df["Drug"]

#### As can be seen above, some columns contain non-numerical values. This is a problem since decision tree library in Python
#### only works with numbers. So, we have to convert them into numerical values.

In [5]:
# To do so, we use .get_dummies() function. It essentially converts categorical variables into dummies
sex = preprocessing.LabelEncoder()
sex.fit(['F','M'])
x[:,1] = sex.transform(x[:,1])

bp = preprocessing.LabelEncoder()
bp.fit(['LOW','NORMAL', 'HIGH'])
x[:,2] = bp.transform(x[:,2])

chol = preprocessing.LabelEncoder()
chol.fit(['NORMAL','HIGH'])
x[:,3] = chol.transform(x[:,3])

### Splitting our dataset into Train / Test 

In [6]:
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.3, random_state=3)

### Modeling

In [7]:
decTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
decTree.fit(xTrain, yTrain)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [18]:
yHat = decTree.predict(xTest)
print(yHat [0:5])
print(yTest[0:5])

['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
40     drugY
51     drugX
139    drugX
197    drugX
170    drugX
Name: Drug, dtype: object


### Evaluation

In [13]:
print("Accuracy of the Decision Tree model: ", metrics.accuracy_score(yTest, yHat))

Accuracy of the Decision Tree model:  0.9833333333333333
