# <center>Decision Trees</center>

###### you will learn a popular machine learning algorithm, Decision Tree. You will use this classification algorithm to build a model from historical data of patients, and their respond to different medications. Then you use the trained decision tree to predict the class of a unknown patient, or to find a proper drug for a new patient.

#### Importing the Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

#### ML Pipeline

### About dataset
Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. 

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to. 

It is a sample of binary classifier, and you can use the training part of the dataset 
to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.


#### 1. Data Collection

In [2]:
df = pd.read_csv('drug200.csv', sep=',')

In [3]:
df.shape

(200, 6)

In [4]:
df['Cholesterol'].value_counts()

HIGH      103
NORMAL     97
Name: Cholesterol, dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
Age            200 non-null int64
Sex            200 non-null object
BP             200 non-null object
Cholesterol    200 non-null object
Na_to_K        200 non-null float64
Drug           200 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 6.3+ KB


In [6]:
print(df.shape)
X = df.values[:,:-1]
y = df.values[:,-1]

(200, 6)


In [7]:
type(X)

numpy.ndarray

In [8]:
type(y)

numpy.ndarray

#### 2. Data Preprocessing

As you may figure out, some featurs in this dataset are catergorical such as __Sex__ or __BP__. Unfortunately, Sklearn Decision Trees do not handle categorical variables. But still we can convert these features to numerical values. __pandas.get_dummies()__
Convert categorical variable into dummy/indicator variables.

In [9]:
le_sex = LabelEncoder()

In [10]:
le_sex = le_sex.fit(['F','M'])

In [11]:
X[:,1] = le_sex.transform(X[:,1])

In [12]:
le_BP = LabelEncoder()
le_BP = le_BP.fit(['LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])

In [13]:
le_chol = LabelEncoder()
le_chol = le_chol.fit(['NORMAL','HIGH'])
X[:,3] = le_chol.transform(X[:,3])

In [14]:
print('X : ',X.shape)
print('y : ',y.shape)

X :  (200, 5)
y :  (200,)


#### 3. Data Preparation

In [15]:
X_orig,X_test,y_orig,y_test = train_test_split(X, y, test_size = 0.3, random_state = 3)

In [16]:
X_train, X_val, y_train, y_val = train_test_split(X_orig, y_orig, test_size = 0.3, random_state = 3)

In [17]:
print('X DF shape is : ', X.shape)
print('y DF shape is : ', y.shape)
print('-----------------')
print('X_orig shape is : ', X_orig.shape)
print('y_orig shape is : ', y_orig.shape)
print('-----------------')
print('X_train shape is : ', X_train.shape)
print('y_train shape is : ', y_train.shape)
print('-----------------')
print('X_val shape is : ', X_val.shape)
print('y_val shape is : ', y_val.shape)
print('-----------------')
print('X_test shape is : ', X_test.shape)
print('y_test shape is : ', y_test.shape)
print('-----------------')

X DF shape is :  (200, 5)
y DF shape is :  (200,)
-----------------
X_orig shape is :  (140, 5)
y_orig shape is :  (140,)
-----------------
X_train shape is :  (98, 5)
y_train shape is :  (98,)
-----------------
X_val shape is :  (42, 5)
y_val shape is :  (42,)
-----------------
X_test shape is :  (60, 5)
y_test shape is :  (60,)
-----------------


#### 4. Build Model

In [18]:
model_dtree = DecisionTreeClassifier(criterion='entropy', max_depth=4)

#### 5. Train Model

In [19]:
model_dtree = model_dtree.fit(X_train,y_train)

#### 6. Validate Model

In [20]:
pred_dtree = model_dtree.predict(X_val)

#### Apply cross validation

In [21]:
accuracies = cross_val_score(estimator=model_dtree, X=X_train,y=y_train, cv =10)
print('accuracies : ', accuracies)
print('accuracies mean : ', accuracies.mean())
print('accuracies std : ', accuracies.std())

accuracies :  [0.9 1.  1.  1.  1.  1.  1.  1.  1.  1. ]
accuracies mean :  0.99
accuracies std :  0.029999999999999992




#### 7. Model selection based on Validation Accuracy score

In [22]:
print(' Val Accuracy score : ', accuracy_score(y_val,pred_dtree))

 Val Accuracy score :  1.0


#### Grid Search

In [23]:
# parameters = [{'criterion' : ['entropy','gini']},
#               {'max_depth' : [2, 4, 6, 8]}]
# grid_search = GridSearchCV(estimator=model_dtree, param_grid=parameters, scoring='accuracy', cv = 10, n_jobs = -1)
# # best_accuracy = grid_search.best_score_
# best_parameters = grid_search
# # print('best accuracy : ', best_accuracy)
# print('best params : ', best_parameters)

#### 8. Test model and Report Accuracy

In [24]:
model = DecisionTreeClassifier()
model = model.fit(X_orig,y_orig)
pred = model.predict(X_test)
print('Test Accuracy Score is : ', accuracy_score(y_test,pred))

Test Accuracy Score is :  0.9833333333333333


In [25]:
# from sklearn.externals.six import StringIO
# import pydotplus
# import matplotlib.image as mpimg
# from sklearn import tree
# %matplotlib inline 



In [26]:
# pip install pydotplus

Note: you may need to restart the kernel to use updated packages.


In [28]:
# dot_data = StringIO()
# filename = "drugtree.png"
# featureNames = df.columns[0:5]
# targetNames = df["Drug"].unique().tolist()
# out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# graph.write_png(filename)
# img = mpimg.imread(filename)
# plt.figure(figsize=(100, 200))
# plt.imshow(img,interpolation='nearest')

NameError: name 'drugTree' is not defined