# Decision Tree for Medical Drug Prediction
#### Using previous medical data to predict what drug an unknown patient would need. 
##### by Scott Schmidt
Dataset csv can be found here: https://www.kaggle.com/prathamtripathi/drug-classification.
Project based in IBM Decision Trees Lab for the course Machine Learning with Python. The original dataset columns and first five rows can be viewed below:

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data=pd.read_csv(r'/kaggle/input/drug-classification/drug200.csv')
print(data.head())

### Find missing data

In [3]:
print("Missing data by column:")
findNA=data.isnull().sum().sort_values(ascending=False)/len(data)
print(findNA) #There are no missing values

## Split Data

In [4]:
X=data.drop('Drug', axis=1)
y=data['Drug']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

### Feature Engineering
Encode Categorical Variables

In [5]:
import category_encoders as ce

# encode variables with ordinal encoding
encoder = ce.OrdinalEncoder(cols=X.columns)

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

X_train.head()

### DecisionTreeClassifier
Model accuracy score with criterion gini index: 0.5667

In [6]:
from sklearn.tree import DecisionTreeClassifier

clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
clf_gini.fit(X_train, y_train)
y_pred_gini = clf_gini.predict(X_test)

# Check accuracy score with criterion gini index
from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

### Overfiting and underfiting

In [7]:
print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))

Because the test score is about 10% lower, there seems to be some indication that the data is slighty underfit. But overall, the numbers are similiar.

### Visualize decision tree

In [16]:
plt.figure(figsize=(12,8))
from sklearn import tree
tree.plot_tree(clf_gini.fit(X_train, y_train)) 

In [9]:
import graphviz 
dot_data = tree.export_graphviz(clf_gini, out_file=None, 
                              feature_names=X_train.columns,  
                              class_names=y_train,  
                              filled=True, rounded=True,  
                              special_characters=True)

graph = graphviz.Source(dot_data) 
graph

### Confusion Matrix

In [15]:
# Print the Confusion Matrix and slice it into four pieces
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_en)

print('Confusion matrix\n\n', cm)

### Classification Report

In [14]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_en))

### References:
* https://www.kaggle.com/prashant111/decision-tree-classifier-tutorial
* https://www.datacamp.com/community/tutorials/decision-tree-classification-python