## Decision Tree
decision trees are about testing an attribute and branching the cases, based on the result of the test. Each internal node corresponds to a test. And each branch corresponds to a result of the test. And each leaf node assigns a patient to a class.

A decision tree can be constructed by considering the attributes one by one.
1) Choose an attribute from our dataset.
2) Calculate the significance of the attribute in the splitting of the data.
3) split the data based on the value of the best attribute.
4) go to each branch and repeat it for the rest of the attributes

**best attribute: more predictiveness, lower entropy and less impurity**
A node in the tree is considered “pure” if, in 100% of the cases, the nodes fall into a specific category of the target field.
Entropy is the amount of information disorder, or the amount of randomness in the data.The entropy is used to calculate the homogeneity of the samples in that node. If the samples are completely homogeneous the entropy is zero and if the samples are equally divided, it has an entropy of one.

**Entropy = -p(A)log(p(A)) - p(B)(log(p(B)))** where p is the proportion of the category
Choose the independent variable then calculate the entropy after split for target variable: eg say target var is drugA and DrugB, independent var is sex, for male- entropy of the target var, for female- entropy of the target var then calculate information gain. 

**Choose the independent var with higher information gain after split**
Information gain is the information that can increase the level of certainty after splitting. **It is the entropy of a tree before the split minus the weighted entropy after the split by an attribute**

As entropy, or the amount of randomness, decreases, the information gain, or amount of certainty, increases, and vice-versa.
We will consider the entropy over the distribution of samples falling under each leaf node, and we’ll take a weighted average of that entropy – weighted by the proportion of samples falling under that leaf.

In [50]:
## importing libraries 
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [51]:
## reading the dataset 
df = pd.read_csv('drug200.csv')
print(df.shape)
df.head()

(200, 6)


Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [52]:
## selecting independent vars to array x 
x = df.iloc[:,:-1].values

##checking distribution for target variable
y = df.iloc[:,-1].values
df.iloc[:,-1].value_counts()

drugY    91
drugX    54
drugA    23
drugC    16
drugB    16
Name: Drug, dtype: int64

Some features in this dataset are categorical such as **Sex** or **BP**. Unfortunately, Sklearn Decision Trees do not handle categorical variables. But still we can convert these features to numerical values. **pandas.get_dummies()**
Convert categorical variable into dummy/indicator variables.

In [53]:
#LabelEncoder: ordered set: incremental value, OneHotEncoder: unordered set 
from sklearn import preprocessing
n_sex = preprocessing.LabelEncoder()
n_sex.fit(['F','M'])
x[:,1] = n_sex.transform(x[:,1])

n_bp = preprocessing.LabelEncoder()
n_bp.fit(['LOW', 'NORMAL', 'HIGH'])
x[:,2] = n_bp.transform(x[:,2])

n_chol = preprocessing.LabelEncoder()
n_chol.fit(['NORMAL', 'HIGH'])
x[:,3] = n_chol.transform(x[:,3])

x[0:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.113999999999999],
       [28, 0, 2, 0, 7.797999999999999],
       [61, 0, 1, 0, 18.043]], dtype=object)

Now <b> train_test_split </b> will return 4 different parameters. We will name them:<br>
X_trainset, X_testset, y_trainset, y_testset <br> <br>
The <b> train_test_split </b> will need the parameters: <br>
X, y, test_size=0.3, and random_state=3. <br> <br>
The <b>X</b> and <b>y</b> are the arrays required before the split, the <b>test_size</b> represents the ratio of the testing dataset, and the <b>random_state</b> ensures that we obtain the same splits.

In [54]:
## train_test_split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 3)

#checking train set dimensions 
print(x_train.shape)
print(y_train.shape)

#checking test set dimensions
print(x_test.shape)
print(y_test.shape)


(140, 5)
(140,)
(60, 5)
(60,)


#### Model Building 

In [55]:
## creating instance drugtree with criterion as entropy 
drugtree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4)
drugtree #shows all default parameters 

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [58]:
# prediction
drugtree.fit(x_train, y_train)
yhat = drugtree.predict(x_test)

# printing prediction 
print(yhat[0:5])
print(y_test[0:5])

['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
['drugY' 'drugX' 'drugX' 'drugX' 'drugX']


#### Evaluation 

In [61]:
from sklearn import metrics

print("Decision Tree's Accuracy: ", metrics.accuracy_score(y_test, yhat))

Decision Tree's Accuracy:  0.9833333333333333
