# Week 4: ML- Decision Trees


The goal of this practical/lab is to get you acquainted with how decision trees work.   
We will look at how to calculate entropy and information gain, and then write functions to get the feature that maximizes information gain, and use it to split our dataset.   
We will load a naive implementation of ID3 algorithm and train it on our sample dataset, and then compare its performance to that of SKlearn's tree classifier.  
(students interested in a more challenging exercise can provide their own implementation of ID3, this should be slightly easier than normal given you already implemented the necessary constituents of the algorithm).  
We'll first start by importing the necessary libraries for the exercises.  


In [9]:
#Necessary imports
from sklearn.utils import shuffle
import numpy as np
import pandas as pd

Let's then load the same dataset used in this week's tutorial (from exercises 1 and 2).

In [10]:
#Reading the dataset from week 4 tutorial
dataset=pd.read_csv("dataset.csv")
dataset=shuffle(dataset)
dataset.head()


Unnamed: 0,A,B,C,D,variety
29,a3,b1,c1,d1,y
0,a1,b1,c1,d1,y
27,a3,b1,c1,d1,y
9,a1,b1,c1,d1,y
61,a1,b1,c2,d1,n


## Exercise 1: Entropy 

In this exercise, we will look at how to measure entropy in a set/subset, which we will later reuse to calculate the information gain of each single feature.  
We defined entropy in the lecture as a measure of uncertainty or chaos, and is given by the below equation.
## $ Entropy = -\sum_{e=1}^{n} (P(e)*log_2(P(e))) $  

**Question 1:**   
Write a function that takes as a parameter a label column (in our example that's "variety") and returns the entropy of the whole dataset.

In [11]:

def entropy(label):
    #Insert your code below
    
    probabilities= label.value_counts(normalize=True) # alternative: counts = np.bincount(label)
    entropy_value = 0
    for p in probabilities:
        if p > 0:
            entropy_value += - p*np.log2(p)
        
    return entropy_value

print(dataset.variety.value_counts()) # tests
print(pd.unique(dataset.variety)) # tests

y    50
n    40
Name: variety, dtype: int64
['y' 'n']


**Question 2:**  
Use the function you defined above to confirm the values you calculated in exercise 1 of the tutorial.

In [12]:
#Insert your code here
#0.9910760598382222

print(entropy(dataset.variety))

0.9910760598382222


## Exercise 2: Information Gain

As discussed in the lecture, information gain measures the decrease in entropy after a dataset is split on a feature.
Let's proceed now to calculate the information gain of a dataset when splitting using a split_feature.

## $ information\_gain(S,A) = Entropy(S)-\sum_{i=1}^{n}*(\frac{|Si|}{|S|}*Entropy(S))$
**Question 1:**  
Write a function that takes as parameters:

1. dataset
2. label
3. feature

Your function information_gain should make use of the function entropy which you defined in exercise 1.

In [13]:
def information_gain(dataset, label, feature):
    """
    Inser your code here
    """
    information_gain = 0
    # for every attribute , calculated the weighted entropy of each sub_features     
    sub_features = pd.unique(feature) # find all the sub_features

    temp_df = pd.DataFrame()
    feature_entropy = 0
    for f in sub_features:
        temp_df = dataset.loc[feature == f]
        p = (feature.values == f).sum()/len(feature)
        sub_feature_entropy = entropy(temp_df.variety)# each entropy of sub_features respectively
        feature_entropy += p*sub_feature_entropy
    
    information_gain = entropy(label) - feature_entropy
    
    return information_gain

**Question 2:**  
Use the function you defined above to calculate the information gain of feature/attribute A.

In [14]:
#Insert your code here
#0.15809587018183202
label = dataset.variety
fA = dataset.A
info_gain_VA = information_gain(dataset,label,fA)
print(info_gain_VA)

0.15809587018183202


## Exercise 3: Splitting Datasets Based on Information Gain
We split a dataset based on the feature which brings the highest information gain. Your task in this exercise is to first find the attribute that maximizes the information gain, then split the dataset based on it.   
**Question 1:**  
Write a function **feature_to_split_on(dataset)** which takes an argument **dataset** and returns the name of the attribute upon which the dataset will be split.

In [15]:
def feature_to_split_on(dataset):
    '''
    Insert your code here
    '''
    info_gains = []
    
    for feature in dataset.loc[:,dataset.columns != 'variety' ].columns:
        info_gain = information_gain(dataset, dataset.variety, dataset[feature])
        info_gains.append(info_gain)
        
    max_info = np.max(info_gains)
    index = np.where(info_gains==max_info)
    feature = dataset.columns[index][0]

    return feature


**Question 2:**  
Check that your function is working correctly by trying it on the dataset loaded earlier.

In [16]:
#Insert your code here
#A
print(feature_to_split_on(dataset))

A


**Question 3:**  
Finally write a function **split_dataset(dataset,feature)** which takes as parameters:  
1. dataset
2. feature (the feature upon which we will split the dataset)  
Your function should return an array of subdatasets (in the form of dataframes)  


In [17]:
def split_dataset(dataset, feature):
    '''
    Insert your code here
    
    '''
    return list(dataset.groupby(feature))
    

**Question 4:**  
Check that your function is working by running it on dataset and feature "A". This should return 3 subdatasets where the values for "A" are not crosscuting. If you split on "B" you should only get 2 subdatasets.

In [18]:
#Insert code here

split_dataset(dataset,'A')

[('a1',
       A   B   C   D variety
  0   a1  b1  c1  d1       y
  9   a1  b1  c1  d1       y
  61  a1  b1  c2  d1       n
  74  a1  b2  c2  d2       n
  59  a1  b1  c2  d1       n
  10  a1  b1  c1  d1       y
  37  a1  b2  c1  d2       n
  8   a1  b1  c1  d1       y
  34  a1  b2  c1  d2       n
  58  a1  b1  c2  d1       n
  36  a1  b2  c1  d2       n
  41  a1  b2  c1  d2       n
  43  a1  b2  c1  d2       n
  39  a1  b2  c1  d2       n
  6   a1  b1  c1  d1       y
  11  a1  b1  c1  d1       y
  12  a1  b1  c1  d1       n
  33  a1  b2  c1  d2       n
  40  a1  b2  c1  d2       n
  14  a1  b1  c1  d1       n
  76  a1  b2  c2  d2       n
  57  a1  b1  c2  d1       n
  5   a1  b1  c1  d1       y
  7   a1  b1  c1  d1       y
  15  a1  b1  c1  d1       n
  56  a1  b1  c2  d1       n
  35  a1  b2  c1  d2       n
  1   a1  b1  c1  d1       y
  75  a1  b2  c2  d2       n
  4   a1  b1  c1  d1       y
  38  a1  b2  c1  d2       n
  60  a1  b1  c2  d1       n
  32  a1  b2  c1  d2       n
  42  

## Exercise 4: Iterative Dichotomiser 3 (ID3)

ID3 is one of the simplest algorithms used to learn decision trees from data. 
Study the pseudo-code below (from wikipedia):

> 
    ID3 (Examples, Target_Attribute, Attributes)  
        Create a root node for the tree  
        If all examples are positive, Return the single-node tree Root, with label = +.  
        If all examples are negative, Return the single-node tree Root, with label = -.  
        If number of predicting attributes is empty, then Return the single node tree Root,  
        with label = most common value of the target attribute in the examples.  
        Otherwise Begin  
            A ← The Attribute that best classifies examples.  
            Decision Tree attribute for Root = A.  
            For each possible value, vi, of A,  
                Add a new tree branch below Root, corresponding to the test A = vi.  
                Let Examples(vi) be the subset of examples that have the value vi for A  
                If Examples(vi) is empty  
                    Then below this new branch add a leaf node with label = most common target value in the examples  
                Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})  
        End  
        Return Root  


You can, optionally, implement your own function **decision_tree_id3(subdataset,dataset,label,features,parent_node)** which requires the following parameters:  
1. dataset
2. subdataset (in your first iteration, dataset and subdataset are equal)
3. label (what you're trying to predict)
4. features (what you'll be splitting your dataset/subdataset on)
5. parent_node  

Alternatively, we use a naive implementation which we import from Iterative_Dichotomiser3.   
The algorithm has been trained on all the dataset as you can see below, leaving no space for us to assess how it performs.  


In [19]:
import Iterative_Dichotomiser3 as ID

tree=ID.ID3(dataset,dataset,dataset.columns[:-1])


**Question 1:**  
Write a function **train_test_split(dataset, ratio)** which takes a dataset as an input and returns two datasets one for training and another for testing.
For our example dataset, we have 90 rows and so calling your function with the parameters (dataset, 0.1) will return a training set with 81 rows and a testing set with 9 rows.

In [20]:
def train_test_split(dataset, ratio):
    '''
    Insert your code here
    '''
    training_data = dataset.sample(frac=1-ratio)
    testing_data = dataset.drop(training_data.index)
    
#     row_count = dataset.shape[0]
#     split_point = int(row_count*(1-ratio))
#     training_data,testing_data = dataset[:split_point], dataset[split_point:]
    return training_data,testing_data


**Question 2:**  
Test whether your function produces the two subdatasets correctly. Print the shape of the test set below.

In [21]:
#Insert code here
#(9,5)
# print(train_test_split(dataset,0.1)[0])
# print(train_test_split(dataset,0.1)[1])
print(train_test_split(dataset,0.1)[1].shape)

(9, 5)


Let's finally train the ID3 tree classifier on the training set and evaluate it on the test set.

In [29]:

train = train_test_split(dataset,0.1)[0]
test = train_test_split(dataset,0.1)[1]
tree = ID.ID3(train,train,train.columns[:-1])
print(test.variety)
print(tree)
ID.test(test.variety,tree)

#The prediction accuracy is:  77.77777777777779 %

26    y
83    y
80    y
51    n
75    n
28    y
46    y
70    y
54    n
Name: variety, dtype: object
{'A': {'a1': {'B': {'b1': {'C': {'c1': {'D': {'d1': 'y'}}, 'c2': 'n'}}, 'b2': 'n'}}, 'a2': {'C': {'c1': {'B': {'b1': {'D': {'d2': 'n'}}, 'b2': {'D': {'d1': 'y'}}}}, 'c2': 'y'}}, 'a3': {'B': {'b1': 'y', 'b2': {'C': {'c1': 'n', 'c2': 'y'}}}}}}


IndexingError: Too many indexers

## Exercise 5: Compairing your Decision Tree with SKlearn's


**Question 1:**  
  
Use the **DecisionTreeClassifier** available in the **sklearn** library to train a decision tree model on the example dataset we've been working on so far.    

Fine-tune the tree classifier as you see fit and make sure it uses **entropy** not **gini**.       
**Some data preprocessing might be required before you can run the classifier on the example dataset.**    
Check the accuracy of the sklearn classifier and compare it to the one we imported from Iterative_Dichotomiser3 (which we partly implemented).    



In [30]:
#Insert your code here
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

train_features = pd.get_dummies(train.drop('variety',axis=1))
train_target = train['variety']
test_features = pd.get_dummies(test.drop('variety',axis=1))
test_target = test['variety']

tree = DecisionTreeClassifier(criterion = 'entropy',max_depth=3 ).fit(train_features,train_target) 
predict = tree.predict(test_features)

print("The prediction accuracy is: ",accuracy_score(test_target, predict)*100,"%")


The prediction accuracy is:  100.0 %


**Question 2:**  
Give an interpretation of why your classifier underperforms (or outperforms if you didn't fine-tune the sklearn claffier well) compared to the sklearn tree classifier.

In [None]:
#Enter your answer here
# on the choice of splitter: string, optional (default=”best”), since our dataset has only a few features and possiblity withut overfitting problem, so use default is safer
# max_depth: int or None, optional (default=None): if the data is overfitting, the accuracy might low, so decrease the max_depth can increase the accuracy