#### This assignment may be worked individually or in pairs. Enter your name/names here:
    

In [1]:
#names here

# Assignment 1: Decision Trees

In this assignment we'll implement the Decision Tree algorithm to classify patients as either having or not having diabetic retinopathy. For this task we'll be using the Diabetic Retinopathy data set, which contains features from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not. This dataset has `1151` instances and `20` attributes (some categorical, some continuous). You can find additional details about the dataset [here](http://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set).

Attribute Information:

0) The binary result of quality assessment. 0 = bad quality 1 = sufficient quality.

1) The binary result of pre-screening, where 1 indicates severe retinal abnormality and 0 its lack. 

2-7) The results of MA detection. Each feature value stand for the number of MAs found at the confidence levels alpha = 0.5, . . . , 1, respectively. 

8-15) contain the same information as 2-7) for exudates. However, as exudates are represented by a set of points rather than the number of pixels constructing the lesions, these features are normalized by dividing the 
number of lesions with the diameter of the ROI to compensate different image sizes. 

16) The euclidean distance of the center of the macula and the center of the optic disc to provide important information regarding the patient's condition. This feature is also normalized with the diameter of the ROI.

17) The diameter of the optic disc. 

18) The binary result of the AM/FM-based classification.

19) Class label. 1 = contains signs of Diabetic Retinopathy (Accumulative label for the Messidor classes 1, 2, 3), 0 = no signs of Diabetic Retinopathy.


A few function prototypes are already given to you, please don't change those. You can add additional helper functions for your convenience. *Suggestion:* The dataset is substantially big, for the purpose of easy debugging work with a subset of the data and test your decision tree implementation on that.

#### Implementation: 
A few function prototypes are already given to you, please don't change those. You can add additional helper functions for your convenience. 

*Suggestion:* The dataset is substantially big, for the purpose of easy debugging, work with a subset of the data and test your decision tree implementation on that.

#### Notes:
Parts of this assignment will be **autograded** so a couple of caveats :-
- Entropy is calculated using log with base 2, `math.log2(x)`.
- For continuous features ensure that the threshold value lies exactly between 2 buckets. For example, if for feature 2 the best split occurs between 10 and 15 then the threshold value will be set as 12.5.
- For binary features [0/1] the threshold value will be 0.5. All values < `thresh_val` go to the left child and all values >= `thresh_val` go to the right child.

In [2]:
# Standard Headers
# You are welcome to add additional headers if you wish
# EXCEPT for scikit-learn... You may NOT use scikit-learn for this assignment!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import log
from random import shuffle

In [3]:
class DataPoint:
    def __str__(self):
        return "< " + str(self.label) + ": " + str(self.features) + " >"
    def __init__(self, label, features):
        self.label = label # the classification label of this data point
        self.features = features

Q1. Read data from a CSV file. Put it into a list of `DataPoints`.

In [4]:
def get_data(filename):
    data = []
#     your code goes here
    data_frame = pd.read_csv(filename, header=None)
    for row in data_frame.iterrows():
        data.append(DataPoint(row[1][19], row[1][:19]))
    return data

In [5]:
class TreeNode:
    is_leaf = True          # boolean variable to check if the node is a leaf
    feature_idx = None      # index that identifies the feature
    thresh_val = None       # threshold value that splits the node
    prediction = None       # prediction class (only valid for leaf nodes)
    left_child = None       # left TreeNode (all values < thresh_val)
    right_child = None      # right TreeNode (all values >= thresh_val)
    
    def printTree(self):    # for debugging purposes
        if self.is_leaf:
            print ('Leaf Node:      predicts ' + str(self.prediction))
        else:
            print ('Internal Node:  splits on feature ' 
                   + str(self.feature_idx) + ' with threshold ' + str(self.thresh_val))
            self.left_child.printTree()
            self.right_child.printTree()

Q2. Implement the function `make_prediction` that takes the decision tree root and a `DataPoint` instance and returns the prediction label.

In [6]:
def make_prediction(tree_root, data_point):
#     your code goes here
    if tree_root.is_leaf:
        return tree_root.prediction
    else:
        data_point_val = data_point.features[tree_root.feature_idx]
        if data_point_val < tree_root.thresh_val:
            return make_prediction(tree_root.left_child, data_point)
        else:
            return make_prediction(tree_root.right_child, data_point)

Q3. Implement the function `split_dataset` given an input data set, a `feature_idx` and the `threshold` for the feature. `left_split` will have all values < `threshold` and `right_split` will have all values >= `threshold`.

In [7]:
def split_dataset(data, feature_idx, threshold):
    left_split = []
    right_split = []
#     your code goes here
    for data_point in data:
        if data_point.features[feature_idx] < threshold:
            left_split.append(data_point)
        else:
            right_split.append(data_point)
    return (left_split, right_split)

Q4. Implement the function `calc_entropy` to return the entropy of the input dataset.

In [8]:
def calc_entropy(data):
    entropy = 0.0
#     your code goes here
    if not data:
        return entropy
    positive = 0
    negative = 0
    for data_point in data:
        if data_point.label:
            positive += 1
        else:
            negative += 1
    prop_pos = positive / (positive + negative)
    prop_neg = negative / (positive + negative)
    return calc_entropy_props(prop_pos, prop_neg)

def calc_entropy_props(prop_pos, prop_neg):
    entropy_pos = 0
    if prop_pos:
        entropy_pos = -prop_pos * log(prop_pos, 2)
    entropy_neg = 0
    if prop_neg:
        entropy_neg = - prop_neg * log(prop_neg, 2)
    return entropy_pos + entropy_neg

Q5. Implement the function `calc_best_threshold` which returns the best information gain and the corresponding threshold value for one feature at `feature_idx`.

In [9]:
def calc_best_threshold(data, feature_idx):
    best_info_gain = 0.0
    best_thresh = None
    binary_features = (0, 1, 18)
    data_entropy = calc_entropy(data)
#     your code goes here
    if feature_idx in binary_features:
        left, right = split_dataset(data, feature_idx, 0.5)
        len_left = len(left)
        len_right = len(right)
        prop_left =  len_left/ (len_left + len_right)
        prop_right = len_right / (len_left + len_right)
        best_thresh = 0.5
        best_info_gain = data_entropy - (prop_left * calc_entropy(left) + prop_right * calc_entropy(right))
    else:
        sorted_data = sorted(data, key=lambda x: x.features[feature_idx])
        left_pos = 0
        left_tot = 0
        right_pos = 0
        right_tot = len(sorted_data)
        for data_point in data:
            if data_point.label:
                right_pos += 1
        
        for i in range(len(sorted_data) - 1):
            cur_point = sorted_data[i]
            next_point = sorted_data[i + 1]
            if cur_point.label:
                left_pos += 1
                right_pos -= 1
            left_tot += 1
            right_tot -= 1
            
            if cur_point.label != next_point.label:
                left_entropy = calc_entropy_props(left_pos / left_tot, (left_tot - left_pos) / left_tot)
                right_entropy = calc_entropy_props(right_pos / right_tot, (right_tot - right_pos) / right_tot)
                left_prop = left_tot / (left_tot + right_tot)
                right_prop = 1 - left_prop
                info_gain = data_entropy - left_prop * left_entropy - right_prop * right_entropy
                if info_gain > best_info_gain:
                    best_info_gain = info_gain
                    best_thresh = (cur_point.features[feature_idx] + next_point.features[feature_idx]) / 2
            
    return (best_info_gain, best_thresh)

Q6. Implement the function `identify_best_split` which returns the best feature to split on for an input dataset, and also returns the corresponding threshold value.

In [10]:
def identify_best_split(data):
    if len(data) < 2:
        return (None, None)
    best_feature = None
    best_thresh = None
    best_gain = 0.0
    for i in range(len(data[0].features)):
        gain, thresh = calc_best_threshold(data, i)
        if best_gain < gain:
            best_feature = i
            best_thresh = thresh
            best_gain = gain
    return (best_feature, best_thresh)

Q7. Implement the function `createLeafNode` which returns a `TreeNode` with `is_leaf=True` and `prediction` set to whichever classification occurs most in the dataset at this node.

In [11]:
def createLeafNode(data):
    leaf = TreeNode()
    pos = 0
    
    for data_point in data:
        if data_point.label:
            pos += 1
            
    if pos > len(data) / 2:
        leaf.prediction = 1.0
    else:
        leaf.prediction = 0.0
        
    return leaf

Q8. Implement the `createDecisionTree` function. `max_levels` denotes the maximum height of the tree (for example if `max_levels = 1` then the decision tree will only contain the leaf node at the root. [Hint: this is where the recursion happens.]

In [12]:
def createDecisionTree(data, max_levels):
#     your code goes here
    if max_levels == 1:
        return createLeafNode(data)
    else:
        best_feature, best_thresh = identify_best_split(data)
        if best_feature is None:
            return createLeafNode(data)
        left, right = split_dataset(data, best_feature, best_thresh)
        if not left or not right:
            return createLeafNode(data)
        cur_node = TreeNode()
        cur_node.left_child = createDecisionTree(left, max_levels - 1)
        cur_node.right_child = createDecisionTree(right, max_levels - 1)
        cur_node.is_leaf = False
        cur_node.feature_idx = best_feature
        cur_node.thresh_val = best_thresh
        return cur_node

Q9. Given a test set, the function `calcAccuracy` returns the accuracy of the classifier. You'll use the `makePrediction` function for this.

In [13]:
def calcAccuracy(tree_root, data):
    total = 0
    num_correct = 0
    for data_point in data:
        if make_prediction(tree_root, data_point) == data_point.label:
            num_correct += 1
        total += 1
    
    return num_correct / total

Q10. Keeping the `max_levels` parameter as 10, use 5-fold cross validation to measure the accuracy of the model. Print the recall and precision of the model. Also display the confusion matrix.

In [14]:
# edit the code here - this is just a sample to get you started
import time

d = get_data("messidor_features.txt")
test_set_len = len(d) // 5
accuracies = []
for i in range(5):
# partition data into train_set and test_set
    train_set = d[0:i * test_set_len] + d[(i + 1) * test_set_len:len(d)]
    test_set = d[i * test_set_len:(i + 1) * test_set_len]

    print ('Training set size:', len(train_set))
    print ('Test set size    :', len(test_set))

    # create the decision tree
    start = time.time()
    tree = createDecisionTree(train_set, 10)
    end = time.time()
    print ('Time taken:', end - start)

    # calculate the accuracy of the tree
    accuracy = calcAccuracy(tree, test_set)
    accuracies.append(accuracy)
    print ('The accuracy on the test set ', i, ' is ', str(accuracy * 100.0))
avg_accuracy = sum(accuracies) / len(accuracies)
print ('The average accuracy on the data set is: ', str(avg_accuracy)
#t.printTree()

Training set size: 921
Test set size    : 230
Time taken: 1.0745725631713867
The accuracy on the test set  0  is  63.04347826086957
Training set size: 921
Test set size    : 230
Time taken: 1.0341777801513672
The accuracy on the test set  1  is  65.21739130434783
Training set size: 921
Test set size    : 230
Time taken: 0.9987263679504395
The accuracy on the test set  2  is  68.26086956521739
Training set size: 921
Test set size    : 230
Time taken: 0.991776704788208
The accuracy on the test set  3  is  65.65217391304347
Training set size: 921
Test set size    : 230
Time taken: 0.8748905658721924
The accuracy on the test set  4  is  65.21739130434783
The average accuracy on the data set is:  0.6547826086956521
