# Assignment 1 - Decision Trees and Clustering Techniques

## *Aprendizagem Computacional - MEI | Computação Neuronal e Sistemas Difusos - MIEB*

### by Catarina Silva and Marco Simões

_

This assignment will assess the students knowledge on the following Machine Learning topics:
- Decision Trees
- Clustering Techniques

The assignment is split into two sub-assignments: 1-a) Decision Trees (first week) and 1-b) Clustering Techniques (second week).

Students should implement their solutions and answering the questions directly in the notebooks, and submit both files together in Inforestudante before the deadline: *06/10/2021*

## Conditions: 
- *Groups:* two elements of the same PL class
- *Duration:* 2 weeks
- *Workload:* 8h per student
 


# Assignment 1 - a) Decision Trees

Consider the depression dataset, from Agresti, A. (2019). _An introduction to categorical data analysis (2nd ed.). John Wiley & Sons._ This dataset is composed by evaluations of 335 patients during 3 phase treatment. We want to learn a decision tree that, given the attributes A - Diagnosis Severity (0: Mild, 1: Severe), B - Treatment Type (0: Standard, 1: New drug) and C - Follow Up Time (0: 1 week, 1: 2 weeks, 2: 4 weeks), predicts D - Depression Outcome (0: Normal, 1: Abnormal).




In [None]:
import pandas as pd
import numpy as np
... # TODO add extra imports if needed


# load data
data = pd.read_csv('depression.csv')


***
### Ex. 1
Create a function `attr_probs( data, attr )` that, given the dataset (`data`) and a attribute id (`attr`), computes the percentage of cases with Abnormal treatment outcome (D) for each attribute *value*. The function should return a dictionary with the different attribute values as keys and the correspondent percentages as values. Example: `attr_probs( data, 'A')` -> returns `{0: 0.30, 1: 0.23}`

In [5]:
OUTCOME = 'D'

def attr_probs( data, attr):
    df = pd.DataFrame(data=data, columns=attr)
    probs = {
        '0': 0,
        '1': 0
    }
    positive = 0
    negative = 0
    tp = 0

    for index, row in df.iterrows():
        if row['A'] == 1 and row['D'] == 1:
            positive += 1
        elif row['A'] == 0 and row['D'] == 1:
            negative += 1
        tp += 1
    probs['0'] = negative / tp
    probs['1'] = positive / tp

    return probs

# Result -> {'0': 0.296078431372549, '1': 0.22647058823529412}

***
### Ex. 2
Create a function `entropy( probs )` that, given a list probability values, returns the correspondent **entropy** value.

In [None]:
def calc_log2(value):
    if value > 0:
        return np.log2(value)
    return 0

def entropy( probs ):
    entropy_value = 0

    for value in list:
        entropy_value -= value * calc_log2(value)

    return round(entropy_value, 3)

    
    

In [None]:
# example
print(entropy([2/8, 0/8, 4/8, 2/8])) # should print 1.5

***
### Ex. 3 
Create a function `gain( data, attr )` to compute the gain of an attribute. Make use of the functions developed in the previous exercises.

In [None]:
def unique(df):
    unique_list = np.unique(np.array(df))
    return unique_list

def gain(data, attr):
    column = [attr, 'D']
    df = pd.DataFrame(data=data, columns=column)
    unique_values = unique(df[attr])

    set_positive, set_negative, set_total = 0, 0, 0
    subset_1_positive, subset_1_negative, subset_1_total = 0, 0, 0
    subset_2_positive, subset_2_negative, subset_2_total = 0, 0, 0
    subset_3_positive, subset_3_negative, subset_3_total = 0, 0, 0

    if len(unique_values) == 2:
        for index, row in df.iterrows():
            if row['D'] == 1:
                if row[attr] == unique_values[0]:
                    subset_1_positive += 1
                    subset_1_total += 1
                else:
                    subset_2_positive += 1
                    subset_2_total += 1
                set_positive += 1
            else:
                if row[attr] == unique_values[0]:
                    subset_1_negative += 1
                    subset_1_total += 1
                else:
                    subset_2_negative += 1
                    subset_2_total += 1
                set_negative += 1
            set_total += 1
    else:
        for index, row in df.iterrows():
            if row['D'] == 1:
                if row[attr] == unique_values[0]:
                    subset_1_positive += 1
                    subset_1_total += 1
                elif row[attr] == unique_values[1]:
                    subset_2_positive += 1
                    subset_2_total += 1
                else:
                    subset_3_positive += 1
                    subset_3_total += 1
                set_positive += 1
            else:
                if row[attr] == unique_values[0]:
                    subset_1_negative += 1
                    subset_1_total += 1
                elif row[attr] == unique_values[1]:
                    subset_2_negative += 1
                    subset_2_total += 1
                else:
                    subset_3_negative += 1
                    subset_3_total += 1
                set_negative += 1
            set_total += 1

    entropy_set = entropy([set_positive / set_total, set_negative / set_total])
    entropy_subset_1 = entropy([subset_1_positive / subset_1_total, subset_1_negative / subset_1_total])
    entropy_subset_2 = entropy([subset_2_positive / subset_2_total, subset_2_negative / subset_2_total])

    if attr == 'A':
        print(f'- Entropy Value: {entropy_set} \
                - Positive: {set_positive}/{set_total} \
                - Negative: {set_negative}/{set_total}')

    print(f'- Set {attr}:')
    print(f'    - Subset {unique_values[0]}:')
    print(f'        - Positive: {subset_1_positive}/{subset_1_total} \
                    - Negative: {subset_1_negative}/{subset_1_total} \
                    - Entropy: {entropy_subset_1}')

    print(f'    - Subset {unique_values[1]}:')
    print(f'        - Positive: {subset_2_positive}/{subset_2_total} \
                    - Negative: {subset_2_negative}/{subset_2_total}  \
                    - Entropy: {entropy_subset_2}')

    gain_value = entropy_set - (subset_1_total / set_total) * entropy_subset_1 - (subset_2_total / set_total) * entropy_subset_2

    if len(unique_values) == 3:
        entropy_subset_3 = entropy([subset_3_positive / subset_3_total, subset_3_negative / subset_3_total])
        print(f'    - Subset {unique_values[2]}:')
        print(f'        - Positive: {subset_3_positive}/{subset_3_total} \
                        - Negative: {subset_3_negative}/{subset_3_total} \
                        - Entropy: {entropy_subset_3}')

        gain_value = entropy_set - (subset_1_total / set_total) * entropy_subset_1 - (subset_2_total / set_total) * entropy_subset_2 - (subset_3_total / set_total) * entropy_subset_3

    print(f'\nSET {attr} GAIN:  {round(gain_value, 3)}\n')
    return round(gain_value, 3)

    
    

***
### Ex. 4 

Run the following code to compute the gain for the different attributes. In what does those results influence the design of the decision tree?

In [None]:
ATTRS = ['A', 'B', 'C']
for attr in ATTRS:
    print('Gain {attr}: {gain:.2f}'.format(attr=attr, gain=gain(data, attr)))


**Answer:**

- Entropy Value: 0.999                 - Positive: 533/1020                 - Negative: 487/1020
- Set A:
    - Subset 0:
        - Positive: 302/450                     - Negative: 148/450                     - Entropy: 0.914
    - Subset 1:
        - Positive: 231/570                     - Negative: 339/570                      - Entropy: 0.974

SET A GAIN:  0.051

- Set B:
    - Subset 0:
        - Positive: 237/540                     - Negative: 303/540                     - Entropy: 0.989
    - Subset 1:
        - Positive: 296/480                     - Negative: 184/480                      - Entropy: 0.96

SET B GAIN:  0.024

- Set C:
    - Subset 0:
        - Positive: 115/340                     - Negative: 225/340                     - Entropy: 0.923
    - Subset 1:
        - Positive: 175/340                     - Negative: 165/340                      - Entropy: 0.999
    - Subset 2:
        - Positive: 243/340                         - Negative: 97/340                         - Entropy: 0.863

SET C GAIN:  0.071


We can see that 'C' has the greater gain, so it will be our root 

***
### Ex. 5

Split the dataset into two sets (train set and test set), assigning randomly $70\%$ of the cases to the train set and the remaining $30\%$ to the test set. Use the `train_test_split` method from the `sklearn.model_selection` module, specifying the `random_state` with a value of $7$ for reproducibility purposes.

Train a `DecisionTreeClassifier` (from the `sklearn.tree` module) using the training data. Enforce the use of the `entropy` criterion instead of the `gini` criterion. 

Resort to the function `export_text` from the `sklearn.tree` module to visualize the structure of the resulting tree. Are the results of **Ex. 4** congruent with the tree obtained here? Justify.



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text

csv = pd.read_csv('depression.csv')

data = csv.iloc[:, :-1]
labels = csv.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.3, random_state=7)

classifier = DecisionTreeClassifier(criterion='entropy')

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

plot = export_text(classifier, feature_names=['A', 'B', 'C'])
print(plot)



**Answer:**

|--- C <= 1.50
|   |--- A <= 0.50
|   |   |--- B <= 0.50
|   |   |   |--- C <= 0.50
|   |   |   |   |--- class: 1
|   |   |   |--- C >  0.50
|   |   |   |   |--- class: 1
|   |   |--- B >  0.50
|   |   |   |--- C <= 0.50
|   |   |   |   |--- class: 1
|   |   |   |--- C >  0.50
|   |   |   |   |--- class: 1
|   |--- A >  0.50
|   |   |--- C <= 0.50
|   |   |   |--- B <= 0.50
|   |   |   |   |--- class: 0
|   |   |   |--- B >  0.50
|   |   |   |   |--- class: 0
|   |   |--- C >  0.50
|   |   |   |--- B <= 0.50
|   |   |   |   |--- class: 0
|   |   |   |--- B >  0.50
|   |   |   |   |--- class: 1
|--- C >  1.50
|   |--- B <= 0.50
|   |   |--- A <= 0.50
|   |   |   |--- class: 1
|   |   |--- A >  0.50
|   |   |   |--- class: 0
|   |--- B >  0.50
|   |   |--- A <= 0.50
|   |   |   |--- class: 1
|   |   |--- A >  0.50
|   |   |   |--- class: 1


Yes, C is our root(double click answer box to see organized tree)

***
### Ex 6

Looking for the structure of the tree printed, evaluate the following cases (by hand) and provide the outcome class for each case, as well as the path from the root to the leaf (meaning, provide the conditions it evaluated as true to reach that class).

**Cases:**<p>
c1 = (A=1, B=0, C=2)<p>
c2 = (A=0, B=0, C=0)<p>
c3 = (A=0, B=0, C=1)<p>
c4 = (A=1, B=1, C=0)<p>


**Example:**<p>
case: cx = (A=1, B=1, C=1)<p>
path: (C <= 1.5) --> (A > 0.5) --> (C > 0.5) --> (B > 0.5) --> class 1<p>


**Answer:**

case: c1 = (A=1, B=0, C=2)<p>
path: `(C > 1.5) --> (B > 0.5) --> (a > 0.5) --> class 1`<p>
_

case: c2 = (A=0, B=0, C=0)<p>
path: `(C <= 1.5) --> (A <= 0.5) --> (B <= 0.5) --> (C <= 0.5) --> class 1`<p>
_

case: c3 = (A=0, B=0, C=1)<p>
path: `(C <= 1.5) --> (A <= 0.5) --> (B <= 0.5) --> (C > 0.5) --> class 1`<p>
_

case: c4 = (A=1, B=1, C=0)<p>
path: `(C <= 1.5) --> (A > 0.5) --> (C > 0.5) --> (B > 0.5) --> class 0`<p>



***
### Ex. 7

Apply the decision tree trained in the previous exercise to the test data. Compare the predicted labels to the true labels, generating a confusion matrix (you can use the `confusion_matrix` function of the `sklearn.metrics` module for that). Report the **percentage** of `True Positives, True Negatives, False Positives and False Negatives`, as well as the metrics `accuracy, precision, recall and f1-score`.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib.pyplot as plt

csv = pd.read_csv('depression.csv')

data = csv.iloc[:, :-1]
labels = csv.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.3, random_state=7)

classifier = DecisionTreeClassifier(criterion='entropy')

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
plot_cm = ConfusionMatrixDisplay(confusion_matrix=cm)
TP = cm[0][0]
FN = cm[0][1]
FP = cm[1][0]
TN = cm[1][1]

rules = export_text(classifier, feature_names=['A', 'B', 'C'])
print(rules)
plot_cm.plot()
plt.show()
print(classification_report(y_true=y_test, y_pred=y_pred))
print(f'- True Positive : {TP}\
        - True Negative : {TN}\
        - False Positive : {FP}\
        - False Negative : {FN}')




***
### Ex. 8
Repeat the process of spliting the data, training the classifier and testing the classifier 100 times (use the values from 0 to 99 as `random_state` for the `train_test_split`function). Plot the accuracy across the 100 repetitions, reporting also its mean value and standard deviation.


In [None]:
# TODO CODE HERE

