###### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Assignment 1: Pose classification with naive Bayes


**Student ID(s):**     Mihai Blaga (1085020) Kai Stevens-Noguchi (963632)


This Python notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

## Questions 


If you are in a group of 1, you will respond to **two** questions of your choosing.

If you are in a group of 2, you will respond to **four** questions of your choosing.

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer should be submitted separately as a PDF.

### Naive Bayes classifier

In [None]:
import scipy.stats
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple

TEST_LOCATION = ".\\COMP30027_2021_assignment1_data\\test.csv"
TRAIN_LOCATION = ".\\COMP30027_2021_assignment1_data\\train.csv"


In [None]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

#input location of the data

def preprocess(csv_location: str) -> pd.DataFrame:
    """
    Takes CSV location as input and outputs a Dataframe with null values converted to NaN and appropriate column names.
    Specify specific column: data[column_name]
    Specify specific row: data.loc[row_no]
    Specify specific element: data.loc[row_no, column_name]
    """
    columns = ['label', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'y1', 'y2', 'y3', 'y4', 'y5', 'y6', 'y7', 'y8', 'y9','y10', 'y11']
    # Change 9999 to np.nan when reading in the data
    data = pd.read_csv(csv_location, header = None, na_values = 9999, names = columns) 

    return data

In [None]:
# This function should calculate prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

#input dataframe of training data
#output dataframe with mean, datagframe with std and prior probabilities for each label

def train(data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, dict]:
     """
     It take a pandas dataframe and returns 2 pandas DataFrame with mean and standard deviation of each features/attributes for each labels/class.
     example output of mean_df:
          |  x1  |  x2  |  x3  | ...
     label| 2.31   21.4   3.21
     label| 31.2   10.9   45.2 
     label| 
     """
     # groupby each label and compute the mean and std for each label
     grouped = data.groupby("label")
     mean_df: pd.DataFrame = grouped.agg(np.mean)
     std_df: pd.DataFrame = grouped.agg(np.std)
     # calculate logged prior probabilities
     total_size: int = len(data.index)
     group_sizes: pd.Series = grouped.size()
     probabilities: dict = {label: np.log(group_sizes[label] / total_size) for label in group_sizes.index}
     return (mean_df, std_df, probabilities)

In [None]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)

#input Dataframe and train output
#output list of labels

def predict(train_df: pd.DataFrame, test_df: pd.DataFrame, use_missing=False) -> list:
    """
    Takes train data and test data as input, create naive bayes classifier and reture a list that contains predicted labels of test. 
    When use_missing is set to true, implementation for Q5 is used.
    """
    if use_missing:
        missing_probabilities = calculate_missing(train_df)

    # get mean, std and prior probabilities
    mean_df, std_df, prior_probabilities = train(train_df)
    prediction: list = list()

    # iterate over all points in test set and find the probability of occurance
    for _, row in test_df.iterrows():
        probabilities = prior_probabilities.copy()
        for feature, value in row.iloc[1:].iteritems():
            if np.isnan(value) and not use_missing:
                continue

            for label in prior_probabilities.keys():
                # if a point is missing use probability of point missing
                if np.isnan(value):
                    likelihood = missing_probabilities[feature][label]
                else:
                    likelihood = scipy.stats.norm(mean_df[feature][label], std_df[feature][label]).pdf(value)
                # if probability is zero change with epsilon
                if likelihood == 0:
                    likelihood = sys.float_info.epsilon
                probabilities[label] += np.log(likelihood)

        # pick the label with highest probability
        prediction.append(max(probabilities.items(), key=lambda x: x[1])[0])
        
    return prediction

In [None]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

#theoretically run predict(test.csv) and compare that to test["labels"]
#input list of labels and list of true values
#output score and individual label score.

def evaluate(prediction: list, ground_truth: list) -> Tuple[dict, pd.DataFrame]: 
    data = pd.DataFrame({"ground truth": ground_truth, "prediction": prediction})
    
    num_inst = len(data)
    num_classes = len(set(data["ground truth"].values))

    #initial counters
    micro_tp = 0
    micro_fp = 0
    micro_fn = 0

    macro_prec = 0
    macro_recall = 0
    macro_f1 = 0

    wa_prec = 0
    wa_recall = 0
    wa_f1 = 0

    output = pd.DataFrame(columns = ["precision", "recall", "f1", "num_label"])
    indexes = []

    for label in set(data["ground truth"].values):
        tp = len(data.loc[(data["ground truth"] == label) & (data["prediction"] == label)])
        fp = len(data.loc[(data["ground truth"] != label) & (data["prediction"] == label)])
        fn = len(data.loc[(data["ground truth"] == label) & (data["prediction"] != label)])
        num_label = len(data.loc[data["ground truth"] == label])

        micro_tp = micro_tp + tp
        micro_fp = micro_fp + fp
        micro_fn = micro_fn + fn

        #conditionals to avoid division by 0 errors.
        if (tp > 0):
            prec = tp / (tp + fp)
            recall = tp / (tp + fn)
        else:
            prec = 0
            recall = 0

        if (prec + recall > 0):
            f1 = 2*prec*recall / (prec + recall)
        else:
            f1 = 0

        #macro calculations
        macro_prec = macro_prec + prec
        macro_recall = macro_recall + recall
        macro_f1 = macro_f1 + f1

        #weighted average calculations
        wa_prec = wa_prec + (num_label/num_inst)*prec
        wa_recall = wa_recall + (num_label/num_inst)*recall
        wa_f1 = wa_f1 + (num_label/num_inst)*f1

        #for broken down calculations of precision, recall and f1
        temp = pd.DataFrame([[prec, recall, f1, num_label]], columns = ["precision", "recall", "f1", "num_label"])
        indexes = indexes + [label]
        output = output.append(temp)

    #final macro division
    macro_prec = macro_prec / num_classes
    macro_recall = macro_recall / num_classes
    macro_f1 = macro_f1 / num_classes

    #micro calculations
    micro_prec = micro_tp / (micro_tp + micro_fp)
    micro_recall = micro_tp / (micro_tp + micro_fn)
    micro_f1 = 2*micro_prec*micro_recall / (micro_prec + micro_recall)

    #table of broken down calculations used by Q1
    output.index = indexes
    # print("Output table for Q1")
    print(output.sort_values("precision"))

    scores = {"macro precision": macro_prec, "macro recall": macro_recall, "macro f1": macro_f1, "micro precision": micro_prec, "micro recall": micro_recall, "micro f1": micro_f1, "weighted average precision": wa_prec, "weighted average recall": wa_recall, "weighted average f1": wa_f1}
    return (scores, output)

#testing
#evaluate(result, true_labels)

### Q1
Since this is a multiclass classification problem, there are multiple ways to compute precision, recall, and F-score for this classifier. Implement at least two of the methods from the "Model Evaluation" lecture and discuss any differences between them. (The implementation should be your own and should not just call a pre-existing function.)

In [None]:
# Running naive bayes

train_df = preprocess("COMP30027_2021_assignment1_data/train.csv")
test_df = preprocess("COMP30027_2021_assignment1_data/test.csv")

result = predict(train_df, test_df)
true_labels = list(test_df.label)
overall, indiv = evaluate(result, true_labels)
overall

### Q2
The Gaussian naıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in this dataset? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the classifier’s predictions.

In [None]:
print("Output for Q2")
fig, axs = plt.subplots(22, 3, figsize=(20, 100))

n_bins = 20
train_data = preprocess(TRAIN_LOCATION)
classes = ["tree", "bridge"]
colours = ["royalblue", "maroon"]

for idx, target_class in enumerate(classes):
    target = train_data.groupby("label").get_group(target_class) 
    
    for col_x, col_y, ax in zip(train_data.columns[1:12], train_data.columns[12:], axs[idx*11:]):
        ax[0].hist(x=target[col_x].dropna(), bins = n_bins, color = colours[idx], ec=colours[idx])
        ax[0].set_title("{} Histogram for class {}".format(col_x, target_class))
        ax[1].hist(x=target[col_y].dropna(), bins = n_bins, color = colours[idx], ec=colours[idx])
        ax[1].set_title("{} Histogram for class {}".format(col_y, target_class))
        ax[2].scatter(x=target[col_x], y=target[col_y], color = colours[idx], ec=colours[idx])
        ax[2].set_title("({}, {}) Scatter Plot for class {}".format(col_x, col_y,  target_class))
    

### Q3
Implement a kernel density estimate (KDE) naive Bayes classifier and compare its performance to the Gaussian naive Bayes classifier. Recall that KDE has kernel bandwidth as a free parameter -- you can choose an arbitrary value for this, but a value in the range 5-25 is recommended. Discuss any differences you observe between the Gaussian and KDE naive Bayes classifiers. (As with the Gaussian naive Bayes, this KDE naive Bayes implementation should be your own and should not just call a pre-existing function.)

### Q4
Instead of using an arbitrary kernel bandwidth for the KDE naive Bayes classifier, use random hold-out or cross-validation to choose the kernel bandwidth. Discuss how this changes the model performance compared to using an arbitrary kernel bandwidth.

### Q5
Naive Bayes ignores missing values, but in pose recognition tasks the missing values can be informative. Missing values indicate that some part of the body was obscured and sometimes this is relevant to the pose (e.g., holding one hand behind the back). Are missing values useful for this task? Implement a method that incorporates information about missing values and demonstrate whether it changes the classification results.

In [None]:
def calculate_missing(train: pd.DataFrame) -> pd.DataFrame:
    grouped = train.groupby("label")
    # Create empty dataframe with freatures in columns
    missing_probabilities = pd.DataFrame(columns=train_df.columns[1:])

    # Calculat probability of missing points within each label
    for label in grouped.groups:
        label_df = grouped.get_group(label)
        row = (label_df.isna().sum() / len(label_df.index))[1:]
        row.name = label
        missing_probabilities = missing_probabilities.append(row)

    return missing_probabilities


In [None]:
print("output for Q5")
# run evaluation of predict() with missing value information incorporated 
result = predict(train_df, test_df, use_missing=True)
true_labels = list(test_df.label)
overall_with_missing, indiv_with_missing = evaluate(result, true_labels)
overall_with_missing

In [None]:
# Change in classification results
(indiv_with_missing - indiv)

In [None]:
# Finding percentage  of missing value in each pose
train_grouped = train_df.groupby("label")
test_grouped = test_df.groupby("label")
train_missing_ratio = train_grouped.count().rsub(train_grouped.size(), axis=0).sum(axis=1) / (train_grouped.size() * len(train_df.columns[1:]))
test_missing_ratio = test_grouped.count().rsub(test_grouped.size(), axis=0).sum(axis=1) / (test_grouped.size() * len(test_df.columns[1:]))
summary = pd.DataFrame({"train":train_missing_ratio, "test":test_missing_ratio, "# of train": train_grouped.size(), "# of test": test_grouped.size()})
summary


### Q6
Engineer your own pose features from the provided keypoints. Instead of using the (x,y) positions of keypoints, you might consider the angles of the limbs or body, or the distances between pairs of keypoints. How does a naive Bayes classifier based on your engineered features compare to the classifier using (x,y) values? Please note that we are interested in explainable features for pose recognition, so simply putting the (x,y) values in a neural network or similar to get an arbitrary embedding will not receive full credit for this question. You should be able to explain the rationale behind your proposed features. Also, don't forget the conditional independence assumption of naive Bayes when proposing new features -- a large set of highly-correlated features may not work well.