###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Assignment 1: Pose classification with naive Bayes


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [1]:
import csv
import math
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
pd.options.mode.chained_assignment = None

In [2]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

def preprocess(train_csv, test_csv):

    # Load in the data
    train_df = pd.read_csv(train_csv, header = None)
    test_df = pd.read_csv(test_csv, header = None)
    
    # Replace missing values with NaN
    train_df.replace(float('9999'), float('NaN'), inplace = True)
    test_df.replace(float('9999'), float('NaN'), inplace = True)   
    
    # Obtaining a list of classes
    labels = train_df[0].unique()

    return [train_df, test_df, labels]

In [3]:
# This function should calculate prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

# Calculates the Gaussian probability distribution function for x
def gaussian_pdf(x, mean, stdev):
	exponent = math.exp((-1/2)*((x-mean)/stdev)**2)
	return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent


# Calcalates the priors for each class
def calculate_priors(df, labels):
    total_instances = len(df)
    priors = {}

    for label in labels:
        df_by_label = df.loc[df[0] == label]
        # Calculate prior
        priors[label] = len(df_by_label)/total_instances

    return priors

# Calculates the mean and standard deviation of each class x feature
def calculate_parameters(df, labels):
    parameters = {}

    for label in labels:
        feature_params = []
        df_by_label = df.loc[df[0] == label]

        for i in range(1,23):
            mean = df_by_label[i].mean()
            stdev = df_by_label[i].std()
    
            feature_params.append([mean, stdev])
        
        parameters[label] = feature_params
    
    return parameters
    
def train(train_df, labels):
    
    priors = calculate_priors(train_df, labels)
    parameters = calculate_parameters(train_df, labels)
    
    return [priors, parameters]


In [4]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)

def predict(test_df, labels, priors, parameters):
    predicted_labels = []
    
    # For each instance in our test data
    for index, row in test_df.iterrows():
        probabilities = {}
        
        for label in labels:
            # Prior
            label_prob = math.log(priors[label])
            for i in range(1,23):
                if not np.isnan(row[i]):
                    # Mean and stdev
                    mean = parameters[label][i-1][0]
                    stdev = parameters[label][i-1][1]

                    # Calculate likelihood
                    if (gaussian_pdf(row[i], mean, stdev) != 0):
                        label_prob += math.log(gaussian_pdf(row[i], mean, stdev))
                    else:
                        label_prob += math.log(sys.float_info.epsilon)

            probabilities[label] = label_prob
        
        # Find maximum probability
        best_prob = -100
        best_label = None
        for label, prob in probabilities.items():
            if best_label is None or prob > best_prob:
                best_prob = prob
                best_label = label
        
        predicted_labels.append(best_label)
    
    return predicted_labels

In [5]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate(test_df, predicted_labels):
    correct = 0
    
    for i in range(len(test_df)):
        if predicted_labels[i] == test_df[0][i]:
            correct +=1
    
    return correct/len(test_df)


In [6]:
# Running our implementation of Gaussian Naive Bayes on the given training and test data sets

[train_df, test_df, labels] = preprocess('data/train.csv', 'data/test.csv')

[priors, parameters] = train(train_df, labels)

predicted_labels = predict(test_df, labels, priors, parameters)
accuracy = evaluate(test_df, predicted_labels)

print("---Naive Bayes Classifier---")
print(f'Accuracy: {round(accuracy*100, 2)}%')

# Writing to output.txt
original_stdout = sys.stdout

with open('output.txt', 'w') as f:
    sys.stdout = f
    print("---Naive Bayes Classifier---")
    print(f'Accuracy: {round(accuracy*100, 2)}%')
    sys.stdout = original_stdout

---Naive Bayes Classifier---
Accuracy: 71.55%


## Questions 


If you are in a group of 1, you will respond to **two** questions of your choosing.

If you are in a group of 2, you will respond to **four** questions of your choosing.

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer should be submitted separately as a PDF.

### Q1
Since this is a multiclass classification problem, there are multiple ways to compute precision, recall, and F-score for this classifier. Implement at least two of the methods from the "Model Evaluation" lecture and discuss any differences between them. (The implementation should be your own and should not just call a pre-existing function.)

In [7]:
def compute_precision(class_labels, predicted_labels, label):
    tp = 0
    fp = 0
    
    for i in range(len(class_labels)):
        if predicted_labels[i] == label:
            if predicted_labels[i] == class_labels[i]:
                tp += 1
            else:
                fp += 1
    return tp/(tp+fp)

def compute_recall(class_labels, predicted_labels, label):
    tp = 0
    fn = 0
    
    for i in range(len(class_labels)):
        if predicted_labels[i] == label:
            if predicted_labels[i] == class_labels[i]:
                tp += 1
        else:
            if predicted_labels[i] != class_labels[i]:
                fn += 1
    return tp/(tp+fn)

def compute_f_score(precision, recall):
    return 2*precision*recall/(precision+recall)

# Macro Averaging
def compute_macro_average(test_df, predicted_labels, labels):
    test_df['Predicted'] = predicted_labels
    
    precision = 0
    recall = 0
    for label in labels:
        df_by_label = test_df.loc[test_df[0] == label]
        class_labels = list(df_by_label[0])
        predicted_by_label = list(df_by_label['Predicted'])
        precision += compute_precision(class_labels, predicted_by_label, label)
        recall += compute_recall(class_labels, predicted_by_label, label)
    
    c = len(labels)
    return [precision/c, recall/c]

compute_macro_average(test_df, predicted_labels, labels)

# Micro Averaging
def compute_micro_average(df, predicted_labels, labels):
    df['Predicted'] = predicted_labels
    
    tp = 0
    fp = 0
    fn = 0
    for label in labels:
        df_by_label = df.loc[df[0] == label]
        class_labels = list(df_by_label[0])
        predicted_by_label = list(df_by_label['Predicted'])

        for i in range(len(df_by_label)):
            if predicted_by_label[i] == label:
                if predicted_by_label[i] == class_labels[i]:
                    tp += 1
                else:
                    fp += 1
            else:
                if predicted_by_label[i] != class_labels[i]:
                    fn += 1

    return [tp/(tp+fp), tp/(tp+fn)]

macro = compute_macro_average(test_df, predicted_labels, labels)
micro = compute_micro_average(test_df, predicted_labels, labels)

print("---Macro-Averaging---")
print(f'Precision: {macro[0]:.4f}, Recall: {macro[1]:.4f}, F-Score: {compute_f_score(macro[0], macro[1]):.4f}')
print("---Micro-Averaging---")
print(f'Precision: {micro[0]:.4f}, Recall: {micro[1]:.4f}, F-Score: {compute_f_score(micro[0], micro[1]):.4f}')

# Writing to output.txt
with open('output.txt', 'a') as f:
    sys.stdout = f 
    print("\nQUESTION 1")
    print("---Macro-Averaging---")
    print(f'Precision: {macro[0]:.4f}, Recall: {macro[1]:.4f}, F-Score: {compute_f_score(macro[0], macro[1]):.4f}')
    print("---Micro-Averaging---")
    print(f'Precision: {micro[0]:.4f}, Recall: {micro[1]:.4f}, F-Score: {compute_f_score(micro[0], micro[1]):.4f}')
    sys.stdout = original_stdout

---Macro-Averaging---
Precision: 1.0000, Recall: 0.7078, F-Score: 0.8289
---Micro-Averaging---
Precision: 1.0000, Recall: 0.7155, F-Score: 0.8342


### Q2
The Gaussian naıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in this dataset? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the classifier’s predictions.

In [8]:
import scipy.stats as stats

# Discretises values into equal-width bins
def discretise(df, n):
    df_discrete = df.copy()
    
    column_max = []
    column_min = []
    for i in range(1,len(df.columns)):
        colmax = math.ceil(df[i].max())
        colmin = math.floor(df[i].min())
        
        step_length = math.floor((colmax - colmin)/n)
        step_intervals = list(range(colmin, colmax, step_length))
        step_intervals.pop(-1)
        step_intervals.append(colmax)
        
        df_discrete[i] = np.digitize(df[i], step_intervals)
    
    return df_discrete

N = 15  # Abitrary choice of number of bins

[train_df, test_df, labels] = preprocess('data/train.csv', 'data/test.csv')
bins_df = discretise(train_df, N)

x = list(range(1,N+1))

# Plotting each class x feature
for label in labels:

    df_by_class = bins_df.loc[train_df[0] == label]
    df_by_class = df_by_class.loc[:, 1:]

    for i in range(0,22):
        values = list(df_by_class.iloc[:,0+i])
        
        values_freq = []
        for a in x:
            if a in values:
                values_freq.append(values.count(a))
            else:
                values_freq.append(0)
        
#         # This section was used to produce the bar charts for Gaussian assumption analysis
#         plt.figure()
        
#         plt.title(str(label) + " feature: " + str(i+1))
#         plt.xlabel("Bin Number")
#         plt.ylabel("Frequency")
#         plt.xticks(x)
#         plt.bar(x, values_freq)
       
#         plt.show()
        

new_train_df = train_df.copy()

# Removing feature 16 of the trianglepose class from our training dataset
for i in range(len(new_train_df)):
    if new_train_df[0][i] == 'trianglepose':
        new_train_df[16][i] = np.nan
    
[priors, parameters] = train(new_train_df, labels)
predicted_labels = predict(test_df, labels, priors, parameters)
accuracy_remove = evaluate(test_df, predicted_labels)

print("---Naive Bayes w/ Removal of a class x feature that violates Gaussian assumption---")
print(f'Accuracy: {round(accuracy_remove*100, 2)}%')

# Writing to output.txt
with open('output.txt', 'a') as f:
    sys.stdout = f 
    print("\nQUESTION 2")
    print("---Naive Bayes w/ Removal of a class x feature that violates Gaussian assumption---")
    print(f'Accuracy: {round(accuracy_remove*100, 2)}%')
    sys.stdout = original_stdout 

---Naive Bayes w/ Removal of a class x feature that violates Gaussian assumption---
Accuracy: 68.97%


### Q3
Implement a kernel density estimate (KDE) naive Bayes classifier and compare its performance to the Gaussian naive Bayes classifier. Recall that KDE has kernel bandwidth as a free parameter -- you can choose an arbitrary value for this, but a value in the range 5-25 is recommended. Discuss any differences you observe between the Gaussian and KDE naive Bayes classifiers. (As with the Gaussian naive Bayes, this KDE naive Bayes implementation should be your own and should not just call a pre-existing function.)

In [9]:
def KDE(train_df, test_df, labels, kernel_bandwidth):
    predicted_labels = []
    total_instances = len(train_df)

    # For each instance in test
    for index, row in test_df.iterrows():
        
        probabilities = {}

        for label in labels:
            df_by_label = train_df.loc[train_df[0] == label]
            
            # Calculate Prior
            label_prob = math.log(len(df_by_label)/total_instances)
            
            for i in range(1,23):
                if not np.isnan(row[i]):
                    x_test = row[i]
                    llh = kde_likelihood(row[i], df_by_label[i], kernel_bandwidth)
                    if llh == 0:
                        llh = sys.float_info.epsilon
                    label_prob += math.log(llh)

            probabilities[label] = label_prob
            
        # Find maximum probability
        best_prob = -1000
        best_label = None
        for label, prob in probabilities.items():
            if best_label is None or prob > best_prob:
                best_prob = prob
                best_label = label
        
        predicted_labels.append(best_label)
            
    return predicted_labels


def kde_likelihood(x_test, x_values, kernel_bandwidth):
    likelihood = 0

    for x_i in x_values:
        if not np.isnan(x_i):
            likelihood += gaussian_pdf(x_test-x_i, 0, kernel_bandwidth)
    
    return likelihood/len(x_values)
    

kernel_bandwidth = 5
predicted_values = KDE(train_df, test_df, labels, kernel_bandwidth)
accuracy_kde_5 = evaluate(test_df, predicted_values)

print("---Naive Bayes Classifier---")
print(f'Accuracy: {round(accuracy*100, 2)}%')
print("\n---KDE---")
print(f'Kernel Bandwidth: {kernel_bandwidth}')
print(f'Accuracy: {round(accuracy_kde_5*100, 2)}%')

kernel_bandwidth = 15
predicted_values = KDE(train_df, test_df, labels, kernel_bandwidth)
accuracy_kde_15 = evaluate(test_df, predicted_values)
print(f'\nKernel Bandwidth: {kernel_bandwidth}')
print(f'Accuracy: {round(accuracy_kde_15*100, 2)}%')

# Writing to output.txt
with open('output.txt', 'a') as f:
    sys.stdout = f
    print("\nQUESTION 3")
    print("---Naive Bayes Classifier---")
    print(f'Accuracy: {round(accuracy*100, 2)}%')
    print("\n---KDE---")
    print(f'Kernel Bandwidth: {kernel_bandwidth}')
    print(f'Accuracy: {round(accuracy_kde_5*100, 2)}%')
    print(f'\nKernel Bandwidth: {kernel_bandwidth}')
    print(f'Accuracy: {round(accuracy_kde_15*100, 2)}%')
    sys.stdout = original_stdout

---Naive Bayes Classifier---
Accuracy: 71.55%

---KDE---
Kernel Bandwidth: 5
Accuracy: 72.41%

Kernel Bandwidth: 15
Accuracy: 70.69%


### Q4
Instead of using an arbitrary kernel bandwidth for the KDE naive Bayes classifier, use random hold-out or cross-validation to choose the kernel bandwidth. Discuss how this changes the model performance compared to using an arbitrary kernel bandwidth.

### Q5
Naive Bayes ignores missing values, but in pose recognition tasks the missing values can be informative. Missing values indicate that some part of the body was obscured and sometimes this is relevant to the pose (e.g., holding one hand behind the back). Are missing values useful for this task? Implement a method that incorporates information about missing values and demonstrate whether it changes the classification results.

In [10]:
def predict_use_NaNs(test_df, train_df, labels, priors, parameters):
    predicted_labels = []
    
    # For each instance in our test data
    for index, row in test_df.iterrows():
        probabilities = {}
        
        for label in labels:
            df_by_label = train_df.loc[train_df[0] == label]
            label_prob = math.log(priors[label])
            
            for i in range(1,23):
                if not np.isnan(row[i]):
                    mean = parameters[label][i-1][0]
                    stdev = parameters[label][i-1][1]

                    # Likelihood
                    if (gaussian_pdf(row[i], mean, stdev) != 0):
                        label_prob += math.log(gaussian_pdf(row[i], mean, stdev))
                    else:
                        label_prob += math.log(sys.float_info.epsilon)

                else:
                    nan_count = df_by_label[i].isna().sum()
                    column_len = len(df_by_label[i])
                    if nan_count != 0:
                        label_prob += math.log(nan_count/column_len)
                    else:
                        label_prob += math.log(sys.float_info.epsilon)

            probabilities[label] = label_prob

        # Find maximum probability
        best_prob = -100
        best_label = None
        for label, prob in probabilities.items():
            if best_label is None or prob > best_prob:
                best_prob = prob
                best_label = label
        
        predicted_labels.append(best_label)
    
    return predicted_labels

predicted_labels = predict_use_NaNs(test_df, train_df, labels, priors, parameters)
accuracy_nan = evaluate(test_df, predicted_labels)

print("---Naive Bayes Classifier (Ignoring NaNs)---")
print(f'Accuracy: {round(accuracy*100, 2)}%')
print("\n---Naive Bayes w/ NaNs---")
print(f'Accuracy: {round(accuracy_nan*100, 2)}%')

# Writing to output.txt
with open('output.txt', 'a') as f:
    sys.stdout = f 
    print("\nQUESTION 5")
    print("---Naive Bayes Classifier (Ignoring NaNs)---")
    print(f'Accuracy: {round(accuracy*100, 2)}%')
    print("\n---Naive Bayes w/ NaNs---")
    print(f'Accuracy: {round(accuracy_nan*100, 2)}%')
    sys.stdout = original_stdout

---Naive Bayes Classifier (Ignoring NaNs)---
Accuracy: 71.55%

---Naive Bayes w/ NaNs---
Accuracy: 68.1%


### Q6
Engineer your own pose features from the provided keypoints. Instead of using the (x,y) positions of keypoints, you might consider the angles of the limbs or body, or the distances between pairs of keypoints. How does a naive Bayes classifier based on your engineered features compare to the classifier using (x,y) values? Please note that we are interested in explainable features for pose recognition, so simply putting the (x,y) values in a neural network or similar to get an arbitrary embedding will not receive full credit for this question. You should be able to explain the rationale behind your proposed features. Also, don't forget the conditional independence assumption of naive Bayes when proposing new features -- a large set of highly-correlated features may not work well.