###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Assignment 1: Pose classification with naive Bayes


**Student ID(s):**     1039169, 1044793


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [1]:
# Load library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math 
%matplotlib inline

In [2]:
# This function is used to read data from the csv file
def read_data():
    # Create URL
    test_csv = "~/pose-classification-with-naive-bayes/COMP30027_2021_assignment1_data/test.csv"
    train_csv = "~/pose-classification-with-naive-bayes/COMP30027_2021_assignment1_data/train.csv"
    # Load Dataset 
    test_df = pd.read_csv(test_csv, header = None)
    train_df = pd.read_csv(train_csv, header = None)

    # Duplicate Dataset
    new_test_df = test_df.copy()
    new_train_df = train_df.copy()
    return new_test_df, new_train_df

In [3]:
def replace_missing_value(df):
    """Replace Missing Value in the complete training/testing dataset"""
    for column in df.columns[1:]:
        # Replace missing value (9999) with median for each column
        df[column] = np.where(df[column] == 9999, np.nan, df[column])
    return df

In [4]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

def preprocess():
    """ Returns DataFrame consist of the testing, training, feature for training/testing and target for training/testing """
    new_test_df, new_train_df = read_data()
    # Replace Missing Value for both testing/training set
    new_test_df = replace_missing_value(new_test_df)
    new_train_df = replace_missing_value(new_train_df) 
    # Split the dataset for both training/testing into feature and target 
    feature_train = new_train_df.iloc[:,1:]
    y_train = new_train_df[0]
    
    feature_test = new_test_df.iloc[:,1:]
    y_test = new_test_df[0]
    
    return new_test_df, new_train_df, feature_train, y_train, feature_test, y_test
    

In [5]:
def find_prior_prob(y_train):
    """ Returns a DataFrame of the prior probability of each pose in log form """
    
    target_count = 0
    target_dict = {}
    target_dict['Pose'] = []
    target_dict['Prior'] = []
    for target in np.unique(y_train):
        target_dict['Pose'].append(target)
        target_count = sum(y_train == target)
        target_dict['Prior'].append(target_count / len(y_train))

    prior_df = pd.DataFrame(target_dict)
    prior_df.set_index(['Pose'], inplace = True)
    prior_df['Prior'] = np.log(prior_df['Prior'])
    return prior_df

In [6]:
def find_mean_std(new_train_df):
    """ Returns a DataFrame consisting mean and standard deviation for all poses """
    
    # Group by Mean for each Pose
    mean_df = new_train_df.groupby([0]).mean()
    mean_df.index.names = ['Pose']
    mean_df
    
    # Group by Standard Deviation for each pose
    std_df = new_train_df.groupby([0]).std()
    std_df.index.names = ['Pose']
    std_df
    
    # x is mean and y is standard deviation
    mean_std_df = pd.merge(mean_df, std_df, on = 'Pose', how = 'left')
    
    return mean_std_df

In [7]:
# Gaussian function
import math
def gaussian_pdf(x, mean, std):
    """Return the Gaussian Distribution for a given x data point"""
    return (1/(std * math.sqrt(2*math.pi))*math.exp(-(1/2) * ((x - mean) / std)**2))

For our training/prediction set, we cannot include the class label when calculating likelihood rather we have to infer the posterior probability of all class labels and take the largest posterior probability -> MAP hypothesis. Link : https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Gaussian_naïve_Bayes

In [8]:
def find_likelihood(feature_train, mean_std_df):
    """ Returns a DataFrame of log-likelihood of all 10 poses for every point """
    
    # Likelihood_dict stores the likelihood of all 10 poses for every POINT
    likelihood_dict = {}
    x = 0
    while x < (len(feature_train)): #Go through each instances 
        y = 0
        likelihood_dict['Sample_' + str(x+1)] = []
        while y < (len(feature_train.columns)): #Go through each features
            pose = 0
            while pose < len(mean_std_df): #Go through each unique poses
                if np.isnan((feature_train.iloc[x, y])):
                    likelihood = np.finfo(float).eps # epsilon for probabilistic smoothing

                else:
                    # Get the likelihood of each data point using Gaussian Distribution
                    likelihood = gaussian_pdf(feature_train.iloc[x, y], mean_std_df.iloc[pose, y], mean_std_df.iloc[pose, y+22])

                if (likelihood == 0): # Zero Probability 
                    likelihood = np.finfo(float).eps # epsilon for probabilistic smoothing
                
                #Get the Log-likelihood
                likelihood = math.log(likelihood)
                likelihood_dict['Sample_' + str(x+1)].append(likelihood)
                pose += 1
            y += 1
        x += 1
    return pd.DataFrame(likelihood_dict)
    
    


In [9]:
def find_pose_likelihood(feature_train, mean_std_df, likelihood_df):
    """ Returns a DataFrame of log-likelihood of all poses for every sample """
    
    # pose_likelihood stores the likelihood of all 10 poses for every SAMPLE
    pose_likelihood = pd.DataFrame(mean_std_df.index)
    
    y = 0
    num_pose = len(mean_std_df)
    while y < len(likelihood_df.columns): #Go through each feature
        i = 0
        lst = []
        while i < num_pose: #Go through through each likelihood in an instance for each unique pose
            sum_likelihood_pose = 0
            x = i
            while x < len(likelihood_df):
                # Sum the log-likelihood in an instance for each unique pose
                sum_likelihood_pose += likelihood_df.iloc[x, y]
                x += 10
                
            lst.append(sum_likelihood_pose)
            i += 1
        
        # Insert into a datafram of sum of the log-likelihood for each unique pose for a given instance
        pose_likelihood['Sample_' + str(y+1)] = lst
        y += 1
    
    pose_likelihood.set_index(['Pose'], inplace = True)
    return pose_likelihood

In [10]:
def find_posterior(pose_likelihood, prior_df):
    """ Returns the posterior for all poses of every sample """
    
    posterior_df = pose_likelihood.copy()
    # Get the Posterior Probability 
    for pose in posterior_df.index:
        prior = prior_df.loc[pose]
        posterior_df.loc[pose] +=  float(prior)
        
    posterior = posterior_df.T

    posterior = posterior.reset_index()
    posterior.index.name = ''
    posterior.columns.name = ''

    posterior.rename(columns={'index':'Sample'}, inplace=True)
    # Make 'Sample' column as index
    posterior.set_index(['Sample'], inplace = True)
    
    return posterior, posterior_df

In [11]:
# This function should calculate prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

def train(y_train, new_train_df, feature_train):
    # Get the Prior Probability for each pose
    prior_df = find_prior_prob(y_train)
    # Get the Mean and Standard deviation for each feature for each pose
    mean_std_df = find_mean_std(new_train_df)
    # Get the log-likelihood of all poses for every sample
    likelihood_df = find_likelihood(feature_train, mean_std_df)
    # Get the posterior probability of all poses for every sample
    pose_likelihood = find_pose_likelihood(feature_train, mean_std_df, likelihood_df)
    
    return pose_likelihood, prior_df

In [12]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)

def predict(y_train, pose_likelihood, prior_df):
    
    posterior, posterior_df = find_posterior(pose_likelihood, prior_df)   # Now we have the posteriors
    # This is the maximum value for each row, we have to find the label for each of these values
    max_post = posterior.max(axis = 1)
    max_post = pd.DataFrame(max_post)
    max_post.rename(columns={0:'Max_Posterior'}, inplace=True)
    
    predict_dict = {}
    predict_dict['Sample'] = list(max_post.index)
    predicted_pose_list = []
    
    #Convert the posterior probability into the pose name
    for r in posterior.index:
        index = 0
        for c in list(posterior.loc[r]):
            if c == max_post.loc[r]['Max_Posterior']:
                predicted_pose_list.append(list(posterior_df.index)[index])
            index += 1
    # Insert the predicted pose name into the dataframe
    predict_dict['Predicted_Pose'] = predicted_pose_list

    predict_df = pd.DataFrame(predict_dict)
    # Insert the true pose name into the dataframe for comparison purpose
    predict_df['True_Pose'] = y_train
    pd.set_option("display.max_rows", None, "display.max_columns", None)
    return predict_df # Predicted Pose based for each instances

In [13]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate(predict_df):
    num_test_instance = len(predict_df)
    predict_df['Correct_Label'] = predict_df['Predicted_Pose'] == predict_df['True_Pose']
    # Get the total number of correct label
    num_correct_label = len(predict_df[predict_df['Correct_Label'] == True])
    # Get the accuracy score
    accuracy_score = num_correct_label / num_test_instance
    
    return accuracy_score

In [14]:
# Preprocess 
new_test_df, new_train_df, feature_train, y_train, feature_test, y_test = preprocess()

# Train
pose_likelihood, prior_df = train(y_train, new_train_df, feature_train)

# Predict
predict_df = predict(y_train, pose_likelihood, prior_df)

# Evaluate
print(evaluate(predict_df))

0.7402945113788487


## Questions 


If you are in a group of 1, you will respond to **two** questions of your choosing.

If you are in a group of 2, you will respond to **four** questions of your choosing.

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer should be submitted separately as a PDF.

### Q1
Since this is a multiclass classification problem, there are multiple ways to compute precision, recall, and F-score for this classifier. Implement at least two of the methods from the "Model Evaluation" lecture and discuss any differences between them. (The implementation should be your own and should not just call a pre-existing function.)

Precision = TP / (TP + FP).  
Recall = TP / (TP + FN). 
F1 = 2*Precision*Recall / (Precision + Recall)

In [16]:
tp_df = predict_df[predict_df['True_Pose'] == 'bridge'] # Actual Bridge
tp_fn_list = list(tp_df['Predicted_Pose'] == tp_df['True_Pose']) # Predicted Bridge and Actual Bridge is True
tp = 0
fn = 0
for i in tp_fn_list:
    if i == True:
        tp += 1
    else:
        fn += 1

In [17]:
print(str(tp) + '_' + str(fn))

17_64


In [18]:
fp_df = predict_df[predict_df['True_Pose'] != 'bridge'] # Actual Not Bridge
fp_list = list(fp_df['Predicted_Pose'] == 'bridge') # Predicted Bridge and Actual Not Bridge
fp = 0
for i in fp_list:
    if i == True:
        fp += 1
print(fp)

12


In [19]:
tn = len(predict_df) - tp - fn - fp
tn

654

In [20]:
confusion_dict = {}
for pose in list(predict_df['True_Pose'].unique()):
    confusion_list = []
    #confusion_dict[pose] = confusion_list
    tp_df = predict_df[predict_df['True_Pose'] == pose] # Actual Bridge
    tp_fn_list = list(tp_df['Predicted_Pose'] == tp_df['True_Pose']) # Predicted Bridge and Actual Bridge is True
    tp = 0
    fn = 0
    for x in tp_fn_list:
        if x == True:
            tp += 1
        else:
            fn += 1
    fp_df = predict_df[predict_df['True_Pose'] != pose] # Actual Not Bridge
    fp_list = list(fp_df['Predicted_Pose'] == pose) # Predicted Bridge and Actual Not Bridge
    fp = 0
    for i in fp_list:
        if i == True:
            fp += 1
    tn = len(predict_df) - tp - fn - fp
    
    precision = tp / (tp + fp)
    recall = tp / (tp + fn) 
    f1 = (2*precision*recall) / (precision + recall)
    #print('Precision : ' + str(precision))
    #print('recall : ' + str(recall))
    #print('f1 : ' + str(f1))
    
    confusion_list.append(tp)
    confusion_list.append(fn)
    confusion_list.append(fp)
    confusion_list.append(tn)
    confusion_list.append(precision)
    confusion_list.append(recall)
    confusion_list.append(f1)
    
    confusion_dict[pose] = confusion_list
    #print('True Positive : ' + str(tp))
    #print('False Negative : ' + str(fn))
    #print('False Positive : ' + str(fp))
    #print('True Negative : ' + str(tn))

In [21]:
# 1. Macro-averaging
sum_precision = 0
sum_recall = 0
for precision_pose in confusion_dict.keys():
    sum_precision += confusion_dict[precision_pose][4]
    sum_recall += confusion_dict[precision_pose][5]
    
macro_average_precision = sum_precision / len(confusion_dict.keys())
macro_average_recall = sum_recall / len(confusion_dict.keys())
#f1 = (2*macro_average_precision*macro_average_recall) / (macro_average_precision+ macro_average_recall)
print(macro_average_precision)
print(macro_average_recall)

0.7392575055101048
0.7156254548083576


In [22]:
# 2. Micro-averaging
tp_sum = 0
fp_sum = 0
fn_sum = 0

for precision_pose in confusion_dict.keys():
    tp_sum += confusion_dict[precision_pose][0]
    fn_sum += confusion_dict[precision_pose][1]
    fp_sum += confusion_dict[precision_pose][2]

micro_average_precision = tp_sum / (tp_sum + fp_sum)
micro_average_recall = tp_sum / (tp_sum + fn_sum)
print(micro_average_precision)
print(micro_average_recall)

0.7402945113788487
0.7402945113788487


### Q2
The Gaussian naıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in this dataset? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the classifier’s predictions.

No

Evidence: Many of the pose (bridge are mistaken as downwarddog) and (seatedforwardbend mistaken as child)

Gaussian Distribution assumes : mean = 0 and standard deviation = sigma

In [31]:
# Normal Distribution is known as a Gaussian Distribution
from scipy import stats

for i in feature_train.columns:
    x = feature_train[i].dropna()
    print('feature ' + str(i) + ': ' + str(stats.kstest(x, 'norm')))
#All are not normal 

feature 1: KstestResult(statistic=0.39424631630152945, pvalue=4.939960373172725e-83)
feature 2: KstestResult(statistic=0.38131250232818703, pvalue=1.151369046625854e-89)
feature 3: KstestResult(statistic=0.746019154009615, pvalue=2.1849940734e-313)
feature 4: KstestResult(statistic=0.6968736264662918, pvalue=9.661078928049572e-252)
feature 5: KstestResult(statistic=0.7200063665734007, pvalue=3.592892236987334e-289)
feature 6: KstestResult(statistic=0.6843096973515523, pvalue=5.614484308392732e-243)
feature 7: KstestResult(statistic=0.3704906967405037, pvalue=6.254988392017793e-81)
feature 8: KstestResult(statistic=0.7944992898607758, pvalue=0.0)
feature 9: KstestResult(statistic=0.74932673397712, pvalue=4.029804521855616e-304)
feature 10: KstestResult(statistic=0.8003199316315173, pvalue=0.0)
feature 11: KstestResult(statistic=0.7418340856173791, pvalue=1.491985106513682e-296)
feature 12: KstestResult(statistic=0.8563727592482392, pvalue=0.0)
feature 13: KstestResult(statistic=0.906158

### Q3
Implement a kernel density estimate (KDE) naive Bayes classifier and compare its performance to the Gaussian naive Bayes classifier. Recall that KDE has kernel bandwidth as a free parameter -- you can choose an arbitrary value for this, but a value in the range 5-25 is recommended. Discuss any differences you observe between the Gaussian and KDE naive Bayes classifiers. (As with the Gaussian naive Bayes, this KDE naive Bayes implementation should be your own and should not just call a pre-existing function.)

In [23]:
from sklearn.neighbors import KernelDensity
X = feature_train.dropna()
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)
#kde.score_samples(X)

### Q4
Instead of using an arbitrary kernel bandwidth for the KDE naive Bayes classifier, use random hold-out or cross-validation to choose the kernel bandwidth. Discuss how this changes the model performance compared to using an arbitrary kernel bandwidth.

### Q5
Naive Bayes ignores missing values, but in pose recognition tasks the missing values can be informative. Missing values indicate that some part of the body was obscured and sometimes this is relevant to the pose (e.g., holding one hand behind the back). Are missing values useful for this task? Implement a method that incorporates information about missing values and demonstrate whether it changes the classification results.

### Q6
Engineer your own pose features from the provided keypoints. Instead of using the (x,y) positions of keypoints, you might consider the angles of the limbs or body, or the distances between pairs of keypoints. How does a naive Bayes classifier based on your engineered features compare to the classifier using (x,y) values? Please note that we are interested in explainable features for pose recognition, so simply putting the (x,y) values in a neural network or similar to get an arbitrary embedding will not receive full credit for this question. You should be able to explain the rationale behind your proposed features. Also, don't forget the conditional independence assumption of naive Bayes when proposing new features -- a large set of highly-correlated features may not work well.