###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2023 Semester 1

## Assignment 1: Music genre classification with naive Bayes


**Student ID(s):**     `1174154`


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [1]:
import pandas as pd
import numpy as np


In [2]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing
def preprocess():
    # test data classifying
    test_data_df = pd.read_csv('COMP30027_2023_asst1_data\gztan_test.csv')
    
    # train data classifying
    train_data_df = pd.read_csv('COMP30027_2023_asst1_data\gztan_train.csv')
    
    return test_data_df, train_data_df

In [9]:
# This function should calculate prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

def train():
    test_data_df, train_data_df = preprocess()
    
    
    prior = calc_prior(train_data_df)
    
    print(calc_likelihood(train_data_df))
    return 0

train()

Likelihood of chroma_stft_var=0.0445552468299866 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0480097904801369 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0480712652206421 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0508087538182735 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0595614984631538 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0603355392813683 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0616306029260159 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0622770376503468 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0639619752764702 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0653657168149948 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0670696869492531 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0671127066016197 given label blues is 0.0000
Likelihood of chroma_stft_var=0.0674524903297424 given label blues is 0.0000

KeyboardInterrupt: 

In [3]:
def calc_prior(data):
    prior_prob = {}

    labels = data['label']
    unique_labels, counts = np.unique(labels, return_counts=True)
    
    n = counts.sum()

    for i in range(len(unique_labels)):
        prior_prob[unique_labels[i]] = (counts[i] / n).round(3)

    return prior_prob

In [8]:
def calc_likelihood(data):
    # Create an empty dictionary to store the likelihoods
    likelihood = {}
    unique_labels = np.unique(data.values[:, -1])
    features_list = data.columns[1:]
    
    for feature in features_list:
        
        
        feature_values = np.unique(data[feature])
        
        # Create a nested dictionary for each feature
        likelihood[feature] = {}

        for label in unique_labels:
            # Create a nested dictionary for each label
            likelihood[feature][label] = {}

            for value in feature_values:
                count = len(data[(data['label'] == label) & (data[feature] == value)])
                total_count = len(data[(data['label'] == label)])

                likelihood[feature][label][value] = count / total_count
                print(f"Likelihood of {feature}={value} given label {label} is {likelihood[feature][label][value]:.4f}")
                
    return likelihood

In [None]:
# This function should predict classes for new items in a test dataset

def predict():
    return

In [None]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate():
    return

## Task 1. Pop vs. classical music classification

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer must be submitted separately as a PDF.

### Q1
Compute and report the accuracy, precision, and recall of your model (treat "classical" as the "positive" class).

### Q2
For each of the features X below, plot the probability density functions P(X|Class = pop) and P(X|Class = classical). If you had to classify pop vs. classical music using just one of these three features, which feature would you use and why? Refer to your plots to support your answer.
- spectral centroid mean
- harmony mean
- tempo

## Task 2. 10-way music genre classification

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer must be submitted separately as a PDF.

### Q3
Compare the performance of the full model to a 0R baseline and a one-attribute baseline. The one-attribute baseline should be the best possible naive Bayes model which uses only a prior and a single attribute. In your write-up, explain how you implemented the 0R and one-attribute baselines.

### Q4
Train and test your model with a range of training set sizes by setting up your own train/test splits. With each split, use cross-fold validation so you can report the performance on the entire dataset (1000 items). You may use built-in functions to set up cross-validation splits. In your write-up, evaluate how model performance changes with training set size.

### Q5
Implement a kernel density estimate (KDE) naive Bayes model and compare its performance to your Gaussian naive Bayes model. You may use built-in functions and automatic ("rule of thumb") bandwidth selectors to compute the KDE probabilities, but you should implement the naive Bayes logic yourself. You should give the parameters of the KDE implementation (namely, what bandwidth(s) you used and how they were chosen) in your write-up.

### Q6
Modify your naive Bayes model to handle missing attributes in the test data. Recall from lecture that you can handle missing attributes at test by skipping the missing attributes and computing the posterior probability from the non-missing attributes. Randomly delete some attributes from the provided test set to test how robust your model is to missing data. In your write-up, evaluate how your model's performance changes as the amount of missing data increases.