# Machine Learning Lab 1.2 - Naive Bayes Pt 2

Edit the following to include the names of everyone in your group including you, or if you are not in a group 
- **Student Names:** name 1, name 2, ...
- **Student Numbers:** student no 1, student no 2, ...

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Section 1 - Digit Classification with Naive Bayes

## Data Processing
**Do not change the below cells, but you must run them**


Load data into pandas dataframe object

In [None]:
# Load data (digits) & labels as pandas dataframe object
df = pd.read_csv("smalldigits.csv", header=None)
df = df.sample(frac=1, random_state=42, axis=0)  # Randomise dataframe
df

Create 90/10 train test split and convert to numpy arrays

In [None]:
n_rows = df.shape[0]
# Roughly 90/10 train-test split
train_digits = df.iloc[:int(n_rows * 0.9), :-1].to_numpy()
train_labels = df.iloc[:int(n_rows * 0.9), -1].to_numpy()

print("train_digits: \n", train_digits)
print("\ntrain_labels: \n", train_labels)

In [None]:
test_digits = df.iloc[int(n_rows * 0.9):, :-1].to_numpy()
test_labels = df.iloc[int(n_rows * 0.9):, -1].to_numpy()

print("test_digits: \n", test_digits)
print("\ntest_labels: \n", test_labels)

**Note:**
- `train_digits` - train features
- `train_labels` - train labels
- `test_digits` - test features
- `test_labels` - test labels

### Helper Functions

`vis_digit` can be used to visualise a given digit. This may be useful for debugging and/or your understanding.

In [None]:
def vis_digit(digit):
    plt.imshow(digit.reshape(8, 8), cmap="viridis")

print(f"Label = {train_labels[0]}")
vis_digit(train_digits[0])

## Learning

#### Question 1
Compute the **prior** probabilities for each class. These values should be stored in the numpy array `priors`, with the prior for label 0 being at index 0 of the `priors`, label 1 being at index 1 and so on.

In [None]:
priors = np.zeros(10)
# TODO

priors

#### Question 2 
Compute the class conditionals with Laplacian smoothing and and assign their values to the numpy array `class_conditionals`. Set `k = 1`.

In [None]:
class_conditionals = np.zeros((10, 64))  # 10 classes, 64 features
k = 1

# TODO

class_conditionals[:2]

##### Visualise class conditionals 
Below, for each class we are plotting the associated probabilities of each pixel (i.e. features). If your computation of the class conditionals is correct then the plots below should vaguelly look like the associated label.

Think about why visualising the class conditional model in this way shows the associated labels.

In [None]:
# Create a figure with two subplots
fig, axes = plt.subplots(2, 5, figsize=(14, 7))  # 1 row, 2 columns
for i in range(5):
    axes[0][i].imshow(class_conditionals[i].reshape(8, 8), cmap="viridis")
    axes[0][i].set_title(f"label = {i}")

for i in range(5):
    axes[1][i].imshow(class_conditionals[i+5].reshape(8, 8), cmap="viridis")
    axes[1][i].set_title(f"label = {i+5}")

## Inference

#### Question 3
Finish the function `calc_posterior` that computes $P(C|X)$, where $C$=`label` and $X$=`features`.
- `features`: $x$

In [None]:
def calc_posterior(features):
    # Calc P(X|C) for each C    
    feat_class_conds = np.ones(10)  # feat_class_conds[0] corresponds to P(X|C=0)

    # TODO

    return p_c_x


print(f"Posterior probs for digit = {calc_posterior(test_digits[0])}")

In [None]:
print("test_digits[0]:")
vis_digit(test_digits[0])

#### Question 4

Finish the function `infer_class` that infers/predicts the most probable class for the given data `digit`.

In [None]:
def infer_class(digit):
    # TODO
    pred_label = ...

    return pred_label

infer_ind = 0
print(f"Predicted label = {infer_class(test_digits[infer_ind])}; True label = {test_labels[infer_ind]}")

In [None]:
print(f"test_digits[{infer_ind}]:")
vis_digit(test_digits[infer_ind])

#### Question 5
Create a confusion matrix using the test set

In [None]:
confusion_matrix = np.zeros((10, 10))

# TODO

# Don't modify
print(confusion_matrix)
plt.imshow(confusion_matrix)  # Plot heatmap of confusion_matrix

#### Question 6
Compute the accuracy

In [None]:
acc = ...

print(f"Accuracy = {acc}")

# Section 2 - Naive Bayes with Continuous Features

The file `banknote_authentication.csv` contains 100 examples of genuine (class=1) and forged (class=0) banknotes. These images were analysed with a wavelet transform tool that generated four continuous features: variance, skewness, curtosis and entropy (of each image). For each feature in both classes, you must fit a Gaussian distribution to that feature and use this to make the predictions.

In [None]:
bank_df = pd.read_csv("banknote_authentication.csv", sep=";")
bank_df = bank_df.sample(frac=1, random_state=42)  # Randomise
bank_df

## Data Analysis & Visualisation

### Question 1
**a)**
Plot 8 seperate histograms: one for each variable for each class. These plots must be rendered in the provided matlotlib axes (`axs`). The top row should correspond to `class=0` and the bottom row should correspond to `class=1`. For each plot you must set the title of the axis to have the format `class={class}-{feature_name}`. Incorrect formating will lead to zero marks for this question.

**Tip:** I'd recommend using the plotting library [Seaborn](https://seaborn.pydata.org/index.html) for this as it will make things easier, but you are also welcome to just use matplotlib directly. 

In [None]:
# If using seaborn uncomment the below and run
# !pip install seaborn
# import seaborn as sns
# sns.set()

In [None]:
fig, axs = plt.subplots(2, 4, figsize=(14, 7)) # Don't remove

# TODO

fig.tight_layout()  # Don't remove

**b)** Do these distributions look like Gaussian distributions? How well do you expect this to work?

*This is a Markdown Cell. Double click this text to edit.*

Put your answer below:

...

## Training

#### Train-Test Split

First create an 80-20 train-test split. Note that we first randomised the dataframe so the data is already randomised.

In [None]:
split_index = int(bank_df.shape[0] * 0.8)

print("#########")
print("# TRAIN #")
print("#########")
s2_train_features = bank_df.iloc[:split_index, :-1].to_numpy()
s2_train_labels = bank_df.iloc[:split_index, -1].to_numpy()

print(f"first ten rows of s2_train_features = \n {s2_train_features[:10]}")
print(f"\nfirst ten elements of s2_train_labels = \n {s2_train_labels[:10]}")

print("\n########")
print("# TEST #")
print("########")
s2_test_features = bank_df.iloc[split_index:, :-1].to_numpy()
s2_test_labels = bank_df.iloc[split_index:, -1].to_numpy()

print(f"first ten rows of s2_test_features = \n {s2_test_features[:10]}")
print(f"\nfirst ten elements of s2_test_labels = \n {s2_test_labels[:10]}")

**Note: If you do not train only on the training data you will lose significant marks or get zero for the preceeding questions.**

### Question 2 - Priors
Calculate the class priors and set them to the numpy array `s2_priors`. Element 0 of the array should correspond to class=0.

In [None]:
s2_priors = np.zeros(2)

# TODO

s2_priors

### Question 3 - Class Conditionals
For each feature $x_i$ and class $c$ fit a gaussian distribution to the associated data and implement the function `s2_class_conditional_fn`. Note that you **must** implement the relevant equations yourself - do not just use in built methods for computing the mean, variance and what not.

**TIP:** Use the relevant equations found in the lecture notes "Lec 1.2 - More on Naive Bayes". 

**a)** Fit gaussian distributions to each feature and class $(x_i, c)$. I.e., compute the mean ($\mu_{x_i, c}$) and variance ($\sigma^2_{x_i, c}$) for each $(x_i, c)$. Store these values in the numpy arrays `s2_cc_mean` for the means, and `s2_cc_var` for the variance. The rows of these arrays must correspond to features, and the columns must correspond to classes. Note: $x_0$="variance", $x_1$="skewness", $x_2$="curtosis" and $x_3$="entropy". $c_0$=0 and $c_1$=1.

The format of `s2_cc_mean` is as follows:
`s2_cc_mean` = </br>
\[ </br>
&emsp; \[$\mu_{x_0, c_0}$, $\mu_{x_1, c_0}$, $\mu_{x_2, c_0}$, $\mu_{x_3, c_0}$],</br>
&emsp; \[$\mu_{x_0, c_1}$, $\mu_{x_1, c_1}$, $\mu_{x_2, c_1}$, $\mu_{x_3, c_1}$] </br>
]

The format of `s2_cc_var` follows similarly. 

In [None]:
s2_cc_mean = np.zeros((2, 4))

# TODO

s2_cc_mean

In [None]:
s2_cc_var = np.zeros((2,4))

# TODO

s2_cc_var

**b)** Implement the function `s2_class_conditional_fn` which will compute $P(x_i | c)$. This function takes in the feature, class (class_label), mean and variance (var).
- `feature`: $x_i$
- `class_label`: $c$
- `mean`: mean ($\mu_{x_i, c}$) of associated gaussian distribution for $(x_i, c)$
- `var`: variance ($\sigma^2_{x_i, c}$) of associated gaussian distribution for $(x_i, c)$

In [None]:
def s2_class_conditional_fn(feature, class_label, mean, var):
    cond_prob = ...   # i.e. P(x_i | c)
    # TODO

    return cond_prob

tmp_feature = s2_train_features[0, 0]
# tmp_class = 0
print(f"P(x_0={tmp_feature}|c={0}) = {s2_class_conditional_fn(tmp_feature, 0, s2_cc_mean[0, 0], s2_cc_var[0, 0])}")
print(f"P(x_0={tmp_feature}|c={1}) = {s2_class_conditional_fn(tmp_feature, 1, s2_cc_mean[0, 1], s2_cc_var[0, 1])}")


### Question 4 - Posterior Probability

Implement the function `s2_calc_posterior` that calculates the posterior probability of a given class based off given data. I.e. it should compute $P(c|x)$.
- `feature`: $x$
- `class_label`: $c$

In [None]:
def s2_calc_posterior(class_label, feature):
    post_prob = ...
    # TODO

    return post_prob

# Don't change
print(f"P(c=0 | x={s2_test_features[0]}) = {s2_calc_posterior(0, s2_test_features[0])}")
print(f"P(c=1 | x={s2_test_features[0]}) = {s2_calc_posterior(1, s2_test_features[0])}")

## Question 5 - Predict Class
Implement the function `s2_infer_class`. Which should return the most probable class for the given data.

In [None]:
def s2_infer_class(feature):
    c = ...
    # TODO
    
    return c

print(f"Inferred class for x={s2_test_features[0]} = {s2_infer_class(s2_test_features[0])}")

## Question 6 - Confusion Matrix & Accuracy
**a)** Compute the confusion matrix using the test set

In [None]:
s2_confusion_matrix = np.zeros((2, 2))

# TODO

# Don't modify
print(s2_confusion_matrix)
plt.imshow(s2_confusion_matrix)  # Plot heatmap of confusion_matrix

**b)** Compute the accuracy

In [None]:
s2_acc = ...

# TODO

s2_acc

**Does this accuracy align with what you expected based off how well (or not well) the data fits normal distributions?** (Note: Don't write the answer)  

# \[Optional for Bonus Marks] Section 3 - Harry Potter Classification

We will now look at a more challenging text-based classification problem, namely to classify a page from a Harry Potter book into which of the seven books the page was taken from. The books can be found in the zip file hp books.zip and are text files where each page of a given book is a line in the text file. Note, all punctuation and capital letters have been removed from the file, so that only the words of the page remain to be used by our model.

**Note:** Add and use mutliple code cells for each question to improve readibility.

## Question 1
Train an NB model using 80% of the data to train and the remaining 20% as test data. Use Laplace smoothing for your model. Report a confusion matrix of your results. Laplace smoothing is a simple way of avoiding 0 values in the class-conditional models (table of likelihoods). However, it may cause problems when many unique, infrequent words are added to the table (when multiplied together low likelihoods may still become 0 but too large a smoothing value will bias the model). In such a case even removing stop words* may not be enough. Thus, we will now smooth the table of likelihoods by adding a set value to each element of the table. The smoothing value used will now become a hyper-parameter for our algorithm, and so we will need to use a validation data set to find the correct value for the hyper-parameter.

*stop words are 'unhelpful' frequent words such as 'and', 'the', 'at' and so on that are often removed from the data to improve performance.

## Question 2

**a)** Adapt your code to use 80% of the data to train, 10% of the data as validation data and the remaining 10% as test data. Train separate NB classifiers using the values {$1 \times 10^{-1}$, $1 \times 10^{-2}$, $1 \times 10^{-3}$, $1 \times 10^{-4}$, $1 \times 10^{-5}$, $1 \times 10^{-6}$} to smooth the table of likelihoods. Train each model using the training data, and track its performance on the validation data.

**b)** Which model gave the best accuracy on validation data? 

TODO: Your answer here

**c)** Does the choice of smoothing value have a big impact on the performance of the model?

TODO: Your answer here

## Question 3

Use the model which achieved the best validation accuracy and test it using the test data set. Report a confusion matrix of the results, as well as the test accuracy of the model.

## Question 4

Looking at the confusion matrix, which books would you say are most similar to each other (hint: look at which books are often confused with each other)? Do you think JK Rowling's writing style changed over time? Why else do you think certain books are more easily confused with each other?

TODO: Your answer here

# END