# Final Assessment Scratch Pad

## Instructions

1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.
2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.
3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so.

## Report

**[TODO]**

Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.

This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. 

**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**

##### Overview
**[TODO]**

##### 1. Descriptive Analysis
The first step is to get an intuition of what is the kind of images I am dealing with, which is why I plot the first 10 images. However, there are some errors, which is why I did some transformation and changing type to uint8 to be able to plot.

The next step is to understand what is the nature of the labels, namely:
1. How many nans are there? This will affect how we choose to process the data later on.
2. What is the make up of each labels as a percentage of the whole dataset? This will affect whether or not under/over sampling is used.

##### 2. Detection and Handling of Missing Values
**[TODO]**

##### 3. Detection and Handling of Outliers
**[TODO]**

##### 4. Detection and Handling of Class Imbalance 
**[TODO]**

##### 5. Understanding Relationship Between Variables
**[TODO]**

##### 6. Data Visualization
**[TODO]** 
##### 7. General Preprocessing
**[TODO]**
 
##### 8. Feature Selection 
**[TODO]**

##### 9. Feature Engineering
**[TODO]**

##### 10. Creating Models
**[TODO]**

##### 11. Model Evaluation
**[TODO]**

##### 12. Hyperparameters Search
**[TODO]**

##### Conclusion
**[TODO]**

---

# Workings (Not Graded)

You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. 

## Import Packages

Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`).

In [None]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

## Load Dataset

The dataset `data/images.npy` is of size $(N, C, H, W)$, where $N$, $C$, $H$, and $W$ correspond to the number of data, image channels, image width, and image height, respectively.

A code snippet that loads the data is provided below.

### Load Image Data

In [None]:
with open('data.npy', 'rb') as f:
    data = np.load(f, allow_pickle=True).item()
    images = data['image']
    labels = data['label']
    
print('Shape:', images.shape)

In [None]:
# Create a figure with subplots
plt.figure(figsize=(15, 15))  # Adjust the size as needed

# Loop through the first 10 images
for i in range(10):
    # Access the image
    image = images[i]

    # Convert to uint8
    image = np.array(image, dtype='uint8')

    # Rearrange the axes from [channels, height, width] to [height, width, channels]
    image = np.transpose(image, (1, 2, 0))

    # Plot the image
    plt.subplot(2, 5, i + 1)  # Adjust the layout (rows, columns, index) as needed
    plt.imshow(image)
    plt.title(f"Label: {labels[i]}")
    plt.axis('off')

# Display the plot
plt.show()

In [None]:

# count the number of nans in labels
nan_count = 0
for label in labels:
    if np.isnan(label):
        nan_count += 1

total_count = len(labels)

print('NaN count:', nan_count)
# print nan count percentage
print('NaN count percentage:', nan_count / len(labels) * 100, "%")

# remove nans and plot the label count
labels = labels[~np.isnan(labels)]
label_count = {}
for label in labels:
    if label not in label_count:
        label_count[label] = 0
    label_count[label] += 1

bars = plt.bar(label_count.keys(), label_count.values())

plt.title('Label Count')
plt.xlabel('Label')
plt.ylabel('Count')

# Add the percentage to each bar
for bar in bars:
    height = bar.get_height()
    percentage = f'{100 * height / total_count:.2f}%'
    plt.text(bar.get_x() + bar.get_width() / 2, height, percentage, ha='center', va='bottom')

plt.show()


## Data Exploration & Preparation

### 1. Descriptive Analysis

### 2. Detection and Handling of Missing Values

In [None]:
print('NaN count percentage:', nan_count / len(images) * 100, "%") # arund 100%

# remove images where label is nan
images = images[~np.isnan(labels)]
labels = labels[~np.isnan(labels)]

print('Shape:', images.shape)

In [None]:
# count the max nan values in an image
max_nan_count = 0
for image in images:
    nan_count = np.isnan(image).sum()
    if nan_count > max_nan_count:
        max_nan_count = nan_count

print('Max NaN count:', max_nan_count)
# replace nan values with 0
images = np.nan_to_num(images)


### 3. Detection and Handling of Outliers

### 4. Detection and Handling of Class Imbalance

In [None]:
label_1_indices = np.where(labels == 1)[0]
label_1_count = len(label_1_indices)
add_count = 300 - label_1_count
add_indices = np.random.choice(label_1_indices, add_count)
images = np.concatenate((images, images[add_indices]))
labels = np.concatenate((labels, labels[add_indices]))
# oversample the data with label 2 to 300
# get the indices of label 2
label_2_indices = np.where(labels == 2)[0]
# get the number of images with label 2
label_2_count = len(label_2_indices)
# get the number of images to add
add_count = 300 - label_2_count
# get the indices to add
add_indices = np.random.choice(label_2_indices, add_count)
# add the images and labels
images = np.concatenate((images, images[add_indices]))
labels = np.concatenate((labels, labels[add_indices]))
print('New shape:', images.shape)

### 5. Understanding Relationship Between Variables

In [None]:
# for each label 0, 1, 2, print out 5 images with that label, side by side
for label in range(3):
    label_indices = np.where(labels == label)[0]
    for i in range(5):
        image = images[label_indices[i]]
        image = np.array(image, dtype='uint8')
        image = np.transpose(image, (1, 2, 0))
        plt.subplot(1, 5, i + 1)
        plt.imshow(image)
        plt.title(f"Label: {label}")
        plt.axis('off')
    plt.show()

### 6. Data Visualization

## Data Preprocessing

### 7. General Preprocessing

### 8. Feature Selection

### 9. Feature Engineering

## Modeling & Evaluation

### 10. Creating models

In [None]:
from torch import nn
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.optim as optim
from torchvision import transforms, datasets
from PIL import Image

class Model:  
    """
    This class represents an AI model.
    """
    
    def __init__(self):
        """
        Constructor for Model class.
  
        Parameters
        ----------
        self : object
            The instance of the object passed by Python.
        """
        # initialize neural network sequence
        self.cnn = nn.Sequential(
            # convolutional layer and other layers suitable for a 16x16 image
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Flatten(),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, 3)  
        )

        # initialize hyperparameters
        self.learning_rate = 0.003
        self.batch_size = 32
        self.epochs = 50

    def fit(self, X, y):
        """
        Train the model using the input data.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, channel, height, width)
            Training data.
        y : ndarray of shape (n_samples,)
            Target values.
            
        Returns
        -------
        self : object
            Returns an instance of the trained model.
        """

        X, y = Model.preprocess(X, y)
        X, y = Model.feature_engineer(X, y)

        total_count = len(y)

        class_counts = torch.bincount(torch.tensor(y, dtype=torch.long))

        # if the number of classes is 2, add a 1 to the end of class_counts
        if len(class_counts) == 2:
            class_counts = torch.cat((class_counts, torch.tensor([1])))
        print("🚀 ~ file: scratchpad.ipynb:72 ~ class_counts:", class_counts)

        # Increase the weight of the minority classes more significantly
        class_weights = torch.tensor([total_count / (len(class_counts) * class_count) for class_count in class_counts])

        # Optionally, normalize the weights
        class_weights = class_weights / class_weights.sum()
        print('Class counts:', class_counts)

        print('Class weights:', class_weights)

        # print percentage of each label
        unique_labels, counts = np.unique(y, return_counts=True)
        for label, count in zip(unique_labels, counts):
            print(f'Label {label}: {count / total_count * 100:.2f}%')

        X_tensor = torch.tensor(X, dtype=torch.float32)
        y_tensor = torch.tensor(y, dtype=torch.long)

        # Create a dataset and data loader
        dataset = TensorDataset(X_tensor, y_tensor)
        dataloader = DataLoader(dataset, batch_size=len(dataset) if self.batch_size is None else self.batch_size, shuffle=True)

        # Define loss function and optimizer for classification
        criterion = nn.CrossEntropyLoss(weight=class_weights)
        optimizer = optim.Adam(self.cnn.parameters(), lr=self.learning_rate)

        # Train the model
        for epoch in range(self.epochs):
            for inputs, targets in dataloader:
                # Zero the parameter gradients
                optimizer.zero_grad()

                # Forward pass
                outputs = self.cnn(inputs)
                loss = criterion(outputs, targets)

                # Backward and optimize
                loss.backward()
                optimizer.step()

            print(f'Epoch {epoch+1}/{self.epochs}, Loss: {loss.item()}')

        return self



    
    def predict(self, X):
        """
        Use the trained model to make predictions.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, channel, height, width)
            Input data.
            
        Returns
        -------
        ndarray of shape (n_samples,)
        Predicted target values per element in X.
           
        """
        X = Model.preprocess_predict(X)
        X_tensor = torch.tensor(X, dtype=torch.float32)
        dataset = TensorDataset(X_tensor)
        dataloader = DataLoader(dataset, batch_size=len(dataset), shuffle=False)
        predictions = []
        for inputs in dataloader:
            outputs = self.cnn(inputs[0])
            _, predicted = torch.max(outputs.data, 1)
            predictions += predicted.tolist()
        
        return np.array(predictions)
        
    
    @staticmethod
    def preprocess(images, labels):
        # remove images where label is nan
        images = images[~np.isnan(labels)]
        labels = labels[~np.isnan(labels)]
        
        # replace nan values with 0
        images = np.nan_to_num(images)

        # remove images where label is 2
        images = images[labels != 2]
        labels = labels[labels != 2]

        # balance the dataset using sklearn
        images, labels = Model.balance_dataset(images, labels)

        # normalize the images
        images = images / 255.0

        return images, labels
    
    @staticmethod
    def preprocess_predict(images):
        # replace nan values with 0
        images = np.nan_to_num(images)
        
        # normalize the images
        images = images / 255.0

        return images

    @staticmethod
    def feature_engineer(images, labels):
        T = transforms.Compose([
            transforms.RandomHorizontalFlip(),
            transforms.RandomRotation(10),
        ])

        augmented_images = []
        augmented_labels = []

        # get images and labels where label is 1 or 2
        images_to_engineer = images[labels != 0]
        labels_to_engineer = labels[labels != 0]

        for image, label in zip(images_to_engineer, labels_to_engineer):
            image = image.transpose(1, 2, 0)
            img_pil = Image.fromarray(image.astype('uint8'), 'RGB')
        
            # Apply transformations
            augmented_img = T(img_pil)

            # Convert transformed image to numpy array and append
            augmented_np = np.asarray(augmented_img).transpose(2, 0, 1)

            augmented_images.append(augmented_np)
            augmented_labels.append(label)

        images = np.concatenate((images, augmented_images), axis=0)
        labels = np.concatenate((labels, augmented_labels), axis=0)

        return images, labels
    
    @staticmethod
    def balance_dataset(images, labels, min_proportions=[0.1, 0.5]):
        unique_labels, counts = np.unique(labels, return_counts=True)
        total_samples = len(labels)
        
        # Determine minimum count for each label based on proportions
        min_counts = [int(total_samples * p) for p in min_proportions]
        
        # Sort labels by their count (ascending)
        sorted_indices = np.argsort(counts)
        
        for idx, min_count in zip(sorted_indices, min_counts):
            label = unique_labels[idx]
            current_count = counts[idx]
            
            if current_count < min_count:
                # Calculate the number of samples to add
                add_count = min_count - current_count
                
                # Get indices of the current label
                label_indices = np.where(labels == label)[0]
                
                # Randomly select indices to duplicate
                add_indices = np.random.choice(label_indices, add_count)
                
                # Add the images and labels
                images = np.concatenate((images, images[add_indices]))
                labels = np.concatenate((labels, labels[add_indices]))
                
                # Update total samples
                total_samples += add_count

        return images, labels


### 11. Model Evaluation

In [None]:
# Load data
with open('data.npy', 'rb') as f:
    data = np.load(f, allow_pickle=True).item()
    X = data['image']
    y = data['label']

In [None]:
# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# Filter test data that contains no labels
# In Coursemology, the test data is guaranteed to have labels
nan_indices = np.argwhere(np.isnan(y_test)).squeeze()
mask = np.ones(y_test.shape, bool)
mask[nan_indices] = False
X_test = X_test[mask]
y_test = y_test[mask]

# Train and predict
model = Model()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate model predition
# Learn more: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
print("F1 Score (macro): {0:.2f}".format(f1_score(y_test, y_pred, average='macro'))) # You may encounter errors, you are expected to figure out what's the issue.

In [None]:
# print first 20 predictions beside ground truth
#for i in range(20):
#    print(f'Prediction: {y_pred[i]}, Ground Truth: {y_test[i]}')

# count total number of prediction and ground truth for each label
unique_labels, counts_truth = np.unique(y_test, return_counts=True)
_, counts_pred = np.unique(y_pred, return_counts=True)

for label, count_truth, count_pred in zip(unique_labels, counts_truth, counts_pred):
    print(f'Label {label}: {count_truth} ground truth, {count_pred} predictions')

# for the predictions that predict 1, print the percentage where the ground truth is 1
label_1_indices = np.where(y_pred == 1)[0]
label_1_count = len(label_1_indices)
label_1_truth_count = 0
for i in label_1_indices:
    if y_test[i] == 1:
        label_1_truth_count += 1

label_2_indices = np.where(y_pred == 2)[0]
label_2_count = len(label_2_indices)
label_2_truth_count = 0
for i in label_2_indices:
    if y_test[i] == 2:
        label_2_truth_count += 1

print(f'Label 1: {label_1_truth_count / label_1_count * 100:.2f}%')
# print(f'Label 2: {label_2_truth_count / label_2_count * 100:.2f}%')

# print the predictions where the ground truth is 1
for i in label_1_indices:
    print(f'Prediction: {y_pred[i]}, Ground Truth: {y_test[i]}')

print(f'Label 2: {label_2_count}')

### 12. Hyperparameters Search

In [None]:
# print number of 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99)

# print the count of y_test where label is 2
print(np.count_nonzero(y_test == 0))
