# Final Assessment Scratch Pad

## Instructions

1. Please use only this Jupyter notebook to work on your model, and **do not use any extra files**. If you need to define helper classes or functions, feel free to do so in this notebook.
2. This template is intended to be general, but it may not cover every use case. The sections are given so that it will be easier for us to grade your submission. If your specific use case isn't addressed, **you may add new Markdown or code blocks to this notebook**. However, please **don't delete any existing blocks**.
3. If you don't think a particular section of this template is necessary for your work, **you may skip it**. Be sure to explain clearly why you decided to do so.

## Report

**[TODO]**

Please provide a summary of the ideas and steps that led you to your final model. Someone reading this summary should understand why you chose to approach the problem in a particular way and able to replicate your final model at a high level. Please ensure that your summary is detailed enough to provide an overview of your thought process and approach but also concise enough to be easily understandable. Also, please follow the guidelines given in the `main.ipynb`.

This report should not be longer than **1-2 pages of A4 paper (up to around 1,000 words)**. Marks will be deducted if you do not follow instructions and you include too many words here. 

**[DELETE EVERYTHING FROM THE PREVIOUS TODO TO HERE BEFORE SUBMISSION]**

##### Overview
**[TODO]**

##### 1. Descriptive Analysis
**[TODO]**

##### 2. Detection and Handling of Missing Values
**[TODO]**

##### 3. Detection and Handling of Outliers
**[TODO]**

##### 4. Detection and Handling of Class Imbalance 
**[TODO]**

##### 5. Understanding Relationship Between Variables
**[TODO]**

##### 6. Data Visualization
**[TODO]** 
##### 7. General Preprocessing
**[TODO]**
 
##### 8. Feature Selection 
**[TODO]**

##### 9. Feature Engineering
**[TODO]**

##### 10. Creating Models
**[TODO]**

##### 11. Model Evaluation
**[TODO]**

##### 12. Hyperparameters Search
**[TODO]**

##### Conclusion
**[TODO]**

---

# Workings (Not Graded)

You will do your working below. Note that anything below this section will not be graded, but we might counter-check what you wrote in the report above with your workings to make sure that you actually did what you claimed to have done. 

## Import Packages

Here, we import some packages necessary to run this notebook. In addition, you may import other packages as well. Do note that when submitting your model, you may only use packages that are available in Coursemology (see `main.ipynb`).

In [None]:
import pandas as pd
import os
import numpy as np
from util import show_images, dict_train_test_split
from sklearn.impute import KNNImputer


## Load Dataset

The dataset provided is multimodal and contains two components, images and tabular data. The tabular dataset `tabular.csv` contains $N$ entries and $F$ columns, including the target feature. On the other hand, the image dataset `images.npy` is of size $(N, H, W)$, where $N$, $H$, and $W$ correspond to the number of data, image width, and image height, respectively. Each image corresponds to the data in the same index of the tabular dataset. These datasets can be found in the `data/` folder in the given file structure.

A code snippet that loads and displays some of the data is provided below.

### Load Tabular Data

In [None]:
df = pd.read_csv(os.path.join('data', 'tabular.csv'))
cols = ["V9", "V12", "V19", "V20", "V21", "V23", "V24", "V29", "V31", "V36", "V37", "V46", "V47", "V51", "V52", "V54", "V55", "V58"]
df.iloc[:, 50:].tail()

### Load Image Data

In [None]:
with open(os.path.join('data', 'images.npy'), 'rb') as f:
    images = np.load(f)
    
print('Shape:', images.shape)
show_images(images[:10], n_row=2, n_col=5, figsize=[12,5])

## Data Exploration & Preparation

### 1. Descriptive Analysis

In [None]:
# df.head()
# get summary statistics
df.describe()
# count number of occurances where V0 has value more than 8315

# df[df['V0'] == 65715].count()

### 2. Detection and Handling of Missing Values

In [None]:
df[df.isna().any(axis=1)]
cols = ["V9", "V12", "V19", "V20", "V21", "V23", "V24", "V29", "V31", "V36", "V37", "V46", "V47", "V51", "V52", "V54", "V55", "V58"]

# Encoding and decoding functions
def encode_categories(column):
    return column.str.replace('C', '').astype(float)
# 
# def decode_categories(column):
#     return 'C' + column.round().astype(int).astype(str)
# 
for col in cols:
    df[col] = encode_categories(df[col])
# 
# imputer = KNNImputer(n_neighbors=2)
# df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
# 
# df_imputed

In [None]:
from sklearn.impute import SimpleImputer
cols_cat = ["V9", "V12", "V19", "V20", "V21", "V23", "V24", "V29", "V31", "V36", "V37", "V46", "V47", "V51", "V52", "V54", "V55", "V58"]
## cols that are not categorical
cols_num = [col for col in df.columns if col not in cols_cat]
imputer = SimpleImputer(strategy="mean")
for col in cols_num:
    df[col] = imputer.fit_transform(df[[col]])

df.dropna(inplace=True)
df

### 3. Detection and Handling of Outliers

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

print(cols_num)
# For numerical columns, remove outliers that are 1.5 times the interquartile range
for col in cols_num:
    IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
    lower_bound = df[col].quantile(0.25) - (1.5 * IQR)
    upper_bound = df[col].quantile(0.75) + (1.5 * IQR)
    median_value = df[col].median()
    
    # Replace with median
    df[col] = df[col].mask((df[col] < lower_bound) | (df[col] > upper_bound), median_value)

### 4. Detection and Handling of Class Imbalance

In [None]:
# plot distribution of the target column
sns.distplot(df['target'])

### 5. Understanding Relationship Between Variables

In [None]:

corr_matrix = df.corr()
def custom_annotator(val):
    if abs(val) > 0.7 and val != 1:  # Exclude self-correlation
        return f'{val:.2f}'
    else:
        return ''

# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=np.vectorize(custom_annotator)(corr_matrix), fmt="", cmap='coolwarm')
plt.show()


### 6. Data Visualization

In [None]:
# plot correlation of variables with target
plt.figure(figsize=(12, 8))
corr_target = df.corr()['target'].sort_values(ascending=False)
sns.barplot(x=corr_target.index, y=corr_target.values)
plt.xticks(rotation=90)
plt.show()

## Data Preprocessing

### 7. General Preprocessing

In [None]:
# preprocess the data
# removing duplicates
df = df.drop_duplicates()
# drop column V2, V1, V20, V19, V13, V50, V5, V29, V55, V56, V59, V58, V57, V23, V3
df.drop(columns=['V2', 'V1', 'V20', 'V19', 'V13', 'V50', 'V5', 'V29', 'V55', 'V56', 'V59', 'V58', 'V57', 'V23', 'V3'], inplace=True)
df

### 8. Feature Selection

In [None]:
# use pca to reduce the dimensionality of the data
from sklearn.decomposition import PCA
pca = PCA(n_components=30)
pca.fit(df)
pca.explained_variance_ratio_

### 9. Feature Engineering

In [None]:
# remove the images that are not in the tabular data
images = images[df.index]

i

## Modeling & Evaluation

### 10. Creating models

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.impute import SimpleImputer
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, tabular_data, image_data, labels):
        self.tabular_data = tabular_data
        self.image_data = image_data
        self.labels = labels

    def __len__(self):
        return len(self.tabular_data)

    def __getitem__(self, idx):
        tabular_sample = self.tabular_data[idx]
        image_sample = self.image_data[idx]
        label_sample = self.labels[idx]

        image_sample = torch.tensor(image_sample, dtype=torch.float32).unsqueeze(0)
        
        return tabular_sample, image_sample, label_sample
    
class Model(nn.Module):  
    """
    This class represents an AI model.
    """
    
    def __init__(self):
        """
        Constructor for Model class.
  
        Parameters
        ----------
        self : object
            The instance of the object passed by Python.
        """
        super(Model, self).__init__()
        # CNN for image data
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Add more CNN layers as needed...
        )
        
        # Determine the size of the flattened CNN output
        with torch.no_grad():
            self.cnn_output_size = self._get_cnn_output_size(torch.zeros((1, 1, 8, 8)))

        # Fully connected network for tabular data
        self.fc_tabular = nn.Linear(30, 128)

        # Combine CNN and tabular pathways
        self.fc_combined = nn.Sequential(
            nn.Linear(self.cnn_output_size + 128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def _get_cnn_output_size(self, sample_input):
        return self.cnn(sample_input).view(sample_input.size(0), -1).shape[1]
    
    def forward(self, tabular_data, image_data):
        # Process image data through CNN
        print('image_data:', image_data.shape)
        image_output = self.cnn(image_data)
        image_output = image_output.view(image_output.size(0), -1)  # Flatten the output

        # Process tabular data through fully connected network
        tabular_output = self.fc_tabular(tabular_data)

        # Combine the outputs from both pathways
        combined_output = torch.cat((image_output, tabular_output), dim=1)

        # Pass the combined output through additional layers
        output = self.fc_combined(combined_output)
        return output
    
    def fit(self, X_dict, y):
        """
        Train the model using the input data.
        
        Parameters
        ----------
        X_dict : dictionary with the following entries:
            - tabular: pandas Dataframe of shape (n_samples, n_features)
            - images: ndarray of shape (n_samples, height, width)
            Training data.
        y : pandas Dataframe of shape (n_samples,)
            Target values.
            
        Returns
        -------
        self : object
            Returns an instance of the trained model.
        """
        df = X_dict['tabular']
        images = X_dict['images']
        processed_df, y_aligned = self.process_df(df, y)
        # remove the images that are not in the tabular data
        images = images[processed_df.index]

        # Convert to PyTorch tensors
        X_tensor = torch.tensor(processed_df.values, dtype=torch.float32)
        y_tensor = torch.tensor(y_aligned.values, dtype=torch.float32)


        # print sizes
        print('X_tensor:', X_tensor.shape)
        print('y_tensor:', y.shape)
        
        # DataLoader
        dataset = CustomDataset(X_tensor, images, y_tensor)
        dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

        # Loss function and optimizer
        criterion = torch.nn.MSELoss()
        optimizer = torch.optim.SGD(self.parameters(), lr=0.01)
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

        # Training loop
        for epoch in range(50):
            for tabular_inputs, image_inputs, labels in dataloader:
                # Forward pass
                outputs = self(tabular_inputs, image_inputs)
                outputs = outputs.squeeze()
                loss = criterion(outputs, labels)

                # Backward and optimize
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            scheduler.step()
            # Print loss every few epochs
            print(f'Epoch {epoch}, Loss: {loss.item()}')

        return self
    
    def predict(self, X_dict):
        """
        Use the trained model to make predictions.
        
        Parameters
        ----------
        X_dict : dictionary with the following entries:
            - tabular: pandas Dataframe of shape (n_samples, n_features)
            - images: ndarray of shape (n_samples, height, width)
            Input data.
            
        Returns
        -------
        pandas Dataframe of shape (n_samples,)
           Predicted target values per element in X_dict.
           
        """
        processed_df = self.process_df_predict(X_dict['tabular'])
        X_tensor = torch.tensor(processed_df.values, dtype=torch.float32)

        # Process the image data
        images = X_dict['images']
        # Ensure images are in the correct format (as tensors)
        images = [torch.tensor(image, dtype=torch.float32).unsqueeze(0) for image in images]
        images_tensor = torch.stack(images)

        # No gradient calculation needed
        with torch.no_grad():
            outputs = self(X_tensor, images_tensor)
        
        # Convert outputs to a pandas DataFrame
        predictions = pd.DataFrame(outputs.numpy(), columns=['predictions'])
        return predictions

    def save(self, path):
        """
        Save the trained model to a file.
        
        Parameters
        ----------
        path : str
            Path to the file to save the model.
        """
        torch.save(self.model.state_dict(), path)
    
    def load(self, path):
        """
        Load the trained model from a file.
        
        Parameters
        ----------
        path : str
            Path to the file to load the model from.
        """
        self.model.load_state_dict(torch.load(path))

    def process_df(self, df, y):
        """
        Process the dataframe.
        
        Parameters
        ----------
        df : pandas Dataframe of shape (n_samples, n_features)
            Input data.
            
        Returns
        -------
        pandas Dataframe of shape (n_samples, n_features)
           Processed data.
        """
        # append target column to df
        df['target'] = y
        # convert categorical columns to numerical
        cols = ["V9", "V12", "V19", "V20", "V21", "V23", "V24", "V29", "V31", "V36", "V37", "V46", "V47", "V51", "V52", "V54", "V55", "V58"]
        def encode_categories(column):
            return column.str.replace('C', '').astype(float)
        for col in cols:
            df[col] = encode_categories(df[col])
        
        # handle missing values
        cols_num = [col for col in df.columns if col not in cols_cat]
        imputer_mean = SimpleImputer(strategy="mean")
        # impute with highest frequency
        imputer_freq = SimpleImputer(strategy="most_frequent")
        # all columns of df
        for col in cols_num:
            df[col] = imputer_mean.fit_transform(df[[col]])

        for col in cols_cat:
            df[col] = imputer_freq.fit_transform(df[[col]])

        # handle outliers
        for col in cols_num:
            IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
            lower_bound = df[col].quantile(0.25) - (1.5 * IQR)
            upper_bound = df[col].quantile(0.75) + (1.5 * IQR)
            median_value = df[col].median()
            df[col] = df[col].mask((df[col] < lower_bound) | (df[col] > upper_bound), median_value)
        
        # drop duplicates
        df = df.drop_duplicates()

        # drop the 5 least correlated columns with y
        corr_target = df.corr()['target'].sort_values(ascending=False)
        self.cols_to_drop = corr_target[-5:].index
        df.drop(columns=self.cols_to_drop, inplace=True)

        # Apply PCA
        self.pca = PCA(n_components=30)
        df_pca = self.pca.fit_transform(df.drop(columns=['target']))
        df = pd.DataFrame(df_pca)

        y_aligned = y.loc[df.index]

        return df, y_aligned
    
    def process_df_predict(self, df):
        """
        Process the dataframe.
        
        Parameters
        ----------
        df : pandas Dataframe of shape (n_samples, n_features)
            Input data.
            
        Returns
        -------
        pandas Dataframe of shape (n_samples, n_features)
           Processed data.
        """
        # convert categorical columns to numerical
        cols = ["V9", "V12", "V19", "V20", "V21", "V23", "V24", "V29", "V31", "V36", "V37", "V46", "V47", "V51", "V52", "V54", "V55", "V58"]
        # print "one" then the shape of the df
        def encode_categories(column):
            return column.str.replace('C', '').astype(float)
        for col in cols:
            df[col] = encode_categories(df[col])
        
        # handle missing values
        cols_num = [col for col in df.columns if col not in cols_cat]
        imputer_mean = SimpleImputer(strategy="mean")
        # impute with highest frequency
        imputer_freq = SimpleImputer(strategy="most_frequent")
        # all columns of df
        for col in cols_num:
            df[col] = imputer_mean.fit_transform(df[[col]])

        for col in cols_cat:
            df[col] = imputer_freq.fit_transform(df[[col]])

        # handle outliers
        for col in cols_num:
            IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
            lower_bound = df[col].quantile(0.25) - (1.5 * IQR)
            upper_bound = df[col].quantile(0.75) + (1.5 * IQR)
            median_value = df[col].median()
            df[col] = df[col].mask((df[col] < lower_bound) | (df[col] > upper_bound), median_value)
        
        # drop the 5 least correlated columns with y
        df.drop(columns=self.cols_to_drop, inplace=True)

        # Apply PCA
        df_pca = self.pca.fit_transform(df)
        df = pd.DataFrame(df_pca)

        return df
        

### 11. Model Evaluation

In [None]:
# Import packages
import pandas as pd
import numpy as np
import os
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error, r2_score
from util import dict_train_test_split

In [None]:
# Load data
df = pd.read_csv(os.path.join('data', 'tabular.csv'))
with open(os.path.join('data', 'images.npy'), 'rb') as f:
    images = np.load(f)
    
# Exclude target column
X_columns = [col for col in df.columns if col != 'target']

# Create X_dict and y
X_dict = {
    'tabular': df[X_columns],
    'images': images
}
y = df['target']

In [None]:
X_dict_train, y_train, X_dict_test, y_test = dict_train_test_split(X_dict, y, ratio=0.9)
images = X_dict_train['images']

In [None]:
# Split train and test
X_dict_train, y_train, X_dict_test, y_test = dict_train_test_split(X_dict, y, ratio=0.9)

# Train and predict
model = Model()
model.fit(X_dict_train, y_train)
y_pred = model.predict(X_dict_test)

# Evaluate model predition
# Learn more: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
print("MSE: {0:.2f}".format(mean_squared_error(y_test, y_pred)))

### 12. Hyperparameters Search