# Question A4

In this section, we will understand the utility of such a neural network in real world scenarios.

#### Please use the real record data named ‘record.wav’  as a test sample. Preprocess the data using the provided preprocessing script (data_preprocess.ipynb) and prepare the dataset.
Do a model prediction on the sample test dataset and obtain the predicted label using a threshold of 0.5. The model used is the optimized pretrained model using the selected optimal batch size and optimal number of neurons.
Find the most important features on the model prediction for the test sample using SHAP. Plot the local feature importance with a force plot and explain your observations.  (Refer to the documentation and these three useful references:
https://christophm.github.io/interpretable-ml-book/shap.html#examples-5,
https://towardsdatascience.com/deep-learning-model-interpretation-using-shap-a21786e91d16,  
https://medium.com/mlearning-ai/shap-force-plots-for-classification-d30be430e195)



1. Firstly, we import relevant libraries.

In [1]:
import tqdm
import time
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
from torch import nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from scipy.io import wavfile as wav

from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
from common_utils import set_seed

# setting seed
set_seed()

To reduce repeated code, place your
network (MLP defined in QA1)
torch datasets (CustomDataset defined in QA1)
loss function (loss_fn defined in QA1)
in a separate file called common_utils.py

Import them into this file. You will not be repenalised for any error in QA1 here as the code in QA1 will not be remarked.

The following code cell will not be marked.


In [2]:
# YOUR CODE HERE
from common_utils import MLP, CustomDataset, loss_fn

2. Install and import shap

In [3]:
# YOUR CODE HERE
!pip install shap
import shap




[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
  from .autonotebook import tqdm as notebook_tqdm


3. Read the csv data preprocessed from 'record.wav', using variable name 'df', and fill the size of 'df' in 'size_row' and 'size_column'.

In [5]:
df = 0
size_row = 0
size_column = 0
# YOUR CODE HERE
df = pd.read_csv('new_record.csv')
size_row = df.shape[0]
size_column = df.shape[1]

 4.  Preprocess to obtain the test data, save the test data as numpy array.

In [6]:

def preprocess(X_train, df):
    """preprocess your dataset to obtain your test dataset, remember to remove the 'filename' as Q1
    """
    # YOUR CODE HERE
    # drop useless columns for df
    columns_to_drop = ['filename']
    df = df.drop(columns=columns_to_drop)

    # scale the data
    scaler = preprocessing.StandardScaler()

    scaler.fit(X_train)
    X_test_scaled_eg = scaler.transform(df)

    return X_test_scaled_eg

from common_utils import split_dataset

# preprocess old training and testing data
old_df = pd.read_csv('simplified.csv')
old_df['label'] = old_df['filename'].str.split('_').str[-2]
columns_to_drop = ['filename', 'label']

# perform data splitting first
X_train, y_train, X_test, y_test = split_dataset(old_df, columns_to_drop=columns_to_drop, test_size=0.25, random_state=100)

X_test_scaled_eg = preprocess(X_train, df)

5. Do a model prediction on the sample test dataset and obtain the predicted label using a threshold of 0.5. The model used is the optimized pretrained model using the selected optimal batch size and optimal number of neurons. Note: Please define the variable of your final predicted label as 'pred_label'.

In [7]:
# YOUR CODE HERE
# set path
model_path = 'model.pth'

# initialize model
optimal_combination = [256, 256, 128]
no_features = 77 # feature number
no_labels = 1  # output label number
lr = 0.001  # learning rate

# initialize model
model = MLP()
para_list = []
optimal_combination.insert(0, no_features)
optimal_combination.append(no_labels)
for i in range(len(optimal_combination) - 1):
    para_list.append((optimal_combination[i], optimal_combination[i + 1]))
for j in range(len(para_list)):
    if j == len(para_list) - 1:
        model.add_layer(f"Linear{j}", nn.Linear(*para_list[j]))
        model.add_layer(f"Sigmoid{j}", nn.Sigmoid())
    else:
        model.add_layer(f"Linear{j}", nn.Linear(*para_list[j]))
        model.add_layer(f"ReLU{j}", nn.ReLU())
        model.add_layer(f"Dropout{j}", nn.Dropout(0.2))

# load model
model.load_state_dict(torch.load("model.pth"))


TypeError: MLP.__init__() missing 3 required positional arguments: 'no_features', 'no_hidden', and 'no_labels'

6. Find the most important features on the model prediction for your test sample using SHAP. Create an instance of the DeepSHAP which is called DeepExplainer using traianing dataset: https://shap-lrjball.readthedocs.io/en/latest/generated/shap.DeepExplainer.html.

Plot the local feature importance with a force plot and explain your observations.  (Refer to the documentation and these three useful references:
https://christophm.github.io/interpretable-ml-book/shap.html#examples-5,
https://towardsdatascience.com/deep-learning-model-interpretation-using-shap-a21786e91d16,  
https://medium.com/mlearning-ai/shap-force-plots-for-classification-d30be430e195)


In [None]:
'''
Fit the explainer on a subset of the data (you can try all but then gets slower)
Return approximate SHAP values for the model applied to the data given by X.
Plot the local feature importance with a force plot and explain your observations.
'''
# YOUR CODE HERE