## AAI30001 Small Project
#### **Group: SP_8**
 - Chua Chen Yi (2302822)
 - Wong Jun Jai (2302765)

## Methodology

#### Our proposed method of improving accuracy over the baseline score is as follows:
### 1. Identify a pre-trained model to be our baseline for fine-tuning
- Before starting our search, we manually re-created the same testing environment as the sample code. This included splitting the dataset into training, validation and testing identically to the sample. In addition, the use of the TestUA for determining overall performance was use
    - *Minor note: We have noticed the way the dataset is split for testing is not the same as how it is described in the text. For example, the test set should only contain files from 'Ses01F' and only from the female speaker. However, checking the dataset class showed that this is not true.*
- With a simple evaluation pipeline in placee, we randomly choose 10 different models publically availble from hugging face, and proceeded to score them. They scored a range from 0.60-0.80 on the test dataset.
- We decided to use the same base model: **"facebook/wav2vec2-base"** with the goal of fine tuning according to the rules and achieve at least 0.70
### 2. Perform Data Augmentation On Dataset
- #### Augmentation 1
    - Details
- #### Augmentation 2
    - Details
### 3. Identify Strengths & Weakness of Pre-Trained Model
- We started by training the model for a few epochs to get a general idea of how it performs. A sample confusion matrix is shown here:

             A   H    N   S
        A  103   8   35   1
        H    8  79   36   9
        N    2   7  109  53 <- Highest Error
        S    1   8   10  59
        
    In general, the model is able to differentiate angry and happy emotions with a high degree of accuracy. However, the model is not good at differentiaing between neutral and sad emotions. Its greatest weakness is predicting a neutral emotion as a sad one
                
### 4. Use Secondary Model With Text Embeddings
- We have decided not to use extracted text embeddings as a feature of our first model, but instead have a completely seperate model extract and perform sentiment analysis on the text. The final prediction will be a combination of both models.
    - This method allows us to:
        1. Manage our time better as work can be done to improve the performance of both indepenently.
        2. Change our base model if we find a better one.
        3. Choose more sophisticated speech-to-text and sentiment analysis models
### 5. Develop Algorithm To Merge Predictions
- Our final implementation consist of getting the confidence for each label using softmax in addition to its original prediction. The predictions for the 2nd model is then merged into a single CSV file. An gridsearch-like function will identify the optimal parameters *(highlighted in **Bold**)*.
- All models tested have improved scores. In general, models with <0.65 score will see a boost of 2-4%, while models >0.65 score will gain 0-2%. 
    ### Merge Algorithm
    - The algorithm that determines the final predictions is a combination of 3 different strategies:
        1. **Merging Strategy:** When to rely on the 2nd model?
        2. **Prediction Strategy:** How to rely on the 2nd model?
        3. **Mapping Strategy:** How to map sentiment to emotions?
    ### 5A. Merging Strategy
    We have identified 2 possible metrics to decide when to rely on the 2nd model
    - **Entrophy Threshold**
    We apply a calculate the entrophy based on the 4 confidence scores, following the logic that a lower overall entrophy will mean the model is most confident in its prediction. We will refer to the 2nd model when the entrophy is above the ***entrophy_threshold***.
    - **Argmax Threshold**
    We apply a simple argmax on the 4 confidence scores. If the value is below the ***argmax_threshold***, we will refer to the 2nd model for the final prediction
    ### 5B. Prediction Strategy
    We have identified 3 possible metrics to decide how to rely on the 2nd model
    - **Default**
    Prefer prediction of 2nd model in all situations
    - **Ignore**
    We identified the original model is very good at angry and happy emotions. So we always prefer the orignal model's predictions if it detects angry or happy
    - **Ignore When Match**
    If both models agree on the same prediction, we ignore however low the confidence is and assume is correct
    ### 5C. Mapping Strategy
    Because our sentiment analysis outputs 3 classes, while we have 4 emotions, we ill need to map 1 class to 2 emotions. This corresponds to the 'Negative' sentiment being mapped to either 'Angry' or 'Sad'. We have implemented the following methods:
    - **Simple Mapping**
    We decide on a ***sentiment_threshold*** value, where anything above is 'Sad'. This can also ***fliped*** around, ie. anything above is 'Angry' as there is no defined way to map the negative sentiment
    - **Reference Mapping**
    When a negative sentiment needs to be mapped, the confidence of 'Sad' and 'Angry' from the original model is looked up, and Argmax is used to return the most likely emotion.
    
    
    
    
    


## All Notebook Settings & Parameters
***Everything that can be set and changed will be in here***

### Settings

In [1]:
# For Colab/Kaggle Notebook.
CLOUD = False #True for local instances

# Model Loading and Output
LOAD_PREVIOUSLY_TRAINED_MODEL = True
PREVIOUSLY_TRAINED_MODEL_PATH = r'C:\Users\ChenYi\Downloads\AAI3001_Project\TransferLearning\models\wav2vec2-large-e5'
PRE_TRAINED_MODEL_NAME = "facebook/wav2vec2-base"
SPEECH_TO_TEXT_MODEL_NAME = ""
SENTIMENT_MODEL_NAME = ""
OUTPUT_MODEL_NAME = "integrated-notebook-test-model"
OUTPUT_FOLDER = "./Training-Output"

# Output Selection
FORMAT_CSV_FOR_KAGGLE = False

# Training Selction
ONLY_RUN_TRAINING_LOOP = False # Runs the entire notebook. False stops the notebook once Test UA is computed
JUST_PRINT_TEST_UA = True      # Prints only the score. False prints out all other metrics
PLOT_TRAINING_GRAPHS = True    # Prints out train and val loss charts

# If you already have a label.csv, can skip predict and directly calculate Test UA
SKIP_PREDICT_AFTER_TRAINING = False

# You can also skip prediction and directly calculate Test UA for a given label.csv
BYPASS_MODEL_PREDICTION = False
PREDICTION_CSV_FILEPATH = '...csv' #

# Prediction Settings
USE_MULTI_MODEL = True

### Cloud/Local Instance Settings

In [2]:
if CLOUD: # Running on kaggle
    TSV = "/kaggle/input/......"
    AUDIO_DIRECTORY = '/kaggle/input/.....'
    REPORT_TO = 'none'
    NUM_WORKERS = 4
    BATCH = 32
    
else: # Running on local Jupyter instance
    TSV = r'C:\Users\ChenYi\Downloads\AAI3001_Project\labels\IEMOCAP_4.tsv'
    AUDIO_DIRECTORY = r'C:\Users\ChenYi\Downloads\AAI3001_Project\small-project\IEMOCAP_full_release_audio'
    REPORT_TO = 'all'
    NUM_WORKERS = 0 # Must set to zero to run
    BATCH = 8 # Adjust to fit model on VRAM

### Training Parameters

In [3]:
# Define label mapping
LABEL_MAPPING = {"A": 0, "H": 1, "N": 2, "S": 3}
#MAX_LEN = 6
EPOCH = 2
LEARNING_RATE = 0.00001
EARLY_STOPPING = 10
SEED = 2024
GRADIENT_ACC_STEPS = 2

## Imports

In [4]:
import os
import torch
import random
import logging
import librosa
import torchaudio
import torch.nn as nn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import Dataset, DatasetDict
from torchaudio import functional as audioF
from torchaudio.transforms import Resample
from torchaudio.compliance import kaldi
from torch.utils.data import Dataset, DataLoader
from transformers import logging as log
from transformers import EarlyStoppingCallback, AdamW, get_scheduler
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import confusion_matrix, classification_report, recall_score, accuracy_score

#log.set_verbosity_error()

### Provided Code

In [5]:
class Pad_trunc_wav(nn.Module):
    def __init__(self, max_len: int = 6*16000):
        super(Pad_trunc_wav, self).__init__()
        self.max_len = max_len
    def forward(self,x):
        shape = x.shape
        length = shape[1]
        if length < self.max_len:
            multiple = self.max_len//length+1
            x_tmp = torch.cat((x,)*multiple, axis=1)
            x_new = x_tmp[:,0:self.max_len]
        else:
            x_new = x[:,0:self.max_len]
        return x_new

In [6]:
def setup_seed(seed=2021):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
setup_seed(SEED)

### Download Required Models

In [7]:
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(PRE_TRAINED_MODEL_NAME)
model = Wav2Vec2ForSequenceClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME,
    num_labels = 4)

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Dataset & Loading

In [8]:
class Mydataset(Dataset):
    def __init__(self, mode='train', max_len=6, seed=42, data_path=TSV, audio_dir=AUDIO_DIRECTORY):
        self.mode = mode
        data_all = pd.read_csv(data_path, sep='\t')
        SpkNames = np.unique(data_all['speaker'])  # ['Ses01F', 'Ses01M', ..., 'Ses05M']
        self.data_info = self.split_dataset(data_all, SpkNames)
        self.get_audio_dir_path = os.path.join(audio_dir)
        self.pad_trunc = Pad_trunc_wav(max_len * 16000)
         
        # Label encoding
        self.label = self.data_info['label'].astype('category').cat.codes.values
        self.ClassNames = np.unique(self.data_info['label'])
        self.NumClasses = len(self.ClassNames)
        if mode == 'train':
            print("Each emotion has the following number of training samples:")
            print([[self.ClassNames[i], (self.label == i).sum()] for i in range(self.NumClasses)])
        self.weight = 1 / torch.tensor([(self.label == i).sum() for i in range(self.NumClasses)]).float()

    def get_classname(self):
        return self.ClassNames

    def split_dataset(self, df_all, speakers):
        test_idx = df_all['speaker'] == speakers[0]  # 'Ses01F' as test set
        val_idx = df_all['speaker'] == speakers[1]   # 'Ses01M' as validation set
        train_idx = ~(test_idx | val_idx)             # Remaining speakers for training
        train_data_info = df_all[train_idx].reset_index(drop=True)
        val_data_info = df_all[val_idx].reset_index(drop=True)
        test_data_info = df_all[test_idx].reset_index(drop=True)

        if self.mode == 'train':
            data_info = train_data_info
        elif self.mode == 'val':
            data_info = val_data_info
        elif self.mode == 'test':
            data_info = test_data_info
        else:
            data_info = df_all
        return data_info

    def pre_process(self, wav):
        wav = self.pad_trunc(wav)
        return wav

    def __len__(self):
        return len(self.data_info)

    def __getitem__(self, idx):
        # Load the raw waveform from file using data_info to get filenames
        wav_path = os.path.join(self.get_audio_dir_path, self.data_info['filename'][idx]) + '.wav'
        wav, sample_rate = torchaudio.load(wav_path)

        # Preprocess the waveform (e.g., pad/truncate if needed)
        wav = self.pre_process(wav)

        # Apply Wav2Vec2 feature extractor
        inputs = feature_extractor(
            wav.squeeze().numpy(),  # Convert PyTorch tensor to numpy array
            sampling_rate=sample_rate,
            return_tensors="pt",  # Return PyTorch tensors
            padding=True  # Optionally pad to a fixed length
        )

        label = self.label[idx]

        # Return the processed input values and the label
        return {
            'input_values': inputs['input_values'].squeeze(0),  # Remove extra batch dimension
            'labels': torch.tensor(label, dtype=torch.long)}

In [9]:
# Instantiate datasets
train_dataset = Mydataset(mode='train', max_len=6)
val_dataset = Mydataset(mode='val', max_len=6)
test_dataset = Mydataset(mode='test', max_len=6)

Each emotion has the following number of training samples:
[['A', 874], ['H', 1358], ['N', 1324], ['S', 890]]


In [10]:
# Put test information into a dataframe for later use
data_info = test_dataset.data_info
test_dataframe = data_info[['filename', 'label']].copy()
test_dataframe['filepath'] = test_dataframe['filename'].apply(
    lambda x: os.path.join(test_dataset.get_audio_dir_path, f"{x}.wav"))

## Model Training Setup

In [11]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    accuracy = np.sum(preds == labels) / len(labels)
    return {"accuracy": accuracy}

# Define the early stopping callback
early_stopping = EarlyStoppingCallback(early_stopping_patience = EARLY_STOPPING)

training_args = TrainingArguments(
    output_dir="./Training-Output",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    per_device_train_batch_size = BATCH,
    per_device_eval_batch_size = BATCH,
    num_train_epochs = EPOCH,
    save_steps = 10,
    save_total_limit = 5,
    logging_dir="./logs",
    fp16 = True,
    dataloader_pin_memory = True,
    load_best_model_at_end = True,
    dataloader_num_workers = NUM_WORKERS,
    report_to = REPORT_TO,
    gradient_accumulation_steps = GRADIENT_ACC_STEPS,
    gradient_checkpointing = True
    #learning_rate = LEARNING_RATE
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping]
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [35]:
def test_model(model, test_dataframe):
    
    results = []
    total = test_dataframe.shape[0]
    count = 1
    
    # Run predictions on test dataset
    if not BYPASS_MODEL_PREDICTIONN:
        for index, row in test_dataframe.iterrows():

            # Display progress
            print(f'File {count} of {total}', end='\r')
            count += 1

            # Load audio file
            filename = row['filename'] + '.wav'
            audio_file = os.path.join(AUDIO_DIRECTORY, filename)
            y_ini, sr_ini = librosa.load(audio_file, sr = 16000)

            inputs = feature_extractor(y_ini, sampling_rate=16000, return_tensors="pt")

            # Get the logits from the model
            with torch.no_grad():
                logits = model(**inputs).logits

            # Predict the class with the highest logit value
            predicted_class_id = torch.argmax(logits).item()

            # Append the result to the list
            results.append([row['filename'], predicted_class_id])

        # Format to dataframe
        prediction_dataframe = pd.DataFrame(results, columns=['ID', 'Predict'])
        
    # if user wants to directly load a csv instead of predicting
    else:
    
        prediction_dataframe = pd.read_csv(PREDICTION_CSV_FILEPATH)

    # Load true values
    true_dataframe = pd.read_csv(TSV, sep='\t')
    remap_dict = {
        0: 'A',
        1: 'H',
        2: 'N',
        3: 'S'}

    # Remap predicted values to match TSV
    prediction_dataframe['Predict'] = prediction_dataframe['Predict'].map(remap_dict)

    # Merge DataFrames on 'filename'
    df_merged = pd.merge(true_dataframe[['filename', 'label']],prediction_dataframe[['ID', 'Predict']],
                         left_on='filename',right_on='ID')

    # Extract true labels and predictions
    y_true = df_merged['label']
    y_pred = df_merged['Predict']
    
    # Compute and print UA score
    macro_recall = recall_score(y_true, y_pred, average='macro')
    print(f"Test UA: {macro_recall}")
    
    if not JUST_PRINT_TEST_UA:
        
        # Compute the confusion matrix
        cm = confusion_matrix(y_true, y_pred)

        # Create a DataFrame for the confusion matrix
        labels = sorted(y_true.unique())
        cm_df = pd.DataFrame(cm, index=labels, columns=labels)

        # Plot the confusion matrix
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')
        plt.ylabel('Actual Labels')
        plt.xlabel('Predicted Labels')
        plt.title('Confusion Matrix')
        plt.show()

        # Print the confusion matrix
        print("Confusion Matrix:")
        print(cm_df)

        # Compute and print classification report
        report = classification_report(y_true, y_pred, labels=labels)
        print("\nClassification Report:")
        print(report)

## Start Training
***Or load a previously trained model***

In [21]:
if not LOAD_PREVIOUSLY_TRAINED_MODEL:
    trainer.train()
else:
    model = Wav2Vec2ForSequenceClassification.from_pretrained(PREVIOUSLY_TRAINED_MODEL_PATH)
    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(PREVIOUSLY_TRAINED_MODEL_PATH)
    print(f" Previously trained model loaded from: {PREVIOUSLY_TRAINED_MODEL_PATH}")

 Previously trained model loaded from: C:\Users\ChenYi\Downloads\AAI3001_Project\TransferLearning\models\wav2vec2-large-e5


In [36]:
if not SKIP_PREDICT_AFTER_TRAINING:
    test_model(model, test_dataframe)

Test UA: 0.6808498519024835


In [37]:
if PLOT_TRAINING_GRAPHS and not LOAD_PREVIOUSLY_TRAINED_MODEL:
    
    train_loss_list = [train_losses.get(epoch, None) for epoch in epochs]
    val_loss_list = [val_losses.get(epoch, None) for epoch in epochs]

    # Plot only the epochs that have both losses
    plt.figure(figsize=(10, 5))
    plt.plot(epochs, train_loss_list, label='Training Loss', marker='o')
    plt.plot(epochs, val_loss_list, label='Validation Loss', marker='o')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss per Epoch')
    plt.legend()
    plt.grid(True)
    plt.show()