# HW 2 Multimodal Machine Learning for Emotion Recognition

- main (this notebook) with sub notebooks
    1. audio (acoustic)
    2. text (lexical)
    3. visual
- IEMOCAP (Interactive Emotional Dyadic Motion Capture) database

# TODOs
In the file dataset.csv, you are provided with the relative address for the audio, visual and text feature files alongwith their corresponding emotion labels. There are 5 sessions and each session has one male and one female speaker.

1. You can use different pooling methods (e.g., max pooling, mean pooling) for reducing the temporal dimensionof the audio and visual files, or use your preferred temporal modeling (e.g., RNN, GRU, LSTM) to obtainfeature vectors per data point.1

2. Perform a 4-class emotion classification using your preferred classifier with the obtained feature vectors. Selectthe parameters using Grid Search (search over a range for hyper-parameters). Perform any additional stepsyou see fit to obtain the best results.

3. Report your classification results on individual modalities (vision, speech, and text) using F1-micro metricon a 10-fold subject-independent cross validation.

4. How do you handle the problem of class imbalance? Plot the confusion matrix for the 4 classes.

5. Use both early fusion (concatenate features from different modalities) and late fusion (majority vote over theoutputs of the unimodal models) to obtain multimodal classification results. Report and compare the resultsfor both fusion techniques.

6. Provide an interpretation on your results from the performed unimodal and multimodal classification tasks.Which one is performing best and why?

*Note*: You are only allowed to use the features and labels provided by us with this assignment. Please refrainfrom using the original data; assignments submitted with any other labels or data will not be graded

#  Imports + Load Data

In [1]:
# !pip install ipympl

In [2]:
# for all subnotebooks
import pandas as pd
import numpy as np
import librosa, librosa.display
import cv2

import os


from matplotlib import pyplot as plt

# for audio - maybe store in specific notebook?
import skimage.measure

# for text

# for visual
import ipympl as mpl # to show (image) plots

In [3]:
BASE = "/Users/brinkley97/Documents/development/"
CLASS_PATH = "classes/csci_535_multimodal_probabilistic_learning/"
DATASET_PATH = "datasets/hw_2"

SESSION_1 = "Session1/"
SESSION_2 = "Session2"
SESSION_3 = "Session3/"
SESSION_4 = "Session4/"
SESSION_5 = "Session5/"

SES_01F = "Ses01F_impro01/"

FILE = "/iemocapRelativeAddressForFiles.csv"
file_paths = BASE + CLASS_PATH + DATASET_PATH + FILE

In [4]:
def load_data(file):
    original_data = pd.read_csv(file)
    # original_data = pd.DataFrame(file)
    copy_of_data = original_data.copy()
    return copy_of_data

In [5]:
# 4 classes - anger(0), sadness(1) and happiness(2),and neutral(3)
dataset_paths_copy = load_data(file_paths)
dataset_paths_copy

Unnamed: 0,file_name_list,speakers,visual_features,acoustic_features,lexical_features,emotion_labels
0,Ses01F_impro01_F001,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,3
1,Ses01F_impro01_M011,M01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,0
2,Ses01F_impro02_F002,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,1
3,Ses01F_impro02_F003,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,3
4,Ses01F_impro02_F004,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,1
...,...,...,...,...,...,...
1331,Ses05M_script03_2_M029,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1332,Ses05M_script03_2_M039,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1333,Ses05M_script03_2_M041,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1334,Ses05M_script03_2_M042,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0


# Preprocessing Files

- [x] Build paths to specific files
- [ ] Reduce the time (temporal) dimension

In [6]:
def build_paths_to_file(df_with_paths, specific_feature):
    """With the given DataFrame of paths, build my paths...
    
    Parameters:
    df_with_paths -- 
    specific_feature -- str (either visual_features, acoustic_features, or lexical_features)
    
    Return:
    list
    """
    
    features = df_with_paths.loc[0:, ['file_name_list', 'speakers', specific_feature]]
    features_path = features.loc[0:, specific_feature]
    features["file_with_path"] = BASE + CLASS_PATH + DATASET_PATH + features_path
    list_of_features = list(features["file_with_path"])
    
    return list_of_features

In [7]:
specific_feature = 'acoustic_features'
audio_features_paths = build_paths_to_file(dataset_paths_copy, specific_feature)

specific_feature = 'lexical_features'
text_features_paths = build_paths_to_file(dataset_paths_copy, specific_feature)


specific_feature = 'visual_features'
visual_features_paths = build_paths_to_file(dataset_paths_copy, specific_feature)

## Load Audio (Acoustic) features
- We have VGGish, a deep convolutional neural network pre-trained on audio spectrograms extracted from a large database of videos to recognize a large variety of audio event categories [3]. The 128-dimensional embeddings were generated by VGGish after dimensionality reduction with Principal Component Analysis (PCA).

In [8]:
# audio_features_path = BASE + CLASS_PATH + DATASET_PATH + '/features/acoustic_features/' + SESSION_1 + SES_01F
# # audio_features
# female_s1 = audio_features_path + 'Ses01F_impro01_F001.npy'
# male_s1 = audio_features_path + 'Ses01F_impro01_M011.npy'

In [9]:
# female_s1_load = np.load(female_s1)
# male_s1_load = np.load(male_s1)

In [10]:
# audio_features_paths

In [20]:
# resampled_audio_features_path = BASE + CLASS_PATH + DATASET_PATH + '/resampled_features/test_save_features/' + SESSION_1 + SES_01F
resampled_audio_features_path = BASE + CLASS_PATH + DATASET_PATH + '/resampled_features/test_save_features/'
# resampled_audio_features_path

## Load Text (Lexical) features

In [10]:
pd.set_option('max_colwidth', 200)

In [11]:
# text_features

## Load Visual features

- For the visual features, we have face embeddings obtained from a ResNet model [4] pre-trained on ImageNet. We have $ T × 2048 $ matrix for each utterance, where $ T $ denotes the number of frames.

In [19]:
visual_features_paths