# HW 2 Multimodal Machine Learning for Emotion Recognition

- main (this notebook) with sub notebooks
    1. audio (acoustic)
    2. text (lexical)
    3. visual
- IEMOCAP (Interactive Emotional Dyadic Motion Capture) database

In [1]:
visual_main_notebook = 'visual-main.ipynb'
audio_main_notebook = 'audio-main.ipynb'
text_main_notebook = 'text-main.ipynb'

In [2]:
# %load visual_main_notebook

In [3]:
# %run 'visual-main.ipynb'

# TODOs


#  Imports + Load Data

In [4]:
# !pip install ipympl

In [5]:
# for all subnotebooks
import pandas as pd
import numpy as np
import seaborn as sns
import librosa, librosa.display
import cv2

import os


from matplotlib import pyplot as plt

# for audio - maybe store in specific notebook?
# import skimage.measure


# for text

# for visual
import ipympl as mpl # to show (image) plots
from sklearn import svm, datasets # per GridSearchCV documentation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

In [6]:
BASE = "/Users/brinkley97/Documents/development/"
CLASS_PATH = "classes/csci_535_multimodal_probabilistic_learning/"
DATASET_PATH = "datasets/hw_2"

# SESSION_1 = "Session1/"
# SESSION_2 = "Session2"
# SESSION_3 = "Session3/"
# SESSION_4 = "Session4/"
# SESSION_5 = "Session5/"

# SES_01F = "Ses01F_impro01/"

FILE = "/iemocapRelativeAddressForFiles.csv"
file_paths = BASE + CLASS_PATH + DATASET_PATH + FILE

In [7]:
def load_data(file):
    original_data = pd.read_csv(file)
    # original_data = pd.DataFrame(file)
    copy_of_data = original_data.copy()
    return copy_of_data

In [8]:
# 4 classes - anger(0), sadness(1) and happiness(2),and neutral(3)
dataset_paths_copy = load_data(file_paths)
dataset_paths_copy

Unnamed: 0,file_name_list,speakers,visual_features,acoustic_features,lexical_features,emotion_labels
0,Ses01F_impro01_F001,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,3
1,Ses01F_impro01_M011,M01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,0
2,Ses01F_impro02_F002,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,1
3,Ses01F_impro02_F003,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,3
4,Ses01F_impro02_F004,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,1
...,...,...,...,...,...,...
1331,Ses05M_script03_2_M029,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1332,Ses05M_script03_2_M039,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1333,Ses05M_script03_2_M041,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1334,Ses05M_script03_2_M042,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0


# Preprocessing Files

- [x] Build paths to specific files
- [ ] Reduce the time (temporal) dimension

In [9]:
def build_paths_to_file(df_with_paths, specific_feature):
    """With the given DataFrame of paths, build my paths...
    
    Parameters:
    df_with_paths -- 
    specific_feature -- str (either visual_features, acoustic_features, or lexical_features)
    
    Return:
    list
    """
    
    features = df_with_paths.loc[0:, ['file_name_list', 'speakers', specific_feature, 'emotion_labels']]
    features_path = features.loc[0:, specific_feature]
    features["file_with_path"] = BASE + CLASS_PATH + DATASET_PATH + features_path
    list_of_features = list(features["file_with_path"])
    
    return list_of_features, features

In [25]:
specific_feature = 'acoustic_features'
audio_features_paths, audio_features_with_y = build_paths_to_file(dataset_paths_copy, specific_feature)

specific_feature = 'lexical_features'
text_features_paths, text_features_with_y = build_paths_to_file(dataset_paths_copy, specific_feature)


specific_feature = 'visual_features'
visual_features_paths, visual_features_with_y = build_paths_to_file(dataset_paths_copy, specific_feature)

## Load Audio (Acoustic) features
- We have VGGish, a deep convolutional neural network pre-trained on audio spectrograms extracted from a large database of videos to recognize a large variety of audio event categories [3]. The 128-dimensional embeddings were generated by VGGish after dimensionality reduction with Principal Component Analysis (PCA).

In [11]:
# audio_features_path = BASE + CLASS_PATH + DATASET_PATH + '/features/acoustic_features/' + SESSION_1 + SES_01F
# # audio_features
# female_s1 = audio_features_path + 'Ses01F_impro01_F001.npy'
# male_s1 = audio_features_path + 'Ses01F_impro01_M011.npy'

In [12]:
# female_s1_load = np.load(female_s1)
# male_s1_load = np.load(male_s1)

In [13]:
type(audio_features_paths)

list

## Load Text (Lexical) features

In [14]:
pd.set_option('max_colwidth', 200)

In [15]:
np.shape(text_features_paths)

(1336,)

## Load Visual features

- For the visual features, we have face embeddings obtained from a ResNet model [4] pre-trained on ImageNet. We have $ T × 2048 $ matrix for each utterance, where $ T $ denotes the number of frames.

# 3. Classification Results on the Visual Modality

In [18]:
# visual_f1_micro
# audio_f1_micro
# text_f1_micro

# 4. Class Imbalance

- Visual: Below, we can see how many classifications are made for each input. The ordering from least to greatest is (2) happiness, (1) sadness, (0) anger, and (3) neutral. Although differences occur with this modality, I haven't come across any problems.

(328, 308, 180, 520)

# 5. Fusion Results

## Early Fusion

## Late Fusion

# 6. Interpretation on my results

1. Unimodals
    - visual: Results aren't well due to pooling method. Recall that I'm using mean pooling which is the easiest to get started. Future work would have to implement other pooling methods to compare all respectively.
    - audio:
    - text: 
2. Multimodal