# HW 2 Multimodal Machine Learning for Emotion Recognition

- main (this notebook) with sub notebooks
    1. audio (acoustic)
    2. text (lexical)
    3. visual
    4. early fusion 
    5. late fusion
    6. results
- IEMOCAP (Interactive Emotional Dyadic Motion Capture) database

# TODOs
- Imports + Load Data
- Preprocess Files
    1. [x] Reduce the temporal dimension from Audio and Visual files

# Imports + Load Data

In [2]:
import os

import pandas as pd
import numpy as np
import seaborn as sns
import ipympl as mpl # to show (image) plots

from matplotlib import pyplot as plt
from sklearn import svm, naive_bayes, neighbors, pipeline, datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

In [3]:
BASE = "/Users/brinkley97/Documents/development/"
CLASS_PATH = "classes/csci_535_multimodal_probabilistic_learning/"
DATASET_PATH = "datasets/hw_2"
FILE = "/iemocapRelativeAddressForFiles.csv"
file_paths = BASE + CLASS_PATH + DATASET_PATH + FILE

In [4]:
def load_data(file):
    original_data = pd.read_csv(file)
    # original_data = pd.DataFrame(file)
    copy_of_data = original_data.copy()
    return copy_of_data

In [5]:
# 4 classes - anger(0), sadness(1) and happiness(2),and neutral(3)
dataset_paths_copy = load_data(file_paths)
dataset_paths_copy

Unnamed: 0,file_name_list,speakers,visual_features,acoustic_features,lexical_features,emotion_labels
0,Ses01F_impro01_F001,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,3
1,Ses01F_impro01_M011,M01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,0
2,Ses01F_impro02_F002,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,1
3,Ses01F_impro02_F003,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,3
4,Ses01F_impro02_F004,F01,/features/visual_features/Session1/Ses01F_impr...,/features/acoustic_features/Session1/Ses01F_im...,/features/lexical_features/Session1/Ses01F_imp...,1
...,...,...,...,...,...,...
1331,Ses05M_script03_2_M029,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1332,Ses05M_script03_2_M039,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1333,Ses05M_script03_2_M041,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0
1334,Ses05M_script03_2_M042,M05,/features/visual_features/Session5/Ses05M_scri...,/features/acoustic_features/Session5/Ses05M_sc...,/features/lexical_features/Session5/Ses05M_scr...,0


# Preprocessing Files

- [x] Build paths to specific files
- [x] Reduce the time (temporal) dimension to 1 for audio and visual features
- [x] Split data into train, test, and validation sets
- *NOTE:* I choose NOT to call the functions below but in their specific file -- `[specific feature]-main.ipynb`

In [10]:
def build_paths_to_file(df_with_paths, specific_feature):
    """With the given DataFrame of paths, build my paths to access the files
    
    Parameters:
    df_with_paths -- pd DF of paths to each ".npy" file
    specific_feature -- str (either visual features, acoustic features, or lexical features)
    
    Return:
    list_of_features -- list (the paths that belong to that specific feature)
    features -- pd DF (of emotion labels for that specific feature)
    """
    
    features = df_with_paths.loc[0:, ['file_name_list', 'speakers', specific_feature, 'emotion_labels']]
    features_path = features.loc[0:, specific_feature]
    features["file_with_path"] = BASE + CLASS_PATH + DATASET_PATH + features_path
    list_of_features = list(features["file_with_path"])
    
    return list_of_features, features

In [11]:
# specific_feature = 'acoustic_features'
# audio_features_paths, audio_features_with_y = build_paths_to_file(dataset_paths_copy, specific_feature)

# specific_feature = 'lexical_features'
# text_features_paths, text_features_with_y = build_paths_to_file(dataset_paths_copy, specific_feature)

# specific_feature = 'visual_features'
# visual_features_paths, visual_features_with_y = build_paths_to_file(dataset_paths_copy, specific_feature)

## 1. Reduce Temporal Dimension for Both Audio and Visual Features

In [12]:
def reduce_temporal_dimension(features_paths, ys):
    """Reduce audio and visual features by making the time dimension 1
    
    Parameters:
    features_paths -- list (of the paths that belong to that specific feature)
    ys -- pd Series (of the emotion labels that belong to that specifuc feature)
    
    Return: 
    reduced_features -- list (of reduced shapes of each input)
    """
    
    reduced_features = []
    true_labels = []
    
    for row in range(len(features_paths)):
        # print("Current path with files is: ", features_path)
        path_exists = os.path.exists(features_paths[row])
        # print(path_exists)
        if path_exists == True:
            # print("Current path with files is: ", audio_features_path)
            load_features_file = np.load(features_paths[row])
            # print("  Original Shape: ", np.shape(load_audio_features_file))
            resampled = np.mean(load_features_file, axis=0)
            # print("  Reduced shape: ", np.shape(resampled_audio))
            reduced_features.append(resampled)
            # print("  reduced_audio_features: ", np.shape(reduced_audio_features))
            true_labels.append(ys[row])
            # print()
        else:
            pass
            # print("CANNOT find current path: ", audio_features_path)
    return reduced_features, true_labels

In [13]:
def load_text_features(text_features_paths, ys):
    """Reduce from 2D to 1D by removing the time dimension
    
    text_features_paths -- list
    
    Return 
    reduced shapes of each input -- list
    """
    
    loaded_text_features = []
    true_labels = []
    
    for row in range(len(text_features_paths)):
        # print("Current path with files is: ", text_features_paths)
        path_exists = os.path.exists(text_features_paths[row])
        # print(path_exists)
        if path_exists == True:
            # print("Current path with files is: ", text_features_paths[row])
            load_text_features_file = np.load(text_features_paths[row])
            # print("  Original Shape: ", np.shape(load_text_features_file))
            # resampled_text = np.mean(load_text_features_file, axis=0)
            # print("  Reduced shape: ", np.shape(resampled_text))
            loaded_text_features.append(load_text_features_file)
            # print("  reduced_text_features: ", np.shape(reduced_text_features))
            true_labels.append(ys[row])
            # print()
        else:
            pass
            # print("CANNOT find current path: ", audio_features_path)
    return loaded_text_features, true_labels

In [13]:
def split_data(specific_features, specific_features_true_labels, test_size):
    """Split data into TRAIN, TEST, and VALIDATION sets
    
    Parameters:
    specific_features -- list (of the reduced features)
    specific_features_true_labels -- list (of the emotion labels that belong to that specifuc feature)
    test_size -- float (to pass into sklearn train_test_split())
    
    Return:
    X_train, X_test, X_val, y_train, y_test, y_val -- list (for that specific subset of the features)
    
    """

    X_train, X_test, y_train, y_test = train_test_split(specific_features, specific_features_true_labels, test_size=test_size, random_state=42)
    total_X = len(X_train) + len(X_test)
    total_Y = len(y_train) + len(y_test)
    
    print("[INFO] X, y TRAIN sets")
    print(np.shape(X_train), np.shape(y_train))
    
    print("\n[INFO] X, y TEST sets")
    print(np.shape(X_test), np.shape(y_test))
    # print("[INFO] TOTAL TRAIN, TEST sets")
    # print(total_X, total_Y)
    
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.7, random_state=42)
    total_X = len(X_train) + len(X_val)
    total_Y = len(y_train) + len(y_val)
    
    print("\n[INFO] X, y TRAIN sets")
    print(np.shape(X_train), np.shape(y_train))
    
    print("\n[INFO] X, y VALIDATION sets")
    print(np.shape(X_val), np.shape(y_val))

    return X_train, X_test, X_val, y_train, y_test, y_val

## 2. 4-class Emotion Classification + 3. Classification Results on Each Modality

Unimodals, so no fusion. This will tell us how much each modality is contributing to our next step, which is fusion. Within each individual notebook, you'll see two sections - (1) "Without Hyper-Parameter Tuning" and (2) "With Hyper-Parameter Tuning". 

- [x]  Audio: Initializing (1) and (2) with the LinearSVC() estimator. (2) performs slightly better WRT F1-micro on a 10-fold subject-independent cross validation. See audio-main.ipynb for more details. $ \newline $

- [x]  Text: Initializing (1) and (2) with both (a) naive_bayes.BernoulliNB() and (b) naive_bayes.GaussianNB() estimators to see which performs better WRT F1-micro on a 10-fold subject-independent cross validation. (b) performs better in both sections - (1) and (2). See text-main.ipynb for more details. $ \newline $

- [x]  Visual: Initializing (1) and (2) with the LinearSVC() estimator. (2) performs slightly better WRT F1-micro on a 10-fold subject-independent cross validation. See visual-main.ipynb for more details.

## 4. Class Imbalance

Below, we can see how many classifications are made for each class. The ordering from least to greatest is (2) happiness, (1) sadness, (0) anger, and (3) neutral. Although differences occur with class imbalance, I haven't come across any problems when running anything above. I will include some methods to handle class imbalance in future work. I assume doing this will improve my F1-micro metrics.

In [14]:
def class_imbalance_check(df):
    """Differientiate which files belong to which emotion label
    
    Parameter:
    df -- pandas DF with all the files and emotional labels
    
    Return;
    dfs (4) -- pandas DFs with each DF being a classification and having their respective files
    
    """
    
    classes = len(df)
    
    for c in range(classes):
        # print(c)
        y0s = df.emotion_labels == 0
        y0s_df = df[y0s]
        
        
        y1s = df.emotion_labels == 1
        y1s_df = df[y1s]
        
        y2s = df.emotion_labels == 2
        y2s_df = df[y2s]
        
        y3s = df.emotion_labels == 3
        y3s_df = df[y3s]
        
    return y0s_df, y1s_df, y2s_df, y3s_df

In [17]:
y0s_df, y1s_df, y2s_df, y3s_df = class_imbalance_check(dataset_paths_copy)

In [18]:
len(y0s_df), len(y1s_df), len(y2s_df), len(y3s_df)

(328, 308, 180, 520)

## 5. Fusion Results

- [x] Early Fusion: Within, you'll see two sections - (1) "Without Hyper-Parameter Tuning" and (2) "With Hyper-Parameter Tuning". Initializing both sections  with 3 estimators - (1) svm.LinearSVC(), (2) naive_bayes.BernoulliNB(), (3) naive_bayes.GaussianNB(). In both (1) and (2), the estimators' performances from best to worse is (1), (2), (3). (a) When comparing sections (1) vs (2) for estimator (1), section (2) performs slighly better. (b) When comparing sections (1) vs (2) for estimator (2), section (2) performs slighly better. (c) When comparing sections (1) vs (2) for estimator (3), section (1) performs better. Not sure how section (1) performs better for (c). See notebook for more details.
- [x] Late Fusion: Taking unimodal predictions from tasks 2 and 3. I'm excluding naive_bayes.BernoulliNB() as naive_bayes.GaussianNB() performs better. Doing majority vote over the three modalities. I then split and get the F1-micro metric. See notebook for more details. 

# 6. Interpretation on my results

- Seems like early fusion using the LinearSVC() estimator performs the best overall. Poor results all around. I think this may be due to mean pooling in task 1. Future work, I'll test other pooling methods. See result notebook for more details.