# Tutorial on CMU-Multimodal SDK

This is a tutorial on using ***CMU-Multimodal SDK*** to load and process multimodal time-series datasets and training a simple late-fusion LSTM model on the processed data. 

For this tutorial, we specify some constants in `./constans/paths.py`. Please first take a look and modify the paths to point to the correct folders.

## Downloading the data

We start off by (down)loading the datasets. In the SDK each dataset has three sets of content: `highlevel`, `raw` and `labels`. `highlevel` contains the extracted features for each modality (e.g OpenFace facial landmarks, openSMILE acoustic features) while `raw` contains the raw transctripts, phonemes. `labels` are self-explanatory. Note that some datasets have more than just one set of annotations so `labels` could also give you multiple files.

Currently there's a caveat that the SDK will not automatically detect if you have downloaded the data already. In event of that it will throw a `RuntimeError`. We work around that by `try/except`. This is not ideal but it will work for now.

In [1]:
from constants import SDK_PATH, DATA_PATH, WORD_EMB_PATH, CACHE_PATH
import sys

SDK_PATH: C:\Users\Viki\Documents\Thesis\CMU-MultimodalSDK


In [2]:


if SDK_PATH is None:
    print("SDK path is not specified! Please specify first in constants/paths.py")
    exit(0)
else:
    sys.path.append(SDK_PATH)
    print(f"SDK path is set to {SDK_PATH}")

SDK path is set to C:\Users\Viki\Documents\Thesis\CMU-MultimodalSDK


In [3]:
import sys
for path in sys.path:
    print(path)
    

C:\Users\Viki\Documents\Thesis\tryout3
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\python310.zip
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\DLLs
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib
C:\Users\Viki\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0
C:\Users\Viki\Documents\Thesis\tryout3\.venv

C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages
C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages\win32
C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages\win32\lib
C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages\Pythonwin
C:\Users\Viki\Documents\Thesis\CMU-MultimodalSDK


In [4]:

import mmsdk
import os
import re
import numpy as np
from mmsdk import mmdatasdk as md
from subprocess import check_call, CalledProcessError

In [5]:


# create folders for storing the data
if not os.path.exists(DATA_PATH):
    check_call(' '.join(['mkdir', '-p', DATA_PATH]), shell=True)

# download highlevel features, low-level (raw) data and labels for the dataset MOSI
# if the files are already present, instead of downloading it you just load it yourself.
# here we use CMU_MOSI dataset as example.

DATASET = md.cmu_mosi

try:
    md.mmdataset(DATASET.highlevel, DATA_PATH)
except RuntimeError:
    print("High-level features have been downloaded previously.")

try:
    md.mmdataset(DATASET.raw, DATA_PATH)
except RuntimeError:
    print("Raw data have been downloaded previously.")
    
try:
    md.mmdataset(DATASET.labels, DATA_PATH)
except RuntimeError:
    print("Labels have been downloaded previously.")

Normalized destination path: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWordVectors.csd
[91m[1m[2024-12-03 11:01:33.004] | Error   | [0mC:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWordVectors.csd file already exists ...
High-level features have been downloaded previously.
Normalized destination path: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWords.csd
[91m[1m[2024-12-03 11:01:33.004] | Error   | [0mC:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWords.csd file already exists ...
Raw data have been downloaded previously.
Normalized destination path: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/labels/CMU_MOSI_Opinion_Labels.csd
[91m[1m[2024-12-03

## Inspecting the downloaded files

We can print the files in the target data folder to see what files are there.

We can observe a bunch of files ending with `.csd` extension. This stands for ***computational sequences***, which is the underlying data structure for all features in the SDK. We will come back to that later when we load the data. For now we just print out what computational sequences we have downloaded.

In [6]:
# list the directory contents... let's see what features there are
data_files = os.listdir(DATA_PATH)
print('\n'.join(data_files))

embedding_and_mapping.pt
http__immortal.multicomp.cs.cmu.edu


## Loading a multimodal dataset

Loading the dataset is as simple as telling the SDK what are the features you need and where are their computational sequences. You can construct a dictionary with format `{feature_name: csd_path}` and feed it to `mmdataset` object in the SDK.

In [7]:
import os

folder_path = "C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data"
for root, dirs, files in os.walk(folder_path):
    print(f"Directory: {root}")
    for file in files:
        print(f"  {file}")


Directory: C:\Users\Viki\Documents\Thesis\tryout3\data
  embedding_and_mapping.pt
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\acoustic
  CMU_MOSI_COVAREP.csd
  CMU_MOSI_OpenSmile_EB10.csd
  CMU_MOSI_openSMILE_IS09.csd
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\labels
  CMU_MOSI_Opinion_Labels.csd
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\language
  CMU_MOSI_TimestampedPhones.csd
  CMU_MOSI_TimestampedWords.csd
  CMU_MOSI_TimestampedWordVectors.csd
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual
  CMU_MOSI_Visual_Facet_41.csd
  CMU_MOSI_Visual_Facet_42.csd
  CMU_MOSI_Visual

In [8]:
# Define your different modalities - refer to the filenames of the CSD files
visual_field_Facet41 = 'CMU_MOSI_Visual_Facet_41'
visual_field_Facet42 = 'CMU_MOSI_Visual_Facet_42'
visual_field_OpenFace1 = 'CMU_MOSI_Visual_OpenFace_1'


# visual_field_OpenFace2 = 'CMU_MOSI_Visual_OpenFace_2'
# [2024-11-24 21:59:03.886] | Error   | C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual\CMU_MOSI_Visual_OpenFace_2.csd resource is not a valid hdf5 computational sequence format ...


acoustic_field_COVAREP = 'CMU_MOSI_COVAREP'
acoustic_field_OpenSmile_EB10 = 'CMU_MOSI_OpenSmile_EB10'
acoustic_field_OpenSmile_IS09 = 'CMU_MOSI_openSMILE_IS09'



text_field_Words = 'CMU_MOSI_TimestampedWords'
text_field_Phones = 'CMU_MOSI_TimestampedPhones'
text_field_WordVectors = 'CMU_MOSI_TimestampedWordVectors'



# text_field = 'CMU_MOSI_ModifiedTimestampedWords'
feature=[
    text_field_Words,
    visual_field_Facet41, 
    acoustic_field_COVAREP,
]

# List of features
features1 = [
    text_field_Words,
    text_field_Phones,
    text_field_WordVectors,
    visual_field_Facet41, 
    visual_field_Facet42,
    visual_field_OpenFace1,
    # visual_field_OpenFace2,
    acoustic_field_COVAREP,
    # acoustic_field_OpenSmile_EB10,
    acoustic_field_OpenSmile_IS09
]

recipe = {
    text_field_Words: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "language", text_field_Words) + '.csd',
    
    visual_field_Facet41: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "visual", visual_field_Facet41) + '.csd',
    
    acoustic_field_COVAREP: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "acoustic", acoustic_field_COVAREP) + '.csd',
}
# Recipe with correct subdirectory paths for each modality
recipe1 = {
    text_field_Words: os.path.join(DATA_PATH, text_field_Words) + '.csd',
    
    text_field_Phones: os.path.join(DATA_PATH, "CMU-MOSI", "language", text_field_Phones) + '.csd',
    
    text_field_WordVectors: os.path.join(DATA_PATH, "CMU-MOSI", "language", text_field_WordVectors) + '.csd',
    
    visual_field_Facet41: os.path.join(DATA_PATH, "CMU-MOSI", "visual", visual_field_Facet41) + '.csd',
    
    visual_field_Facet42: os.path.join(DATA_PATH, "CMU-MOSI", "visual", visual_field_Facet42) + '.csd',
    
    visual_field_OpenFace1: os.path.join(DATA_PATH, "CMU-MOSI", "visual", visual_field_OpenFace1) + '.csd',
    
    # visual_field_OpenFace2: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "visual", visual_field_OpenFace2) + '.csd',
    
    acoustic_field_COVAREP: os.path.join(DATA_PATH, "CMU-MOSI", "acoustic", acoustic_field_COVAREP) + '.csd',
    
    acoustic_field_OpenSmile_EB10: os.path.join(DATA_PATH, "CMU-MOSI", "acoustic", acoustic_field_OpenSmile_EB10) + '.csd',
    
    acoustic_field_OpenSmile_IS09: os.path.join(DATA_PATH, "CMU-MOSI", "acoustic", acoustic_field_OpenSmile_IS09) + '.csd'
}



print (recipe)
dataset = md.mmdataset(recipe)

{'CMU_MOSI_TimestampedWords': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\language\\CMU_MOSI_TimestampedWords.csd', 'CMU_MOSI_Visual_Facet_41': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\visual\\CMU_MOSI_Visual_Facet_41.csd', 'CMU_MOSI_COVAREP': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\acoustic\\CMU_MOSI_COVAREP.csd'}
[92m[1m[2024-12-03 11:01:33.096] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\language\CMU_MOSI_TimestampedWords.csd ...
[94m[1m[2024-12-03 11:01:33.138] | Status  | [0mChecking the integrity of the <words> computational sequence ...
[94m[1m[2024-12-03 11:01:33.138] | Status  | [0mChecking the format of the data in <words> computational sequence ...


                                                                             

[92m[1m[2024-12-03 11:01:33.296] | Success | [0m<words> computational sequence data in correct format.
[94m[1m[2024-12-03 11:01:33.296] | Status  | [0mChecking the format of the metadata in <words> computational sequence ...
[92m[1m[2024-12-03 11:01:33.298] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual\CMU_MOSI_Visual_Facet_41.csd ...
[94m[1m[2024-12-03 11:01:33.336] | Status  | [0mChecking the integrity of the <FACET_4.1> computational sequence ...
[94m[1m[2024-12-03 11:01:33.336] | Status  | [0mChecking the format of the data in <FACET_4.1> computational sequence ...


                                                                             

[92m[1m[2024-12-03 11:01:33.472] | Success | [0m<FACET_4.1> computational sequence data in correct format.
[94m[1m[2024-12-03 11:01:33.473] | Status  | [0mChecking the format of the metadata in <FACET_4.1> computational sequence ...
[92m[1m[2024-12-03 11:01:33.476] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\acoustic\CMU_MOSI_COVAREP.csd ...
[94m[1m[2024-12-03 11:01:33.505] | Status  | [0mChecking the integrity of the <COVAREP> computational sequence ...
[94m[1m[2024-12-03 11:01:33.505] | Status  | [0mChecking the format of the data in <COVAREP> computational sequence ...


                                                                             

[92m[1m[2024-12-03 11:01:33.629] | Success | [0m<COVAREP> computational sequence data in correct format.
[94m[1m[2024-12-03 11:01:33.630] | Status  | [0mChecking the format of the metadata in <COVAREP> computational sequence ...
[92m[1m[2024-12-03 11:01:33.630] | Success | [0mDataset initialized successfully ... 





## A peek into the dataset

The multimodal dataset, after loaded, has the following hierarchy:


```
            computational_sequence_1 ---...
           /                                   ...
          /                                    /
         /                          first_video     features -- T X N array
        /                          /               /
dataset ---computational_sequence_2 -- second_video
        \                          \               \
         \                          third_video     intervals -- T X 2 array
          \                                    \...
           \
            computational_sequence_3 ---...
```

It looks like a nested dictionary and can be indexed as if it is a nested dictionary. A dataset contains multiple computational sequences whose key is the `text_field`, `visual_field`, `acoustic_field` as defined above. Each computational sequence, however, has multiple video IDs in it, and different computational sequences are supposed to have the same set of video IDs. Within each video, there are two arrays: `features` and `intervals`, denoting the feature values at each time step and the start and end timestamp for each step. We can take a look at its content.

In [9]:
print(list(dataset.keys()))
print("=" * 80)

print(list(dataset[visual_field_Facet41].keys()))
print("=" * 80)

some_id = list(dataset[visual_field_Facet41].keys())[15]
print(list(dataset[visual_field_Facet41][some_id].keys()))
print("=" * 80)

print(list(dataset[visual_field_Facet41][some_id]['intervals'].shape))
print("=" * 80)

print(list(dataset[visual_field_Facet41][some_id]['features'].shape))
print(list(dataset[text_field_Words][some_id]['features'].shape))
print(list(dataset[acoustic_field_COVAREP][some_id]['features'].shape))
print("Different modalities have different number of time steps!")



['CMU_MOSI_TimestampedWords', 'CMU_MOSI_Visual_Facet_41', 'CMU_MOSI_COVAREP']
['03bSnISJMiM', '0h-zjBukYpk', '1DmNV9C1hbY', '1iG0909rllw', '2WGyTLYerpo', '2iD-tVS8NPw', '5W7Z1C_fDaE', '6Egk_28TtTM', '6_0THN4chvY', '73jzhE8R1TQ', '7JsX8y1ysxY', '8OtFthrtaJM', '8d-gEyoeBzc', '8qrpnFRGt2A', '9J25DZhivz8', '9T9Hf74oK10', '9c67fiY0wGQ', '9qR7uwkblbs', 'Af8D0E4ZXaw', 'BI97DNYfe5I', 'BXuRRbG0Ugk', 'Bfr499ggo-0', 'BioHAh1qJAQ', 'BvYR0L6f2Ig', 'Ci-AH39fi3Y', 'Clx4VXItLTE', 'Dg_0XKD0Mf4', 'G-xst2euQUc', 'G6GlGvlkxAQ', 'GWuJjcEuzt8', 'HEsqda8_d0Q', 'I5y0__X72p0', 'Iu2PFX3z_1s', 'IumbAb8q2dM', 'Jkswaaud0hk', 'LSi-o-IrDMs', 'MLal-t_vJPM', 'Njd1F0vZSm4', 'Nzq88NnDkEk', 'OQvJTdtJ2H4', 'OtBXNcAL_lE', 'Oz06ZWiO20M', 'POKffnXeBds', 'PZ-lDQFboO8', 'QN9ZIUWUXsY', 'Qr1Ca94K55A', 'Sqr0AcuoNnk', 'TvyZBvOMOTc', 'VCslbP0mgZI', 'VbQk4H8hgr0', 'Vj1wYRQjB-o', 'W8NXH0Djyww', 'WKA5OygbEKI', 'X3j2zQgwYgE', 'ZAIRrfG22O0', 'ZUXBRvtny7o', '_dI--eQ6qVU', 'aiEXnCPZubE', 'atnd_PF-Lbs', 'bOL9jKpeJRs', 'bvLlb-M3UXU', 'c5xsK

In [10]:

print("=== Metadata FACE 41 Visual===\n")
print("Visual FACE 41 Metadata:", dataset[visual_field_Facet41].metadata)
print("\n")

print("=== Metadata COVAREP Acoustic===\n")
print("Acoustic COVAREP Metadata:", dataset[acoustic_field_COVAREP].metadata)
print("\n")

print("=== Metadata Words===\n")
print("Words Metadata:", dataset[text_field_Words].metadata)
print("\n")


=== Metadata FACE 41 Visual===

Visual FACE 41 Metadata: {'alignment compatible': True, 'computational sequence description': 'FACET 4.1 Visual Features for CMU-MOSI Dataset', 'computational sequence version': 1.0, 'contact': 'abagherz@andrew.cmu.edu', 'creator': 'Amir Zadeh', 'dataset bib citation': '@article{zadeh2016multimodal,title={Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages},author={Zadeh, Amir and Zellers, Rowan and Pincus, Eli and Morency, Louis-Philippe},journal={IEEE Intelligent Systems},volume={31},number={6},pages={82--88},year={2016},publisher={IEEE}}', 'dataset name': 'CMU-MOSI', 'dataset version': 1.0, 'dimension names': ['Face X', 'Face Y', 'Face Width', 'Face Height', 'angerEvidence', 'contemptEvidence', 'disgustEvidence', 'joyEvidence', 'fearEvidence', 'negativeEvidence', 'neutralEvidence', 'positiveEvidence', 'sadnessEvidence', 'surpriseEvidence', 'confusionEvidence', 'frustrationEvidence', 'angerIntensity', 'contemptIntensi

In [11]:
# # List all keys in the visual and acoustic fields
# # print("Keys in visual FACE41 field:", dataset[visual_field_Facet41].keys())
# # print("Keys in visual FACE42 field:", dataset[visual_field_Facet41].keys())
# # print("Keys in visual OpenFace field:", dataset[visual_field_OpenFace1].keys())
# # 
# # print("Keys in acousticC field:", dataset[acoustic_field_COVAREP].keys())
# # print("Keys in acousticOEB field:", dataset[acoustic_field_OpenSmile_EB10].keys())
# # print("Keys in acousticOIS field:", dataset[acoustic_field_OpenSmile_IS09].keys())
# 
# print("=== Metadata FACE 41 Visual===\n")
# print("Visual FACE 41 Metadata:", dataset[visual_field_Facet41].metadata)
# print("\n")
# 
# print("=== Metadata FACE 42 Visual===\n")
# print("Visual FACE 42 Metadata:", dataset[visual_field_Facet42].metadata)
# print("\n")
# 
# print("=== Metadata OpenFace Visual===\n")
# print("Visual OpenFace Metadata:", dataset[visual_field_OpenFace1].metadata)
# print("\n")
# 
# print("=== Metadata COVAREP Acoustic===\n")
# print("Acoustic COVAREP Metadata:", dataset[acoustic_field_COVAREP].metadata)
# print("\n")
# 
# # print("=== Metadata OpenSmile EB10 Acoustic===\n")
# # print("Acoustic OpenSmile EB10 Metadata:", dataset[acoustic_field_OpenSmile_EB10].metadata)
# # print("\n")
# 
# print("=== Metadata OpenSmile IS09 Acoustic===\n")
# print("Acoustic OpenSmile IS09 Metadata:", dataset[acoustic_field_OpenSmile_IS09].metadata)
# print("\n")
# 
# print("=== Metadata Words===\n")
# print("Words Metadata:", dataset[text_field_Words].metadata)
# print("\n")
# 
# print("=== Metadata Phones===\n")
# print("Phones Metadata:", dataset[text_field_Phones].metadata)
# print("\n")
# 
# print("=== Metadata WordVectors===\n")
# print("WordVectors Metadata:", dataset[text_field_WordVectors].metadata)
# print("\n")

In [12]:
# import pandas as pd
# 
# # Assuming 'dataset' and field variables (e.g., visual_field_Facet41) are already defined
# 
# # List of fields and their metadata
# fields_metadata = [
#     {"Field": "Visual FACE 41", "Metadata": dataset[visual_field_Facet41].metadata},
#     {"Field": "Visual FACE 42", "Metadata": dataset[visual_field_Facet42].metadata},
#     {"Field": "Visual OpenFace", "Metadata": dataset[visual_field_OpenFace1].metadata},
#     {"Field": "Acoustic COVAREP", "Metadata": dataset[acoustic_field_COVAREP].metadata},
#     {"Field": "Acoustic OpenSmile EB10", "Metadata": dataset[acoustic_field_OpenSmile_EB10].metadata},
#     {"Field": "Acoustic OpenSmile IS09", "Metadata": dataset[acoustic_field_OpenSmile_IS09].metadata},
#     {"Field": "Words", "Metadata": dataset[text_field_Words].metadata},
#     {"Field": "Phones", "Metadata": dataset[text_field_Phones].metadata},
#     {"Field": "WordVectors", "Metadata": dataset[text_field_WordVectors].metadata},
# ]
# 
# # Normalize (flatten) the metadata dictionaries
# normalized_data = []
# for entry in fields_metadata:
#     metadata_flat = pd.json_normalize(entry["Metadata"], sep="_")
#     metadata_flat["Field"] = entry["Field"]
#     normalized_data.append(metadata_flat)
# 
# # Concatenate all normalized data into a single DataFrame
# df_metadata = pd.concat(normalized_data, ignore_index=True)
# 
# # Reorder columns so "Field" is first
# df_metadata = df_metadata[["Field"] + [col for col in df_metadata.columns if col != "Field"]]
# 
# 
# # Save the DataFrame to a CSV file
# csv_filename = "metadata_summary.csv"
# df_metadata.to_csv(csv_filename, index=False)
# 
# print(f"Metadata successfully saved to {csv_filename}")
# 
# 
# 
# #remove \ for Acoustic OpenSmile EB10
# #remove b" and " for Acoustic OpenSmile IS09

In [13]:
# print(df_metadata.columns.tolist())
# 
# dimension_column = "dimension names"
# 
# 
# if dimension_column in df_metadata.columns:
#     # Extract the column as a pandas DataFrame and print it
#     dimension_data = df_metadata[[dimension_column]]
#     print(dimension_data)
# else:
#     print(f"The column '{dimension_column}' does not exist in the DataFrame.")
# 
# 
# csv_filename = "columns.csv"
# dimension_data.to_csv(csv_filename, index=False)
# 
# # delete "b""[ or "[ or [ from the start of the line
# # delete ] or ]" or ]""" from the end of the line


In [14]:
# import pandas as pd
# 
# # Specify the path to your CSV file
# csv_file_path = 'columns.csv'
# 
# # Read the CSV file with the specified separator
# df = pd.read_csv(csv_file_path, sep="\',\'", engine='python')
# 
# csv_filename = "columnsSeparatedElements.csv"
# df.to_csv(csv_filename, index=False)
# 
# 
# # Display the DataFrame to verify the contents
# print(df)

In [15]:
 # Iterate through all segments
# for segment in dataset[visual_field].keys():
#     # Access features for each modality
#     visual_features = dataset[visual_field][segment]['features']
#     acoustic_features = dataset[acoustic_field][segment]['features']
#     text_features = dataset[text_field][segment]['features']
#     
#     # Print segment information
#     print(f"Segment ID: {segment}")
#     print(f"Visual features shape: {visual_features.shape}")
#     print(f"Acoustic features shape: {acoustic_features.shape}")
#     print(f"Text features shape: {text_features.shape}")
#     
#     # Compute basic statistics for each segment
#     visual_mean = np.mean(visual_features, axis=0)
#     visual_std = np.std(visual_features, axis=0)
#     acoustic_mean = np.mean(acoustic_features, axis=0)
#     acoustic_std = np.std(acoustic_features, axis=0)
#     
#     print(f"Visual feature mean: {visual_mean}")
#     print(f"Visual feature std dev: {visual_std}")
#     print(f"Acoustic feature mean: {acoustic_mean}")
#     print(f"Acoustic feaure std dev: {acoustic_std}")
#     print("=" * 40)


## Alignment of multimodal time series

To work with multimodal time series that contains multiple views of data with different frequencies, we have to first align them to a ***pivot*** modality. The convention is to align to ***words***. Alignment groups feature vectors from other modalities into bins denoted by the timestamps of the pivot modality, and apply a certain processing function to each bin. We call this function ***collapse function***, because usually it is a pooling function that collapses multiple feature vectors from another modality into one single vector. This will give you sequences of same lengths in each modality (as the length of the pivot modality) for all videos.

Here we define our collapse funtion as simple averaging. We feed the function to the SDK when we invoke `align` method. Note that the SDK always expect collapse functions with two arguments: `intervals` and `features`. Even if you don't use intervals (as is in the case below) you still need to define your function in the following way.

***Note: Currently the SDK applies the collapse function to all modalities including the pivot, and obviously text modality cannot be "averaged", causing some errors. My solution is to define the avg function such that it averages the features when it can, and return the content as is when it cannot average.***

In [16]:
# we define a simple averaging function that does not depend on intervals
def avg(intervals: np.array, features: np.array) -> np.array:
    try:
        return np.average(features, axis=0)
    except:
        return features

# first we align to words with averaging, collapse_function receives a list of functions
dataset.align(text_field_Words, collapse_functions=[avg])

[94m[1m[2024-12-03 11:01:33.803] | Status  | [0mUnify was called ...
[92m[1m[2024-12-03 11:01:33.804] | Success | [0mUnify completed ...
[94m[1m[2024-12-03 11:01:33.804] | Status  | [0mPre-alignment based on <CMU_MOSI_TimestampedWords> computational sequence started ...
[94m[1m[2024-12-03 11:01:36.017] | Status  | [0mPre-alignment done for <CMU_MOSI_Visual_Facet_41> ...
[94m[1m[2024-12-03 11:01:47.223] | Status  | [0mPre-alignment done for <CMU_MOSI_COVAREP> ...
[94m[1m[2024-12-03 11:01:47.316] | Status  | [0mAlignment starting ...


Overall Progress:   0%|          | 0/93 [00:00<?, ? Computational Sequence Entries/s]
  0%|          | 0/464 [00:00<?, ? Segments/s][A
Aligning 03bSnISJMiM:   0%|          | 0/464 [00:00<?, ? Segments/s][A
Aligning 03bSnISJMiM:   2%|▏         | 9/464 [00:00<00:05, 88.80 Segments/s][A
Aligning 03bSnISJMiM:   6%|▋         | 29/464 [00:00<00:02, 153.16 Segments/s][A
Aligning 03bSnISJMiM:  10%|█         | 47/464 [00:00<00:02, 163.13 Segments/s][A
Aligning 03bSnISJMiM:  14%|█▍        | 67/464 [00:00<00:02, 177.45 Segments/s][A
Aligning 03bSnISJMiM:  19%|█▊        | 86/464 [00:00<00:02, 181.71 Segments/s][A
Aligning 03bSnISJMiM:  23%|██▎       | 105/464 [00:00<00:01, 179.99 Segments/s][A
Aligning 03bSnISJMiM:  27%|██▋       | 124/464 [00:00<00:02, 169.09 Segments/s][A
Aligning 03bSnISJMiM:  31%|███▏      | 145/464 [00:00<00:01, 179.11 Segments/s][A
Aligning 03bSnISJMiM:  36%|███▌      | 168/464 [00:00<00:01, 193.62 Segments/s][A
Aligning 03bSnISJMiM:  43%|████▎     | 201/464 [00:0

[92m[1m[2024-12-03 11:04:21.660] | Success | [0mAlignment to <CMU_MOSI_TimestampedWords> complete.
[94m[1m[2024-12-03 11:04:21.660] | Status  | [0mReplacing dataset content with aligned computational sequences
[92m[1m[2024-12-03 11:04:21.665] | Success | [0mInitialized empty <CMU_MOSI_TimestampedWords> computational sequence.
[94m[1m[2024-12-03 11:04:21.665] | Status  | [0mChecking the format of the data in <CMU_MOSI_TimestampedWords> computational sequence ...


                                                                                      

[92m[1m[2024-12-03 11:04:21.798] | Success | [0m<CMU_MOSI_TimestampedWords> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:21.799] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_TimestampedWords> computational sequence ...
[92m[1m[2024-12-03 11:04:21.799] | Success | [0mInitialized empty <CMU_MOSI_Visual_Facet_41> computational sequence.
[94m[1m[2024-12-03 11:04:21.799] | Status  | [0mChecking the format of the data in <CMU_MOSI_Visual_Facet_41> computational sequence ...


                                                                                      

[92m[1m[2024-12-03 11:04:21.973] | Success | [0m<CMU_MOSI_Visual_Facet_41> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:21.973] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_Visual_Facet_41> computational sequence ...
[92m[1m[2024-12-03 11:04:21.973] | Success | [0mInitialized empty <CMU_MOSI_COVAREP> computational sequence.
[94m[1m[2024-12-03 11:04:21.973] | Status  | [0mChecking the format of the data in <CMU_MOSI_COVAREP> computational sequence ...


                                                                                      

[92m[1m[2024-12-03 11:04:22.096] | Success | [0m<CMU_MOSI_COVAREP> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:22.096] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_COVAREP> computational sequence ...


## Append annotations to the dataset and get the data points

Now that we have a preprocessed dataset, all we need to do is to apply annotations to the data. Annotations are also computational sequences, since they are also just some values distributed on different time spans (e.g 1-3s is 'angry', 12-26s is 'neutral'). Hence, we just add the label computational sequence to the dataset and then align to the labels. Since we (may) want to preserve the whole sequences, this time we don't specify any collapse functions when aligning. 

Note that after alignment, the keys in the dataset changes from `video_id` to `video_id[segment_no]`, because alignment will segment each datapoint based on the segmentation of the pivot modality (in this case, it is segmented based on labels, which is what we need, and yes, one code block ago they are segmented to word level, which I didn't show you).

***Important: DO NOT add the labels together at the beginning, the labels will be segmented during the first alignment to words. This also holds for any situation where you want to do multiple levels of alignment.***

In [17]:
label_field = 'CMU_MOSI_Opinion_Labels'

# we add and align to lables to obtain labeled segments
# this time we don't apply collapse functions so that the temporal sequences are preserved
label_recipe = {label_field: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "labels", label_field) + '.csd'}
dataset.add_computational_sequences(label_recipe, destination=None)
dataset.align(label_field)

[92m[1m[2024-12-03 11:04:22.210] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\labels\CMU_MOSI_Opinion_Labels.csd ...
[94m[1m[2024-12-03 11:04:22.240] | Status  | [0mChecking the integrity of the <Opinion Segment Labels> computational sequence ...
[94m[1m[2024-12-03 11:04:22.240] | Status  | [0mChecking the format of the data in <Opinion Segment Labels> computational sequence ...


                                                                   

[92m[1m[2024-12-03 11:04:22.323] | Success | [0m<Opinion Segment Labels> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:22.324] | Status  | [0mChecking the format of the metadata in <Opinion Segment Labels> computational sequence ...
[94m[1m[2024-12-03 11:04:22.324] | Status  | [0mUnify was called ...




[92m[1m[2024-12-03 11:04:22.462] | Success | [0mUnify completed ...
[94m[1m[2024-12-03 11:04:22.465] | Status  | [0mPre-alignment based on <CMU_MOSI_Opinion_Labels> computational sequence started ...
[94m[1m[2024-12-03 11:04:22.687] | Status  | [0mPre-alignment done for <CMU_MOSI_TimestampedWords> ...
[94m[1m[2024-12-03 11:04:22.913] | Status  | [0mPre-alignment done for <CMU_MOSI_Visual_Facet_41> ...
[94m[1m[2024-12-03 11:04:23.166] | Status  | [0mPre-alignment done for <CMU_MOSI_COVAREP> ...
[94m[1m[2024-12-03 11:04:23.170] | Status  | [0mAlignment starting ...


Overall Progress:   0%|          | 0/93 [00:00<?, ? Computational Sequence Entries/s]
  0%|          | 0/13 [00:00<?, ? Segments/s][A
Aligning 03bSnISJMiM:   0%|          | 0/13 [00:00<?, ? Segments/s][A
                                                                   [A
  0%|          | 0/25 [00:00<?, ? Segments/s][A
Aligning 0h-zjBukYpk:   0%|          | 0/25 [00:00<?, ? Segments/s][A
                                                                   [A
  0%|          | 0/14 [00:00<?, ? Segments/s][A
Aligning 1DmNV9C1hbY:   0%|          | 0/14 [00:00<?, ? Segments/s][A
Overall Progress:   3%|▎         | 3/93 [00:00<00:04, 21.31 Computational Sequence Entries/s]
  0%|          | 0/30 [00:00<?, ? Segments/s][A
Aligning 1iG0909rllw:   0%|          | 0/30 [00:00<?, ? Segments/s][A
                                                                   [A
  0%|          | 0/63 [00:00<?, ? Segments/s][A
Aligning 2WGyTLYerpo:   0%|          | 0/63 [00:00<?, ? Segments/s][A
Alignin

[92m[1m[2024-12-03 11:04:27.688] | Success | [0mAlignment to <CMU_MOSI_Opinion_Labels> complete.
[94m[1m[2024-12-03 11:04:27.688] | Status  | [0mReplacing dataset content with aligned computational sequences
[92m[1m[2024-12-03 11:04:27.852] | Success | [0mInitialized empty <CMU_MOSI_TimestampedWords> computational sequence.
[94m[1m[2024-12-03 11:04:27.852] | Status  | [0mChecking the format of the data in <CMU_MOSI_TimestampedWords> computational sequence ...


                                                                     

[92m[1m[2024-12-03 11:04:27.862] | Success | [0m<CMU_MOSI_TimestampedWords> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:27.862] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_TimestampedWords> computational sequence ...
[92m[1m[2024-12-03 11:04:27.862] | Success | [0mInitialized empty <CMU_MOSI_Visual_Facet_41> computational sequence.
[94m[1m[2024-12-03 11:04:27.862] | Status  | [0mChecking the format of the data in <CMU_MOSI_Visual_Facet_41> computational sequence ...


                                                                     

[92m[1m[2024-12-03 11:04:27.871] | Success | [0m<CMU_MOSI_Visual_Facet_41> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:27.872] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_Visual_Facet_41> computational sequence ...
[92m[1m[2024-12-03 11:04:27.872] | Success | [0mInitialized empty <CMU_MOSI_COVAREP> computational sequence.
[94m[1m[2024-12-03 11:04:27.872] | Status  | [0mChecking the format of the data in <CMU_MOSI_COVAREP> computational sequence ...


                                                                     

[92m[1m[2024-12-03 11:04:27.880] | Success | [0m<CMU_MOSI_COVAREP> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:27.881] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_COVAREP> computational sequence ...
[92m[1m[2024-12-03 11:04:27.881] | Success | [0mInitialized empty <CMU_MOSI_Opinion_Labels> computational sequence.
[94m[1m[2024-12-03 11:04:27.881] | Status  | [0mChecking the format of the data in <CMU_MOSI_Opinion_Labels> computational sequence ...


                                                                     

[92m[1m[2024-12-03 11:04:27.892] | Success | [0m<CMU_MOSI_Opinion_Labels> computational sequence data in correct format.
[94m[1m[2024-12-03 11:04:27.892] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_Opinion_Labels> computational sequence ...




In [18]:
# check out what the keys look like now
print(list(dataset[text_field_Words].keys())[55])

1iG0909rllw[3]


In [30]:
num_segments = len(dataset[visual_field_Facet41].keys())  # Assuming all fields have the same segment keys
print(f"Number of segments: {num_segments}")


Number of segments: 2198


In [31]:
# Compute lengths of segments for each modality
segment_lengths_visual = [dataset[visual_field_Facet41][seg]['features'].shape[0] for seg in dataset[visual_field_Facet41].keys()]
segment_lengths_acoustic = [dataset[acoustic_field_COVAREP][seg]['features'].shape[0] for seg in dataset[acoustic_field_COVAREP].keys()]
segment_lengths_text = [dataset[text_field_Words][seg]['features'].shape[0] for seg in dataset[text_field_Words].keys()]

# Calculate maximum and minimum lengths
max_length_visual = max(segment_lengths_visual)
min_length_visual = min(segment_lengths_visual)

max_length_acoustic = max(segment_lengths_acoustic)
min_length_acoustic = min(segment_lengths_acoustic)

max_length_text = max(segment_lengths_text)
min_length_text = min(segment_lengths_text)

# Print the results
print(f"Visual modality: Max length = {max_length_visual}, Min length = {min_length_visual}")
print(f"Acoustic modality: Max length = {max_length_acoustic}, Min length = {min_length_acoustic}")
print(f"Text modality: Max length = {max_length_text}, Min length = {min_length_text}")


Visual modality: Max length = 125, Min length = 1
Acoustic modality: Max length = 125, Min length = 1
Text modality: Max length = 125, Min length = 1


In [32]:
for seg in dataset[visual_field_Facet41].keys():  # Assuming all modalities have the same segment keys
    visual_length = dataset[visual_field_Facet41][seg]['features'].shape[0]
    acoustic_length = dataset[acoustic_field_COVAREP][seg]['features'].shape[0]
    text_length = dataset[text_field_Words][seg]['features'].shape[0]
    
    print(f"Segment {seg}: Visual length = {visual_length}, Acoustic length = {acoustic_length}, Text length = {text_length}")


Segment 03bSnISJMiM[0]: Visual length = 8, Acoustic length = 8, Text length = 8
Segment 03bSnISJMiM[1]: Visual length = 27, Acoustic length = 27, Text length = 27
Segment 03bSnISJMiM[2]: Visual length = 10, Acoustic length = 10, Text length = 10
Segment 03bSnISJMiM[3]: Visual length = 7, Acoustic length = 7, Text length = 7
Segment 03bSnISJMiM[4]: Visual length = 4, Acoustic length = 4, Text length = 4
Segment 03bSnISJMiM[5]: Visual length = 5, Acoustic length = 5, Text length = 5
Segment 03bSnISJMiM[6]: Visual length = 13, Acoustic length = 13, Text length = 13
Segment 03bSnISJMiM[7]: Visual length = 20, Acoustic length = 20, Text length = 20
Segment 03bSnISJMiM[8]: Visual length = 9, Acoustic length = 9, Text length = 9
Segment 03bSnISJMiM[9]: Visual length = 5, Acoustic length = 5, Text length = 5
Segment 03bSnISJMiM[10]: Visual length = 7, Acoustic length = 7, Text length = 7
Segment 03bSnISJMiM[11]: Visual length = 7, Acoustic length = 7, Text length = 7
Segment 03bSnISJMiM[12]: V

In [33]:
# Extract unique video IDs
video_ids = set(seg.split("[")[0] for seg in dataset[visual_field_Facet41].keys())

# Count the number of unique video IDs
num_unique_ids = len(video_ids)

# Count segments for each video ID
video_segment_counts = {video_id: sum(seg.startswith(video_id) for seg in dataset[visual_field_Facet41].keys()) for video_id in video_ids}

# Print the total number of unique video IDs
print(f"Total number of unique video IDs: {num_unique_ids}\n")

# Print the segment counts for each video ID
for video_id, count in video_segment_counts.items():
    print(f"Video {video_id}: {count} segments")



Total number of unique video IDs: 93

Video zhpQhgha_KU: 35 segments
Video 8qrpnFRGt2A: 26 segments
Video 1DmNV9C1hbY: 14 segments
Video tStelxIAHjw: 16 segments
Video 9qR7uwkblbs: 33 segments
Video Sqr0AcuoNnk: 22 segments
Video vyB00TXsimI: 22 segments
Video 6Egk_28TtTM: 12 segments
Video BXuRRbG0Ugk: 31 segments
Video 9T9Hf74oK10: 25 segments
Video BioHAh1qJAQ: 30 segments
Video G6GlGvlkxAQ: 29 segments
Video OQvJTdtJ2H4: 16 segments
Video WKA5OygbEKI: 22 segments
Video 0h-zjBukYpk: 25 segments
Video bvLlb-M3UXU: 25 segments
Video Vj1wYRQjB-o: 9 segments
Video _dI--eQ6qVU: 28 segments
Video cM3Yna7AavY: 16 segments
Video rnaNMUZpvvg: 22 segments
Video BvYR0L6f2Ig: 26 segments
Video GWuJjcEuzt8: 18 segments
Video c7UH_rxdZv4: 33 segments
Video jUzDDGyPkXU: 27 segments
Video cXypl4FnoZo: 29 segments
Video 2WGyTLYerpo: 62 segments
Video tmZoasNr4rU: 20 segments
Video I5y0__X72p0: 39 segments
Video wMbj6ajWbic: 30 segments
Video ob23OKe5a9Q: 14 segments
Video 5W7Z1C_fDaE: 24 segments
Vi

## Splitting the dataset

Now it comes to our final step: splitting the dataset into train/dev/test splits. This code block is a bit long in itself, so be patience and step through carefully with the explanatory comments.

The SDK provides the splits in terms of video IDs (which video belong to which split), however, after alignment our dataset keys already changed from `video_id` to `video_id[segment_no]`. Hence, we need to extract the video ID when looping through the data to determine which split each data point belongs to.

In the following data processing, I also include instance-wise Z-normalization (subtract by mean and divide by standard dev) and converted words to unique IDs.

This example is based on PyTorch so I am using PyTorch related utils, but the same procedure should be easy to adapt to other frameworks.

In [34]:
# obtain the train/dev/test splits - these splits are based on video IDs
train_split = DATASET.standard_folds.standard_train_fold
dev_split = DATASET.standard_folds.standard_valid_fold
test_split = DATASET.standard_folds.standard_test_fold

# inspect the splits: they only contain video IDs
print(f"lengths: train {len(train_split)}, dev {len(dev_split)}, test {len(test_split)}\n")
print(train_split)
print(dev_split)
print(test_split)

lengths: train 52, dev 10, test 31

['2iD-tVS8NPw', '8d-gEyoeBzc', 'Qr1Ca94K55A', 'Ci-AH39fi3Y', '8qrpnFRGt2A', 'Bfr499ggo-0', 'QN9ZIUWUXsY', '9T9Hf74oK10', '7JsX8y1ysxY', '1iG0909rllw', 'Oz06ZWiO20M', 'BioHAh1qJAQ', '9c67fiY0wGQ', 'Iu2PFX3z_1s', 'Nzq88NnDkEk', 'Clx4VXItLTE', '9J25DZhivz8', 'Af8D0E4ZXaw', 'TvyZBvOMOTc', 'W8NXH0Djyww', '8OtFthrtaJM', '0h-zjBukYpk', 'Vj1wYRQjB-o', 'GWuJjcEuzt8', 'BI97DNYfe5I', 'PZ-lDQFboO8', '1DmNV9C1hbY', 'OQvJTdtJ2H4', 'I5y0__X72p0', '9qR7uwkblbs', 'G6GlGvlkxAQ', '6_0THN4chvY', 'Njd1F0vZSm4', 'BvYR0L6f2Ig', '03bSnISJMiM', 'Dg_0XKD0Mf4', '5W7Z1C_fDaE', 'VbQk4H8hgr0', 'G-xst2euQUc', 'MLal-t_vJPM', 'BXuRRbG0Ugk', 'LSi-o-IrDMs', 'Jkswaaud0hk', '2WGyTLYerpo', '6Egk_28TtTM', 'Sqr0AcuoNnk', 'POKffnXeBds', '73jzhE8R1TQ', 'OtBXNcAL_lE', 'HEsqda8_d0Q', 'VCslbP0mgZI', 'IumbAb8q2dM']
['WKA5OygbEKI', 'c5xsKMxpXnc', 'atnd_PF-Lbs', 'bvLlb-M3UXU', 'bOL9jKpeJRs', '_dI--eQ6qVU', 'ZAIRrfG22O0', 'X3j2zQgwYgE', 'aiEXnCPZubE', 'ZUXBRvtny7o']
['tmZoasNr4rU', 'zhpQhgha_KU', '

In [35]:
# we can see they are in the format of 'video_id[segment_no]', but the splits was specified with video_id only
# we need to use regex or something to match the video IDs...
import torch
import torch.nn as nn

from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm_notebook
from collections import defaultdict


In [36]:
import os
import matplotlib.pyplot as plt

def plot_hist(visual, acoustic, title="Segment"):
    # Create the folder if it doesn't exist
    folder_name = "Value distributions"
    os.makedirs(folder_name, exist_ok=True)
    
    # Plot the histograms
    plt.hist(visual.flatten(), bins=100, alpha=0.5, label='Visual')
    plt.hist(acoustic.flatten(), bins=100, alpha=0.5, label='Acoustic')
    plt.xlabel("Value")
    plt.ylabel("Frequency")
    plt.legend()
    plt.title(f"Value Distribution for {title}")
    plt.show()
    plt.close()  # Close the plot to free memory

    # 
    # # Save the figure to the specified folder with the given title
    # file_name = f"value_distribution_{title}.png"
    # file_path = os.path.join(folder_name, file_name)
    # plt.savefig(file_path)


In [37]:

# a sentinel epsilon for safe division, without it we will replace illegal values with a constant
EPS = 1e-8 # a small value to not end up in a situation in which we divide with 0

# construct a word2id mapping that automatically takes increment when new words are encountered
word2id = defaultdict(lambda: len(word2id))
UNK = word2id['<unk>']
PAD = word2id['<pad>']

# place holders for the final train/dev/test dataset
train = []
dev = []
test = []

# define a regular expression to extract the video ID out of the keys
pattern = re.compile('(.*)\[.*\]')
num_drop = 0 # a counter to count how many data points went into some processing issues


for segment in dataset[label_field].keys():
    
    # get the video ID and the features out of the aligned dataset
    vid = re.search(pattern, segment).group(1)
    label = dataset[label_field][segment]['features']
    _words = dataset[text_field_Words][segment]['features']
    _visual = dataset[visual_field_Facet41][segment]['features']
    _acoustic = dataset[acoustic_field_COVAREP][segment]['features']

    # if the sequences are not same length after alignment, there must be some problem with some modalities
    # we should drop it or inspect the data again
    # print(f"Just to chec: Encountered datapoint {vid} with text shape {_words.shape}, visual shape {_visual.shape}, acoustic shape {_acoustic.shape}")    
    if not _words.shape[0] == _visual.shape[0] == _acoustic.shape[0]:
        print(f"REAL DEALLLLLL  Encountered datapoint {vid} with text shape {_words.shape}, visual shape {_visual.shape}, acoustic shape {_acoustic.shape}")
        num_drop += 1
        continue
    
    # remove nan values
    label = np.nan_to_num(label)
    _visual = np.nan_to_num(_visual)
    _acoustic = np.nan_to_num(_acoustic)

    
    # plot_hist(_visual, _acoustic, segment)

    # print("remove speech pause moment")
    # remove speech pause tokens - this is in general helpful
    # we should remove speech pauses and corresponding visual/acoustic features together
    # otherwise modalities would no longer be aligned
    words = []
    visual = []
    acoustic = []
    # print(f"Processing segment {segment}: word[0] = {_words}")

    #SHALL I REMOVE THE UM WORDS????? pause fillers (umm, uhh, etc.),
    for i, word in enumerate(_words):
        # print(f"Processing segment {segment}: word[0] = {word[0]}")

        if word[0] != b'sp':
            words.append(word2id[word[0].decode('utf-8')]) # SDK stores strings as bytes, decode into strings here
            visual.append(_visual[i, :])
            acoustic.append(_acoustic[i, :])
        # print(f"Processing segment {segment}: word[0] = {word[0]}")

    words = np.asarray(words)
    visual = np.asarray(visual)
    acoustic = np.asarray(acoustic)
    
    # print(f"words: {words}, acoustic: {acoustic} ")

    # print(f"Visual range: min={np.min(visual)}, max={np.max(visual)}")
    # print(f"Acoustic range: min={np.min(acoustic)}, max={np.max(acoustic)}")
    

    # z-normalization per instance and remove nan/infs
    std_dev = np.std(visual, axis=0, keepdims=True)
    visual = np.nan_to_num((visual - visual.mean(0, keepdims=True)) / (EPS + std_dev))
    # Skip normalization for columns with zero std_dev
    visual[:, std_dev.flatten() == 0] = EPS
    
    # print(f"visual: {visual}")
    # visual = np.nan_to_num((visual - visual.mean(0, keepdims=True)) / (EPS + np.std(visual, axis=0, keepdims=True)))


    
    # z-normalization for acoustic
    
    acoustic_mean = np.nanmean(acoustic, axis=0, keepdims=True)
    std_dev_acoustic = np.nanstd(acoustic, axis=0, keepdims=True)
        
    
    std_dev_acoustic = np.nan_to_num(std_dev_acoustic)
    std_dev_acoustic[std_dev_acoustic == 0] = EPS
    # print(f"STD: {std_dev_acoustic}")
    
    acoustic = np.nan_to_num((acoustic - acoustic_mean) / (EPS + std_dev_acoustic))
    

    # Should i do normalization from 0 - 1 or -1 to 1?????

    
    # if np.any(std_dev_acoustic == 0):
    #     print("Error: std_dev_acoustic still contains zeros after safeguard.")
    if np.any(np.isnan(std_dev_acoustic)):
        print("Error: std_dev_acoustic contains NaNs.")
        
    if np.any(np.isnan(acoustic)):
        print("Error: NaN values in normalized acoustic data.")
    if np.any(np.isinf(acoustic)):
        print("Error: Infinite values in normalized acoustic data.")
        

    # plot_hist(visual, acoustic, segment)
    
    #why are the values not consistent???

    if vid in train_split:
        train.append(((words, visual, acoustic), label, segment))
    elif vid in dev_split:
        dev.append(((words, visual, acoustic), label, segment))
    elif vid in test_split:
        test.append(((words, visual, acoustic), label, segment))
    else:
        print(f"Found video that doesn't belong to any splits: {vid}")



print(f"Total number of {num_drop} datapoints have been dropped.")

Total number of 0 datapoints have been dropped.


  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  sqr = np.multiply(arr, arr, out=arr, where=where)


In [38]:
import random
sample = random.choice(train)

words = sample[0][0]  # Text data
visual = sample[0][1]  # Visual features
acoustic = sample[0][2]  # Acoustic features

print("Words:", words)
print("Visual shape:", visual.shape)
print("Acoustic shape:", acoustic.shape)

Words: [ 135   25  185  137    7  167  565  140  689  147  119  120    7 1533
   38  119  120  492  565  211  147 1305]
Visual shape: (22, 47)
Acoustic shape: (22, 74)


In [39]:
# turn off the word2id - define a named function here to allow for pickling
def return_unk():
    return UNK
word2id.default_factory = return_unk

## Inspect the dataset

Now that we have loaded the data, we can check the sizes of each split, data point shapes, vocabulary size, etc.

## Collate function in PyTorch

Collate functions are functions used by PyTorch dataloader to gather batched data from dataset. It loads multiple data points from an iterable dataset object and put them in a certain format. Here we just use the lists we've constructed as the dataset and assume PyTorch dataloader will operate on that.

In [40]:
def multi_collate(batch):
    '''
    Collate functions assume batch = [Dataset[i] for i in index_set]
    '''
    # for later use we sort the batch in descending order of length
    batch = sorted(batch, key=lambda x: x[0][0].shape[0], reverse=True)
    
    # get the data out of the batch - use pad sequence util functions from PyTorch to pad things
    labels = torch.cat([torch.from_numpy(sample[1]) for sample in batch], dim=0)
    sentences = pad_sequence([torch.LongTensor(sample[0][0]) for sample in batch], padding_value=PAD)
    visual = pad_sequence([torch.FloatTensor(sample[0][1]) for sample in batch])
    acoustic = pad_sequence([torch.FloatTensor(sample[0][2]) for sample in batch])
    
    # lengths are useful later in using RNNs
    lengths = torch.LongTensor([sample[0][0].shape[0] for sample in batch])
    return sentences, visual, acoustic, labels, lengths

# construct dataloaders, dev and test could use around ~X3 times batch size since no_grad is used during eval
batch_sz = 56
train_loader = DataLoader(train, shuffle=True, batch_size=batch_sz, collate_fn=multi_collate)
dev_loader = DataLoader(dev, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate)
test_loader = DataLoader(test, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate)

# let's create a temporary dataloader just to see how the batch looks like
temp_loader = iter(DataLoader(test, shuffle=True, batch_size=8, collate_fn=multi_collate))
batch = next(temp_loader)

print(batch[0].shape) # word vectors, padded to maxlen
print(batch[1].shape) # visual features
print(batch[2].shape) # acoustic features
print(batch[3]) # labels
print(batch[4]) # lengths

torch.Size([20, 8])
torch.Size([20, 8, 47])
torch.Size([20, 8, 74])
tensor([[ 2.0000],
        [-2.4000],
        [ 1.0000],
        [ 1.2000],
        [ 1.7500],
        [-2.2000],
        [ 2.2000],
        [-1.0000]])
tensor([20, 15, 12,  9,  9,  9,  7,  7])


In [41]:
# Let's actually inspect the transcripts to ensure it's correct
id2word = {v:k for k, v in word2id.items()}
examine_target = train
idx = np.random.randint(0, len(examine_target))
print(' '.join(list(map(lambda x: id2word[x], examine_target[idx][0][0].tolist()))))


# print(' '.join(examine_target[idx][0]))
print(examine_target[idx][1]) #label
print(examine_target[idx][2]) #segment

but i just wanted to put this up soon
[[0.4]]
HEsqda8_d0Q[32]


In [42]:
# Reverse mapping from word IDs to words
id2word = {v: k for k, v in word2id.items()}


# Specify how many examples to examine
num_examples = len(train)  # Ensure we don't exceed the dataset size


max_length = 0
min_length = float('inf')

for idx in range(num_examples):
    # Convert word IDs to words
    words = ' '.join(map(lambda x: id2word[x], train[idx][0][0].tolist()))
    label = train[idx][1]  # Label
    segment = train[idx][2]  # Segment

    length = len(words.split())  # Word count, not character count

    # Track the max and min lengths
    max_length = max(max_length, length)
    min_length = min(min_length, length)

    # Display the information
    print(f"Example {idx+1}:")
    print(f"Text: {words}")
    print(f"Length: {length}")
    print(f"Label: {label}")
    print(f"Segment: {segment}")
    print("-" * 40)  # Separator for readability
    if length==1:
        example = idx+1

# After the loop, print the max and min lengths
print(f"Maximum length: {max_length}")
print(f"Minimum length: {min_length}")
print(f"Example with length 1: {example}")




Example 1:
Text: anyhow it was really good
Length: 5
Label: [[2.4]]
Segment: 03bSnISJMiM[0]
----------------------------------------
Example 2:
Text: they didnt really do a whole bunch of background info on why she has to fight and be prepared
Length: 19
Label: [[-0.8]]
Segment: 03bSnISJMiM[1]
----------------------------------------
Example 3:
Text: i mean they did a little bit of it
Length: 9
Label: [[-1.]]
Segment: 03bSnISJMiM[2]
----------------------------------------
Example 4:
Text: but not a whole bunch
Length: 5
Label: [[-1.75]]
Segment: 03bSnISJMiM[3]
----------------------------------------
Example 5:
Text: and they i guess
Length: 4
Label: [[0.]]
Segment: 03bSnISJMiM[4]
----------------------------------------
Example 6:
Text: they live up with more
Length: 5
Label: [[0.]]
Segment: 03bSnISJMiM[5]
----------------------------------------
Example 7:
Text: and but besides that it was all over pretty good
Length: 10
Label: [[0.8]]
Segment: 03bSnISJMiM[6]
-----------------------

In [43]:
# Reverse mapping from word IDs to words
id2word = {v: k for k, v in word2id.items()}

# Specify how many examples to examine
num_examples = len(dev)  # Ensure we don't exceed the dataset size

max_length = 0
min_length = float('inf')

for idx in range(num_examples):
    # Convert word IDs to words
    words = ' '.join(map(lambda x: id2word[x], dev[idx][0][0].tolist()))
    label = dev[idx][1]  # Label
    segment = dev[idx][2]  # Segment

    # Calculate the length of the words (number of words)
    length = len(words.split())  # Word count, not character count

    # Track the max and min lengths
    max_length = max(max_length, length)
    min_length = min(min_length, length)

    # Display the information
    print(f"Example {idx+1}:")
    print(f"Text: {words}")
    print(f"Length: {length}")
    print(f"Label: {label}")
    print(f"Segment: {segment}")
    print("-" * 40)  # Separator for readability

# After the loop, print the max and min lengths
print(f"Maximum length: {max_length}")
print(f"Minimum length: {min_length}")



Example 1:
Text: uh i really enjoyed doing them for the philly d movie club so i thought oh why not
Length: 18
Label: [[1.8]]
Segment: WKA5OygbEKI[0]
----------------------------------------
Example 2:
Text: uh is a really cute movie
Length: 6
Label: [[2.2]]
Segment: WKA5OygbEKI[1]
----------------------------------------
Example 3:
Text: um i i liked a lot of the movie
Length: 9
Label: [[1.2]]
Segment: WKA5OygbEKI[2]
----------------------------------------
Example 4:
Text: i will admit im a big johnny depp fan
Length: 9
Label: [[1.8]]
Segment: WKA5OygbEKI[3]
----------------------------------------
Example 5:
Text: um i pretty much will go see any movie that kind of has an interesting plot if hes in it
Length: 20
Label: [[0.6]]
Segment: WKA5OygbEKI[4]
----------------------------------------
Example 6:
Text: um its not his best
Length: 5
Label: [[-1.4]]
Segment: WKA5OygbEKI[5]
----------------------------------------
Example 7:
Text: i mean corpse bride hes really good in corpse brid

In [44]:
# Reverse mapping from word IDs to words
id2word = {v: k for k, v in word2id.items()}

# Specify how many examples to examine
num_examples = len(test)  # Ensure we don't exceed the dataset size

max_length = 0
min_length = float('inf')


for idx in range(num_examples):
    # Convert word IDs to words
    words = ' '.join(map(lambda x: id2word[x], test[idx][0][0].tolist()))
    label = test[idx][1]  # Label
    segment = test[idx][2]  # Segment

    length = len(words.split())  # Word count, not character count

    # Track the max and min lengths
    max_length = max(max_length, length)
    min_length = min(min_length, length)

    # Display the information
    print(f"Example {idx+1}:")
    print(f"Text: {words}")
    print(f"Length: {length}")
    print(f"Label: {label}")
    print(f"Segment: {segment}")
    print("-" * 40)  # Separator for readability
    if length == 1:
        example = idx+1

# After the loop, print the max and min lengths
print(f"Maximum length: {max_length}")
print(f"Minimum length: {min_length}")
print(f"example with length 1: {example}")


Example 1:
Text: oh my gosh bad movie
Length: 5
Label: [[-2.8]]
Segment: c7UH_rxdZv4[0]
----------------------------------------
Example 2:
Text: really bad movie
Length: 3
Label: [[-2.6]]
Segment: c7UH_rxdZv4[1]
----------------------------------------
Example 3:
Text: i wish i because
Length: 4
Label: [[-0.8]]
Segment: c7UH_rxdZv4[2]
----------------------------------------
Example 4:
Text: because i truly love an action flick action comedy flick even better right
Length: 13
Label: [[1.6]]
Segment: c7UH_rxdZv4[3]
----------------------------------------
Example 5:
Text: but basically seth rogan brought none of the charm that he normally does to his roles
Length: 16
Label: [[-2.2]]
Segment: c7UH_rxdZv4[4]
----------------------------------------
Example 6:
Text: in fact i have to say that this was one of those obnoxious main characters ive seen a long time
Length: 20
Label: [[-2.6]]
Segment: c7UH_rxdZv4[5]
----------------------------------------
Example 7:
Text: a horrible protagonis

## Define a multimodal model

Here we show a simple example of late-fusion LSTM. Late-fusion refers to combining the features from different modalities at the final prediction stage, without introducing any interactions between them before that.

In [56]:
# let's define a simple model that can deal with multimodal variable length sequence
class LFLSTM(nn.Module):
    def __init__(self, input_sizes, hidden_sizes, fc1_size, output_size, dropout_rate):
        super(LFLSTM, self).__init__()
        self.input_size = input_sizes
        self.hidden_size = hidden_sizes
        self.fc1_size = fc1_size
        self.output_size = output_size
        self.dropout_rate = dropout_rate
        
        # defining modules - two layer bidirectional LSTM with layer norm in between
        self.embed = nn.Embedding(len(word2id), input_sizes[0])
        self.trnn1 = nn.LSTM(input_sizes[0], hidden_sizes[0], bidirectional=True)
        self.trnn2 = nn.LSTM(2*hidden_sizes[0], hidden_sizes[0], bidirectional=True)
        
        self.vrnn1 = nn.LSTM(input_sizes[1], hidden_sizes[1], bidirectional=True)
        self.vrnn2 = nn.LSTM(2*hidden_sizes[1], hidden_sizes[1], bidirectional=True)
        
        self.arnn1 = nn.LSTM(input_sizes[2], hidden_sizes[2], bidirectional=True)
        self.arnn2 = nn.LSTM(2*hidden_sizes[2], hidden_sizes[2], bidirectional=True)

        self.fc1 = nn.Linear(sum(hidden_sizes)*4, fc1_size)
        self.fc2 = nn.Linear(fc1_size, output_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        self.tlayer_norm = nn.LayerNorm((hidden_sizes[0]*2,))
        self.vlayer_norm = nn.LayerNorm((hidden_sizes[1]*2,))
        self.alayer_norm = nn.LayerNorm((hidden_sizes[2]*2,))
        self.bn = nn.BatchNorm1d(sum(hidden_sizes)*4)

        
    def extract_features(self, sequence, lengths, rnn1, rnn2, layer_norm):
        packed_sequence = pack_padded_sequence(sequence, lengths)
        packed_h1, (final_h1, _) = rnn1(packed_sequence)
        padded_h1, _ = pad_packed_sequence(packed_h1)
        normed_h1 = layer_norm(padded_h1)
        packed_normed_h1 = pack_padded_sequence(normed_h1, lengths)
        _, (final_h2, _) = rnn2(packed_normed_h1)
        return final_h1, final_h2

        
    def fusion(self, sentences, visual, acoustic, lengths):
        batch_size = lengths.size(0)
        
        max_idx = sentences.max().item()
        if max_idx >= self.embed.num_embeddings:
            raise ValueError(f"Sentence contains an index {max_idx} that exceeds the embedding matrix size {self.embedding.num_embeddings - 1}. Please ensure input indices are within the vocab size.")

        sentences = self.embed(sentences)
        
        # extract features from text modality
        final_h1t, final_h2t = self.extract_features(sentences, lengths, self.trnn1, self.trnn2, self.tlayer_norm)
        
        # extract features from visual modality
        final_h1v, final_h2v = self.extract_features(visual, lengths, self.vrnn1, self.vrnn2, self.vlayer_norm)
        
        # extract features from acoustic modality
        final_h1a, final_h2a = self.extract_features(acoustic, lengths, self.arnn1, self.arnn2, self.alayer_norm)

        
        # simple late fusion -- concatenation + normalization
        h = torch.cat((final_h1t, final_h2t, final_h1v, final_h2v, final_h1a, final_h2a),
                       dim=2).permute(1, 0, 2).contiguous().view(batch_size, -1)
        return self.bn(h)

    def forward(self, sentences, visual, acoustic, lengths):
        batch_size = lengths.size(0)
        h = self.fusion(sentences, visual, acoustic, lengths)
        h = self.fc1(h)
        h = self.dropout(h)
        h = self.relu(h)
        o = self.fc2(h)
        return o

## Load pretrained embeddings

We define a function for loading pretrained word embeddings stored in GloVe-style file. Contextualized embeddings obviously cannot be stored and loaded this way, though.

In [57]:
import tqdm
from tqdm import tqdm_notebook

In [58]:
# define a function that loads data from GloVe-like embedding files
# we will add tutorials for loading contextualized embeddings later
# 2196017 is the vocab size of GloVe here.

def load_emb(w2i, path_to_embedding, embedding_size=300, embedding_vocab=2196017, init_emb=None):
    if init_emb is None:
        emb_mat = np.random.randn(len(w2i), embedding_size)
    else:
        emb_mat = init_emb
    f = open(path_to_embedding, 'r', encoding='utf-8', errors='replace')
    found = 0
    for line in tqdm_notebook(f, total=embedding_vocab):
        try:
            content = line.strip().split()
            vector = np.asarray(list(map(lambda x: float(x), content[-300:])))
            word = ' '.join(content[:-300])
            if word in w2i:
                idx = w2i[word]
                emb_mat[idx, :] = vector
                found += 1
        except ValueError as e:
            print(f"Skipping invalid line: {line}")
        
    print(f"Found {found} words in the embedding file.")
    return torch.tensor(emb_mat).float()


## Training a model

Next we train a model. We use Adam with gradient clipping and weight decay for training, and our loss here is Mean Absolute Error (MOSI is a regression dataset). We exclude the embeddings from trainable computation graph to prevent overfitting. We also apply a early-stopping scheme with learning rate annealing based on validation loss.

In [59]:

from torch.optim import Adam, SGD
from sklearn.metrics import accuracy_score

In [60]:
import os
path = 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data'
if os.access(path, os.R_OK):
    print("Directory is readable")
else:
    print("No read permission")
if os.access(path, os.W_OK):
    print("Directory is writable")
else:
    print("No write permission")
    

Directory is readable
Directory is writable


In [61]:


torch.manual_seed(123)
torch.cuda.manual_seed_all(123)

CUDA = torch.cuda.is_available()
MAX_EPOCH = 1000

text_size = 300
visual_size = 47
acoustic_size = 74

# define some model settings and hyper-parameters
input_sizes = [text_size, visual_size, acoustic_size]
hidden_sizes = [int(text_size * 1.5), int(visual_size * 1.5), int(acoustic_size * 1.5)]
fc1_size = sum(hidden_sizes) // 2
dropout = 0.25
output_size = 1
curr_patience = patience = 8
num_trials = 3
grad_clip_value = 1.0
weight_decay = 0.1

if os.path.exists(CACHE_PATH):
    pretrained_emb, word2id = torch.load(CACHE_PATH)
elif WORD_EMB_PATH is not None:
    pretrained_emb = load_emb(word2id, WORD_EMB_PATH)
    torch.save((pretrained_emb, word2id), CACHE_PATH)
else:
    pretrained_emb = None

model = LFLSTM(input_sizes, hidden_sizes, fc1_size, output_size, dropout)
if pretrained_emb is not None:
    model.embed.weight.data = pretrained_emb
model.embed.requires_grad = False
optimizer = Adam([param for param in model.parameters() if param.requires_grad], weight_decay=weight_decay)

if CUDA:
    model.cuda()
criterion = nn.L1Loss(reduction='sum')
criterion_test = nn.L1Loss(reduction='sum')
best_valid_loss = float('inf')
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)
lr_scheduler.step() # for some reason it seems the StepLR needs to be stepped once first
train_losses = []
valid_losses = []
for e in range(MAX_EPOCH):
    model.train()
    train_iter = tqdm_notebook(train_loader)
    train_loss = 0.0
    for batch in train_iter:
        model.zero_grad()
        t, v, a, y, l = batch
        batch_size = t.size(0)
        if CUDA:
            t = t.cuda()
            v = v.cuda()
            a = a.cuda()
            y = y.cuda()
            l = l.cuda()
        y_tilde = model(t, v, a, l)
        loss = criterion(y_tilde, y)
        loss.backward()
        torch.nn.utils.clip_grad_value_([param for param in model.parameters() if param.requires_grad], grad_clip_value)
        optimizer.step()
        train_iter.set_description(f"Epoch {e}/{MAX_EPOCH}, current batch loss: {round(loss.item()/batch_size, 4)}")
        train_loss += loss.item()
    train_loss = train_loss / len(train)
    train_losses.append(train_loss)
    print(f"Training loss: {round(train_loss, 4)}")

    model.eval()
    with torch.no_grad():
        valid_loss = 0.0
        for batch in dev_loader:
            model.zero_grad()
            t, v, a, y, l = batch
            if CUDA:
                t = t.cuda()
                v = v.cuda()
                a = a.cuda()
                y = y.cuda()
                l = l.cuda()
            y_tilde = model(t, v, a, l)
            loss = criterion(y_tilde, y)
            valid_loss += loss.item()
    
    valid_loss = valid_loss/len(dev)
    valid_losses.append(valid_loss)
    print(f"Validation loss: {round(valid_loss, 4)}")
    print(f"Current patience: {curr_patience}, current trial: {num_trials}.")
    if valid_loss <= best_valid_loss:
        best_valid_loss = valid_loss
        print("Found new best model on dev set!")
        torch.save(model.state_dict(), 'model.std')
        torch.save(optimizer.state_dict(), 'optim.std')
        curr_patience = patience
    else:
        curr_patience -= 1
        if curr_patience <= -1:
            print("Running out of patience, loading previous best model.")
            num_trials -= 1
            curr_patience = patience
            model.load_state_dict(torch.load('model.std'))
            optimizer.load_state_dict(torch.load('optim.std'))
            lr_scheduler.step()
            print(f"Current learning rate: {optimizer.state_dict()['param_groups'][0]['lr']}")
    
    if num_trials <= 0:
        print("Running out of patience, early stopping.")
        break

model.load_state_dict(torch.load('model.std'))
y_true = []
y_pred = []
model.eval()
with torch.no_grad():
    test_loss = 0.0
    for batch in test_loader:
        model.zero_grad()
        t, v, a, y, l = batch
        if CUDA:
            t = t.cuda()
            v = v.cuda()
            a = a.cuda()
            y = y.cuda()
            l = l.cuda()
        y_tilde = model(t, v, a, l)
        loss = criterion_test(y_tilde, y)
        y_true.append(y_tilde.detach().cpu().numpy())
        y_pred.append(y.detach().cpu().numpy())
        test_loss += loss.item()
print(f"Test set performance: {test_loss/len(test)}")
y_true = np.concatenate(y_true, axis=0)
y_pred = np.concatenate(y_pred, axis=0)
                  
y_true_bin = y_true >= 0
y_pred_bin = y_pred >= 0
bin_acc = accuracy_score(y_true_bin, y_pred_bin)
print(f"Test set accuracy is {bin_acc}")


  pretrained_emb, word2id = torch.load(CACHE_PATH)
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  train_iter = tqdm_notebook(train_loader)


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.3179
Validation loss: 1.3969
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.9565
Validation loss: 1.2881
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.7411
Validation loss: 1.2608
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.5924
Validation loss: 1.1824
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4733
Validation loss: 1.2676
Current patience: 8, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4268
Validation loss: 1.2268
Current patience: 7, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4001
Validation loss: 1.2312
Current patience: 6, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.373
Validation loss: 1.2132
Current patience: 5, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3412
Validation loss: 1.2214
Current patience: 4, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3416
Validation loss: 1.1989
Current patience: 3, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3221
Validation loss: 1.2358
Current patience: 2, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3246
Validation loss: 1.1948
Current patience: 1, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.2999
Validation loss: 1.2116
Current patience: 0, current trial: 3.
Running out of patience, loading previous best model.
Current learning rate: 1e-05


  model.load_state_dict(torch.load('model.std'))
  optimizer.load_state_dict(torch.load('optim.std'))


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4781
Validation loss: 1.2005
Current patience: 8, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4546
Validation loss: 1.2138
Current patience: 7, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4294
Validation loss: 1.2088
Current patience: 6, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4174
Validation loss: 1.2119
Current patience: 5, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3871
Validation loss: 1.2122
Current patience: 4, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.365
Validation loss: 1.2178
Current patience: 3, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3683
Validation loss: 1.2062
Current patience: 2, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3542
Validation loss: 1.2167
Current patience: 1, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3356
Validation loss: 1.2038
Current patience: 0, current trial: 2.
Running out of patience, loading previous best model.
Current learning rate: 1e-05


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4712
Validation loss: 1.1985
Current patience: 8, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4374
Validation loss: 1.2085
Current patience: 7, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4274
Validation loss: 1.2102
Current patience: 6, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.4065
Validation loss: 1.2185
Current patience: 5, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3911
Validation loss: 1.2264
Current patience: 4, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3779
Validation loss: 1.2122
Current patience: 3, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3636
Validation loss: 1.2198
Current patience: 2, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3519
Validation loss: 1.2098
Current patience: 1, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.3427
Validation loss: 1.2119
Current patience: 0, current trial: 1.
Running out of patience, loading previous best model.
Current learning rate: 1e-05
Running out of patience, early stopping.


  model.load_state_dict(torch.load('model.std'))


AttributeError: 'LFLSTM' object has no attribute 'embedding'

In [ ]:
# import torch
# import numpy as np
# from sklearn.metrics import accuracy_score
# import torch.nn as nn

# Assume necessary variables like model, test_loader, CUDA, dic_size, etc. are already defined.

# Define the loss function (assuming MSELoss as used previously)

# Load the model state (assuming it's already saved in 'model.std')
model.load_state_dict(torch.load('model.std'))

# Prepare for evaluation
y_true = []
y_pred = []

# Set model to evaluation mode
model.eval()

# No gradient calculation
with torch.no_grad():
    test_loss = 0.0

    # Iterate over the test set
    for batch_idx, batch in enumerate(test_loader):
        # Zero the gradients
        model.zero_grad()

        # Unpack batch
        t, v, a, y, l = batch

        # Check input data dimensions (optional debugging step)
        print(f"Batch {batch_idx+1}: t.shape={t.shape}, v.shape={v.shape}, a.shape={a.shape}, y.shape={y.shape}, l.shape={l.shape}")

        # Move to CUDA if available
        if CUDA:
            t = t.cuda()
            v = v.cuda()
            a = a.cuda()
            y = y.cuda()
            l = l.cuda()

        # Forward pass
        y_tilde = model(t, v, a, l)

        # Check if the forward pass output is as expected
        print(f"y_tilde shape: {y_tilde.shape}, y shape: {y.shape}")

        # Compute loss
        loss = criterion_test(y_tilde, y)
        
        # Print loss for this batch
        print(f"Batch {batch_idx+1} Loss: {loss.item()}")

        # Collect predictions and true values
        y_true.append(y_tilde.detach().cpu().numpy())
        y_pred.append(y.detach().cpu().numpy())

        # Accumulate loss
        test_loss += loss.item()

    # Compute average test loss
    avg_test_loss = test_loss / len(test_loader)
    print(f"Test set performance (Average Loss): {avg_test_loss}")

    # Concatenate all true values and predictions
    y_true = np.concatenate(y_true, axis=0)
    y_pred = np.concatenate(y_pred, axis=0)

    # Check the first few predictions and ground truth values
    print(f"First few predictions (y_tilde): {y_pred[:5]}")
    print(f"First few ground truth values (y): {y_true[:5]}")

    # Convert to binary for classification accuracy
    y_true_bin = y_true >= 0
    y_pred_bin = y_pred >= 0

    # Print first few binary values for sanity check
    print(f"First few binary predictions: {y_pred_bin[:5]}")
    print(f"First few binary ground truth values: {y_true_bin[:5]}")

    # Calculate binary accuracy
    bin_acc = accuracy_score(y_true_bin, y_pred_bin)
    print(f"Test set binary accuracy: {bin_acc}")

    # Optionally, print the total number of test examples processed
    print(f"Total test examples processed: {len(test_loader.dataset)}")
