# Tutorial on CMU-Multimodal SDK

This is a tutorial on using ***CMU-Multimodal SDK*** to load and process multimodal time-series datasets and training a simple late-fusion LSTM model on the processed data. 

For this tutorial, we specify some constants in `./constans/paths.py`. Please first take a look and modify the paths to point to the correct folders.

## Downloading the data

We start off by (down)loading the datasets. In the SDK each dataset has three sets of content: `highlevel`, `raw` and `labels`. `highlevel` contains the extracted features for each modality (Facet facial landmarks, COVAREP acoustic features) while `raw` contains the raw transctripts, phonemes. `labels` are self-explanatory. Note that some datasets have more than just one set of annotations so `labels` could also give you multiple files.

Currently there's a caveat that the SDK will not automatically detect if you have downloaded the data already. In event of that it will throw a `RuntimeError`. We work around that by `try/except`. This is not ideal but it will work for now.

In [4]:
from constants import SDK_PATH, DATA_PATH, WORD_EMB_PATH, CACHE_PATH
import sys

SDK_PATH: C:\Users\Viki\Documents\Thesis\CMU-MultimodalSDK


In [5]:


if SDK_PATH is None:
    print("SDK path is not specified! Please specify first in constants/paths.py")
    exit(0)
else:
    sys.path.append(SDK_PATH)
    print(f"SDK path is set to {SDK_PATH}")

SDK path is set to C:\Users\Viki\Documents\Thesis\CMU-MultimodalSDK


In [6]:
import sys
for path in sys.path:
    print(path)
    

C:\Program Files\JetBrains\PyCharm 2023.3.2\plugins\python\helpers-pro\jupyter_debug
C:\Program Files\JetBrains\PyCharm 2023.3.2\plugins\python\helpers\pydev
C:\Users\Viki\Documents\Thesis\tryout3
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\python310.zip
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\DLLs
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib
C:\Users\Viki\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0
C:\Users\Viki\Documents\Thesis\tryout3\.venv

C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages
C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages\win32
C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages\win32\lib
C:\Users\Viki\Documents\Thesis\tryout3\.venv\lib\site-packages\Pythonwin
C:\Users\Viki\Documents\Thesis\CMU-MultimodalSDK


In [7]:

import mmsdk
import os
import re
import numpy as np
from mmsdk import mmdatasdk as md
from subprocess import check_call, CalledProcessError

In [8]:




# create folders for storing the data
if not os.path.exists(DATA_PATH):
    check_call(' '.join(['mkdir', '-p', DATA_PATH]), shell=True)

# download highlevel features, low-level (raw) data and labels for the dataset MOSI
# if the files are already present, instead of downloading it you just load it yourself.
# here we use CMU_MOSI dataset as example.

DATASET = md.cmu_mosi

try:
    md.mmdataset(DATASET.highlevel, DATA_PATH)
except RuntimeError:
    print("High-level features have been downloaded previously.")

try:
    md.mmdataset(DATASET.raw, DATA_PATH)
except RuntimeError:
    print("Raw data have been downloaded previously.")
    
try:
    md.mmdataset(DATASET.labels, DATA_PATH)
except RuntimeError:
    print("Labels have been downloaded previously.")

Normalized destination path: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWordVectors.csd
[91m[1m[2024-12-04 11:20:43.924] | Error   | [0mC:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWordVectors.csd file already exists ...
High-level features have been downloaded previously.
Normalized destination path: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWords.csd
[91m[1m[2024-12-04 11:20:43.925] | Error   | [0mC:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/language/CMU_MOSI_TimestampedWords.csd file already exists ...
Raw data have been downloaded previously.
Normalized destination path: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu/CMU-MOSI/labels/CMU_MOSI_Opinion_Labels.csd
[91m[1m[2024-12-04

## Inspecting the downloaded files

We can print the files in the target data folder to see what files are there.

We can observe a bunch of files ending with `.csd` extension. This stands for ***computational sequences***, which is the underlying data structure for all features in the SDK. We will come back to that later when we load the data. For now we just print out what computational sequences we have downloaded.

In [9]:
# list the directory contents... let's see what features there are
data_files = os.listdir(DATA_PATH)
print('\n'.join(data_files))

embedding_and_mapping.pt
http__immortal.multicomp.cs.cmu.edu


## Loading a multimodal dataset

Loading the dataset is as simple as telling the SDK what are the features you need and where are their computational sequences. You can construct a dictionary with format `{feature_name: csd_path}` and feed it to `mmdataset` object in the SDK.

In [10]:
import os

folder_path = "C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data"
for root, dirs, files in os.walk(folder_path):
    print(f"Directory: {root}")
    for file in files:
        print(f"  {file}")


Directory: C:\Users\Viki\Documents\Thesis\tryout3\data
  embedding_and_mapping.pt
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\acoustic
  CMU_MOSI_COVAREP.csd
  CMU_MOSI_OpenSmile_EB10.csd
  CMU_MOSI_openSMILE_IS09.csd
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\labels
  CMU_MOSI_Opinion_Labels.csd
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\language
  CMU_MOSI_TimestampedPhones.csd
  CMU_MOSI_TimestampedWords.csd
  CMU_MOSI_TimestampedWordVectors.csd
Directory: C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual
  CMU_MOSI_Visual_Facet_41.csd
  CMU_MOSI_Visual_Facet_42.csd
  CMU_MOSI_Visual

In [11]:
# Define your different modalities - refer to the filenames of the CSD files
visual_field_Facet41 = 'CMU_MOSI_Visual_Facet_41'
visual_field_Facet42 = 'CMU_MOSI_Visual_Facet_42'
visual_field_OpenFace1 = 'CMU_MOSI_Visual_OpenFace_1'


# visual_field_OpenFace2 = 'CMU_MOSI_Visual_OpenFace_2'
# [2024-11-24 21:59:03.886] | Error   | C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual\CMU_MOSI_Visual_OpenFace_2.csd resource is not a valid hdf5 computational sequence format ...


acoustic_field_COVAREP = 'CMU_MOSI_COVAREP'
acoustic_field_OpenSmile_EB10 = 'CMU_MOSI_OpenSmile_EB10'
acoustic_field_OpenSmile_IS09 = 'CMU_MOSI_openSMILE_IS09'



text_field_Words = 'CMU_MOSI_TimestampedWords'
text_field_Phones = 'CMU_MOSI_TimestampedPhones'
text_field_WordVectors = 'CMU_MOSI_TimestampedWordVectors'




In [2]:
feature1=[
    text_field_Words,
    visual_field_Facet41, 
    acoustic_field_COVAREP,
]

recipe1 = {
    text_field_Words: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "language", text_field_Words) + '.csd',
    
    visual_field_Facet41: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "visual", visual_field_Facet41) + '.csd',
    
    acoustic_field_COVAREP: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "acoustic", acoustic_field_COVAREP) + '.csd',
}

NameError: name 'os' is not defined

In [12]:

# text_field = 'CMU_MOSI_ModifiedTimestampedWords'


# List of features
features = [
    text_field_Words,
    text_field_Phones,
    text_field_WordVectors,
    visual_field_Facet41, 
    visual_field_Facet42,
    visual_field_OpenFace1,
    # visual_field_OpenFace2,
    acoustic_field_COVAREP,
    acoustic_field_OpenSmile_IS09,
    acoustic_field_OpenSmile_EB10
]


# Recipe with correct subdirectory paths for each modality
recipe = {
    text_field_Words: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "language", text_field_Words) + '.csd',
    
    # not helpful
    text_field_Phones: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu","CMU-MOSI", "language", text_field_Phones) + '.csd',

    text_field_WordVectors: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "language", text_field_WordVectors) + '.csd',
    
    visual_field_Facet41: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "visual", visual_field_Facet41) + '.csd',
    
    visual_field_Facet42: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "visual", visual_field_Facet42) + '.csd',
    
    visual_field_OpenFace1: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "visual", visual_field_OpenFace1) + '.csd',
    
    
    #not taken into account cause not runnable - 
    # visual_field_OpenFace2: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "visual", visual_field_OpenFace2) + '.csd',
    
    acoustic_field_COVAREP: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "acoustic", acoustic_field_COVAREP) + '.csd',
    
    acoustic_field_OpenSmile_EB10: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "acoustic", acoustic_field_OpenSmile_EB10) + '.csd',
    
    # has same and less features as EB10
    acoustic_field_OpenSmile_IS09: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "acoustic", acoustic_field_OpenSmile_IS09) + '.csd'
}



print (recipe)
dataset = md.mmdataset(recipe)

{'CMU_MOSI_TimestampedWords': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\language\\CMU_MOSI_TimestampedWords.csd', 'CMU_MOSI_TimestampedPhones': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\language\\CMU_MOSI_TimestampedPhones.csd', 'CMU_MOSI_TimestampedWordVectors': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\language\\CMU_MOSI_TimestampedWordVectors.csd', 'CMU_MOSI_Visual_Facet_41': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\visual\\CMU_MOSI_Visual_Facet_41.csd', 'CMU_MOSI_Visual_Facet_42': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\visual\\CMU_MOSI_Visual_Facet_42.csd', 'CMU_MOSI_Visual_OpenFace_1': 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data\\http__immortal.multicomp.cs.cmu.edu\\CMU-MOSI\\visual\\CMU_MO

                                                                   

[92m[1m[2024-12-04 11:21:02.842] | Success | [0m<words> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:02.842] | Status  | [0mChecking the format of the metadata in <words> computational sequence ...
[92m[1m[2024-12-04 11:21:02.843] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\language\CMU_MOSI_TimestampedPhones.csd ...
[94m[1m[2024-12-04 11:21:02.867] | Status  | [0mChecking the integrity of the <phoneme> computational sequence ...
[94m[1m[2024-12-04 11:21:02.867] | Status  | [0mChecking the format of the data in <phoneme> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:02.921] | Success | [0m<phoneme> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:02.922] | Status  | [0mChecking the format of the metadata in <phoneme> computational sequence ...
[92m[1m[2024-12-04 11:21:02.923] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\language\CMU_MOSI_TimestampedWordVectors.csd ...
[94m[1m[2024-12-04 11:21:02.941] | Status  | [0mChecking the integrity of the <glove_vectors> computational sequence ...
[94m[1m[2024-12-04 11:21:02.941] | Status  | [0mChecking the format of the data in <glove_vectors> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:03.001] | Success | [0m<glove_vectors> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:03.001] | Status  | [0mChecking the format of the metadata in <glove_vectors> computational sequence ...
[92m[1m[2024-12-04 11:21:03.002] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual\CMU_MOSI_Visual_Facet_41.csd ...
[94m[1m[2024-12-04 11:21:03.028] | Status  | [0mChecking the integrity of the <FACET_4.1> computational sequence ...
[94m[1m[2024-12-04 11:21:03.028] | Status  | [0mChecking the format of the data in <FACET_4.1> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:03.092] | Success | [0m<FACET_4.1> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:03.092] | Status  | [0mChecking the format of the metadata in <FACET_4.1> computational sequence ...
[92m[1m[2024-12-04 11:21:03.108] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual\CMU_MOSI_Visual_Facet_42.csd ...
[94m[1m[2024-12-04 11:21:03.135] | Status  | [0mChecking the integrity of the <FACET_4.2> computational sequence ...
[94m[1m[2024-12-04 11:21:03.136] | Status  | [0mChecking the format of the data in <FACET_4.2> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:03.202] | Success | [0m<FACET_4.2> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:03.202] | Status  | [0mChecking the format of the metadata in <FACET_4.2> computational sequence ...
[92m[1m[2024-12-04 11:21:03.204] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\visual\CMU_MOSI_Visual_OpenFace_1.csd ...
[94m[1m[2024-12-04 11:21:03.224] | Status  | [0mChecking the integrity of the <OpenFace_1> computational sequence ...
[94m[1m[2024-12-04 11:21:03.224] | Status  | [0mChecking the format of the data in <OpenFace_1> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:03.278] | Success | [0m<OpenFace_1> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:03.279] | Status  | [0mChecking the format of the metadata in <OpenFace_1> computational sequence ...
[92m[1m[2024-12-04 11:21:03.280] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\acoustic\CMU_MOSI_COVAREP.csd ...
[94m[1m[2024-12-04 11:21:03.298] | Status  | [0mChecking the integrity of the <COVAREP> computational sequence ...
[94m[1m[2024-12-04 11:21:03.298] | Status  | [0mChecking the format of the data in <COVAREP> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:03.358] | Success | [0m<COVAREP> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:03.358] | Status  | [0mChecking the format of the metadata in <COVAREP> computational sequence ...
[92m[1m[2024-12-04 11:21:03.359] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\acoustic\CMU_MOSI_OpenSmile_EB10.csd ...
[94m[1m[2024-12-04 11:21:03.381] | Status  | [0mChecking the integrity of the <OpenSmile_emobase2010> computational sequence ...
[94m[1m[2024-12-04 11:21:03.381] | Status  | [0mChecking the format of the data in <OpenSmile_emobase2010> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:03.439] | Success | [0m<OpenSmile_emobase2010> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:03.439] | Status  | [0mChecking the format of the metadata in <OpenSmile_emobase2010> computational sequence ...
[92m[1m[2024-12-04 11:21:03.442] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\acoustic\CMU_MOSI_openSMILE_IS09.csd ...
[94m[1m[2024-12-04 11:21:03.462] | Status  | [0mChecking the integrity of the <b'OpenSMILE'> computational sequence ...
[94m[1m[2024-12-04 11:21:03.462] | Status  | [0mChecking the format of the data in <b'OpenSMILE'> computational sequence ...


                                                                   

[92m[1m[2024-12-04 11:21:03.505] | Success | [0m<b'OpenSMILE'> computational sequence data in correct format.
[94m[1m[2024-12-04 11:21:03.506] | Status  | [0mChecking the format of the metadata in <b'OpenSMILE'> computational sequence ...
[92m[1m[2024-12-04 11:21:03.506] | Success | [0mDataset initialized successfully ... 






## A peek into the dataset

The multimodal dataset, after loaded, has the following hierarchy:


```
            computational_sequence_1 ---...
           /                                   ...
          /                                    /
         /                          first_video     features -- T X N array
        /                          /               /
dataset ---computational_sequence_2 -- second_video
        \                          \               \
         \                          third_video     intervals -- T X 2 array
          \                                    \...
           \
            computational_sequence_3 ---...
```

It looks like a nested dictionary and can be indexed as if it is a nested dictionary. A dataset contains multiple computational sequences whose key is the `text_field`, `visual_field`, `acoustic_field` as defined above. Each computational sequence, however, has multiple video IDs in it, and different computational sequences are supposed to have the same set of video IDs. Within each video, there are two arrays: `features` and `intervals`, denoting the feature values at each time step and the start and end timestamp for each step. We can take a look at its content.

In [20]:
print(list(dataset.keys()))
print("=" * 80)

print(list(dataset[visual_field_Facet41].keys()))
print("=" * 80)

some_id = list(dataset[visual_field_Facet41].keys())[15]
print(list(dataset[visual_field_Facet41][some_id].keys()))
print("=" * 80)

print(list(dataset[visual_field_Facet41][some_id]['intervals'].shape))
print("=" * 80)

print(list(dataset[visual_field_Facet41][some_id]['features'].shape))
print(list(dataset[text_field_Words][some_id]['features'].shape))
print(list(dataset[acoustic_field_COVAREP][some_id]['features'].shape))
print("Different modalities have different number of time steps!")



['CMU_MOSI_TimestampedWords', 'CMU_MOSI_TimestampedWordVectors', 'CMU_MOSI_Visual_Facet_41', 'CMU_MOSI_COVAREP']
['03bSnISJMiM', '0h-zjBukYpk', '1DmNV9C1hbY', '1iG0909rllw', '2WGyTLYerpo', '2iD-tVS8NPw', '5W7Z1C_fDaE', '6Egk_28TtTM', '6_0THN4chvY', '73jzhE8R1TQ', '7JsX8y1ysxY', '8OtFthrtaJM', '8d-gEyoeBzc', '8qrpnFRGt2A', '9J25DZhivz8', '9T9Hf74oK10', '9c67fiY0wGQ', '9qR7uwkblbs', 'Af8D0E4ZXaw', 'BI97DNYfe5I', 'BXuRRbG0Ugk', 'Bfr499ggo-0', 'BioHAh1qJAQ', 'BvYR0L6f2Ig', 'Ci-AH39fi3Y', 'Clx4VXItLTE', 'Dg_0XKD0Mf4', 'G-xst2euQUc', 'G6GlGvlkxAQ', 'GWuJjcEuzt8', 'HEsqda8_d0Q', 'I5y0__X72p0', 'Iu2PFX3z_1s', 'IumbAb8q2dM', 'Jkswaaud0hk', 'LSi-o-IrDMs', 'MLal-t_vJPM', 'Njd1F0vZSm4', 'Nzq88NnDkEk', 'OQvJTdtJ2H4', 'OtBXNcAL_lE', 'Oz06ZWiO20M', 'POKffnXeBds', 'PZ-lDQFboO8', 'QN9ZIUWUXsY', 'Qr1Ca94K55A', 'Sqr0AcuoNnk', 'TvyZBvOMOTc', 'VCslbP0mgZI', 'VbQk4H8hgr0', 'Vj1wYRQjB-o', 'W8NXH0Djyww', 'WKA5OygbEKI', 'X3j2zQgwYgE', 'ZAIRrfG22O0', 'ZUXBRvtny7o', '_dI--eQ6qVU', 'aiEXnCPZubE', 'atnd_PF-Lbs', '

In [21]:

print("=== Metadata FACE 41 Visual===\n")
print("Visual FACE 41 Metadata:", dataset[visual_field_Facet41].metadata)
print("\n")

print("=== Metadata COVAREP Acoustic===\n")
print("Acoustic COVAREP Metadata:", dataset[acoustic_field_COVAREP].metadata)
print("\n")

print("=== Metadata Words===\n")
print("Words Metadata:", dataset[text_field_Words].metadata)
print("\n")

print("=== Metadata WordVectors===\n")
print("WordVectors Metadata:", dataset[text_field_WordVectors].metadata)
print("\n")


=== Metadata FACE 41 Visual===

Visual FACE 41 Metadata: {'alignment compatible': True, 'computational sequence description': 'FACET 4.1 Visual Features for CMU-MOSI Dataset', 'computational sequence version': 1.0, 'contact': 'abagherz@andrew.cmu.edu', 'creator': 'Amir Zadeh', 'dataset bib citation': '@article{zadeh2016multimodal,title={Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages},author={Zadeh, Amir and Zellers, Rowan and Pincus, Eli and Morency, Louis-Philippe},journal={IEEE Intelligent Systems},volume={31},number={6},pages={82--88},year={2016},publisher={IEEE}}', 'dataset name': 'CMU-MOSI', 'dataset version': 1.0, 'dimension names': ['Face X', 'Face Y', 'Face Width', 'Face Height', 'angerEvidence', 'contemptEvidence', 'disgustEvidence', 'joyEvidence', 'fearEvidence', 'negativeEvidence', 'neutralEvidence', 'positiveEvidence', 'sadnessEvidence', 'surpriseEvidence', 'confusionEvidence', 'frustrationEvidence', 'angerIntensity', 'contemptIntensi

In [13]:
# List all keys in the visual and acoustic fields
# print("Keys in visual FACE41 field:", dataset[visual_field_Facet41].keys())
# print("Keys in visual FACE42 field:", dataset[visual_field_Facet41].keys())
# print("Keys in visual OpenFace field:", dataset[visual_field_OpenFace1].keys())
# 
# print("Keys in acousticC field:", dataset[acoustic_field_COVAREP].keys())
# print("Keys in acousticOEB field:", dataset[acoustic_field_OpenSmile_EB10].keys())
# print("Keys in acousticOIS field:", dataset[acoustic_field_OpenSmile_IS09].keys())

print("=== Metadata FACE 41 Visual===\n")
print("Visual FACE 41 Metadata:", dataset[visual_field_Facet41].metadata)
print("\n")

print("=== Metadata FACE 42 Visual===\n")
print("Visual FACE 42 Metadata:", dataset[visual_field_Facet42].metadata)
print("\n")

print("=== Metadata OpenFace Visual===\n")
print("Visual OpenFace Metadata:", dataset[visual_field_OpenFace1].metadata)
print("\n")

print("=== Metadata COVAREP Acoustic===\n")
print("Acoustic COVAREP Metadata:", dataset[acoustic_field_COVAREP].metadata)
print("\n")

print("=== Metadata OpenSmile EB10 Acoustic===\n")
print("Acoustic OpenSmile EB10 Metadata:", dataset[acoustic_field_OpenSmile_EB10].metadata)
print("\n")

print("=== Metadata OpenSmile IS09 Acoustic===\n")
print("Acoustic OpenSmile IS09 Metadata:", dataset[acoustic_field_OpenSmile_IS09].metadata)
print("\n")

print("=== Metadata Words===\n")
print("Words Metadata:", dataset[text_field_Words].metadata)
print("\n")

print("=== Metadata Phones===\n")
print("Phones Metadata:", dataset[text_field_Phones].metadata)
print("\n")

print("=== Metadata WordVectors===\n")
print("WordVectors Metadata:", dataset[text_field_WordVectors].metadata)
print("\n")

=== Metadata FACE 41 Visual===

Visual FACE 41 Metadata: {'alignment compatible': True, 'computational sequence description': 'FACET 4.1 Visual Features for CMU-MOSI Dataset', 'computational sequence version': 1.0, 'contact': 'abagherz@andrew.cmu.edu', 'creator': 'Amir Zadeh', 'dataset bib citation': '@article{zadeh2016multimodal,title={Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages},author={Zadeh, Amir and Zellers, Rowan and Pincus, Eli and Morency, Louis-Philippe},journal={IEEE Intelligent Systems},volume={31},number={6},pages={82--88},year={2016},publisher={IEEE}}', 'dataset name': 'CMU-MOSI', 'dataset version': 1.0, 'dimension names': ['Face X', 'Face Y', 'Face Width', 'Face Height', 'angerEvidence', 'contemptEvidence', 'disgustEvidence', 'joyEvidence', 'fearEvidence', 'negativeEvidence', 'neutralEvidence', 'positiveEvidence', 'sadnessEvidence', 'surpriseEvidence', 'confusionEvidence', 'frustrationEvidence', 'angerIntensity', 'contemptIntensi

In [15]:
import pandas as pd

In [17]:
fields_metadata = [
    {"Field": "Visual FACE 41", "Metadata": dataset[visual_field_Facet41].metadata},
    {"Field": "Visual FACE 42", "Metadata": dataset[visual_field_Facet42].metadata},
    {"Field": "Visual OpenFace", "Metadata": dataset[visual_field_OpenFace1].metadata},
    {"Field": "Acoustic COVAREP", "Metadata": dataset[acoustic_field_COVAREP].metadata},
    {"Field": "Acoustic OpenSmile EB10", "Metadata": dataset[acoustic_field_OpenSmile_EB10].metadata},
    {"Field": "Acoustic OpenSmile IS09", "Metadata": dataset[acoustic_field_OpenSmile_IS09].metadata},
    {"Field": "Words", "Metadata": dataset[text_field_Words].metadata},
    {"Field": "Phones", "Metadata": dataset[text_field_Phones].metadata},
    {"Field": "WordVectors", "Metadata": dataset[text_field_WordVectors].metadata},
]

# Normalize (flatten) the metadata dictionaries
normalized_data = []
for entry in fields_metadata:
    metadata_flat = pd.json_normalize(entry["Metadata"], sep="_")
    metadata_flat["Field"] = entry["Field"]
    normalized_data.append(metadata_flat)

# print(f"normalized metadata: {normalized_data}")

df_metadata = pd.concat(normalized_data, ignore_index=True)
df_metadata = df_metadata[["Field"] + [col for col in df_metadata.columns if col != "Field"]]

print(df_metadata)




# Save the DataFrame to a CSV file
csv_filename = "metadata_summary.csv"
df_metadata.to_csv(csv_filename, index=False)

print(f"Metadata successfully saved to {csv_filename}")

# 
# 
#remove \ for Acoustic OpenSmile EB10
#remove b" and " for Acoustic OpenSmile IS09

                     Field alignment compatible  \
0           Visual FACE 41                 True   
1           Visual FACE 42                 True   
2          Visual OpenFace                 True   
3         Acoustic COVAREP                 True   
4  Acoustic OpenSmile EB10                 True   
5  Acoustic OpenSmile IS09              b'True'   
6                    Words                False   
7                   Phones                False   
8              WordVectors                 True   

                  computational sequence description  \
0     FACET 4.1 Visual Features for CMU-MOSI Dataset   
1     FACET 4.2 Visual Features for CMU-MOSI Dataset   
2   OpenFace V1 Visual Features for CMU-MOSI Dataset   
3     COVAREP Acoustic Features for CMU-MOSI Dataset   
4  OpenSmile emobase2010 Acoustic Features for CM...   
5                                  b'MOSI openSMILE'   
6                Word sequences for CMU-MOSI Dataset   
7             Phoneme sequences for CMU-M

In [25]:
from functools import reduce

In [26]:
df_components = df_metadata[["Field", "dimension names"]]

# Define the replacements in a dictionary
replacements = {
    'b"': '',
    '"': '',
    '\'' : '',
    '[': '',
    ']': ''

}



def replace_strings(x):
    if isinstance(x, str):
        for old, new in replacements.items():
            x = x.replace(old, new)
    return x


# Apply the replacements using map
df_components = df_components.applymap(replace_strings)

print(df_components)

# Save the DataFrame to a CSV file
csv_filename = "componentsWithoutIS.csv"
df_components.to_csv(csv_filename, index=False)

              Field                                    dimension names
0    Visual FACE 41  [Face X, Face Y, Face Width, Face Height, ange...
1  Acoustic COVAREP  [F0, VUV, NAQ, QOQ, H1H2, PSP, MDQ, peakSlope,...
2             Words                                             [word]
3       WordVectors  [glove_vector, glove_vector, glove_vector, glo...


  df_components = df_components.applymap(replace_strings)


In [27]:
import pandas as pd
import csv

# Define the replacements in a dictionary
replacements = {
    'b"': '',
    '\"': '',
    '\'': '',
    '[': '',
    ']': ''
}

# Function to apply replacements
def replace_strings(x):
    if isinstance(x, str):
        for old, new in replacements.items():
            x = x.replace(old, new)
    return x

# Read the CSV file into a DataFrame
with open('componentsWithoutIS.csv', 'r', newline='', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    rows = []
    for row in reader:
        new_row = []
        for element in row:
            # Split elements by comma and strip spaces
            split_elements = [e.strip() for e in element.split(',')]
            new_row.extend(split_elements)
        rows.append(new_row)

# Create DataFrame from the processed rows
df = pd.DataFrame(rows)

# Apply the replacements using applymap to the entire DataFrame
df = df.applymap(replace_strings)

# Save the modified DataFrame back to a CSV file
df.to_csv('componentsModifiedWithoutIS.csv', index=False)

  df = df.applymap(replace_strings)


In [28]:
import pandas as pd

# Load the dataset from a CSV file
# Replace 'your_file.csv' with the path to your actual file
df = pd.read_csv('componentsModifiedWithoutIS.csv', header=None)

# Flatten the DataFrame into a single list, keeping track of field positions
data = []
for col in df.columns:
    for row in df.index:
        value = df.iloc[row, col]
        if pd.notna(value):  # Exclude empty elements
            data.append((value.strip(), f"Row {row+1}, Column {col+1}"))

# Create a dictionary to count occurrences and track positions
duplicate_tracker = {}
for value, position in data:
    if value in duplicate_tracker:
        duplicate_tracker[value]['count'] += 1
        duplicate_tracker[value]['positions'].append(position)
    else:
        duplicate_tracker[value] = {'count': 1, 'positions': [position]}

# Write results to a text file
with open('outputDuplicatesWithoutIS.txt', 'w') as output_file:
    output_file.write("Duplicates Found:\n")
    for value, info in duplicate_tracker.items():
        if info['count'] > 1:
            output_file.write(
                f"{value}: {info['count']}; Positions: {', '.join(info['positions'])}\n"
            )

print("Duplicate analysis complete. Results saved to 'outputWithoutIS.txt'.")


Duplicate analysis complete. Results saved to 'outputWithoutIS.txt'.


## Alignment of multimodal time series

To work with multimodal time series that contains multiple views of data with different frequencies, we have to first align them to a ***pivot*** modality. The convention is to align to ***words***. Alignment groups feature vectors from other modalities into bins denoted by the timestamps of the pivot modality, and apply a certain processing function to each bin. We call this function ***collapse function***, because usually it is a pooling function that collapses multiple feature vectors from another modality into one single vector. This will give you sequences of same lengths in each modality (as the length of the pivot modality) for all videos.

Here we define our collapse funtion as simple averaging. We feed the function to the SDK when we invoke `align` method. Note that the SDK always expect collapse functions with two arguments: `intervals` and `features`. Even if you don't use intervals (as is in the case below) you still need to define your function in the following way.

***Note: Currently the SDK applies the collapse function to all modalities including the pivot, and obviously text modality cannot be "averaged", causing some errors. My solution is to define the avg function such that it averages the features when it can, and return the content as is when it cannot average.***

In [29]:
# we define a simple averaging function that does not depend on intervals
def avg(intervals: np.array, features: np.array) -> np.array:
    try:
        return np.average(features, axis=0)
    except:
        return features

# first we align to words with averaging, collapse_function receives a list of functions
dataset.align(text_field_Words, collapse_functions=[avg])

[94m[1m[2024-12-02 19:10:03.187] | Status  | [0mUnify was called ...
[92m[1m[2024-12-02 19:10:03.188] | Success | [0mUnify completed ...
[94m[1m[2024-12-02 19:10:03.188] | Status  | [0mPre-alignment based on <CMU_MOSI_TimestampedWords> computational sequence started ...
[94m[1m[2024-12-02 19:10:13.346] | Status  | [0mPre-alignment done for <CMU_MOSI_COVAREP> ...
[94m[1m[2024-12-02 19:10:15.113] | Status  | [0mPre-alignment done for <CMU_MOSI_TimestampedWordVectors> ...
[94m[1m[2024-12-02 19:10:17.409] | Status  | [0mPre-alignment done for <CMU_MOSI_Visual_Facet_41> ...
[94m[1m[2024-12-02 19:10:17.629] | Status  | [0mAlignment starting ...


Overall Progress:   0%|          | 0/93 [00:00<?, ? Computational Sequence Entries/s]
  0%|          | 0/464 [00:00<?, ? Segments/s][A
Aligning 03bSnISJMiM:   0%|          | 0/464 [00:00<?, ? Segments/s][A
Aligning 03bSnISJMiM:   5%|▌         | 25/464 [00:00<00:01, 238.23 Segments/s][A
Aligning 03bSnISJMiM:  11%|█         | 49/464 [00:00<00:01, 231.14 Segments/s][A
Aligning 03bSnISJMiM:  16%|█▌        | 73/464 [00:00<00:01, 228.16 Segments/s][A
Aligning 03bSnISJMiM:  21%|██        | 96/464 [00:00<00:01, 216.93 Segments/s][A
Aligning 03bSnISJMiM:  25%|██▌       | 118/464 [00:00<00:01, 213.21 Segments/s][A
Aligning 03bSnISJMiM:  30%|███       | 141/464 [00:00<00:01, 217.19 Segments/s][A
Aligning 03bSnISJMiM:  35%|███▌      | 163/464 [00:00<00:01, 210.27 Segments/s][A
Aligning 03bSnISJMiM:  41%|████      | 191/464 [00:00<00:01, 229.72 Segments/s][A
Aligning 03bSnISJMiM:  48%|████▊     | 223/464 [00:00<00:00, 254.26 Segments/s][A
Aligning 03bSnISJMiM:  55%|█████▌    | 257/464 [0

[92m[1m[2024-12-02 19:13:01.776] | Success | [0mAlignment to <CMU_MOSI_TimestampedWords> complete.
[94m[1m[2024-12-02 19:13:01.776] | Status  | [0mReplacing dataset content with aligned computational sequences
[92m[1m[2024-12-02 19:13:01.785] | Success | [0mInitialized empty <CMU_MOSI_TimestampedWords> computational sequence.
[94m[1m[2024-12-02 19:13:01.786] | Status  | [0mChecking the format of the data in <CMU_MOSI_TimestampedWords> computational sequence ...


                                                                                      

[92m[1m[2024-12-02 19:13:01.924] | Success | [0m<CMU_MOSI_TimestampedWords> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:01.924] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_TimestampedWords> computational sequence ...
[92m[1m[2024-12-02 19:13:01.924] | Success | [0mInitialized empty <CMU_MOSI_TimestampedWordVectors> computational sequence.
[94m[1m[2024-12-02 19:13:01.924] | Status  | [0mChecking the format of the data in <CMU_MOSI_TimestampedWordVectors> computational sequence ...


                                                                                      

[92m[1m[2024-12-02 19:13:02.054] | Success | [0m<CMU_MOSI_TimestampedWordVectors> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:02.054] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_TimestampedWordVectors> computational sequence ...
[92m[1m[2024-12-02 19:13:02.054] | Success | [0mInitialized empty <CMU_MOSI_Visual_Facet_41> computational sequence.
[94m[1m[2024-12-02 19:13:02.054] | Status  | [0mChecking the format of the data in <CMU_MOSI_Visual_Facet_41> computational sequence ...


                                                                                      

[92m[1m[2024-12-02 19:13:02.176] | Success | [0m<CMU_MOSI_Visual_Facet_41> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:02.176] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_Visual_Facet_41> computational sequence ...
[92m[1m[2024-12-02 19:13:02.176] | Success | [0mInitialized empty <CMU_MOSI_COVAREP> computational sequence.
[94m[1m[2024-12-02 19:13:02.176] | Status  | [0mChecking the format of the data in <CMU_MOSI_COVAREP> computational sequence ...


                                                                                      

[92m[1m[2024-12-02 19:13:02.314] | Success | [0m<CMU_MOSI_COVAREP> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:02.314] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_COVAREP> computational sequence ...


## Append annotations to the dataset and get the data points

Now that we have a preprocessed dataset, all we need to do is to apply annotations to the data. Annotations are also computational sequences, since they are also just some values distributed on different time spans (e.g 1-3s is 'angry', 12-26s is 'neutral'). Hence, we just add the label computational sequence to the dataset and then align to the labels. Since we (may) want to preserve the whole sequences, this time we don't specify any collapse functions when aligning. 

Note that after alignment, the keys in the dataset changes from `video_id` to `video_id[segment_no]`, because alignment will segment each datapoint based on the segmentation of the pivot modality (in this case, it is segmented based on labels, which is what we need, and yes, one code block ago they are segmented to word level, which I didn't show you).

***Important: DO NOT add the labels together at the beginning, the labels will be segmented during the first alignment to words. This also holds for any situation where you want to do multiple levels of alignment.***

In [30]:
label_field = 'CMU_MOSI_Opinion_Labels'

# we add and align to lables to obtain labeled segments
# this time we don't apply collapse functions so that the temporal sequences are preserved
label_recipe = {label_field: os.path.join(DATA_PATH, "http__immortal.multicomp.cs.cmu.edu", "CMU-MOSI", "labels", label_field) + '.csd'}
dataset.add_computational_sequences(label_recipe, destination=None)
dataset.align(label_field)

[92m[1m[2024-12-02 19:13:02.445] | Success | [0mComputational sequence read from file C:\Users\Viki\Documents\Thesis\tryout3\data\http__immortal.multicomp.cs.cmu.edu\CMU-MOSI\labels\CMU_MOSI_Opinion_Labels.csd ...
[94m[1m[2024-12-02 19:13:02.485] | Status  | [0mChecking the integrity of the <Opinion Segment Labels> computational sequence ...
[94m[1m[2024-12-02 19:13:02.485] | Status  | [0mChecking the format of the data in <Opinion Segment Labels> computational sequence ...


                                                                   

[92m[1m[2024-12-02 19:13:02.581] | Success | [0m<Opinion Segment Labels> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:02.583] | Status  | [0mChecking the format of the metadata in <Opinion Segment Labels> computational sequence ...
[94m[1m[2024-12-02 19:13:02.584] | Status  | [0mUnify was called ...




[92m[1m[2024-12-02 19:13:02.793] | Success | [0mUnify completed ...
[94m[1m[2024-12-02 19:13:02.797] | Status  | [0mPre-alignment based on <CMU_MOSI_Opinion_Labels> computational sequence started ...
[94m[1m[2024-12-02 19:13:03.032] | Status  | [0mPre-alignment done for <CMU_MOSI_COVAREP> ...
[94m[1m[2024-12-02 19:13:03.241] | Status  | [0mPre-alignment done for <CMU_MOSI_Visual_Facet_41> ...
[94m[1m[2024-12-02 19:13:03.560] | Status  | [0mPre-alignment done for <CMU_MOSI_TimestampedWordVectors> ...
[94m[1m[2024-12-02 19:13:03.756] | Status  | [0mPre-alignment done for <CMU_MOSI_TimestampedWords> ...
[94m[1m[2024-12-02 19:13:03.763] | Status  | [0mAlignment starting ...


Overall Progress:   0%|          | 0/93 [00:00<?, ? Computational Sequence Entries/s]
  0%|          | 0/13 [00:00<?, ? Segments/s][A
Aligning 03bSnISJMiM:   0%|          | 0/13 [00:00<?, ? Segments/s][A
                                                                   [A
  0%|          | 0/25 [00:00<?, ? Segments/s][A
Aligning 0h-zjBukYpk:   0%|          | 0/25 [00:00<?, ? Segments/s][A
                                                                   [A
  0%|          | 0/14 [00:00<?, ? Segments/s][A
Aligning 1DmNV9C1hbY:   0%|          | 0/14 [00:00<?, ? Segments/s][A
Overall Progress:   3%|▎         | 3/93 [00:00<00:03, 23.75 Computational Sequence Entries/s]
  0%|          | 0/30 [00:00<?, ? Segments/s][A
Aligning 1iG0909rllw:   0%|          | 0/30 [00:00<?, ? Segments/s][A
                                                                   [A
  0%|          | 0/63 [00:00<?, ? Segments/s][A
Aligning 2WGyTLYerpo:   0%|          | 0/63 [00:00<?, ? Segments/s][A
Alignin

[92m[1m[2024-12-02 19:13:08.820] | Success | [0mAlignment to <CMU_MOSI_Opinion_Labels> complete.
[94m[1m[2024-12-02 19:13:08.820] | Status  | [0mReplacing dataset content with aligned computational sequences
[92m[1m[2024-12-02 19:13:09.077] | Success | [0mInitialized empty <CMU_MOSI_TimestampedWords> computational sequence.
[94m[1m[2024-12-02 19:13:09.077] | Status  | [0mChecking the format of the data in <CMU_MOSI_TimestampedWords> computational sequence ...


                                                                     

[92m[1m[2024-12-02 19:13:09.088] | Success | [0m<CMU_MOSI_TimestampedWords> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:09.088] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_TimestampedWords> computational sequence ...
[92m[1m[2024-12-02 19:13:09.088] | Success | [0mInitialized empty <CMU_MOSI_TimestampedWordVectors> computational sequence.
[94m[1m[2024-12-02 19:13:09.088] | Status  | [0mChecking the format of the data in <CMU_MOSI_TimestampedWordVectors> computational sequence ...


                                                                     

[92m[1m[2024-12-02 19:13:09.097] | Success | [0m<CMU_MOSI_TimestampedWordVectors> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:09.097] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_TimestampedWordVectors> computational sequence ...
[92m[1m[2024-12-02 19:13:09.097] | Success | [0mInitialized empty <CMU_MOSI_Visual_Facet_41> computational sequence.
[94m[1m[2024-12-02 19:13:09.097] | Status  | [0mChecking the format of the data in <CMU_MOSI_Visual_Facet_41> computational sequence ...


                                                                     

[92m[1m[2024-12-02 19:13:09.106] | Success | [0m<CMU_MOSI_Visual_Facet_41> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:09.106] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_Visual_Facet_41> computational sequence ...
[92m[1m[2024-12-02 19:13:09.107] | Success | [0mInitialized empty <CMU_MOSI_COVAREP> computational sequence.
[94m[1m[2024-12-02 19:13:09.107] | Status  | [0mChecking the format of the data in <CMU_MOSI_COVAREP> computational sequence ...


                                                                     

[92m[1m[2024-12-02 19:13:09.117] | Success | [0m<CMU_MOSI_COVAREP> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:09.117] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_COVAREP> computational sequence ...
[92m[1m[2024-12-02 19:13:09.117] | Success | [0mInitialized empty <CMU_MOSI_Opinion_Labels> computational sequence.
[94m[1m[2024-12-02 19:13:09.117] | Status  | [0mChecking the format of the data in <CMU_MOSI_Opinion_Labels> computational sequence ...


                                                                     

[92m[1m[2024-12-02 19:13:09.129] | Success | [0m<CMU_MOSI_Opinion_Labels> computational sequence data in correct format.
[94m[1m[2024-12-02 19:13:09.129] | Status  | [0mChecking the format of the metadata in <CMU_MOSI_Opinion_Labels> computational sequence ...




In [31]:
# check out what the keys look like now
print(list(dataset[text_field_Words].keys())[55])

1iG0909rllw[3]


In [32]:
num_segments = len(dataset[visual_field_Facet41].keys())  # Assuming all fields have the same segment keys
print(f"Number of segments: {num_segments}")


Number of segments: 2198


In [33]:
# Compute lengths of segments for each modality
segment_lengths_visual = {
    visual_field_Facet41: [dataset[visual_field_Facet41][seg]['features'].shape[0] for seg in dataset[visual_field_Facet41].keys()],
    visual_field_Facet42: [dataset[visual_field_Facet42][seg]['features'].shape[0] for seg in dataset[visual_field_Facet42].keys()],
    visual_field_OpenFace1: [dataset[visual_field_OpenFace1][seg]['features'].shape[0] for seg in dataset[visual_field_OpenFace1].keys()],
}

segment_lengths_acoustic = {
    acoustic_field_COVAREP: [dataset[acoustic_field_COVAREP][seg]['features'].shape[0] for seg in dataset[acoustic_field_COVAREP].keys()],
    acoustic_field_OpenSmile_EB10: [dataset[acoustic_field_OpenSmile_EB10][seg]['features'].shape[0] for seg in dataset[acoustic_field_OpenSmile_EB10].keys()],
}

segment_lengths_text = {
    text_field_Words: [dataset[text_field_Words][seg]['features'].shape[0] for seg in dataset[text_field_Words].keys()],
    text_field_WordVectors: [dataset[text_field_WordVectors][seg]['features'].shape[0] for seg in dataset[text_field_WordVectors].keys()],
}

# Calculate maximum and minimum lengths for each modality
max_length_visual = {modality: max(lengths) for modality, lengths in segment_lengths_visual.items()}
min_length_visual = {modality: min(lengths) for modality, lengths in segment_lengths_visual.items()}

max_length_acoustic = {modality: max(lengths) for modality, lengths in segment_lengths_acoustic.items()}
min_length_acoustic = {modality: min(lengths) for modality, lengths in segment_lengths_acoustic.items()}

max_length_text = {modality: max(lengths) for modality, lengths in segment_lengths_text.items()}
min_length_text = {modality: min(lengths) for modality, lengths in segment_lengths_text.items()}

# Print the results for each modality
for modality, max_len in max_length_visual.items():
    print(f"Visual modality {modality}: Max length = {max_len}, Min length = {min_length_visual[modality]}")

for modality, max_len in max_length_acoustic.items():
    print(f"Acoustic modality {modality}: Max length = {max_len}, Min length = {min_length_acoustic[modality]}")

for modality, max_len in max_length_text.items():
    print(f"Text modality {modality}: Max length = {max_len}, Min length = {min_length_text[modality]}")


[91m[1m[2024-12-02 19:13:09.233] | Error   | [0mComputational sequence does not exist ...


RuntimeError: Computational sequence does not exist ...

In [None]:
# Iterate over each segment in the dataset (assuming all modalities have the same segment keys)
for seg in dataset[visual_field_Facet41].keys():  
    # Compute the length for each modality and feature
    visual_length_Facet41 = dataset[visual_field_Facet41][seg]['features'].shape[0]
    visual_length_Facet42 = dataset[visual_field_Facet42][seg]['features'].shape[0]
    visual_length_OpenFace1 = dataset[visual_field_OpenFace1][seg]['features'].shape[0]
    
    acoustic_length_COVAREP = dataset[acoustic_field_COVAREP][seg]['features'].shape[0]
    acoustic_length_OpenSmile_EB10 = dataset[acoustic_field_OpenSmile_EB10][seg]['features'].shape[0]
    
    text_length_Words = dataset[text_field_Words][seg]['features'].shape[0]
    text_length_WordVectors = dataset[text_field_WordVectors][seg]['features'].shape[0]
    
    # Print the lengths for all features of the current segment
    print(f"Segment {seg}:")
    print(f"  Visual (Facet41) length = {visual_length_Facet41}")
    print(f"  Visual (Facet42) length = {visual_length_Facet42}")
    print(f"  Visual (OpenFace1) length = {visual_length_OpenFace1}")
    print(f"  Acoustic (COVAREP) length = {acoustic_length_COVAREP}")
    print(f"  Acoustic (OpenSmile_EB10) length = {acoustic_length_OpenSmile_EB10}")
    print(f"  Text (Words) length = {text_length_Words}")
    print(f"  Text (WordVectors) length = {text_length_WordVectors}")
    print('-' * 50)


In [None]:
# Extract unique video IDs
video_ids = set(seg.split("[")[0] for seg in dataset[visual_field_Facet41].keys())

# Count the number of unique video IDs
num_unique_ids = len(video_ids)

# Count segments for each video ID
video_segment_counts = {video_id: sum(seg.startswith(video_id) for seg in dataset[visual_field_Facet41].keys()) for video_id in video_ids}

# Print the total number of unique video IDs
print(f"Total number of unique video IDs: {num_unique_ids}\n")

# Print the segment counts for each video ID
for video_id, count in video_segment_counts.items():
    print(f"Video {video_id}: {count} segments")



## Splitting the dataset

Now it comes to our final step: splitting the dataset into train/dev/test splits. This code block is a bit long in itself, so be patience and step through carefully with the explanatory comments.

The SDK provides the splits in terms of video IDs (which video belong to which split), however, after alignment our dataset keys already changed from `video_id` to `video_id[segment_no]`. Hence, we need to extract the video ID when looping through the data to determine which split each data point belongs to.

In the following data processing, I also include instance-wise Z-normalization (subtract by mean and divide by standard dev) and converted words to unique IDs.

This example is based on PyTorch so I am using PyTorch related utils, but the same procedure should be easy to adapt to other frameworks.

In [None]:
# obtain the train/dev/test splits - these splits are based on video IDs
train_split = DATASET.standard_folds.standard_train_fold
dev_split = DATASET.standard_folds.standard_valid_fold
test_split = DATASET.standard_folds.standard_test_fold

# inspect the splits: they only contain video IDs
print(f"lengths: train {len(train_split)}, dev {len(dev_split)}, test {len(test_split)}\n")
print(train_split)
print(dev_split)
print(test_split)

In [None]:
# we can see they are in the format of 'video_id[segment_no]', but the splits was specified with video_id only
# we need to use regex or something to match the video IDs...
import torch
import torch.nn as nn

from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm_notebook
from collections import defaultdict


In [None]:
import os
import matplotlib.pyplot as plt

def plot_hist(visual, acoustic, title="Segment"):
    # Create the folder if it doesn't exist
    folder_name = "Value distributions"
    os.makedirs(folder_name, exist_ok=True)
    
    # Plot the histograms
    plt.hist(visual.flatten(), bins=100, alpha=0.5, label='Visual')
    plt.hist(acoustic.flatten(), bins=100, alpha=0.5, label='Acoustic')
    plt.xlabel("Value")
    plt.ylabel("Frequency")
    plt.legend()
    plt.title(f"Value Distribution for {title}")
    plt.show()
    plt.close()  # Close the plot to free memory/

    # 
    # # Save the figure to the specified folder with the given title
    # file_name = f"value_distribution_{title}.png"
    # file_path = os.path.join(folder_name, file_name)
    # plt.savefig(file_path)




In [None]:
import numpy as np

# A sentinel epsilon for safe division, avoiding division by zero
EPS = 1e-8

word2id = defaultdict(lambda: len(word2id))
UNK = word2id['<unk>']
PAD = word2id['<pad>']
EOS = word2id['<eos>']
BOS = word2id['<bos>']
# SEP = word2id['<sep>']
DUMMY = word2id['<dummy>']

pattern = re.compile('(.*)\[.*\]')

In [None]:



# Assuming the input features are already defined elsewhere
# _words, _visual_Facet41, _visual_Facet42, _visual_OpenFace1, _acoustic_COVAREP, _acoustic_OpenSmile_EB10, _wordvectors
# Also assuming train_split, dev_split, test_split are already defined

# Placeholders for final train/dev/test dataset
train = []
dev = []
test = []

# Iterate over the segments in the dataset
num_drop = 0  # Counter to track the number of dropped data points

for segment in dataset[label_field].keys():
    # Get the video ID and features
    vid = re.search(pattern, segment).group(1)
    label = dataset[label_field][segment]['features']
    _words = dataset[text_field_Words][segment]['features']
    
    # Collect all visual and acoustic features
    _visual_Facet41 = dataset[visual_field_Facet41][segment]['features']
    # _visual_Facet42 = dataset[visual_field_Facet42][segment]['features']
    # _visual_OpenFace1 = dataset[visual_field_OpenFace1][segment]['features']
    
    _acoustic_COVAREP = dataset[acoustic_field_COVAREP][segment]['features']
    # _acoustic_OpenSmile_EB10 = dataset[acoustic_field_OpenSmile_EB10][segment]['features']
    
    # _wordvectors = dataset[text_field_WordVectors][segment]['features']

    # # Check if all modalities have the same number of elements (length of sequence)
    # if not (_words.shape[0] == _visual_Facet41.shape[0] == _visual_Facet42.shape[0] == _visual_OpenFace1.shape[0] == 
    #         _acoustic_COVAREP.shape[0] == _acoustic_OpenSmile_EB10.shape[0] == _wordvectors.shape[0]):
    #     print(f"Error: Inconsistent sequence lengths for segment {vid}")
    #     num_drop += 1
    #     continue  # Skip this segment and continue with the next one
    
    
     # Check if all modalities have the same number of elements (length of sequence)
    if not (_words.shape[0] == _visual_Facet41.shape[0] == _acoustic_COVAREP.shape[0]):
            #  == _wordvectors.shape[0]):
        print(f"Error: Inconsistent sequence lengths for segment {vid}")
        num_drop += 1
        continue  # Skip this segment and continue with the next one

    # Lists to hold the processed data for each modality
    words = []
    visual = []
    acoustic = []
    # wordvectors = []

    # Remove speech pauses (um, uhh, etc.)
    for i, word in enumerate(_words):
        if word[0] != b'sp':  # Remove speech pauses
            words.append(word2id[word[0].decode('utf-8')])  # Decode bytes to string and add to words
            
            # FIGURE OUT HERE WHAT YOU NEED TO DO - how do you work with the vectors?
            
            
            # Append visual features (check the shape of each feature)
            visual.append(_visual_Facet41[i, :])  # Facet41
            # visual.append(_visual_Facet42[i, :])  # Facet42
            # visual.append(_visual_OpenFace1[i, :])  # OpenFace1
            
            # Append acoustic features (check the shape of each feature)
            acoustic.append(_acoustic_COVAREP[i, :])  # COVAREP
            # acoustic.append(_acoustic_OpenSmile_EB10[i, :])  # OpenSmile_EB10
            
            # combined_acoustic = np.vstack((_acoustic_COVAREP[i, :], _acoustic_OpenSmile_EB10[i, :]))
            # acoustic.append(combined_acoustic)
            
            # Append word vectors
            # wordvectors.append(_wordvectors[i, :])  # Word vectors
            
            
    # LOOK AT THE SHAPES

    # Check the shapes of the collected features before converting to numpy arrays
    # print(f"Word vectors shape: {np.asarray(wordvectors).shape}")
    print(f"Words shape: {np.asarray(words).shape}")
    print(f"Acoustic shape: {np.asarray(acoustic).shape}")

    print(f"Visual shape: {np.asarray(visual).shape}")

    # Convert lists to numpy arrays
    words = np.asarray(words)
    visual = np.asarray(visual)
    acoustic = np.asarray(acoustic)
    # wordvectors = np.asarray(wordvectors)

    # Z-normalization for visual modality (across all visual features)
    std_dev_visual = np.std(visual, axis=0, keepdims=True)
    visual = np.nan_to_num((visual - visual.mean(0, keepdims=True)) / (EPS + std_dev_visual))
    visual[:, std_dev_visual.flatten() == 0] = EPS  # Safeguard for zero standard deviation

    # Z-normalization for acoustic modality (across all acoustic features)
    acoustic_mean = np.nanmean(acoustic, axis=0, keepdims=True)
    std_dev_acoustic = np.nanstd(acoustic, axis=0, keepdims=True)
    std_dev_acoustic = np.nan_to_num(std_dev_acoustic)
    std_dev_acoustic[std_dev_acoustic == 0] = EPS  # Safeguard for zero standard deviation

    acoustic = np.nan_to_num((acoustic - acoustic_mean) / (EPS + std_dev_acoustic))

    # # Z-normalization for word vectors
    # wordvectors_mean = np.nanmean(wordvectors, axis=0, keepdims=True)
    # std_dev_wordvectors = np.nanstd(wordvectors, axis=0, keepdims=True)
    # std_dev_wordvectors = np.nan_to_num(std_dev_wordvectors)
    # std_dev_wordvectors[std_dev_wordvectors == 0] = EPS  # Safeguard for zero standard deviation
    # wordvectors = np.nan_to_num((wordvectors - wordvectors_mean) / (EPS + std_dev_wordvectors))

    # Ensure no NaN or Inf values in the data
    if np.any(np.isnan(acoustic)) or np.any(np.isinf(acoustic)):
        print(f"Error in acoustic data for segment {vid}")
    if np.any(np.isnan(visual)) or np.any(np.isinf(visual)):
        print(f"Error in visual data for segment {vid}")
    if np.any(np.isnan(words)) or np.any(np.isinf(words)):
        print(f"Error in wordvectors data for segment {vid}")

    # Add the data to the appropriate split
    if vid in train_split:
        train.append(((words, visual, acoustic), label, segment))
    elif vid in dev_split:
        dev.append(((words, visual, acoustic), label, segment))
    elif vid in test_split:
        test.append(((words, visual, acoustic), label, segment))
    else:
        print(f"Found video that doesn't belong to any splits: {vid}")

# Output number of dropped datapoints
print(f"Total number of {num_drop} datapoints have been dropped.")
vocab_size = len(word2id)
print(f"Vocabulary size: {vocab_size}")


In [None]:
def plot_hist2(wordvectors, acoustic, title="Segment"):
    # Create the folder if it doesn't exist
    folder_name = "Value distributions"
    os.makedirs(folder_name, exist_ok=True)
    
    # Plot the histograms
    plt.hist(wordvectors.flatten(), bins=100, alpha=0.5, label='Wordvectors')
    plt.hist(acoustic.flatten(), bins=100, alpha=0.5, label='Acoustic')
    plt.xlabel("Value")
    plt.ylabel("Frequency")
    plt.legend()
    plt.title(f"Value Distribution for {title}")
    plt.show()
    plt.close() 

In [None]:
# vocab_size = len(word2id)
# print(f"Vocabulary size: {vocab_size}")
# 
# 
# # Assuming the input features are already defined elsewhere
# # _words, _visual_Facet41, _visual_Facet42, _visual_OpenFace1, _acoustic_COVAREP, _acoustic_OpenSmile_EB10, _wordvectors
# # Also assuming train_split, dev_split, test_split are already defined
# 
# # Placeholders for final train/dev/test dataset
# train = []
# dev = []
# test = []
# 
# # Iterate over the segments in the dataset
# num_drop = 0  # Counter to track the number of dropped data points
# 
# for segment in dataset[label_field].keys():
#     # Get the video ID and features
#     vid = re.search(pattern, segment).group(1)
#     label = dataset[label_field][segment]['features']
#     _words = dataset[text_field_Words][segment]['features']
#     
#     # Collect all visual and acoustic features
#     _visual_Facet41 = dataset[visual_field_Facet41][segment]['features']
#     # _visual_Facet42 = dataset[visual_field_Facet42][segment]['features']
#     # _visual_OpenFace1 = dataset[visual_field_OpenFace1][segment]['features']
#     
#     _acoustic_COVAREP = dataset[acoustic_field_COVAREP][segment]['features']
#     # _acoustic_OpenSmile_EB10 = dataset[acoustic_field_OpenSmile_EB10][segment]['features']
#     
#     _wordvectors = dataset[text_field_WordVectors][segment]['features']
# 
#     # # Check if all modalities have the same number of elements (length of sequence)
#     # if not (_words.shape[0] == _visual_Facet41.shape[0] == _visual_Facet42.shape[0] == _visual_OpenFace1.shape[0] == 
#     #         _acoustic_COVAREP.shape[0] == _acoustic_OpenSmile_EB10.shape[0] == _wordvectors.shape[0]):
#     #     print(f"Error: Inconsistent sequence lengths for segment {vid}")
#     #     num_drop += 1
#     #     continue  # Skip this segment and continue with the next one
#     
#     
#      # Check if all modalities have the same number of elements (length of sequence)
#     if not (_words.shape[0] == _visual_Facet41.shape[0] == _acoustic_COVAREP.shape[0]
#                     == _wordvectors.shape[0]):
#         print(f"Error: Inconsistent sequence lengths for segment {vid}")
#         num_drop += 1
#         continue  # Skip this segment and continue with the next one
# 
#     # Lists to hold the processed data for each modality
#     words = []
#     visual = []
#     acoustic = []
#     wordvectors = []
# 
#     # Remove speech pauses (um, uhh, etc.)
#     for i, word in enumerate(_words):
#         if word[0] != b'sp':  # Remove speech pauses
#             words.append(word2id[word[0].decode('utf-8')])  # Decode bytes to string and add to words
#             
#             # FIGURE OUT HERE WHAT YOU NEED TO DO - how do you work with the vectors?
#             
#             
#             # Append visual features (check the shape of each feature)
#             visual.append(_visual_Facet41[i, :])  # Facet41
#             # visual.append(_visual_Facet42[i, :])  # Facet42
#             # visual.append(_visual_OpenFace1[i, :])  # OpenFace1
#             
#             # Append acoustic features (check the shape of each feature)
#             acoustic.append(_acoustic_COVAREP[i, :])  # COVAREP
#             # acoustic.append(_acoustic_OpenSmile_EB10[i, :])  # OpenSmile_EB10
#             
#             # combined_acoustic = np.vstack((_acoustic_COVAREP[i, :], _acoustic_OpenSmile_EB10[i, :]))
#             # acoustic.append(combined_acoustic)
#             
#             # Append word vectors
#             wordvectors.append(_wordvectors[i, :])  # Word vectors
#             
#             
#     # LOOK AT THE SHAPES
# 
#     # Check the shapes of the collected features before converting to numpy arrays
#     # print(f"Word vectors shape: {np.asarray(wordvectors).shape}")
#     print(f"Words shape: {np.asarray(words).shape}")
#     print(f"WordVectors shape: {np.asarray(wordvectors).shape}")
#     print(f"Acoustic shape: {np.asarray(acoustic).shape}")
# 
#     print(f"Visual shape: {np.asarray(visual).shape}")
# 
#     # Convert lists to numpy arrays
#     words = np.asarray(words)
#     visual = np.asarray(visual)
#     acoustic = np.asarray(acoustic)
#     wordvectors = np.asarray(wordvectors)
# 
#     # Z-normalization for visual modality (across all visual features)
#     std_dev_visual = np.std(visual, axis=0, keepdims=True)
#     visual = np.nan_to_num((visual - visual.mean(0, keepdims=True)) / (EPS + std_dev_visual))
#     visual[:, std_dev_visual.flatten() == 0] = EPS  # Safeguard for zero standard deviation
# 
#     # Z-normalization for acoustic modality (across all acoustic features)
#     acoustic_mean = np.nanmean(acoustic, axis=0, keepdims=True)
#     std_dev_acoustic = np.nanstd(acoustic, axis=0, keepdims=True)
#     std_dev_acoustic = np.nan_to_num(std_dev_acoustic)
#     std_dev_acoustic[std_dev_acoustic == 0] = EPS  # Safeguard for zero standard deviation
# 
#     acoustic = np.nan_to_num((acoustic - acoustic_mean) / (EPS + std_dev_acoustic))
# 
#     # Z-normalization for word vectors
#     wordvectors_mean = np.nanmean(wordvectors, axis=0, keepdims=True)
#     std_dev_wordvectors = np.nanstd(wordvectors, axis=0, keepdims=True)
#     std_dev_wordvectors = np.nan_to_num(std_dev_wordvectors)
#     std_dev_wordvectors[std_dev_wordvectors == 0] = EPS  # Safeguard for zero standard deviation
#     wordvectors = np.nan_to_num((wordvectors - wordvectors_mean) / (EPS + std_dev_wordvectors))
#     
#     # plot_hist2(wordvectors, acoustic)
# 
#     # Ensure no NaN or Inf values in the data
#     if np.any(np.isnan(acoustic)) or np.any(np.isinf(acoustic)):
#         print(f"Error in acoustic data for segment {vid}")
#     if np.any(np.isnan(visual)) or np.any(np.isinf(visual)):
#         print(f"Error in visual data for segment {vid}")
#     if np.any(np.isnan(words)) or np.any(np.isinf(words)):
#         print(f"Error in wordvectors data for segment {vid}")
# 
#     # Add the data to the appropriate split
#     if vid in train_split:
#         train.append(((words, visual, acoustic), label, segment))
#     elif vid in dev_split:
#         dev.append(((words, visual, acoustic), label, segment))
#     elif vid in test_split:
#         test.append(((words, visual, acoustic), label, segment))
#     else:
#         print(f"Found video that doesn't belong to any splits: {vid}")
# 
# # Output number of dropped datapoints
# print(f"Total number of {num_drop} datapoints have been dropped.")
# # Check how many words are in the vocabulary
# vocab_size = len(word2id)
# print(f"Vocabulary size: {vocab_size}")


In [None]:
import random
# Ensure the train dataset has elements before sampling
if len(train) > 0:
    # Randomly sample one processed segment
    sample = random.choice(train)

    # Extract components
    (words, visual, acoustic, wordvectors), label, segment = sample

    # Display the details
    print(f"Segment: {segment}")
    print(f"Label: {label}")
    print(f"Words (sample, first 10 if too long): {words[:10] if len(words) > 10 else words}")
    print(f"Words shape: {words.shape}")
    print(f"Visual shape: {visual.shape}")
    print(f"Acoustic shape: {acoustic.shape}")
    print(f"WordVectors shape: {wordvectors.shape}")
else:
    print("The train dataset is empty. Check the preprocessing loop for issues.")


In [33]:
# turn off the word2id - define a named function here to allow for pickling
def return_unk():
    return UNK
word2id.default_factory = return_unk

## Inspect the dataset

Now that we have loaded the data, we can check the sizes of each split, data point shapes, vocabulary size, etc.

Frame Index and Time: frameIndex and frameTime indicate the time alignment of the features.
Loudness features (pcm_loudness_sma_*): These are statistical attributes (e.g., mean, standard deviation, kurtosis, and quartiles) related to loudness.
MFCC features (pcm_fftMag_mfcc_sma*): These are Mel-Frequency Cepstral Coefficients and their derived statistics.
Log Mel Frequency Band features (logMelFreqBand_sma*): These represent the energy in specific mel-scale frequency bands.
Regression coefficients and errors (*_linregc1, *_linregc2, *_linregerrA, etc.): These describe trends and variations in features across frames.



Voicing and Pitch (voicingFinalUnclipped):

Parameters such as sma_maxPos, sma_minPos, sma_amean, and sma_stddev indicate characteristics of voicing activity (presence of voice or sound).
The skewness, kurtosis, and percentiles reflect the distribution of these features over time.
Linreg coefficients (linregc1, linregc2) suggest trends over the analyzed time window (e.g., rising or falling pitch).
Loudness (pcm_loudness_sma):

Metrics such as de_maxPos, de_minPos, de_amean highlight loudness levels, which are important for detecting intensity and energy in speech.
Statistical properties like stddev, skewness, and kurtosis describe the variability and asymmetry of loudness.
Percentile and quartile-based features offer insight into loudness thresholds (e.g., how much time was spent above a certain loudness level).
Mel-Frequency Cepstral Coefficients (MFCCs):

Features such as pcm_fftMag_mfcc_sma_deX for X = 0 to 14 represent frequency-domain characteristics of the audio signal.
These coefficients are crucial in audio and speech processing, capturing the timbre and spectral properties of sound.

Up-Level Time Metrics: (upleveltime75, upleveltime90) indicate how long the signal remains above certain thresholds, reflecting sustained vocal effort or loudness.
Percentile Features: Help capture the extremes and typical ranges of the data.
IQR (Interquartile Range): Measures variability and could reflect vocal modulation or consistency.


lspFreq_sma_deX:

These features pertain to Line Spectral Pairs (LSP) frequencies derived from speech signals. They describe spectral envelope characteristics and are sensitive to phoneme-level variations.
X indicates different LSP frequency bands (de0, de1, ..., de7) or derivations of them.
Common statistical summaries:
maxPos/minPos: Time position of the maximum/minimum values in the speech segment.
amean: Arithmetic mean (average) value of the feature over the segment.
linregc1/linregc2: Linear regression coefficients describing the trend of the feature (e.g., increasing or decreasing).
linregerrA/Q: Absolute or quadratic error of the regression model—indicating fit quality.
stddev: Standard deviation, showing variation or spread of the feature.
skewness/kurtosis: Higher-order statistical moments—describe asymmetry (skewness) or "peakedness" (kurtosis) of the distribution.
quartile1/2/3: 25th, 50th (median), and 75th percentiles, splitting the data distribution into quartiles.
iqrX-Y: Interquartile range between quartileX and quartileY.
percentile1.0/percentile99.0: Values below which 1% or 99% of the data lie.
upleveltime75/90: Percentage of time the feature exceeds 75% or 90% of its range.
F0finEnv_sma_de:

Represents the pitch envelope or fundamental frequency (F0) final value’s dynamics. The sma suffix indicates smoothing (simple moving average), and de represents derived (e.g., delta) features.
Includes the same statistical parameters as lspFreq.
voicingFinalUnclipped_sma_de:

Measures the degree of voicing in the speech (whether vocal folds are vibrating or not).
"Unclipped" indicates no thresholding applied to separate voiced/unvoiced segments.
Provides detailed analysis using similar statistical measures.
F0final_sma and F0final_sma_de:

Final pitch values (smoothed or derived) from the speech signal, critical for prosody and intonation analysis.
These features include mean, variability, skewness, and trends of the pitch over time.
jitterLocal_sma and jitterDDP_sma:

These are measures of pitch perturbation or variability:
JitterLocal: Variability of pitch period duration, indicating instability in vocal fold vibration.
JitterDDP (Difference of Differences of Periods): More refined pitch irregularity metric.
shimmerLocal_sma:

A measure of amplitude perturbation, reflecting variability in speech loudness (e.g., voice breaks or instability).
Turn_numOnsets and Turn_duration:

Turn_numOnsets: Number of distinct voiced segments or syllables within a turn (speaking segment).
Turn_duration: Duration of the entire turn, often used to measure speaking rate.



## Collate function in PyTorch

Collate functions are functions used by PyTorch dataloader to gather batched data from dataset. It loads multiple data points from an iterable dataset object and put them in a certain format. Here we just use the lists we've constructed as the dataset and assume PyTorch dataloader will operate on that.

In [None]:
def multi_collate(batch):
    '''
    Collate functions assume batch = [Dataset[i] for i in index_set]
    '''
    # for later use we sort the batch in descending order of length
    batch = sorted(batch, key=lambda x: x[0][0].shape[0], reverse=True)
    
    # get the data out of the batch - use pad sequence util functions from PyTorch to pad things
    labels = torch.cat([torch.from_numpy(sample[1]) for sample in batch], dim=0)
    sentences = pad_sequence([torch.LongTensor(sample[0][0]) for sample in batch], padding_value=PAD)
    visual = pad_sequence([torch.FloatTensor(sample[0][1]) for sample in batch])
    acoustic = pad_sequence([torch.FloatTensor(sample[0][2]) for sample in batch])
    
    # lengths are useful later in using RNNs
    lengths = torch.LongTensor([sample[0][0].shape[0] for sample in batch])
    return sentences, visual, acoustic, labels, lengths

# construct dataloaders, dev and test could use around ~X3 times batch size since no_grad is used during eval
batch_sz = 56
train_loader = DataLoader(train, shuffle=True, batch_size=batch_sz, collate_fn=multi_collate)
dev_loader = DataLoader(dev, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate)
test_loader = DataLoader(test, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate)

# let's create a temporary dataloader just to see how the batch looks like
temp_loader = iter(DataLoader(test, shuffle=True, batch_size=8, collate_fn=multi_collate))
batch = next(temp_loader)

print(batch[0].shape) # word vectors, padded to maxlen
print(batch[1].shape) # visual features
print(batch[2].shape) # acoustic features
print(batch[3]) # labels
print(batch[4]) # lengths

In [202]:
# Check how many words are in the vocabulary
vocab_size = len(word2id)
print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 2733


In [203]:
# Let's actually inspect the transcripts to ensure it's correct
id2word = {v:k for k, v in word2id.items()}
examine_target = train
idx = np.random.randint(0, len(examine_target))
print(' '.join(list(map(lambda x: id2word[x], examine_target[idx][0][0].tolist()))))


# print(' '.join(examine_target[idx][0]))
print(examine_target[idx][1]) #label
print(examine_target[idx][2]) #segment

the the whole movie i was you know thinking this as bad as ive heard and it um
[[1.2]]
Dg_0XKD0Mf4[11]


In [204]:
# Reverse mapping from word IDs to words
id2word = {v: k for k, v in word2id.items()}


# Specify how many examples to examine
num_examples = len(train)  # Ensure we don't exceed the dataset size


max_length = 0
min_length = float('inf')

for idx in range(num_examples):
    # Convert word IDs to words
    words = ' '.join(map(lambda x: id2word[x], train[idx][0][0].tolist()))
    label = train[idx][1]  # Label
    segment = train[idx][2]  # Segment

    length = len(words.split())  # Word count, not character count

    # Track the max and min lengths
    max_length = max(max_length, length)
    min_length = min(min_length, length)

    # Display the information
    print(f"Example {idx+1}:")
    print(f"Text: {words}")
    print(f"Length: {length}")
    print(f"Label: {label}")
    print(f"Segment: {segment}")
    print("-" * 40)  # Separator for readability
    if length==1:
        example = idx+1

# After the loop, print the max and min lengths
print(f"Maximum length: {max_length}")
print(f"Minimum length: {min_length}")
print(f"Example with length 1: {example}")




Example 1:
Text: anyhow it was really good
Length: 5
Label: [[2.4]]
Segment: 03bSnISJMiM[0]
----------------------------------------
Example 2:
Text: they didnt really do a whole bunch of background info on why she has to fight and be prepared
Length: 19
Label: [[-0.8]]
Segment: 03bSnISJMiM[1]
----------------------------------------
Example 3:
Text: i mean they did a little bit of it
Length: 9
Label: [[-1.]]
Segment: 03bSnISJMiM[2]
----------------------------------------
Example 4:
Text: but not a whole bunch
Length: 5
Label: [[-1.75]]
Segment: 03bSnISJMiM[3]
----------------------------------------
Example 5:
Text: and they i guess
Length: 4
Label: [[0.]]
Segment: 03bSnISJMiM[4]
----------------------------------------
Example 6:
Text: they live up with more
Length: 5
Label: [[0.]]
Segment: 03bSnISJMiM[5]
----------------------------------------
Example 7:
Text: and but besides that it was all over pretty good
Length: 10
Label: [[0.8]]
Segment: 03bSnISJMiM[6]
-----------------------

In [None]:
# Reverse mapping from word IDs to words
id2word = {v: k for k, v in word2id.items()}

# Specify how many examples to examine
num_examples = len(dev)  # Ensure we don't exceed the dataset size

max_length = 0
min_length = float('inf')

for idx in range(num_examples):
    # Convert word IDs to words
    words = ' '.join(map(lambda x: id2word[x], dev[idx][0][0].tolist()))
    label = dev[idx][1]  # Label
    segment = dev[idx][2]  # Segment

    # Calculate the length of the words (number of words)
    length = len(words.split())  # Word count, not character count

    # Track the max and min lengths
    max_length = max(max_length, length)
    min_length = min(min_length, length)

    # Display the information
    print(f"Example {idx+1}:")
    print(f"Text: {words}")
    print(f"Length: {length}")
    print(f"Label: {label}")
    print(f"Segment: {segment}")
    print("-" * 40)  # Separator for readability

# After the loop, print the max and min lengths
print(f"Maximum length: {max_length}")
print(f"Minimum length: {min_length}")



In [None]:
# Reverse mapping from word IDs to words
id2word = {v: k for k, v in word2id.items()}

# Specify how many examples to examine
num_examples = len(test)  # Ensure we don't exceed the dataset size

max_length = 0
min_length = float('inf')


for idx in range(num_examples):
    # Convert word IDs to words
    words = ' '.join(map(lambda x: id2word[x], test[idx][0][0].tolist()))
    label = test[idx][1]  # Label
    segment = test[idx][2]  # Segment

    length = len(words.split())  # Word count, not character count

    # Track the max and min lengths
    max_length = max(max_length, length)
    min_length = min(min_length, length)

    # Display the information
    print(f"Example {idx+1}:")
    print(f"Text: {words}")
    print(f"Length: {length}")
    print(f"Label: {label}")
    print(f"Segment: {segment}")
    print("-" * 40)  # Separator for readability
    if length == 1:
        example = idx+1

# After the loop, print the max and min lengths
print(f"Maximum length: {max_length}")
print(f"Minimum length: {min_length}")
print(f"example with length 1: {example}")


## Define a multimodal model

Here we show a simple example of late-fusion LSTM. Late-fusion refers to combining the features from different modalities at the final prediction stage, without introducing any interactions between them before that.

In [None]:
# let's define a simple model that can deal with multimodal variable length sequence
class LFLSTM(nn.Module):
    def __init__(self, input_sizes, hidden_sizes, fc1_size, output_size, dropout_rate):
        super(LFLSTM, self).__init__()
        self.input_size = input_sizes
        self.hidden_size = hidden_sizes
        self.fc1_size = fc1_size
        self.output_size = output_size
        self.dropout_rate = dropout_rate
        
        # defining modules - two layer bidirectional LSTM with layer norm in between
        self.embed = nn.Embedding(len(word2id), input_sizes[0])
        self.trnn1 = nn.LSTM(input_sizes[0], hidden_sizes[0], bidirectional=True)
        self.trnn2 = nn.LSTM(2*hidden_sizes[0], hidden_sizes[0], bidirectional=True)
        
        self.vrnn1 = nn.LSTM(input_sizes[1], hidden_sizes[1], bidirectional=True)
        self.vrnn2 = nn.LSTM(2*hidden_sizes[1], hidden_sizes[1], bidirectional=True)
        
        self.arnn1 = nn.LSTM(input_sizes[2], hidden_sizes[2], bidirectional=True)
        self.arnn2 = nn.LSTM(2*hidden_sizes[2], hidden_sizes[2], bidirectional=True)

        self.fc1 = nn.Linear(sum(hidden_sizes)*4, fc1_size)
        self.fc2 = nn.Linear(fc1_size, output_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        self.tlayer_norm = nn.LayerNorm((hidden_sizes[0]*2,))
        self.vlayer_norm = nn.LayerNorm((hidden_sizes[1]*2,))
        self.alayer_norm = nn.LayerNorm((hidden_sizes[2]*2,))
        self.bn = nn.BatchNorm1d(sum(hidden_sizes)*4)

        
    def extract_features(self, sequence, lengths, rnn1, rnn2, layer_norm):
        packed_sequence = pack_padded_sequence(sequence, lengths)
        packed_h1, (final_h1, _) = rnn1(packed_sequence)
        padded_h1, _ = pad_packed_sequence(packed_h1)
        normed_h1 = layer_norm(padded_h1)
        packed_normed_h1 = pack_padded_sequence(normed_h1, lengths)
        _, (final_h2, _) = rnn2(packed_normed_h1)
        return final_h1, final_h2

        
    def fusion(self, sentences, visual, acoustic, lengths):
        batch_size = lengths.size(0)
        sentences = self.embed(sentences)
        
        # extract features from text modality
        final_h1t, final_h2t = self.extract_features(sentences, lengths, self.trnn1, self.trnn2, self.tlayer_norm)
        
        # extract features from visual modality
        final_h1v, final_h2v = self.extract_features(visual, lengths, self.vrnn1, self.vrnn2, self.vlayer_norm)
        
        # extract features from acoustic modality
        final_h1a, final_h2a = self.extract_features(acoustic, lengths, self.arnn1, self.arnn2, self.alayer_norm)

        
        # simple late fusion -- concatenation + normalization
        h = torch.cat((final_h1t, final_h2t, final_h1v, final_h2v, final_h1a, final_h2a),
                       dim=2).permute(1, 0, 2).contiguous().view(batch_size, -1)
        return self.bn(h)

    def forward(self, sentences, visual, acoustic, lengths):
        batch_size = lengths.size(0)
        h = self.fusion(sentences, visual, acoustic, lengths)
        h = self.fc1(h)
        h = self.dropout(h)
        h = self.relu(h)
        o = self.fc2(h)
        return o

## Load pretrained embeddings

We define a function for loading pretrained word embeddings stored in GloVe-style file. Contextualized embeddings obviously cannot be stored and loaded this way, though.

In [None]:
import tqdm
from tqdm import tqdm_notebook

In [None]:
# define a function that loads data from GloVe-like embedding files

# 2196017 is the vocab size of GloVe here.

def load_emb(w2i, path_to_embedding, embedding_size=300, embedding_vocab=2196017, init_emb=None):
    if init_emb is None:
        emb_mat = np.random.randn(len(w2i), embedding_size)
    else:
        emb_mat = init_emb
    f = open(path_to_embedding, 'r', encoding='utf-8', errors='replace')
    found = 0
    for line in tqdm_notebook(f, total=embedding_vocab):
        try:
            content = line.strip().split()
            vector = np.asarray(list(map(lambda x: float(x), content[-300:])))
            word = ' '.join(content[:-300])
            if word in w2i:
                idx = w2i[word]
                emb_mat[idx, :] = vector
                found += 1
        except ValueError as e:
            print(f"Skipping invalid line: {line}")
        
    print(f"Found {found} words in the embedding file.")
    return torch.tensor(emb_mat).float()


## Training a model

Next we train a model. We use Adam with gradient clipping and weight decay for training, and our loss here is Mean Absolute Error (MOSI is a regression dataset). We exclude the embeddings from trainable computation graph to prevent overfitting. We also apply a early-stopping scheme with learning rate annealing based on validation loss.

In [None]:

from torch.optim import Adam, SGD
from sklearn.metrics import accuracy_score

In [3]:
import os
path = 'C:\\Users\\Viki\\Documents\\Thesis\\tryout3\\data'
if os.access(path, os.R_OK):
    print("Directory is readable")
else:
    print("No read permission")
if os.access(path, os.W_OK):
    print("Directory is writable")
else:
    print("No write permission")
    

Directory is readable
Directory is writable


In [51]:


torch.manual_seed(123)
torch.cuda.manual_seed_all(123)

CUDA = torch.cuda.is_available()
MAX_EPOCH = 1000

text_size = 300
visual_size = 47
acoustic_size = 74

# define some model settings and hyper-parameters
input_sizes = [text_size, visual_size, acoustic_size]
hidden_sizes = [int(text_size * 1.5), int(visual_size * 1.5), int(acoustic_size * 1.5)]
fc1_size = sum(hidden_sizes) // 2
dropout = 0.25
output_size = 1
curr_patience = patience = 8
num_trials = 3
grad_clip_value = 1.0
weight_decay = 0.1

if os.path.exists(CACHE_PATH):
    pretrained_emb, word2id = torch.load(CACHE_PATH)
elif WORD_EMB_PATH is not None:
    pretrained_emb = load_emb(word2id, WORD_EMB_PATH)
    torch.save((pretrained_emb, word2id), CACHE_PATH)
else:
    pretrained_emb = None

model = LFLSTM(input_sizes, hidden_sizes, fc1_size, output_size, dropout)
if pretrained_emb is not None:
    model.embed.weight.data = pretrained_emb
model.embed.requires_grad = False
optimizer = Adam([param for param in model.parameters() if param.requires_grad], weight_decay=weight_decay)

if CUDA:
    model.cuda()
criterion = nn.L1Loss(reduction='sum')
criterion_test = nn.L1Loss(reduction='sum')
best_valid_loss = float('inf')
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)
lr_scheduler.step() # for some reason it seems the StepLR needs to be stepped once first
train_losses = []
valid_losses = []
for e in range(MAX_EPOCH):
    model.train()
    train_iter = tqdm_notebook(train_loader)
    train_loss = 0.0
    for batch in train_iter:
        model.zero_grad()
        t, v, a, y, l = batch
        batch_size = t.size(0)
        if CUDA:
            t = t.cuda()
            v = v.cuda()
            a = a.cuda()
            y = y.cuda()
            l = l.cuda()
        y_tilde = model(t, v, a, l)
        # print(f"batch_pred: {y_tilde.shape}")
        # print(f"labels: {y.shape}")
        loss = criterion(y_tilde, y)
        loss.backward()
        torch.nn.utils.clip_grad_value_([param for param in model.parameters() if param.requires_grad], grad_clip_value)
        optimizer.step()
        train_iter.set_description(f"Epoch {e}/{MAX_EPOCH}, current batch loss: {round(loss.item()/batch_size, 4)}")
        train_loss += loss.item()
    train_loss = train_loss / len(train)
    train_losses.append(train_loss)
    print(f"Training loss: {round(train_loss, 4)}")

    model.eval()
    with torch.no_grad():
        valid_loss = 0.0
        for batch in dev_loader:
            model.zero_grad()
            t, v, a, y, l = batch
            if CUDA:
                t = t.cuda()
                v = v.cuda()
                a = a.cuda()
                y = y.cuda()
                l = l.cuda()
            y_tilde = model(t, v, a, l)
            loss = criterion(y_tilde, y)
            valid_loss += loss.item()
    
    valid_loss = valid_loss/len(dev)
    valid_losses.append(valid_loss)
    print(f"Validation loss: {round(valid_loss, 4)}")
    print(f"Current patience: {curr_patience}, current trial: {num_trials}.")
    if valid_loss <= best_valid_loss:
        best_valid_loss = valid_loss
        print("Found new best model on dev set!")
        torch.save(model.state_dict(), 'model.std')
        torch.save(optimizer.state_dict(), 'optim.std')
        curr_patience = patience
    else:
        curr_patience -= 1
        if curr_patience <= -1:
            print("Running out of patience, loading previous best model.")
            num_trials -= 1
            curr_patience = patience
            model.load_state_dict(torch.load('model.std'))
            optimizer.load_state_dict(torch.load('optim.std'))
            lr_scheduler.step()
            print(f"Current learning rate: {optimizer.state_dict()['param_groups'][0]['lr']}")
    
    if num_trials <= 0:
        print("Running out of patience, early stopping.")
        break

model.load_state_dict(torch.load('model.std'))
y_true = []
y_pred = []
model.eval()
with torch.no_grad():
    test_loss = 0.0
    for batch in test_loader:
        model.zero_grad()
        t, v, a, y, l = batch
        if CUDA:
            t = t.cuda()
            v = v.cuda()
            a = a.cuda()
            y = y.cuda()
            l = l.cuda()
        y_tilde = model(t, v, a, l)
        loss = criterion_test(y_tilde, y)
        y_true.append(y_tilde.detach().cpu().numpy())
        y_pred.append(y.detach().cpu().numpy())
        test_loss += loss.item()
print(f"Test set performance: {test_loss/len(test)}")
y_true = np.concatenate(y_true, axis=0)
y_pred = np.concatenate(y_pred, axis=0)
                  
y_true_bin = y_true >= 0
y_pred_bin = y_pred >= 0
bin_acc = accuracy_score(y_true_bin, y_pred_bin)
print(f"Test set accuracy is {bin_acc}")

  pretrained_emb, word2id = torch.load(CACHE_PATH)
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  train_iter = tqdm_notebook(train_loader)


  0%|          | 0/23 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [37]:
# # ARE
# 
# #-*- coding: utf-8 -*-
# 
# """
# what    : Single Encoder Model for audio
# """
# import tensorflow as tf
# from tensorflow.contrib import rnn
# from tensorflow.contrib.rnn import DropoutWrapper 
# 
# from tensorflow.core.framework import summary_pb2
# from random import shuffle
# from project_config import *
# 
# class SingleEncoderModelAudio:
# 
#     def __init__(self, batch_size,
#                  encoder_size,
#                  num_layer, lr,
#                  hidden_dim,
#                  dr):
# 
#         self.batch_size = batch_size
#         self.encoder_size = encoder_size
#         self.num_layers = num_layer
#         self.lr = lr
#         self.hidden_dim = hidden_dim
#         self.dr = dr
# 
#         self.encoder_inputs = []
#         self.encoder_seq_length =[]
#         self.y_labels =[]
# 
#         self.M = None
#         self.b = None
# 
#         self.y = None
#         self.optimizer = None
# 
#         self.batch_loss = None
#         self.loss = 0
#         self.batch_prob = None
# 
#         # for global counter
#         self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
# 
# 
#     def _create_placeholders(self):
#         print '[launch-audio] placeholders'
#         with tf.name_scope('audio_placeholder'):
# 
#             self.encoder_inputs  = tf.placeholder(tf.float32, shape=[self.batch_size, self.encoder_size, N_AUDIO_MFCC], name="encoder")  # [batch, time_step, audio]
#             self.encoder_seq     = tf.placeholder(tf.int32, shape=[self.batch_size], name="encoder_seq")   # [batch] - valid audio step
#             self.encoder_prosody = tf.placeholder(tf.float32, shape=[self.batch_size, N_AUDIO_PROSODY], name="encoder_prosody")   
#             self.y_labels        = tf.placeholder(tf.float32, shape=[self.batch_size, N_CATEGORY], name="label")
#             self.dr_prob         = tf.placeholder(tf.float32, name="dropout")
# 
#     # cell instance
#     def gru_cell(self):
#         return tf.contrib.rnn.GRUCell(num_units=self.hidden_dim)
# 
# 
#     # cell instance with drop-out wrapper applied
#     def gru_drop_out_cell(self):
#         return tf.contrib.rnn.DropoutWrapper(self.gru_cell(), input_keep_prob=self.dr_prob, output_keep_prob=self.dr_prob)                    
# 
# 
#     def test_cross_entropy_with_logit(self, logits, labels):
#         x = logits
#         z = labels
#         return tf.maximum(x, 0) - x * z + tf.log(1 + tf.exp(-tf.abs(x)))
# 
# 
#     def _create_gru_model(self):
#         print '[launch-audio] create gru cell'
# 
#         with tf.name_scope('audio_RNN') as scope:
# 
#             with tf.variable_scope("audio_GRU", reuse=False, initializer=tf.orthogonal_initializer()):
# 
#                 cells_en = tf.contrib.rnn.MultiRNNCell( [ self.gru_drop_out_cell() for _ in range(self.num_layers) ] )
# 
#                 (self.outputs_en, last_states_en) = tf.nn.dynamic_rnn(
#                                                     cell=cells_en,
#                                                     inputs= self.encoder_inputs,
#                                                     dtype=tf.float32,
#                                                     sequence_length=self.encoder_seq,
#                                                     time_major=False)
# 
#                 self.final_encoder = last_states_en[-1]
# 
#         self.final_encoder_dimension   = self.hidden_dim
# 
# 
#     def _add_prosody(self):
#         print '[launch-audio] add prosody feature, dim: ' + str(N_AUDIO_PROSODY)
#         self.final_encoder = tf.concat( [self.final_encoder, self.encoder_prosody], axis=1 )
#         self.final_encoder_dimension = self.hidden_dim + N_AUDIO_PROSODY
# 
# 
#     def _create_output_layers(self):
#         print '[launch-audio] create output projection layer'        
# 
#         with tf.name_scope('audio_output_layer') as scope:
# 
#             self.M = tf.Variable(tf.random_uniform([self.final_encoder_dimension, N_CATEGORY],
#                                                    minval= -0.25,
#                                                    maxval= 0.25,
#                                                    dtype=tf.float32,
#                                                    seed=None),
#                                                 trainable=True,
#                                                 name="similarity_matrix")
# 
#             self.b = tf.Variable(tf.zeros([1], dtype=tf.float32), 
#                                                  trainable=True, 
#                                                  name="output_bias")
# 
#             # e * M + b
#             self.batch_pred = tf.matmul(self.final_encoder, self.M) + self.b
# 
#         with tf.name_scope('loss') as scope:
# 
#             self.batch_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.batch_pred, labels=self.y_labels )
#             self.loss = tf.reduce_mean( self.batch_loss  )
# 
# 
#     def _create_output_layers_for_multi(self):
#         print '[launch-audio] create output projection layer for multi'        
# 
#         with tf.name_scope('audio_output_layer') as scope:
# 
#             self.M = tf.Variable(tf.random_uniform([self.final_encoder_dimension, (self.final_encoder_dimension/2)],
#                                                    minval= -0.25,
#                                                    maxval= 0.25,
#                                                    dtype=tf.float32,
#                                                    seed=None),
#                                                 trainable=True,
#                                                 name="similarity_matrix")
# 
#             self.b = tf.Variable(tf.zeros([1], dtype=tf.float32), 
#                                                  trainable=True, 
#                                                  name="output_bias")
# 
#             # e * M + b
#             self.batch_pred = tf.matmul(self.final_encoder, self.M) + self.b
# 
# 
#     def _create_optimizer(self):
#         print '[launch-audio] create optimizer'
# 
#         with tf.name_scope('audio_optimizer') as scope:
#             opt_func = tf.train.AdamOptimizer(learning_rate=self.lr)
#             gvs = opt_func.compute_gradients(self.loss)
#             capped_gvs = [(tf.clip_by_value(t=grad, clip_value_min=-10, clip_value_max=10), var) for grad, var in gvs]
#             self.optimizer = opt_func.apply_gradients(grads_and_vars=capped_gvs, global_step=self.global_step)
# 
# 
#     def _create_summary(self):
#         print '[launch-audio] create summary'
# 
#         with tf.name_scope('summary'):
#             tf.summary.scalar('mean_loss', self.loss)
#             self.summary_op = tf.summary.merge_all()
# 
# 
#     def build_graph(self):
#         self._create_placeholders()
#         self._create_gru_model()
#         self._add_prosody()
#         self._create_output_layers()
#         self._create_optimizer()
#         self._create_summary()

In [None]:
def multi_collate_acoustic(batch):
    '''
    Collate function for acoustic data only. Batch will be sorted based on the sequence length of the acoustic features.
    '''
    # Sort batch in descending order based on the length of the acoustic feature sequence
    batch = sorted(batch, key=lambda x: x[0][2].shape[0], reverse=True)
    
    # Extract labels and acoustic features from the batch
    labels = torch.cat([torch.from_numpy(sample[1]) for sample in batch], dim=0).float()
    acoustic = pad_sequence([torch.FloatTensor(sample[0][2]) for sample in batch], batch_first=True)
    
    # Sequence lengths (useful for RNNs)
    lengths = torch.LongTensor([sample[0][2].shape[0] for sample in batch])
    return acoustic, labels, lengths


batch_sz = 56
train_loader = DataLoader(train, shuffle=True, batch_size=batch_sz, collate_fn=multi_collate_acoustic)
dev_loader = DataLoader(dev, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate_acoustic)
test_loader = DataLoader(test, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate_acoustic)

# let's create a temporary dataloader just to see how the batch looks like
temp_loader = iter(DataLoader(test, shuffle=True, batch_size=8, collate_fn=multi_collate_acoustic))
batch = next(temp_loader)

# print(batch[0].shape) # word vectors, padded to maxlen
print(batch[0].shape) # acoustic features
print(batch[1]) # labels
print(batch[2]) # lengths

In [8]:
from torch import optim
from torch.utils.tensorboard import SummaryWriter

class SingleEncoderModelAudio(nn.Module):
    def __init__(self, input_size, hidden_dim, num_layers, dropout_rate,num_categories, output_size):
        super(SingleEncoderModelAudio, self).__init__()
        
        self.input_size = input_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.dropout_rate = dropout_rate    
        self.num_categories = num_categories
        self.output_size = output_size
        # GRU layer
        self.gru = nn.GRU(input_size=self.input_size, 
                          hidden_size=self.hidden_dim,  # Correct parameter is hidden_size
                          num_layers=self.num_layers, 
                          batch_first=True,  # Input is [batch_size, seq_length, input_size]
                          dropout=self.dropout_rate if self.num_layers > 1 else 0)

        # Fully connected output layer
        self.fc1 = nn.Linear((self.hidden_dim), num_categories)
        self.fc2 = nn.Linear(num_categories, output_size)


    
    def forward(self, x, lengths):
        batch_size = lengths.size(0)

        # GRU forward pass
        _, hidden = self.gru(x)  # `hidden` shape: (num_layers, batch_size, hidden_dim)
        hidden = hidden[-1]  # Use the last hidden state
        
        # Fully connected layers
        fc1_out = self.fc1(hidden)  # Output of shape [batch_size, num_categories]
        output = self.fc2(fc1_out)  # Output of shape [batch_size, output_size]
        
        return output


    def compute_loss(self, batch_pred, y_labels):
        # Use Mean Squared Error loss for regression
        loss_fn = nn.MSELoss()
        loss = loss_fn(batch_pred, y_labels)
        return loss


    def create_optimizer(self, lr):
        # Optimizer (Adam) with the specified learning rate
        optimizer = optim.Adam(self.parameters(), lr=lr)
        return optimizer



In [9]:


torch.manual_seed(123)
torch.cuda.manual_seed_all(123)

CUDA = torch.cuda.is_available()
MAX_EPOCH = 1000

input_size = 74  # Example: Number of features per time step (adjust as per your data)
hidden_sizes = int(input_size)  # Number of GRU units
num_layers = 2  # Number of layers in GRU
num_categories = 7  # Number of categories for classification (adjust as needed)
output_size = 1  # Output size (e.g., 1 for binary classification)


fc1_size = hidden_sizes
dropout = 0.5
curr_patience = patience = 8
num_trials = 3
grad_clip_value = 1.0
weight_decay = 0.1

print("Model ???")
model = SingleEncoderModelAudio(input_size=input_size, 
                                hidden_dim=hidden_sizes,
                                num_layers=num_layers,
                                dropout_rate=dropout,
                                num_categories=num_categories,
                                output_size=output_size)
print("Model created")

# model.embed.requires_grad = False
optimizer = model.create_optimizer(lr=0.001)
print("Optimizer created")


if CUDA:
    model.cuda()
    
criterion = nn.MSELoss(reduction='sum')
criterion_test = nn.MSELoss(reduction='sum')

best_valid_loss = float('inf')

lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)
lr_scheduler.step() # for some reason it seems the StepLR needs to be stepped once first

train_losses = []
valid_losses = []
for e in range(MAX_EPOCH):
    model.train()
    train_iter = tqdm_notebook(train_loader)
    train_loss = 0.0
    
    for batch in train_iter:
        model.zero_grad()
        a, y, l = batch
        batch_size = a.size(0)
        if CUDA:
            a = a.cuda()
            y = y.cuda()
            l = l.cuda()
        
        y_tilde = model(a, l)
        # print(f"batch_pred: {y_tilde.shape}")
        # print(f"labels: {y.shape}")
        
        loss = criterion(y_tilde, y)
        loss.backward()
        torch.nn.utils.clip_grad_value_([param for param in model.parameters() if param.requires_grad], grad_clip_value)
        optimizer.step()
        train_iter.set_description(f"Epoch {e}/{MAX_EPOCH}, current batch loss: {round(loss.item()/batch_size, 4)}")
        train_loss += loss.item()
    train_loss = train_loss / len(train)
    train_losses.append(train_loss)
    print(f"Training loss: {round(train_loss, 4)}")
    
    # print("LOOP COMPLETED")

    model.eval()
    with torch.no_grad():
        valid_loss = 0.0
        for batch in dev_loader:
            model.zero_grad()
            a, y, l = batch
            if CUDA:
                a = a.cuda()
                y = y.cuda()
                l = l.cuda()
            y_tilde = model(a, l)
            loss = criterion(y_tilde, y)
            valid_loss += loss.item()
    
    valid_loss = valid_loss/len(dev)
    valid_losses.append(valid_loss)
    print(f"Validation loss: {round(valid_loss, 4)}")
    print(f"Current patience: {curr_patience}, current trial: {num_trials}.")
    if valid_loss <= best_valid_loss:
        best_valid_loss = valid_loss
        print("Found new best model on dev set!")
        torch.save(model.state_dict(), 'model.std')
        torch.save(optimizer.state_dict(), 'optim.std')
        curr_patience = patience
    else:
        curr_patience -= 1
        if curr_patience <= -1:
            print("Running out of patience, loading previous best model.")
            num_trials -= 1
            curr_patience = patience
            model.load_state_dict(torch.load('model.std'))
            optimizer.load_state_dict(torch.load('optim.std'))
            lr_scheduler.step()
            print(f"Current learning rate: {optimizer.state_dict()['param_groups'][0]['lr']}")
    
    if num_trials <= 0:
        print("Running out of patience, early stopping.")
        break

model.load_state_dict(torch.load('model.std'))
y_true = []
y_pred = []
model.eval()
with torch.no_grad():
    test_loss = 0.0
    for batch in test_loader:
        model.zero_grad()
        a, y, l = batch
        if CUDA:
            a = a.cuda()
            y = y.cuda()
            l = l.cuda()
        y_tilde = model(a, l)
        loss = criterion_test(y_tilde, y)
        y_true.append(y_tilde.detach().cpu().numpy())
        y_pred.append(y.detach().cpu().numpy())
        test_loss += loss.item()
print(f"Test set performance: {test_loss/len(test)}")
y_true = np.concatenate(y_true, axis=0)
y_pred = np.concatenate(y_pred, axis=0)
                  
y_true_bin = y_true >= 0
y_pred_bin = y_pred >= 0
bin_acc = accuracy_score(y_true_bin, y_pred_bin)
print(f"Test set accuracy is {bin_acc}")


Model ???
Model created
Optimizer created




NameError: name 'train_loader' is not defined

In [ ]:
# # TRE
# 
# #-*- coding: utf-8 -*-
# 
# """
# what    : Single Encoder Model for text
# """
# import tensorflow as tf
# from tensorflow.contrib import rnn
# from tensorflow.contrib.rnn import DropoutWrapper 
# 
# from tensorflow.core.framework import summary_pb2
# from random import shuffle
# import numpy as np
# from project_config import *
# 
# class SingleEncoderModelText:
# 
#     def __init__(self, dic_size,
#                  use_glove,
#                  batch_size,
#                  encoder_size,
#                  num_layer, lr,
#                  hidden_dim,
#                  dr):
# 
#         self.dic_size = dic_size
#         self.use_glove = use_glove
#         self.batch_size = batch_size
#         self.encoder_size = encoder_size
#         self.num_layers = num_layer
#         self.lr = lr
#         self.hidden_dim = hidden_dim
#         self.dr = dr
# 
#         self.encoder_inputs = []
#         self.encoder_seq_length =[]
#         self.y_labels =[]
# 
#         self.M = None
#         self.b = None
# 
#         self.y = None
#         self.optimizer = None
# 
#         self.batch_loss = None
#         self.loss = 0
#         self.batch_prob = None
# 
#         if self.use_glove == 1:
#             self.embed_dim = 300
#         else:
#             self.embed_dim = DIM_WORD_EMBEDDING
# 
#         # for global counter
#         self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
# 
# 
#     def _create_placeholders(self):
#         print '[launch-text] placeholders'
#         with tf.name_scope('text_placeholder'):
# 
#             self.encoder_inputs  = tf.placeholder(tf.int32, shape=[self.batch_size, self.encoder_size], name="encoder")  # [batch,time_step]
#             self.encoder_seq     = tf.placeholder(tf.int32, shape=[self.batch_size], name="encoder_seq")   # [batch] - valid word step
#             self.y_labels        = tf.placeholder(tf.float32, shape=[self.batch_size, N_CATEGORY], name="label")
#             self.dr_prob         = tf.placeholder(tf.float32, name="dropout")
# 
#              # for using pre-trained embedding
#             self.embedding_placeholder = tf.placeholder(tf.float32, shape=[self.dic_size, self.embed_dim], name="embedding_placeholder")
# 
#     def _create_embedding(self):
#         print '[launch-text] create embedding'
#         with tf.name_scope('embed_layer'):
#             self.embed_matrix = tf.Variable(tf.random_normal([self.dic_size, self.embed_dim],
#                                                             mean=0.0,
#                                                             stddev=0.01,
#                                                             dtype=tf.float32,                                                             
#                                                             seed=None),
#                                                             trainable = EMBEDDING_TRAIN,
#                                                             name='embed_matrix')
# 
#             self.embed_en       = tf.nn.embedding_lookup(self.embed_matrix, self.encoder_inputs, name='embed_encoder')
# 
# 
#     def _use_external_embedding(self):
#         if self.use_glove == 1:
#             print '[launch-text] use pre-trained embedding'
#             self.embedding_init = self.embed_matrix.assign(self.embedding_placeholder)
# 
# 
#     # cell instance
#     def gru_cell(self):
#         return tf.contrib.rnn.GRUCell(num_units=self.hidden_dim)
# 
# 
#     # cell instance with drop-out wrapper applied
#     def gru_drop_out_cell(self):
#         return tf.contrib.rnn.DropoutWrapper(self.gru_cell(), input_keep_prob=self.dr_prob, output_keep_prob=self.dr_prob)                    
# 
# 
#     def test_cross_entropy_with_logit(self, logits, labels):
#         x = logits
#         z = labels
#         return tf.maximum(x, 0) - x * z + tf.log(1 + tf.exp(-tf.abs(x)))
# 
# 
#     def _create_gru_model(self):
#         print '[launch-text] create gru cell'
# 
#         with tf.name_scope('text_RNN') as scope:
# 
#             with tf.variable_scope("text_GRU", reuse=False, initializer=tf.orthogonal_initializer()):
# 
#                 cells_en = tf.contrib.rnn.MultiRNNCell( [ self.gru_drop_out_cell() for _ in range(self.num_layers) ] )
# 
#                 (self.outputs_en, last_states_en) = tf.nn.dynamic_rnn(
#                                                     cell=cells_en,
#                                                     inputs= self.embed_en,
#                                                     dtype=tf.float32,
#                                                     sequence_length=self.encoder_seq,
#                                                     time_major=False)
# 
#                 self.final_encoder = last_states_en[-1]
# 
#         self.final_encoder_dimension   = self.hidden_dim
# 
# 
#     def _create_output_layers(self):
#         print '[launch-text] create output projection layer'        
# 
#         with tf.name_scope('text_output_layer') as scope:
# 
#             self.M = tf.Variable(tf.random_uniform([self.final_encoder_dimension, N_CATEGORY],
#                                                    minval= -0.25,
#                                                    maxval= 0.25,
#                                                    dtype=tf.float32,
#                                                    seed=None),
#                                                  trainable=True,
#                                                  name="similarity_matrix")
# 
#             self.b = tf.Variable(tf.zeros([1], dtype=tf.float32),
#                                                  trainable=True,
#                                                  name="output_bias")
# 
#             # e * M + b
#             self.batch_pred = tf.matmul(self.final_encoder, self.M) + self.b
# 
#         with tf.name_scope('loss') as scope:
# 
#             self.batch_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.batch_pred, labels=self.y_labels )
#             self.loss = tf.reduce_mean( self.batch_loss  )
# 
# 
#     def _create_output_layers_for_multi(self):
#         print '[launch-text] create output projection layer for multi'        
# 
#         with tf.name_scope('text_output_layer') as scope:
# 
#             self.M = tf.Variable(tf.random_uniform([self.final_encoder_dimension, (self.final_encoder_dimension/2)],
#                                                    minval= -0.25,
#                                                    maxval= 0.25,
#                                                    dtype=tf.float32,
#                                                    seed=None),
#                                                  trainable=True,
#                                                  name="similarity_matrix")
# 
#             self.b = tf.Variable(tf.zeros([1], dtype=tf.float32),
#                                                  trainable=True,
#                                                  name="output_bias")
# 
#             # e * M + b
#             self.batch_pred = tf.matmul(self.final_encoder, self.M) + self.b
# 
# 
#     def _create_optimizer(self):
#         print '[launch-text] create optimizer'
# 
#         with tf.name_scope('text_optimizer') as scope:
#             opt_func = tf.train.AdamOptimizer(learning_rate=self.lr)
#             gvs = opt_func.compute_gradients(self.loss)
#             capped_gvs = [(tf.clip_by_value(t=grad, clip_value_min=-10, clip_value_max=10), var) for grad, var in gvs]
#             self.optimizer = opt_func.apply_gradients(grads_and_vars=capped_gvs, global_step=self.global_step)
# 
# 
#     def _create_summary(self):
#         print '[launch-text] create summary'
# 
#         with tf.name_scope('summary'):
#             tf.summary.scalar('mean_loss', self.loss)
#             self.summary_op = tf.summary.merge_all()
# 
# 
#     def build_graph(self):
#         self._create_placeholders()
#         self._create_embedding()
#         self._use_external_embedding()
#         self._create_gru_model()
#         self._create_output_layers()
#         self._create_optimizer()
#         self._create_summary()

In [34]:
# define a function that loads data from GloVe-like embedding files

# 2196017 is the vocab size of GloVe here.

def load_emb(w2i, path_to_embedding, embedding_size=300, embedding_vocab=2196017, init_emb=None):
    print("Len w2i start:", len(w2i))
    if init_emb is None:
        emb_mat = np.random.randn(len(w2i), embedding_size)
    else:
        emb_mat = init_emb
    f = open(path_to_embedding, 'r', encoding='utf-8', errors='replace')
    found = 0
    for line in tqdm_notebook(f, total=embedding_vocab):
        try:
            content = line.strip().split()
            vector = np.asarray(list(map(lambda x: float(x), content[-300:])))
            word = ' '.join(content[:-300])
            if word in w2i:
                idx = w2i[word]
                emb_mat[idx, :] = vector
                found += 1
        except ValueError as e:
            print(f"Skipping invalid line: {line}")
    
    print("Len w2i end:", len(w2i))

    print(f"Found {found} words in the embedding file.")
    return torch.tensor(emb_mat).float()

# After processing the entire file, it returns the embedding matrix as a PyTorch tensor.



In [35]:
# if os.path.exists(CACHE_PATH):
#     pretrained_emb, word2id = torch.load(CACHE_PATH)
# elif WORD_EMB_PATH is not None:
#     pretrained_emb = load_emb(word2id, WORD_EMB_PATH)
#     torch.save((pretrained_emb, word2id), CACHE_PATH)
# else:
#     pretrained_emb = None
# 
# model = #here I should have the Single Encoder for Text Model
# if pretrained_emb is not None:
#     model.embed.weight.data = pretrained_emb
# model.embed.requires_grad = False
# optimizer = Adam([param for param in model.parameters() if param.requires_grad], weight_decay=weight_decay)
# 
# embedding_matrix = model.embed.weight.data
# print(embedding_matrix.shape)
# 
# embedding_matrix_np = embedding_matrix.cpu().numpy()
# print(embedding_matrix_np[:10])  # First 10 rows
# 
# word = "example"
# word_idx = word2id[word]
# word_embedding = embedding_matrix[word_idx]
# print(f"Embedding for '{word}': {word_embedding}")


In [36]:
# import matplotlib.pyplot as plt
# import seaborn as sns
# 
# plt.figure(figsize=(18, 14))  # Set the figure size to 12x8 inches
# sns.heatmap(embedding_matrix_np[:10], annot=False)  # Visualize first 10 embeddings
# plt.xlabel("Dimensions")
# plt.ylabel("Words (Index)")
# plt.title("Embedding Matrix Heatmap")
# plt.show()

In [37]:
# from sklearn.decomposition import PCA
# 
# # Reduce dimensionality of the first 100 embeddings
# pca = PCA(n_components=2)
# reduced_embeddings = pca.fit_transform(embedding_matrix_np[:200])
# 
# plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
# for i, word in enumerate(list(word2id.keys())[:200]):
#     plt.annotate(word, (reduced_embeddings[i, 0], reduced_embeddings[i, 1]))
# plt.show()
# 

In [38]:
# import numpy as np
# np.save("embedding_matrix.npy", embedding_matrix_np)


In [39]:
# embedding_matrix_loaded = np.load("embedding_matrix.npy")


In [40]:
def multi_collate_textual(batch):
    '''
    Collate functions assume batch = [Dataset[i] for i in index_set]
    '''
    # for later use we sort the batch in descending order of length
    batch = sorted(batch, key=lambda x: x[0][0].shape[0], reverse=True)
    
    # get the data out of the batch - use pad sequence util functions from PyTorch to pad things
    labels = torch.cat([torch.from_numpy(sample[1]) for sample in batch], dim=0).float()
    sentences = pad_sequence([torch.LongTensor(sample[0][0]) for sample in batch], padding_value=PAD, batch_first=True)
    
    lengths = torch.LongTensor([sample[0][0].shape[0] for sample in batch])
    return sentences, labels, lengths
        

batch_sz = 56
train_loader = DataLoader(train, shuffle=True, batch_size=batch_sz, collate_fn=multi_collate_textual)
dev_loader = DataLoader(dev, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate_textual)
test_loader = DataLoader(test, shuffle=False, batch_size=batch_sz*3, collate_fn=multi_collate_textual)

# let's create a temporary dataloader just to see how the batch looks like
temp_loader = iter(DataLoader(test, shuffle=True, batch_size=8, collate_fn=multi_collate_textual))
batch = next(temp_loader)

# print(batch[0].shape) # word vectors, padded to maxlen
print(batch[0].shape) # textual features
print(batch[1]) # labels
print(batch[2]) # lengths




torch.Size([8, 58])
tensor([[-1.2000],
        [ 0.2000],
        [ 0.6000],
        [-2.0000],
        [ 1.8000],
        [ 0.6000],
        [ 0.0000],
        [-1.0000]])
tensor([58, 37, 29, 23, 14, 14,  7,  5])


In [41]:
# Check how many words are in the vocabulary
vocab_size = len(word2id)
print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 2733


In [42]:
class SingleEncoderModelText(nn.Module):
    def __init__(self, dic_size, use_glove, encoder_size, num_layers, lr, hidden_dim, dr, num_categories, output_size, embedding_path=WORD_EMB_PATH):
        super(SingleEncoderModelText, self).__init__()
        
        # Parameters
        self.dic_size = dic_size
        self.use_glove = use_glove
        self.encoder_size = encoder_size
        self.num_layers = num_layers
        self.lr = lr
        self.hidden_dim = hidden_dim
        self.dr = dr
        self.num_categories = num_categories
        self.output_size = output_size
        
        # Embedding size is 300 if using GloVe, else defined elsewhere
        self.embed_dim = 300 if self.use_glove else 128  # Adjust accordingly
    
        # Embedding layer
        self.embedding = nn.Embedding(self.dic_size, self.embed_dim)
        
        if self.use_glove:
            print("Using glove")
            

    
        # GRU layer
        self.gru = nn.GRU(input_size=self.embed_dim, 
                          hidden_size=self.hidden_dim, 
                          num_layers=self.num_layers, 
                          dropout=self.dr, 
                          batch_first=True)
        
        # Fully connected output layer
        # self.fc1 = nn.Linear(self.hidden_dim, self.num_categories)  
        self.fc2 = nn.Linear(self.hidden_dim, self.output_size)
    # 
    # # Debugging - before passing data to the model, print out some input examples
    # # Check if all indices in the input are valid
    # def check_input_indices(input_tensor, vocab_size):
    #     if input_tensor.max().item() >= vocab_size:
    #         print(f"Warning: Input contains indices out of range! Max index: {input_tensor.max().item()} but vocab size is {vocab_size}.")
    #     else:
    #         print(f"Input indices are within valid range. Max index: {input_tensor.max().item()}.")
    # 
    # # Example usage:
    # check_input_indices(input_tensor, self.dic_size)
    

        
    def forward(self, x, lengths):
        batch_size = lengths.size(0)

        if (x.min() < 0 or x.max() >= self.dic_size):
            raise ValueError(
                f"Input indices out of range! Min index: {x.min()}, Max index: {x.max()}, Vocabulary size: {self.dic_size}"
            )
        # Step 1: Embedding layer
        embedded = self.embedding(x)  # [batch_size, seq_length] -> [batch_size, seq_length, embed_dim]
        
        # Step 2: GRU layer
        gru_out, hidden = self.gru(embedded)  # gru_out: [batch_size, seq_length, hidden_dim], hidden: [num_layers, batch_size, hidden_dim]
        
        # Step 3: Use the last hidden state (or apply pooling, e.g., mean pooling)
        # We'll use the hidden state from the last time step of the last GRU layer
        last_hidden = hidden[-1]  # Shape: [batch_size, hidden_dim]
        
        # Step 4: Pass through fully connected layers
        # fc1_out = self.fc1(last_hidden)  # [batch_size, num_categories]
        output = self.fc2(last_hidden)  # [batch_size, output_size]
        
        return output


In [43]:

temp_loader = iter(DataLoader(test, shuffle=True, batch_size=8, collate_fn=multi_collate_textual))
batch = next(temp_loader)

t, y, l = batch
print(f"Input tensor min: {t.min()}, max: {t.max()}")

# Check how many words are in the vocabulary
vocab_size = len(word2id)
print(f"Vocabulary size: {vocab_size}")
# print(f"vocabulary: {list(word2id.keys())}")

Input tensor min: 1, max: 2488
Vocabulary size: 2733


In [44]:
max_value = float('-inf')  # Initialize to negative infinity
for batch in DataLoader(test, shuffle=True, batch_size=8, collate_fn=multi_collate_textual):
    t, y, l = batch
    batch_max = t.max().item()  # Find the max in the current batch
    max_value = max(max_value, batch_max)  # Update the overall max

print(f"Overall maximum value in the dataset: {max_value}")

Overall maximum value in the dataset: 2732


In [45]:
# Check how many words are in the vocabulary
vocab_size = len(word2id)
print(f"Vocabulary size: {vocab_size}")
print(f"vocabulary: {list(word2id.keys())}")

Vocabulary size: 2733
vocabulary: ['<unk>', '<pad>', '<eos>', '<bos>', '<sep>', '<dummy>', 'anyhow', 'it', 'was', 'really', 'good', 'they', 'didnt', 'do', 'a', 'whole', 'bunch', 'of', 'background', 'info', 'on', 'why', 'she', 'has', 'to', 'fight', 'and', 'be', 'prepared', 'i', 'mean', 'did', 'little', 'bit', 'but', 'not', 'guess', 'live', 'up', 'with', 'more', 'besides', 'that', 'all', 'over', 'pretty', 'there', 'is', 'like', 'someone', 'while', 'lot', 'action', 'oh', 'my', 'god', 'sad', 'part', 'parts', 'awesome', 'its', 'funny', 'now', 'the', 'title', 'movie', 'basically', 'says', 'im', 'even', 'gonna', 'sugar', 'coat', 'this', 'frustrated', 'me', 'such', 'an', 'extreme', 'extent', 'loudly', 'exclaiming', 'at', 'end', 'film', 'reason', 'comic', 'book', 'fan', 'see', 'characters', 'treated', 'responsibly', 'huh', 'before', 'we', 'go', 'must', 'say', 'had', 'surprisingly', 'decent', 'cast', 'strange', 'since', 'one', 'biggest', 'grapes', 'series', 'always', 'hugh', 'jackman', 'best', '

In [46]:
from torch.optim import Adam, SGD


In [47]:


torch.manual_seed(123)
torch.cuda.manual_seed_all(123)

CUDA = torch.cuda.is_available()
MAX_EPOCH = 1000

dic_size = len(word2id)
print(f"Dictionary size: {dic_size}")

# define some model settings and hyper-parameters
input_sizes = dic_size
hidden_sizes = int(dic_size)
fc1_size = hidden_sizes
dropout = 0.25
output_size = 1
curr_patience = patience = 8
num_trials = 3
grad_clip_value = 1.0
weight_decay = 0.1

if os.path.exists(CACHE_PATH):
    pretrained_emb, word2id = torch.load(CACHE_PATH)
    print(f"Size of vocabulary (word2id) after pretrained 1: {len(word2id)}")

elif WORD_EMB_PATH is not None:
    pretrained_emb = load_emb(word2id, WORD_EMB_PATH)
    torch.save((pretrained_emb, word2id), CACHE_PATH)
    print(f"Size of vocabulary (word2id) after pretrained 2: {len(word2id)}")
    print(list(word2id.keys()))
else:
    pretrained_emb = None
    
print(f"Size of pre-trained embeddings: {pretrained_emb.shape}")
print(f"Size of vocabulary (word2id): {len(word2id)}")



model = SingleEncoderModelText(dic_size=dic_size,
                                use_glove=True,
                                encoder_size=300,
                                num_layers=2,
                                lr=0.001,
                                hidden_dim=128,
                                dr=0.2,
                                num_categories=64,
                                output_size=output_size)


if pretrained_emb is not None:
    model.embedding.weight.data = pretrained_emb
model.embedding.requires_grad = False

optimizer = Adam([param for param in model.parameters() if param.requires_grad], weight_decay=weight_decay)

if CUDA:
    model.cuda()
    
criterion = nn.MSELoss(reduction='sum')
criterion_test = nn.MSELoss(reduction='sum')

best_valid_loss = float('inf')

lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)
lr_scheduler.step() # for some reason it seems the StepLR needs to be stepped once first

train_losses = []
valid_losses = []
for e in range(MAX_EPOCH):
    model.train()
    train_iter = tqdm_notebook(train_loader)
    train_loss = 0.0
    
    for batch in train_iter:
        model.zero_grad()
        t, y, l = batch
        batch_size = t.size(0)
        if CUDA:
            t = t.cuda()
            y = y.cuda()
            l = l.cuda()
        
        y_tilde = model(t, l)
        # print(f"batch_pred: {y_tilde.shape}")
        # print(f"labels: {y.shape}")
        
        loss = criterion(y_tilde, y)
        loss.backward()
        torch.nn.utils.clip_grad_value_([param for param in model.parameters() if param.requires_grad], grad_clip_value)
        optimizer.step()
        train_iter.set_description(f"Epoch {e}/{MAX_EPOCH}, current batch loss: {round(loss.item()/batch_size, 4)}")
        train_loss += loss.item()
    train_loss = train_loss / len(train)
    train_losses.append(train_loss)
    print(f"Training loss: {round(train_loss, 4)}")
    
    print ("Loop completed")

    model.eval()
    with torch.no_grad():
        valid_loss = 0.0
        for batch in dev_loader:
            model.zero_grad()
            t, y, l = batch
            if CUDA:
                t = t.cuda()
                y = y.cuda()
                l = l.cuda()
            y_tilde = model(t, l)
            loss = criterion(y_tilde, y)
            valid_loss += loss.item()

    valid_loss = valid_loss/len(dev)
    valid_losses.append(valid_loss)
    print(f"Validation loss: {round(valid_loss, 4)}")
    print(f"Current patience: {curr_patience}, current trial: {num_trials}.")
    if valid_loss <= best_valid_loss:
        best_valid_loss = valid_loss
        print("Found new best model on dev set!")
        torch.save(model.state_dict(), 'model.std')
        torch.save(optimizer.state_dict(), 'optim.std')
        curr_patience = patience
    else:
        curr_patience -= 1
        if curr_patience <= -1:
            print("Running out of patience, loading previous best model.")
            num_trials -= 1
            curr_patience = patience
            model.load_state_dict(torch.load('model.std'))
            optimizer.load_state_dict(torch.load('optim.std'))
            lr_scheduler.step()
            print(f"Current learning rate: {optimizer.state_dict()['param_groups'][0]['lr']}")

    if num_trials <= 0:
        print("Running out of patience, early stopping.")
        break

model.load_state_dict(torch.load('model.std'))
y_true = []
y_pred = []
model.eval()
with torch.no_grad():
    test_loss = 0.0
    for batch in test_loader:
        model.zero_grad()
        t, y, l = batch
        if CUDA:
            t = t.cuda()
            y = y.cuda()
            l = l.cuda()
        y_tilde = model(t, l)
        loss = criterion_test(y_tilde, y)
        y_true.append(y_tilde.detach().cpu().numpy())
        y_pred.append(y.detach().cpu().numpy())
        test_loss += loss.item()
print(f"Test set performance: {test_loss/len(test)}")
y_true = np.concatenate(y_true, axis=0)
y_pred = np.concatenate(y_pred, axis=0)

y_true_bin = y_true >= 0
y_pred_bin = y_pred >= 0
bin_acc = accuracy_score(y_true_bin, y_pred_bin)
print(f"Test set accuracy is {bin_acc}")

Dictionary size: 2733
Size of vocabulary (word2id) after pretrained 1: 2724
Size of pre-trained embeddings: torch.Size([2724, 300])
Size of vocabulary (word2id): 2724
Using glove


  pretrained_emb, word2id = torch.load(CACHE_PATH)
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  train_iter = tqdm_notebook(train_loader)


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.3099
Loop completed
Validation loss: 2.7348
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.279
Loop completed
Validation loss: 2.7384
Current patience: 8, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.27
Loop completed
Validation loss: 2.7254
Current patience: 7, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.2806
Loop completed
Validation loss: 2.7313
Current patience: 8, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.2716
Loop completed
Validation loss: 2.7278
Current patience: 7, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.2666
Loop completed
Validation loss: 2.7232
Current patience: 6, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.2681
Loop completed
Validation loss: 2.7185
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.269
Loop completed
Validation loss: 2.7178
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.2576
Loop completed
Validation loss: 2.7037
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.2432
Loop completed
Validation loss: 2.6953
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.2197
Loop completed
Validation loss: 2.6703
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.1803
Loop completed
Validation loss: 2.6357
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.1145
Loop completed
Validation loss: 2.5633
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 2.061
Loop completed
Validation loss: 2.4995
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.8947
Loop completed
Validation loss: 2.4277
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.8211
Loop completed
Validation loss: 2.4744
Current patience: 8, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.6541
Loop completed
Validation loss: 2.3861
Current patience: 7, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.53
Loop completed
Validation loss: 2.3833
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.4564
Loop completed
Validation loss: 2.4367
Current patience: 8, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.3384
Loop completed
Validation loss: 2.381
Current patience: 7, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.2027
Loop completed
Validation loss: 2.3755
Current patience: 8, current trial: 3.
Found new best model on dev set!


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.1635
Loop completed
Validation loss: 2.4042
Current patience: 8, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0418
Loop completed
Validation loss: 2.5616
Current patience: 7, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.9554
Loop completed
Validation loss: 2.4299
Current patience: 6, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.8724
Loop completed
Validation loss: 2.5664
Current patience: 5, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.8298
Loop completed
Validation loss: 2.3848
Current patience: 4, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.7701
Loop completed
Validation loss: 2.5144
Current patience: 3, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.7304
Loop completed
Validation loss: 2.4101
Current patience: 2, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.6185
Loop completed
Validation loss: 2.4966
Current patience: 1, current trial: 3.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 0.5717
Loop completed
Validation loss: 2.4526
Current patience: 0, current trial: 3.
Running out of patience, loading previous best model.
Current learning rate: 1e-05


  model.load_state_dict(torch.load('model.std'))
  optimizer.load_state_dict(torch.load('optim.std'))


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.1179
Loop completed
Validation loss: 2.3812
Current patience: 8, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.1248
Loop completed
Validation loss: 2.3833
Current patience: 7, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.1108
Loop completed
Validation loss: 2.3894
Current patience: 6, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.1068
Loop completed
Validation loss: 2.384
Current patience: 5, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0736
Loop completed
Validation loss: 2.3819
Current patience: 4, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0712
Loop completed
Validation loss: 2.3887
Current patience: 3, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0708
Loop completed
Validation loss: 2.3864
Current patience: 2, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0435
Loop completed
Validation loss: 2.3922
Current patience: 1, current trial: 2.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.043
Loop completed
Validation loss: 2.3924
Current patience: 0, current trial: 2.
Running out of patience, loading previous best model.
Current learning rate: 1e-05


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.1067
Loop completed
Validation loss: 2.3806
Current patience: 8, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.1182
Loop completed
Validation loss: 2.3854
Current patience: 7, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0947
Loop completed
Validation loss: 2.3893
Current patience: 6, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0993
Loop completed
Validation loss: 2.3845
Current patience: 5, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0874
Loop completed
Validation loss: 2.385
Current patience: 4, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0777
Loop completed
Validation loss: 2.3902
Current patience: 3, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0669
Loop completed
Validation loss: 2.3962
Current patience: 2, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0479
Loop completed
Validation loss: 2.4004
Current patience: 1, current trial: 1.


  0%|          | 0/23 [00:00<?, ?it/s]

Training loss: 1.0413
Loop completed
Validation loss: 2.3967
Current patience: 0, current trial: 1.
Running out of patience, loading previous best model.
Current learning rate: 1e-05
Running out of patience, early stopping.


  model.load_state_dict(torch.load('model.std'))


IndexError: index out of range in self

In [76]:
# import torch
# import torch.nn as nn
# import torch.optim as optim
# import torch.nn.functional as F
# 
# class SingleEncoderModelText(nn.Module):
#     def __init__(self, dic_size, use_glove, batch_size, encoder_size, num_layer, lr, hidden_dim, dr):
#         super(SingleEncoderModelText, self).__init__()
# 
#         self.dic_size = dic_size
#         self.use_glove = use_glove
#         self.batch_size = batch_size
#         self.encoder_size = encoder_size
#         self.num_layers = num_layer
#         self.lr = lr
#         self.hidden_dim = hidden_dim
#         self.dr = dr
#         self.embed_dim = 300 if self.use_glove else DIM_WORD_EMBEDDING
#         self.global_step = 0  # Global step for tracking
# 
#         # Embedding layer
#         self.embedding = nn.Embedding(dic_size, self.embed_dim)
# 
#         # GRU layer
#         self.gru = nn.GRU(input_size=self.embed_dim,
#                           hidden_size=self.hidden_dim,
#                           num_layers=self.num_layers,
#                           batch_first=True,
#                           dropout=dr if self.num_layers > 1 else 0)
# 
#         # Output layers
#         self.fc = nn.Linear(self.hidden_dim, N_CATEGORY)
#         self.dropout = nn.Dropout(p=self.dr)
# 
#         # Optimizer
#         self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
# 
#     def forward(self, encoder_inputs, encoder_seq_length):
#         # Embedding lookup
#         embedded = self.embedding(encoder_inputs)  # [batch_size, encoder_size, embed_dim]
# 
#         # Pack padded sequence for GRU
#         packed_input = nn.utils.rnn.pack_padded_sequence(embedded, encoder_seq_length, batch_first=True, enforce_sorted=False)
#         packed_output, hidden = self.gru(packed_input)
# 
#         # Take the last hidden state of the GRU
#         # (hidden[-1] if multi-layered GRU)
#         final_hidden = hidden[-1]  # [batch_size, hidden_dim]
# 
#         # Fully connected layer with dropout
#         output = self.fc(self.dropout(final_hidden))  # [batch_size, N_CATEGORY]
#         return output
# 
#     def compute_loss(self, logits, labels):
#         # Binary Cross-Entropy with Logits
#         return F.binary_cross_entropy_with_logits(logits, labels)
# 
#     def train_batch(self, encoder_inputs, encoder_seq_length, y_labels):
#         self.train()  # Set the model to training mode
# 
#         # Forward pass
#         logits = self(encoder_inputs, encoder_seq_length)
#         loss = self.compute_loss(logits, y_labels)
# 
#         # Backward pass and optimization
#         self.optimizer.zero_grad()
#         loss.backward()
#         nn.utils.clip_grad_value_(self.parameters(), clip_value=10)  # Gradient clipping
#         self.optimizer.step()
# 
#         # Update global step
#         self.global_step += 1
# 
#         return loss.item()
# 
#     def evaluate(self, encoder_inputs, encoder_seq_length, y_labels):
#         self.eval()  # Set the model to evaluation mode
#         with torch.no_grad():
#             logits = self(encoder_inputs, encoder_seq_length)
#             loss = self.compute_loss(logits, y_labels)
#         return loss.item(), torch.sigmoid(logits)  # Return probabilities
# 
#     def load_pretrained_embeddings(self, embeddings):
#         # Load pre-trained embedding weights
#         self.embedding.weight.data.copy_(torch.tensor(embeddings))
#         self.embedding.weight.requires_grad = EMBEDDING_TRAIN
# 
# 
# # Usage Example
# # model = SingleEncoderModelText(dic_size=5000, use_glove=True, batch_size=32, encoder_size=50, 
# #                                num_layer=2, lr=0.001, hidden_dim=128, dr=0.5)


In [77]:
# # FIGURE OUT HOW TO DO MDRE 
# 
# """
# what    : Single Encoder Model for Multi (Audio + Text)
# """
# import tensorflow as tf
# from tensorflow.contrib import rnn
# from tensorflow.contrib.rnn import DropoutWrapper 
# 
# from tensorflow.core.framework import summary_pb2
# from random import shuffle
# import numpy as np
# from project_config import *
# 
# from SE_model_audio import *
# from SE_model_text import *
# 
# 
# class SingleEncoderModelMulti:
#     
#     def __init__(self,
#                  batch_size,
#                  lr,
#                  encoder_size_audio,  # for audio
#                  num_layer_audio,
#                  hidden_dim_audio,
#                  dr_audio,
#                  dic_size,             # for text
#                  use_glove,
#                  encoder_size_text,
#                  num_layer_text,
#                  hidden_dim_text,
#                  dr_text
#                 ):
# 
#         # for audio
#         self.encoder_size_audio = encoder_size_audio
#         self.num_layers_audio = num_layer_audio
#         self.hidden_dim_audio = hidden_dim_audio
#         self.dr_audio = dr_audio
#         
#         self.encoder_inputs_audio = []
#         self.encoder_seq_length_audio =[]
#         
#         # for text        
#         self.dic_size = dic_size
#         self.use_glove = use_glove
#         self.encoder_size_text = encoder_size_text
#         self.num_layers_text = num_layer_text
#         self.hidden_dim_text = hidden_dim_text
#         self.dr_text = dr_text
#         
#         self.encoder_inputs_text = []
#         self.encoder_seq_length_text =[]
# 
#         # common        
#         self.batch_size = batch_size
#         self.lr = lr
#         self.y_labels =[]
#         
#         self.M = None
#         self.b = None
#         
#         self.y = None
#         self.optimizer = None
# 
#         self.batch_loss = None
#         self.loss = 0
#         self.batch_prob = None
#         
#         if self.use_glove == 1:
#             self.embed_dim = 300
#         else:
#             self.embed_dim = DIM_WORD_EMBEDDING
#         
#         # for global counter
#         self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
# 
# 
#     def _create_placeholders(self):
#         print '[launch-multi] placeholders'
#         with tf.name_scope('multi_placeholder'):
#             
#             # for audio
#             self.encoder_inputs_audio  = self.model_audio.encoder_inputs  # [batch, time_step, audio]
#             self.encoder_seq_audio     = self.model_audio.encoder_seq
#             self.encoder_prosody       = self.model_audio.encoder_prosody
#             self.dr_prob_audio         = self.model_audio.dr_prob
#             
#             # for text
#             self.encoder_inputs_text  = self.model_text.encoder_inputs
#             self.encoder_seq_text     = self.model_text.encoder_seq
#             self.dr_prob_text         = self.model_text.dr_prob
# 
#             # common
#             self.y_labels             = tf.placeholder(tf.float32, shape=[self.batch_size, N_CATEGORY], name="label")
#             
#             # for using pre-trained embedding
#             self.embedding_placeholder = self.model_text.embedding_placeholder
# 
# 
#     def _create_model_audio(self):
#         print '[launch-multi] create audio model'
#         self.model_audio =  SingleEncoderModelAudio(
#                                                         batch_size=self.batch_size,
#                                                         encoder_size=self.encoder_size_audio,
#                                                         num_layer=self.num_layers_audio,
#                                                         hidden_dim=self.hidden_dim_audio,
#                                                         lr = self.lr,
#                                                         dr= self.dr_audio
#                                                         )
#         self.model_audio._create_placeholders()
#         self.model_audio._create_gru_model()
#         self.model_audio._add_prosody()
#         self.model_audio._create_output_layers_for_multi()
#         
# 
# 
#     def _create_model_text(self):
#         print '[launch-multi] create text model'        
#         self.model_text = SingleEncoderModelText(
#                                                         batch_size=self.batch_size,
#                                                         dic_size=self.dic_size,
#                                                         use_glove=self.use_glove,
#                                                         encoder_size=self.encoder_size_text,
#                                                         num_layer=self.num_layers_text,
#                                                         hidden_dim=self.hidden_dim_text,
#                                                         lr = self.lr,
#                                                         dr= self.dr_text
#                                                         )
#         
#         self.model_text._create_placeholders()
#         self.model_text._create_embedding()
#         self.model_text._use_external_embedding()
#         self.model_text._create_gru_model()
#         self.model_text._create_output_layers_for_multi()
# 
# 
#     def _create_output_layers(self):
#         print '[launch-multi] create output projection layer from (audio_final_dim/2) + (text_final_dim/2)'
#         
#         with tf.name_scope('multi_output_layer') as scope:
# 
#             self.M = tf.Variable(tf.random_uniform([(self.model_audio.final_encoder_dimension/2)+(self.model_text.final_encoder_dimension/2), N_CATEGORY],
#                                                    minval= -0.25,
#                                                    maxval= 0.25,
#                                                    dtype=tf.float32,
#                                                    seed=None),
#                                                  trainable=True,
#                                                  name="similarity_matrix")
#             
#             self.b = tf.Variable(tf.zeros([1], dtype=tf.float32),
#                                                  trainable=True,
#                                                  name="output_bias")
#             
#             self.final_encoder = tf.concat( [self.model_audio.batch_pred, self.model_text.batch_pred], axis=1 )
#             
#             # e * M + b
#             self.batch_pred = tf.matmul(self.final_encoder, self.M) + self.b
#         
#         with tf.name_scope('loss') as scope:
#             
#             self.batch_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.batch_pred, labels=self.y_labels )
#             self.loss = tf.reduce_mean( self.batch_loss  )
# 
#     
#     def _create_optimizer(self):
#         print '[launch-multi] create optimizer'
#         
#         with tf.name_scope('multi_optimizer') as scope:
#             opt_func = tf.train.AdamOptimizer(learning_rate=self.lr)
#             gvs = opt_func.compute_gradients(self.loss)
#             capped_gvs = [(tf.clip_by_value(t=grad, clip_value_min=-10, clip_value_max=10), var) for grad, var in gvs]
#             self.optimizer = opt_func.apply_gradients(grads_and_vars=capped_gvs, global_step=self.global_step)
#     
#     
#     def _create_summary(self):
#         print '[launch-multi] create summary'
#         
#         with tf.name_scope('summary'):
#             tf.summary.scalar('mean_loss', self.loss)
#             self.summary_op = tf.summary.merge_all()
#     
#     
#     def build_graph(self):
#         self._create_model_audio()
#         self._create_model_text()
#         self._create_placeholders()
#         self._create_output_layers()
#         self._create_optimizer()
#         self._create_summary()

In [ ]:
# # FIGURE OUT HOW TO DO MDREA
# #-*- coding: utf-8 -*-
# 
# """
# what    : Single Encoder Model for Multi (Audio + Text) with attention
# """
# import tensorflow as tf
# from tensorflow.contrib import rnn
# from tensorflow.contrib.rnn import DropoutWrapper 
# 
# from tensorflow.core.framework import summary_pb2
# from random import shuffle
# import numpy as np
# from project_config import *
# 
# from SE_model_audio import *
# from SE_model_text import *
# # from model_util import luong_attention
# from model_luong_attention import luong_attention
# 
# 
# class SingleEncoderModelMultiAttn:
#     
#     def __init__(self,
#                  batch_size,
#                  lr,
#                  encoder_size_audio,  # for audio
#                  num_layer_audio,
#                  hidden_dim_audio,
#                  dr_audio,
#                  dic_size,             # for text
#                  use_glove,
#                  encoder_size_text,
#                  num_layer_text,
#                  hidden_dim_text,
#                  dr_text
#                 ):
# 
#         # for audio
#         self.encoder_size_audio = encoder_size_audio
#         self.num_layers_audio = num_layer_audio
#         self.hidden_dim_audio = hidden_dim_audio
#         self.dr_audio = dr_audio
#         
#         self.encoder_inputs_audio = []
#         self.encoder_seq_length_audio =[]
#         
#         # for text        
#         self.dic_size = dic_size
#         self.use_glove = use_glove
#         self.encoder_size_text = encoder_size_text
#         self.num_layers_text = num_layer_text
#         self.hidden_dim_text = hidden_dim_text
#         self.dr_text = dr_text
#         
#         self.encoder_inputs_text = []
#         self.encoder_seq_length_text =[]
# 
#         # common        
#         self.batch_size = batch_size
#         self.lr = lr
#         self.y_labels =[]
#         
#         self.M = None
#         self.b = None
#         
#         self.y = None
#         self.optimizer = None
# 
#         self.batch_loss = None
#         self.loss = 0
#         self.batch_prob = None
#         
#         if self.use_glove == 1:
#             self.embed_dim = 300
#         else:
#             self.embed_dim = DIM_WORD_EMBEDDING
#         
#         # for global counter
#         self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
# 
# 
#     def _create_placeholders(self):
#         print '[launch-multi] placeholders'
#         with tf.name_scope('multi_placeholder'):
#             
#             # for audio
#             self.encoder_inputs_audio  = self.model_audio.encoder_inputs  # [batch, time_step, audio]
#             self.encoder_seq_audio     = self.model_audio.encoder_seq
#             self.encoder_prosody       = self.model_audio.encoder_prosody
#             self.dr_prob_audio         = self.model_audio.dr_prob
#             
#             # for text
#             self.encoder_inputs_text  = self.model_text.encoder_inputs
#             self.encoder_seq_text     = self.model_text.encoder_seq
#             self.dr_prob_text         = self.model_text.dr_prob
# 
#             # common
#             self.y_labels             = tf.placeholder(tf.float32, shape=[self.batch_size, N_CATEGORY], name="label")
#             
#             # for using pre-trained embedding
#             self.embedding_placeholder = self.model_text.embedding_placeholder
# 
# 
#     def _create_model_audio(self):
#         print '[launch-multi] create audio model'
#         self.model_audio =  SingleEncoderModelAudio(
#                                                         batch_size=self.batch_size,
#                                                         encoder_size=self.encoder_size_audio,
#                                                         num_layer=self.num_layers_audio,
#                                                         hidden_dim=self.hidden_dim_audio,
#                                                         lr = self.lr,
#                                                         dr= self.dr_audio
#                                                         )
#         self.model_audio._create_placeholders()
#         self.model_audio._create_gru_model()
#         self.model_audio._add_prosody()
#         #self.model_audio._create_output_layers_for_multi()
#         
# 
# 
#     def _create_model_text(self):
#         print '[launch-multi] create text model'        
#         self.model_text = SingleEncoderModelText(
#                                                     batch_size=self.batch_size,
#                                                     dic_size=self.dic_size,
#                                                     use_glove=self.use_glove,
#                                                     encoder_size=self.encoder_size_text,
#                                                     num_layer=self.num_layers_text,
#                                                     hidden_dim=self.hidden_dim_text,
#                                                     lr = self.lr,
#                                                     dr= self.dr_text
#                                                 )
#         
#         self.model_text._create_placeholders()
#         self.model_text._create_embedding()
#         self.model_text._use_external_embedding()
#         self.model_text._create_gru_model()
#         #self.model_text._create_output_layers_for_multi()
# 
# 
#     def _create_attention_module(self):
#         print '[launch-multi] create attention module'
#         # project audio dimension_size to text dimension_size
#         self.attnM = tf.Variable(tf.random_uniform([self.model_audio.final_encoder_dimension, self.model_text.final_encoder_dimension],
#                                                    minval= -0.25,
#                                                    maxval= 0.25,
#                                                    dtype=tf.float32,
#                                                    seed=None),
#                                                  trainable=True,
#                                                  name="attn_projection_helper")
#             
#         self.attnb = tf.Variable(tf.zeros([1], dtype=tf.float32),
#                                                  trainable=True,
#                                                  name="attn_bias")
#         
# 
#         self.attn_audio_final_encoder = tf.matmul(self.model_audio.final_encoder, self.attnM) + self.attnb
#         
#         self.final_encoder, self.tmp_norm = luong_attention (
#                                                 batch_size = self.batch_size,
#                                                 target = self.model_text.outputs_en,
#                                                 condition = self.attn_audio_final_encoder,
#                                                 batch_seq = self.encoder_seq_text,
#                                                 max_len = self.model_text.encoder_size,
#                                                 hidden_dim = self.model_text.final_encoder_dimension
#                                             )
# 
#         
#     def _create_output_layers(self):
#         print '[launch-multi] create output projection layer from (text_final_dim(==audio) + text_final_dim)'
#         
#         with tf.name_scope('multi_output_layer') as scope:
# 
#             self.final_encoder = tf.concat( [self.final_encoder, self.attn_audio_final_encoder], axis=1 )
#             
#             self.M = tf.Variable(tf.random_uniform([(self.model_text.final_encoder_dimension)+(self.model_text.final_encoder_dimension), N_CATEGORY],
#                                                    minval= -0.25,
#                                                    maxval= 0.25,
#                                                    dtype=tf.float32,
#                                                    seed=None),
#                                                  trainable=True,
#                                                  name="similarity_matrix")
#             
#             self.b = tf.Variable(tf.zeros([1], dtype=tf.float32),
#                                                  trainable=True,
#                                                  name="output_bias")
#             
#             # e * M + b
#             self.batch_pred = tf.matmul(self.final_encoder, self.M) + self.b
#         
#         with tf.name_scope('loss') as scope:
#             
#             self.batch_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.batch_pred, labels=self.y_labels )
#             self.loss = tf.reduce_mean( self.batch_loss  )
# 
#     
#     def _create_optimizer(self):
#         print '[launch-multi] create optimizer'
#         
#         with tf.name_scope('multi_optimizer') as scope:
#             opt_func = tf.train.AdamOptimizer(learning_rate=self.lr)
#             gvs = opt_func.compute_gradients(self.loss)
#             capped_gvs = [(tf.clip_by_value(t=grad, clip_value_min=-10, clip_value_max=10), var) for grad, var in gvs]
#             self.optimizer = opt_func.apply_gradients(grads_and_vars=capped_gvs, global_step=self.global_step)
#     
#     
#     def _create_summary(self):
#         print '[launch-multi] create summary'
#         
#         with tf.name_scope('summary'):
#             tf.summary.scalar('mean_loss', self.loss)
#             self.summary_op = tf.summary.merge_all()
#     
#     
#     def build_graph(self):
#         self._create_model_audio()
#         self._create_model_text()
#         self._create_placeholders()
#         self._create_attention_module()
#         self._create_output_layers()
#         self._create_optimizer()
#         self._create_summary()


In [ ]:
# FIGURE OUT HOW TO DO EMOTION SHIFT?

In [ ]:
# FIGURE OUT FUSION TECHNIQUES