## Go to the root repo directory.

In [1]:
%cd /home/tk/repos/erc/

/home/tk/repos/erc


## Choose the dataset 

`DATASET` should be either MELD or IEMOCAP

In [2]:
from glob import glob
import os

DATASET = 'MELD'
text_features_paths = {}
for SPLIT in ['train', 'val', 'test']:
    text_features_paths[SPLIT] = glob(f"Datasets/{DATASET}/text-features/{SPLIT}/*.npy")
    print(DATASET, SPLIT, len(text_features_paths[SPLIT]))

MELD train 2343
MELD val 0
MELD test 0


## Randomly choose a sample and see what it is

In [3]:
import random
import numpy as np

path = random.choice(text_features_paths['train'])
print(path)
print()

features = np.load(path, allow_pickle=True)

# a dict was saved as a npy, so we should use the item() to get it out.
features = features.item()

for key, val in features.items():
    print(key, ':', val)
    print()

Datasets/MELD/text-features/train/dia153_utt0.npy

utterance : ROSS: No real-, honey, really it’s fine, just g-go with Susan.

label : 0

uttid : dia153_utt0

tokens : tensor([    0,   500, 17549,    35,   440,   588, 20551, 10658,     6,   269,
           24,    17,    27,    29,  2051,     6,    95,   821,    12,  2977,
           19,  6470,     4,     2])

logprobs : [[-0.16803461 -3.7351213  -6.2581997  -4.142675   -2.7416916  -3.1234658
  -5.3938284 ]]

probs : [[0.8453246  0.02387028 0.00191469 0.01588032 0.06446122 0.0440044
  0.00454454]]

pred : 0

features : [[ 0.43519896  0.16311124 -0.30719158 ... -0.5003236  -0.5439267
   0.0103919 ]]

emotion2num : {'neutral': 0, 'joy': 1, 'surprise': 2, 'anger': 3, 'sadness': 4, 'disgust': 5, 'fear': 6}



### See the tokens. 

Note that the token idx 0 means the `<CLS>` token,
and the token idx 2 is the `<EOS>` token.

In [4]:
features['tokens']

tensor([    0,   500, 17549,    35,   440,   588, 20551, 10658,     6,   269,
           24,    17,    27,    29,  2051,     6,    95,   821,    12,  2977,
           19,  6470,     4,     2])

### Print the number of tokens. 

The number of tokens include the `<CLS>` and `<EOS>` tokens.

In [5]:
len(features['tokens'])

24

### What does the feature vector look like?

Every feature vector has the shape of (`batch_size`, `num_neurons`). 

The feature vectors were extracted by the below code snippet.

```python
self.features = self.roberta.extract_features(self.tokens)
self.features = self.features[:, 0, :]  # <CLS> token feature

#  dropout doesn't have an effect (i.e. p=0) but I'll still preserve it.
self.features = \
    self.roberta.model.classification_heads['MELD_head'].dropout(
        self.roberta.model.classification_heads['MELD_head'].activation_fn(
            self.roberta.model.classification_heads['MELD_head'].dense(
                self.roberta.model.classification_heads['MELD_head'].dropout(self.features))))
```

Here `activation_fn` is tanh, which makes every element in the vector have the range in (-1, +1)

This is actually the output of the second last layer (dense + tanh) in the entire roberta.large neural network. The last layer is a fully-connected layer that takes this feature vector and outputs the vector of size `num_classes`.

In [6]:
features['features'].shape, features['features'].dtype, features['features'].min(), features['features'].max()

((1, 1024), dtype('float32'), -0.9667779, 0.9595487)