# BERT-ERC Teacher Model Code Implementation

In this file you will see how we were able to implement the teacher model using a pretrained RoBERTa-Large Model.  

## Library instillation

In [1]:
!pip install datasets
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import RobertaModel, RobertaTokenizer, AdamW, RobertaTokenizerFast
import pandas as pd
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

Assuring the utlization of CUDA




In [2]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Using device: {device}')

Using device: cuda


Downloading the Data from huggingface

link: https://huggingface.co/datasets/roskoN/dailydialog

In [3]:
ds = load_dataset("roskoN/dailydialog")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/581 [00:00<?, ?B/s]

dailydialog.py:   0%|          | 0.00/4.59k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/3.67M [00:00<?, ?B/s]

full/validation/0000.parquet:   0%|          | 0.00/340k [00:00<?, ?B/s]

full/test/0000.parquet:   0%|          | 0.00/337k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11118 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

##Parsing Data

Parsing the data into Train, Validation and Testing.

In [4]:
DD_train_data = ds['train']
DD_val_data = ds['validation']
DD_test_data = ds['test']

#Data Processing Stage

##Training Data processing

In this step we are segmenting the dialogue into utterances with its respective emotions and associting speaker tokens to each uttereance for the training data.

In [7]:
# Initialize an empty list to store the data
data = []

# Function to map emotion indices to emotion labels (optional)
emotion_labels = {
    0: 'no_emotion',
    1: 'anger',
    2: 'disgust',
    3: 'fear',
    4: 'happiness',
    5: 'sadness',
    6: 'surprise'
}

def assign_speakers(utterances):
    speakers = []
    current_speaker = 'Speaker A'
    for i in range(len(utterances)):
        speakers.append(current_speaker)
        # Alternate speakers
        current_speaker = 'Speaker B' if current_speaker == 'Speaker A' else 'Speaker A'
    return speakers

# Iterate over each dialogue in the training data
for dialog in DD_train_data:
    utterances = dialog['utterances']
    emotions = dialog['emotions']
    dialog_id = dialog['id']

    # Ensure the number of utterances matches the number of emotions
    if len(utterances) != len(emotions):
        print(f"Length mismatch in dialogue {dialog_id}")
        continue  # Skip this dialogue or handle accordingly

    # Assign speakers
    speakers = assign_speakers(utterances)

    # Iterate over utterance-emotion pairs
    for utt, emo, speaker in zip(utterances, emotions, speakers):
        data.append({
            'dialogue_id': dialog_id,
            'utterance': utt,
            'emotion': emo,
            'emotion_label': emotion_labels.get(emo, 'unknown'),
            'speaker': speaker
        })

# Convert the list of dictionaries into a DataFrame
df_train = pd.DataFrame(data)
df_train = df_train.drop(columns=['dialogue_id'])

# Display the first few rows
df_train.head(100)

Unnamed: 0,utterance,emotion,emotion_label,speaker
0,"Say , Jim , how about going for a few beers af...",0,no_emotion,Speaker A
1,You know that is tempting but is really not go...,0,no_emotion,Speaker B
2,What do you mean ? It will help us to relax .,0,no_emotion,Speaker A
3,Do you really think so ? I don't . It will jus...,0,no_emotion,Speaker B
4,I guess you are right.But what shall we do ? I...,0,no_emotion,Speaker A
...,...,...,...,...
95,You look so tan and healthy !,4,happiness,Speaker A
96,Thanks . I just got back from summer camp .,4,happiness,Speaker B
97,How was it ?,0,no_emotion,Speaker A
98,Great . I got to try so many things for the fi...,4,happiness,Speaker B


In [9]:
df_train.shape

(87170, 4)

##Validation Data processing

In this step we are segmenting the dialogue into utterances with its respective emotions and associting speaker tokens to each uttereance for the validation data.

In [10]:
data = []
# Iterate over each dialogue in the training data
for dialog in DD_val_data:
    utterances = dialog['utterances']
    emotions = dialog['emotions']
    dialog_id = dialog['id']

    # Ensure the number of utterances matches the number of emotions
    if len(utterances) != len(emotions):
        print(f"Length mismatch in dialogue {dialog_id}")
        continue  # Skip this dialogue or handle accordingly

    speakers = assign_speakers(utterances)

    # Iterate over utterance-emotion pairs
    for utt, emo, speaker in zip(utterances, emotions, speakers):
        data.append({
            'dialogue_id': dialog_id,
            'utterance': utt,
            'emotion': emo,
            'emotion_label': emotion_labels.get(emo, 'unknown'),
            'speaker': speaker
        })

# Convert the list of dictionaries into a DataFrame
df_val = pd.DataFrame(data)
df_val = df_val.drop(columns=['dialogue_id'])
# Display the first few rows
df_val.head(100)

Unnamed: 0,utterance,emotion,emotion_label,speaker
0,"Good morning , sir . Is there a bank near here ?",0,no_emotion,Speaker A
1,There is one . 5 blocks away from here ?,0,no_emotion,Speaker B
2,"Well , that's too far.Can you change some mone...",0,no_emotion,Speaker A
3,"Surely , of course . What kind of currency hav...",0,no_emotion,Speaker B
4,RIB .,0,no_emotion,Speaker A
...,...,...,...,...
95,That's him !,0,no_emotion,Speaker B
96,I'll call him and tell him you're here .,0,no_emotion,Speaker A
97,I appreciate your help .,0,no_emotion,Speaker B
98,Would you like to have a seat over there ? It'...,0,no_emotion,Speaker A


In [11]:
df_val.shape

(8069, 4)

##Testing Data processing

In this step we are segmenting the dialogue into utterances with its respective emotions and associting speaker tokens to each uttereance for the testing data.

In [12]:
data = []
# Iterate over each dialogue in the training data
for dialog in DD_test_data:
    utterances = dialog['utterances']
    emotions = dialog['emotions']
    dialog_id = dialog['id']

    # Ensure the number of utterances matches the number of emotions
    if len(utterances) != len(emotions):
        print(f"Length mismatch in dialogue {dialog_id}")
        continue  # Skip this dialogue or handle accordingly

    speakers = assign_speakers(utterances)

    # Iterate over utterance-emotion pairs
    for utt, emo, speaker in zip(utterances, emotions, speakers):
        data.append({
            'dialogue_id': dialog_id,
            'utterance': utt,
            'emotion': emo,
            'emotion_label': emotion_labels.get(emo, 'unknown'),
            'speaker': speaker
        })

# Convert the list of dictionaries into a DataFrame
df_test = pd.DataFrame(data)
df_test = df_test.drop(columns=['dialogue_id'])
# Display the first few rows
df_test.head(100)

Unnamed: 0,utterance,emotion,emotion_label,speaker
0,"Hey man , you wanna buy some weed ?",0,no_emotion,Speaker A
1,Some what ?,6,surprise,Speaker B
2,"Weed ! You know ? Pot , Ganja , Mary Jane some...",0,no_emotion,Speaker A
3,"Oh , umm , no thanks .",0,no_emotion,Speaker B
4,I also have blow if you prefer to do a few lin...,0,no_emotion,Speaker A
...,...,...,...,...
95,I can't really deal with any distractions righ...,0,no_emotion,Speaker B
96,Sun-set hotel . May I help you ?,0,no_emotion,Speaker A
97,"Yes , I have booked a room for 24th . It's a d...",0,no_emotion,Speaker B
98,"Hold on , please . Let me check it for you . Y...",0,no_emotion,Speaker A


In [13]:
df_test.shape

(7740, 4)

## Formating the input text

###Train input text

In [None]:
def prepare_input_text(utterance, speaker):
    # Using the suggestive text format with speaker tokens
    input_text = f"<s> {speaker} <mask> says: {utterance} </s>"
    return input_text

df_train['input_text'] = df_train.apply(lambda x: prepare_input_text(x['utterance'], x['speaker']), axis=1)
df_train.head()

Unnamed: 0,utterance,emotion,emotion_label,speaker,input_text
0,"Say , Jim , how about going for a few beers af...",0,no_emotion,Speaker A,"<s> Speaker A <mask> says: Say , Jim , how abo..."
1,You know that is tempting but is really not go...,0,no_emotion,Speaker B,<s> Speaker B <mask> says: You know that is te...
2,What do you mean ? It will help us to relax .,0,no_emotion,Speaker A,<s> Speaker A <mask> says: What do you mean ? ...
3,Do you really think so ? I don't . It will jus...,0,no_emotion,Speaker B,<s> Speaker B <mask> says: Do you really think...
4,I guess you are right.But what shall we do ? I...,0,no_emotion,Speaker A,<s> Speaker A <mask> says: I guess you are rig...


###Validation input text

In [None]:
df_val['input_text'] = df_val.apply(lambda x: prepare_input_text(x['utterance'], x['speaker']), axis=1)
df_val.head()

Unnamed: 0,utterance,emotion,emotion_label,speaker,input_text
0,"Good morning , sir . Is there a bank near here ?",0,no_emotion,Speaker A,"<s> Speaker A <mask> says: Good morning , sir ..."
1,There is one . 5 blocks away from here ?,0,no_emotion,Speaker B,<s> Speaker B <mask> says: There is one . 5 bl...
2,"Well , that's too far.Can you change some mone...",0,no_emotion,Speaker A,"<s> Speaker A <mask> says: Well , that's too f..."
3,"Surely , of course . What kind of currency hav...",0,no_emotion,Speaker B,"<s> Speaker B <mask> says: Surely , of course ..."
4,RIB .,0,no_emotion,Speaker A,<s> Speaker A <mask> says: RIB . </s>


###Test input text

In [None]:
df_test['input_text'] = df_test.apply(lambda x: prepare_input_text(x['utterance'], x['speaker']), axis=1)
df_test.head()

Unnamed: 0,utterance,emotion,emotion_label,speaker,input_text
0,"Hey man , you wanna buy some weed ?",0,no_emotion,Speaker A,"<s> Speaker A <mask> says: Hey man , you wanna..."
1,Some what ?,6,surprise,Speaker B,<s> Speaker B <mask> says: Some what ? </s>
2,"Weed ! You know ? Pot , Ganja , Mary Jane some...",0,no_emotion,Speaker A,<s> Speaker A <mask> says: Weed ! You know ? P...
3,"Oh , umm , no thanks .",0,no_emotion,Speaker B,"<s> Speaker B <mask> says: Oh , umm , no thank..."
4,I also have blow if you prefer to do a few lin...,0,no_emotion,Speaker A,<s> Speaker A <mask> says: I also have blow if...


##Tokenization and Attention Mask

###Trian Data Tokenization

In [None]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-large')

def tokenize_function(examples):
    return tokenizer(examples['input_text'], truncation=True, padding='max_length', max_length=128)

# Tokenize the dataset
df_train['tokenized'] = df_train['input_text'].apply(lambda x: tokenize_function({'input_text': x}))
df_train['input_ids'] = df_train['tokenized'].apply(lambda x: x['input_ids'])
df_train['attention_mask'] = df_train['tokenized'].apply(lambda x: x['attention_mask'])
df_train.head()

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

####Test Data Tokenization

In [None]:
df_test['tokenized'] = df_test['input_text'].apply(lambda x: tokenize_function({'input_text': x}))
df_test['input_ids'] = df_test['tokenized'].apply(lambda x: x['input_ids'])
df_test['attention_mask'] = df_test['tokenized'].apply(lambda x: x['attention_mask'])
df_test.head()

Unnamed: 0,utterance,emotion,emotion_label,speaker,input_text,tokenized,input_ids,attention_mask
0,"Hey man , you wanna buy some weed ?",0,no_emotion,Speaker A,"<s> Speaker A <mask> says: Hey man , you wanna...","[input_ids, attention_mask]","[0, 0, 6358, 83, 50264, 161, 35, 11468, 313, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Some what ?,6,surprise,Speaker B,<s> Speaker B <mask> says: Some what ? </s>,"[input_ids, attention_mask]","[0, 0, 6358, 163, 50264, 161, 35, 993, 99, 174...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ..."
2,"Weed ! You know ? Pot , Ganja , Mary Jane some...",0,no_emotion,Speaker A,<s> Speaker A <mask> says: Weed ! You know ? P...,"[input_ids, attention_mask]","[0, 0, 6358, 83, 50264, 161, 35, 38511, 27785,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,"Oh , umm , no thanks .",0,no_emotion,Speaker B,"<s> Speaker B <mask> says: Oh , umm , no thank...","[input_ids, attention_mask]","[0, 0, 6358, 163, 50264, 161, 35, 5534, 2156, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,I also have blow if you prefer to do a few lin...,0,no_emotion,Speaker A,<s> Speaker A <mask> says: I also have blow if...,"[input_ids, attention_mask]","[0, 0, 6358, 83, 50264, 161, 35, 38, 67, 33, 4...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


#BERT Model Archetcture

In [None]:
class BERTERCModel(nn.Module):
    def __init__(self, pretrained_model_name='roberta-large', num_classes=7):
        super(BERTERCModel, self).__init__()
        self.roberta = RobertaModel.from_pretrained(pretrained_model_name)
        self.fc = nn.Linear(3 * self.roberta.config.hidden_size, self.roberta.config.hidden_size)
        self.dropout = nn.Dropout(0.3)
        self.mlp = nn.Linear(self.roberta.config.hidden_size, num_classes)
        self.tanh = nn.Tanh()

    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state  # Shape: [batch_size, seq_len, hidden_size]

        # Split into past, query, future tokens
        batch_size, seq_len, hidden_size = last_hidden_state.size()
        query_start = seq_len // 3
        query_end = 2 * seq_len // 3

        past_features = torch.mean(last_hidden_state[:, :query_start, :], dim=1)
        query_features = torch.mean(last_hidden_state[:, query_start:query_end, :], dim=1)
        future_features = torch.mean(last_hidden_state[:, query_end:, :], dim=1)

        # Concatenate past, query, and future features
        combined_features = torch.cat((past_features, query_features, future_features), dim=1)

        # Classification steps
        cls_features = self.tanh(self.fc(combined_features))
        cls_features = self.dropout(cls_features)
        logits = self.mlp(cls_features)

        return logits

#Training

In [None]:
class EmotionDataset(Dataset):
    def __init__(self, dataframe):
        self.input_ids = list(dataframe['input_ids'])
        self.attention_masks = list(dataframe['attention_mask'])
        self.labels = list(dataframe['emotion'])  # Assuming this is the integer label column

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
            'attention_mask': torch.tensor(self.attention_masks[idx], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# Create dataset and dataloader
train_dataset = EmotionDataset(df_train)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Initialize model
model = BERTERCModel(pretrained_model_name='roberta-large', num_classes=7)

# Move model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Define optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)

# Training loop
epochs = 10
model.train()
for epoch in range(epochs):
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        logits = model(input_ids=input_ids, attention_mask=attention_mask)

        loss_fct = torch.nn.CrossEntropyLoss()
        loss = loss_fct(logits, labels)
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

        predicted_labels = torch.argmax(logits, dim=1)
        total_correct = (predicted_labels == labels).sum().item()
        total_samples = labels.size(0)

    avg_loss = total_loss / len(train_loader)
    accuracy = total_correct / total_samples
    print(f"Epoch {epoch + 1}/{epochs} - Average Loss: {avg_loss:.4f} -` Accuracy: {accuracy:.4f}")

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10 - Average Loss: 0.3990 -` Accuracy: 1.0000
Epoch 2/10 - Average Loss: 0.3334 -` Accuracy: 1.0000
Epoch 3/10 - Average Loss: 0.2945 -` Accuracy: 1.0000
Epoch 4/10 - Average Loss: 0.2534 -` Accuracy: 1.0000
Epoch 5/10 - Average Loss: 0.2196 -` Accuracy: 1.0000
Epoch 6/10 - Average Loss: 0.1863 -` Accuracy: 1.0000
Epoch 7/10 - Average Loss: 0.1675 -` Accuracy: 1.0000
Epoch 8/10 - Average Loss: 0.1528 -` Accuracy: 1.0000
Epoch 9/10 - Average Loss: 0.1377 -` Accuracy: 1.0000
Epoch 10/10 - Average Loss: 0.1261 -` Accuracy: 1.0000


#Testing

In [None]:
test_dataset = EmotionDataset(df_test)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=True)

model.eval()
all_preds = []
all_labels = []
correct = 0
total = 0

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs  # Since your model returns logits directly

        preds = torch.argmax(logits, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

        # Compute accuracy for this batch
        correct += (preds == labels).sum().item()
        total += labels.size(0)

# Compute overall accuracy
accuracy = correct / total
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 83.95%
