<a href="https://colab.research.google.com/github/RosaMeyer/2023-lectures/blob/main/Week2_recurrent_neural_network_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 - Recurrent Neural Networks with LSTM for Language Modeling

Let k be the number of members in your group (k ∈ {1, 2, 3}). Implement
k different language models for the questions in the three languages Arabic,
Korean and Telugu, as well as for the document contexts in English (total k × 4
language models), using the training data.

## Install Dependencies

In [1]:
%pip install -q torch pandas pyarrow
import math, random, os
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer

## Fetching data

Here we follow the same technique from week 1 to get the training and validation datasets:

In [2]:
# Training data - 6335 rows

url = 'https://huggingface.co/datasets/coastalcph/tydi_xor_rc/resolve/main/train.parquet'
train_df = pd.read_parquet(url)

languages = ['ar', 'ko', 'te']
filter_train_df = train_df[train_df['lang'].isin(languages)]

In [3]:
# Validation data - 1155 rows

url = 'https://huggingface.co/datasets/coastalcph/tydi_xor_rc/resolve/main/validation.parquet'
validation_df = pd.read_parquet(url)

filtered_val_df = validation_df[validation_df['lang'].isin(languages)]

## Extracting questions and contexts

We define some helper functions to extract questions and contexts from data sets:

In [4]:
def getQuestionsForLanguage(language, data):
    return data[data['lang'] == language]['question']

# TODO: Two below should be used for the model for englsih context (translated questions across languages)
def getContextForLang(language, data):
    return data[data['lang'] == language]['context']

def getContextForAllLang(languages, data):
  res = []
  for lang in languages:
    res.extend(getContextForLang(lang, data))
  return res

# filter_all = getContextForAllLang(languages, filter_train_df)
# len(filter_all)

## Tokenization

In [5]:
# TODO: Consider using the "original" tokenization (the ones implemented in week 1)

# Use BERT for tokenization
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

class FFDataset(Dataset):
    def __init__(self, texts, context, tokenizer):
        ids = []
        for t in texts:
            tokens = tokenizer(t, add_special_tokens=False, return_attention_mask=False)['input_ids']
            if not tokens: continue
            ids.extend(tokens + [tokenizer.sep_token_id])
        self.ids = ids
        self.context = context
        self.vocab_size = tokenizer.vocab_size

        x = []
        y = []
        for i in range(len(ids) - context):
            x.append(ids[i:i+context])
            y.append(ids[i+context])

        self.x = torch.tensor(x, dtype=torch.long)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self): return len(self.y)
    def __getitem__(self, idx): return self.x[idx], self.y[idx]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

## LSTM Model

In [6]:
# LSTM-based language model for next token prediction (uni-directional)
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_size=256, hidden_size=512, num_layers=2, dropout=0.2):
        super(LSTMLanguageModel, self).__init__()

        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding layer: Essential for converting discrete tokens to continuous space
        # This allows the model to learn semantic relationships between tokens
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # LSTM: The core component that models sequential dependencies
        # batch_first=True makes input shape (batch, seq, feature)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers,
                           batch_first=True, dropout=dropout if num_layers > 1 else 0)

        # Dropout for regularization - helps prevent overfitting
        self.dropout = nn.Dropout(dropout)

        # Output projection: Maps LSTM hidden states to vocabulary probabilities
        self.fc = nn.Linear(hidden_size, vocab_size)

        self.init_weights()

    # Initializes weights with small random values for better moer stable training
    def init_weights(self):
        init_range = 0.1
        self.embedding.weight.data.uniform_(-init_range, init_range)
        self.fc.bias.data.zero_()
        self.fc.weight.data.uniform_(-init_range, init_range)

    # token_ids -> embeddings -> LSTM -> logits
    def forward(self, x, hidden=None):
        # Convert tokens to dense embeddings
        embedded = self.embedding(x)
        embedded = self.dropout(embedded)

        # Process through LSTM layers
        lstm_out, hidden = self.lstm(embedded, hidden)
        lstm_out = self.dropout(lstm_out)

        # Project to vocabulary size for next token prediction
        output = self.fc(lstm_out)

        return output, hidden # Logits for next token prediction and updated hidden state

## Training function

In [7]:
# Train the LSTM language model with proper regularization
def train_model(model, train_loader, val_loader, epochs=5, lr=0.001, device='cuda' if torch.cuda.is_available() else 'cpu'):
    model.to(device)

    # Adam optimizer works well for RNNs, weight decay adds L2 regularization
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    criterion = nn.CrossEntropyLoss()

    # Reduce learning rate when validation loss plateaus
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=2, factor=0.5)

    training_losses = []
    validation_losses = []

    print(f'Training on {device}')
    print(f'Model parameters: {sum(p.numel() for p in model.parameters()):,}')

    for epoch in range(epochs):
        # Training phase
        model.train()
        total_train_loss = 0
        train_batches = 0

        for batch_x, batch_y in train_loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)

            optimizer.zero_grad()

            # Forward pass - predict next token from context
            output, _ = model(batch_x)

            # Use only the last timestep for next token prediction
            # This is the standard approach for language modeling
            loss = criterion(output[:, -1, :], batch_y)

            # Backward pass
            loss.backward()

            # Gradient clipping prevents exploding gradients in RNNs
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)

            optimizer.step()

            total_train_loss += loss.item()
            train_batches += 1

        avg_train_loss = total_train_loss / train_batches
        training_losses.append(avg_train_loss)

        # Validation phase
        model.eval()
        total_val_loss = 0
        val_batches = 0

        with torch.no_grad():
            for batch_x, batch_y in val_loader:
                batch_x, batch_y = batch_x.to(device), batch_y.to(device)

                output, _ = model(batch_x)
                loss = criterion(output[:, -1, :], batch_y)

                total_val_loss += loss.item()
                val_batches += 1

        avg_val_loss = total_val_loss / val_batches
        validation_losses.append(avg_val_loss)

        scheduler.step(avg_val_loss)

        print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}')

    return training_losses, validation_losses

## Perplexity Calculation

In [8]:
# calculate perplexity: perplexity = exp(cross_entropy_loss)
def calculate_perplexity(model, data_loader, device='cuda' if torch.cuda.is_available() else 'cpu'):
    model.eval()
    total_loss = 0
    total_tokens = 0
    criterion = nn.CrossEntropyLoss(reduction='sum')

    with torch.no_grad():
        for batch_x, batch_y in data_loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)

            output, _ = model(batch_x)
            loss = criterion(output[:, -1, :], batch_y)

            total_loss += loss.item()
            total_tokens += batch_y.size(0)

    avg_loss = total_loss / total_tokens
    perplexity = torch.exp(torch.tensor(avg_loss))

    return perplexity.item()

## Training Script for Arabic, Korean and Telugu Languages

In [10]:
# Train separate LSTM models for Arabic, Korean, and Telugu
# Using separate models allows each model to specialize in language-specific patterns for the three languages
def train_all_languages():
    # Set random seeds for reproducible results
    torch.manual_seed(42)
    random.seed(42)

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Using device: {device}')

    # TODO: consider playing with paramters to see how it affects the results
    # Model hyperparameters - tuned for multilingual performance
    CONTEXT_SIZE = 32      # Sequence length for context
    EMBED_SIZE = 256       # Embedding dimension
    HIDDEN_SIZE = 512      # LSTM hidden size -- could be downgraded to 256
    NUM_LAYERS = 2         # Number of LSTM layers
    DROPOUT = 0.2          # Dropout probability -- if underfitting, reduce it; if overfitting, increase
    BATCH_SIZE = 64        # Training batch size
    EPOCHS = 5             # Training epochs -- 15
    LEARNING_RATE = 0.001  # Initial learning rate

    languages = ['ar', 'ko', 'te']
    language_names = {'ar': 'Arabic', 'ko': 'Korean', 'te': 'Telugu'}

    results = {}

    # Train models based on distinct languages for questions
    for lang in languages:
        print(f'\n{"="*50}')
        print(f'Training LSTM Language Model for {language_names[lang]} ({lang})')
        print(f'{"="*50}')

        # Extract language-specific data
        train_questions = getQuestionsForLanguage(lang, filter_train_df).tolist()
        val_questions = getQuestionsForLanguage(lang, filtered_val_df).tolist()

        print(f'Training samples: {len(train_questions)}')
        print(f'Validation samples: {len(val_questions)}')

        # Create datasets with tokenization
        train_dataset = FFDataset(train_questions, CONTEXT_SIZE, tokenizer)
        val_dataset = FFDataset(val_questions, CONTEXT_SIZE, tokenizer)

        print(f'Vocabulary size: {train_dataset.vocab_size}')
        print(f'Training sequences: {len(train_dataset)}')

        # Create data loaders
        train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

        # Initialize model for each language
        model = LSTMLanguageModel(
            vocab_size=train_dataset.vocab_size,
            embed_size=EMBED_SIZE,
            hidden_size=HIDDEN_SIZE,
            num_layers=NUM_LAYERS,
            dropout=DROPOUT
        )

        print(f'Model parameters: {sum(p.numel() for p in model.parameters()):,}')

        # Train the model
        train_losses, val_losses = train_model(
            model, train_loader, val_loader,
            epochs=EPOCHS, lr=LEARNING_RATE, device=device
        )

        # Calculate final performance metrics - perplexity
        train_perplexity = calculate_perplexity(model, train_loader, device)
        val_perplexity = calculate_perplexity(model, val_loader, device)

        # Store results for analysis
        results[lang] = {
            'model': model,
            'train_losses': train_losses,
            'val_losses': val_losses,
            'train_perplexity': train_perplexity,
            'val_perplexity': val_perplexity,
            'vocab_size': train_dataset.vocab_size
        }

        print(f'\nFinal Results for {language_names[lang]}:')
        print(f'Training Perplexity: {train_perplexity:.2f}')
        print(f'Validation Perplexity: {val_perplexity:.2f}')

    return results


In [13]:
def train_context_for_all_lang():
    """
    Train separate LSTM models for English (context column).

    The language has some characteristics:
    - Englsih (context): Simple inflectional system, Polysemy, West Germanic

    Using a separate model allows to specialize in language-specific patterns.
    Method one to one with above.
    """

    # Set random seeds for reproducible results
    torch.manual_seed(42)
    random.seed(42)

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")

    # TODO: consider playing with paramters to see how it affects the results
    # Model hyperparameters - tuned for multilingual performance
    CONTEXT_SIZE = 32      # Sequence length for context
    EMBED_SIZE = 256       # Embedding dimension
    HIDDEN_SIZE = 512      # LSTM hidden size -- could be downgraded to 256
    NUM_LAYERS = 2         # Number of LSTM layers
    DROPOUT = 0.3          # Dropout probability -- if underfitting, reduce it; if overfitting, increase
    BATCH_SIZE = 64        # Training batch size
    EPOCHS = 5             # Training epochs -- 15
    LEARNING_RATE = 0.001  # Initial learning rate

    languages = ['ar', 'ko', 'te']
    language_names = {'ar': 'Arabic', 'ko': 'Korean', 'te': 'Telugu'}

    results = {}

    print(f"\n{'='*50}")
    print(f"Training LSTM Language Model for Englsig (contexts)")
    print(f"{'='*50}")

    # Extract language-specific data
    train_context = getContextForAllLang(languages, filter_train_df)
    val_context = getContextForAllLang(languages, filtered_val_df)

    print(f"Training samples: {len(train_context)}")
    print(f"Validation samples: {len(val_context)}")

    # Create datasets with tokenization
    train_dataset = FFDataset(train_context, CONTEXT_SIZE, tokenizer)
    val_dataset = FFDataset(val_context, CONTEXT_SIZE, tokenizer)

    print(f"Vocabulary size: {train_dataset.vocab_size}")
    print(f"Training sequences: {len(train_dataset)}")

    # Create data loaders
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

    # Initialize model for the English context
    model = LSTMLanguageModel(
        vocab_size=train_dataset.vocab_size,
        embed_size=EMBED_SIZE,
        hidden_size=HIDDEN_SIZE,
        num_layers=NUM_LAYERS,
        dropout=DROPOUT
    )

    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

    # Train the model
    train_losses, val_losses = train_model(
        model, train_loader, val_loader,
        epochs=EPOCHS, lr=LEARNING_RATE, device=device
    )

    # Calculate final performance metrics - perplexity
    train_perplexity = calculate_perplexity(model, train_loader, device)
    val_perplexity = calculate_perplexity(model, val_loader, device)

    # Store results for analysis
    results['English (contexts)'] = {
        'model': model,
        'train_losses': train_losses,
        'val_losses': val_losses,
        'train_perplexity': train_perplexity,
        'val_perplexity': val_perplexity,
        'vocab_size': train_dataset.vocab_size
    }

    print(f"\nFinal Results for English (contexts):")
    print(f"Training Perplexity: {train_perplexity:.2f}")
    print(f"Validation Perplexity: {val_perplexity:.2f}")

    return results


## Run Training and Display Results

In [11]:
# Execute the training
results = train_all_languages()

# Display summary results
print(f'\n{"="*60}')
print('SUMMARY RESULTS - LSTM Language Models')
print(f'{"="*60}')
print(f'{"Language":<15} {"Train PPL":<12} {"Val PPL":<12} {"Vocab Size":<12}')
print(f'{"-"*60}')

language_names = {'ar': 'Arabic', 'ko': 'Korean', 'te': 'Telugu'}
for lang in ['ar', 'ko', 'te']:
    lang_name = language_names[lang]
    train_ppl = results[lang]['train_perplexity']
    val_ppl = results[lang]['val_perplexity']
    vocab_size = results[lang]['vocab_size']
    print(f"{lang_name:<15} {train_ppl:<12.2f} {val_ppl:<12.2f} {vocab_size:<12}")

Using device: cuda

Training LSTM Language Model for Arabic (ar)
Training samples: 2558
Validation samples: 415
Vocabulary size: 119547
Training sequences: 36399
Model parameters: 95,609,851
Training on cuda
Model parameters: 95,609,851
Epoch [1/5], Train Loss: 6.1128, Val Loss: 5.2830
Epoch [2/5], Train Loss: 5.0891, Val Loss: 4.9530
Epoch [3/5], Train Loss: 4.8146, Val Loss: 4.8074
Epoch [4/5], Train Loss: 4.6343, Val Loss: 4.6658
Epoch [5/5], Train Loss: 4.4902, Val Loss: 4.6321

Final Results for Arabic:
Training Perplexity: 73.27
Validation Perplexity: 103.07

Training LSTM Language Model for Korean (ko)
Training samples: 2422
Validation samples: 356
Vocabulary size: 119547
Training sequences: 37494
Model parameters: 95,609,851
Training on cuda
Model parameters: 95,609,851
Epoch [1/5], Train Loss: 6.1963, Val Loss: 5.2239
Epoch [2/5], Train Loss: 4.5026, Val Loss: 4.1865
Epoch [3/5], Train Loss: 3.9285, Val Loss: 3.8758
Epoch [4/5], Train Loss: 3.6343, Val Loss: 3.6875
Epoch [5/5]

In [14]:
# Execute the training for English (contexts) specifically
results = train_context_for_all_lang()

# Display summary results
print(f"\n{'='*60}")
print("SUMMARY RESULTS - LSTM Language Models")
print(f"{'='*60}")
print(f"{'Language':<15} {'Train PPL':<12} {'Val PPL':<12} {'Vocab Size':<12}")
print(f"{'-'*60}")


lang_name = 'English (contexts)'
train_ppl = results[lang_name]['train_perplexity']
val_ppl = results[lang_name]['val_perplexity']
vocab_size = results[lang_name]['vocab_size']
print(f"{lang_name:<15} {train_ppl:<12.2f} {val_ppl:<12.2f} {vocab_size:<12}")

Token indices sequence length is longer than the specified maximum sequence length for this model (551 > 512). Running this sequence through the model will result in indexing errors


Using device: cuda

Training LSTM Language Model for Englsig (contexts)
Training samples: 6335
Validation samples: 1155
Vocabulary size: 119547
Training sequences: 868001
Model parameters: 95,609,851
Training on cuda
Model parameters: 95,609,851


KeyboardInterrupt: 