# LSTM TRAINING FROM SCRATCH

## Overview

This notebook implements a multi-label emotion classification pipeline using a custom Bi-directional LSTM architecture trained from scratch. It utilizes a SentencePiece (BPE) tokenizer trained directly on the corpus to encode text into 256-token sequences. The training process optimizes a BCEWithLogitsLoss objective across five emotion labels using AdamW and Cosine Annealing, with experiment tracking provided by W&B and model saving determined by the best validation Macro-F1 score.

## Imports

In [None]:
!pip install wandb

Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 19.0.1
    Uninstalling pyarrow-19.0.1:
      Successfully uninstalled pyarrow-19.0.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
pylibcudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.2.2 requires pyarrow<20.0.0

In [None]:
#import libraries
import os
import wandb
import random
import numpy as np
import pandas as pd
import sentencepiece as spm
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.metrics import f1_score

wandb.login(key="16a767377715590d2d5fe6351174577f96db6dc6")

2025-11-28 16:19:23.010698: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764346763.192704      47 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764346763.247451      47 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


True

## Variables

In [None]:
SEED = 42
MAX_LEN = 256
BATCH = 64
EPOCHS = 10
LR = 2e-3
LABEL_COLS = ["anger","fear","joy","sadness","surprise"]

#set random seed
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(SEED)

NameError: name 'random' is not defined

##Load Data

In [None]:
train_df = pd.read_csv("/kaggle/input/dlgenai/augmented_train.csv")

test_df  = pd.read_csv("/kaggle/input/dlgenai/test_clean.csv")
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42)

**Exporting text from our df to a file, replacing newlines with spaces to prepare a clean corpus. It then trains a SentencePiece Byte-Pair Encoding (BPE) tokenizer on that file with a vocabulary size of 28,000.**

In [None]:
import re
import unicodedata

def clean_text(text):
    # Convert to string
    text = str(text)
    # Normalize unicode
    text = unicodedata.normalize("NFKC", text)
    # Replace newlines/tabs with space
    text = re.sub(r'\s+', ' ', text)
    return text

train_df["text_clean"] = train_df["text"].apply(clean_text)
test_df["text_clean"] = test_df["text"].apply(clean_text)

In [None]:
with open("all_text.txt", "w", encoding="utf-8") as f:
    for t in train_df["text"]:
        f.write(str(t).replace("\n", " ") + "\n")

spm.SentencePieceTrainer.Train(
    input="all_text.txt",
    model_prefix="bpe_tokenizer",
    vocab_size=28000,
    model_type="bpe",
    character_coverage=1.0,
    max_sentence_length=99999
)

In [None]:

sp = spm.SentencePieceProcessor()
sp.load("bpe_tokenizer.model")

In [None]:
def encode(text, max_len=MAX_LEN):
    ids = sp.encode(text, out_type=int)
    if len(ids) < max_len:
        ids += [0] * (max_len - len(ids))
    else:
        ids = ids[:max_len]
    return ids

In [None]:
def preprocess_dataset(df, is_test=False):
    data = []
    for _, row in df.iterrows():
        ids = encode(str(row["text"]))
        if is_test:
            labels = torch.zeros(len(LABEL_COLS))
        else:
            labels = torch.tensor([row[c] for c in LABEL_COLS], dtype=torch.float)

        data.append({
            "input_ids": torch.tensor(ids, dtype=torch.long),
            "labels": labels
        })
    return data

train_ds = preprocess_dataset(train_df)
val_ds   = preprocess_dataset(val_df)
test_ds  = preprocess_dataset(test_df, is_test=True)

train_loader = DataLoader(train_ds, batch_size=BATCH, shuffle=True)
val_loader   = DataLoader(val_ds, batch_size=BATCH)
test_loader  = DataLoader(test_ds, batch_size=BATCH)

##MODEL

**Defining a bidirectional LSTM to process input embeddings into a 128-dimensional hidden state. It extracts the features from the final time step, stabilizes them with Batch Normalization and Dropout, and maps them to 5 classes**

In [None]:
class LSTMEmotion(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, num_labels=5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

        self.lstm = nn.LSTM(
            embed_dim,
            hidden_dim,
            batch_first=True,
            bidirectional=True
        )

        self.dropout = nn.Dropout(0.4)
        self.bn = nn.BatchNorm1d(hidden_dim * 2)
        self.fc = nn.Linear(hidden_dim * 2, num_labels)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        lstm_out, _ = self.lstm(x)
        cls = lstm_out[:, -1, :]
        cls = self.bn(cls)
        cls = self.dropout(cls)
        return self.fc(cls)

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LSTMEmotion(sp.get_piece_size()).to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
loss_fn = nn.BCEWithLogitsLoss()

##TRAINING

**full training loop with gradient clipping and evaluating the model using the Macro F1 score on the validation set and logging training metrics to WandB and saving model whenever the validation F1 improves**

In [None]:
wandb.init(
    project="23f3001910-t32025",
    name="lstm",
    config={
        "seed": SEED,
        "lr": LR,
        "epochs": EPOCHS,
        "batch": BATCH,
        "max_len": MAX_LEN,
        "model": "LSTM + BPE",
        "scheduler": "CosineAnnealingLR"
    }
)

In [None]:
best_f1 = 0

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0

    for batch in train_loader:
        ids = batch["input_ids"].to(device)
        lbl = batch["labels"].to(device)

        logits = model(ids)
        loss = loss_fn(logits, lbl)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        total_loss += loss.item()

    scheduler.step()

    # VALIDATION
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for batch in val_loader:
            ids = batch["input_ids"].to(device)
            lbl = batch["labels"].cpu().numpy()

            logits = model(ids).cpu()
            probs = torch.sigmoid(logits).numpy()
            preds = (probs >= 0.5).astype(int)

            all_labels.append(lbl)
            all_preds.append(preds)

    all_labels = np.vstack(all_labels)
    all_preds = np.vstack(all_preds)

    macro_f1 = f1_score(all_labels, all_preds, average="macro")

    lr_now = scheduler.get_last_lr()[0]

    # WANDB LOG
    wandb.log({
        "train_loss": total_loss / len(train_loader),
        "val_f1": macro_f1,
        "lr": lr_now
    })

    print(f"Epoch {epoch+1}/{EPOCHS} | Loss={total_loss/len(train_loader):.4f} | F1={macro_f1:.4f}")


    if macro_f1 > best_f1:
        best_f1 = macro_f1
        torch.save(model.state_dict(), "best_model.pth")
        wandb.save("best_model.pth")

Training BPE tokenizer...
Tokenizer vocab size: 28000


0,1
lr,█▇▇▆▅▃▂▂▁▁
train_loss,█▂▂▂▂▁▁▁▁▁
val_f1,▁▃▆▅█▃▃▃▃▃

0,1
lr,0.0
train_loss,0.64269
val_f1,0.14419


Epoch 1/10 | Loss=0.6459 | F1=0.3448
Epoch 2/10 | Loss=0.6433 | F1=0.2501
Epoch 3/10 | Loss=0.6430 | F1=0.0947
Epoch 4/10 | Loss=0.6429 | F1=0.3409
Epoch 5/10 | Loss=0.6429 | F1=0.3942
Epoch 6/10 | Loss=0.6429 | F1=0.1442
Epoch 7/10 | Loss=0.6428 | F1=0.1442
Epoch 8/10 | Loss=0.6428 | F1=0.0000
Epoch 9/10 | Loss=0.6428 | F1=0.1442
Epoch 10/10 | Loss=0.6427 | F1=0.1442
Training complete. Best F1: 0.39416478590409926

Running inference on test.csv...
Saved test_predictions.csv


##INFERENCE

In [None]:
#INFERENCE

model.load_state_dict(torch.load("best_model.pth"))
model.eval()

test_preds = []

threshold = 0.5

with torch.no_grad():
    for batch in test_loader:
        ids = batch["input_ids"].to(device)
        logits = model(ids)
        probs = torch.sigmoid(logits).cpu().numpy()
        preds = (probs >= threshold).astype(int)  # convert to 0/1
        test_preds.append(preds)

test_preds = np.vstack(test_preds)

# Attach predictions
for i, col in enumerate(LABEL_COLS):
    test_df[col] = test_preds[:, i]

In [None]:
#Save predictions
test_df = test_df[['id','anger','fear', 'joy', 'sadness','surprise']]
test_df.to_csv("test_predictions.csv", index=False)
wandb.save("test_predictions.csv")
print("Saved test_predictions.csv")

Saved test_predictions.csv
