Text Anlytics Group 2 - LSTM Model

Introduction Long Short-Term Memory (LSTM) neural networks are a special type of recurrent neural network (RNN) designed to model sequential data and capture long-range dependencies. Unlike traditional RNNs, which struggle with learning long-term patterns due to the vanishing gradient problem, LSTMs use a sophisticated memory cell and gating mechanism to retain and control information over extended sequences. This makes them especially effective for tasks such as natural language processing, time series forecasting, and speech recognition, where understanding the context and order of data is essential.

Reference:

Understanding LSTM--a tutorial into long short-term memory recurrent neural networks Long Short-Term Memory

Set up
Import Libraries
1.2. Download Datasets The dataset we will use is Movie Review (MR), a sentence polarity dataset from (Pang and Lee, 2005). The dataset has 5331 positive and 5331 negative processed sentences/snippets.

In [None]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import nltk
nltk.download("all")
import matplotlib.pyplot as plt
import torch

%matplotlib inline

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

1.2. Download Datasets
The dataset we will use is Movie Review (MR), a sentence polarity dataset from (Pang and Lee, 2005). The dataset has 5331 positive and 5331 negative processed sentences/snippets.

In [None]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import nltk
nltk.download("punkt")
import matplotlib.pyplot as plt
import torch

%matplotlib inline

# ==============================
# Load Medical Conversation Dataset
# ==============================

def load_text_from_excel(path, sentence_col='Sentence', label_col='Label'):
    """Load text and labels from Excel, lowercase text, and return lists."""
    df = pd.read_excel(path)
    texts = []
    for sentence in df[sentence_col]:
        texts.append(str(sentence).lower().strip())
    labels = df[label_col].tolist()
    return texts, labels

train_texts, train_labels = load_text_from_excel('INFO 617 Group Project Train Val (2).xlsx')



# Convert to numpy arrays (same as professor’s style)
texts = np.array(train_texts)
labels = np.array(train_labels)

# Quick check
texts[0]  # Should print a lowercased medical conversation sentence


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


np.str_('hello,')

1.3. Download fastText Word Vectors

The pretrained word vectors used in the original paper is word2vec (Mikolov et al., 2013) trained on 100 billion tokens of Google News. In this tutorial, we will use fastText pretrained word vectors (Mikolov et al., 2017), trained on 600 billion tokens on Common Crawl. fastText is an upgraded version of word2vec and outperform other state-of-the-art methods by a large margin.

The code below will download fastText pretrained vectors. Using Google Colab, the running time is approximately 3min 30s.



In [None]:
%%time
URL = "https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip"
FILE = "fastText"

if os.path.isdir(FILE):
    print("fastText exists.")
else:
    !wget -P $FILE $URL
    !unzip $FILE/crawl-300d-2M.vec.zip -d $FILE


--2025-04-26 17:44:51--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.35.7.82, 13.35.7.50, 13.35.7.128, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.35.7.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1523785255 (1.4G) [application/zip]
Saving to: ‘fastText/crawl-300d-2M.vec.zip’


2025-04-26 17:45:06 (100 MB/s) - ‘fastText/crawl-300d-2M.vec.zip’ saved [1523785255/1523785255]

Archive:  fastText/crawl-300d-2M.vec.zip
  inflating: fastText/crawl-300d-2M.vec  
CPU times: user 546 ms, sys: 71.4 ms, total: 617 ms
Wall time: 1min 17s


1.4. Set up GPU for Training
Google Colab offers free GPUs and TPUs. Since we'll be training a large neural network it's best to utilize these features.

A GPU can be added by going to the menu and selecting:

Runtime -> Change runtime type -> Hardware accelerator: GPU

Then we need to run the following cell to specify the GPU as the device.

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


No GPU available, using the CPU instead.


2. Data Preparation
To prepare our text data for training, first we need to tokenize our sentences and build a vocabulary dictionary word2idx, which will later be used to convert our tokens into indexes and build an embedding layer.

So, what is an embedding layer?

An embedding layer serves as a look-up table which take word indexes in the vocabulary as input and output word vectors. Hence, the embedding layer has shape (N,d) where N is the size of the vocabulary and d is the embedding dimension. In order to fine-tune pretrained word vectors, we need to create an embedding layer in our nn.Modules class. Our input to the model will then be input_ids, which is the tokens' index in the vocabulary.

2.1. Tokenize
The function tokenize will tokenize our sentences, build a vocabulary and fine the maximum sentence length. The function encode will take in the outputs of tokenize, perform sentence padding and return input_ids as a numpy array.

In [None]:
from nltk.tokenize import word_tokenize
from collections import defaultdict

def tokenize(texts):
    """Tokenize texts, build vocabulary and find maximum sentence length.
    Args:
        texts (List[str]): List of text data
    Returns:
        tokenized_texts (List[List[str]]): List of list of tokens
        word2idx (Dict): Vocabulary built from the corpus
        max_len (int): Maximum sentence length
    """
    max_len = 0
    tokenized_texts = []
    word2idx = {}

    # Add <pad> and <unk> tokens to the vocabulary
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    # Build vocabulary from the corpus
    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)

        # Add tokenized sentence to list
        tokenized_texts.append(tokenized_sent)

        # Add new tokens to vocabulary
        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        # Update maximum sentence length
        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

def encode(tokenized_texts, word2idx, max_len):
    """Pad each sentence to the maximum sentence length and encode tokens to indexes."""
    input_ids = []
    for tokenized_sent in tokenized_texts:
        # Pad sentence
        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        # Encode tokens
        input_id = [word2idx.get(token, word2idx['<unk>']) for token in tokenized_sent]
        input_ids.append(input_id)

    return np.array(input_ids)

# ==============================
# Now Call These Functions for Your Data
# ==============================

# Tokenize sentences
tokenized_texts, word2idx, max_len = tokenize(texts)

# Encode sentences
input_ids = encode(tokenized_texts, word2idx, max_len)

# Quick Check
print(f"Vocabulary size: {len(word2idx)}")
print(f"Max sentence length: {max_len}")
print(f"Shape of input_ids: {input_ids.shape}")


Vocabulary size: 5066
Max sentence length: 130
Shape of input_ids: (4030, 130)


2.2. Load Pretrained Vectors
We will load the pretrain vectors for each tokens in our vocabulary. For tokens with no pretraiend vectors, we will initialize random word vectors with the same length and variance.

In [None]:
from tqdm import tqdm_notebook  # If error, you can use: from tqdm import tqdm
import torch

def load_pretrained_vectors(word2idx, fname):
    """Load pretrained vectors and create embedding matrix.

    Args:
        word2idx (Dict): Vocabulary built from the corpus
        fname (str): Path to pretrained vector file

    Returns:
        embeddings (np.array): Embedding matrix with shape (N, d)
    """
    print("Loading pretrained vectors...")
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')

    # First line contains vocab size (n) and dimension (d)
    n, d = map(int, fin.readline().split())

    # Initialize random embeddings
    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), d))
    embeddings[word2idx['<pad>']] = np.zeros((d,))

    # Load pre-trained vectors
    count = 0
    for line in tqdm_notebook(fin):
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        if word in word2idx:
            count += 1
            embeddings[word2idx[word]] = np.array(tokens[1:], dtype=np.float32)

    print(f"There are {count} / {len(word2idx)} pretrained vectors found.")
    return embeddings

# ==============================
# Now Call This Function
# ==============================

# Load pretrained embeddings
embeddings = load_pretrained_vectors(word2idx, "fastText/crawl-300d-2M.vec")

# Convert embeddings to Torch Tensor
embeddings = torch.tensor(embeddings)

# Quick Check
print(f"Shape of embeddings tensor: {embeddings.shape}")


Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for line in tqdm_notebook(fin):


0it [00:00, ?it/s]

There are 4896 / 5066 pretrained vectors found.
Shape of embeddings tensor: torch.Size([5066, 300])


Now let's put above steps together.

In [None]:
# Tokenize, build vocabulary, encode tokens
print("Tokenizing...\n")
tokenized_texts, word2idx, max_len = tokenize(texts)
input_ids = encode(tokenized_texts, word2idx, max_len)

# Load pretrained vectors
embeddings = load_pretrained_vectors(word2idx, "fastText/crawl-300d-2M.vec")
embeddings = torch.tensor(embeddings)

Tokenizing...

Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for line in tqdm_notebook(fin):


0it [00:00, ?it/s]

There are 4896 / 5066 pretrained vectors found.


2.3. Create PyTorch DataLoader
We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed. The batch_size used in the paper is 50.

In [None]:
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler, SequentialSampler)
from sklearn.model_selection import train_test_split
import torch

# ==============================
# 1. First, do Train/Validation Split
# ==============================

# Note: labels are still in text form like 'TREAT', 'EXPLAIN'
# So first encode labels into numbers

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Now split input_ids and labels
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    input_ids, encoded_labels, test_size=0.1, random_state=42
)

print(f"Training size: {train_inputs.shape[0]}")
print(f"Validation size: {val_inputs.shape[0]}")

# ==============================
# 2. Define your professor's DataLoader function
# ==============================

def data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50):
    """Convert train and validation sets to torch.Tensors and load them to DataLoader."""
    train_inputs, val_inputs, train_labels, val_labels = \
        tuple(torch.tensor(data) for data in [train_inputs, val_inputs, train_labels, val_labels])

    # Training DataLoader
    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Validation DataLoader
    val_data = TensorDataset(val_inputs, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

    return train_dataloader, val_dataloader

# ==============================
# 3. Call the function to create DataLoaders
# ==============================

train_dataloader, val_dataloader = data_loader(
    train_inputs, val_inputs, train_labels, val_labels, batch_size=50
)

print("DataLoaders created successfully.")


Training size: 3627
Validation size: 403
DataLoaders created successfully.


We will use 90% of the dataset for training and 10% for validation.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import torch

# ==============================
# Encode the labels into integers
# ==============================

label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# ==============================
# Train-Test Split
# ==============================

train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    input_ids, encoded_labels, test_size=0.1, random_state=42
)

print(f"Training size: {train_inputs.shape[0]}")
print(f"Validation size: {val_inputs.shape[0]}")

# ==============================
# Load data to PyTorch DataLoader
# ==============================

train_dataloader, val_dataloader = data_loader(
    train_inputs, val_inputs, train_labels, val_labels, batch_size=50
)

print("DataLoaders created successfully.")


Training size: 3627
Validation size: 403
DataLoaders created successfully.


In [None]:
import torch
print(torch.__version__)


2.6.0+cu124


3. Model

3.1. Create LSTM Model

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTM_NLP(nn.Module):
    def __init__(self,
                 hidden_dim=None,
                 output_dim=None,
                 n_layers=None,
                 bidirectional=False,
                 lstm_dropout=None,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 fc_dropout=0.5):
        super(LSTM_NLP, self).__init__()

        # Embedding layer
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding, freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)

        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=n_layers,
                            bidirectional=bidirectional, dropout=lstm_dropout,
                            batch_first=True)

        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(fc_dropout)

    def forward(self, input_ids):
        x_embed = self.embedding(input_ids).float()
        x_embed = x_embed.to(torch.float32)
        output, (hidden, cell) = self.lstm(x_embed)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
                              if self.lstm.bidirectional else hidden[-1,:,:])
        return self.fc(self.dropout(hidden))


3.2. Optimizer
To train Deep Learning models, we need to define a loss function and minimize this loss. We'll use back-propagation to compute gradients and use an optimization algorithm (ie. Gradient Descent) to minimize the loss. The original paper used the Adadelta optimizer.

In [None]:
import torch.optim as optim

def initialize_lstm(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    hidden_dim=256,
                    output_dim=2,
                    n_layers=2,
                    bidirectional=False,
                    lstm_dropout=0.5,
                    fc_dropout=0.5,
                    learning_rate=0.01):
    """Instantiate an LSTM model and an optimizer."""

    # Instantiate LSTM model
    lstm_model = LSTM_NLP(pretrained_embedding=pretrained_embedding,
                          freeze_embedding=freeze_embedding,
                          vocab_size=vocab_size,
                          embed_dim=embed_dim,
                          hidden_dim=hidden_dim,
                          output_dim=output_dim,
                          n_layers=n_layers,
                          bidirectional=bidirectional,
                          lstm_dropout=lstm_dropout,
                          fc_dropout=fc_dropout)

    # Send model to device (GPU/CPU)
    lstm_model.to(device)

    # Instantiate Adadelta optimizer
    optimizer = optim.Adadelta(lstm_model.parameters(),
                               lr=learning_rate,
                               rho=0.95)

    return lstm_model, optimizer


3.3. Training Loop
For each epoch, the code below will perform a forward step to compute the Cross Entropy loss, a backward step to compute gradients and use the optimizer to update weights/parameters. At the end of each epoch, the loss on training data and the accuracy over the validation data will be printed to help us keep track of the model's performance. The code is heavily annotated with detailed explanations.

In [None]:
import random
import time
import torch.nn as nn
import numpy as np

# Define Loss Function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train_lstm(model, optimizer, train_dataloader, val_dataloader=None, epochs=10):
    """Train the LSTM model."""
    # Tracking best validation accuracy
    best_accuracy = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
    print("-"*60)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        t0_epoch = time.time()
        total_loss = 0
        model.train()

        for step, batch in enumerate(train_dataloader):
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)
            model.zero_grad()
            logits = model(b_input_ids)
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()

        avg_train_loss = total_loss / len(train_dataloader)

        # =======================================
        #               Evaluation
        # =======================================
        if val_dataloader is not None:
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            if val_accuracy > best_accuracy:
                best_accuracy = val_accuracy

            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")

    print("\n")
    print(f"Training complete! Best accuracy: {best_accuracy:.2f}%.")

def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance on our validation set."""
    # Put the model into evaluation mode
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to device
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Average validation loss and accuracy
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy


In [None]:
# The following LSTM model USES pre-trained word embedding (FastText)
# The embedding layer is initialized from pretrained vectors and is fine-tuned during training
# The model is bi-directional and has 2 LSTM layers.

lstm_model, optimizer = initialize_lstm(
    pretrained_embedding=embeddings,       # ✅ Use FastText embeddings
    freeze_embedding=False,                # ✅ Allow fine-tuning
    vocab_size=len(word2idx),               # ✅ Vocab size from your tokenized texts
    embed_dim=300,                          # ✅ FastText embeddings are 300-dimensional
    hidden_dim=128,                         # ✅ Hidden size of LSTM
    output_dim=len(label_encoder.classes_), # ✅ Number of classes = 15 for your dataset
    n_layers=2,                             # ✅ Two LSTM layers
    bidirectional=True,                     # ✅ Make it BiLSTM for stronger performance
    learning_rate=1.0,                      # ✅ Adadelta uses higher LR
    lstm_dropout=0.3,                       # ✅ Dropout inside LSTM layers
    fc_dropout=0.5                          # ✅ Dropout before final FC layer
)


In [None]:
train_lstm(lstm_model, optimizer, train_dataloader, val_dataloader, epochs=10)


Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   0.670233   |  1.184896  |   70.15   |   89.48  
   2    |   0.624505   |  1.132808  |   71.04   |   83.12  
   3    |   0.554846   |  1.244203  |   70.15   |   85.11  
   4    |   0.502999   |  1.328269  |   67.04   |   93.82  
   5    |   0.481722   |  1.287065  |   68.81   |   85.15  
   6    |   0.409068   |  1.365271  |   64.89   |   88.40  
   7    |   0.376296   |  1.399057  |   67.56   |   90.02  
   8    |   0.332961   |  1.400945  |   71.04   |   88.10  
   9    |   0.319219   |  1.433249  |   66.67   |   91.89  
  10    |   0.249532   |  1.672524  |   70.37   |   94.77  


Training complete! Best accuracy: 71.04%.
