# LSTM

### TORCHTEXT HAS ALREADY BEEN OBSOLETE!!!
---

### Introduction to LSTM using Text Sentiment Analysis as an Example

Sentiment analysis of text, also called opinion extraction, topic analysis, or sentiment classification, is commonly used in everyday applications.

For instance, when shopping on an e-commerce website, we often review the product ratings to see if there are negative reviews. These review texts express different emotions and attitudes, such as happiness, anger, sadness, praise, and criticism.

In such a scenario, the computer automatically categorizes these reviews as positive, neutral, or negative. The technology behind this categorization is sentiment analysis.

Moreover, upon further observation, you'll notice labels such as "sound volume is appropriate," "fast connection speed," or "excellent customer service." These labels are also extracted automatically by the computer, identifying topics or opinions based on the text.

The rapid growth of sentiment analysis has been greatly aided by the rise of social media. Since the early 2000s, sentiment analysis has become one of the most active areas of research in natural language processing (NLP). It is widely applied in personalized recommendations, business decision-making, and public opinion monitoring.


### Data Preparation

We have a set of movie review data (IMDB dataset) categorized into two types: positive reviews and negative reviews. Our goal is to train a sentiment analysis model that can classify the review texts.

In essence, this is a text classification problem, specifically a binary classification task, focused on movie review texts. Let's look at the training data.

IMDB (Internet Movie Database) is a dataset with 50,000 highly polarized movie reviews. It is divided into training and testing sets, each containing 25,000 reviews. Both sets include 50% positive and 50% negative reviews.


### Using Torchtext to Load the Dataset

First, install the required package:

```bash
pip install torchtext
```

Torchtext provides the IMDB dataset along with functionalities like loading corpora, converting words to vectors, mapping words to indices, and creating iterators—all essential for text processing.

In [None]:
# Loading the IMDB dataset
import torchtext
train_iter = torchtext.datasets.IMDB(root='./data', split='train')
# Each line contains the sentiment label followed by the review text.
# "neg" indicates negative, and "pos" indicates positive.
print(next(iter(train_iter)))

### Data Processing Pipelines

After loading the dataset, we need to convert the text and labels into vectors that the computer can read. Typically, this involves tokenizing the text and mapping the words to IDs.

Torchtext provides basic text processing tools, including the tokenizer and vocabulary functions. The `get_tokenizer` function creates a tokenizer, while the `build_vocab_from_iterator` function builds the vocabulary using the training data iterator.

In [None]:
# Create tokenizer
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

# ['here', 'is', 'the', 'an', 'example', '!']
print(tokenizer('here is the an example!'))

# Build vocabulary
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = torchtext.vocab.build_vocab_from_iterator(yield_tokens(train_iter), specials=["<pad>", "<unk>"])
vocab.set_default_index(vocab["<unk>"])

# [131, 9, 40, 464, 0, 0]
print(vocab(tokenizer('here is the an example <pad> <pad>')))

In the vocabulary creation process, the `yield_tokens` function tokenizes each data point in the training dataset. The user can customize special tokens such as `<pad>` for padding and `<unk>` for unknown words.

Next, we create data processing pipelines. The `text_pipeline` converts a given text into token IDs, while the `label_pipeline` maps sentiment labels to numerical values: "neg" to 0 and "pos" to 1.

In [None]:
# Data processing pipelines
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: 1 if x == 'pos' else 0

# [131, 9, 40, 464, 0, 0, ... , 0]
print(text_pipeline('here is the an example'))

# Output: 0
print(label_pipeline('neg'))

### Generating Training Data

Once we have the data pipelines, we can generate the training data using the `DataLoader`.

Due to variable-length sentences in the dataset, we need to pad or truncate each sequence to a fixed length. For instance, if the maximum sentence length is 256 words, longer sentences will be truncated, and shorter ones will be padded.

All these operations can be handled by the `collate_batch` function.

In [None]:
# Generate training data
import torch
import torchtext
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    max_length = 256
    pad = text_pipeline('<pad>')
    label_list, text_list, length_list = [], [], []
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = text_pipeline(_text)[:max_length]
         length_list.append(len(processed_text))
         text_list.append((processed_text + pad * max_length)[:max_length])
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.tensor(text_list, dtype=torch.int64)
    length_list = torch.tensor(length_list, dtype=torch.int64)
    return label_list.to(device), text_list.to(device), length_list.to(device)

train_iter = torchtext.datasets.IMDB(root='./data', split='train')
train_dataset = to_map_style_dataset(train_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])
train_dataloader = DataLoader(split_train_, batch_size=8, shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=8, shuffle=False, collate_fn=collate_batch)

The workflow includes:

1. Loading the IMDB training dataset using Torchtext.
2. Converting the iterator to `Dataset` type.
3. Splitting the dataset into training (95%) and validation (5%) sets.
4. Creating `DataLoader` for training and validation.

### Model Construction

Due to inherent limitations of RNNs, such as gradient vanishing or exploding during backpropagation, the LSTM network improves on the RNN by using gates to combine short-term and long-term memory, thereby mitigating these issues.

#### Why predict with hidden state?
The hidden state is used for predictions in LSTMs because it provides the most up-to-date representation of the input sequence, combining both long-term and short-term information. Unlike the cell state, which stores raw long-term memory, the hidden state is refined by the output gate to represent relevant features for prediction, making it a comprehensive and output-ready summary of the sequence.

In [None]:
# Define model
class LSTM(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout_rate, pad_index=0):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim, n_layers, bidirectional=bidirectional, dropout=dropout_rate, batch_first=True)
        self.dropout = torch.nn.Dropout(dropout_rate)
        self.fc = torch.nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)

    def forward(self, ids, length):
        # Apply embedding layer to input ids to get word embeddings, followed by dropout for regularization
        embedded = self.dropout(self.embedding(ids))
        
        # Pack the padded sequence to *ignore padding tokens* during LSTM processing, using actual sequence lengths
        packed_embedded = torch.nn.utils.rnn.pack_padded_sequence(embedded, length, batch_first=True, enforce_sorted=False)
        
        # Pass the packed sequence through the LSTM layer
        # packed_output contains the output for each time step, while hidden and cell contain the final states
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        
        # Unpack the packed sequence back to padded form, used for further processing if needed
        output, output_length = torch.nn.utils.rnn.pad_packed_sequence(packed_output)
        
        # Extract the final hidden state
        # If the LSTM is bidirectional, concatenate the final states from both forward and backward passes
        if self.lstm.bidirectional:
            hidden = self.dropout(torch.cat([hidden[-1], hidden[-2]], dim=-1))
        else:
            hidden = self.dropout(hidden[-1])
        
        # Pass the hidden state through the fully connected layer to produce the final output prediction
        prediction = self.fc(hidden)
        
        # Return the final prediction
        return prediction


### Model Training and Evaluation

Finally, we train and evaluate the LSTM model on the dataset.

In [None]:
import tqdm
import sys
import numpy as np

vocab_size = len(vocab)
embedding_dim = 300
hidden_dim = 300
output_dim = 2
n_layers = 2
bidirectional = True
dropout_rate = 0.5
lr = 5e-4

model = LSTM(vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout_rate)
model = model.to(device)

criterion = torch.nn.CrossEntropyLoss()
criterion = criterion.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

def train(dataloader, model, criterion, optimizer, device):
    model.train()
    epoch_losses = []
    epoch_accs = []
    for batch in tqdm.tqdm(dataloader, desc='training...', file=sys.stdout):
        (label, ids, length) = batch
        label = label.to(device)
        ids = ids.to(device)
        length = length.to(device)
        prediction = model(ids, length)
        loss = criterion(prediction, label)
        accuracy = get_accuracy(prediction, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())
    return epoch_losses, epoch_accs

def evaluate(dataloader, model, criterion, device):
    model.eval()
    epoch_losses = []
    epoch_accs = []
    with torch.no_grad():
        for batch in tqdm.tqdm(dataloader, desc='evaluating...', file=sys.stdout):
            (label, ids, length) = batch
            label = label.to(device)
            ids = ids.to(device)
            length = length.to(device)
            prediction = model(ids, length)
            loss = criterion(prediction, label)
            accuracy = get_accuracy(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())
    return epoch_losses, epoch_accs

def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

n_epochs = 10
best_valid_loss = float('inf')

train_losses = []
train_accs = []
valid_losses = []
valid_accs = []

for epoch in range(n_epochs):
    train_loss, train_acc = train(train_dataloader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_dataloader, model, criterion, device)
    train_losses.extend(train_loss)
    train_accs.extend(train_acc)
    valid_losses.extend(valid_loss)
    valid_accs.extend(valid_acc)
    epoch_train_loss = np.mean(train_loss)
    epoch_train_acc = np.mean(train_acc)
    epoch_valid_loss = np.mean(valid_loss)
    epoch_valid_acc = np.mean(valid_acc)
    if epoch_valid_loss < best_valid_loss:
        best_valid_loss = epoch_valid_loss
        torch.save(model.state_dict(), 'lstm.pt')
    print(f'epoch: {epoch + 1}')
    print(f'train_loss: {epoch_train_loss:.3f}, train_acc: {epoch_train_acc:.3f}')
    print(f'valid_loss: {epoch_valid_loss:.3f}, valid_acc: {epoch_valid_acc:.3f}')