# Deep Learning with PyTorch

In this workshop, we will try to build some feedforward models to do sentiment analysis, using pytorch, a deep learning library: https://pytorch.org/


## Setup GPU environment

To use a free GPU in Google Colab, go to "Runtime" > "Change runtime type" and select "GPU" as the hardware accelerator.

Here's a more detailed breakdown:

- Accessing Colab: Open Google Colab in your browser and sign in with your Google account.

- Creating a Notebook: Create a new notebook by clicking on "New Notebook".
Enabling GPU:

- Go to the "Runtime" menu.

- Select "Change runtime type".

- In the pop-up window, choose "GPU" as the hardware accelerator. Click "Save".

Now you will need pandas, torch to run this code (pip install pandas torch).

In [None]:
!pip install pandas scikit-learn torch

Now you can run the below code to verify that GPU has been enabled:

In [15]:
# imports are always needed
import torch


# get index of currently selected device
print(f"current device: {torch.cuda.current_device()}") # returns 0 in my case


# get number of GPUs available
print(f"number of GPU available: {torch.cuda.device_count()}") # returns 1 in my case


# get the name of the device
print(f"name of the device: {torch.cuda.get_device_name(0)}") # good old Tesla K80

# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()


#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')


current device: 0
number of GPU available: 1
name of the device: Tesla T4
Using device: cuda

Tesla T4
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB


  print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')


## Loading dataset

First let's prepare the data. We are using 1000 yelp reviews, nnotated with either positive or negative sentiments. You can download the dataset file from canvas, and upload it to the current working directory in the Google Colab environment.

In [5]:
import pandas as pd

corpus = "07-yelp-dataset.txt"
df = pd.read_csv(corpus, names=['sentence', 'label'], sep='\t')
print("Number of sentences =", len(df))
print("\nData:")
print(df.iloc[:3])

Number of sentences = 1000

Data:
                                    sentence  label
0                   Wow... Loved this place.      1
1                         Crust is not good.      0
2  Not tasty and the texture was just nasty.      0


Next, let's create the train/dev/test partitions

In [6]:
import random
import numpy as np

sentences = df['sentence'].values
labels = df['label'].values

#partition data into 80/10/10 for train/dev/test
sentences_train, y_train = sentences[:800], labels[:800]
sentences_dev, y_dev = sentences[800:900], labels[800:900]
sentences_test, y_test = sentences[900:1000], labels[900:1000]

#convert label list into arrays
y_train = np.array(y_train)
y_dev = np.array(y_dev)
y_test = np.array(y_test)

print(y_train[0], sentences_train[0])
print(y_dev[0], sentences_dev[0])
print(y_test[0], sentences_test[0])

1 Wow... Loved this place.
0 I'm super pissd.
0 Spend your money elsewhere.


## Building vocabulary set and vectorizer

In this workshop, we will employ the `tokenizer` function provided by PyTorch to process our data. After tokenization, the next step involves using `build_vocab_from_iterator` to construct a frequency dictionary for our vocabulary. Moreover, we incorporate two special tokens, `<unk>` and `<pad>`, into our vocabulary set. The `<unk>` token is designated for managing tokens that have not been seen during training, ensuring the model can handle new or rare words. The `<pad>` token, on the other hand, will be utilized later to pad sequences of varying lengths, allowing us to standardize them to a uniform length for model processing.

In [47]:
import torch
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK tokenizer data (run once if not already downloaded)
nltk.download('punkt_tab')

# Assuming sentences_train is your training data (list of strings)
train_iter = sentences_train

# Define a simple tokenizer using NLTK (or you can use another library like spacy)
def tokenizer(text):
    return word_tokenize(text.lower())  # Tokenize and convert to lowercase

# Function to yield tokens from the data
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

# Build vocabulary manually using Counter
def build_vocab_from_iterator(token_iterator, specials=('<unk>', '<pad>')):
    counter = Counter()
    for tokens in token_iterator:
        counter.update(tokens)

    # Create vocab dictionary with special tokens
    vocab = {token: idx + len(specials) for idx, (token, _) in enumerate(counter.items())}
    for idx, special in enumerate(specials):
        vocab[special] = idx

    # Reverse mapping (stoi: string to index)
    stoi = vocab
    # Index to string mapping
    itos = {idx: token for token, idx in stoi.items()}

    return stoi, itos, stoi['<unk>']

# Build the vocabulary
vocab, index_to_vocab, default_index = build_vocab_from_iterator(yield_tokens(train_iter), specials=('<unk>', '<pad>'))
padding_index = vocab['<pad>']

# Use CountVectorizer with the custom tokenizer and vocabulary
vectorizer = CountVectorizer(
    tokenizer=tokenizer,
    vocabulary=vocab,  # Pass the string-to-index mapping
    lowercase=True
)

# Example usage (assuming sentences_train is a list of strings)
# vectorized_data = vectorizer.fit_transform(sentences_train)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [48]:
x_train = vectorizer.transform(sentences_train).toarray() #BOW representation
x_dev = vectorizer.transform(sentences_dev).toarray() #BOW representation
x_test = vectorizer.transform(sentences_test).toarray() #BOW representation

In [49]:
x_train.shape

(800, 1812)

Now every sentence has been transformed into a vector with frequency count of 1814 vocabulary

In [50]:
vocab_size = x_train.shape[1]
print("Vocab size =", vocab_size)
print(tokenizer(sentences_train[0]))
print(x_train[0])

Vocab size = 1812
['wow', '...', 'loved', 'this', 'place', '.']
[0 0 1 ... 0 0 0]


## Baseline with sklearn logistic regression

Before we build a neural network model, let's see how well logistic regression do with this dataset.

In [51]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(x_train, y_train)
score = classifier.score(x_test, y_test)

print("Accuracy:", score)

Accuracy: 0.69


The logistic regression result is not too bad, and it will serve as a baseline for the deep learning models.



## Short Introduction of Pytorch

PyTorch is an open-source machine learning, especially deep learning, library widely acclaimed for its flexibility, speed, and ease of use. To learn pytorch properly please refer to their official tutorial: https://pytorch.org/tutorials/beginner/basics/intro.html

### Tensor

Tensors are a specialized data structure that are very similar to numpy arrays and matrices. But more than numpy array, tensors cache and trace  the mathematic operation that the been carried; therefore capable of automatically calculate the gradient for back propagation during training. Beyond mere numerical storage, tensors uniquely track and record the mathematical operations performed on them. This intrinsic capability allows for the automatic computation of gradients, a crucial component for the backpropagation process during neural network training. When operating with your tensors, there are three important properties to look after, which are "shape (The dimenions of the matrices)", "dtype (the data type of the values)", "device (The harware (cpu, gpu) the values are stored)". detail please see: https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html

In [52]:
import torch
import numpy as np

tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")


Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


In [53]:
# Get cpu, gpu or mps device for training.
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(device)

cuda


Relocate your tensor to GPU memory using .cuda()

In [54]:
# Check if CUDA is available
if torch.cuda.is_available():
    # Move tensor to GPU
    tensor_gpu = tensor.cuda()

    print(f"Tensor is on: {tensor_gpu.device}")
else:
    print("CUDA is not available.")

Tensor is on: cuda:0


If you have multiple GPU and would like to assign to a specific GPU(and even CPU, and mps).

In [55]:
if torch.cuda.is_available():
    # Move tensor to GPU (default CUDA device)
    tensor_gpu = tensor.to('cuda:0')

    # Optionally specify a specific GPU (e.g., GPU 0)
    # tensor_gpu = tensor.to('cuda:0')

    print(f"Tensor is on: {tensor_gpu.device}")
else:
    print("CUDA is not available.")

Tensor is on: cuda:0


### DataLoader

For efficient training of deep learning models, it is common practice to train in batches rather than processing instances individually. Moreover, it is essential to convert the data into PyTorch tensors prior to training. Given that unstructured data, including text and images, often necessitates pre-processing, establishing a data pipeline for pre-processing becomes imperative for effective model training. This pipeline not only streamlines the preparation of data but also ensures compatibility with PyTorch's computational framework. For a detailed guide, please refer to the official tutorial: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

In [56]:
from torch.utils.data import DataLoader

batch_size = 10


def bow_collate_batch(batch):
    label_list, text_list = [], []
    for  _text, _label in batch:
        label_list.append(_label)
        text_list.append(_text)
    # For each batched data, we convert the values into tensor.
    label_list = torch.tensor(label_list, dtype=torch.float32)
    text_list = torch.tensor(text_list, dtype=torch.float32)

    # We also place each tensor to an assigned device
    return text_list.to(device), label_list.reshape(-1, 1).to(device)

# Create data loaders.
train_dataloader = DataLoader(list(zip(x_train, y_train)), batch_size=batch_size, collate_fn=bow_collate_batch)
dev_dataloader = DataLoader(list(zip(x_dev, y_dev)), batch_size=batch_size, collate_fn=bow_collate_batch)
test_dataloader = DataLoader(list(zip(x_test, y_test)), batch_size=batch_size, collate_fn=bow_collate_batch)


### Model

PyTorch neural network models consist of sequential layers, each containing parameters that the network learns from during training. To create a custom model in PyTorch, your class should inherit from nn.Module, which is the base class for all neural network modules. Within the constructor method \_\_init\_\_, you can define the layers of your model. The forward propagation of the network, where the actual computation is performed, is defined in a method named forward within your class. This method specifies how data passes through the model. Below is an example of how a simple network for a Bag of Words model might be structured:

In [57]:
from torch import nn

class BowNetwork(torch.nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.first_layer = nn.Linear(vocab_size, hidden_dim)
        # Each layer has been randomly initaised when first created.
        # You can further initialise the weight with different algorithms like below.
        # torch.nn.init.uniform_(self.first_layer.weight)
        self.second_layer = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        # x.shape = [batch_size, vocab_size]

        x = torch.relu(self.first_layer(x))
        # x.shape = [batch_size, hidden_size]

        logits = torch.sigmoid(self.second_layer(x))
        # logits.shape = [batch_size, 1]
        return logits

In [58]:
bow_model = BowNetwork(vocab_size, 10).to(device)
print(bow_model)

BowNetwork(
  (first_layer): Linear(in_features=1812, out_features=10, bias=True)
  (second_layer): Linear(in_features=10, out_features=1, bias=True)
)


### Train

Training a neural network is an iterative process that involves updating the model's weights to optimize performance on a given task. This process unfolds over multiple iterations, known as epochs, during which the model undergoes several key steps, as outlined below:







In [59]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)

    # Calling .train() will evoke the tensor to start caching steps and gradients.
    model.train()
    for batch, (X, y) in enumerate(dataloader):

        # Compute prediction by directly calling the model variable.
        pred = model(X)

        # Calculate the loss by comparing the prediction and the true labels.
        loss = loss_fn(pred, y)

        # Backpropagation: calculate the gradient by walking back the cached steps.
        loss.backward()
        # Update the parameters of the model with the loss gradient.
        optimizer.step()
        # Remove all the gradient to be ready for the next training of the next batch.
        optimizer.zero_grad()

        if batch  == size - 1:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

It's essential to specify our optimizer—a component that dictates the learning strategy through parameters such as the learning rate—and our loss function, which measures the discrepancy between the model's predictions and the actual data. Finally, we must determine the number of epochs, or complete passes through the training dataset, to effectively guide the training process to convergence.

In [60]:
# We select Binacy cross entropy loss
loss_fn = nn.BCELoss()
# We select Adam optimizer and hook it up with our model parameters.
optimizer = torch.optim.Adam(bow_model.parameters(), lr=0.001)
# we set epochs to 30
epochs = 20

During each epoch iteration, we also would like to gauge the progress of the model's performance. We can use the dev set with a test functin.

In [61]:
def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)

    # After calling eval(), the model no longer caching steps and gradient.
    # The model also does the inference faster with less resource.
    model.eval()
    test_loss, correct = 0, 0
    # This line specify that there will be no gradient in the operation below.
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            result = (pred>0.5).float()
            correct += (result == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Performance: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Now we can run our training:

In [62]:
print("Training BOW feedforward network model!")

for t in range(epochs):
    print(f"Epoch {t + 1}\n-------------------------------")
    train(train_dataloader, bow_model, loss_fn, optimizer)
    test(dev_dataloader, bow_model, loss_fn)
print("Done!")

Training BOW feedforward network model!
Epoch 1
-------------------------------
Test Performance: 
 Accuracy: 56.0%, Avg loss: 0.679033 

Epoch 2
-------------------------------
Test Performance: 
 Accuracy: 66.0%, Avg loss: 0.632574 

Epoch 3
-------------------------------
Test Performance: 
 Accuracy: 74.0%, Avg loss: 0.573498 

Epoch 4
-------------------------------
Test Performance: 
 Accuracy: 79.0%, Avg loss: 0.519920 

Epoch 5
-------------------------------
Test Performance: 
 Accuracy: 81.0%, Avg loss: 0.479142 

Epoch 6
-------------------------------
Test Performance: 
 Accuracy: 79.0%, Avg loss: 0.449264 

Epoch 7
-------------------------------
Test Performance: 
 Accuracy: 81.0%, Avg loss: 0.428555 

Epoch 8
-------------------------------
Test Performance: 
 Accuracy: 81.0%, Avg loss: 0.414380 

Epoch 9
-------------------------------
Test Performance: 
 Accuracy: 81.0%, Avg loss: 0.404627 

Epoch 10
-------------------------------
Test Performance: 
 Accuracy: 81.0%, 

Now test it with the test data

In [63]:
print("final test:" )
test(test_dataloader, bow_model, loss_fn)

final test:
Test Performance: 
 Accuracy: 76.0%, Avg loss: 0.558826 



How does the performance compare to logistic regression? If you run it a few times you may find that it gives slightly different numbers, and that is due to random initialisation of the model parameters.

## Embedding cosine similarity

Even though we did not explicitly define any word embeddings in the model architecture, they are in our model: in the weights between the input and the hidden layer. The hidden layer can therefore be interpreted as a sum of word embeddings for each input document.

Let's fetch the word embeddings of some words, and look at their cosine similarity, and see if they make any sense.

In [40]:
from numpy import dot
from numpy.linalg import norm

def cos_sim(a, b):
    return dot(a, b) / (norm(a) * norm(b))

def display_embedding_similarity_examples(embeddings, vocab):
    emb_love = embeddings[vocab["love"]]  # embeddings for 'love'
    emb_like = embeddings[vocab["like"]]
    emb_lukewarm = embeddings[vocab["lukewarm"]]
    emb_bad = embeddings[vocab["bad"]]

    print("show embedding similarity examples...")
    print("embedding vector of love:")
    print(emb_love)

    print("embedding cosine similarity comparisons:")
    print("love vs. like =", cos_sim(emb_love, emb_like))
    print("love vs. lukewarm =", cos_sim(emb_love, emb_lukewarm))
    print("love vs. bad =", cos_sim(emb_love, emb_bad))
    print("lukewarm vs. bad =", cos_sim(emb_lukewarm, emb_bad))

In [64]:
print("show embedding of bow model")
# extract word embeddings layer
embeddings = bow_model.first_layer.weight.T.to("cpu").detach().numpy()
display_embedding_similarity_examples(embeddings, vocab)

show embedding of bow model
show embedding similarity examples...
embedding vector of love:
[ 0.25114682  0.25714073  0.26676106 -0.26078755 -0.21964015  0.27364895
 -0.258442   -0.22225012 -0.22669387 -0.22875689]
embedding cosine similarity comparisons:
love vs. like = 0.98100877
love vs. lukewarm = -0.9891863
love vs. bad = -0.9904617
lukewarm vs. bad = 0.9966723


Not bad. You should find that for *love* and *like*, which are both positive sentiment words, produce high cosine similarity. Similar observations for *lukewarm* and *bad*. But when we compare opposite polarity words like *love* and *bad*, we get negative cosine similarity values.

## Sequence Model

Next, we are going to build another feed-forward model, but this time, instead of using BOW features as input, we want to use the word sequence as input (so order of words is preserved). It is usually not straightforward to do this for classical machine learning models, but with neural networks and embeddings, it's pretty straightforward.

Let's first build a pipeline by combining vocab and tokenizer together that can convert a sentence into a number sequence.

### Preparing sequence data

In [70]:
def sequence_pipeline(x):
  tokens = tokenizer(x)
  return [vocab.get(token, default_index) for token in tokens]

sequence_pipeline("Hello world, today is a good day.")

[1457, 894, 97, 410, 9, 71, 11, 665, 7]

Now lets build the pytorch dataloader pipeline that 1.) Convert every text sentence into number sequence, 2.) Padding all sentences to our predefined max sentence, so the input dimension will be consistent, 3.) convert all input to appropriate tensor and place them on device.

In [71]:
def seq_collate_batch(batch):
    label_list, text_list = [], []
    for  _text, _label in batch:
        label_list.append(_label)
        text_list.append(sequence_pipeline(_text))
    label_list = torch.tensor(label_list, dtype=torch.float32)

    # Pad or truncate each sequence
    padded_sequences = []
    for seq in text_list:
        # Truncate if longer than max_len
        padded_seq = seq[:max_len]
         # Pad if shorter
        padded_seq += [padding_index] * (max_len - len(padded_seq))
        padded_sequences.append(torch.tensor(padded_seq))
    text_list = torch.stack(padded_sequences)
    # Stack all sequences into a single tensor
    return text_list.to(device), label_list.reshape(-1, 1).to(device)

# set max length
max_len = 30

# Create data loaders.
xseq_train_dataloader = DataLoader(list(zip(sentences_train, y_train)), batch_size=10, collate_fn=seq_collate_batch)
xseq_dev_dataloader = DataLoader(list(zip(sentences_dev, y_dev)), batch_size=10, collate_fn=seq_collate_batch)
xseq_test_dataloader = DataLoader(list(zip(sentences_test, y_test)), batch_size=10, collate_fn=seq_collate_batch)

## Sequence Feed Forward Network

Now let's build our second model. This model first embeds each word in the input sequence into embeddings, and then concatenate the word embeddings together to represent input sequence. The ``Flatten`` function you see after the embedding layer is essentially doing the concatenation, by 'chaining' the list of word embeddings into a very long vector.

If our word embeddings has a dimension 10, and our documents always have 30 words (padded), then here the concatenated word embeddings have a dimension of 10 x 30 = 300.

The concatenated word embeddings undergo a linear transformation with non-linear activations (``layers.Dense(10, activation='relu')``), producing a hidden representation with a dimension of 10. It is then passed to the output layer.

In [72]:
class SequenceFFNetwork(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, padding_idx, max_len):
        super().__init__()
        self.embedding  = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        # torch.nn.init.uniform_(self.embedding.weight)
        self.first_layer = nn.Linear(hidden_dim * max_len, hidden_dim)
        # torch.nn.init.uniform_(self.first_layer.weight)
        self.second_layer = nn.Linear(hidden_dim, 1)
        # torch.nn.init.uniform_(self.second_layer.weight)

    def forward(self, x):
        x = self.embedding(x)
        # Flattening all word vectors in to one long vector
        x = torch.flatten(x, start_dim=1)
        x = torch.relu(self.first_layer(x))
        logits = torch.sigmoid(self.second_layer(x))
        return logits

Now let's see how it performs.

In [73]:
embedding_dim = 10
hidden_dim = 10
padding_index = vocab["<pad>"]

In [74]:
print("Training seqeunce feedforward network model!")

seq_model = SequenceFFNetwork(vocab_size, embedding_dim, hidden_dim, padding_index, max_len=max_len).to(device)
print(seq_model)

loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(seq_model.parameters(), lr=0.001)

for t in range(epochs):
    print(f"Epoch {t + 1}\n-------------------------------")
    train(xseq_train_dataloader, seq_model, loss_fn, optimizer)
    test(xseq_dev_dataloader, seq_model, loss_fn)
print("Done!")

print("final test:")
test(xseq_test_dataloader, seq_model, loss_fn)

Training seqeunce feedforward network model!
SequenceFFNetwork(
  (embedding): Embedding(1812, 10, padding_idx=1)
  (first_layer): Linear(in_features=300, out_features=10, bias=True)
  (second_layer): Linear(in_features=10, out_features=1, bias=True)
)
Epoch 1
-------------------------------
Test Performance: 
 Accuracy: 48.0%, Avg loss: 0.697589 

Epoch 2
-------------------------------
Test Performance: 
 Accuracy: 51.0%, Avg loss: 0.692550 

Epoch 3
-------------------------------
Test Performance: 
 Accuracy: 51.0%, Avg loss: 0.689816 

Epoch 4
-------------------------------
Test Performance: 
 Accuracy: 52.0%, Avg loss: 0.694150 

Epoch 5
-------------------------------
Test Performance: 
 Accuracy: 57.0%, Avg loss: 0.709097 

Epoch 6
-------------------------------
Test Performance: 
 Accuracy: 59.0%, Avg loss: 0.726845 

Epoch 7
-------------------------------
Test Performance: 
 Accuracy: 57.0%, Avg loss: 0.742752 

Epoch 8
-------------------------------
Test Performance: 
 A

You may find that the performance isn't as good as the BOW model. In general, concatenating word embeddings isn't a good way to represent word sequence.

A better way is to build a recurrent model. But first, let's extract the word embeddings for the 4 words as before and look at their similarity.

In [75]:
print("show sequence FF model embeddings")
# extract word embeddings layer
embeddings = seq_model.embedding.weight.to("cpu").detach().numpy()
display_embedding_similarity_examples(embeddings, vocab)

show sequence FF model embeddings
show embedding similarity examples...
embedding vector of love:
[ 0.03840087 -0.6711328  -1.1095555   0.32735482  0.34282312 -1.029826
  1.2651542  -0.19842899 -0.54404205 -1.359286  ]
embedding cosine similarity comparisons:
love vs. like = 0.2005474
love vs. lukewarm = 0.17755868
love vs. bad = -0.5497594
lukewarm vs. bad = 0.5892816


### LSTM Model

Now, let's try to build an LSTM model. After the embeddings layer, the LSTM layer will process the words one at a time, and compute the next state (dimension for the hidden state = 10 in this case). The output of the LSTM layer has three components "output (the output values of all steps)", "hidden(the hidden state at the end of the LSTM)", "cell(the cell state at the end of the LSTM)".

In [76]:
class SimpleLSTMNetwork(nn.Module):

    def __init__(self, vocab_size, embedding_dim, padding_idx):
        super().__init__()
        self.embedding  = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        self.lstm_layer = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.forward_layer = nn.Linear(hidden_dim, 1)

    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        h0 = torch.zeros((1, batch_size, hidden_dim)).to(device)
        c0 = torch.zeros((1, batch_size, hidden_dim)).to(device)
        hidden = (h0, c0)
        return hidden


    def forward(self, x):
        # x = [batch size, seq length]

        embedded = self.embedding(x)
        # embedded = [batch size, seq length, emb dim]

        h0 = self.init_hidden(x.shape[0])

        output, (hidden, cell) = self.lstm_layer(embedded, h0)
        # output = [batch size, seq length, hid dim * num directions]
        # hidden = [num layers * num directions, batch size, hid dim]
        # cell = [num layers * num directions, batch size, hid dim]

        hidden = hidden[-1, :, :]
        # hidden = [batch size, hid dim]

        logits = torch.sigmoid(self.forward_layer(hidden))
        return logits

Let's see how it goes

In [77]:
print("Training LSTM network model!")

lstm_model = SimpleLSTMNetwork(vocab_size, embedding_dim, padding_index).to(device)
print(lstm_model)

loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=0.001)

for t in range(epochs):
    print(f"Epoch {t + 1}\n-------------------------------")
    train(xseq_train_dataloader, lstm_model, loss_fn, optimizer)
    test(xseq_dev_dataloader, lstm_model, loss_fn)
print("Done!")

print("final test:")
test(xseq_test_dataloader, lstm_model, loss_fn)

Training LSTM network model!
SimpleLSTMNetwork(
  (embedding): Embedding(1812, 10, padding_idx=1)
  (lstm_layer): LSTM(10, 10, batch_first=True)
  (forward_layer): Linear(in_features=10, out_features=1, bias=True)
)
Epoch 1
-------------------------------
Test Performance: 
 Accuracy: 44.0%, Avg loss: 0.706505 

Epoch 2
-------------------------------
Test Performance: 
 Accuracy: 44.0%, Avg loss: 0.710847 

Epoch 3
-------------------------------
Test Performance: 
 Accuracy: 44.0%, Avg loss: 0.713580 

Epoch 4
-------------------------------
Test Performance: 
 Accuracy: 44.0%, Avg loss: 0.715103 

Epoch 5
-------------------------------
Test Performance: 
 Accuracy: 44.0%, Avg loss: 0.715810 

Epoch 6
-------------------------------
Test Performance: 
 Accuracy: 44.0%, Avg loss: 0.715926 

Epoch 7
-------------------------------
Test Performance: 
 Accuracy: 44.0%, Avg loss: 0.715123 

Epoch 8
-------------------------------
Test Performance: 
 Accuracy: 52.0%, Avg loss: 0.710390 



You should notice that the training is quite a bit slower, and that's because now the model has to process the sequence one word at a time. But the results should be better than the sequence FFmodel!

And lastly, let's extract the embeddings and look at the their similarity.

In [78]:
print("show LSTM model embeddings")
# extract word embeddings layer
embeddings = lstm_model.embedding.weight.to("cpu").detach().numpy()
display_embedding_similarity_examples(embeddings, vocab)

show LSTM model embeddings
show embedding similarity examples...
embedding vector of love:
[-0.8298477   0.38135883  0.6203263  -0.2778687   1.0426271  -2.799933
 -0.1474115  -0.5253565   0.6681899  -0.3256043 ]
embedding cosine similarity comparisons:
love vs. like = 0.12939441
love vs. lukewarm = 0.30936325
love vs. bad = 0.21451172
lukewarm vs. bad = -0.25039604


However, if you run the trainig a few times, you might notice that LSTM is not always better. In this particular case, the BOW approach seems to triumph over recurrent model. Why is this the case?