### Sentiment Analysis using LSTM

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

####  Prepare data

In [2]:
# Sentences (textual data) and their sentiment labels (1 for positive, 0 for negative)
sentences = ["i love this movie", "this film is amazing", "i didn't like it", "it was terrible"]
sentiment = [1, 1, 0, 0]

#### Create Vocabulary

In [4]:
# Simple vocabulary to represent words as indices
vocab = {"<PAD>": 0, "i": 1, "love": 2, "this": 3, "movie": 4, "film": 5, "is": 6, "amazing": 7, "didn't": 8, "like": 9, "it": 10, "was": 11, "terrible": 12}

We create a simple vocabulary to represent words as indices. This allows us to convert words in our sentences to numbers, which can be fed as input to our neural network.

#### Tokenize, encode and pad sentences

In [5]:
encoded_sentences = [[vocab[word] for word in sentence.split()] for sentence in sentences]
max_length = max([len(sentence) for sentence in encoded_sentences])
padded_sentences = [sentence + [vocab["<PAD>"]] * (max_length - len(sentence)) for sentence in encoded_sentences]

We tokenize and encode the sentences using the vocabulary created earlier. We also pad the sentences with the `<PAD>` token to make them all the same length.

#### Convert data to tensors

In [5]:
inputs = torch.LongTensor(padded_sentences)
labels = torch.FloatTensor(sentiment)

We convert the input data and labels to PyTorch tensors. Inputs are converted to LongTensors, while labels are converted to FloatTensors.

#### Define LSTM Model

In [None]:
class SimpleLSTM(nn.Module):
    """
    A simple LSTM model for sentiment analysis.
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SimpleLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, _) = self.lstm(embedded)
        logits = self.fc(hidden.squeeze(0))
        return logits

We define a simple LSTM model class that inherits from `nn.Module`. The model consists of an embedding layer, an LSTM layer, and a fully connected (linear) layer. The forward method takes an input tensor `x`, passes it through the embedding layer, the LSTM layer, and finally the fully connected layer to produce the output logits.

#### Instantiate model and define loss and optimizer

In [10]:
model = SimpleLSTM(len(vocab), embedding_dim=10, hidden_dim=20, output_dim=1)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

We instantiate the LSTM model with the vocabulary size, embedding dimensions, hidden dimensions, and output dimensions. We also define the binary cross-entropy with logits loss (`BCEWithLogitsLoss`) and the Adam optimizer.

In [15]:
def prepare_sentiment_data(sentences, sentiment, vocab):
    """
    Prepare sentiment analysis data for LSTM model training.

    This function takes raw sentences and their corresponding sentiment labels,
    converts them to numerical format, pads sequences to equal length, and
    prepares them in the format expected by PyTorch's LSTM.

    Args:
        sentences (list of str): List of input sentences to be processed
        sentiment (list of int): List of sentiment labels (0 for negative, 1 for positive)
        vocab (dict): Vocabulary mapping words to their corresponding indices

    Returns:
        tuple: A tuple containing:
            - inputs (torch.LongTensor): Padded and encoded input sequences 
              with shape (seq_len, batch_size)
            - labels (torch.FloatTensor): Tensor of sentiment labels with shape (batch_size,)
    """
    # Tokenize and encode the sentences
    encoded_sentences = [[vocab[word] for word in sentence.split()] for sentence in sentences]

    # Find the maximum sequence length
    max_length = max(len(sentence) for sentence in encoded_sentences)

    # Pad sequences to the same length
    padded_sentences = [sentence + [vocab["<PAD>"]] * (max_length - len(sentence)) 
                       for sentence in encoded_sentences]

    # Convert to PyTorch tensors
    inputs = torch.LongTensor(padded_sentences)
    labels = torch.FloatTensor(sentiment)

    # Transpose inputs to match LSTM's expected input shape (seq_len, batch_size, input_size)
    inputs = inputs.t()
    
    return inputs, labels

# Example usage:
inputs, labels = prepare_sentiment_data(sentences, sentiment, vocab)

#### Train the model

In [16]:
# Training loop
epochs = 1000

for epoch in range(epochs):
    optimizer.zero_grad()
    predictions = model(inputs.t()).squeeze(1)
    loss = criterion(predictions, labels)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch + 1}, Loss: {loss.item()}")


Epoch: 100, Loss: 0.0011476636864244938
Epoch: 200, Loss: 0.0009985864162445068
Epoch: 300, Loss: 0.0008769434643909335
Epoch: 400, Loss: 0.0007760371081531048
Epoch: 500, Loss: 0.0006912948447279632
Epoch: 600, Loss: 0.0006192974396981299
Epoch: 700, Loss: 0.0005576762487180531
Epoch: 800, Loss: 0.0005042588454671204
Epoch: 900, Loss: 0.0004578128573484719
Epoch: 1000, Loss: 0.0004170317552052438


We train the model for 1000 epochs. In each epoch, we:

- Reset the gradients by calling optimizer.zero_grad()
- Get the model's predictions for the input sentences by calling model(inputs.t()).squeeze(1)
- Calculate the loss between the predictions and the true labels using the criterion defined earlier
- Perform backpropagation by calling loss.backward()
- Update the model's parameters by calling optimizer.step()
- We also print the loss every 100 epochs for monitoring the training progress.

#### Test the model

In [14]:
with torch.no_grad():
    test_sentences = ["i love this film", "it was terrible"]
    encoded_test_sentences = [[vocab[word] for word in sentence.split()] for sentence in test_sentences]
    padded_test_sentences = [sentence + [vocab["<PAD>"]] * (max_length - len(sentence)) for sentence in encoded_test_sentences]
    test_inputs = torch.LongTensor(padded_test_sentences)
    test_predictions = torch.sigmoid(model(test_inputs.t()).squeeze(1))
    print("Test predictions:", test_predictions)


Test predictions: tensor([0.9949, 0.0283])


We test the model on two new sentences. First, we tokenize, encode, and pad the test sentences in the same way as we did for the training sentences. We then convert the test sentences to PyTorch tensors and pass them through the model. We apply the sigmoid function to the output logits to obtain the final predictions, which represent the probability of each sentence being positive.

The resulting `test_predictions` tensor contains the model's sentiment predictions for the given test sentences.