# Sentiment Analysis using LSTM

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

### Prepare data

In [2]:
# Sentences (Textual data) and their sentiment labels (1 for positive, 0 for negative)
sentences = ["i love this movie", "this film is amazing", "i didn't like it", "it was terrible"]
sentiment = [1, 1, 0, 0]

### Create Vocabulary

In [3]:
# Simple vocabulary to represent words as indices
vocab = {"<PAD>": 0, "i": 1, "love": 2, "this": 3, "movie": 4, "film": 5, "is": 6, "amazing": 7, "didn't": 8, "like": 9, "it": 10, "was": 11, "terrible": 12}

We create a simple vocabulary to represent words as indices. This allows us to convert words in our sentences to numbers, which can be fed as input to our neural network.

### Tokenize, encode and pad sentences

In [5]:
encoded_sentences = [[vocab[word] for word in sentence.split()] for sentence in sentences]
max_length = max([len(sentence) for sentence in encoded_sentences])
padded_sentences = [sentence + [vocab['<PAD>']] * (max_length - len(sentence)) for sentence in encoded_sentences]

In [6]:
encoded_sentences

[[1, 2, 3, 4], [3, 5, 6, 7], [1, 8, 9, 10], [10, 11, 12]]

In [7]:
max_length

4

We tokenize and encode the sentences using the vocabulary created earlier. We also pad the sentences with the `<PAD>` token to make them all the same length.

In [8]:
padded_sentences

[[1, 2, 3, 4], [3, 5, 6, 7], [1, 8, 9, 10], [10, 11, 12, 0]]

### Convert data to tensors

In [11]:
inputs = torch.LongTensor(padded_sentences)
labels = torch.FloatTensor(sentiment)

We convert the input data and labels to PyTorch tensors. Inputs are converted to LongTensors, while labels are converted to FloatTensors.

In [12]:
inputs, labels

(tensor([[ 1,  2,  3,  4],
         [ 3,  5,  6,  7],
         [ 1,  8,  9, 10],
         [10, 11, 12,  0]]),
 tensor([1., 1., 0., 0.]))

## Define LSTM Model

In [15]:
class SimpleLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SimpleLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        out, (hidden, _) = self.lstm(embedded)
        logits = self.fc(hidden.squeeze(0))
        return logits


We define a simple LSTM model class that inherits from `nn.Module`. The model consists of an embedding layer, an LSTM layer, and a fully connected (linear) layer. The forward method takes an input tensor `x`, passes it through the embedding layer, the LSTM layer, and finally the fully connected layer to produce the output logits.

### Instantiate model and define loss and optimizer

In [16]:
model = SimpleLSTM(len(vocab), embedding_dim=10, hidden_dim=20, output_dim=1)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

We instantiate the LSTM model with the vocabulary size, embedding dimensions, hidden dimensions, and output dimensions. We also define the binary cross-entropy with logits loss (`BCEWithLogitsLoss`) and the Adam optimizer.

## Train the model

In [27]:
inputs[0], inputs.t()

(tensor([1, 2, 3, 4]),
 tensor([[ 1,  3,  1, 10],
         [ 2,  5,  8, 11],
         [ 3,  6,  9, 12],
         [ 4,  7, 10,  0]]))

In [30]:
epochs = 1000
model.train()
for epoch in range(1, epochs + 1):
    # Zero the gradient
    optimizer.zero_grad()

    # Forward pass
    predictions = model(inputs.t()).squeeze(1)

    # Calculate loss and perform optimzation step
    loss = criterion(predictions, labels)
    loss.backward()

    optimizer.step()

    if(epoch % 100 == 0):
        print(f"Epoch: {epoch}, Loss: {loss.item()}")


Epoch: 100, Loss: 0.00042748049600049853
Epoch: 200, Loss: 0.0003904219774994999
Epoch: 300, Loss: 0.00035786055377684534
Epoch: 400, Loss: 0.00032914121402427554
Epoch: 500, Loss: 0.00030354931368492544
Epoch: 600, Loss: 0.0002806679403875023
Epoch: 700, Loss: 0.00026011004229076207
Epoch: 800, Loss: 0.0002414585615042597
Epoch: 900, Loss: 0.00022459443425759673
Epoch: 1000, Loss: 0.00020921982650179416


## Test the model

In [33]:
with torch.no_grad():
    test_sentences = ["i love this film", "it was terrible"]
    encoded_test_sentences = [[vocab[word] for word in sentence.split()] for sentence in test_sentences]
    padded_test_sentences = [sentence + [vocab["<PAD>"]] * (max_length - len(sentence)) for sentence in encoded_test_sentences]
    test_inputs = torch.LongTensor(padded_test_sentences)
    test_predictions = torch.sigmoid(model(test_inputs.t()).squeeze(1))
    print("Test predictions:", test_predictions)


Test predictions: tensor([9.9605e-01, 1.7678e-04])


We test the model on two new sentences. First, we tokenize, encode, and pad the test sentences in the same way as we did for the training sentences. We then convert the test sentences to PyTorch tensors and pass them through the model. We apply the sigmoid function to the output logits to obtain the final predictions, which represent the probability of each sentence being positive.

The resulting `test_predictions` tensor contains the model's sentiment predictions for the given test sentences.