### Sentiment Analysis using LSTM

####  Prepare data

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

# Sentences (textual data) and their sentiment labels (1 for positive, 0 for negative)
sentences = ["i love this movie", "this film is amazing", "i didn't like it", "it was terrible"]
sentiment = [1, 1, 0, 0]

#### Create Vocabulary

In [2]:
# Simple vocabulary to represent words as indices
vocab = {"<PAD>": 0, "i": 1, "love": 2, "this": 3, "movie": 4, "film": 5, "is": 6, "amazing": 7, "didn't": 8, "like": 9, "it": 10, "was": 11, "terrible": 12}

We create a simple vocabulary to represent words as indices. This allows us to convert words in our sentences to numbers, which can be fed as input to our neural network.

#### Tokenize, encode and pad sentences

In [3]:
encoded_sentences = [[vocab[word] for word in sentence.split()] for sentence in sentences]
max_length = max([len(sentence) for sentence in encoded_sentences])
padded_sentences = [sentence + [vocab["<PAD>"]] * (max_length - len(sentence)) for sentence in encoded_sentences]

We tokenize and encode the sentences using the vocabulary created earlier. We also pad the sentences with the `<PAD>` token to make them all the same length.

#### Convert data to tensors

In [4]:
inputs = torch.LongTensor(padded_sentences)
labels = torch.FloatTensor(sentiment)

We convert the input data and labels to PyTorch tensors. Inputs are converted to LongTensors, while labels are converted to FloatTensors.

#### Define LSTM Model

In [None]:
class SimpleLSTM(nn.Module):
    """
    LSTM-based binary sentiment classifier for text sequences.
    
    This model implements a standard sequence-to-one LSTM architecture for sentiment
    analysis tasks. It processes variable-length text sequences and outputs sentiment
    predictions by using the final LSTM hidden state as a sentence-level representation.
    
    Architecture components:
    1. Embedding layer: Converts discrete token indices to dense vector representations
    2. LSTM layer: Processes sequences to capture temporal dependencies and context
    3. Linear layer: Maps final hidden state to sentiment classification logits
    
    The model assumes binary sentiment classification (positive/negative) but can be
    extended to multi-class sentiment analysis by adjusting output_dim.

    Args:
        vocab_size (int): Size of vocabulary (number of unique tokens)
        embedding_dim (int): Dimensionality of word embeddings
        hidden_dim (int): Number of hidden units in LSTM layer
        output_dim (int): Number of output classes (typically 1 for binary, 2+ for multi-class)
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SimpleLSTM, self).__init__()
        
        """
        Initialize embedding layer for token-to-vector conversion.
        Creates learnable lookup table mapping vocabulary indices to dense embeddings.
        """
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        """
        Initialize LSTM layer for sequential processing.
        Default configuration (batch_first=False) expects input shape (seq_len, batch_size, embedding_dim).
        Processes embedded sequences to learn temporal patterns and context dependencies.
        """
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        """
        Initialize linear classification layer.
        Maps final LSTM hidden state to sentiment prediction logits.
        For binary sentiment: output_dim=1, for multi-class: output_dim=num_classes.
        """
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        """
        Forward pass for sentiment prediction.

        Args:
            x (torch.Tensor): Input token indices of shape (seq_len, batch_size)

        Returns:
            torch.Tensor: Sentiment logits of shape (batch_size, output_dim)
        """
        
        """
        Convert token indices to dense embeddings.
        Shape transformation: (seq_len, batch_size) → (seq_len, batch_size, embedding_dim)
        """
        embedded = self.embedding(x)
        
        """
        Process embedded sequence through LSTM to extract temporal features.
        Returns all hidden states and final (hidden, cell) state tuple.
        hidden shape: (1, batch_size, hidden_dim) for single-layer LSTM.
        """
        output, (hidden, _) = self.lstm(embedded)
        
        """
        Generate sentiment predictions from final hidden state.
        squeeze(0) removes layer dimension: (1, batch_size, hidden_dim) → (batch_size, hidden_dim)
        Linear layer maps to sentiment logits: (batch_size, hidden_dim) → (batch_size, output_dim)
        """
        logits = self.fc(hidden.squeeze(0))
        return logits

We define a simple LSTM model class that inherits from `nn.Module`. The model consists of an embedding layer, an LSTM layer, and a fully connected (linear) layer. The forward method takes an input tensor `x`, passes it through the embedding layer, the LSTM layer, and finally the fully connected layer to produce the output logits.

#### Instantiate model and define loss and optimizer

In [6]:
model = SimpleLSTM(len(vocab), embedding_dim=10, hidden_dim=20, output_dim=1)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

We instantiate the LSTM model with the vocabulary size, embedding dimensions, hidden dimensions, and output dimensions. We also define the binary cross-entropy with logits loss (`BCEWithLogitsLoss`) and the Adam optimizer.

In [None]:
def prepare_sentiment_data(sentences, sentiment, vocab):
    """
    Transform raw text and sentiment labels into LSTM-compatible tensor format.

    This preprocessing function handles the complete pipeline from raw text to model-ready
    tensors, including tokenization, vocabulary encoding, sequence padding, and tensor
    conversion. It ensures all sequences have uniform length and proper formatting for
    batch processing in PyTorch LSTM models.

    The function assumes text is already cleaned and tokenized (split by spaces), and
    that all vocabulary words are present in the vocab dictionary. Unknown words will
    cause KeyError exceptions.

    Args:
        sentences (list of str): Raw text sentences for sentiment analysis.
            Example: ["this movie is great", "terrible acting"]
        sentiment (list of int): Corresponding sentiment labels for each sentence.
            Typically 0 for negative, 1 for positive sentiment.
            Must have same length as sentences list.
        vocab (dict): Word-to-index mapping for vocabulary.
            Example: {"this": 0, "movie": 1, "is": 2, "great": 3, "<PAD>": 4}
            Must include "<PAD>" token for sequence padding.

    Returns:
        tuple: A tuple containing:
            - inputs (torch.LongTensor): Encoded and padded sequences with shape 
              (max_seq_len, batch_size). Transposed for LSTM's expected input format.
            - labels (torch.FloatTensor): Sentiment labels with shape (batch_size,).
              Float tensor suitable for binary classification with BCELoss or similar.

    Raises:
        KeyError: If any word in sentences is not found in vocab dictionary.
        ValueError: If sentences and sentiment lists have different lengths.

    Example:
        >>> sentences = ["good movie", "bad film"]
        >>> sentiment = [1, 0]  
        >>> vocab = {"good": 0, "movie": 1, "bad": 2, "film": 3, "<PAD>": 4}
        >>> inputs, labels = prepare_sentiment_data(sentences, sentiment, vocab)
        >>> print(inputs.shape)  # torch.Size([2, 2]) -> (max_seq_len, batch_size)
        >>> print(labels.shape)  # torch.Size([2])
    """
    
    """
    Tokenize sentences and convert words to vocabulary indices.
    
    Process:
    1. Split each sentence on whitespace to get individual tokens
    2. Map each token to its vocabulary index using vocab dictionary
    3. Results in list of lists where each inner list contains token indices
    
    Example: "good movie" with vocab {"good": 0, "movie": 1} becomes [0, 1]
    
    Note: This will raise KeyError if any word is not in vocabulary.
    Consider using vocab.get(word, vocab["<UNK>"]) for unknown word handling.
    """
    encoded_sentences = [[vocab[word] for word in sentence.split()] for sentence in sentences]

    """
    Determine maximum sequence length for padding standardization.
    All sequences will be padded to this length to enable batch processing.
    Longer sequences require more memory but preserve all textual information.
    """
    max_length = max(len(sentence) for sentence in encoded_sentences)

    """
    Pad all sequences to uniform length using <PAD> token indices.
    
    Padding process:
    1. Calculate required padding: max_length - current_sequence_length
    2. Append <PAD> token indices to reach max_length
    3. Ensures all sequences have identical length for tensor conversion
    
    Example: If max_length=4 and sequence=[0, 1], result=[0, 1, 4, 4] (assuming <PAD>=4)
    """
    padded_sentences = [sentence + [vocab["<PAD>"]] * (max_length - len(sentence)) 
                       for sentence in encoded_sentences]

    """
    Convert preprocessed data to PyTorch tensors with appropriate data types.
    
    LongTensor for inputs: Required for embedding layer which expects integer indices
    FloatTensor for labels: Compatible with most loss functions (BCELoss, MSELoss, etc.)
    
    Initial tensor shapes:
    - inputs: (batch_size, max_seq_len)
    - labels: (batch_size,)
    """
    inputs = torch.LongTensor(padded_sentences)
    labels = torch.FloatTensor(sentiment)

    """
    Transpose input tensor to match LSTM's expected input format.
    
    LSTM with batch_first=False expects: (seq_len, batch_size, input_size)
    Transpose changes: (batch_size, seq_len) → (seq_len, batch_size)
    
    This ensures compatibility with the SimpleLSTM model's forward pass.
    """
    inputs = inputs.t()
    
    return inputs, labels

# Example usage:
inputs, labels = prepare_sentiment_data(sentences, sentiment, vocab)

#### Train the model

In [None]:
# Training loop
epochs = 1000

for epoch in range(epochs):
    """
    We train the model for 1000 epochs. In each epoch, we:
    
        - Reset the gradients by calling optimizer.zero_grad()
        - Get the model's predictions for the input sentences by calling model(inputs.t()).squeeze(1)
        - Calculate the loss between the predictions and the true labels using the criterion defined earlier
        - Perform backpropagation by calling loss.backward()
        - Update the model's parameters by calling optimizer.step()
        - We also print the loss every 100 epochs for monitoring the training progress.
    """
    optimizer.zero_grad()
    predictions = model(inputs.t()).squeeze(1)
    loss = criterion(predictions, labels)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch + 1}, Loss: {loss.item()}")


Epoch: 100, Loss: 0.0011476636864244938
Epoch: 200, Loss: 0.0009985864162445068
Epoch: 300, Loss: 0.0008769434643909335
Epoch: 400, Loss: 0.0007760371081531048
Epoch: 500, Loss: 0.0006912948447279632
Epoch: 600, Loss: 0.0006192974396981299
Epoch: 700, Loss: 0.0005576762487180531
Epoch: 800, Loss: 0.0005042588454671204
Epoch: 900, Loss: 0.0004578128573484719
Epoch: 1000, Loss: 0.0004170317552052438


#### Test the model

In [None]:
with torch.no_grad():
    """
    We use the model to make predictions on new sentences. In this example, we:
    
    - Disable gradient calculation by using torch.no_grad()
    - Define a list of test sentences
    - Tokenize and encode the test sentences
    - Pad the sequences to match the maximum sequence length
    - Convert the padded sequences to PyTorch tensors
    - Get the model's predictions for the test sentences
    - Apply the sigmoid function to convert the predictions to probabilities
    - Print the test predictions
    """
    test_sentences = ["i love this film", "it was terrible"]
    encoded_test_sentences = [[vocab[word] for word in sentence.split()] for sentence in test_sentences]
    padded_test_sentences = [sentence + [vocab["<PAD>"]] * (max_length - len(sentence)) for sentence in encoded_test_sentences]
    test_inputs = torch.LongTensor(padded_test_sentences)
    test_predictions = torch.sigmoid(model(test_inputs.t()).squeeze(1))
    
    print("Test predictions:", test_predictions)


Test predictions: tensor([0.9949, 0.0283])


We test the model on two new sentences. First, we tokenize, encode, and pad the test sentences in the same way as we did for the training sentences. We then convert the test sentences to PyTorch tensors and pass them through the model. We apply the sigmoid function to the output logits to obtain the final predictions, which represent the probability of each sentence being positive.

The resulting `test_predictions` tensor contains the model's sentiment predictions for the given test sentences.