### Text classification using LSTM

In this coding exercise, you will create a simple LSTM model using PyTorch to perform text classification on a dataset of short phrases. We will perform the following steps:

- Create a vocabulary to represent words as indices.
- Tokenize, encode, and pad the phrases.
- Convert the phrases and categories to PyTorch tensors.
- Instantiate the LSTM model with the vocabulary size, embedding dimensions, hidden dimensions, and output dimensions.
- Define the loss function and optimizer.
- Train the model for a number of epochs.
- Test the model on new phrases and print the category predictions.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

In [9]:
"""
Phrases (textual data) and their category labels (0 for sports, 1 for technology, 2 for food)
This data is extremely less for realistically training an LSTM model.  This model might overfit 
as the data is less. Feel free to use any other data source for training or create your own dummy data
"""

phrases = ["great goal scored", "amazing touchdown", "new phone release", "latest laptop model", "tasty pizza", "delicious burger"]
categories = [0, 0, 1, 1, 2, 2]

"""
Create a vocabulary to represent words as indices
"""
vocab = {"<PAD>": 0, "great": 1, "goal": 2, "scored": 3, "amazing": 4, "touchdown": 5, "new": 6, "phone": 7, "release": 8, "latest": 9, "laptop": 10, "model": 11, "tasty": 12, "pizza": 13, "delicious": 14, "burger": 15}

"""
Tokenize, encode, and pad phrases
"""
encoded_phrases = [[vocab[word] for word in phrase.split()] for phrase in phrases]
max_length = max([len(phrase) for phrase in encoded_phrases])
padded_phrases = [phrase + [vocab["<PAD>"]] * (max_length - len(phrase)) for phrase in encoded_phrases]

"""
Concise unpacking approach for printing all elements.
Uses print()'s ability to handle multiple arguments with sep parameter.
"""
print(*padded_phrases, sep='\n')
print("\n")

"""
Proper use of map() for data transformation.
Converts each phrase to uppercase - this creates new data.
"""
uppercase_phrases = list(map(str.upper, phrases))
print(*uppercase_phrases, sep='\n')
print("\n")

"""
Even better: use list comprehension for transformations.
More readable and Pythonic than map() in most cases.
"""
uppercase_phrases = [phrase.upper() for phrase in phrases]
print(*uppercase_phrases, sep='\n')
print("\n")

"""
Convert phrases and categories to PyTorch tensors
"""
inputs = torch.LongTensor(padded_phrases)
labels = torch.LongTensor(categories)

[1, 2, 3]
[4, 5, 0]
[6, 7, 8]
[9, 10, 11]
[12, 13, 0]
[14, 15, 0]


GREAT GOAL SCORED
AMAZING TOUCHDOWN
NEW PHONE RELEASE
LATEST LAPTOP MODEL
TASTY PIZZA
DELICIOUS BURGER


GREAT GOAL SCORED
AMAZING TOUCHDOWN
NEW PHONE RELEASE
LATEST LAPTOP MODEL
TASTY PIZZA
DELICIOUS BURGER





---

##### Detailed Dimension Analysis

###### Step-by-Step Dimension Tracking

**Starting dimensions**: Assume `seq_len=10`, `batch_size=32`, `embedding_dim=100`, `hidden_dim=128`, `output_dim=5`

1. **Input**: `x.shape = (10, 32)` - 10 tokens per sequence, 32 sequences in batch

2. **After Embedding**: `embedded.shape = (10, 32, 100)`
   - Each of 10×32 token indices becomes 100-dimensional vector
   - Total: 320 embedding vectors, each 100-dimensional

3. **After LSTM**: 
   - `output.shape = (10, 32, 128)` - Hidden states for all 10 timesteps
   - `hidden.shape = (1, 32, 128)` - Only final hidden state
   - The "1" comes from `num_layers=1` (default for single-layer LSTM)

4. **After squeeze(0)**: `hidden.squeeze(0).shape = (32, 128)`
   - Removes the layer dimension, keeping batch and hidden dimensions

5. **After Linear**: `logits.shape = (32, 5)`
   - 32 sequences → 32 predictions, each with 5 class scores

###### Why These Specific Dimensions?

**LSTM Output Format**: PyTorch LSTM returns `(num_layers, batch_size, hidden_dim)` for hidden state, hence the need for `squeeze(0)` to remove the single layer dimension.

**Sequence-to-One**: We discard all intermediate LSTM outputs and use only the final hidden state, assuming it encodes sufficient information about the entire sequence for classification.

**Batch Processing**: All operations preserve the batch dimension, allowing efficient parallel processing of multiple sequences simultaneously.

---

#### Numerical Example of Embedding Layer

Let me demonstrate with a concrete example using small dimensions for clarity.

##### Setup Parameters
```python
vocab_size = 6      # Small vocabulary: ["hello", "world", "good", "bad", "movie", "<PAD>"]
embedding_dim = 4   # 4-dimensional embeddings
```

##### The Embedding Matrix

When you create `nn.Embedding(6, 4)`, PyTorch initializes a learnable matrix:

```python
# Initial embedding matrix (randomly initialized)
embedding_matrix = torch.tensor([
    [0.2, -0.1,  0.8,  0.3],  # Index 0: "hello"
    [0.5,  0.9, -0.4,  0.1],  # Index 1: "world" 
    [-0.2, 0.7,  0.6, -0.8],  # Index 2: "good"
    [0.9, -0.5,  0.2,  0.4],  # Index 3: "bad"
    [0.1,  0.3, -0.9,  0.7],  # Index 4: "movie"
    [0.0,  0.0,  0.0,  0.0]   # Index 5: "<PAD>" (often zero)
])
# Shape: (6, 4) = (vocab_size, embedding_dim)
```

##### Understanding Embedding Dimensions vs. Features

The `embedding_dim = 4` is **not** the features we want to extract for classification. It's an **intermediate representation** that helps the model learn those features.

##### The Distinction

**Embedding dimensions**: Dense vector representation of individual tokens
**Features for classification**: Higher-level patterns learned by the LSTM

##### What Each Layer Extracts

```python
# Input: Token indices [2, 4] representing "good movie"

# Embedding layer output (embedding_dim = 4):
embedded = [
    [-0.2,  0.7,  0.6, -0.8],  # "good" token representation
    [ 0.1,  0.3, -0.9,  0.7]   # "movie" token representation  
]
# These are NOT the final features - just token-level representations

# LSTM layer processes these embeddings and extracts:
final_hidden_state = [0.3, -0.1, 0.8, 0.2, -0.5, 0.9, ...]  # hidden_dim=128
# THIS is the feature vector we want - sentence-level representation
```

##### The Learning Hierarchy

```mermaid
flowchart TD
    A["Token Level<br>Individual word meanings<br>embedding_dim=4"] --> B["Sequence Level<br>Sentence patterns & context<br>hidden_dim=128"]
    B --> C["Classification Features<br>Sentiment, topic, etc.<br>output_dim=num_classes"]
    
    A1["'good' → [-0.2, 0.7, 0.6, -0.8]<br>'movie' → [0.1, 0.3, -0.9, 0.7]"] --> A
    B1["'good movie' → [0.3, -0.1, 0.8, ...]<br>Captures: positive sentiment<br>+ movie context"] --> B
    C1["Final prediction:<br>[0.1, 0.9] = 90% positive"] --> C
    
    style A fill:#E8F4FD,color:#000
    style B fill:#A3CCF6,color:#000
    style C fill:#E0F0E0,color:#000
    style A1 fill:#FFE0E0,color:#000
    style B1 fill:#FFF0E0,color:#000
    style C1 fill:#FFFFCC,color:#000
```

##### What We Actually Want

The **classification features** we want are in the LSTM's final hidden state:

```python
# This 128-dimensional vector contains the learned features:
final_features = [0.3, -0.1, 0.8, 0.2, -0.5, 0.9, ...]  # 128 values

# These might represent concepts like:
# feature[0] = 0.3   → "positive sentiment strength"
# feature[1] = -0.1  → "movie genre indicator" 
# feature[2] = 0.8   → "emotional intensity"
# feature[67] = 0.2  → "grammatical complexity"
# etc.
```

##### Dimensionality Purpose

**embedding_dim (4)**: Just enough dimensions to distinguish between vocabulary words and capture basic semantic relationships.

**hidden_dim (128)**: Much larger dimension to capture complex sequential patterns, context dependencies, and classification-relevant features.

##### Analogy

Think of embeddings like individual LEGO pieces (simple building blocks), while the LSTM hidden state is like the complex structure built from those pieces. The embedding dimensions give you the basic components, but the LSTM features give you the architectural patterns needed for classification.

The 4-dimensional embeddings are tools for learning, not the final features we want for classification.


##### Token-to-Index Mapping
```python
word_to_idx = {
    "hello": 0, "world": 1, "good": 2, 
    "bad": 3, "movie": 4, "<PAD>": 5
}
```

##### Input Processing Example

**Input sentence**: "good movie"
**Tokenized indices**: `[2, 4]`

```python
# Input tensor of token indices
x = torch.tensor([2, 4])  # Shape: (2,) representing "good movie"

# Embedding lookup
embedded = embedding_layer(x)
```

##### Step-by-Step Lookup Process

```python
# Manual lookup to show what happens:
token_2_embedding = embedding_matrix[2, :]  # "good" → [-0.2, 0.7, 0.6, -0.8]
token_4_embedding = embedding_matrix[4, :]  # "movie" → [0.1, 0.3, -0.9, 0.7]

# Result
embedded = torch.tensor([
    [-0.2,  0.7,  0.6, -0.8],  # "good"
    [ 0.1,  0.3, -0.9,  0.7]   # "movie"
])
# Shape: (2, 4) = (sequence_length, embedding_dim)
```

##### Batch Processing Example

**Multiple sentences**:
- Sentence 1: "good movie" → `[2, 4]`
- Sentence 2: "bad movie" → `[3, 4]`

```python
# Batched input (assuming sequences padded to same length)
x_batch = torch.tensor([
    [2, 4],  # "good movie"
    [3, 4]   # "bad movie"  
])
# Shape: (2, 2) = (batch_size, seq_len)

# After embedding
embedded_batch = embedding_layer(x_batch)
# Shape: (2, 2, 4) = (batch_size, seq_len, embedding_dim)

embedded_batch = torch.tensor([
    [[-0.2,  0.7,  0.6, -0.8],   # "good"
     [ 0.1,  0.3, -0.9,  0.7]],  # "movie"
    [[ 0.9, -0.5,  0.2,  0.4],   # "bad"
     [ 0.1,  0.3, -0.9,  0.7]]   # "movie"
])
```

##### Learning Process

During training, gradients update the embedding matrix:

```python
# Initial: "good" embedding
embedding_matrix[2] = [-0.2, 0.7, 0.6, -0.8]

# After some training (example update)
embedding_matrix[2] = [-0.1, 0.8, 0.5, -0.7]  # Learned better representation

# The model learns that "good" should have embeddings that help
# distinguish positive sentiment in classification tasks
```

##### Key Properties

**Discrete → Continuous**: Token index `2` becomes dense vector `[-0.2, 0.7, 0.6, -0.8]`

**Learnable**: These numbers change during training to better represent semantic relationships

**Shared Representations**: Token `4` ("movie") gets the same embedding `[0.1, 0.3, -0.9, 0.7]` wherever it appears

**Efficient Lookup**: No matrix multiplication - just indexing into the embedding table

This embedding layer effectively converts sparse, discrete token indices into dense, continuous representations that neural networks can process effectively.

---

```mermaid
flowchart LR
    A["Input Tokens<br>(seq_len, batch_size)<br>Token indices"] --> B["Embedding Layer<br>nn.Embedding<br>vocab_size → embedding_dim"]
    B --> C["Dense Vectors<br>(seq_len, batch_size, embedding_dim)<br>Continuous representations"]
    C --> D["LSTM Layer<br>nn.LSTM<br>Sequential processing"]
    D --> E["All Hidden States<br>(seq_len, batch_size, hidden_dim)<br>Temporal features"]
    D --> F["Final Hidden State<br>(1, batch_size, hidden_dim)<br>Sequence summary"]
    F --> G["Squeeze Operation<br>Remove layer dimension<br>(batch_size, hidden_dim)"]
    G --> H["Linear Layer<br>nn.Linear<br>hidden_dim → output_dim"]
    H --> I["Class Logits<br>(batch_size, output_dim)<br>Raw classification scores"]
    
    E -.->|"Discarded"| J["Not used for<br>classification"]
    
    style A fill:#E8F4FD,color:#000
    style B fill:#D1E7FB,color:#000
    style C fill:#A3CCF6,color:#000
    style D fill:#75B1F1,color:#000
    style E fill:#FFE0E0,color:#000
    style F fill:#4796EC,color:#000
    style G fill:#1E88E5,color:#000
    style H fill:#1976D2,color:#000
    style I fill:#E0F0E0,color:#000
    style J fill:#FFCCCC,color:#000
```

The diagram illustrates the key architectural decisions:

**Information Flow**: Input tokens undergo progressive transformation from discrete indices to dense embeddings to sequential features to final classification scores.

**Dimensionality Changes**: Each step shows how tensor shapes evolve, with the critical dimension changes highlighted at each transformation.

**Selection Strategy**: The dotted line shows that intermediate LSTM outputs are discarded, with only the final hidden state used for classification - a sequence-to-one prediction approach.

**Bottleneck Design**: The final hidden state serves as a compressed representation of the entire input sequence, requiring the LSTM to encode all relevant information for classification in this single vector.

---

In [None]:
class PhraseClassifier(nn.Module):
    """
    LSTM-based phrase classifier for text classification tasks using sequence-to-one prediction.
    
    Architecture Flow:
    Input tokens → Embedding → LSTM → Final hidden state → Linear → Class logits
    
    The model processes variable-length text sequences by:
    1. Converting token indices to dense vector representations via embedding
    2. Processing embedded sequences through LSTM to capture sequential patterns
    3. Using the final LSTM hidden state as a sentence-level representation
    4. Mapping this representation to class probabilities via linear transformation
    
    This architecture is effective for document-level classification where the entire
    sequence context determines the output label (sentiment, topic, etc.).

    Attributes:
        embedding (nn.Embedding): Token index to dense vector conversion
        lstm (nn.LSTM): Sequential pattern learning with memory
        fc (nn.Linear): Classification head for final prediction

    Args:
        vocab_size (int): Total number of unique tokens in vocabulary
        embedding_dim (int): Dimensionality of dense word vectors (typically 50-300)
        hidden_dim (int): Size of LSTM hidden state (controls model capacity)
        output_dim (int): Number of target classes for classification
    """

    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        """
        Initialize phrase classifier with specified architecture dimensions.
        
        Layer initialization creates:
        - Embedding matrix: (vocab_size × embedding_dim) learnable lookup table
        - LSTM cell: Processes sequences with hidden_dim internal memory
        - Linear classifier: Maps hidden_dim features to output_dim classes
        """
        super(PhraseClassifier, self).__init__()
        
        """
        Embedding layer converts discrete token indices to continuous vectors.
        
        Creates learnable lookup table of shape (vocab_size, embedding_dim) where:
        - Each row represents one vocabulary token's dense embedding
        - Token index i maps to embedding_matrix[i, :] vector
        - Gradients update embeddings during training for task-specific representations
        """
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        """
        LSTM layer processes embedded sequences to learn temporal dependencies.
        
        Default configuration (no batch_first=True) expects input shape:
        (sequence_length, batch_size, embedding_dim)
        
        LSTM maintains internal cell state and hidden state to capture:
        - Long-range dependencies across sequence positions
        - Sequential patterns relevant for classification
        - Contextual information from bidirectional processing
        """
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        """
        Final classification layer maps LSTM output to class logits.
        
        Linear transformation from hidden_dim → output_dim:
        - Takes sentence-level representation (final LSTM hidden state)
        - Produces raw scores (logits) for each possible class
        - Applied softmax during inference converts to class probabilities
        """
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        """
        Process input token sequence through embedding, LSTM, and classification layers.

        Args:
            x (torch.Tensor): Input token indices of shape (seq_len, batch_size)
                Each element x[i,j] is vocabulary index for token i in sequence j

        Returns:
            torch.Tensor: Classification logits of shape (batch_size, output_dim)
                Raw scores for each class, typically passed through softmax for probabilities
        """
        
        """
        Convert token indices to dense embedding vectors.
        
        Dimension transformation: (seq_len, batch_size) → (seq_len, batch_size, embedding_dim)
        
        Process:
        - Each token index x[i,j] gets replaced by embedding_matrix[x[i,j], :]
        - Result: embedded[i,j,:] contains embedding_dim-dimensional vector for token i in sequence j
        - Embeddings are learned parameters that capture semantic token relationships
        """
        embedded = self.embedding(x)  # (seq_len, batch_size, embedding_dim)
        
        """
        Process embedded sequence through LSTM to extract sequential patterns.
        
        LSTM returns tuple: (output_states, (final_hidden_state, final_cell_state))
        
        Dimension transformations:
        - Input: (seq_len, batch_size, embedding_dim)
        - output: (seq_len, batch_size, hidden_dim) - hidden states for ALL timesteps
        - hidden: (1, batch_size, hidden_dim) - FINAL hidden state only
        - cell: (1, batch_size, hidden_dim) - final cell state (discarded with _)
        
        Key insight: We use only final hidden state for classification, assuming it
        contains sufficient information about the entire sequence for classification.
        """
        output, (hidden, _) = self.lstm(embedded)  # output: (seq_len, batch_size, hidden_dim)
                                                   # hidden: (1, batch_size, hidden_dim)
        
        """
        Generate classification logits from final LSTM hidden state.
        
        Dimension transformation: (1, batch_size, hidden_dim) → (batch_size, output_dim)
        
        Process:
        1. hidden.squeeze(0): Remove first dimension (1) → (batch_size, hidden_dim)
        2. self.fc(): Linear transformation → (batch_size, output_dim)
        
        The squeeze(0) removes the sequence length dimension from hidden state since
        LSTM returns shape (num_layers, batch_size, hidden_dim) and we have 1 layer.
        """
        logits = self.fc(hidden.squeeze(0))  # (batch_size, output_dim)
        return logits

In [13]:
"""
Initialize and train PhraseClassifier model using CrossEntropyLoss and Adam optimizer.

Model configuration:
- vocab_size: len(vocab) - total number of unique tokens in vocabulary
- embedding_dim=10: Compact 10-dimensional word embeddings for token representation
- hidden_dim=20: Small LSTM hidden state size (20 units) for sequence processing
- output_dim=3: Three classification classes (likely sentiment: negative/neutral/positive)

Training setup optimized for small-scale text classification with minimal overfitting risk
due to compact architecture and moderate learning rate.
"""
model = PhraseClassifier(len(vocab), embedding_dim=10, hidden_dim=20, output_dim=3)

"""
Configure CrossEntropyLoss for multi-class classification training.

CrossEntropyLoss combines LogSoftmax and NLLLoss, making it ideal for classification:
- Automatically applies softmax to model outputs (raw logits)
- Computes negative log-likelihood loss for true class labels
- Provides stable gradients for backpropagation in classification tasks
- Expects raw logits from model (not probabilities) and integer class labels

Mathematical formulation: Loss = -log(softmax(logits)[true_class])
"""
criterion = nn.CrossEntropyLoss()

"""
Initialize Adam optimizer for adaptive learning rate optimization.

Adam configuration:
- lr=0.001: Conservative learning rate suitable for small text classification models
- Default momentum parameters (β₁=0.9, β₂=0.999) provide stable convergence
- Adaptive per-parameter learning rates help with embedding layer training
- Efficient for sparse gradients common in NLP tasks with vocabulary lookups
"""
optimizer = optim.Adam(model.parameters(), lr=0.001)

"""
Train model for 100 epochs using full-batch gradient descent.

Training characteristics:
- epochs=100: Sufficient iterations for small model convergence on text data
- Full-batch training: Uses entire dataset per iteration (suitable for small datasets)
- No data shuffling: May be suboptimal for larger datasets but acceptable here
- Loss reporting every 100 epochs: Monitors training progress and convergence
"""
epochs = 100

for epoch in range(epochs):
    """
    Clear accumulated gradients from previous iteration.
    Essential step preventing gradient accumulation across batches.
    """
    optimizer.zero_grad()
    
    """
    Forward pass: compute model predictions for input sequences.
    
    inputs.t() transposes input tensor from (batch_size, seq_len) to (seq_len, batch_size)
    to match LSTM's expected input format (since batch_first=False by default).
    
    Returns: predictions tensor of shape (batch_size, output_dim=3) containing
    raw logits for each of the 3 classification classes.
    """
    predictions = model(inputs.t())
    
    """
    Compute CrossEntropyLoss between predictions and true labels.
    
    Arguments:
    - predictions: (batch_size, 3) raw logits from model
    - labels: (batch_size,) true class indices [0, 1, or 2]
    
    CrossEntropyLoss automatically applies softmax to predictions and computes
    negative log-likelihood for the correct class.
    """
    loss = criterion(predictions, labels)
    
    """
    Backward pass: compute gradients for all model parameters.
    Uses automatic differentiation to compute ∂loss/∂parameter for:
    - Embedding matrix weights
    - LSTM gate parameters (input, forget, cell, output gates)
    - Final linear layer weights and biases
    """
    loss.backward()
    
    """
    Update model parameters using computed gradients and Adam optimizer.
    Adam applies adaptive learning rates and momentum to each parameter
    for stable and efficient convergence.
    """
    optimizer.step()

    """
    Monitor training progress by printing loss every 100 epochs.
    loss.item() extracts scalar value from tensor for logging.
    Expected behavior: loss should decrease over epochs indicating learning.
    """
    if (epoch + 1) % 100 == 0:
        print(f"Epoch: {epoch + 1}, Loss: {loss.item()}")

Epoch: 100, Loss: 0.4263986051082611


In [14]:
"""
Inference pipeline for testing trained PhraseClassifier on new text samples.

This block demonstrates proper inference practices:
1. Disable gradient computation for memory efficiency and speed
2. Set model to evaluation mode to handle dropout/batch normalization appropriately
3. Preprocess new text using same tokenization and padding as training
4. Generate predictions and convert logits to class predictions

The pipeline handles unknown words gracefully by mapping them to <PAD> tokens,
ensuring robust inference on out-of-vocabulary terms.
"""
with torch.no_grad():  # Disable gradient calculation for inference
    """
    Set model to evaluation mode for proper inference behavior.
    
    model.eval() affects certain layers:
    - Dropout layers: Disabled (no random neuron dropping)
    - BatchNorm layers: Use running statistics instead of batch statistics
    - Other layers: May have different behavior between training/eval modes
    
    Essential for consistent, deterministic predictions during inference.
    """
    model.eval()
    
    """
    Define test phrases for model evaluation.
    These represent typical input text that the model should classify.
    The model will predict sentiment/category for each phrase based on training.
    """
    test_phrases = ["incredible match", "newest gadget", "yummy cake"]
    
    """
    Tokenize test phrases using training vocabulary with unknown word handling.
    
    Process for each phrase:
    1. Split phrase into individual tokens (words)
    2. Map each token to vocabulary index using vocab.get()
    3. Unknown tokens default to <PAD> index for graceful degradation
    4. Result: List of token indices representing each phrase
    
    Example: "incredible match" → [vocab["incredible"], vocab["match"]]
    If "incredible" not in vocab → [vocab["<PAD>"], vocab["match"]]
    """
    encoded_test_phrases = [
        [vocab.get(word, vocab["<PAD>"]) for word in phrase.split()] 
        for phrase in test_phrases
    ]
    
    """
    Pad sequences to uniform length matching training data requirements.
    
    Padding process:
    1. Calculate padding needed: max_length - current_phrase_length
    2. Append <PAD> token indices to reach max_length
    3. Ensures all sequences have identical length for batch processing
    
    Example: If max_length=5 and phrase has 2 tokens:
    [token1, token2] → [token1, token2, <PAD>, <PAD>, <PAD>]
    """
    padded_test_phrases = [
        phrase + [vocab["<PAD>"]] * (max_length - len(phrase)) 
        for phrase in encoded_test_phrases
    ]
    
    """
    Convert preprocessed sequences to PyTorch tensor for model input.
    
    torch.LongTensor creates integer tensor suitable for embedding layer lookup.
    Shape: (batch_size=3, seq_len=max_length) where each element is vocab index.
    LongTensor is required for nn.Embedding which expects integer indices.
    """
    test_inputs = torch.LongTensor(padded_test_phrases)
    
    """
    Generate model predictions and extract most probable class for each input.
    
    Processing steps:
    1. test_inputs.t(): Transpose (3, seq_len) → (seq_len, 3) for LSTM input format
    2. model(...): Forward pass returns (3, output_dim=3) logits tensor
    3. torch.argmax(..., dim=1): Find class index with highest logit per sample
    4. Result: (3,) tensor with predicted class indices [0, 1, or 2]
    
    argmax converts raw logits to discrete class predictions:
    [logit_class0, logit_class1, logit_class2] → class_with_highest_logit
    """
    test_predictions = torch.argmax(model(test_inputs.t()), dim=1)
    
    """
    Display predicted class indices for interpretation.
    
    Output interpretation depends on training label mapping:
    - 0: Negative sentiment
    - 1: Neutral sentiment  
    - 2: Positive sentiment
    (or whatever class encoding was used during training)
    """
    print("Test predictions:", test_predictions)

Test predictions: tensor([2, 2, 2])
