### Text classification using LSTM
Objective:ll create a simple LSTM model using PyTorch to perform text classification on a dataset of short phrase`.Stepsd to:

- Create a vocabulary to represent words as indices.
- Tokenize, encode, and pad the phrases.
- Convert the phrases and categories to PyTorch tensors.
- Instantiate the LSTM model with the vocabulary size, embedding dimensions, hidden dimensions, and output dimensions.
- Define the loss function and optimizer.
- Train the model for a number of epochs.
- Test the model on new phrases and print the category predictions.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
# Phrases (textual data) and their category labels (0 for sports, 1 for technology, 2 for food)
# Note: this data is extremely less for realistically training an LSTM model. Feel free to use
# a relevant data source or create your own dummy data for this exercise.
phrases = ["great goal scored", "amazing touchdown", "new phone release", "latest laptop model", "tasty pizza", "delicious burger"]
categories = [0, 0, 1, 1, 2, 2]

### Create a vocabulary to represent words as indices

In [3]:
vocab = {"<PAD>": 0, "great": 1, "goal": 2,
         "scored": 3, "amazing": 4, "touchdown": 5,
         "new": 6, "phone": 7, "release": 8,
         "latest": 9, "laptop": 10, "model": 11,
         "tasty": 12, "pizza": 13, "delicious": 14,
         "burger": 15
        }

### Tokenize, encode, and pad phrases

In [15]:
encoded = [[vocab[word] for word in phrase.split()] for phrase in phrases]
max_length = max([len(phrase) for phrase in encoded])
padded = [phrase + [vocab['<PAD>']] * (max_length - len(phrase)) for phrase in encoded]

In [16]:
padded

[[1, 2, 3], [4, 5, 0], [6, 7, 8], [9, 10, 11], [12, 13, 0], [14, 15, 0]]

### Convert phrases and categories to PyTorch tensors

In [33]:
inputs = torch.LongTensor(padded)
labels = torch.LongTensor(categories)
inputs, labels

(tensor([[ 1,  2,  3],
         [ 4,  5,  0],
         [ 6,  7,  8],
         [ 9, 10, 11],
         [12, 13,  0],
         [14, 15,  0]]),
 tensor([0, 0, 1, 1, 2, 2]))

In [34]:
# Define LSTM model
class PhraseClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(PhraseClassifier, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, _) = self.lstm(embedded)
        logits = self.fc(hidden.squeeze(0))
        return logits

### Instantiate model and define loss and optimizer

In [70]:
model = PhraseClassifier(vocab_size=len(vocab), embedding_dim=10, hidden_dim=20, output_dim=3)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

### Train the model

In [71]:
num_epoch = 100
model.train()

for epoch in range(1, num_epoch + 1):
    # Clear the gradients
    optimizer.zero_grad()

    # Forward pass
    predictions = model(inputs.t()).squeeze(1)
    # if (epoch < 3): print(model(inputs.t())) 
    # Calculate the loss and perform optimization step
    loss = criterion(predictions, labels)
    loss.backward()

    optimizer.step()
    
    if(epoch % 10 == 0):
        print(f"Epoch: {epoch}, Loss: {loss.item()}")

Epoch: 10, Loss: 1.0620263814926147
Epoch: 20, Loss: 1.0145480632781982
Epoch: 30, Loss: 0.9610541462898254
Epoch: 40, Loss: 0.8977944850921631
Epoch: 50, Loss: 0.8221752643585205
Epoch: 60, Loss: 0.7337680459022522
Epoch: 70, Loss: 0.6349571943283081
Epoch: 80, Loss: 0.5307366847991943
Epoch: 90, Loss: 0.42742419242858887
Epoch: 100, Loss: 0.33165740966796875


### Test the model

In [72]:
# Test the model on new phrases
with torch.no_grad():
    test_phrases = ["incredible match", "newest gadget", "yummy cake"]
    encoded_test_phrases = [[vocab.get(word, vocab["<PAD>"]) for word in phrase.split()] for phrase in test_phrases]
    padded_test_phrases = [phrase + [vocab["<PAD>"]] * (max_length - len(phrase)) for phrase in encoded_test_phrases]
    test_inputs = torch.LongTensor(padded_test_phrases)
    test_predictions = torch.argmax(model(test_inputs.t()), dim=1)
    print("Test predictions:", test_predictions)

Test predictions: tensor([2, 2, 2])


The results clearly indicate an overfitted model. We can improve the testing accuracy by getting more data.