# Semantique

Using natural language understanding and deep learning to predict movie star ratings from free-text reviews.

In this project, we develop an NLP-based system that infers a star rating (1–5) from an English-language movie review. We preprocess text data, train a deep learning model using PyTorch, and deploy the model for real-time prediction using a lightweight web app.


**Table of Conetents**

1. [Setup](#setup)

    - [Imports & Libraries](#imports--libraries)

    - [Data Collection](#data-collection)

2. [Preprocessing](#preprocessing)

    - [Loading Data](#loading-data)

    - [Cleaning the Text](#cleaning-the-text)

    - [Building the Vocabulary](#building-the-vocabulary)

    - [Encoding and Padding](#encoding-and-padding)

    - [Creating DataLoaders](#creating-dataloaders)

    - [Loading Saved Vocabulary](#loading-saved-vocabulary)

3. [Training](#training)

    - [Model Architecture](#model-architecture)

    - [Training Configuration](#training-configuration)

    - [Training Loop](#training-loop)

4. [Inference](#inference)

    - [Loading the Model](#loading-the-modell)

    - [Testing on unseen data](#testing-on-unseen-data)


**NOTE:** *The data was preprocessed seperately and training was done on a GPU enabled machine. The point of this notebook to explain the process, rather than training use. To reproduce the model and run the code, refer to the project's repository, and follow the Usage instructions.*

Project Repository: [semantique (github)](https://github.com/SepehrAkbari/semantique/)

## **Setup**

### Imports & Libraries

In [1]:
import os
import sys
import tarfile
import urllib.request
import glob
import re
import json
from collections import Counter
from random import shuffle

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split



### Data Collection

We first download and extract the IMDb Large Movie Review dataset if it does not already exist locally.

In [2]:
def download_imdb():
    url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    filename = "aclImdb_v1.tar.gz"
    foldername = "../data/aclImdb"

    if not os.path.exists(foldername):
        print("Downloading IMDb dataset...")
        urllib.request.urlretrieve(url, filename)
        print("Extracting...")
        with tarfile.open(filename, "r:gz") as tar:
            tar.extractall()
        print("Done!")
    else:
        print("IMDb dataset already exists.")

## **Preprocessing**

### Loading Data

In this step we load the positive and negative movie reviews into memory, extracting the rating from filenames.

In [3]:
def load_reviews(path):
    reviews = []
    for filepath in glob.glob(path + "/*.txt"):
        with open(filepath, encoding='utf8') as f:
            text = f.read()
        rating = int(filepath.split("_")[-1].split(".")[0])
        reviews.append((text, rating))
    return reviews

### Cleaning the Text

We clean the text by lowercasing, removing HTML tags, and stripping punctuation. We use regex tokenization to split the text into words. This is a common preprocessing step in NLP, as it allows us to work with individual words rather than entire sentences. We also remove stop words, which are common words that do not carry much meaning (e.g., "the", "is", "and"). This helps to reduce noise in the data and improve the performance of our model.

In [8]:
def regex_tokenize(text):
    return re.findall(r"\b\w+\b", text.lower())

def preprocess(text):
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return regex_tokenize(text)

Here is an example

In [9]:
preprocess("I loved this movie! <br> It was amazing!!!")

['i', 'loved', 'this', 'movie', 'it', 'was', 'amazing']

### Building the Vocabulary

We build a vocabulary of the most frequent words across the training reviews. We use the top 20,000 most frequent words to limit the size of our vocabulary. This is a common practice as it helps to reduce the dimensionality of the data and improve the performance of our model. We also create a mapping from words to indices, which allows us to convert words into numerical representations that can be used as input to our model.

In [11]:
def build_vocab(tokenized_data, vocab_size = 20000):
    counter = Counter()
    for tokens, _ in tokenized_data:
        counter.update(tokens)
    vocab = {word: idx+2 for idx, (word, _) in enumerate(counter.most_common(vocab_size))}
    vocab["<PAD>"] = 0
    vocab["<UNK>"] = 1
    return vocab

Here is an example

In [10]:
example_tokenized = [
    (["i", "loved", "the", "movie"], 9),
    (["the", "movie", "was", "great"], 8)
]

build_vocab(example_tokenized)

{'the': 2,
 'movie': 3,
 'i': 4,
 'loved': 5,
 'was': 6,
 'great': 7,
 '<PAD>': 0,
 '<UNK>': 1}

### Encoding and Padding

We encode the tokens into integers based on the vocabulary and pad/truncate each review to a fixed maximum length. This is important as it allows us to create a consistent input size for our model. We use padding to ensure that all reviews are the same length, and truncation to limit the maximum length of reviews. This helps to reduce noise in the data and improve the performance of our model.

In [13]:
def encode(tokens, vocab):
    return [vocab.get(token, vocab["<UNK>"]) for token in tokens]

def pad_sequence(seq, max_len=300):
    if len(seq) < max_len:
        return seq + [0] * (max_len - len(seq))
    else:
        return seq[:max_len]

Here is an example

In [16]:
tokens = ["the", "movie", "was", "awesome"]
vocab = {"the": 2, "movie": 3, "was": 4, "<UNK>": 1}

seq = encode(tokens, vocab)
print("Encoded:", seq)

seq_padded = pad_sequence(seq, max_len=10)
print("Padded:", seq_padded)

Encoded: [2, 3, 4, 1]
Padded: [2, 3, 4, 1, 0, 0, 0, 0, 0, 0]


### Creating DataLoaders

We convert the encoded reviews into PyTorch tensors, organize them into datasets, and wrap them with DataLoaders for batch-wise training and evaluation. This is important as it allows us to efficiently load and process the data in batches, which is essential for training. We also create a DataLoader for the test set, which allows us to evaluate the performance of our model on unseen data.

In [18]:
def get_dataloaders(train_data, test_data, vocab, max_len=300, batch_size=64):
    X_train = [pad_sequence(encode(tokens, vocab), max_len) for tokens, _ in train_data]
    y_train = [rating for _, rating in train_data]

    X_test = [pad_sequence(encode(tokens, vocab), max_len) for tokens, _ in test_data]
    y_test = [rating for _, rating in test_data]

    X_train = torch.tensor(X_train, dtype=torch.long)
    y_train = torch.tensor(y_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.long)
    y_test = torch.tensor(y_test, dtype=torch.float32)

    train_dataset = TensorDataset(X_train, y_train)
    test_dataset = TensorDataset(X_test, y_test)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)

    return train_loader, test_loader

After this final step of preprocessing we now have a dataset that follows this structure:

```python
X_batch = [
    [2, 3, 4, 1, 0, ..., 0],    # review 1
    [7, 6, 9, 2, 5, ..., 0],    # review 2
    ...
    [1, 2, 3, 4, 5, ..., 0],  # review n
]   # Shape: [n, 300]

y_batch = [8.0, 9.0, ..., 6.0]  # Shape: [n]
```

### Loading Saved Vocabulary

We define a utility function to load the vocabulary mapping from a saved JSON file. This llows us to reuse the vocabulary mapping across different runs of the model. We also define a function to convert words to indices using the loaded vocabulary, which allows us to encode new reviews for prediction.

In [19]:
def extract_vocab():
    checkpoint = torch.load("../saved_models/best_model.pt", map_location="cpu")
    vocab = checkpoint["vocab"]

    with open("../saved_models/vocab.json", "w") as f:
        json.dump(vocab, f)

def load_vocab(path="saved_models/vocab.json"):
    with open(path, "r") as f:
        vocab = json.load(f)
    return {k: int(v) for k, v in vocab.items()}

## **Training**

### Model Architecture

We define a simple LSTM-based regression model to predict movie ratings from text sequences. An LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) that is well-suited for sequence prediction tasks. It is capable of learning long-term dependencies in the data, which makes it a good choice for NLP tasks. We use an embedding layer to convert the input words into dense vectors, followed by an LSTM layer to process the sequences. Finally, we use a linear layer to output the predicted rating.

In [20]:
class ReviewRegressor(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, dropout=0.3, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        last_hidden = lstm_out[:, -1, :]
        out = self.dropout(last_hidden)
        out = self.fc(out).squeeze(1)
        return out

Embedding Layer:

- This layer turns each word index into a dense vector of size 128.

- Now we have a tensor of shape [64, 300, 128], meaning each review is now a sequence of word embeddings.

LSTM Layer:

- This embedding sequence is passed into a 2-layer LSTM.

- The LSTM processes each word in context, capturing the sequence structure — things like grammar, sentiment buildup, negations, and so on.

- It outputs a new sequence of the same length: [64, 300, 128], but now with contextualized vectors.

Last Hidden State:

- Instead of using all 300 outputs, we only keep the last hidden state.

- This gives us one 128-dimensional vector per review, representing the model’s understanding of the entire sequence.

Dropout:

- We apply dropout to reduce overfitting and help the model generalize better.

Fully Connected Layer:

- This final dense layer maps the 128-dimensional vector to a single number.

- It’s a regression output — a float between 1 and 10 representing the predicted rating.

### Training Configuration

To configure the training procedure, we first prepare the data by creating training and validation splits. We reserve 10% of the training set for validation to monitor generalization performance during training.

The model is instantiated using the `ReviewRegressor` architecture, and the GPU is automatically detected for efficient computation. 

We use the Adam optimizer with a learning rate of 1e-3, and Mean Squared Error (MSE) is chosen as the loss function to directly regress onto the continuous rating scale.

Batch-wise data loading is applied to both training and validation sets to enable mini-batch stochastic optimization.

In [21]:
def config_training():
    train_loader, test_loader, vocab = prepare_data()

    full_train_dataset = train_loader.dataset
    val_size = int(0.1 * len(full_train_dataset))
    train_size = len(full_train_dataset) - val_size
    train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])

    batch_size = 64
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = ReviewRegressor(vocab_size=len(vocab)).to(device)
    loss_fn = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    

### Training Loop

The training process optimizes the model using mini-batch gradient descent across multiple epochs. After each epoch, the model's performance is evaluated on a held-out validation set to monitor overfitting.

The Mean Squared Error (MSE) loss between the predicted ratings and the ground truth ratings is computed for both training and validation sets. Validation loss is used as the primary metric for model checkpointing.

The model is saved whenever an improvement in validation loss is detected. To prevent unnecessary computation and overfitting, early stopping is implemented: if validation loss does not improve for a specified number of consecutive epochs ("patience"), training is terminated automatically.

In [18]:
def main():
    config_training()
    
    epochs = 50
    best_val_loss = float('inf')
    patience = 10
    patience_counter = 0
    save_dir = "../models"
    os.makedirs(save_dir, exist_ok=True)

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            pred = model(xb)
            loss = loss_fn(pred, yb)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_train_loss = total_loss / len(train_loader)

        model.eval()
        val_loss = 0
        with torch.no_grad():
            for xb, yb in val_loader:
                xb, yb = xb.to(device), yb.to(device)
                pred = model(xb)
                loss = loss_fn(pred, yb)
                val_loss += loss.item()

        avg_val_loss = val_loss / len(val_loader)

        print(f"Epoch {epoch+1}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            patience_counter = 0
            torch.save({
                'model_state_dict': model.state_dict(),
                'vocab': vocab,
                'params': {
                    'embed_dim': 128,
                    'hidden_dim': 128,
                    'num_layers': 2
                }
            }, os.path.join(save_dir, "lstm_model.pt"))
            print("model saved")
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered.")
                break

    print("Training complete. Model saved")

For each batch:

- We move data to device.

- Perform a forward pass to get predictions.

- Compute loss between predicted and actual ratings with MSE.

- Backpropagate the error.

- Update the model weights with the optimizer.

- Accumulate the loss for tracking.

Here is the training loop output:

IMDb dataset already exists.

Epoch 1, Train Loss: 13.1095, Val Loss: 12.2072

Best model saved!

Epoch 2, Train Loss: 12.1692, Val Loss: 12.2012

Best model saved!

Epoch 3, Train Loss: 12.1631, Val Loss: 12.2016

Epoch 4, Train Loss: 12.0537, Val Loss: 12.2294

Epoch 5, Train Loss: 11.6965, Val Loss: 12.0832

Best model saved!

Epoch 6, Train Loss: 10.2653, Val Loss: 10.6756

Best model saved!

Epoch 7, Train Loss: 8.8276, Val Loss: 7.7987

Best model saved!

Epoch 8, Train Loss: 9.7357, Val Loss: 8.6655

Epoch 9, Train Loss: 7.3917, Val Loss: 8.3216

Epoch 10, Train Loss: 6.4967, Val Loss: 6.2363

Best model saved!

Epoch 11, Train Loss: 4.6747, Val Loss: 6.0862

Best model saved!

Epoch 12, Train Loss: 3.9636, Val Loss: 5.0654

Best model saved!

Epoch 13, Train Loss: 3.4546, Val Loss: 4.7474

Best model saved!

Epoch 14, Train Loss: 3.0516, Val Loss: 4.5928

Best model saved!

Epoch 15, Train Loss: 2.7172, Val Loss: 4.6426

Epoch 16, Train Loss: 2.5244, Val Loss: 4.6294

Epoch 17, Train Loss: 2.3367, Val Loss: 4.6185

Epoch 18, Train Loss: 2.1732, Val Loss: 4.8058

Epoch 19, Train Loss: 2.0793, Val Loss: 4.6943

Epoch 20, Train Loss: 1.9065, Val Loss: 4.3475

Best model saved!

Epoch 21, Train Loss: 1.7136, Val Loss: 4.5039

Epoch 22, Train Loss: 1.6124, Val Loss: 4.6814

Epoch 23, Train Loss: 1.4649, Val Loss: 4.5539

Epoch 24, Train Loss: 1.3505, Val Loss: 4.5139

Epoch 25, Train Loss: 1.3097, Val Loss: 4.7228

Epoch 26, Train Loss: 1.1824, Val Loss: 4.7661

Epoch 27, Train Loss: 1.0835, Val Loss: 4.7407

Epoch 28, Train Loss: 1.0418, Val Loss: 4.7476

Epoch 29, Train Loss: 0.9816, Val Loss: 5.2269

Epoch 30, Train Loss: 0.9809, Val Loss: 4.8599

Early stopping triggered.

Training complete.

## **Inference**

### Loading the Model

We load the vocabulary mapping from a saved JSON file. This allows us to reuse the vocabulary mapping across different runs of the model. Using the loaded vocabulary, we recreate the model and load the best weights from the training phase.

In [None]:
vocab = load_vocab("../models/vocab.json")

model_params = {"embed_dim": 128, "hidden_dim": 128, "num_layers": 2}
model = ReviewRegressor(vocab_size=len(vocab), **model_params)

checkpoint = torch.load("../models/lstm_model.pt", map_location="cpu")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

ReviewRegressor(
  (embedding): Embedding(20002, 128, padding_idx=0)
  (lstm): LSTM(128, 128, num_layers=2, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

### Testing on unseen data

To evaluate the model's performance, we write some test cases with unknown ratings, which are not part of the training or validation sets. We use the trained model to predict the ratings for these reviews. The predicted ratings can be evaluated by the context and content of the reviews. This allows us to assess the model's ability to generalize to unseen data and make accurate predictions based on the text input.

In [26]:
def convert_to_stars(pred_scale):
    pred_scale = round(min(max(pred_scale, 1), 10))
    return pred_scale / 2

sample_reviews = [
    "A beautiful masterpiece, visually stunning and emotionally resonant.",
    "Great acting, I would maybe watch it again.",
    "The movie was a dull and cliched mess from start to finish."
]

for review in sample_reviews:
    tokens = preprocess(review)
    encoded = pad_sequence(encode(tokens, vocab))
    x = torch.tensor([encoded])

    with torch.no_grad():
        pred = model(x).item()
    
    stars = convert_to_stars(pred)
    print(f"Review: {review}\nPredicted Rating: {stars} out of 5 stars\n")

Review: A beautiful masterpiece, visually stunning and emotionally resonant.
Predicted Rating: 4.5 out of 5 stars

Review: Great acting, I would maybe watch it again.
Predicted Rating: 3.5 out of 5 stars

Review: The movie was a dull and cliched mess from start to finish.
Predicted Rating: 1.0 out of 5 stars

