Loading and Preparing the IMDb Dataset

In this step, I load a real-world dataset the IMDb movie reviews dataset and prepare it for training.
I sample 5,000 reviews for faster experimentation and convert the data into the format required by spaCy ((text, {"cats": {...}})).

In [1]:
import pandas as pd
import random

# Load CSV
df = pd.read_csv("Data/IMDB Dataset.csv")

# Only a subset for faster training 
df = df.sample(5000, random_state=42)

# Convert to spaCy format
train_data = []
for _, row in df.iterrows():
    sentiment = row["sentiment"].upper()  # POSITIVE or NEGATIVE
    cats = {"POSITIVE": 1 if sentiment == "POSITIVE" else 0,
            "NEGATIVE": 1 if sentiment == "NEGATIVE" else 0}
    train_data.append((row["review"], {"cats": cats}))


Testing the Model on New Reviews

Now I test the trained sentiment model on unseen text.
For each review, spaCy’s pipeline predicts the likelihood of being positive or negative, allowing me to verify that the model generalizes beyond the training data.

In [2]:
from spacy.util import minibatch, compounding
from spacy.training.example import Example
import spacy, random

nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat_multilabel", last=True)
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

optimizer = nlp.begin_training()

for epoch in range(5):
    random.shuffle(train_data)
    losses = {}
    batches = minibatch(train_data, size=compounding(16.0, 64.0, 1.5))
    for batch in batches:
        examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in batch]
        nlp.update(examples, sgd=optimizer, losses=losses)
    print(f"Epoch {epoch+1} – Loss: {losses['textcat_multilabel']:.4f}")


Epoch 1 – Loss: 15.6838
Epoch 2 – Loss: 6.0309
Epoch 3 – Loss: 2.7492
Epoch 4 – Loss: 2.0106
Epoch 5 – Loss: 1.2127


In [3]:
texts = [
    """
Incredible starcast that shines throughout. Creative camera work including the one where it is attached to the bumper of the ride to show you how the state of affairs really is. Delicate subject handled carefully, this one is all adrenalin! Di Caprio's role is similar to the one in Once Upon a Time in Hollywood as he appears to be a hapless man, struggling often to comical tones, Sean Penn as the Colonel plays his part earnestly but it is finally Benicio who shines in a short but a part that stays with long after the credits roll over.
"""]
for text in texts:
    doc = nlp(text)
    print(doc.cats)


{'POSITIVE': 0.9879484176635742, 'NEGATIVE': 0.010546259582042694}


Saving the Trained Model

Finally, I save the trained spaCy model using nlp.to_disk("movie_sentiment_model").
This allows me to reload and reuse the model later for classifying new movie reviews without retraining.

In [4]:
nlp.to_disk("movie_sentiment_model")