<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/4.topics/TopicModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/4.topics/TopicModel.ipynb)

# Topic modeling movie summaries

In this notebook we'll use topic modeling to discover broad themes in a collection of movie summaries.

In [1]:
!pip install gensim



In [2]:
import operator
import re

import gensim
import nltk
from gensim import corpora
from tqdm import tqdm  # for progress bars

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
import random

import numpy as np
import pandas as pd
from nltk.corpus import stopwords

random.seed(1)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [3]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/jockers.stopwords
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/movie.metadata.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/plot_summaries.txt

--2025-09-16 23:23:45--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/jockers.stopwords
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44453 (43K) [text/plain]
Saving to: ‘jockers.stopwords’


2025-09-16 23:23:46 (5.29 MB/s) - ‘jockers.stopwords’ saved [44453/44453]

--2025-09-16 23:23:46--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/movie.metadata.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15038604 (14M) [text/plain]
Saving to: ‘movie.metadata.tsv’


2025-09-16 23:23:46 (160 MB/s) - ‘movie.

## Loading stopwords

Since we're running topic modeling on texts with lots of names, we'll add the Jockers list of stopwords (which includes character names) to our stoplist. We'll also filter out any words that don't contain at least one letter.

In [4]:
def read_stopwords(filename):
    """Reads a file of stopwords into a set."""
    stopwords = set([
        line.rstrip() for line in open(filename)
    ])
    return stopwords

In [5]:
stop_words = set(stopwords.words('english'))
stop_words = stop_words | read_stopwords("jockers.stopwords")
stop_words.add("'s")
stop_words=list(stop_words)

In [6]:
pattern = re.compile(r"[A-Za-z]")
def stopword_filter(word, stopwords):
    """ Function to exclude words from a text."""

    # no stopwords
    if word in stopwords:
        return False

    # has to contain at least one letter
    if pattern.search(word) is not None:
        return True

    return False

## Loading summaries

We'll read in summaries of the 5,000 movies with the highest box office revenues. This may take 3-4 minutes.

In [7]:
def read_docs(plot_filename, metadata_filename, stopwords):
    n=5000

    # only get box office top N
    metadata = pd.read_csv(metadata_filename, sep="\t", names=["movie_id", "_", "title", "year", "box_office", "_1", "_2", "_3", "_4"])
    metadata = metadata.dropna(subset=["box_office"]).sort_values(by="box_office", ascending=False)
    metadata = metadata.iloc[:n].set_index("movie_id")

    plots = pd.read_csv(plot_filename, sep="\t", names=["movie_id", "summary"])
    plots = plots.set_index("movie_id")
    plots = metadata.join(plots)

    def tokenize_and_process(text):
        return [
            x for x in nltk.word_tokenize(text.lower()) if stopword_filter(x, stopwords)
        ]

    docs = []
    for summary in tqdm(plots.summary.fillna("")):
        docs.append(tokenize_and_process(summary))

    names = plots.title.to_list()
    return docs, list(names)

In [8]:
docs, names = read_docs("plot_summaries.txt", "movie.metadata.tsv", stop_words)

100%|██████████| 5000/5000 [03:53<00:00, 21.40it/s]


We will convert the movie summaries into a bag-of-words representation using gensim's [corpora.dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) methods.

In [9]:
# Create vocab from data; restrict vocab to only the top 10K terms that show up in at least 5 documents
# and no more than 50% of all documents

dictionary = corpora.Dictionary(docs)
dictionary.filter_extremes(no_below=5, no_above=.5, keep_n=10000)

In [10]:
# Replace dataset with numeric ids words in vocab (and exclude all other words)
corpus = [dictionary.doc2bow(text) for text in docs]

## Running topic model

Now let's run a topic model on this data using gensim's built-in LDA.

In [11]:
num_topics = 20

In [12]:
lda_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=10,
    alpha='auto'
)

## Interpreting topic model

**Topic word distributions**

We can get a sense of what the topics are by printing the top 10 words with highest $P(word \mid topic)$ for each topic.

In [13]:
for i in range(num_topics):
    print("topic %s:\t%s" % (i, ' '.join([term for term, freq in lda_model.show_topic(i, topn=10)])))

topic 0:	fight kid stone tournament match martial boxing training arts evil
topic 1:	earth alien dr. planet space ship machine aliens human humans
topic 2:	bond collins dracula monster gibson count frankenstein phantom coffin bride
topic 3:	father village camp death wolf de lord killed family men
topic 4:	race murphy vegas batman dodge thompson pink las stark casino
topic 5:	smith president wheeler murder political film company agent death government
topic 6:	team game school film students high win play coach time
topic 7:	police hotel wife train prison apartment murder time takes tells
topic 8:	beck miller stu silver sullivan snake vic bush paulie tooth
topic 9:	ship crew island boat water captain escape find group sea
topic 10:	father tells mother family life day time house school finds
topic 11:	war army soldiers men general attack orders colonel military battle
topic 12:	find world city castle life help return named finds children
topic 13:	ghost christmas spirit mill doll oz frost

**Topic document distributions**

Another way of understanding topics is to print out the documents that have the highest topic representation -- i.e., for a given topic $k$, the documents with highest $P(\text{topic}=k \mid \text{document})$.  How much do the documents listed here align with your understanding of the topics?

In [14]:
topic_model = lda_model
topic_docs = []

for i in range(num_topics):
    topic_docs.append({})

for doc_id in range(len(corpus)):
    doc_topics = topic_model.get_document_topics(corpus[doc_id])
    for topic_num, topic_prob in doc_topics:
        topic_docs[topic_num][doc_id] = topic_prob

for i in range(num_topics):
    top_topic_terms = [term for term, _ in topic_model.show_topic(i, topn=10)]
    sorted_docs = sorted(topic_docs[i].items(), key=lambda x: x[1], reverse=True)

    print(" ".join(top_topic_terms))
    print()

    for doc_id, prob in sorted_docs[:5]:
        print(f"{i}\t{prob}\t{names[doc_id]}")
    print()

fight kid stone tournament match martial boxing training arts evil

0	0.712681233882904	Sidekicks
0	0.7078146934509277	3 Ninjas Kick Back
0	0.6681515574455261	Bloodsport
0	0.5996115803718567	The Meteor Man
0	0.5739623308181763	Salsa

earth alien dr. planet space ship machine aliens human humans

1	0.9143122434616089	Queen of Blood
1	0.6407159566879272	Star Trek: The Motion Picture
1	0.6007406115531921	Moon
1	0.589035153388977	Alien: Resurrection
1	0.552208423614502	Battle for Terra

bond collins dracula monster gibson count frankenstein phantom coffin bride

2	0.6625617146492004	Dracula: Dead and Loving It
2	0.4098891019821167	Private Eye
2	0.35834944248199463	Dracula 2000
2	0.35705670714378357	The Man with the Golden Gun
2	0.31080031394958496	Terror Train

father village camp death wolf de lord killed family men

3	0.9747889637947083	Just Heroes
3	0.936826765537262	White Fang 2: Myth of the White Wolf
3	0.9367643594741821	The Ottoman Republic
3	0.9210546016693115	Shinobi: Heart Under 

The above results seem generally consistent at pairing films with their respective topic. For documents that I am familiar with, like Star Trek, it's assignment to the topic with 'alien', 'ships', 'planet,' etc. feels fitting. I expect that this topic representation breaks down a bit for documents containing words that fall under multiple distinct topics. This list notes movies by top P(topic=k ∣ document) values, which indicates movies that are extremely representative of their allocated topic, and not much else, because having a large proportion of the document being about another topic would dilute the probability.