<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/4.topics/TopicModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/4.topics/TopicModel.ipynb)

# Topic modeling movie summaries

In this notebook we'll use topic modeling to discover broad themes in a collection of movie summaries.

In [1]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━

In [None]:
import operator
import re

import gensim
import nltk
from gensim import corpora
from tqdm import tqdm  # for progress bars

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
import random

import numpy as np
import pandas as pd
from nltk.corpus import stopwords

random.seed(1)

In [None]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/jockers.stopwords
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/movie.metadata.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/plot_summaries.txt

## Loading stopwords

Since we're running topic modeling on texts with lots of names, we'll add the Jockers list of stopwords (which includes character names) to our stoplist. We'll also filter out any words that don't contain at least one letter.

In [None]:
def read_stopwords(filename):
    """Reads a file of stopwords into a set."""
    stopwords = set([
        line.rstrip() for line in open(filename)
    ])
    return stopwords

In [None]:
stop_words = set(stopwords.words('english'))
stop_words = stop_words | read_stopwords("jockers.stopwords")
stop_words.add("'s")
stop_words=list(stop_words)

In [None]:
pattern = re.compile(r"[A-Za-z]")
def stopword_filter(word, stopwords):
    """ Function to exclude words from a text."""

    # no stopwords
    if word in stopwords:
        return False

    # has to contain at least one letter
    if pattern.search(word) is not None:
        return True

    return False

## Loading summaries

We'll read in summaries of the 5,000 movies with the highest box office revenues. This may take 3-4 minutes.

In [None]:
def read_docs(plot_filename, metadata_filename, stopwords):
    n=5000

    # only get box office top N
    metadata = pd.read_csv(metadata_filename, sep="\t", names=["movie_id", "_", "title", "year", "box_office", "_1", "_2", "_3", "_4"])
    metadata = metadata.dropna(subset=["box_office"]).sort_values(by="box_office", ascending=False)
    metadata = metadata.iloc[:n].set_index("movie_id")

    plots = pd.read_csv(plot_filename, sep="\t", names=["movie_id", "summary"])
    plots = plots.set_index("movie_id")
    plots = metadata.join(plots)

    def tokenize_and_process(text):
        return [
            x for x in nltk.word_tokenize(text.lower()) if stopword_filter(x, stopwords)
        ]

    docs = []
    for summary in tqdm(plots.summary.fillna("")):
        docs.append(tokenize_and_process(summary))

    names = plots.title.to_list()
    return docs, list(names)

In [None]:
docs, names = read_docs("plot_summaries.txt", "movie.metadata.tsv", stop_words)

We will convert the movie summaries into a bag-of-words representation using gensim's [corpora.dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) methods.

In [None]:
# Create vocab from data; restrict vocab to only the top 10K terms that show up in at least 5 documents
# and no more than 50% of all documents

dictionary = corpora.Dictionary(docs)
dictionary.filter_extremes(no_below=5, no_above=.5, keep_n=10000)

In [None]:
# Replace dataset with numeric ids words in vocab (and exclude all other words)
corpus = [dictionary.doc2bow(text) for text in docs]

## Running topic model

Now let's run a topic model on this data using gensim's built-in LDA.

In [None]:
num_topics = 20

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=10,
    alpha='auto'
)

## Interpreting topic model

**Topic word distributions**

We can get a sense of what the topics are by printing the top 10 words with highest $P(word \mid topic)$ for each topic.

In [None]:
for i in range(num_topics):
    print("topic %s:\t%s" % (i, ' '.join([term for term, freq in lda_model.show_topic(i, topn=10)])))

**Topic document distributions**

Another way of understanding topics is to print out the documents that have the highest topic representation -- i.e., for a given topic $k$, the documents with highest $P(\text{topic}=k \mid \text{document})$.  How much do the documents listed here align with your understanding of the topics?

In [None]:
topic_model = lda_model
topic_docs = []

for i in range(num_topics):
    topic_docs.append({})

for doc_id in range(len(corpus)):
    doc_topics = topic_model.get_document_topics(corpus[doc_id])
    for topic_num, topic_prob in doc_topics:
        topic_docs[topic_num][doc_id] = topic_prob

for i in range(num_topics):
    top_topic_terms = [term for term, _ in topic_model.show_topic(i, topn=10)]
    sorted_docs = sorted(topic_docs[i].items(), key=lambda x: x[1], reverse=True)

    print(" ".join(top_topic_terms))
    print()

    for doc_id, prob in sorted_docs[:5]:
        print(f"{i}\t{prob}\t{names[doc_id]}")
    print()