<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/4.topics/HW4_ExploreTopics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/4.topics/HW4_ExploreTopics.ipynb)

**N.B.** Once it's open on Colab, remember to save a copy (by e.g. clicking `Copy to Drive` above).

---

# HW4: Exploring topics
In this homework, you have an open-ended task: tell us something **interesting** about a collection of text data using topic modeling. You can choose between the following datasets:

In `acl.all.tsv` you'll find 7,188 papers published at major NLP venues (ACL, EMNLP, NAACL, TACL, etc.) between 2013 and 2020.  Here is a sample of that data:

|id|year of publication|title|abstract|
|---|---|---|---|
|pimentel-etal-2020-phonotactic|2020|Phonotactic Complexity and Its Trade-offs|We present methods for calculating a measure of phonotactic complexity ...|
|wang-etal-2020-amr|2020|AMR-To-Text Generation with Graph Transformer|Abstract meaning representation (AMR)-to-text generation is the challenging task  ...|


In [None]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/acl.all.tsv

In `gutenberg.genre.tsv` you'll find 1,250 passages of English-language fiction from Project Gutenberg, categorized by genre (adventure, detective, love stories, science fiction, westerns).  Here's a sample:

|id|genre|author|passage|
|---|---|---|---|
|66390_1796	|love stories	|Prime-Stevenson, Edward	|Only a few days absence. I shall think of you. Imre. P. S. Please write me." I was amused,  ...|
|50157_2780	|detective and mystery stories|	Wheeler, Janet D.	|Edina shook her head. “They think I’ve lied to them. They think I’ve cheated them. They want their money, and you can’t rightly blame them...|


In [1]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/gutenberg.genre.tsv

--2025-09-20 16:37:48--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/gutenberg.genre.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2766963 (2.6M) [text/plain]
Saving to: ‘gutenberg.genre.tsv’


2025-09-20 16:37:48 (36.7 MB/s) - ‘gutenberg.genre.tsv’ saved [2766963/2766963]



In `convote.train.tsv`, you'll find 2,723 dialogue turns from the Convote dataset on congressional speeches, paired with the party affiliation of the speaker (Democrat/Republican). Here's a sample:

|party|passage|
|---|---|
|R	|mr. speaker , i claim the time in opposition to the motion to recommit .|
|D	|mr. speaker , on that i demand the yeas and nays .|

In [None]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote.train.tsv

Choose one of these datasets and use topic modeling as exploratory data analysis to find some interesting structure in that data.  There are only two constraints:

* You must use topic modeling to find topics in the data
* You must relate those **topics** to some aspect of the metadata. Examples of this could be: charting the rise and fall of **topics** over time (i.e., with year of publication as metadata) or associating **topics** with genre to find which topics are more aligned one one genre over another (perhaps using some of the measures of association we talked about earlier this semester).

## Part 1: topic modeling

Q1: Try creating several topic models with different parameters for the _number_ of topics. Compare the outputs, and decide what number of topics is most reasonable for your dataset. **In a couple of sentences, justify your selection.**

In [36]:
!pip install gensim



In [37]:
import operator
import re

import nltk
from gensim import corpora
from tqdm import tqdm  # for progress bars

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
import random

import numpy as np
import pandas as pd
from nltk.corpus import stopwords

random.seed(1)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [38]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/jockers.stopwords

--2025-09-20 17:42:45--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/jockers.stopwords
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44453 (43K) [text/plain]
Saving to: ‘jockers.stopwords.1’


2025-09-20 17:42:46 (2.72 MB/s) - ‘jockers.stopwords.1’ saved [44453/44453]



In [39]:
def read_stopwords(filename):
    """Reads a file of stopwords into a set."""
    stopwords = set([
        line.rstrip() for line in open(filename)
    ])
    return stopwords

In [63]:
stop_words = set(stopwords.words('english'))
stop_words = stop_words | read_stopwords("jockers.stopwords")
stop_words.add("'s")
stop_words.add("n't") # add another nonesense token that keeps appearing for some reason
stop_words=list(stop_words)

In [64]:
pattern = re.compile(r"[A-Za-z]")
def stopword_filter(word, stopwords):
    """ Function to exclude words from a text."""

    # no stopwords
    if word in stopwords:
        return False

    # has to contain at least one letter
    if pattern.search(word) is not None:
        return True

    return False

In [65]:
import string

def clean_token(token):
    return token.strip(string.punctuation)

In [66]:
def read_gutenberg(filename, stopwords):
    '''
    Adapted function from TopicModel.ipynb for gutenberg corpus
    Four columns: id, genre, author, text
    '''
    df = pd.read_csv(filename, sep="\t", names=["id", "genre", "author", "text"])

    def tokenize_and_process(text):
      return [
          clean_token(x)
          for x in nltk.word_tokenize(str(text).lower())
          if stopword_filter(clean_token(x), stop_words)
      ]

    docs = []
    for txt in tqdm(df.txt.fillna("")):
        docs.append(tokenize_and_process(text))

    names = df.genre.to_list()   # fetch genre col
    return docs, names

In [67]:
docs, names = read_gutenberg("gutenberg.genre.tsv", stop_words)

100%|██████████| 1247/1247 [00:43<00:00, 28.83it/s]


In [79]:
dictionary = corpora.Dictionary(docs)
dictionary.filter_extremes(no_below=5, no_above=0.2, keep_n=10000) # modified no_above to 0.3
corpus = [dictionary.doc2bow(text) for text in docs]

In [81]:
import gensim

lda_model20 = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=20,
    passes=10,
    alpha='auto'
)

In [82]:
for i in range(20):
    print("topic %s:\t%s" % (i, ' '.join([term for term, freq in lda_model20.show_topic(i, topn=10)])))

topic 0:	door sir girls lord girl seen letter heard hour gone
topic 1:	dr father dark work st house light suddenly heard lawyer
topic 2:	father quite door ter help girl done ye git fer
topic 3:	captain people big give gentleman girl morning course sure sir
topic 4:	mean course trouble woman things believe give girl wife child
topic 5:	horse woman house word people door mind race side brought
topic 6:	fire cash close feathers cabin mayor sound means tried door
topic 7:	pole give people sir light read division quite days big
topic 8:	woman door work kiron help nature mackinney heart days created
topic 9:	sir door leave seen heard gone father days captain stood
topic 10:	light door want sir captain vogel heard window small began
topic 11:	th horse country house sir rode yo aw big people
topic 12:	work sure people course want set big river side guess
topic 13:	bork prosecutor want water prokle feet light hour things side
topic 14:	weapon white yuh point yore heart sure figure space seen
to

In [83]:
lda_model10 = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    passes=10,
    alpha='auto'
)

In [84]:
for i in range(10):
    print("topic %s:\t%s" % (i, ' '.join([term for term, freq in lda_model10.show_topic(i, topn=10)])))

topic 0:	captain coroner years sir people boy days dark felt happened
topic 1:	de door boy stood table house gave quite work word
topic 2:	woman felt door things girl people gone work white heart
topic 3:	th want house water horse perk sure uncle river things
topic 4:	father mother girl want stood woman child door felt morning
topic 5:	people girl world course quite want give name really sure
topic 6:	bork de prosecutor case matter need things big moved natural
topic 7:	marsden yuh people prale set carlin things days work big
topic 8:	door light heard open side ship water air feet white
topic 9:	sir th kennedy north heard want dead course range inspector


In [85]:
lda_model5 = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=5,
    passes=10,
    alpha='auto'
)

In [86]:
for i in range(5):
    print("topic %s:\t%s" % (i, ' '.join([term for term, freq in lda_model5.show_topic(i, topn=10)])))

topic 0:	th big sure feet side water perk horse felt dead
topic 1:	people white light work things gone captain ship seen morning
topic 2:	door sir quite course people stood black want heard things
topic 3:	door heard sir began small horse kennedy returned coroner stood
topic 4:	girl house woman want sir door give de course felt


Q2: Using the topic model with the number of topics that you selected, identify a couple of topics that are coherent. **In a paragraph, interpret these topics (i.e., by giving them a name / description). Support your answer with evidence from word and/or document distributions.**

## Part 2: analysis

Q3: Relate the topics to some aspect of the metadata. These don't have to be the same topics that you described in question 2. **Write a paragraph synthesizing your findings, and provide quantitative evidence.**

---

## To submit

Congratulations on finishing this homework!
Please follow the instructions below to download the notebook file (`.ipynb`) and its printed version (`.pdf`) for submission on bCourses -- remember **all cells must be executed**.

1.  Download a copy of the notebook file: `File > Download > Download .ipynb`.

2.  Print the notebook as PDF (via your browser, or tools like [nbconvert](https://nbconvert.readthedocs.io/en/latest/)).