#Lab Assignment 8 : Topic Modeling ✈

by Sawit Koseeyaumporn 65070507238

---

# Supplementary: Topic Modeling

Objectives:
- To demonstrate students how to apply topic modeling to real-world data.
- Students will gain hands-on experience through this example.

Create a data directory and store the downloaded data (state-of-the-union.csv) in it. The data is about State of the Union addresses from 1970 to 2012.

In [1]:
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/state-of-the-union.csv -P data

--2024-11-05 14:13:28--  https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/state-of-the-union.csv
Resolving nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)... 162.243.189.2
Connecting to nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)|162.243.189.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10501219 (10M) [text/csv]
Saving to: ‘data/state-of-the-union.csv’


2024-11-05 14:13:31 (5.06 MB/s) - ‘data/state-of-the-union.csv’ saved [10501219/10501219]



Read data

In [2]:
import pandas as pd

df = pd.read_csv("data/state-of-the-union.csv")

# Clean it up a little bit, removing non-word characters (numbers and ___ etc)
df.content = df.content.str.replace("[^A-Za-z ]", " ")

df.head()

Unnamed: 0,year,content
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


In [3]:
df.shape

(226, 2)

Using Gensim to perform topic modeling

In [None]:
# Run this cell if gensim has not been installed yet.
!pip install gensim

Apply `simple_process` to convert a document into a list of tokens. The input will be lowercased, tokenized, and de-accented (optional).



In [4]:
from gensim.utils import simple_preprocess

texts = df.content.apply(simple_preprocess)

In [5]:
texts

Unnamed: 0,content
0,"[george, washington, january, fellow, citizens..."
1,"[state, of, the, union, address, george, washi..."
2,"[state, of, the, union, address, george, washi..."
3,"[state, of, the, union, address, george, washi..."
4,"[state, of, the, union, address, george, washi..."
...,...
221,"[state, of, the, union, address, george, bush,..."
222,"[address, to, joint, session, of, congress, ba..."
223,"[state, of, the, union, address, barack, obama..."
224,"[state, of, the, union, address, barack, obama..."


## Task 1 : ID-to-word mapping:

In the current notebook, after calling the doc2bow method, all words are represented by their IDs. Consequently, when you use the print_topics method, only these IDs are displayed, making the output challenging to interpret and less meaningful. Therefore, Task #1 is to incorporate an ID-to-word mapping to resolve this issue.

Create a dictionary, using the texts that have already been preprocessed.

The method `doc2bow` is for converting document (a list of words) into the bag-of-words format.

In [6]:
from gensim import corpora

dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=2000)

# สร้าง Bag-of-word เป็นรูปแบบ Corpus

corpus = [dictionary.doc2bow(text) for text in texts]

In [12]:
print(corpus[:10])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 2), (36, 1), (37, 2), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 1), (80, 1), (81, 2), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 3), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1)

### Finding the topics with the LDA Models and with the different number of max_topics

In [9]:
from gensim import models

max_topics = 10

for n_topics in range(1, max_topics + 1):
    lda_model = models.LdaModel(corpus=corpus, num_topics=n_topics, id2word=dictionary)
    print(f"Topics for {n_topics} topics:")
    topics = lda_model.print_topics()
    for topic in topics:
        print(topic)
    print("\n")



Topics for 1 topics:
(0, '0.003*"mexico" + 0.003*"help" + 0.003*"americans" + 0.003*"program" + 0.002*"budget" + 0.002*"per" + 0.002*"million" + 0.002*"convention" + 0.002*"programs" + 0.002*"spain"')






Topics for 2 topics:
(0, '0.004*"mexico" + 0.003*"americans" + 0.003*"help" + 0.002*"program" + 0.002*"today" + 0.002*"budget" + 0.002*"convention" + 0.002*"court" + 0.002*"spain" + 0.002*"tonight"')
(1, '0.003*"program" + 0.003*"help" + 0.003*"programs" + 0.003*"mexico" + 0.003*"per" + 0.003*"million" + 0.002*"budget" + 0.002*"americans" + 0.002*"convention" + 0.002*"minister"')






Topics for 3 topics:
(0, '0.005*"mexico" + 0.003*"program" + 0.003*"per" + 0.002*"americans" + 0.002*"help" + 0.002*"tariff" + 0.002*"million" + 0.002*"estimated" + 0.002*"convention" + 0.002*"minister"')
(1, '0.003*"mexico" + 0.003*"program" + 0.003*"americans" + 0.003*"help" + 0.002*"budget" + 0.002*"convention" + 0.002*"million" + 0.002*"minister" + 0.002*"per" + 0.002*"banks"')
(2, '0.004*"help" + 0.004*"americans" + 0.003*"tonight" + 0.003*"billion" + 0.003*"jobs" + 0.003*"budget" + 0.003*"program" + 0.003*"programs" + 0.003*"million" + 0.002*"today"')






Topics for 4 topics:
(0, '0.004*"americans" + 0.003*"help" + 0.003*"per" + 0.003*"mexico" + 0.002*"banks" + 0.002*"program" + 0.002*"budget" + 0.002*"convention" + 0.002*"court" + 0.002*"reform"')
(1, '0.004*"program" + 0.003*"help" + 0.003*"americans" + 0.003*"million" + 0.003*"mexico" + 0.003*"budget" + 0.002*"convention" + 0.002*"today" + 0.002*"billion" + 0.002*"programs"')
(2, '0.005*"mexico" + 0.004*"program" + 0.003*"americans" + 0.003*"help" + 0.003*"programs" + 0.002*"billion" + 0.002*"million" + 0.002*"budget" + 0.002*"per" + 0.002*"spain"')
(3, '0.003*"help" + 0.002*"mexico" + 0.002*"per" + 0.002*"convention" + 0.002*"budget" + 0.002*"industrial" + 0.002*"bank" + 0.002*"ships" + 0.002*"program" + 0.002*"currency"')






Topics for 5 topics:
(0, '0.006*"mexico" + 0.003*"americans" + 0.003*"minister" + 0.003*"help" + 0.002*"convention" + 0.002*"spain" + 0.002*"payment" + 0.002*"court" + 0.002*"bank" + 0.002*"intercourse"')
(1, '0.003*"americans" + 0.003*"program" + 0.002*"per" + 0.002*"budget" + 0.002*"mexico" + 0.002*"problems" + 0.002*"products" + 0.002*"gold" + 0.002*"intercourse" + 0.002*"today"')
(2, '0.005*"program" + 0.004*"million" + 0.003*"billion" + 0.003*"help" + 0.003*"per" + 0.002*"americans" + 0.002*"budget" + 0.002*"percent" + 0.002*"major" + 0.002*"programs"')
(3, '0.005*"help" + 0.003*"americans" + 0.003*"program" + 0.003*"tariff" + 0.003*"mexico" + 0.003*"budget" + 0.002*"convention" + 0.002*"per" + 0.002*"programs" + 0.002*"today"')
(4, '0.003*"mexico" + 0.003*"americans" + 0.003*"help" + 0.003*"tonight" + 0.003*"per" + 0.002*"convention" + 0.002*"budget" + 0.002*"banks" + 0.002*"spain" + 0.002*"gold"')






Topics for 6 topics:
(0, '0.005*"mexico" + 0.003*"americans" + 0.003*"help" + 0.003*"per" + 0.002*"budget" + 0.002*"jobs" + 0.002*"court" + 0.002*"gold" + 0.002*"million" + 0.002*"tariff"')
(1, '0.004*"mexico" + 0.003*"program" + 0.003*"americans" + 0.003*"help" + 0.003*"minister" + 0.003*"today" + 0.003*"spain" + 0.002*"convention" + 0.002*"million" + 0.002*"intercourse"')
(2, '0.004*"mexico" + 0.003*"americans" + 0.003*"convention" + 0.002*"spain" + 0.002*"banks" + 0.002*"budget" + 0.002*"court" + 0.002*"bill" + 0.002*"per" + 0.002*"program"')
(3, '0.004*"program" + 0.004*"help" + 0.003*"americans" + 0.002*"million" + 0.002*"mexico" + 0.002*"budget" + 0.002*"tariff" + 0.002*"convention" + 0.002*"billion" + 0.002*"per"')
(4, '0.003*"help" + 0.003*"mexico" + 0.002*"per" + 0.002*"americans" + 0.002*"today" + 0.002*"convention" + 0.002*"programs" + 0.002*"estimated" + 0.002*"cannot" + 0.002*"tariff"')
(5, '0.004*"help" + 0.004*"program" + 0.003*"million" + 0.003*"programs" + 0.003*"billi



Topics for 7 topics:
(0, '0.003*"mexico" + 0.003*"help" + 0.003*"americans" + 0.003*"budget" + 0.002*"billion" + 0.002*"program" + 0.002*"islands" + 0.002*"today" + 0.002*"intercourse" + 0.002*"programs"')
(1, '0.003*"per" + 0.003*"americans" + 0.003*"program" + 0.002*"help" + 0.002*"minister" + 0.002*"mexico" + 0.002*"court" + 0.002*"cent" + 0.002*"intercourse" + 0.002*"ships"')
(2, '0.004*"mexico" + 0.003*"help" + 0.003*"gold" + 0.003*"per" + 0.003*"americans" + 0.002*"minister" + 0.002*"convention" + 0.002*"bank" + 0.002*"banks" + 0.002*"notes"')
(3, '0.003*"spain" + 0.003*"estimated" + 0.003*"help" + 0.003*"convention" + 0.002*"program" + 0.002*"mexico" + 0.002*"americans" + 0.002*"budget" + 0.002*"per" + 0.002*"gold"')
(4, '0.004*"mexico" + 0.003*"program" + 0.003*"help" + 0.003*"jobs" + 0.003*"million" + 0.003*"americans" + 0.003*"convention" + 0.002*"today" + 0.002*"billion" + 0.002*"tonight"')
(5, '0.004*"million" + 0.004*"help" + 0.003*"program" + 0.003*"americans" + 0.003*"bu



Topics for 8 topics:
(0, '0.005*"mexico" + 0.004*"americans" + 0.003*"help" + 0.003*"budget" + 0.003*"programs" + 0.003*"million" + 0.002*"program" + 0.002*"jobs" + 0.002*"today" + 0.002*"minister"')
(1, '0.003*"program" + 0.003*"help" + 0.003*"mexico" + 0.003*"americans" + 0.003*"budget" + 0.003*"million" + 0.002*"convention" + 0.002*"gold" + 0.002*"billion" + 0.002*"court"')
(2, '0.003*"mexico" + 0.003*"americans" + 0.003*"program" + 0.003*"million" + 0.003*"help" + 0.002*"budget" + 0.002*"banks" + 0.002*"billion" + 0.002*"programs" + 0.002*"working"')
(3, '0.003*"mexico" + 0.003*"help" + 0.003*"program" + 0.003*"americans" + 0.002*"per" + 0.002*"tariff" + 0.002*"programs" + 0.002*"tonight" + 0.002*"convention" + 0.002*"cent"')
(4, '0.003*"mexico" + 0.003*"americans" + 0.002*"program" + 0.002*"convention" + 0.002*"spain" + 0.002*"bank" + 0.002*"silver" + 0.002*"payment" + 0.002*"bill" + 0.002*"per"')
(5, '0.004*"mexico" + 0.004*"program" + 0.003*"per" + 0.002*"spain" + 0.002*"ministe



Topics for 9 topics:
(0, '0.004*"help" + 0.004*"program" + 0.003*"per" + 0.003*"americans" + 0.003*"million" + 0.002*"mexico" + 0.002*"budget" + 0.002*"tariff" + 0.002*"conference" + 0.002*"court"')
(1, '0.004*"mexico" + 0.003*"million" + 0.003*"per" + 0.003*"program" + 0.002*"budget" + 0.002*"help" + 0.002*"americans" + 0.002*"courts" + 0.002*"convention" + 0.002*"cannot"')
(2, '0.004*"mexico" + 0.004*"convention" + 0.003*"gold" + 0.003*"minister" + 0.002*"spain" + 0.002*"per" + 0.002*"help" + 0.002*"mexican" + 0.002*"budget" + 0.002*"payment"')
(3, '0.004*"mexico" + 0.003*"minister" + 0.003*"cent" + 0.003*"spain" + 0.002*"intercourse" + 0.002*"islands" + 0.002*"per" + 0.002*"france" + 0.002*"program" + 0.002*"july"')
(4, '0.005*"help" + 0.004*"americans" + 0.004*"program" + 0.003*"mexico" + 0.003*"tonight" + 0.003*"re" + 0.003*"million" + 0.003*"today" + 0.003*"jobs" + 0.002*"banks"')
(5, '0.004*"americans" + 0.004*"help" + 0.003*"mexico" + 0.003*"tonight" + 0.002*"billion" + 0.002*"

In [14]:
# Run this cell if pyLDAvis has never been installed
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


Topic Analysis: The initial visualization from pyLDAvis shows significant overlap among many topics. Additionally, when hovering over each topic, you may notice that several words appear across multiple topics, many of which may not even contribute meaningfully to the analysis. Thus, for Task #2, try tuning the LDA parameters or incorporate additional text preprocessing. Finally, present your analysis of the topics derived from this dataset.

In [15]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

# Task 2 : try tuning the LDA parameters or incorporate additional text preprocessing

In [27]:
import pandas as pd
from gensim.utils import simple_preprocess
from gensim import corpora
from gensim import models
import pyLDAvis

df = pd.read_csv("data/state-of-the-union.csv")

df.head(5)

  and should_run_async(code)


Unnamed: 0,year,content
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


In [29]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import re

nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()


def clean_punctuation(text):
    # Define a regular expression pattern to match punctuation and special characters
    # Matches any character that is not a word character (\w), space (\s), or underscore (_)
    punctuation_pattern = re.compile(r'[^\w\s]|_')

    # Replace punctuation and special characters with an empty string
    cleaned_text = re.sub(punctuation_pattern, '', text)

    return cleaned_text

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Lemmatizing
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    # Remove punctuation
    text = clean_punctuation(text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

df.content = df.content.str.replace("[^A-Za-z ]", " ")
df.content = df.content.apply(clean_text)

df.head()

  and should_run_async(code)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,year,content
0,1790,george washington january fellowcitizens senat...
1,1790,state union address george washington december...
2,1791,state union address george washington october ...
3,1792,state union address george washington november...
4,1793,state union address george washington december...


In [32]:
from gensim import models
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
import pandas as pd

# ... (your previous code for data loading and cleaning) ...

# Assuming 'cleaned_content' column contains the cleaned text
texts = df['content'].apply(simple_preprocess)
dictionary = Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=2000)
corpus = [dictionary.doc2bow(text) for text in texts]

# Define the parameter grid
param_grid = {
    'num_topics': [2, 5, 10, 15],  # Number of topics
    'passes': [10]
}

# Initialize variables to store the best model and its coherence score
best_lda_model = None
best_coherence_score = -1

# Iterate through the parameter grid (excluding alpha and eta)
for num_topics in param_grid['num_topics']:
    for passes in param_grid['passes']:
        # Train the LDA model
        lda_model = models.LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            passes=passes,
            random_state=42
        )

        # ... (rest of your code for coherence calculation and model selection) ...

                # Calculate coherence score
        coherence_model_lda = models.CoherenceModel(
            model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v'
        )
        coherence_score = coherence_model_lda.get_coherence()

        # Update best model if coherence score is improved
        if coherence_score > best_coherence_score:
            best_coherence_score = coherence_score
            best_lda_model = lda_model

# Print the best model and its coherence score
print(f"Best LDA Model: {best_lda_model}")
print(f"Best Coherence Score: {best_coherence_score}")

  and should_run_async(code)


Best LDA Model: LdaModel<num_terms=2000, num_topics=2, decay=0.5, chunksize=2000>
Best Coherence Score: 0.5241149774801065


In [34]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(best_lda_model, corpus, dictionary)
vis

  and should_run_async(code)
