# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
# Write your code here
# Import necessary libraries
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Get the list of English stopwords and extend it with custom stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

# Function to tokenize sentences into words
def sent_to_words(sentences):
    for sentence in sentences:
        # simple_preprocess tokenizes and removes punctuations (deacc=True)
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

# Function to remove stopwords from a list of words
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc))
             if word not in stop_words] for doc in texts]

# Sample data
sample_data = [
    "The new restaurant in town exceeded all my expectations. The food was divine, and the service was impeccable.",
    "The concert was a complete disappointment. The sound quality was poor, and the performers lacked energy.",
    "I have mixed feelings about the latest novel. Some parts were captivating, but others were dull and predictable.",
]

# Preprocess the sample data
data_words = list(sent_to_words(sample_data))

# Remove stop words from the tokenized data
data_words = remove_stopwords(data_words)

# Create a dictionary mapping words to unique integer IDs
import gensim.corpora as corpora
id2word = corpora.Dictionary(data_words)

# Create a bag-of-words corpus representation
texts = data_words
corpus = [id2word.doc2bow(text) for text in texts]

# Number of topics to identify in the data
num_topics = 2  # You can adjust the number of topics as needed

# Build an LDA (Latent Dirichlet Allocation) model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

# Print the keywords associated with each topic
from pprint import pprint
pprint(lda_model.print_topics())

# Get the topic distribution for each document in the corpus
doc_lda = lda_model[corpus]






[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[(0,
  '0.048*"latest" + 0.048*"others" + 0.047*"captivating" + 0.047*"novel" + '
  '0.047*"predictable" + 0.046*"mixed" + 0.046*"feelings" + 0.046*"parts" + '
  '0.045*"dull" + 0.040*"food"'),
 (1,
  '0.049*"complete" + 0.048*"disappointment" + 0.048*"sound" + 0.048*"quality" '
  '+ 0.047*"lacked" + 0.047*"concert" + 0.047*"performers" + 0.047*"energy" + '
  '0.046*"poor" + 0.037*"new"')]


## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
# Write your code here
# Import necessary libraries
import pandas as pd
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, preprocess_string, strip_short, stem_text
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Sample data
sample_data = [
    "The new restaurant in town exceeded all my expectations. The food was divine, and the service was impeccable.",
    "The concert was a complete disappointment. The sound quality was poor, and the performers lacked energy.",
    "I have mixed feelings about the latest novel. Some parts were captivating, but others were dull and predictable.",
]

# Create a DataFrame for the sample data
df = pd.DataFrame({'Review': sample_data})

# Preprocess the text using the same function
def preprocess(text):
    CUSTOM_FILTERS = [lambda x: x.lower(), remove_stopwords, strip_punctuation, strip_short, stem_text]
    text = preprocess_string(text, CUSTOM_FILTERS)
    return text

# Apply the preprocessing to the sample data
df['Review_Text (Clean)'] = df['Review'].apply(lambda x: preprocess(x))

# Create a dictionary with the corpus
corpus = df['Review_Text (Clean)']
dictionary = corpora.Dictionary(corpus)

# Convert corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

# Coherence score in topic modeling to measure how interpretable the topics are to humans.
# Find the coherence score with a different number of topics
for i in range(2, 11):
    lsi = LsiModel(bow, num_topics=i, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=df['Review_Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

# Perform SVD on the bag of words with the LsiModel to extract 2 topics
lsi = LsiModel(bow, num_topics=2, id2word=dictionary)

# Find the 5 words with the strongest association to the derived topics
for topic_num, words in lsi.print_topics(num_words=10):
    print('Words in {}: {}.'.format(topic_num, words))

# Find the scores given between the review and each topic
corpus






Coherence score with 2 clusters: 0.2778931922701693
Coherence score with 3 clusters: 0.2880807171704993
Coherence score with 4 clusters: 0.25709984398343083
Coherence score with 5 clusters: 0.26749651812680003
Coherence score with 6 clusters: 0.273816583032251
Coherence score with 7 clusters: 0.27789319227016934
Coherence score with 8 clusters: 0.25709984398343083
Coherence score with 9 clusters: 0.27381658303225104
Coherence score with 10 clusters: 0.28722471444322994
Words in 0: -0.333*"sound" + -0.333*"qualiti" + -0.333*"poor" + -0.333*"complet" + -0.333*"energi" + -0.333*"perform" + -0.333*"lack" + -0.333*"disappoint" + -0.333*"concert" + -0.000*"predict".
Words in 1: 0.333*"food" + 0.333*"expect" + 0.333*"impecc" + 0.333*"restaur" + 0.333*"divin" + 0.333*"servic" + 0.333*"town" + 0.333*"exceed" + 0.333*"new" + -0.000*"mix".


0    [new, restaur, town, exceed, expect, food, div...
1    [concert, complet, disappoint, sound, qualiti,...
2    [mix, feel, latest, novel, part, captiv, dull,...
Name: Review_Text (Clean), dtype: object

## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
# Write your code here
import nltk
nltk.download('all')
!pip install preprocess
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline





[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

Collecting preprocess
  Downloading preprocess-2.0.0-py3-none-any.whl (12 kB)
Installing collected packages: preprocess
Successfully installed preprocess-2.0.0


In [None]:
!pip install pyLDAvis


Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.24.2 (from pyLDAvis)
  Downloading numpy-1.26.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting tzdata>=2022.1 (from pandas>=2.0.0->pyLDAvis)
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m 

In [None]:
# Import necessary libraries
import pyLDAvis
pyLDAvis.enable_notebook()
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
sample_data = [
    "The new restaurant in town exceeded all my expectations. The food was divine, and the service was impeccable.",
    "The concert was a complete disappointment. The sound quality was poor, and the performers lacked energy.",
    "I have mixed feelings about the latest novel. Some parts were captivating, but others were dull and predictable.",
]

# Preprocess the sample data
sample_data_clean = [preprocess(text) for text in sample_data]

# Join the preprocessed texts into a list of strings
sample_data_clean_text = [' '.join(text) for text in sample_data_clean]

# Create a CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data_clean_text)

# Get the top words for each topic
top = 10
topic_to_topwords = {}
for j in range(X.shape[0]):
    top_words_indices = np.argsort(X[j].toarray()[0])[::-1][:top]
    top_words = [vectorizer.get_feature_names_out()[i] for i in top_words_indices]
    msg = 'Topic %i has top words: %s' % (j, ', '.join(top_words))
    print(msg)
    topic_to_topwords[j] = top_words


Topic 0 has top words: town, food, servic, restaur, divin, new, exceed, expect, impecc, complet
Topic 1 has top words: lack, complet, concert, qualiti, disappoint, poor, perform, energi, sound, food
Topic 2 has top words: captiv, feel, predict, dull, part, novel, mix, latest, complet, concert


## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
pip install bertopic

  and should_run_async(code)


Collecting bertopic
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.4.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.8/90.8 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
# Write your code here
import re
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bertopic import BERTopic
import matplotlib.pyplot as plt

# Download NLTK data
nltk.download('stopwords')
nltk.download('punkt')



  and should_run_async(code)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all')['data']

topic_model = BERTopic(nr_topics="auto", calculate_probabilities=True, verbose=True)
topics, _ = topic_model.fit_transform(data)

topic_overview = topic_model.get_topic_freq()

for topic_num, freq in topic_overview[1:].values:
    topic_words = topic_model.get_topic(topic_num)
    topic_summary = ", ".join([word[0] for word in topic_words[:5]])
    print(f"Topic {topic_num}: {topic_summary} (Freq: {freq})")


  and should_run_async(code)


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2023-11-10 05:43:19,874 - BERTopic - Transformed documents to Embeddings
2023-11-10 05:44:08,808 - BERTopic - Reduced dimensionality
2023-11-10 05:46:18,536 - BERTopic - Clustered reduced embeddings
2023-11-10 05:46:41,088 - BERTopic - Reduced number of topics from 352 to 196


Topic 0: image, jpeg, scsi, drive, for (Freq: 2029)
Topic 1: game, team, he, games, players (Freq: 1613)
Topic 2: god, that, jesus, is, you (Freq: 842)
Topic 3: president, government, clipper, that, the (Freq: 672)
Topic 4: gun, guns, militia, weapons, you (Freq: 555)
Topic 5: israel, israeli, jews, arab, peace (Freq: 378)
Topic 6: car, bike, mustang, v8, toyota (Freq: 293)
Topic 7: insurance, health, cancer, medical, patients (Freq: 258)
Topic 8: homosexual, homosexuality, gay, homosexuals, sexual (Freq: 223)
Topic 9: amp, sale, cd, receiver, cds (Freq: 215)
Topic 10: armenian, turkish, armenians, were, armenia (Freq: 205)
Topic 11: windows, dos, nt, memory, swap (Freq: 154)
Topic 12: printer, deskjet, hp, printers, laser (Freq: 142)
Topic 13: modem, modems, serial, fax, courier (Freq: 137)
Topic 14: space, venus, mars, mission, launch (Freq: 132)
Topic 15: radar, detector, detectors, ir, alarm (Freq: 113)
Topic 16: polygon, points, sphere, algorithm, polygons (Freq: 105)
Topic 17: mo

## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

In [11]:
# Write your answer here (no code needed for this question)
'''
We must take into account elements like topic quality and interpretability in order to compare the topic modeling techniques (LDA, NMF, LSA, and BERTopic).

Word probability distributions for themes provided by LDA are quite comprehensible.

While not exactly the same as LDA, NMF and LSA provide subjects that are sufficiently comprehensible.

Due to its BERT model reliance, BERTopic may not yield semantically relevant topics in the same way as standard approaches.

Assess topic coherence; NMF and LDA frequently have strong coherence.

Lower topic coherence is typical of LSA.

Depending on the BERT model selected, BERTopic can have strong coherence.

Think about scalability: huge datasets can be handled by LDA, NMF, and LSA.

BERTopic may need a lot of processing power, particularly when dealing with big BERT models.

Examine the models robustness and hyperparameter sensitivity for your project, as well as the availability of pre-trained models.

LDA is often considered the greatest option for interpretability due to its clear word probability distributions for each topic.

If you desire improved scalability combined with strong interpretability, NMF and LSA might be useful options.

Semantic context is provided by BERTopic, yet it may require more computing power and has variable interpretability.

Based on the aforementioned features and a comparison of the four models, I think LSA is the most effective.

'''



  and should_run_async(code)


'\nWe must take into account elements like topic quality and interpretability in order to compare the topic modeling techniques (LDA, NMF, LSA, and BERTopic).\n\nWord probability distributions for themes provided by LDA are quite comprehensible.\n\nWhile not exactly the same as LDA, NMF and LSA provide subjects that are sufficiently comprehensible.\n\nDue to its BERT model reliance, BERTopic may not yield semantically relevant topics in the same way as standard approaches.\n\nAssess topic coherence; NMF and LDA frequently have strong coherence.\n\nLower topic coherence is typical of LSA.\n\nDepending on the BERT model selected, BERTopic can have strong coherence.\n\nThink about scalability: huge datasets can be handled by LDA, NMF, and LSA.\n\nBERTopic may need a lot of processing power, particularly when dealing with big BERT models.\n\nExamine the models robustness and hyperparameter sensitivity for your project, as well as the availability of pre-trained models.\n\nLDA is often cons