# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [2]:
# Import necessary libraries
import pandas as pd
import gensim
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

# Read the dataset from the CSV file
df = pd.read_csv("df_file.csv")

# Preprocess the text
def preprocess_text(text):
    tokens = word_tokenize(text)  # Tokenization
    tokens = [word.lower() for word in tokens if word.isalpha()]  # Convert to lowercase and remove punctuation
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return tokens

# Preprocess the text data in the 'Text' column
preprocessed_texts = [preprocess_text(text) for text in df['Text']]

# Create dictionary and document-term matrix
dictionary = Dictionary(preprocessed_texts)
doc_term_matrix = [dictionary.doc2bow(text) for text in preprocessed_texts]

# Compute coherence scores for different number of topics
coherence_scores = []
for num_topics in range(2, 11):
    lda_model = LdaModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics, random_state=42)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lda.get_coherence()
    coherence_scores.append((num_topics, coherence_score))

# Choose the number of topics with the highest coherence score
optimal_num_topics = max(coherence_scores, key=lambda x: x[1])[0]

# Train the LDA model with the optimal number of topics
lda_model = LdaModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=optimal_num_topics, random_state=42)

# Print the topics
print("Top words for each topic:")
for topic_id, topic_words in lda_model.print_topics():
    print(f"Topic {topic_id + 1}: {topic_words}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Top words for each topic:
Topic 1: 0.008*"great" + 0.008*"product" + 0.007*"one" + 0.005*"like" + 0.005*"even" + 0.005*"good" + 0.004*"used" + 0.004*"would" + 0.004*"little" + 0.004*"use"
Topic 2: 0.009*"product" + 0.008*"time" + 0.008*"use" + 0.007*"one" + 0.006*"hair" + 0.006*"old" + 0.005*"like" + 0.005*"great" + 0.005*"daughter" + 0.005*"easy"
Topic 3: 0.010*"really" + 0.009*"good" + 0.008*"one" + 0.008*"get" + 0.006*"like" + 0.005*"great" + 0.005*"toy" + 0.005*"product" + 0.005*"well" + 0.005*"would"
Topic 4: 0.009*"like" + 0.008*"one" + 0.007*"get" + 0.007*"great" + 0.007*"love" + 0.005*"toy" + 0.005*"game" + 0.005*"even" + 0.005*"little" + 0.005*"easy"
Topic 5: 0.011*"great" + 0.010*"like" + 0.006*"one" + 0.006*"would" + 0.006*"use" + 0.006*"love" + 0.005*"product" + 0.004*"well" + 0.004*"bought" + 0.004*"even"


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [3]:
# Import necessary libraries
import pandas as pd
import gensim
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

# Read the dataset from the CSV file
df = pd.read_csv("df_file.csv")

# Preprocess the text
def preprocess_text(text):
    tokens = word_tokenize(text)  # Tokenization
    tokens = [word.lower() for word in tokens if word.isalpha()]  # Convert to lowercase and remove punctuation
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return tokens

# Preprocess the text data in the 'Text' column
preprocessed_texts = [preprocess_text(text) for text in df['Text']]

# Create dictionary and document-term matrix
dictionary = Dictionary(preprocessed_texts)
doc_term_matrix = [dictionary.doc2bow(text) for text in preprocessed_texts]

# Compute coherence scores for different number of topics
coherence_scores = []
for num_topics in range(2, 11):
    lsi_model = LsiModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics)
    coherence_model_lsi = CoherenceModel(model=lsi_model, texts=preprocessed_texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lsi.get_coherence()
    coherence_scores.append((num_topics, coherence_score))

# Choose the number of topics with the highest coherence score
optimal_num_topics = max(coherence_scores, key=lambda x: x[1])[0]

# Train the LSA model with the optimal number of topics
lsi_model = LsiModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=optimal_num_topics)

# Print the topics
print("Top words for each topic:")
for topic_id, topic_words in lsi_model.print_topics():
    print(f"Topic {topic_id + 1}: {topic_words}")



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Top words for each topic:
Topic 1: 0.277*"one" + 0.250*"like" + 0.193*"get" + 0.191*"great" + 0.183*"use" + 0.175*"hair" + 0.164*"product" + 0.159*"would" + 0.152*"time" + 0.137*"really"
Topic 2: -0.686*"hair" + -0.243*"brush" + 0.191*"toy" + -0.151*"used" + -0.147*"black" + 0.146*"great" + -0.143*"product" + -0.133*"shaver" + -0.112*"natural" + -0.111*"use"
Topic 3: -0.385*"shaver" + -0.328*"cleaning" + -0.204*"unit" + 0.203*"hair" + -0.195*"head" + 0.151*"product" + -0.140*"cycle" + 0.137*"like" + 0.113*"brush" + -0.102*"find"


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [15]:
# Install necessary packages
!pip install lda2vec pyldavis

# Import required libraries
import numpy as np
import pyLDAvis
import lda2vec
from sklearn.datasets import fetch_20newsgroups
from lda2vec import preprocess, Corpus, Lda2vec

# Load 20 Newsgroups dataset
remove = ('headers', 'footers', 'quotes')
texts = fetch_20newsgroups(subset='train', remove=remove).data

# Preprocess text data
tokenized_texts, vocab = preprocess.tokenize(texts)
corpus = Corpus()
corpus.update_word_count(tokenized_texts)
corpus.finalize()

# Train lda2vec model
model = Lda2vec(n_words=len(vocab), n_hidden=128, counts=corpus.word_count,
                n_vocab=len(vocab), alpha=0.1, beta=0.01, num_epochs=100)
model.fit(corpus.word_vectors)

# Get top words for each topic
top_n = 10
topic_to_topwords = {}
for j, topic_to_word in enumerate(model.topic_to_word):
    top = np.argsort(topic_to_word)[::-1][:top_n]
    top_words = [vocab[i].strip()[:35] for i in top]
    topic_to_topwords[j] = top_words
    print('Topic %i: %s' % (j, ', '.join(top_words)))

# Visualize topics using PyLDAvis
prepared_data = pyLDAvis.prepare(model.topic_to_word, model.doc_to_topic,
                                 model.doc_lengths, vocab, R=10, lambda_step=0.01, mds='tsne')
pyLDAvis.display(prepared_data)




ImportError: cannot import name 'preprocess' from 'lda2vec' (/usr/local/lib/python3.10/dist-packages/lda2vec/__init__.py)

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [4]:

!pip install bertopic
# Import necessary libraries
from bertopic import BERTopic
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

# Read the dataset from the CSV file
df = pd.read_csv("df_file.csv")

# Preprocess the text
def preprocess_text(text):
    tokens = word_tokenize(text)  # Tokenization
    tokens = [word.lower() for word in tokens if word.isalpha()]  # Convert to lowercase and remove punctuation
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return " ".join(tokens)

# Preprocess the text data in the 'Text' column
preprocessed_texts = [preprocess_text(text) for text in df['Text']]

# Generate BERT embeddings
model = "paraphrase-MiniLM-L6-v2"  # Pretrained BERT model
topic_model = BERTopic(language="english", embedding_model=model)
topics, _ = topic_model.fit_transform(preprocessed_texts)

# Evaluate coherence score for different number of topics
coherence_scores = []
for num_topics in range(2, 11):
    _, coherence_score = topic_model.reduce_topics(preprocessed_texts, topics, nr_topics=num_topics, return_cohesion=True)
    coherence_scores.append((num_topics, coherence_score))

# Choose the number of topics with the highest coherence score
optimal_num_topics = max(coherence_scores, key=lambda x: x[1])[0]

# Summarize topics
top_words = topic_model.get_topics(num_topics=optimal_num_topics)
print("Top words for each topic:")
for topic_id, words in top_words.items():
    print(f"Topic {topic_id}: {words}")




Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

TypeError: BERTopic.reduce_topics() got multiple values for argument 'nr_topics'

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
"""Given the need for precision and selecting one algorithm, BERTopic emerges as the preferred choice among the four topic modeling algorithms.

Here's the rationale:

1. Semantic Coherence: BERTopic employs BERT embeddings, renowned for their ability to capture intricate semantic details. Consequently, it often yields topics that are more cohesive and reflective of the underlying semantic structure of the text data.

2. Interpretability: While traditional methods such as LDA and LSA excel in interpretability, BERTopic, leveraging BERT embeddings, can also generate interpretable topics, with the added advantage of encompassing nuanced semantic relationships.

3. Robustness: BERTopic's reliance on BERT embeddings and hierarchical clustering typically results in robust topics capable of handling complex semantic associations within the dataset.

4. Scalability: Although BERTopic may demand significant computational resources, particularly for large datasets, its utilization of BERT embeddings enables it to scale reasonably well and effectively manage unstructured or intricate text data.

In summary, BERTopic stands out for its capacity to capture semantic nuances and produce coherent topics, rendering it a compelling choice for numerous text analysis tasks."""


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Learning Experience:
The exercise provided a hands-on learning experience in text data processing and feature extraction using various topic modeling algorithms. It enhanced my understanding of LDA, LSA, lda2vec, and BERTopic and their applications in NLP.

Challenges Encountered:
The main challenge was understanding the intricacies of each algorithm and selecting appropriate parameters, particularly when deciding the number of topics based on coherence score.

Relevance to Your Field of Study:
This exercise is highly relevant to NLP as it addresses fundamental tasks like topic modeling and feature extraction. Understanding these algorithms is crucial for various NLP applications, making the exercise beneficial for my field of study.




'''