# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
import gensim
import nltk
from gensim import corpora
from gensim.models import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import requests
from bs4 import BeautifulSoup
import re

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nltk.download('punkt')

# URL of the webpage
url = "https://en.wikipedia.org/wiki/Short_story"

# Fetch the webpage content
response = requests.get(url)
html_content = response.text

# Parse the HTML content to extract text
soup = BeautifulSoup(html_content, 'html.parser')
text_content = soup.get_text()

# Tokenization and preprocessing
tokenized_data = [word_tokenize(text.lower()) for text in text_content.split()]
tokenized_data = [[word for word in doc if word not in stop_words] for doc in tokenized_data]

# Create dictionary and corpus
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_data]

# Function to clean up the words in each topic
def clean_topic_words(topic_words):
    cleaned_words = []
    for word in topic_words:
        # Remove symbols and characters using regex
        cleaned_word = re.sub(r'[^a-zA-Z]', '', word)
        if cleaned_word:
            cleaned_words.append(cleaned_word)
    return cleaned_words

# Function to compute coherence score for a given number of topics
def compute_coherence_score(corpus, dictionary, k):
    lda_model = gensim.models.LdaMulticore(corpus, num_topics=k, id2word=dictionary, passes=2, workers=2)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_data, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lda.get_coherence()
    return coherence_score

# Optimal number of topics based on coherence score
max_topics = 10
best_coherence = 0
best_num_topics = 0

for k in range(1, max_topics+1):
    coherence_score = compute_coherence_score(corpus, dictionary, k)
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_num_topics = k

# LDA Model Training
lda_model = gensim.models.LdaMulticore(corpus, num_topics=best_num_topics, id2word=dictionary, passes=10, workers=2)

# Summarize topics with cleaned words
topics = lda_model.show_topics(num_topics=best_num_topics, num_words=10, formatted=False)
for topic_id, words in topics:
    cleaned_words = clean_topic_words([word for word, _ in words])
    print(f"Topic {topic_id + 1}: {' '.join(cleaned_words)}")

# Coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_data, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model_lda.get_coherence()

print("Coherence Score:", coherence_score)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Topic 1: prize american de august structure sales contemporary james another often
Topic 2: short s narrative writers original style novel archived
Topic 3: century action cambridge william
Topic 4: retrieved fiction isbn modern literary needed published character
Topic 5: form pin drop book s wikipedia bbc hayes
Topic 6: stories ed genre identifiersarticles may articles awards
Topic 7: story mitchell magazine jos
Topic 8: one writer women read using allan life fairy
Topic 9: literature press p edit first citation new
Topic 10: university nobel tales united young writing
Coherence Score: 0.7270097037173542


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import requests
from bs4 import BeautifulSoup

# URL of the webpage
url = "https://en.wikipedia.org/wiki/Short_story"

# Fetch the webpage content
response = requests.get(url)
html_content = response.text

# Parse the HTML content to extract text
soup = BeautifulSoup(html_content, 'html.parser')
text_content = soup.get_text()

# Tokenization and preprocessing
stop_words = set(stopwords.words('english'))
tokenized_data = [word_tokenize(text.lower()) for text in text_content.split()]
tokenized_data = [[word for word in doc if word.isalpha() and word not in stop_words] for doc in tokenized_data]

# Create dictionary and corpus
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_data]

# Function to compute coherence score for a given number of topics
def compute_coherence_score(corpus, dictionary, texts, k):
    lsa_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=k)
    coherence_model_lsa = CoherenceModel(model=lsa_model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lsa.get_coherence()
    return coherence_score

# Find optimal number of topics based on coherence score
max_topics = 10
best_coherence = 0
best_num_topics = 0

for k in range(1, max_topics+1):
    coherence_score = compute_coherence_score(corpus, dictionary, tokenized_data, k)
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_num_topics = k

# Train LSA model with optimal number of topics
lsa_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=best_num_topics)

# Summarize topics
topics = lsa_model.show_topics(num_topics=best_num_topics, num_words=10, formatted=False)
for topic_id, topic in topics:
    cleaned_words = [word for word in topic if word[0].isalpha()]  # Remove non-alphabetic tokens
    print(f"Topic {topic_id + 1}: {' '.join([word for word, _ in cleaned_words])}")

# Compute coherence score
coherence_score = compute_coherence_score(corpus, dictionary, tokenized_data, best_num_topics)
print("Coherence Score:", coherence_score)


Topic 1: short length awards william sometimes women often something articles issn
Topic 2: story citation city techniques point elsewhere sales fonseca thousands beattie
Coherence Score: 0.7434366821572954


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
!pip install lda2vec
!pip install pyLDAvis



In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import requests
from bs4 import BeautifulSoup
import gensim
from gensim import corpora
from gensim.models import Word2Vec, LdaModel
import numpy as np

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# URL of the webpage containing text data
url = "https://en.wikipedia.org/wiki/Short_story"

# Fetch the webpage content
response = requests.get(url)
html_content = response.text

# Parse the HTML content to extract text
soup = BeautifulSoup(html_content, 'html.parser')
text_content = soup.get_text()

# Tokenization and preprocessing
stop_words = set(stopwords.words('english'))
wordnet_lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    tokens = [wordnet_lemmatizer.lemmatize(word) for word in tokens]
    return tokens

tokenized_data = [preprocess_text(text) for text in text_content.split()]

# Train word embeddings using Word2Vec
word2vec_model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)

# Convert tokenized data to bag-of-words representation
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_data]

# Train LDA on top of the embeddings
num_topics = 5
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)

# Print topics
topics = lda_model.print_topics(num_topics=num_topics, num_words=10)
for topic_id, words in topics:
    print(f"Topic {topic_id + 1}: {words}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Topic 1: 0.020*"character" + 0.017*"press" + 0.017*"genre" + 0.016*"novel" + 0.014*"ed" + 0.014*"structure" + 0.010*"york" + 0.009*"university" + 0.008*"conflict" + 0.008*"modern"
Topic 2: 0.031*"narrative" + 0.016*"identifiersarticles" + 0.011*"literature" + 0.011*"new" + 0.009*"tale" + 0.008*"statement" + 0.008*"narrator" + 0.008*"related" + 0.008*"term" + 0.006*"united"
Topic 3: 0.027*"retrieved" + 0.020*"theory" + 0.013*"form" + 0.011*"study" + 0.011*"award" + 0.008*"american" + 0.008*"magazine" + 0.008*"essay" + 0.007*"dramatic" + 0.007*"realism"
Topic 4: 0.019*"writer" + 0.018*"style" + 0.016*"plot" + 0.015*"additional" + 0.015*"unsourced" + 0.012*"state" + 0.011*"using" + 0.011*"time" + 0.011*"common" + 0.010*"nobel"
Topic 5: 0.164*"story" + 0.153*"short" + 0.044*"fiction" + 0.028*"article" + 0.016*"august" + 0.015*"literary" + 0.013*"action" + 0.012*"edit" + 0.012*"science" + 0.011*"reference"


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [19]:
%%capture
!pip install bertopic

In [18]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import requests
from bs4 import BeautifulSoup
from bertopic import BERTopic

# Fetch the webpage content
url = "https://en.wikipedia.org/wiki/Short_story"
response = requests.get(url)
html_content = response.text

# Parse the HTML content to extract text
soup = BeautifulSoup(html_content, 'html.parser')
text_content = soup.get_text()

# Split the text into paragraphs
paragraphs = [p.strip() for p in text_content.split('\n') if p.strip()]

# Tokenize and preprocess each paragraph
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

documents = []
for paragraph in paragraphs:
    tokens = word_tokenize(paragraph.lower())
    filtered_tokens = [token for token in tokens if token not in stop_words and token not in punctuation]
    documents.append(' '.join(filtered_tokens))

# Initialize BERTopic model
topic_model = BERTopic()

# Fit BERTopic model on filtered text
topics, probs = topic_model.fit_transform(documents)

# Get the most frequent topic
most_frequent_topic = topic_model.get_topic(0)
print("Most Frequent Topic:", most_frequent_topic)

# Visualize topics
topic_model.visualize_topics()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Most Frequent Topic: [('short', 0.04586836114559908), ('story', 0.039936311964951766), ('stories', 0.03246597925251005), ('writers', 0.01660470658621207), ('form', 0.015294505267520912), ('century', 0.013859348148061166), ('modern', 0.01324204616635544), ('citation', 0.01324204616635544), ('needed', 0.012603392739070594), ('tales', 0.012506383941374797)]


## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
'''
  Latent Dirichlet Allocation (LDA) emerges as the preeminent choice among topic modeling algorithms
  due to its superior interpretability and coherence. LDA generates topics as distributions over words,
  offering clear and intuitive insights into the underlying themes present in text data. This interpretability
  is invaluable in various applications, such as academic research and content analysis, where human
  understanding of topics is essential. Additionally, LDA consistently achieves high coherence scores,
  indicating the logical connectedness and semantic coherence of the topics it generates. This coherence
  ensures that the topics are meaningful and internally consistent, further enhancing the utility of LDA
  in extracting actionable insights from text corpora.

'''


'\n  Latent Dirichlet Allocation (LDA) emerges as the preeminent choice among topic modeling algorithms\n  due to its superior interpretability and coherence. LDA generates topics as distributions over words,\n  offering clear and intuitive insights into the underlying themes present in text data. This interpretability\n  is invaluable in various applications, such as academic research and content analysis, where human\n  understanding of topics is essential. Additionally, LDA consistently achieves high coherence scores,\n  indicating the logical connectedness and semantic coherence of the topics it generates. This coherence\n  ensures that the topics are meaningful and internally consistent, further enhancing the utility of LDA\n  in extracting actionable insights from text corpora.\n  \n'

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
'''
The learning experience in working with text data and extracting features from them using the different
algorithms on topic modeling was quite enriching and informative. Got to learn new techniques from the tasks in each algorithm.
Understanding the algorithms behind LDA, LSA, BERTopic, and LDA2Vec methods provided valuable insights.
These algorithms have further helped us in the domain of text processing, where the main concern is in the generation and
representation of topics within the text data. The understanding from the subtleties of these algorithms' implementation helped greatly.
from text data, including tokenization, preprocessing, and modeling latent topics.

The major challenge experienced is during the tokenization of words from the URL of a website, specifically in handling the HTML tags and acquisition of the meaningful words and text content.
Decided to leverage tools like BeautifulSoup and carry out text preprocessing techniques. At first the algorithms were tokening the link text instead of the website content.
Further, it required not only invaluable understanding but also an in-depth understanding of the concepts and algorithms underneath Ljson from scratch.
opportunity to delve into the intricacies of topic modeling and feature extraction.

The given exercise is highly relevant for the area of Natural Language Processing (NLP) dealing with ways of extracting meaningful information.
These methods allow the latent topics from the text to be extracted. NLP has very broad uses, from sentiment analysis to document classification
to information retrieval. Where the role of topic modeling is crucial to expose hidden patterns and structures in textual content, so as to master
the techniques of topic modeling, comprehensive study in this particular area helps to pursue effective research in considered. algorithms and
feature extraction techniques is essential for building effective NLP systems and applications.






'''

"\nThe learning experience in working with text data and extracting features from them using the different\nalgorithms on topic modeling was quite enriching and informative. Got to learn new techniques from the tasks in each algorithm.\nUnderstanding the algorithms behind LDA, LSA, BERTopic, and LDA2Vec methods provided valuable insights.\nThese algorithms have further helped us in the domain of text processing, where the main concern is in the generation and\nrepresentation of topics within the text data. The understanding from the subtleties of these algorithms' implementation helped greatly.\nfrom text data, including tokenization, preprocessing, and modeling latent topics.\n\nThe major challenge experienced is during the tokenization of words from the URL of a website, specifically in handling the HTML tags and acquisition of the meaningful words and text content.\nDecided to leverage tools like BeautifulSoup and carry out text preprocessing techniques. At first the algorithms were