# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [47]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')


news_articles = [
    "Political leaders discussed the budget in the parliament session.",
    "The football match last night ended in a thrilling penalty shootout.",
    "A new breakthrough in artificial intelligence was announced by a tech company.",
    "The latest movie in the franchise has been well-received by audiences.",
    "Researchers discovered a potential treatment for a common health condition."
]

# Tokenizing and preprocessing the input text
stop_words = set(stopwords.words('english'))
tokenized_articles = [word_tokenize(article.lower()) for article in news_articles]
filtered_tokens = [[word for word in tokens if word.isalnum() and word not in stop_words] for tokens in tokenized_articles]

# Creating the dictionary and corpus
dictionary = corpora.Dictionary(filtered_tokens)
corpus = [dictionary.doc2bow(tokens) for tokens in filtered_tokens]

# Computing the coherence scores to determine optimal number of topics
coherence_scores = {}
for k in range(2, 6):  # Trying different numbers of topics
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=42)
    coherence_model = CoherenceModel(model=lda_model, texts=filtered_tokens, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores[k] = coherence_score

# Getting the optimal number of topics with the highest coherence score
optimal_k = max(coherence_scores, key=coherence_scores.get)

# Training LDA model with optimal number of topics
optimal_lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=optimal_k, random_state=42)

# Summarizing the topics
topics = optimal_lda_model.print_topics(num_words=5)

print("Optimal number of topics:", optimal_k)
print("Topics:")
for topic in topics:
    print(topic)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Optimal number of topics: 5
Topics:
(0, '0.031*"movie" + 0.031*"political" + 0.031*"latest" + 0.031*"franchise" + 0.031*"session"')
(1, '0.047*"thrilling" + 0.047*"match" + 0.047*"last" + 0.047*"announced" + 0.047*"night"')
(2, '0.031*"movie" + 0.031*"session" + 0.031*"latest" + 0.031*"audiences" + 0.031*"political"')
(3, '0.089*"health" + 0.089*"researchers" + 0.089*"treatment" + 0.089*"potential" + 0.089*"common"')
(4, '0.097*"discussed" + 0.097*"leaders" + 0.096*"parliament" + 0.096*"budget" + 0.096*"session"')


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [48]:
from gensim.models import LsiModel
from gensim.models import CoherenceModel
import numpy as np

def compute_coherence_score(model, texts, dictionary, coherence='c_v'):
    coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence=coherence)
    return coherence_model.get_coherence()

news_articles = [
    "Political leaders discussed the budget in the parliament session.",
    "The football match last night ended in a thrilling penalty shootout.",
    "A new breakthrough in artificial intelligence was announced by a tech company.",
    "The latest movie in the franchise has been well-received by audiences.",
    "Researchers discovered a potential treatment for a common health condition."
]

# Tokenizing and preprocessing the text
stop_words = set(stopwords.words('english'))
tokenized_articles = [word_tokenize(article.lower()) for article in news_articles]
filtered_tokens = [[word for word in tokens if word.isalnum() and word not in stop_words] for tokens in tokenized_articles]

# Creating the dictionary and corpus
dictionary = corpora.Dictionary(filtered_tokens)
corpus = [dictionary.doc2bow(tokens) for tokens in filtered_tokens]

# Computing the coherence scores to determine optimal number of topics
coherence_scores = {}
max_topics = min(len(news_articles), len(dictionary)) - 1  # Maximum number of topics
for k in range(2, max_topics):
    lsi_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=k)
    coherence_score = compute_coherence_score(lsi_model, filtered_tokens, dictionary)
    coherence_scores[k] = coherence_score

# Getting the optimal number of topics with the highest coherence score
optimal_k = max(coherence_scores, key=coherence_scores.get)

# Training LSI model with optimal number of topics
optimal_lsi_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=optimal_k)

# Summarizing the topics
topics = optimal_lsi_model.print_topics(num_words=5)

print("Optimal number of topics:", optimal_k)
print("Topics:")
for topic in topics:
    print(topic)


Optimal number of topics: 2
Topics:
(0, '-0.354*"penalty" + -0.354*"shootout" + -0.354*"last" + -0.354*"match" + -0.354*"football"')
(1, '0.378*"treatment" + 0.378*"potential" + 0.378*"common" + 0.378*"discovered" + 0.378*"researchers"')


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [56]:
from gensim import corpora, models
import pyLDAvis
from gensim.models import CoherenceModel

try:
    import seaborn
except ImportError:
    pass

pyLDAvis.enable_notebook()


news_articles = [
    "Political leaders discussed the budget in the parliament session.",
    "The football match last night ended in a thrilling penalty shootout.",
    "A new breakthrough in artificial intelligence was announced by a tech company.",
    "The latest movie in the franchise has been well-received by audiences.",
    "Researchers discovered a potential treatment for a common health condition."
]

# Tokenize and preprocess text data
tokenized_text = [text.lower().split() for text in news_articles]

# Create dictionary and corpus
dictionary = corpora.Dictionary(tokenized_text)
corpus = [dictionary.doc2bow(text) for text in tokenized_text]

# Train LDA model
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=2, passes=10)

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print("Coherence Score:", coherence_lda)

# Extract topics
topics = lda_model.print_topics(num_words=10)

# Print topics
for topic in topics:
    print("Topic {}: {}".format(topic[0], topic[1]))


Coherence Score: 0.2407869499258611
Topic 0: 0.083*"a" + 0.049*"potential" + 0.049*"discovered" + 0.049*"treatment" + 0.049*"health" + 0.049*"researchers" + 0.049*"for" + 0.049*"common" + 0.049*"condition." + 0.017*"in"
Topic 1: 0.087*"the" + 0.071*"in" + 0.054*"a" + 0.039*"by" + 0.024*"franchise" + 0.024*"been" + 0.024*"new" + 0.024*"latest" + 0.024*"session." + 0.024*"announced"


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [19]:
!pip install bertopic
from bertopic import BERTopic

#I have used more data. Bert Requires more data to work correctly
news_articles = [
    "Political leaders discussed the budget in the parliament session.",
    "The football match last night ended in a thrilling penalty shootout.",
    "A new breakthrough in artificial intelligence was announced by a tech company.",
    "The latest movie in the franchise has been well-received by audiences.",
    "Researchers discovered a potential treatment for a common health condition.",
    "The company's stock saw a significant increase in value following the announcement.",
    "A new study sheds light on the long-term effects of climate change.",
    "The city unveiled plans for a new public transportation system.",
    "Celebrations erupted across the country after the team won the championship.",
    "Experts warn of potential cybersecurity threats in the upcoming elections."
]

# Creating BERTopic model
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Fitting the model
topics, probs = topic_model.fit_transform(news_articles)

# Getting topic information
freq = topic_model.get_topic_info()
print(freq.head(5))

# Getting the top words for a specific topic
print(topic_model.get_topic(0))

# Getting the top topics
print(topic_model.topics_[:10])




2024-03-30 02:39:27,156 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2024-03-30 02:39:27,657 - BERTopic - Embedding - Completed ✓
2024-03-30 02:39:27,661 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-30 02:39:30,092 - BERTopic - Dimensionality - Completed ✓
2024-03-30 02:39:30,095 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-30 02:39:30,165 - BERTopic - Cluster - Completed ✓
2024-03-30 02:39:30,194 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-30 02:39:30,225 - BERTopic - Representation - Completed ✓


   Topic  Count                     Name  \
0     -1     10  -1_the_in_new_potential   

                                      Representation  \
0  [the, in, new, potential, of, for, by, warn, w...   

                                 Representative_Docs  
0  [Celebrations erupted across the country after...  
False
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]


## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [20]:
# Write your code here
'''
#LDA produced conversation topics about politics, health studies, movies, and sporting events. There was some overlap between the political and cinematic themes, but overall the ideas were understandable.

#The themes generated by LSA were clearly comprehensible and centered around health studies and sporting activities.
#The subjects did not overlap or remain unclear.

#lda2vec offered entertainment (movies/franchises) and issues pertaining to health research and possible cures.
#Although there was some room for interpretation, cohesion might use some work.

#Documents were grouped by BERTopic using representations derived from typical documents and frequently occurring terms.
#While other clusters indicated more narrowly focused themes, the texts in the '-1' cluster most likely covered a
#wide range of subjects.

#LSA and LDA produced clear and interpretable topics with minimal redundancy and high coherence, making them suitable choices for this dataset.

#lda2vec and BERTopic offered more advanced techniques leveraging word embeddings, but their topics were less interpretable and coherent in this context.
#For this LSA and LDA appear to provide the most effective and interpretable results.
'''

"\n#LDA produced conversation topics about politics, health studies, movies, and sporting events. There was some overlap between the political and cinematic themes, but overall the ideas were understandable.\n\n#The themes generated by LSA were clearly comprehensible and centered around health studies and sporting activities. \n#The subjects did not overlap or remain unclear.\n\n#lda2vec offered entertainment (movies/franchises) and issues pertaining to health research and possible cures. \n#Although there was some room for interpretation, cohesion might use some work.\n\n#Documents were grouped by BERTopic using representations derived from typical documents and frequently occurring terms. \n#While other clusters indicated more narrowly focused themes, the texts in the '-1' cluster most likely covered a \n#wide range of subjects.\n\n#LSA and LDA produced clear and interpretable topics with minimal redundancy and high coherence, making them suitable choices for this dataset.\n\n#lda2ve

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

The Exercise is really challenging. It improved my overall experience in the LSA, LDA, BERT and LDA2VEC. My Overall experience is good while doing this exercise. /



'''