# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [10]:
# Write your code here
import gensim
from gensim import corpora
from gensim.models import CoherenceModel

# Prepare the text corpus
texts = [
    "Inception is a mind-bending masterpiece that keeps viewers on the edge of their seats with its intricate plot and stunning visual effects.",
    "The Godfather is a cinematic classic, with its gripping narrative, compelling characters, and iconic performances making it a timeless masterpiece.",
    "Jurassic Park revolutionized the world of cinema with its groundbreaking special effects and thrilling storyline that continues to captivate audiences of all ages.",
    "Eternal Sunshine of the Spotless Mind is a thought-provoking exploration of love and memory, with its inventive storytelling and heartfelt performances leaving a lasting impression"
]

# Tokenize the texts
tokenized_texts = [text.lower().split() for text in texts]

# Create a dictionary from the tokenized texts
dictionary = corpora.Dictionary(tokenized_texts)

# Convert the tokenized texts into a bag-of-words representation
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

# Find the optimal number of topics using coherence score
coherence_scores = {}
for k in range(2, 8):
    lda_model = gensim.models.ldamodel.LdaModel(
        corpus=corpus,
        id2word=dictionary,
        num_topics=k,
        random_state=100,
        chunksize=100,
        passes=10,
        alpha='auto',
        per_word_topics=True
    )
    coherence_model = CoherenceModel(
        model=lda_model,
        texts=tokenized_texts,
        dictionary=dictionary,
        coherence='c_v'
    )
    coherence_scores[k] = coherence_model.get_coherence()

optimal_k = max(coherence_scores, key=coherence_scores.get)
print(f"The optimal number of topics is {optimal_k}")

# Train the LDA model with the optimal number of topics
lda_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=optimal_k,
    random_state=100,
    chunksize=100,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

# Summarize the topics
topics = lda_model.show_topics(formatted=False)
for i, topic in enumerate(topics):
    print(f"Topic {i+1}: {' '.join([w[0] for w in topic[1]])}")

The optimal number of topics is 3
Topic 1: and of the its that with on inception visual masterpiece
Topic 2: a and of is its with the performances mind sunshine
Topic 3: of with and its the captivate cinema park effects storyline


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [11]:
# Write your code here
from gensim.models import LsiModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel

# Create dictionary
texts_tokenized = [text.split() for text in texts]
dictionary = Dictionary(texts_tokenized)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in texts_tokenized]

# Build LSA model
num_topics = 4
lsa_model = LsiModel(corpus, num_topics=num_topics, id2word=dictionary)

# Compute coherence score
coherence_model_lsa = CoherenceModel(model=lsa_model, texts=texts_tokenized, dictionary=dictionary, coherence='c_v')
coherence_lsa = coherence_model_lsa.get_coherence()

# Print coherence score
print('Coherence Score: ', coherence_lsa)

# Print topics
topics = lsa_model.show_topics(formatted=False)
for i, topic in enumerate(topics):
    print('Topic {}: {}'.format(i+1, [word[0] for word in topic[1]]))

Coherence Score:  0.6791577725415017
Topic 1: ['of', 'and', 'a', 'with', 'its', 'the', 'is', 'performances', 'that', 'impression']
Topic 2: ['a', 'of', 'that', 'revolutionized', 'Jurassic', 'all', 'world', 'effects', 'captivate', 'audiences']
Topic 3: ['edge', 'viewers', 'stunning', 'seats', 'mind-bending', 'plot', 'on', 'visual', 'masterpiece', 'effects.']
Topic 4: ['The', 'timeless', 'it', 'compelling', 'characters,', 'classic,', 'Godfather', 'narrative,', 'making', 'iconic']


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [12]:
# Write your code here
!pip install lda2vec

import numpy as np
import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_string
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "Inception is a mind-bending masterpiece that keeps viewers on the edge of their seats with its intricate plot and stunning visual effects.",
    "The Godfather is a cinematic classic, with its gripping narrative, compelling characters, and iconic performances making it a timeless masterpiece.",
    "Jurassic Park revolutionized the world of cinema with its groundbreaking special effects and thrilling storyline that continues to captivate audiences of all ages.",
    "Eternal Sunshine of the Spotless Mind is a thought-provoking exploration of love and memory, with its inventive storytelling and heartfelt performances leaving a lasting impression"
]

# Preprocess the text data
processed_texts = [preprocess_string(text) for text in texts]

# Create a dictionary and corpus
dictionary = gensim.corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Train the LDA model
vectorizer = CountVectorizer(stop_words="english")
doc_term_matrix = vectorizer.fit_transform(texts)
feature_names = vectorizer.get_feature_names_out()
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10)

# Get the topics and their top words
topics = lda_model.show_topics(num_topics=5, num_words=10, formatted=False)
top_words_per_topic = []
for topic in topics:
    top_words = [word[0] for word in topic[1]]
    top_words_per_topic.append(top_words)

# Print the top words for each topic
for i, top_words in enumerate(top_words_per_topic):
    print(f"Topic {i+1}: {', '.join(top_words)}")

Topic 1: mind, visual, masterpiec, stun, bend, effect, edg, intric, viewer, incept
Topic 2: perform, grip, classic, narr, godfath, compel, icon, charact, make, timeless
Topic 3: continu, cinema, ag, world, revolution, storylin, thrill, jurass, special, effect
Topic 4: audienc, captiv, groundbreak, park, effect, special, jurass, thrill, storylin, revolution
Topic 5: seat, plot, keep, incept, viewer, masterpiec, intric, edg, effect, mind


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [37]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
texts = ["Your text data goes here...", "Another document...", "..."]

# Preprocess the texts
vectorizer = CountVectorizer(stop_words="english")
doc_term_matrix = vectorizer.fit_transform(texts)
feature_names = vectorizer.get_feature_names_out()

# Create a BERTopic model and fit it to the document-term matrix
model = BERTopic()
topics, _ = model.fit_transform(doc_term_matrix)

# Print the topics
print("Number of topics:", model.get_params()["n_components"])
print("\nTopics:")
for topic_id in range(model.get_params()["n_components"]):
    topic_words = [feature_names[i] for i in model.get_topic(topic_id)]
    print(f"Topic {topic_id + 1}: {', '.join(topic_words)}")



TypeError: Make sure that the iterable only contains strings.

## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [None]:
# Write your code here
# Then Explain the visualization

# Repeat for the other 2 visualizations as well.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [17]:
# Write your code here
'''
LDA generated 3 topics, which is more than the other two algorithms. It identified some similar topics as LSA and LDA2Vec,
LSA generated 4 topics.
LDA2Vec generated 5 topics and identified some similar topics as LDA and LSA.
LDA and LDA2Vec generated more coherent and specific topics compared to LSA. LDA2Vec may have an advantage in identifying specific features and sentiments, while LDA seems to capture a more diverse range of topics.'''

'\nLDA generated 3 topics, which is more than the other two algorithms. It identified some similar topics as LSA and LDA2Vec,\nLSA generated 4 topics.\nLDA2Vec generated 5 topics and identified some similar topics as LDA and LSA.\nLDA and LDA2Vec generated more coherent and specific topics compared to LSA. LDA2Vec may have an advantage in identifying specific features and sentiments, while LDA seems to capture a more diverse range of topics.'

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [15]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

For me all the topics are new and all these topics helped me in learning feature extraction from text data. I felt question 3, LDA2Vec and BERT topic challenging.



'''

'\nPlease write you answer here:\n\nFor me all the topics are new and all these topics helped me in learning feature extraction from text data. I felt question 3, LDA2Vec and BERT topic challenging. \n\n\n\n'