<a href="https://colab.research.google.com/github/HarshaSolingaram/INFO_5731/blob/main/Solingaram_Harshavardhan_Exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [24]:
# Write your code here
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy

# Sample data
sample_data = [
    "This movie is absolutely fantastic! I loved every moment of it.",
    "I'm not sure why people like this movie; it's quite boring.",
    "The restaurant had great service, but the food was disappointing.",
]

# Step 1: Preprocess the text data
nlp = spacy.load("en_core_web_sm")

def preprocess_text(texts):
    processed_texts = []
    for text in texts:
        doc = nlp(text)
        tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
        processed_texts.append(tokens)
    return processed_texts

processed_data = preprocess_text(sample_data)

# Step 2: Create a dictionary and document-term matrix
id2word = corpora.Dictionary(processed_data)
corpus = [id2word.doc2bow(text) for text in processed_data]



In [25]:
# Step 3: Determine the optimal number of topics (K) using coherence scores
coherence_scores = []
for k in range(2, 11):
    lda_model = gensim.models.LdaModel(corpus=corpus, id2word=id2word, num_topics=k, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_data, dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((k, coherence_score))

coherence_scores

[(2, 0.2759024284927243),
 (3, 0.27590242849272434),
 (4, 0.2759024284927243),
 (5, 0.2759024284927243),
 (6, 0.27590242849272434),
 (7, 0.2759024284927243),
 (8, 0.2759024284927243),
 (9, 0.2759024284927243),
 (10, 0.2759024284927244)]

In [26]:
# Select the K with the highest coherence score
best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])
print(f"Optimal number of topics (K): {best_k}")

# Step 5: Train the LDA model with the selected K
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=id2word, num_topics=best_k, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

# Step 6: Summarize the topics
topics = lda_model.print_topics(num_topics=best_k, num_words=5)
for topic in topics:
    print(f"Topic {topic[0] + 1}: {topic[1]}")

Optimal number of topics (K): 10
Topic 1: 0.172*"service" + 0.172*"restaurant" + 0.172*"food" + 0.172*"disappointing" + 0.172*"great"
Topic 2: 0.071*"movie" + 0.071*"boring" + 0.071*"food" + 0.071*"like" + 0.071*"fantastic"
Topic 3: 0.071*"movie" + 0.071*"boring" + 0.071*"great" + 0.071*"disappointing" + 0.071*"people"
Topic 4: 0.071*"movie" + 0.071*"boring" + 0.071*"sure" + 0.071*"great" + 0.071*"food"
Topic 5: 0.071*"movie" + 0.071*"great" + 0.071*"boring" + 0.071*"sure" + 0.071*"disappointing"
Topic 6: 0.172*"absolutely" + 0.172*"moment" + 0.172*"fantastic" + 0.172*"love" + 0.172*"movie"
Topic 7: 0.071*"movie" + 0.071*"great" + 0.071*"boring" + 0.071*"food" + 0.071*"absolutely"
Topic 8: 0.071*"movie" + 0.071*"food" + 0.071*"great" + 0.071*"boring" + 0.071*"fantastic"
Topic 9: 0.172*"movie" + 0.172*"people" + 0.172*"like" + 0.172*"sure" + 0.172*"boring"
Topic 10: 0.071*"movie" + 0.071*"restaurant" + 0.071*"disappointing" + 0.071*"great" + 0.071*"like"


In [29]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy

# Sample data
sample_data = [
    "This movie is absolutely fantastic! I loved every moment of it.",
    "I'm not sure why people like this movie; it's quite boring.",
    "The restaurant had great service, but the food was disappointing.",
]

# Step 1: Preprocess the text data
nlp = spacy.load("en_core_web_sm")

def preprocess_text(texts):
    processed_texts = []
    for text in texts:
        doc = nlp(text)
        tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
        processed_texts.append(tokens)
    return processed_texts

processed_data = preprocess_text(sample_data)

# Step 2: Create a dictionary and document-term matrix
id2word = corpora.Dictionary(processed_data)
corpus = [id2word.doc2bow(text) for text in processed_data]

# Step 3: Determine the optimal number of topics (K) using coherence scores
coherence_scores = []
for k in range(2, 11):
    lda_model = gensim.models.LdaModel(corpus=corpus, id2word=id2word, num_topics=k, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_data, dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((k, coherence_score))

# Select the K with the highest coherence score
best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])
print(f"Optimal number of topics (K): {best_k}")

# Step 5: Train the LDA model with the selected K
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=id2word, num_topics=best_k, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

# Step 6: Summarize the topics
topics = lda_model.print_topics(num_topics=best_k, num_words=5)
for topic in topics:
    print(f"Topic {topic[0] + 1}: {topic[1]}")


Optimal number of topics (K): 10
Topic 1: 0.172*"service" + 0.172*"restaurant" + 0.172*"food" + 0.172*"disappointing" + 0.172*"great"
Topic 2: 0.071*"movie" + 0.071*"boring" + 0.071*"food" + 0.071*"like" + 0.071*"fantastic"
Topic 3: 0.071*"movie" + 0.071*"boring" + 0.071*"great" + 0.071*"disappointing" + 0.071*"people"
Topic 4: 0.071*"movie" + 0.071*"boring" + 0.071*"sure" + 0.071*"great" + 0.071*"food"
Topic 5: 0.071*"movie" + 0.071*"great" + 0.071*"boring" + 0.071*"sure" + 0.071*"disappointing"
Topic 6: 0.172*"absolutely" + 0.172*"moment" + 0.172*"fantastic" + 0.172*"love" + 0.172*"movie"
Topic 7: 0.071*"movie" + 0.071*"great" + 0.071*"boring" + 0.071*"food" + 0.071*"absolutely"
Topic 8: 0.071*"movie" + 0.071*"food" + 0.071*"great" + 0.071*"boring" + 0.071*"fantastic"
Topic 9: 0.172*"movie" + 0.172*"people" + 0.172*"like" + 0.172*"sure" + 0.172*"boring"
Topic 10: 0.071*"movie" + 0.071*"restaurant" + 0.071*"disappointing" + 0.071*"great" + 0.071*"like"


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [27]:
# Write your code here
# Write your code here
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

# Sample data
sample_data = [
    "This movie is absolutely fantastic! I loved every moment of it.",
    "I'm not sure why people like this movie; it's quite boring.",
    "The restaurant had great service, but the food was disappointing.",
]

# Step 1: Preprocess the text data and create a TF-IDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(sample_data)

# Step 2: Determine the optimal number of topics (K) using LSA (you specify K)
K = 2  # You can choose the number of topics
lsa = TruncatedSVD(n_components=K)
lsa_topic_matrix = lsa.fit_transform(tfidf_matrix)

# Step 3: Summarize the topics
terms = tfidf_vectorizer.get_feature_names_out()
topic_keywords = []
for i, topic in enumerate(lsa.components_):
    top_terms = [terms[idx] for idx in topic.argsort()[-5:][::-1]]
    topic_keywords.append(top_terms)
    print(f"Topic {i + 1}: {', '.join(top_terms)}")


Topic 1: this, it, movie, loved, every
Topic 2: the, but, disappointing, service, restaurant


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
# Write your code here


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [30]:
%%capture
!pip install bertopic

In [58]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset="test",  remove=('headers', 'footers', 'quotes'))['data']

from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

freq = topic_model.get_topic_info(); freq.head(5)

topic_model.get_topic(0)

2024-03-29 02:45:20,719 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/236 [00:00<?, ?it/s]

2024-03-29 02:59:15,174 - BERTopic - Embedding - Completed ✓
2024-03-29 02:59:15,176 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-29 02:59:55,581 - BERTopic - Dimensionality - Completed ✓
2024-03-29 02:59:55,584 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-29 02:59:59,987 - BERTopic - Cluster - Completed ✓
2024-03-29 03:00:00,007 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-29 03:00:03,712 - BERTopic - Representation - Completed ✓


[('game', 0.017455823228888123),
 ('he', 0.010815978122157677),
 ('team', 0.01037046882293),
 ('games', 0.009787749294113185),
 ('the', 0.008510960023341622),
 ('was', 0.008431934318538847),
 ('25', 0.008193396194866254),
 ('his', 0.007644650432912428),
 ('in', 0.007635976813955437),
 ('year', 0.007167922175027246)]

In [59]:
topic_model.topics_[:10]

[1, -1, -1, 50, 43, 56, -1, -1, 6, 12]

In [60]:
topic_model.topics_[:10]

[1, -1, -1, 50, 43, 56, -1, -1, 6, 12]

## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [63]:
# Write your code here
# Then Explain the visualization

# Repeat for the other 2 visualizations as well.

topic_model.visualize_topics()

In [64]:
topic_model.visualize_distribution(probs[200], min_probability=0.015)

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''