<a href="https://colab.research.google.com/github/KrinalM/Krinalben_INFO5731_Spring2020/blob/main/Monpara_Krinalben_Exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [107]:
# Required Libraries
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy

In [108]:
# Sample Text Corpus
documents = [
    "The product is really good. I love it!",
    "This service is terrible. I'm never using it again.",
    "The quality of the item is poor. I'm disappointed.",
    "The customer support was excellent. They were very helpful.",
    "Not recommended. Waste of money.",
]

In [109]:
# Step 1: Preprocess the text data
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

In [110]:
def preprocess_text(texts):
    return [[token.lemma_ for token in nlp(text) if not token.is_stop and not token.is_punct] for text in texts]

processed_data = preprocess_text(documents)

In [111]:
# Step 2: Create a dictionary and document-term matrix
id2word = corpora.Dictionary(processed_data)
corpus = [id2word.doc2bow(text) for text in processed_data]

In [112]:
# Step 3: Determine the optimal number of topics (K) using coherence scores
coherence_scores = [(k, CoherenceModel(model=gensim.models.LdaModel(corpus=corpus,
                                                                   id2word=id2word,
                                                                    num_topics=k,
                                                                    random_state=100,
                                                                    update_every=1,
                                                                    chunksize=100,
                                                                    passes=10,
                                                                    alpha='auto',
                                                                    per_word_topics=True), texts=processed_data, dictionary=id2word, coherence='c_v').get_coherence()) for k in range(2, 11)]


In [113]:
# Select the K with the highest coherence score
best_k = None
best_coherence = float("-inf")

for k, coherence in coherence_scores:
    if coherence > best_coherence:
        best_k = k
        best_coherence = coherence

print(f"Optimal number of topics (K): {best_k}")

Optimal number of topics (K): 4


In [114]:
# Step 5: Train the LDA model with the selected K
lda_model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=best_k,
    random_state=100,
    update_every=1,
    chunksize=100,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

In [115]:
# Step 6: Summarize the topics
topics = lda_model.print_topics(num_topics=best_k, num_words=5)
for topic in topics:
    print(f"Topic {topic[0] + 1}: {topic[1]}")

Topic 1: 0.125*"money" + 0.125*"waste" + 0.125*"love" + 0.125*"recommend" + 0.125*"product"
Topic 2: 0.208*"terrible" + 0.208*"service" + 0.042*"good" + 0.042*"love" + 0.042*"recommend"
Topic 3: 0.156*"quality" + 0.156*"disappointed" + 0.156*"poor" + 0.156*"item" + 0.031*"good"
Topic 4: 0.156*"excellent" + 0.156*"support" + 0.156*"customer" + 0.156*"helpful" + 0.031*"good"


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [116]:
# Required Libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

In [117]:
# Step 1: Preprocess the text data and create a TF-IDF matrix
def preprocess_text_data(data):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(data)
    return tfidf_matrix

# Sample Text Corpus
documents = [
    "The product is really good. I love it!",
    "This service is terrible. I'm never using it again.",
    "The quality of the item is poor. I'm disappointed.",
    "The customer support was excellent. They were very helpful.",
    "Not recommended. Waste of money.",
]

tfidf_matrix = preprocess_text_data(documents)

In [118]:
# Step 2: Determine the optimal number of topics (K) using LSA (you specify K)
def determine_optimal_topics(tfidf_matrix, K):
    lsa = TruncatedSVD(n_components=K)
    lsa_topic_matrix = lsa.fit_transform(tfidf_matrix)
    return lsa_topic_matrix

# Choose the number of topics
num_topics = 5
lsa_topic_matrix = determine_optimal_topics(tfidf_matrix, num_topics)

In [119]:
# Step 3: Summarize the topics
terms = tfidf_vectorizer.get_feature_names_out()
topic_keywords = [[terms[idx] for idx in topic.argsort()[-5:][::-1]] for topic in lsa.components_]
for i, top_terms in enumerate(topic_keywords):
    print(f"Topic {i + 1}: {', '.join(top_terms)}")

Topic 1: the, is, it, really, good
Topic 2: of, money, not, waste, recommended
Topic 3: customer, excellent, helpful, were, was
Topic 4: again, service, using, this, never
Topic 5: good, really, love, product, it


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [120]:
# Write your code here


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [121]:
# Write your code here


## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [122]:
# Write your code here
# Then Explain the visualization

# Repeat for the other 2 visualizations as well.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [123]:
'''
Examining the outcomes produced by two distinct topic modeling algorithms, namely Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA),
entails taking into account a number of variables, including the coherence score, the interpretability of the topics,
computational efficiency, and domain-specific requirements.

The "better" method is determined by the particular needs of the job based on the evaluation.
Because of its interpretability, coherence, and processing economy, LDA is a good option.


LDA - The coherence score aids in figuring out the ideal quantity of subjects.
subjects with relatively high coherence scores, which indicate cohesive and separate subjects, are frequently produced via LDA.
Since each subject is represented by a distribution of words, LDA usually produces highly interpretable topics.
These subjects are frequently simple to comprehend and evaluate.Large corpora can be easily scaled using LDA's computational efficiency.
For subject modeling jobs, it is commonly utilized. It is a flexible technique that works well for a wide range of text mining applications
and is extensively used in both academia and business.


LSA - LSA uses coherence scores to calculate the ideal number of topics, much like LDA does.
However, because LSA depends on singular value decomposition (SVD), it could not yield coherence scores as well as LDA.
Considering that LSA depends on linear algebra transformations that could not adequately maintain semantic links,
the topics it generates might be less interpretable than those produced by LDA.

'''

'\nExamining the outcomes produced by two distinct topic modeling algorithms, namely Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), \nentails taking into account a number of variables, including the coherence score, the interpretability of the topics, \ncomputational efficiency, and domain-specific requirements.\n\nThe "better" method is determined by the particular needs of the job based on the evaluation. \nBecause of its interpretability, coherence, and processing economy, LDA is a good option.\n\n\nLDA - The coherence score aids in figuring out the ideal quantity of subjects. \nsubjects with relatively high coherence scores, which indicate cohesive and separate subjects, are frequently produced via LDA.\nSince each subject is represented by a distribution of words, LDA usually produces highly interpretable topics. \nThese subjects are frequently simple to comprehend and evaluate.Large corpora can be easily scaled using LDA\'s computational efficiency. \nFor s

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [124]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Overall, exercise was very difficult. I have done Question 1 and 2. Those are little bit easy to code.
But question 3 and 4 is very tough and I also encountered challenges while working.
So, Unfortunately, I couldn't completed Question 3 and 4. But, still I'm working on it and try to solve soon.



'''

"\nPlease write you answer here:\n\nOverall, exercise was very difficult. I have done Question 1 and 2. Those are little bit easy to code.\nBut question 3 and 4 is very tough and I also encountered challenges while working.\nSo, Unfortunately, I couldn't completed Question 3 and 4. But, still I'm working on it and try to solve soon.\n\n\n\n"