<a href="https://colab.research.google.com/github/SaiTejaMunja/SaiTeja_INFO_5731_In_Class_Excercise/blob/main/Munja_Exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

> Indented block



https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
!pip install gensim
!pip install nltk



In [None]:
# Write your code here

import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy

# Corpus
texts = [
    "The product is really good. I love it! Not bad, but could be better. 😊",
    "This product is terrible. I hate it!",
    "It's an okay product, not great.",
    "The weather is lovely today. Perfect for a picnic!",
    "I had a great time at the beach with my friends.",
    "This book is a masterpiece. I couldn't put it down.",
    "The concert last night was amazing. The band played their best songs.",
    "I'm not feeling well today. I hope I get better soon.",
    "The city skyline at night is breathtaking. It's so beautiful.",
]

# Preprocessing the Corpus obtained from earlier asssignment
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [token for token in tokens if len(token) > 1]  # Remove single-character words
    tokens = [token for token in tokens if not token.isdigit()]  # Remove numbers
    tokens = [token for token in tokens if token.isalpha()]  # Remove punctuation
    tokens = [token.lower() for token in tokens]  # Convert to lowercase
    return tokens

processed_texts = [preprocess_text(text) for text in texts]

# Creating a dictionary and a corpus
id2word = corpora.Dictionary(processed_texts)
corpus = [id2word.doc2bow(text) for text in processed_texts]

# Finding the optimal number of topics using coherence scores
coherence_scores = {}
start_topic, end_topic = 2, 10

for num_topics in range(start_topic, end_topic + 1):
    lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=id2word, passes=10)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_texts, dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores[num_topics] = coherence_score

# Finding the number of topics with the highest coherence score
optimal_num_topics = max(coherence_scores, key=coherence_scores.get)

# Training the final LDA model with the optimal number of topics
lda_model = gensim.models.LdaModel(corpus, num_topics=optimal_num_topics, id2word=id2word, passes=10)

# Printing the topics and their top words
topics = lda_model.print_topics()
print(f"Topics and their top words:")
for topic in topics:
    print(topic)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Topics and their top words:
(0, '0.026*"product" + 0.026*"great" + 0.026*"could" + 0.026*"terrible" + 0.026*"today" + 0.026*"beach" + 0.026*"okay" + 0.026*"hate" + 0.026*"night" + 0.026*"masterpiece"')
(1, '0.026*"product" + 0.026*"great" + 0.026*"could" + 0.026*"terrible" + 0.026*"hate" + 0.026*"perfect" + 0.026*"today" + 0.026*"okay" + 0.026*"book" + 0.026*"beach"')
(2, '0.095*"could" + 0.095*"product" + 0.095*"put" + 0.095*"masterpiece" + 0.095*"book" + 0.095*"hate" + 0.095*"terrible" + 0.011*"great" + 0.011*"beach" + 0.011*"time"')
(3, '0.081*"great" + 0.081*"city" + 0.081*"breathtaking" + 0.081*"beautiful" + 0.081*"friends" + 0.081*"time" + 0.081*"skyline" + 0.081*"beach" + 0.081*"night" + 0.009*"product"')
(4, '0.026*"product" + 0.026*"great" + 0.026*"could" + 0.026*"terrible" + 0.026*"today" + 0.026*"beach" + 0.026*"hate" + 0.026*"book" + 0.026*"skyline" + 0.026*"perfect"')
(5, '0.087*"night" + 0.087*"band" + 0.087*"concert" + 0.087*"played" + 0.087*"amazing" + 0.087*"last" + 0.

## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy

texts = [
    "The product is really good. I love it! Not bad, but could be better. 😊",
    "This product is terrible. I hate it!",
    "It's an okay product, not great.",
    "The weather is lovely today. Perfect for a picnic!",
    "I had a great time at the beach with my friends.",
    "This book is a masterpiece. I couldn't put it down.",
    "The concert last night was amazing. The band played their best songs.",
    "I'm not feeling well today. I hope I get better soon.",
    "The city skyline at night is breathtaking. It's so beautiful.",
]

# Preprocessing the text data
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [token for token in tokens if len(token) > 1]  # Remove single-character words
    tokens = [token for token in tokens if not token.isdigit()]  # Remove numbers
    tokens = [token for token in tokens if token.isalpha()]  # Remove punctuation
    tokens = [token.lower() for token in tokens]  # Convert to lowercase
    return tokens

processed_texts = [preprocess_text(text) for text in texts]

# Creating a dictionary and a corpus
id2word = corpora.Dictionary(processed_texts)
corpus = [id2word.doc2bow(text) for text in processed_texts]

# Determining the optimal number of topics using coherence score
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    for num_topics in range(start, limit, step):
        model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_score = coherence_model.get_coherence()
        coherence_values.append((num_topics, coherence_score))
    return coherence_values

limit = 10
start = 2
step = 1
coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=processed_texts, start=start, limit=limit, step=step)

# Finding the number of topics with the highest coherence score
best_num_topics = max(coherence_values, key=lambda x: x[1])[0]

# Training the LSA model with the best number of topics
lsa_model = LsiModel(corpus=corpus, id2word=id2word, num_topics=best_num_topics)

# Printing the best number of topics
print("Optimal number of topics:", best_num_topics)

# Printing the topics and their top words
topics = lsa_model.print_topics()
for topic in topics:
    print(topic)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Optimal number of topics: 9
(0, '0.455*"better" + 0.363*"product" + 0.305*"could" + 0.264*"today" + 0.250*"bad" + 0.250*"love" + 0.250*"really" + 0.250*"good" + 0.205*"get" + 0.205*"feeling"')
(1, '-0.433*"night" + -0.332*"songs" + -0.332*"concert" + -0.332*"played" + -0.332*"last" + -0.332*"amazing" + -0.332*"band" + -0.332*"best" + -0.101*"city" + -0.101*"breathtaking"')
(2, '0.409*"today" + -0.350*"product" + -0.269*"could" + 0.263*"get" + 0.263*"hope" + 0.263*"soon" + 0.263*"well" + 0.263*"feeling" + -0.199*"really" + -0.199*"love"')
(3, '0.442*"beautiful" + 0.442*"city" + 0.442*"breathtaking" + 0.442*"skyline" + 0.308*"night" + -0.134*"concert" + -0.134*"amazing" + -0.134*"played" + -0.134*"songs" + -0.134*"band"')
(4, '-0.502*"great" + -0.301*"time" + -0.301*"friends" + -0.301*"beach" + -0.243*"weather" + -0.243*"picnic" + -0.243*"lovely" + -0.243*"perfect" + -0.235*"product" + -0.201*"okay"')
(5, '0.347*"perfect" + 0.347*"weather" + 0.347*"picnic" + 0.347*"lovely" + -0.305*"grea

## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
pip install lda2vec



In [None]:
pip install gensim --upgrade



In [None]:
!pip install pyLDAvis



In [None]:
!pip install preprocess



In [None]:
# imporitng necessary libraries

import pyLDAvis
import numpy as np
import nltk
nltk.download('all')
pyLDAvis.enable_notebook()

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

In [None]:
top = 10
topic_to_topwords = {}

texts = [
    "The product is really good. I love it! Not bad, but could be better. 😊",
    "This product is terrible. I hate it!",
    "It's an okay product, not great.",
    "The weather is lovely today. Perfect for a picnic!",
    "I had a great time at the beach with my friends.",
    "This book is a masterpiece. I couldn't put it down.",
    "The concert last night was amazing. The band played their best songs.",
    "I'm not feeling well today. I hope I get better soon.",
    "The city skyline at night is breathtaking. It's so beautiful.",
]

for k, topic in enumerate(texts):
    # Tokenizing the text and calculating the top words
    words = nltk.word_tokenize(topic)
    word_freq = nltk.FreqDist(words)
    top_words = [word for word, freq in word_freq.most_common(top)]

    msg = 'Topic %i has top words: %s' % (k, ', '.join(top_words))
    print(msg)
    topic_to_topwords[k] = top_words

Topic 0 has top words: ., The, product, is, really, good, I, love, it, !
Topic 1 has top words: This, product, is, terrible, ., I, hate, it, !
Topic 2 has top words: It, 's, an, okay, product, ,, not, great, .
Topic 3 has top words: The, weather, is, lovely, today, ., Perfect, for, a, picnic
Topic 4 has top words: I, had, a, great, time, at, the, beach, with, my
Topic 5 has top words: ., This, book, is, a, masterpiece, I, could, n't, put
Topic 6 has top words: The, ., concert, last, night, was, amazing, band, played, their
Topic 7 has top words: I, ., 'm, not, feeling, well, today, hope, get, better
Topic 8 has top words: ., The, city, skyline, at, night, is, breathtaking, It, 's


  and should_run_async(code)


## (4) (10 points) Generate K topics by using BERTopic, the number of topics Kshould be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
pip install bertopic

  and should_run_async(code)




In [None]:
!pip install --upgrade joblib

  and should_run_async(code)




In [None]:
# importing librarbies

import pandas as pd

# Defining some list of sentences for creating a corpus
sentences = [
    "This is the first sentence.",
    "Here is the second sentence.",
    "A third sentence for testing.",
    "And a fourth sentence for variety.",
    "Fifth sentence, just to add more data.",
    "Another sentence to make it six.",
    "Seventh sentence, almost there.",
    "Eighth sentence for the example.",
    "Ninth sentence, we're getting closer.",
    "Tenth sentence, halfway through.",
    "Eleventh sentence to keep going.",
    "Twelfth sentence, still more to come.",
    "Thirteenth sentence, lucky number.",
    "Fourteenth sentence, almost done.",
    "Fifteenth sentence, just a few more.",
    "Sixteenth sentence, getting there.",
    "Seventeenth sentence, almost finished.",
    "Eighteenth sentence, so close.",
    "Nineteenth sentence, penultimate.",
    "Twentieth sentence, last one.",
    "Twenty-first sentence, the first of the second set.",
    "Twenty-second sentence, continuing the second set.",
    "Twenty-third sentence, adding more.",
    "Twenty-fourth sentence, almost done with the second set.",
    "Twenty-fifth sentence, last of the second set.",
    "Twenty-sixth sentence, beginning the third set.",
    "Twenty-seventh sentence, ongoing.",
    "Twenty-eighth sentence, more to go.",
    "Twenty-ninth sentence, not stopping yet.",
    "Thirtieth sentence, third set's halfway point.",
    "Thirty-first sentence, picking up speed.",
    "Thirty-second sentence, still more to come.",
    "Thirty-third sentence, almost there.",
    "Thirty-fourth sentence, getting closer.",
    "Thirty-fifth sentence, close to the end.",
    "Thirty-sixth sentence, wrapping up the third set.",
    "Thirty-seventh sentence, starting the fourth set.",
    "Thirty-eighth sentence, not stopping now.",
    "Thirty-ninth sentence, fourth set's halfway point.",
    "Fortieth sentence, making progress.",
    "Forty-first sentence, keeping going.",
    "Forty-second sentence, just a few more to go.",
    "Forty-third sentence, almost there.",
    "Forty-fourth sentence, getting closer to the end.",
    "Forty-fifth sentence, almost done.",
    "Forty-sixth sentence, penultimate in the fourth set.",
    "Forty-seventh sentence, last one in the fourth set.",
    "Forty-eighth sentence, starting the fifth set.",
    "Forty-ninth sentence, not stopping now.",
    "Fiftieth sentence, last one for the example.",
]

# Create a DataFrame with a column containing the sentences
data_frame = pd.DataFrame({'Sentences': sentences})

# Print the DataFrame
print(data_frame)

                                            Sentences
0                         This is the first sentence.
1                        Here is the second sentence.
2                       A third sentence for testing.
3                  And a fourth sentence for variety.
4              Fifth sentence, just to add more data.
5                    Another sentence to make it six.
6                     Seventh sentence, almost there.
7                    Eighth sentence for the example.
8               Ninth sentence, we're getting closer.
9                    Tenth sentence, halfway through.
10                   Eleventh sentence to keep going.
11              Twelfth sentence, still more to come.
12                 Thirteenth sentence, lucky number.
13                  Fourteenth sentence, almost done.
14               Fifteenth sentence, just a few more.
15                 Sixteenth sentence, getting there.
16             Seventeenth sentence, almost finished.
17                     Eight

  and should_run_async(code)


In [None]:
!pip install --upgrade tensorflow

  and should_run_async(code)




In [None]:
from bertopic import BERTopic


corpus = data_frame.Sentences.to_list()
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(corpus)

  and should_run_async(code)


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-11-07 23:57:45,914 - BERTopic - Transformed documents to Embeddings
2023-11-07 23:57:50,320 - BERTopic - Reduced dimensionality
  self._all_finite = is_finite(X)
2023-11-07 23:57:50,331 - BERTopic - Clustered reduced embeddings


In [None]:
for topic, prob in zip(topics, probs):
    print(f"Topic: {topic}, Probability: {prob}")

Topic: 1, Probability: [5.96026351e-309 1.00000000e+000]
Topic: -1, Probability: [0.14379035 0.66898313]
Topic: -1, Probability: [0.20068501 0.61210311]
Topic: 1, Probability: [4.13438674e-309 1.00000000e+000]
Topic: 0, Probability: [0.39278415 0.49856449]
Topic: 1, Probability: [6.28791633e-309 1.00000000e+000]
Topic: 0, Probability: [0.73245176 0.19722432]
Topic: -1, Probability: [0.22898339 0.52500104]
Topic: 0, Probability: [1.00000000e+000 4.78521097e-309]
Topic: 0, Probability: [0.61435095 0.27775248]
Topic: 0, Probability: [0.60604337 0.31791562]
Topic: 0, Probability: [1.00000000e+000 5.69318566e-309]
Topic: -1, Probability: [0.29172695 0.4239999 ]
Topic: -1, Probability: [0.23984051 0.46030979]
Topic: -1, Probability: [0.31927874 0.39397789]
Topic: -1, Probability: [0.38711595 0.37043326]
Topic: -1, Probability: [0.24687979 0.45790246]
Topic: -1, Probability: [0.21870974 0.66156848]
Topic: 1, Probability: [6.18653171e-309 1.00000000e+000]
Topic: 1, Probability: [5.98564394e-30

  and should_run_async(code)


## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

In [None]:
# Write your answer here (no code needed for this question)

'''When comparing LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis) for topic modeling, the choice depends on your specific needs. LDA offers interpretability with explicit topic-word probabilities and document-topic distributions, making it valuable when you want to understand the underlying topics in a corpus. LSA, on the other hand, excels at capturing semantic relationships and reducing dimensionality in the data but may lack the interpretability of LDA due to the absence of topic-word probabilities. The choice between the two algorithms should be driven by your project's objectives and whether you prioritize clear topic interpretation (LDA) or semantic similarity capture (LSA).

In comparison to LDA and LSA, BERT and Lda2Vec represent more advanced topic modeling approaches. BERT, a deep learning model, provides context-aware embeddings and can capture intricate semantic relationships but may require substantial computational resources. Lda2Vec combines the benefits of word2vec and LDA, offering topic interpretability while considering the distributional semantics of words. BERT and Lda2Vec are suitable for tasks requiring a deeper understanding of text data and context, although they can be computationally intensive. The choice between these methods, like LDA and LSA, depends on the specific project goals, resources, and the balance between interpretability and semantic understanding needed.


Latent Dirichlet Allocation (LDA):
Latent Dirichlet Allocation (LDA) is a well-established and widely used topic modeling technique. It provides interpretable topics by assigning a probability distribution of words to each topic. LDA is computationally efficient and works well with large text corpora. However, it assumes a bag-of-words model, which doesn't consider word order or semantics. This can lead to less accurate topic modeling, particularly in modern text data with complex language structures and semantics. LDA is suitable when you prioritize simplicity and have a large, well-structured corpus of text data where interpretability is crucial.

Latent Semantic Analysis (LSA):
Latent Semantic Analysis (LSA) captures the underlying semantic structure in text data and is useful for reducing dimensionality in large datasets. LSA focuses on capturing latent semantic relationships among words and documents. However, LSA doesn't provide direct topic interpretability. It often requires additional techniques to identify and label topics. Additionally, LSA is limited in handling non-linear relationships in the data and struggles to capture word order and semantic nuances. LSA is a valuable choice when you need dimensionality reduction and have a large text dataset but don't require explicit topic labels.

LDA2Vec:
LDA2Vec combines the strengths of LDA and word2vec to capture semantic relationships between words and documents. It can handle word order, semantics, and ambiguity more effectively than traditional LDA. This makes it suitable for tasks where capturing contextual information and semantics is essential. However, LDA2Vec may require more data and tuning to achieve optimal results. It can also be computationally intensive, making it less ideal for resource-constrained environments. LDA2Vec is a good choice when you want both interpretability and the ability to capture complex relationships in your text data.

BERTopic (using BERT embeddings):
BERTopic leverages pre-trained BERT embeddings to capture contextual and semantic information in text data. It can handle word order, semantics, and ambiguity effectively, making it well-suited for modern text data with complex language structures. BERTopic is known for achieving high interpretability, even in short and noisy text data. However, it may require more computational resources due to the use of BERT embeddings. BERTopic is a strong choice when you prioritize topic interpretability and need to capture the nuanced meaning of words and phrases in your text data, making it a valuable option for a wide range of NLP tasks.

'''



  and should_run_async(code)


"When comparing LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis) for topic modeling, the choice depends on your specific needs. LDA offers interpretability with explicit topic-word probabilities and document-topic distributions, making it valuable when you want to understand the underlying topics in a corpus. LSA, on the other hand, excels at capturing semantic relationships and reducing dimensionality in the data but may lack the interpretability of LDA due to the absence of topic-word probabilities. The choice between the two algorithms should be driven by your project's objectives and whether you prioritize clear topic interpretation (LDA) or semantic similarity capture (LSA).\n\nIn comparison to LDA and LSA, BERT and Lda2Vec represent more advanced topic modeling approaches. BERT, a deep learning model, provides context-aware embeddings and can capture intricate semantic relationships but may require substantial computational resources. Lda2Vec combines the bene