# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
!pip install gensim
!pip install nltk




In [2]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy

# Corpus
texts = [
    "The product is really good. I love it! Not bad, but could be better. 😊",
    "This product is terrible. I hate it!",
    "It's an okay product, not great.",
    "The weather is lovely today. Perfect for a picnic!",
    "I had a great time at the beach with my friends.",
    "This book is a masterpiece. I couldn't put it down.",
    "The concert last night was amazing. The band played their best songs.",
    "I'm not feeling well today. I hope I get better soon.",
    "The city skyline at night is breathtaking. It's so beautiful.",
]

# Preprocessing the Corpus obtained from earlier asssignment
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [token for token in tokens if len(token) > 1]  # Remove single-character words
    tokens = [token for token in tokens if not token.isdigit()]  # Remove numbers
    tokens = [token for token in tokens if token.isalpha()]  # Remove punctuation
    tokens = [token.lower() for token in tokens]  # Convert to lowercase
    return tokens

processed_texts = [preprocess_text(text) for text in texts]

# Creating a dictionary and a corpus
id2word = corpora.Dictionary(processed_texts)
corpus = [id2word.doc2bow(text) for text in processed_texts]

# Finding the optimal number of topics using coherence scores
coherence_scores = {}
start_topic, end_topic = 2, 10

for num_topics in range(start_topic, end_topic + 1):
    lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=id2word, passes=10)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_texts, dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores[num_topics] = coherence_score

# Finding the number of topics with the highest coherence score
optimal_num_topics = max(coherence_scores, key=coherence_scores.get)

# Training the final LDA model with the optimal number of topics
lda_model = gensim.models.LdaModel(corpus, num_topics=optimal_num_topics, id2word=id2word, passes=10)

# Printing the topics and their top words
topics = lda_model.print_topics()
print(f"Topics and their top words:")
for topic in topics:
    print(topic)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Topics and their top words:
(0, '0.026*"product" + 0.026*"great" + 0.026*"today" + 0.026*"could" + 0.026*"terrible" + 0.026*"okay" + 0.026*"beach" + 0.026*"night" + 0.026*"better" + 0.026*"put"')
(1, '0.092*"picnic" + 0.092*"perfect" + 0.092*"weather" + 0.092*"lovely" + 0.092*"hate" + 0.092*"terrible" + 0.092*"today" + 0.092*"product" + 0.008*"great" + 0.008*"okay"')
(2, '0.101*"better" + 0.101*"today" + 0.101*"well" + 0.101*"soon" + 0.101*"get" + 0.101*"feeling" + 0.101*"hope" + 0.009*"product" + 0.009*"great" + 0.009*"could"')
(3, '0.141*"could" + 0.074*"really" + 0.074*"bad" + 0.074*"good" + 0.074*"love" + 0.074*"put" + 0.074*"book" + 0.074*"masterpiece" + 0.074*"product" + 0.074*"better"')
(4, '0.026*"product" + 0.026*"great" + 0.026*"okay" + 0.026*"night" + 0.026*"today" + 0.026*"book" + 0.026*"better" + 0.026*"beach" + 0.026*"could" + 0.026*"hate"')
(5, '0.159*"product" + 0.159*"great" + 0.159*"okay" + 0.014*"today" + 0.014*"terrible" + 0.014*"could" + 0.014*"better" + 0.014*"mas

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [3]:
import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy

texts = [
    "The product is really good. I love it! Not bad, but could be better. 😊",
    "This product is terrible. I hate it!",
    "It's an okay product, not great.",
    "The weather is lovely today. Perfect for a picnic!",
    "I had a great time at the beach with my friends.",
    "This book is a masterpiece. I couldn't put it down.",
    "The concert last night was amazing. The band played their best songs.",
    "I'm not feeling well today. I hope I get better soon.",
    "The city skyline at night is breathtaking. It's so beautiful.",
]

# Preprocessing the text data
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [token for token in tokens if len(token) > 1]  # Remove single-character words
    tokens = [token for token in tokens if not token.isdigit()]  # Remove numbers
    tokens = [token for token in tokens if token.isalpha()]  # Remove punctuation
    tokens = [token.lower() for token in tokens]  # Convert to lowercase
    return tokens

processed_texts = [preprocess_text(text) for text in texts]

# Creating a dictionary and a corpus
id2word = corpora.Dictionary(processed_texts)
corpus = [id2word.doc2bow(text) for text in processed_texts]

# Determining the optimal number of topics using coherence score
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    for num_topics in range(start, limit, step):
        model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_score = coherence_model.get_coherence()
        coherence_values.append((num_topics, coherence_score))
    return coherence_values

limit = 10
start = 2
step = 1
coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=processed_texts, start=start, limit=limit, step=step)

# Finding the number of topics with the highest coherence score
best_num_topics = max(coherence_values, key=lambda x: x[1])[0]

# Training the LSA model with the best number of topics
lsa_model = LsiModel(corpus=corpus, id2word=id2word, num_topics=best_num_topics)

# Printing the best number of topics
print("Optimal number of topics:", best_num_topics)

# Printing the topics and their top words
topics = lsa_model.print_topics()
for topic in topics:
    print(topic)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Optimal number of topics: 6
(0, '0.455*"better" + 0.363*"product" + 0.305*"could" + 0.264*"today" + 0.250*"good" + 0.250*"really" + 0.250*"bad" + 0.250*"love" + 0.205*"get" + 0.205*"well"')
(1, '-0.433*"night" + -0.332*"songs" + -0.332*"amazing" + -0.332*"best" + -0.332*"concert" + -0.332*"last" + -0.332*"played" + -0.332*"band" + -0.101*"city" + -0.101*"beautiful"')
(2, '0.409*"today" + -0.350*"product" + -0.269*"could" + 0.263*"get" + 0.263*"feeling" + 0.263*"soon" + 0.263*"hope" + 0.263*"well" + -0.199*"bad" + -0.199*"good"')
(3, '0.442*"beautiful" + 0.442*"skyline" + 0.442*"city" + 0.442*"breathtaking" + 0.308*"night" + -0.134*"last" + -0.134*"concert" + -0.134*"played" + -0.134*"amazing" + -0.134*"band"')
(4, '0.502*"great" + 0.301*"beach" + 0.301*"friends" + 0.301*"time" + 0.243*"weather" + 0.243*"lovely" + 0.243*"perfect" + 0.243*"picnic" + 0.235*"product" + 0.201*"okay"')
(5, '-0.347*"weather" + -0.347*"picnic" + -0.347*"perfect" + -0.347*"lovely" + 0.305*"great" + 0.201*"frien

## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [4]:
pip install lda2vec

Collecting lda2vec
  Downloading lda2vec-0.16.10.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lda2vec
  Building wheel for lda2vec (setup.py) ... [?25l[?25hdone
  Created wheel for lda2vec: filename=lda2vec-0.16.10-py3-none-any.whl size=14410 sha256=0ead3ace398feaeb2dd446516e5df3abb4ffc516648bab954528d2c78e54cc4f
  Stored in directory: /root/.cache/pip/wheels/1e/90/24/a97126c0fe8b479ba3bb79d3b18ebaab571a18d90bb2967ab6
Successfully built lda2vec
Installing collected packages: lda2vec
Successfully installed lda2vec-0.16.10


In [5]:
pip install gensim --upgrade



In [6]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting tzdata>=2022.7 (from pandas>=2.0.0->pyLDAvis)
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: funcy, tzdata, pandas, pyLDAvis
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERRO

In [7]:
!pip install preprocess

Collecting preprocess
  Downloading preprocess-2.0.0-py3-none-any.whl (12 kB)
Installing collected packages: preprocess
Successfully installed preprocess-2.0.0


In [8]:
import pyLDAvis
import numpy as np
import nltk
nltk.download('all')
pyLDAvis.enable_notebook()

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

In [9]:
top = 10
topic_to_topwords = {}

texts = [
    "The product is really good. I love it! Not bad, but could be better. 😊",
    "This product is terrible. I hate it!",
    "It's an okay product, not great.",
    "The weather is lovely today. Perfect for a picnic!",
    "I had a great time at the beach with my friends.",
    "This book is a masterpiece. I couldn't put it down.",
    "The concert last night was amazing. The band played their best songs.",
    "I'm not feeling well today. I hope I get better soon.",
    "The city skyline at night is breathtaking. It's so beautiful.",
]

for k, topic in enumerate(texts):
    # Tokenizing the text and calculating the top words
    words = nltk.word_tokenize(topic)
    word_freq = nltk.FreqDist(words)
    top_words = [word for word, freq in word_freq.most_common(top)]

    msg = 'Topic %i has top words: %s' % (k, ', '.join(top_words))
    print(msg)
    topic_to_topwords[k] = top_words

Topic 0 has top words: ., The, product, is, really, good, I, love, it, !
Topic 1 has top words: This, product, is, terrible, ., I, hate, it, !
Topic 2 has top words: It, 's, an, okay, product, ,, not, great, .
Topic 3 has top words: The, weather, is, lovely, today, ., Perfect, for, a, picnic
Topic 4 has top words: I, had, a, great, time, at, the, beach, with, my
Topic 5 has top words: ., This, book, is, a, masterpiece, I, could, n't, put
Topic 6 has top words: The, ., concert, last, night, was, amazing, band, played, their
Topic 7 has top words: I, ., 'm, not, feeling, well, today, hope, get, better
Topic 8 has top words: ., The, city, skyline, at, night, is, breathtaking, It, 's


  and should_run_async(code)


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [10]:
pip install bertopic

  and should_run_async(code)


Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━

In [11]:
!pip install --upgrade joblib

  and should_run_async(code)




In [12]:
# importing librarbies

import pandas as pd

# Defining some list of sentences for creating a corpus
sentences = [
    "This is the first sentence.",
    "Here is the second sentence.",
    "A third sentence for testing.",
    "And a fourth sentence for variety.",
    "Fifth sentence, just to add more data.",
    "Another sentence to make it six.",
    "Seventh sentence, almost there.",
    "Eighth sentence for the example.",
    "Ninth sentence, we're getting closer.",
    "Tenth sentence, halfway through.",
    "Eleventh sentence to keep going.",
    "Twelfth sentence, still more to come.",
    "Thirteenth sentence, lucky number.",
    "Fourteenth sentence, almost done.",
    "Fifteenth sentence, just a few more.",
    "Sixteenth sentence, getting there.",
    "Seventeenth sentence, almost finished.",
    "Eighteenth sentence, so close.",
    "Nineteenth sentence, penultimate.",
    "Twentieth sentence, last one.",
    "Twenty-first sentence, the first of the second set.",
    "Twenty-second sentence, continuing the second set.",
    "Twenty-third sentence, adding more.",
    "Twenty-fourth sentence, almost done with the second set.",
    "Twenty-fifth sentence, last of the second set.",
    "Twenty-sixth sentence, beginning the third set.",
    "Twenty-seventh sentence, ongoing.",
    "Twenty-eighth sentence, more to go.",
    "Twenty-ninth sentence, not stopping yet.",
    "Thirtieth sentence, third set's halfway point.",
    "Thirty-first sentence, picking up speed.",
    "Thirty-second sentence, still more to come.",
    "Thirty-third sentence, almost there.",
    "Thirty-fourth sentence, getting closer.",
    "Thirty-fifth sentence, close to the end.",
    "Thirty-sixth sentence, wrapping up the third set.",
    "Thirty-seventh sentence, starting the fourth set.",
    "Thirty-eighth sentence, not stopping now.",
    "Thirty-ninth sentence, fourth set's halfway point.",
    "Fortieth sentence, making progress.",
    "Forty-first sentence, keeping going.",
    "Forty-second sentence, just a few more to go.",
    "Forty-third sentence, almost there.",
    "Forty-fourth sentence, getting closer to the end.",
    "Forty-fifth sentence, almost done.",
    "Forty-sixth sentence, penultimate in the fourth set.",
    "Forty-seventh sentence, last one in the fourth set.",
    "Forty-eighth sentence, starting the fifth set.",
    "Forty-ninth sentence, not stopping now.",
    "Fiftieth sentence, last one for the example.",
]

# Create a DataFrame with a column containing the sentences
data_frame = pd.DataFrame({'Sentences': sentences})

# Print the DataFrame
print(data_frame)

                                            Sentences
0                         This is the first sentence.
1                        Here is the second sentence.
2                       A third sentence for testing.
3                  And a fourth sentence for variety.
4              Fifth sentence, just to add more data.
5                    Another sentence to make it six.
6                     Seventh sentence, almost there.
7                    Eighth sentence for the example.
8               Ninth sentence, we're getting closer.
9                    Tenth sentence, halfway through.
10                   Eleventh sentence to keep going.
11              Twelfth sentence, still more to come.
12                 Thirteenth sentence, lucky number.
13                  Fourteenth sentence, almost done.
14               Fifteenth sentence, just a few more.
15                 Sixteenth sentence, getting there.
16             Seventeenth sentence, almost finished.
17                     Eight

  and should_run_async(code)


In [16]:
!pip install --upgrade tensorflow

  and should_run_async(code)




In [22]:
!pip install bertopic

  and should_run_async(code)




In [23]:
from bertopic import BERTopic


corpus = data_frame.Sentences.to_list()
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(corpus)

  and should_run_async(code)


AttributeError: module 'keras._tf_keras.keras' has no attribute '__internal__'

In [24]:
for topic, prob in zip(topics, probs):
    print(f"Topic: {topic}, Probability: {prob}")

  and should_run_async(code)


NameError: name 'probs' is not defined

## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [None]:
# Write your code here
# Then Explain the visualization

# Repeat for the other 2 visualizations as well.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [25]:
# Write your answer here (no code needed for this question)

'''When comparing LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis) for topic modeling, the choice depends on your specific needs. LDA offers interpretability with explicit topic-word probabilities and document-topic distributions, making it valuable when you want to understand the underlying topics in a corpus. LSA, on the other hand, excels at capturing semantic relationships and reducing dimensionality in the data but may lack the interpretability of LDA due to the absence of topic-word probabilities. The choice between the two algorithms should be driven by your project's objectives and whether you prioritize clear topic interpretation (LDA) or semantic similarity capture (LSA).

In comparison to LDA and LSA, BERT and Lda2Vec represent more advanced topic modeling approaches. BERT, a deep learning model, provides context-aware embeddings and can capture intricate semantic relationships but may require substantial computational resources. Lda2Vec combines the benefits of word2vec and LDA, offering topic interpretability while considering the distributional semantics of words. BERT and Lda2Vec are suitable for tasks requiring a deeper understanding of text data and context, although they can be computationally intensive. The choice between these methods, like LDA and LSA, depends on the specific project goals, resources, and the balance between interpretability and semantic understanding needed.


Latent Dirichlet Allocation (LDA):
Latent Dirichlet Allocation (LDA) is a well-established and widely used topic modeling technique. It provides interpretable topics by assigning a probability distribution of words to each topic. LDA is computationally efficient and works well with large text corpora. However, it assumes a bag-of-words model, which doesn't consider word order or semantics. This can lead to less accurate topic modeling, particularly in modern text data with complex language structures and semantics. LDA is suitable when you prioritize simplicity and have a large, well-structured corpus of text data where interpretability is crucial.

Latent Semantic Analysis (LSA):
Latent Semantic Analysis (LSA) captures the underlying semantic structure in text data and is useful for reducing dimensionality in large datasets. LSA focuses on capturing latent semantic relationships among words and documents. However, LSA doesn't provide direct topic interpretability. It often requires additional techniques to identify and label topics. Additionally, LSA is limited in handling non-linear relationships in the data and struggles to capture word order and semantic nuances. LSA is a valuable choice when you need dimensionality reduction and have a large text dataset but don't require explicit topic labels.

LDA2Vec:
LDA2Vec combines the strengths of LDA and word2vec to capture semantic relationships between words and documents. It can handle word order, semantics, and ambiguity more effectively than traditional LDA. This makes it suitable for tasks where capturing contextual information and semantics is essential. However, LDA2Vec may require more data and tuning to achieve optimal results. It can also be computationally intensive, making it less ideal for resource-constrained environments. LDA2Vec is a good choice when you want both interpretability and the ability to capture complex relationships in your text data.

BERTopic (using BERT embeddings):
BERTopic leverages pre-trained BERT embeddings to capture contextual and semantic information in text data. It can handle word order, semantics, and ambiguity effectively, making it well-suited for modern text data with complex language structures. BERTopic is known for achieving high interpretability, even in short and noisy text data. However, it may require more computational resources due to the use of BERT embeddings. BERTopic is a strong choice when you prioritize topic interpretability and need to capture the nuanced meaning of words and phrases in your text data, making it a valuable option for a wide range of NLP tasks.

'''


  and should_run_async(code)


"When comparing LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis) for topic modeling, the choice depends on your specific needs. LDA offers interpretability with explicit topic-word probabilities and document-topic distributions, making it valuable when you want to understand the underlying topics in a corpus. LSA, on the other hand, excels at capturing semantic relationships and reducing dimensionality in the data but may lack the interpretability of LDA due to the absence of topic-word probabilities. The choice between the two algorithms should be driven by your project's objectives and whether you prioritize clear topic interpretation (LDA) or semantic similarity capture (LSA).\n\nIn comparison to LDA and LSA, BERT and Lda2Vec represent more advanced topic modeling approaches. BERT, a deep learning model, provides context-aware embeddings and can capture intricate semantic relationships but may require substantial computational resources. Lda2Vec combines the bene

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [26]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''

Learning Experience:
Working with text data and extracting features using topic modeling algorithms such as BERTopic provided a valuable learning experience. It allowed me to understand how different algorithms can be utilized to uncover latent topics and patterns within textual data. Implementing these algorithms helped in grasping the nuances of feature extraction from text data, especially in the context of natural language processing (NLP). Understanding topics and extracting meaningful features is crucial for tasks like document clustering, text summarization, and sentiment analysis, which are fundamental in NLP.

Challenges Encountered:
While working on the exercise, I encountered specific challenges such as understanding the intricacies of topic modeling algorithms and tuning hyperparameters for optimal results. Additionally, working with large text corpora can sometimes lead to computational challenges, especially when dealing with memory constraints or processing time. However, these challenges provided opportunities to delve deeper into the algorithms and explore ways to optimize performance and efficiency.

Relevance to Your Field of Study:
This exercise is highly relevant to the field of Natural Language Processing (NLP). NLP focuses on enabling computers to understand, interpret, and generate human language, and topic modeling is a crucial technique within NLP for extracting meaningful information from text data. By learning and implementing topic modeling algorithms like BERTopic, I gained insights into how NLP techniques can be applied to real-world problems such as document clustering, text classification, and information retrieval. This exercise not only deepened my understanding of NLP concepts but also provided hands-on experience in applying these concepts to practical tasks involving text data analysis and feature extraction. Overall, it enhanced my skills and knowledge in the field of NLP and text analytics.





'''

  and should_run_async(code)


'\n\nLearning Experience:\nWorking with text data and extracting features using topic modeling algorithms such as BERTopic provided a valuable learning experience. It allowed me to understand how different algorithms can be utilized to uncover latent topics and patterns within textual data. Implementing these algorithms helped in grasping the nuances of feature extraction from text data, especially in the context of natural language processing (NLP). Understanding topics and extracting meaningful features is crucial for tasks like document clustering, text summarization, and sentiment analysis, which are fundamental in NLP.\n\nChallenges Encountered:\nWhile working on the exercise, I encountered specific challenges such as understanding the intricacies of topic modeling algorithms and tuning hyperparameters for optimal results. Additionally, working with large text corpora can sometimes lead to computational challenges, especially when dealing with memory constraints or processing time