# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [26]:
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# assuming 'documents' is your actual list of documents
documents = ['The quick brown fox jumps over the lazy dog', 'Never odd or even', 'Don’t nod']


stop_words = stopwords.words('english')
texts = [[word for word in document.lower().split() if word not in stop_words] for document in documents]

# Create Dictionary
dictionary = corpora.Dictionary(texts)

# Create Corpus
corpus = [dictionary.doc2bow(text) for text in texts]

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus, id2word=dictionary)

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print('Coherence Score: ', coherence_lda)

# Print the Keyword in the topics
print(lda_model.print_topics())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Coherence Score:  0.3427752250857407
[(99, '0.091*"brown" + 0.091*"dog" + 0.091*"fox" + 0.091*"jumps" + 0.091*"lazy" + 0.091*"quick" + 0.091*"even" + 0.091*"never" + 0.091*"odd" + 0.091*"don’t"'), (32, '0.091*"brown" + 0.091*"dog" + 0.091*"fox" + 0.091*"jumps" + 0.091*"lazy" + 0.091*"quick" + 0.091*"even" + 0.091*"never" + 0.091*"odd" + 0.091*"don’t"'), (0, '0.091*"brown" + 0.091*"dog" + 0.091*"fox" + 0.091*"jumps" + 0.091*"lazy" + 0.091*"quick" + 0.091*"even" + 0.091*"never" + 0.091*"odd" + 0.091*"don’t"'), (16, '0.091*"brown" + 0.091*"dog" + 0.091*"fox" + 0.091*"jumps" + 0.091*"lazy" + 0.091*"quick" + 0.091*"even" + 0.091*"never" + 0.091*"odd" + 0.091*"don’t"'), (88, '0.091*"brown" + 0.091*"dog" + 0.091*"fox" + 0.091*"jumps" + 0.091*"lazy" + 0.091*"quick" + 0.091*"even" + 0.091*"never" + 0.091*"odd" + 0.091*"don’t"'), (83, '0.091*"brown" + 0.091*"dog" + 0.091*"fox" + 0.091*"jumps" + 0.091*"lazy" + 0.091*"quick" + 0.091*"even" + 0.091*"never" + 0.091*"odd" + 0.091*"don’t"'), (89, '0.0

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [20]:
!pip install bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load data
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

# Create BERTopic model
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Fit to data and transform to topics
topics, probs = topic_model.fit_transform(docs)

# Get topic representation
topic_representation = topic_model.get_topic_info(); topic_representation.head(10)

# Get individual topics
individual_topics = topic_model.get_topics()




2024-03-30 01:50:10,298 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2024-03-30 02:25:28,426 - BERTopic - Embedding - Completed ✓
2024-03-30 02:25:28,428 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-30 02:26:09,274 - BERTopic - Dimensionality - Completed ✓
2024-03-30 02:26:09,276 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-30 02:26:54,185 - BERTopic - Cluster - Completed ✓
2024-03-30 02:26:54,214 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-30 02:27:02,730 - BERTopic - Representation - Completed ✓


## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [16]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Learning Experience: This assignment taught me how to use text data and topic modelling techniques effectively. Implementing algorithms like LDA, LSA, lda2vec, and BERTopic aided in comprehending their theoretical ideas and practical applications. The exercises assisted in learning the complexities of feature extraction from text data and how different algorithms approach topic modelling in different ways.

Challenges: Understanding the complexity of newer algorithms, like as lda2vec and BERTopic, compared to conventional approaches like LDA and LSA, was a problem during the experiment. Furthermore, while some algorithms, like as lda2vec and BERTopic, are more sophisticated, they require a greater understanding to comprehend their findings.

Relevance to Your Field of Study: This activity is particularly relevant to Natural Language Processing (NLP) studies. Topic modelling is a key job in natural language processing that extracts relevant information and identifies hidden patterns in text. Understanding and implementing diverse topic modelling techniques is critical for NLP practitioners since they are frequently utilised in a variety of applications such as document clustering, summarization, and data retrieval. As a result, the information collected from this exercise is immediately transferable to real-world NLP jobs and research initiatives.

'''

'\nLearning Experience: This assignment taught me how to use text data and topic modelling techniques effectively. Implementing algorithms like LDA, LSA, lda2vec, and BERTopic aided in comprehending their theoretical ideas and practical applications. The exercises assisted in learning the complexities of feature extraction from text data and how different algorithms approach topic modelling in different ways.\n\nChallenges: Understanding the complexity of newer algorithms, like as lda2vec and BERTopic, compared to conventional approaches like LDA and LSA, was a problem during the experiment. Furthermore, while some algorithms, like as lda2vec and BERTopic, are more sophisticated, they require a greater understanding to comprehend their findings.\n\nRelevance to Your Field of Study: This activity is particularly relevant to Natural Language Processing (NLP) studies. Topic modelling is a key job in natural language processing that extracts relevant information and identifies hidden patte