<a href="https://colab.research.google.com/github/MohanaSrinitha/Mohana_INF05731_Spring2024/blob/main/Shaga_Mohana_Exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc))
             if word not in stop_words] for doc in texts]

# Sample data
sample_data = [
    "This movie is absolutely fantastic! I loved every minute of it.",
    "The acting was terrible, and the plot made no sense.",
    "I'm not sure how to feel about this film. It had its moments, but overall, it was mediocre.",
]

# Preprocess the sample data
data_words = list(sent_to_words(sample_data))

# Remove stop words
data_words = remove_stopwords(data_words)

import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)

# Create Corpus
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

from pprint import pprint

# Number of topics
num_topics = 2  # You can adjust the number of topics as needed

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                         id2word=id2word,
                                         num_topics=num_topics)

# Print the Keyword in the topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[(0,
  '0.074*"movie" + 0.074*"fantastic" + 0.073*"minute" + 0.072*"absolutely" + '
  '0.072*"loved" + 0.072*"every" + 0.068*"moments" + 0.067*"mediocre" + '
  '0.067*"overall" + 0.066*"sure"'),
 (1,
  '0.090*"sense" + 0.090*"acting" + 0.089*"plot" + 0.089*"terrible" + '
  '0.087*"made" + 0.052*"feel" + 0.052*"film" + 0.050*"sure" + 0.050*"overall" '
  '+ 0.049*"mediocre"')]


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [3]:
# Write your code here
import pandas as pd
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, preprocess_string, strip_short, stem_text
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Sample data
sample_data = [
    "This movie is absolutely fantastic! I loved every minute of it.",
    "The acting was terrible, and the plot made no sense.",
    "I'm not sure how to feel about this film. It had its moments, but overall, it was mediocre.",
]

# Create a DataFrame for the sample data
df = pd.DataFrame({'Review': sample_data})

# Preprocess the text using the same function
def preprocess(text):
    CUSTOM_FILTERS = [lambda x: x.lower(), remove_stopwords, strip_punctuation, strip_short, stem_text]
    text = preprocess_string(text, CUSTOM_FILTERS)
    return text

# Apply the preprocessing to the sample data
df['Review_Text (Clean)'] = df['Review'].apply(lambda x: preprocess(x))

# Create a dictionary with the corpus
corpus = df['Review_Text (Clean)']
dictionary = corpora.Dictionary(corpus)

# Convert corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

# Coherence score in topic modeling to measure how interpretable the topics are to humans.
# Find the coherence score with a different number of topics
for i in range(2, 11):
    lsi = LsiModel(bow, num_topics=i, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=df['Review_Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

# Perform SVD on the bag of words with the LsiModel to extract 2 topics
lsi = LsiModel(bow, num_topics=2, id2word=dictionary)

# Find the 5 words with the strongest association to the derived topics
for topic_num, words in lsi.print_topics(num_words=10):
    print('Words in {}: {}.'.format(topic_num, words))

# Find the scores given between the review and each topic
corpus



Coherence score with 2 clusters: 0.2974621508745987
Coherence score with 3 clusters: 0.2974621508745987
Coherence score with 4 clusters: 0.2974621508745987
Coherence score with 5 clusters: 0.2974621508745987
Coherence score with 6 clusters: 0.2974621508745987
Coherence score with 7 clusters: 0.2974621508745987
Coherence score with 8 clusters: 0.2974621508745987
Coherence score with 9 clusters: 0.29746215087459865
Coherence score with 10 clusters: 0.2974621508745987
Words in 0: -0.408*"overal" + -0.408*"moment" + -0.408*"mediocr" + -0.408*"feel" + -0.408*"sure" + -0.408*"film" + 0.000*"plot" + 0.000*"act" + 0.000*"sens" + 0.000*"terribl".
Words in 1: 0.447*"love" + 0.447*"movi" + 0.447*"minut" + 0.447*"absolut" + 0.447*"fantast" + -0.000*"plot" + -0.000*"act" + -0.000*"mediocr" + -0.000*"sens" + -0.000*"feel".


0          [movi, absolut, fantast, love, minut]
1                     [act, terribl, plot, sens]
2    [sure, feel, film, moment, overal, mediocr]
Name: Review_Text (Clean), dtype: object

## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [4]:
# Write your code here
import nltk
nltk.download('all')
!pip install preprocess
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

Collecting preprocess
  Downloading preprocess-2.0.0-py3-none-any.whl (12 kB)
Installing collected packages: preprocess
Successfully installed preprocess-2.0.0


In [None]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting tzdata>=2022.7 (from pandas>=2.0.0->pyLDAvis)
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: funcy, tzdata, pandas, pyLDAvis
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERRO

In [None]:
import pyLDAvis
pyLDAvis.enable_notebook()
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
sample_data = [
    "This movie is absolutely fantastic! I loved every minute of it.",
    "The acting was terrible, and the plot made no sense.",
    "I'm not sure how to feel about this film. It had its moments, but overall, it was mediocre.",
]

# Preprocess the sample data
sample_data_clean = [preprocess(text) for text in sample_data]

# Join the preprocessed texts into a list of strings
sample_data_clean_text = [' '.join(text) for text in sample_data_clean]

# Create a CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data_clean_text)

# Get the top words for each topic
top = 10
topic_to_topwords = {}
for j in range(X.shape[0]):
    top_words_indices = np.argsort(X[j].toarray()[0])[::-1][:top]
    top_words = [vectorizer.get_feature_names_out()[i] for i in top_words_indices]
    msg = 'Topic %i has top words: %s' % (j, ', '.join(top_words))
    print(msg)
    topic_to_topwords[j] = top_words


Topic 0 has top words: movi, minut, love, fantast, absolut, terribl, sure, sens, plot, overal
Topic 1 has top words: terribl, sens, plot, act, sure, overal, movi, moment, minut, mediocr
Topic 2 has top words: sure, overal, moment, mediocr, film, feel, terribl, sens, plot, movi


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
!pip install bertopic

  and should_run_async(code)


Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m122.9/154.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9

In [None]:
# Write your code here
import re
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bertopic import BERTopic
import matplotlib.pyplot as plt

# Download NLTK data
nltk.download('stopwords')
nltk.download('punkt')

# Define a function to clean text
stop_words = set(stopwords.words('english'))


  and should_run_async(code)
  np.bool8: (False, True),
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all')['data']

topic_model = BERTopic(nr_topics="auto", calculate_probabilities=True, verbose=True)
topics, _ = topic_model.fit_transform(data)

topic_overview = topic_model.get_topic_freq()

for topic_num, freq in topic_overview[1:].values:
    topic_words = topic_model.get_topic(topic_num)
    topic_summary = ", ".join([word[0] for word in topic_words[:5]])
    print(f"Topic {topic_num}: {topic_summary} (Freq: {freq})")

  and should_run_async(code)
2024-03-29 01:25:15,040 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2024-03-29 02:26:19,980 - BERTopic - Embedding - Completed ✓
2024-03-29 02:26:19,982 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-29 02:27:06,584 - BERTopic - Dimensionality - Completed ✓
2024-03-29 02:27:06,586 - BERTopic - Cluster - Start clustering the reduced embeddings
  self._all_finite = is_finite(X)
2024-03-29 02:29:05,958 - BERTopic - Cluster - Completed ✓
2024-03-29 02:29:05,962 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-29 02:29:16,701 - BERTopic - Representation - Completed ✓
2024-03-29 02:29:16,710 - BERTopic - Topic reduction - Reducing number of topics
  self._all_finite = is_finite(X)
2024-03-29 02:29:27,642 - BERTopic - Topic reduction - Reduced number of topics from 375 to 44


Topic -1: the, to, of, and, in (Freq: 6502)
Topic 1: battery, batteries, concrete, acid, lead (Freq: 51)
Topic 2: oil, drain, changing, my, plug (Freq: 45)
Topic 3: wax, chain, scratches, plastic, paint (Freq: 43)
Topic 4: cpu, fan, heat, fans, sink (Freq: 42)
Topic 5: monitors, hours, day, 24, power (Freq: 37)
Topic 6: air, r12, heat, substitutes, conditioning (Freq: 27)
Topic 7: solvent, adhesive, ducttape, mek, carpet (Freq: 23)
Topic 8: uv, subliminal, tv, second, flashlight (Freq: 23)
Topic 9: pregnency, sex, teacher, biology, sperm (Freq: 22)
Topic 10: weight, chromium, fat, diet, wa7kgx (Freq: 22)
Topic 11: crohns, inflammation, ibd, disease, diet (Freq: 22)
Topic 12: ear, ears, hearing, it, earwax (Freq: 21)
Topic 13: blue, leds, boards, green, solder (Freq: 21)
Topic 14: maxaxaxaxaxaxaxaxaxaxaxaxaxaxax, mg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9vg9v, pwisemansalmonusdedu, cliff, 14 (Freq: 20)
Topic 15: kidney, stones, she, calcium, stone (Freq: 20)
Topic 16: 42, tiff, philosop

## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [None]:
# Write your code here
# Then Explain the visualization

# Repeat for the other 2 visualizations as well.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here
By taking into consideration elements like topic quality and interpretability in order to compare the topic modeling techniques (LDA, NMF, LSA, and
BERTopic).Word probability distributions for themes provided by LDA are quite comprehensible.While not exactly the same as LDA, NMF and LSA provide
subjects that are reasonably comprehensible.Due to its BERT model dependency, BERTopic may not yield semantically relevant topics in the same way as
standard approaches.Assess topic coherence; NMF and LDA frequently have strong coherence.Lower topic coherence is typical of LSA.Depending on the BERT
model selected, BERTopic can have strong coherence.

Considering scalability:huge datasets can be handled by LDA, NMF, and LSA.BERTopic may require a lot of processing power, particularly when dealing
with big BERT models.Examine the model's robustness and hyperparameter sensitivity for your project, as well as the availability of pre-trained models
.LDA is often considered the greatest option for interpretability due to its clear word probability distributions for each topic.If you desire
improved scalability combined with strong interpretability, NMF and LSA can be useful options.Semantic context is provided by BERTopic, yet it may
require more computing power and has variable interpretability.

Based on those features and a comparison of the four models, I think LSA is the most effective.


By taking into consideration elements like topic quality and interpretability in order to compare the topic modeling techniques (LDA, NMF, LSA, and
BERTopic).Word probability distributions for themes provided by LDA are quite comprehensible.While not exactly the same as LDA, NMF and LSA provide
subjects that are reasonably comprehensible.Due to its BERT model dependency, BERTopic may not yield semantically relevant topics in the same way as
standard approaches.Assess topic coherence; NMF and LDA frequently have strong coherence.Lower topic coherence is typical of LSA.Depending on the BERT
model selected, BERTopic can have strong coherence.

Considering scalability:huge datasets can be handled by LDA, NMF, and LSA.BERTopic may require a lot of processing power, particularly when dealing
with big BERT models.Examine the model's robustness and hyperparameter sensitivity for your project, as well as the availability of pre-trained models
.LDA is often considered the greatest option for interpretability due to its clear word probability distributions for each topic.If you desire
improved scalability combined with strong interpretability, NMF and LSA can be useful options.Semantic context is provided by BERTopic, yet it may
require more computing power and has variable interpretability.

Based on those features and a comparison of the four models, I think LSA is the most effective.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
From this exercise I have learnt about differnt topic modelling algorithms. Here I have used algorithms like LDA, LSA, lda2vec, and BERTopic, which
helped to perform the data analysis. Each algorithm has its unique way of approach towards the given corpus of text.Because of such variation between
them, we are better able to comprehend various approaches and understand how they are used in natural language processing (NLP).

Throughout this exercise, I have understood the feature extraction from given text and data analysis. By observing how each algorithm processes and
identifies topics, I have gained practical understanding on the complexity of text analysis. Furthermore, applying these techniques improves
understanding by emphasizing the many ambiguities associated with subject modeling.

As I still new to NLP, I have a bit of difficulty in implementing the algorithms and selecting appropriate hyperparameters, such as the number of
topics.But I have crossed all the errors and learnt these pretty well according to my understanding.

This particular exercise is very much related to NLP as we are doing the text analysis and extraction of features.Topic modelling in general regarded
as foundational task to discover the meaningful insights for the provided text.There are many application where we use these algorithms such as
recommendation systems, content analysis etc.Threefore understanding the topic modelling algorithms is very important and useful for NLP.





'''

Please write you answer here:
From this exercise I have learnt about differnt topic modelling algorithms. Here I have used algorithms like LDA, LSA, lda2vec, and BERTopic, which
helped to perform the data analysis. Each algorithm has its unique way of approach towards the given corpus of text.Because of such variation between
them, we are better able to comprehend various approaches and understand how they are used in natural language processing (NLP).

Throughout this exercise, I have understood the feature extraction from given text and data analysis. By observing how each algorithm processes and
identifies topics, I have gained practical understanding on the complexity of text analysis. Furthermore, applying these techniques improves
understanding by emphasizing the many ambiguities associated with subject modeling.

As I still new to NLP, I have a bit of difficulty in implementing the algorithms and selecting appropriate hyperparameters, such as the number of
topics.But I have crossed all the errors and learnt these pretty well according to my understanding.

This particular exercise is very much related to NLP as we are doing the text analysis and extraction of features.Topic modelling in general regarded
as foundational task to discover the meaningful insights for the provided text.There are many application where we use these algorithms such as
recommendation systems, content analysis etc.Threefore understanding the topic modelling algorithms is very important and useful for NLP.


