<a href="https://colab.research.google.com/github/HarshaSolingaram/INFO_5731/blob/main/Solingaram_Harshavardhan_Exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [65]:
# Write your code here
import gensim
from gensim import corpora
from gensim.models import CoherenceModel

# Prepare the text corpus

textdatas = ["I watched the movie first and then I discovered that there was book to the movie and those who know me know that I cant pass up a book/movie combo",
             "This book was totally unlike the movie I personally like the book more than the movie and even though the movie was good",
             "the book was so full of emotion it connects you with what is going on in Tessa's mind and it makes you want more and to know what is going to happen",
             "being almost 16 feel like people sugar coat what goes on in their first real relationship",
             "Anna Todd if your reading this thank you for writing such an amazing book. To sum this up I definitely recommend reading this book"]

# Tokenize the texts
tokenized_texts = [textdata.lower().split() for textdata in textdatas]

# Create a dictionary from the tokenized texts
dictionary = corpora.Dictionary(tokenized_texts)

# Convert the tokenized texts into a bag-of-words representation
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

# Find the optimal number of topics using coherence score
coherence_scores = {}
for k in range(2, 11):
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=k,
                                                random_state=100, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

    coherence_model = CoherenceModel(model=lda_model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v')

    coherence_scores[k] = coherence_model.get_coherence()
    print(f"coherence_scores[{k+1}]:",coherence_scores[k])

optimal_k = max(coherence_scores, key=coherence_scores.get)

print(f"The optimal number of topics is {optimal_k}")

# Train the LDA model with the optimal number of topics
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=optimal_k,
                                            random_state=100, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

# Summarize the topics
topics = lda_model.show_topics(formatted=False)
for i, topic in enumerate(topics):
    print(f"Topic {i+1}: {' '.join([w[0] for w in topic[1]])}")

coherence_scores[3]: 0.36765756896062923
coherence_scores[4]: 0.736029644819416
coherence_scores[5]: 0.6201629151127688
coherence_scores[6]: 0.618340544697902
coherence_scores[7]: 0.8070711526499893
coherence_scores[8]: 0.765308914569904
coherence_scores[9]: 0.5830118627211708
coherence_scores[10]: 0.6901764547444075
coherence_scores[11]: 0.6744320334732778
The optimal number of topics is 6
Topic 1: you it going and is to what so was full
Topic 2: what to going is and you it of makes book
Topic 3: the movie i and was book know that to more
Topic 4: to you it what is and with going on in
Topic 5: in their goes being real sugar coat feel on 16
Topic 6: this reading todd recommend if anna thank to up you


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [66]:
# Write your code here
from gensim.models import LsiModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel

textdatas = ["I watched the movie first and then I discovered that there was book to the movie and those who know me know that I cant pass up a book/movie combo",
             "This book was totally unlike the movie I personally like the book more than the movie and even though the movie was good",
             "the book was so full of emotion it connects you with what is going on in Tessa's mind and it makes you want more and to know what is going to happen",
             "being almost 16 feel like people sugar coat what goes on in their first real relationship",
             "Anna Todd if your reading this thank you for writing such an amazing book. To sum this up I definitely recommend reading this book"]


# Create dictionary
texts_tokenized = [text.split() for text in textdatas]
dictionary = Dictionary(texts_tokenized)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in texts_tokenized]

# Build LSA model
num_topics = 4
lsa_model = LsiModel(corpus, num_topics=num_topics, id2word=dictionary)

# Compute coherence score
coherence_model_lsa = CoherenceModel(model=lsa_model, texts=texts_tokenized, dictionary=dictionary, coherence='c_v')
coherence_lsa = coherence_model_lsa.get_coherence()

# Print coherence score
print('Coherence Score: ', coherence_lsa)

# Print topics
topics = lsa_model.show_topics(formatted=False)
for i, topic in enumerate(topics):
    print('Topic {}: {}'.format(i+1, [word[0] for word in topic[1]]))

Coherence Score:  0.6434173301145665
Topic 1: ['the', 'movie', 'and', 'I', 'book', 'was', 'know', 'to', 'that', 'more']
Topic 2: ['what', 'it', 'is', 'going', 'you', 'movie', 'to', 'I', 'the', 'on']
Topic 3: ['this', 'reading', 'up', 'definitely', 'an', 'book.', 'your', 'Anna', 'To', 'Todd']
Topic 4: ['that', 'know', 'the', 'I', 'book', 'there', 'book/movie', 'pass', 'watched', 'then']


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [67]:
# Write your code here

# Write your code here
!pip install lda2vec

import numpy as np
import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_string
from sklearn.feature_extraction.text import CountVectorizer

textdatas = ["I watched the movie first and then I discovered that there was book to the movie and those who know me know that I cant pass up a book/movie combo",
             "This book was totally unlike the movie I personally like the book more than the movie and even though the movie was good",
             "the book was so full of emotion it connects you with what is going on in Tessa's mind and it makes you want more and to know what is going to happen",
             "being almost 16 feel like people sugar coat what goes on in their first real relationship",
             "Anna Todd if your reading this thank you for writing such an amazing book. To sum this up I definitely recommend reading this book"]


# Preprocess the text data
processed_texts = [preprocess_string(text) for text in textdatas]

# Create a dictionary and corpus
dictionary = gensim.corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Train the LDA model
vectorizer = CountVectorizer(stop_words="english")
doc_term_matrix = vectorizer.fit_transform(textdatas)
feature_names = vectorizer.get_feature_names_out()
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10)

# Get the topics and their top words
topics = lda_model.show_topics(num_topics=5, num_words=10, formatted=False)
top_words_per_topic = []
for topic in topics:
    top_words = [word[0] for word in topic[1]]
    top_words_per_topic.append(top_words)

# Print the top words for each topic
for i, top_words in enumerate(top_words_per_topic):
    print(f"Topic {i+1}: {', '.join(top_words)}")

Topic 1: book, know, movi, read, go, anna, amaz, write, todd, thank
Topic 2: like, relationship, peopl, coat, feel, goe, sugar, real, book, movi
Topic 3: go, emot, mind, make, want, happen, tessa, connect, book, know
Topic 4: movi, book, like, know, go, make, mind, connect, happen, read
Topic 5: movi, book, person, total, like, good, unlik, know, go, happen


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [68]:
%%capture
!pip install bertopic

In [69]:
from bertopic import BERTopic

textdatas = ["I watched the movie first and then I discovered that there was book to the movie and those who know me know that I cant pass up a book/movie combo",
             "This book was totally unlike the movie I personally like the book more than the movie and even though the movie was good",
             "the book was so full of emotion it connects you with what is going on in Tessa's mind and it makes you want more and to know what is going to happen",
             "being almost 16 feel like people sugar coat what goes on in their first real relationship",
             "Anna Todd if your reading this thank you for writing such an amazing book. To sum this up I definitely recommend reading this book"]


# Initialize BERTopic model
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Fit the model to the data and transform it into topic representations
topics, probabilities = topic_model.fit_transform(textdatas)

# Print the topics
print("\nTopics:")
for topic_id in range(topic_model.get_params()["n_components"]):
    topic_words = topic_model.get_topic(topic_id)
    print(f"Topic {topic_id + 1}: {', '.join(topic_words)}")



2024-03-30 04:29:27,557 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2024-03-30 04:29:28,976 - BERTopic - Embedding - Completed ✓
2024-03-30 04:29:28,978 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

The data we are providing for the bert is too small, its not supporting as it is underfitting. i have tried using with this and i came to know is bert requires more lines of data in thousands on the topic. so this coming an error due to underfitting.


## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [70]:
# Write your code here

"""

k topics of LDA = 6
k tpoics of LSA = 4
k topics of lda2vec = 5

as per the stats and researches BERTopic is the best model to work on topic modeling for large sets of data.
but i have used a small data with 5 lines of code for this as per the coherence and k values
i think Latent Dirichlet Allocation (LDA) is best fit model for my data. as it is genrated more and relevent data compared to other three.


"""

'\n\nk topics of LDA = 6\nk tpoics of LSA = 4\nk topics of lda2vec = 5\n\nas per the stats and researches BERTopic is the best model to work on topic modeling for large sets of data. \nbut i have used a small data with 5 lines of code for this as per the coherence and k values \ni think Latent Dirichlet Allocation (LDA) is best fit model for my data. as it is genrated more and relevent data compared to other three.\n\n\n'

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [71]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

the exercise is good but i thought you would have tell how much input data should be provided.
i have encountered issue with coherence in all the models. overall it is good.
NLP is dragging me down to the bottom, iam losing my confidence in some cases but trying to keep with the course.

'''

'\nPlease write you answer here:\n\nthe exercise is good but i thought you would have tell how much input data should be provided.\ni have encountered issue with coherence in all the models. overall it is good.\nNLP is dragging me down to the bottom, iam losing my confidence in some cases but trying to keep with the course.\n\n'