<a href="https://colab.research.google.com/github/TharunSaiVT/INFO-5731/blob/main/INFO5731_Exercise_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [11]:
# Write your code here
import gensim
from gensim import corpora
from gensim.models import CoherenceModel

# Prepare the text corpus
sample_texts = [
    "It Starts with Us was a beautifully written and heartwarming sequel to It Ends with Us. This was one of my most anticipated reads and it did not disappoint. Atlas and Lily deserved their happy ending and getting to read that was absolutely everything I wanted and more."
    "Again this book picks up immediately after the ending of It Ends with Us, so I would highly recommend just reading that book first. I also suggest checking the content and trigger warnings before reading."
    "It is dual POVs of Atlas and Lily and has a second chance romance and the found family trope."
    "The premise of this book was definitely a lot lighter than the first book, as it focused more on accepting and fighting for love and family. The writing was amazing and just easy to read; I also loved the short chapters."
]

# Tokenize the texts
tokenized_texts = [text.lower().split() for text in sample_texts]

# Create a dictionary from the tokenized texts
dictionary = corpora.Dictionary(tokenized_texts)

# Convert the tokenized texts into a bag-of-words representation
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

# Find the optimal number of topics using coherence score
coherence_scores = {}
for k in range(2, 8):
    lda_model = gensim.models.ldamodel.LdaModel(
        corpus=corpus,
        id2word=dictionary,
        num_topics=k,
        random_state=100,
        chunksize=100,
        passes=10,
        alpha='auto',
        per_word_topics=True
    )
    coherence_model = CoherenceModel(
        model=lda_model,
        texts=tokenized_texts,
        dictionary=dictionary,
        coherence='c_v'
    )
    coherence_scores[k] = coherence_model.get_coherence()
print(f"coherence_scores[{k+1}]:",coherence_scores[k])
optimal_k = max(coherence_scores, key=coherence_scores.get)
print(f"The optimal number of topics is {optimal_k}")

# Train the LDA model with the optimal number of topics
lda_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=optimal_k,
    random_state=100,
    chunksize=100,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

# Summarize the topics
topics = lda_model.show_topics(formatted=False)
for i, topic in enumerate(topics):
    print(f"Topic {i+1}: {' '.join([w[0] for w in topic[1]])}")



coherence_scores[8]: 0.5233749594880391
The optimal number of topics is 7
Topic 1: and was i it book the to ending this of
Topic 2: and the was it of i with to this a
Topic 3: and i it the was book of with a also
Topic 4: and was it a book i with the to of
Topic 5: and it the of i was book this with to
Topic 6: and was it the i to a book of ending
Topic 7: and it was the a this i of to book


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [7]:
# Write your code here
from gensim.models import LsiModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel

sample_texts = [
    "It Starts with Us was a beautifully written and heartwarming sequel to It Ends with Us. This was one of my most anticipated reads and it did not disappoint. Atlas and Lily deserved their happy ending and getting to read that was absolutely everything I wanted and more."
    "Again this book picks up immediately after the ending of It Ends with Us, so I would highly recommend just reading that book first. I also suggest checking the content and trigger warnings before reading."
    "It is dual POVs of Atlas and Lily and has a second chance romance and the found family trope."
    "The premise of this book was definitely a lot lighter than the first book, as it focused more on accepting and fighting for love and family. The writing was amazing and just easy to read; I also loved the short chapters."
]

# Create dictionary
texts_tokenized = [text.split() for text in sample_texts]
dictionary = Dictionary(texts_tokenized)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in texts_tokenized]

# Build LSA model
num_topics = 4
lsa_model = LsiModel(corpus, num_topics=num_topics, id2word=dictionary)

# Compute coherence score
coherence_model_lsa = CoherenceModel(model=lsa_model, texts=texts_tokenized, dictionary=dictionary, coherence='c_v')
coherence_lsa = coherence_model_lsa.get_coherence()

# Print coherence score
print('Coherence Score: ', coherence_lsa)

# Print topics
topics = lsa_model.show_topics(formatted=False)
for i, topic in enumerate(topics):
    print('Topic {}: {}'.format(i+1, [word[0] for word in topic[1]]))

Coherence Score:  0.5031479809700152
Topic 1: ['and', 'the', 'was', 'I', 'of', 'It', 'book', 'a', 'to', 'with']


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [6]:
# Write your code here
# Write your code here
!pip install lda2vec

import numpy as np
import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_string
from sklearn.feature_extraction.text import CountVectorizer

sample_texts = [
    "It Starts with Us was a beautifully written and heartwarming sequel to It Ends with Us. This was one of my most anticipated reads and it did not disappoint. Atlas and Lily deserved their happy ending and getting to read that was absolutely everything I wanted and more."
    "Again this book picks up immediately after the ending of It Ends with Us, so I would highly recommend just reading that book first. I also suggest checking the content and trigger warnings before reading."
    "It is dual POVs of Atlas and Lily and has a second chance romance and the found family trope."
    "The premise of this book was definitely a lot lighter than the first book, as it focused more on accepting and fighting for love and family. The writing was amazing and just easy to read; I also loved the short chapters."
]

# Preprocess the text data
processed_texts = [preprocess_string(text) for text in sample_texts]

# Create a dictionary and corpus
dictionary = gensim.corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Train the LDA model
vectorizer = CountVectorizer(stop_words="english")
doc_term_matrix = vectorizer.fit_transform(sample_texts)
feature_names = vectorizer.get_feature_names_out()
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10)

# Get the topics and their top words
topics = lda_model.show_topics(num_topics=5, num_words=10, formatted=False)
top_words_per_topic = []
for topic in topics:
    top_words = [word[0] for word in topic[1]]
    top_words_per_topic.append(top_words)

# Print the top words for each topic
for i, top_words in enumerate(top_words_per_topic):
    print(f"Topic {i+1}: {', '.join(top_words)}")

Topic 1: read, end, book, trope, absolut, atla, trigger, pick, definit, check
Topic 2: read, book, end, love, atla, famili, lili, deserv, beautifulli, amaz
Topic 3: read, end, book, love, famili, atla, lili, lighter, start, trope
Topic 4: read, book, end, lili, famili, atla, love, heartwarm, trigger, dual
Topic 5: book, read, end, lighter, love, accept, short, romanc, immedi, pick


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [10]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━

In [19]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
texts = [
    "Machine learning is a subset of artificial intelligence that focuses on the development of computer programs that can access data and use it to learn for themselves.",
    "Artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems.",
    "Deep learning is a subset of machine learning that deals with neural networks: algorithms inspired by the structure and function of the brain's neural networks.",
    "Neural networks are a series of algorithms that mimic the operations of a human brain to recognize relationships between vast amounts of data.",
    "Natural language processing is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language.",
    "Supervised learning is a type of machine learning where the algorithm is trained on labeled data, meaning it learns from input-output pairs.",
    "Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data and learns to recognize patterns without supervision.",
    "Reinforcement learning is a type of machine learning where an agent learns to behave in an environment by performing certain actions and receiving rewards or penalties.",
    "Clustering is an unsupervised learning technique used to group data points or objects that are somehow similar.",
    "Classification is a supervised learning technique used to categorize data points into predefined classes or categories."
]

# Initialize BERTopic model
model = BERTopic(language="english")

# Fit BERTopic model to determine optimal number of topics
topics, _ = model.fit_transform(texts)

# Get topic info
topic_info = model.get_topic_info()

# Summarize topics
for i, (topic_id, top_words, _) in enumerate(topic_info.values):
    print(f"Topic {topic_id}:")
    print(f"Top Words: {', '.join(top_words)}")
    print()


ValueError: too many values to unpack (expected 3)

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here
LDA generated 7 topics, which is more than the other two algorithms.
It identified some similar topics as LSA and LDA2Vec,
LSA generated 1 topics.
LDA2Vec generated 5 topics and identified some similar topics as LDA and LSA.
Compared to LSA, LDA and LDA2Vec produced more focused and cohesive subjects.
As berttopic needs more data for its output we tried for a better input data to pass , but we are unable to execute it.
LDA tends to capture a wider range of themes, although LDA2Vec might be better at recognising particular traits and attitudes.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
For now everthing seems good. But need some more practice on it to get some deeper knowledge as we are the beginners to this.
We encountered the challenges that which type of data is good for the better output.
We need to have more handson experience in this so that we can further use them in any real time applications..





'''