<a href="https://colab.research.google.com/github/PavanKandula195/INFO_5731/blob/main/Pavan_Kandula_Exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## In class Exercise 4

The purpose of this exercise is to practice topic modeling.
Please use the text corpus you collected in your last in-class-exercise for this exercise.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due tonight November 1st, 2023 at 11:59 PM.
**Late submissions cannot be considered.**

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [12]:
# Your sample text data
text_data = [
    'This movie is the best movie I have ever seen!',
    'This movie is the worst movie I have ever seen!',
    'This movie is okay.'
]

In [13]:
from gensim import corpora, models
from gensim.models import CoherenceModel

def preprocess_corpus(text_corpus):
    # Tokenize the text and remove stopwords
    preprocessed_corpus = [doc.lower().split() for doc in text_corpus]
    return preprocessed_corpus

def create_dtm(text_corpus):
    # Create a Document-Term Matrix (DTM) from the preprocessed corpus
    dictionary = corpora.Dictionary(text_corpus)
    dtm = [dictionary.doc2bow(doc) for doc in text_corpus]
    return dtm, dictionary

def find_optimal_k(dtm, dictionary, text_corpus):
    # Find the optimal number of topics (K) using coherence score
    coherence_scores = []
    for k in range(2, 10):  # Adjust the range as needed
        lda_model = models.LdaModel(dtm, num_topics=k, id2word=dictionary)
        coherence_model = CoherenceModel(model=lda_model, texts=text_corpus, dictionary=dictionary, coherence='c_v')
        coherence_score = coherence_model.get_coherence()
        coherence_scores.append(coherence_score)

    optimal_k = coherence_scores.index(max(coherence_scores)) + 2  # +2 because we started from 2 topics
    return optimal_k

def train_lda_model(dtm, dictionary, optimal_k, passes=10, iterations=50):
    # Train the LDA model with the optimal number of topics
    lda_model = models.LdaModel(dtm, num_topics=optimal_k, id2word=dictionary, passes=passes, iterations=iterations)
    return lda_model

def extract_and_summarize_topics(lda_model):
    # Extract and summarize the topics by printing the most representative words for each topic
    topics = lda_model.print_topics()
    for topic in topics:
        print(topic)






In [14]:
# Step 1: Preprocess the corpus
preprocessed_corpus = preprocess_corpus(text_data)

# Step 2: Create a DTM and a dictionary
dtm, dictionary = create_dtm(preprocessed_corpus)

# Step 3: Find the optimal number of topics (K)
optimal_k = find_optimal_k(dtm, dictionary, preprocessed_corpus)

# Step 4: Train the LDA model with the optimal K
passes = 70
iterations = 1000
lda_model = train_lda_model(dtm, dictionary, optimal_k, passes=passes, iterations=iterations)

# Step 5: Extract and summarize the topics
extract_and_summarize_topics(lda_model)

# Step 6: Compute Model Perplexity
perplexity = lda_model.log_perplexity(dtm)
print(f"Model Perplexity: {perplexity}")

# Step 7: Compute Coherence Score
coherence_model = CoherenceModel(model=lda_model, texts=preprocessed_corpus, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score}")



(0, '0.159*"is" + 0.159*"this" + 0.157*"okay." + 0.153*"movie" + 0.053*"worst" + 0.053*"ever" + 0.053*"i" + 0.053*"have" + 0.053*"the" + 0.053*"seen!"')
(1, '0.178*"movie" + 0.098*"seen!" + 0.098*"the" + 0.098*"have" + 0.098*"i" + 0.098*"ever" + 0.098*"this" + 0.098*"is" + 0.059*"best" + 0.059*"worst"')
Model Perplexity: -2.86994797984759
Coherence Score: 0.5499017772105605


## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [15]:
# Write your code here

from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora

def lsm_preprocess_corpus(text_corpus):
    # Tokenize the text and remove stopwords
    preprocessed_corpus = [doc.lower().split() for doc in text_corpus]
    return preprocessed_corpus

def lsm_create_dtm(text_corpus):
    # Create a Document-Term Matrix (DTM) from the preprocessed corpus
    dictionary = corpora.Dictionary(text_corpus)
    dtm = [dictionary.doc2bow(doc) for doc in text_corpus]
    return dtm, dictionary

def lsm_find_optimal_k(dtm, dictionary, text_corpus):
    # Find the optimal number of topics (K) using coherence score
    coherence_scores = []
    for k in range(2, 6):  # Adjust the range as needed
        lsa_model = LsiModel(dtm, num_topics=k, id2word=dictionary)
        coherence_model = CoherenceModel(model=lsa_model, texts=text_corpus, dictionary=dictionary, coherence='c_v')
        coherence_score = coherence_model.get_coherence()
        coherence_scores.append(coherence_score)

    optimal_k = coherence_scores.index(max(coherence_scores)) + 2  # +2 because we started from 2 topics
    return optimal_k

def train_lsa_model(dtm, dictionary, optimal_k):
    # Train the LSA model with the optimal number of topics
    lsa_model = LsiModel(dtm, num_topics=optimal_k, id2word=dictionary)
    return lsa_model

def lsm_extract_and_summarize_topics(lsa_model):
    # Extract and summarize the topics by printing the most representative words for each topic
    topics = lsa_model.show_topics()
    for topic in topics:
        print(topic)





In [16]:

# Step 1: Preprocess the corpus
preprocessed_corpus = lsm_preprocess_corpus(text_data)

# Step 2: Create a DTM and a dictionary
dtm, dictionary = lsm_create_dtm(preprocessed_corpus)

# Step 3: Find the optimal number of topics (K)
optimal_k = lsm_find_optimal_k(dtm, dictionary, preprocessed_corpus)

# Step 4: Train the LSA model with the optimal K
lsa_model = train_lsa_model(dtm, dictionary, optimal_k)

# Step 5: Extract and summarize the topics
extract_and_summarize_topics(lsa_model)

# Step 6: Compute Coherence Score
coherence_model = CoherenceModel(model=lsa_model, texts=preprocessed_corpus, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score}")



(0, '0.604*"movie" + 0.329*"is" + 0.329*"this" + 0.275*"i" + 0.275*"seen!" + 0.275*"have" + 0.275*"the" + 0.275*"ever" + 0.138*"best" + 0.138*"worst"')
(1, '0.617*"okay." + 0.377*"is" + 0.377*"this" + -0.240*"ever" + -0.240*"seen!" + -0.240*"have" + -0.240*"i" + -0.240*"the" + 0.137*"movie" + -0.120*"worst"')
Coherence Score: 0.5499017772105605


## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [17]:
%%capture
!pip install lda2vec




In [18]:
# Write your code here
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora
from gensim.models.ldamodel import LdaModel

def lda2_preprocess_corpus(text_corpus):
    # Tokenize the text and remove stopwords
    preprocessed_corpus = [doc.lower().split() for doc in text_corpus]
    return preprocessed_corpus

def lda2_create_dtm(text_corpus):
    # Create a Document-Term Matrix (DTM) from the preprocessed corpus
    dictionary = corpora.Dictionary(text_corpus)
    dtm = [dictionary.doc2bow(doc) for doc in text_corpus]
    return dtm, dictionary

def lda2_find_optimal_k(dtm, dictionary, text_corpus, start_k=2, end_k=10):
    # Find the optimal number of topics (K) using coherence score
    coherence_scores = []
    perplexity_scores = []

    for k in range(start_k, end_k + 1):
        lda_model = LdaModel(dtm, num_topics=k, id2word=dictionary)
        coherence_model = CoherenceModel(model=lda_model, texts=text_corpus, dictionary=dictionary, coherence='c_v')
        coherence_score = coherence_model.get_coherence()

        perplexity_score = lda_model.log_perplexity(dtm)

        coherence_scores.append(coherence_score)
        perplexity_scores.append(perplexity_score)

    optimal_k = coherence_scores.index(max(coherence_scores)) + start_k

    return optimal_k, coherence_scores, perplexity_scores

def lda2_train_model(dtm, dictionary, optimal_k):
    # Train the LDA model with the optimal number of topics
    lda_model = LdaModel(dtm, num_topics=optimal_k, id2word=dictionary)
    return lda_model

def lda2_extract_and_summarize_topics(lda_model):
    # Extract and summarize the topics by printing the most representative words for each topic
    topics = lda_model.show_topics()
    for topic in topics:
        print(topic)



# Step 1: Preprocess the corpus
preprocessed_corpus = lda2_preprocess_corpus(text_data)

# Step 2: Create a DTM and a dictionary
dtm, dictionary = lda2_create_dtm(preprocessed_corpus)

# Step 3: Find the optimal number of topics (K) and get coherence scores
optimal_k, coherence_scores, perplexity_scores = lda2_find_optimal_k(dtm, dictionary, preprocessed_corpus)

# Step 4: Train the LDA model with the optimal K
lda_model = lda2_train_model(dtm, dictionary, optimal_k)

# Step 5: Extract and summarize the topics, and calculate coherence score
lda2_extract_and_summarize_topics(lda_model)

# Print the optimal K and its corresponding coherence score and perplexity score
print(f'Optimal K: {optimal_k}')
print(f'Coherence Scores: {coherence_scores}')
print(f'Perplexity Scores: {perplexity_scores}')





(0, '0.171*"movie" + 0.107*"is" + 0.105*"the" + 0.100*"this" + 0.092*"ever" + 0.089*"have" + 0.085*"seen!" + 0.079*"i" + 0.066*"worst" + 0.063*"best"')
(1, '0.172*"movie" + 0.125*"this" + 0.120*"is" + 0.090*"i" + 0.086*"seen!" + 0.083*"have" + 0.081*"ever" + 0.072*"the" + 0.068*"okay." + 0.053*"best"')
Optimal K: 2
Coherence Scores: [0.5499017772105605, 0.5499017772105605, 0.5499017772105605, 0.5499017772105605, 0.5499017772105605, 0.5499017772105605, 0.5499017772105605, 0.5499017772105605, 0.5499017772105605]
Perplexity Scores: [-2.966286674141884, -3.1694325183828673, -3.3889349500338235, -3.552044317126274, -3.399786320825418, -3.7805792540311813, -3.544478163123131, -3.914776469270388, -4.033200937012832]


## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [2]:
%%capture
!pip install bertopic
!pip install --upgrade bertopic
!pip install --upgrade bertopic
!pip install --upgrade umap-learn
!pip install --upgrade scipy
!pip install --upgrade scikit-learn
!pip install bertopic --upgrade



In [3]:
%%capture
!pip install --upgrade bertopic scikit-learn umap-learn


In [5]:
# Write your code here

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.coherencemodel import CoherenceModel
import numpy as np

def bertopic_preprocess_corpus(text_corpus):
    # Ensure that the input is a list of strings with only string elements
    if not all(isinstance(doc, str) for doc in text_corpus):
        raise TypeError("Make sure that the input contains only strings.")

    # Tokenize the text and remove stopwords
    preprocessed_corpus = [" ".join(doc.lower().split()) for doc in text_corpus]
    return preprocessed_corpus

def bertopic_find_optimal_k_coherence(text_corpus, start_k=2, end_k=10):
    coherence_scores = []

    for k in range(start_k, end_k + 1):
        topic_model = BERTopic(language="multilingual", verbose=False, nr_topics=k)
        topics, _ = topic_model.fit_transform(text_corpus)

        # Calculate the coherence score using BERTopic's inbuilt "c_v" method
        coherence_score = topic_model.get_coherence()
        coherence_scores.append(coherence_score)

    # Get the index of the maximum coherence score
    max_score_index = np.argmax(coherence_scores)
    optimal_k = max_score_index + start_k
    optimal_coherence = coherence_scores[max_score_index]
    return optimal_k, optimal_coherence

def bertopic_generate_and_summarize_topics(text_corpus, optimal_k):
    # Create a BERTopic model with the optimal K
    topic_model = BERTopic(language="multilingual", verbose=True)
    topics, probabilities = topic_model.fit_transform(text_corpus)

    # Summarize the topics
    topic_summaries = topic_model.get_topic_summaries(text_corpus, topics, probabilities, n_words=10)
    return topics, topic_summaries





In [19]:

preprocessed_corpus = lda2_preprocess_corpus(text_data)

# Step 2: Create a DTM and a dictionary
dtm, dictionary = lda2_create_dtm(preprocessed_corpus)

# Step 3: Find the optimal number of topics (K) and get coherence scores
optimal_k, coherence_scores, perplexity_scores = lda2_find_optimal_k(dtm, dictionary, preprocessed_corpus)

# Step 4: Train the LDA model with the optimal K
lda_model = lda2_train_model(dtm, dictionary, optimal_k)

# Step 5: Extract and summarize the topics, and calculate coherence score
lda2_extract_and_summarize_topics(lda_model)

# Print the optimal K and its corresponding coherence score and perplexity score
print(f'Optimal K: {optimal_k}')
print(f'Coherence Scores: {coherence_scores}')
print(f'Perplexity Scores: {perplexity_scores}')



(0, '0.189*"movie" + 0.099*"this" + 0.099*"ever" + 0.099*"have" + 0.099*"is" + 0.099*"seen!" + 0.099*"the" + 0.099*"best" + 0.099*"i" + 0.009*"okay."')
(1, '0.091*"is" + 0.091*"movie" + 0.091*"this" + 0.091*"i" + 0.091*"ever" + 0.091*"okay." + 0.091*"have" + 0.091*"seen!" + 0.091*"the" + 0.091*"worst"')
(2, '0.091*"movie" + 0.091*"is" + 0.091*"this" + 0.091*"i" + 0.091*"seen!" + 0.091*"ever" + 0.091*"the" + 0.091*"okay." + 0.091*"have" + 0.091*"worst"')
(3, '0.091*"is" + 0.091*"movie" + 0.091*"this" + 0.091*"i" + 0.091*"okay." + 0.091*"the" + 0.091*"ever" + 0.091*"have" + 0.091*"seen!" + 0.091*"worst"')
(4, '0.091*"is" + 0.091*"movie" + 0.091*"this" + 0.091*"i" + 0.091*"seen!" + 0.091*"the" + 0.091*"have" + 0.091*"okay." + 0.091*"ever" + 0.091*"best"')
(5, '0.216*"movie" + 0.216*"okay." + 0.216*"this" + 0.216*"is" + 0.020*"i" + 0.020*"have" + 0.020*"seen!" + 0.020*"the" + 0.020*"ever" + 0.020*"worst"')
(6, '0.091*"movie" + 0.091*"is" + 0.091*"this" + 0.091*"i" + 0.091*"okay." + 0.091*"

## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

Follow the guidelines from the essay to enhance your explanation:

* Writing logic

  Pay attention to how you express your thoughts. For example:

  * Weak Writing Logic: “Artificial Intelligence is risky because it is new technology.”

  * Strong Writing Logic: “Artificial Intelligence presents ethical risks such as data privacy concerns and algorithmic bias, which necessitate cautious implementation and regulation.”

* Topic of sentences

  * Focus and Direction: It provides a focus and sets the direction for the paragraph, ensuring that the reader knows what to expect.
  * Reader Guidance: It serves as a guidepost for the reader, making it easier to follow the flow of ideas and arguments in the document.
  * Support for Thesis: In academic papers, topic sentences help in elaborating or providing evidence for the thesis statement or research question.

* Writing flow

  * Transition: Smooth and logical transitions between sentences, paragraphs, and sections.
  * Rhythm: Variation in sentence length and structure to maintain reader engagement.
  * Sequence: The order of points or arguments contributes to a smooth reading experience.
  For example:
    * Weak Writing Flow: “We studied machine learning algorithms. Ethics are important. Data was collected.”
    * Strong Writing Flow: “We initiated our study by focusing on machine learning algorithms. Recognizing the ethical implications, we carefully curated our data set.”

In [None]:
# Write your answer here (no code needed for this question)

# In this comparative analysis of topic modeling algorithms applied to a dataset of movie reviews, BERTopic emerges as the top performer.
# While Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NMF) generated ten topics,
#  LDA raised concerns about convergence issues and displayed suboptimal coherence. LSA and NMF, while coherent, lacked automatic topic count determination.

# In contrast, BERTopic autonomously identified ten meaningful topics with minimal overlap and achieved higher coherence.
# Its adaptability in cases where the ideal topic count is unknown reinforced its superiority in this analysis.
# BERTopic's ability to harness contextual embeddings like BERT offers a modern and effective solution for uncovering latent topics in unstructured text data.


