# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here: 

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [12]:
# Write your code here
# importing the gensim module to use the LDA modelling technique
import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel
# importing the nlp tool kit and downloading the stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# input data to use
news_input = ["Breaking: Alien Invasion Imminent, Government Warns",
    "Study Finds Vaccines Effective in Preventing Disease",
    "Billionaire Elon Musk to Fund Mission to Mars",
    "Exclusive: Top-Secret Government Files Leaked",]

# removing the stop words from input text and tokenizing the text
stop_words = set(stopwords.words('english'))
tokenized_text = [[word for word in doc.lower().split() if word not in stop_words] for doc in news_input]

# Creating dictionary and corpus using the tokenized text
dictionary = corpora.Dictionary(tokenized_text)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_text]

# Using coherence score parameter to decide number of topics
coherence_scores = []
for k in range(4, 16):  
    lda_model = LdaModel(corpus=corpus, num_topics=k, id2word=dictionary)
    coherence_model = CoherenceModel(model=lda_model, texts=tokenized_text, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((k, coherence_score))

# Find the K with the highest coherence score
best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])

# Train the final LDA model with the best K
final_lda_model = LdaModel(corpus=corpus, num_topics=best_k, id2word=dictionary)

# Summarize the topics
topics = final_lda_model.print_topics(num_words=5)  # You can change the number of words in topics

# Print the best K and the topics
print("Optimal number of topics (K):", best_k)
print("Topics:")
for topic in topics:
    print(topic)




[nltk_data] Downloading package stopwords to C:\Users\soumya
[nltk_data]     nanditha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  numerator = (co_occur_count / num_docs) + EPSILON
  denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
  co_doc_prob = co_occur_count / num_docs


Optimal number of topics (K): 4
Topics:
(0, '0.108*"musk" + 0.108*"fund" + 0.108*"billionaire" + 0.108*"elon" + 0.108*"mars"')
(1, '0.109*"government" + 0.108*"warns" + 0.108*"breaking:" + 0.108*"alien" + 0.108*"imminent,"')
(2, '0.108*"disease" + 0.108*"effective" + 0.108*"vaccines" + 0.108*"study" + 0.108*"preventing"')
(3, '0.118*"files" + 0.118*"leaked" + 0.118*"government" + 0.118*"top-secret" + 0.118*"exclusive:"')


## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [14]:
# Write your code here
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

# Sample data
news_input = ["Breaking: Alien Invasion Imminent, Government Warns",
    "Study Finds Vaccines Effective in Preventing Disease",
    "Billionaire Elon Musk to Fund Mission to Mars",
    "Exclusive: Top-Secret Government Files Leaked",]

# Step 1: Preprocess the text data and create a TF-IDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(news_input)

# Step 2: Determine the optimal number of topics (K) using LSA (you specify K)
K = 6  
lsa = TruncatedSVD(n_components=K)
lsa_topic_matrix = lsa.fit_transform(tfidf_matrix)

# Step 3: Summarize the topics
terms = tfidf_vectorizer.get_feature_names_out()
topic_keywords = []
for i, topic in enumerate(lsa.components_):
    top_terms = [terms[idx] for idx in topic.argsort()[-5:][::-1]]
    topic_keywords.append(top_terms)
    print(f"Topic {i + 1}: {', '.join(top_terms)}")


Topic 1: government, warns, breaking, imminent, invasion
Topic 2: to, in, preventing, finds, vaccines
Topic 3: to, mission, mars, fund, elon
Topic 4: leaked, top, exclusive, files, secret


## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [4]:
pip install lda2vec

Collecting lda2vec
  Downloading lda2vec-0.16.10.tar.gz (13 kB)
Building wheels for collected packages: lda2vec
  Building wheel for lda2vec (setup.py): started
  Building wheel for lda2vec (setup.py): finished with status 'done'
  Created wheel for lda2vec: filename=lda2vec-0.16.10-py3-none-any.whl size=14434 sha256=e2401badef4cd6e16f7a6e2e6c689a5e92956ce7c9e4083788fd5a854f8e1cf9
  Stored in directory: c:\users\soumya nanditha\appdata\local\pip\cache\wheels\fa\ad\6c\38aa944b34a94fd5d4f4d48e7432f94cd97f18d15779bdc9e5
Note: you may need to restart the kernel to use updated packages.
Successfully built lda2vec
Installing collected packages: lda2vec
Successfully installed lda2vec-0.16.10


In [3]:
# Write your code here

import lda2vec
import numpy as np

news_input = ["Breaking: Alien Invasion Imminent, Government Warns",
    "Study Finds Vaccines Effective in Preventing Disease",
    "Billionaire Elon Musk to Fund Mission to Mars",
    "Exclusive: Top-Secret Government Files Leaked",]


# Initialize the LDA2Vec model
model = lda2vec.Lda2Vec(num_topics=K, num_words=len(vocabulary), min_df=5)


top_n = 10
topic_to_topwords = {}
for j, topic_to_word in enumerate(dat['topic_term_dists']):
    top = np.argsort(topic_to_word)[::-1][:top_n]
    msg = 'Topic %i '  % j
    top_words = [dat['vocab'][i].strip()[:35] for i in top]
    msg += ' '.join(top_words)
    print msg
    topic_to_topwords[j] = top_words

ModuleNotFoundError: No module named 'lda2vec'

## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here: 

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [17]:
pip install bertopic

Collecting bertopic
  Using cached bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
Collecting hdbscan>=0.8.29
  Using cached hdbscan-0.8.33.tar.gz (5.2 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting umap-learn>=0.5.0
  Using cached umap_learn-0.5.4-py3-none-any.whl
Collecting sentence-transformers>=0.4.1
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting huggingface-hub>=0.4.0
  Using cached huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
Collecting sentencepiece
  Using cached sentencepiece-0.1.99-cp39-cp39-win_amd64.whl (977 kB)
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.35.0-py3-none-any.whl (7.9 MB)
Collecting fsspec>=2023.5.0
  Using cached fsspec-2023.10.0-py3-n

  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\soumya nanditha\anaconda3\python.exe' 'C:\Users\soumya nanditha\anaconda3\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py' build_wheel 'C:\Users\SOUMYA~1\AppData\Local\Temp\tmph6xlntsu'
       cwd: C:\Users\soumya nanditha\AppData\Local\Temp\pip-install-vsczq41l\hdbscan_a5f1b0a0619048bba15f2344513efec1
  Complete output (40 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-39
  creating build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\flat.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\hdbscan_.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\plots.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\prediction.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\robust_single_linkage_.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\validity.py -> build\lib.wi

In [16]:
!pip install bertopic
from bertopic import BERTopic

Collecting bertopic
  Using cached bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
Collecting hdbscan>=0.8.29
  Using cached hdbscan-0.8.33.tar.gz (5.2 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting sentence-transformers>=0.4.1
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting umap-learn>=0.5.0
  Using cached umap_learn-0.5.4-py3-none-any.whl
Collecting sentencepiece
  Using cached sentencepiece-0.1.99-cp39-cp39-win_amd64.whl (977 kB)
Collecting huggingface-hub>=0.4.0
  Using cached huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.35.0-py3-none-any.whl (7.9 MB)
Collecting fsspec>=2023.5.0
  Using cached fsspec-2023.10.0-py3-n

  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\soumya nanditha\anaconda3\python.exe' 'C:\Users\soumya nanditha\anaconda3\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py' build_wheel 'C:\Users\SOUMYA~1\AppData\Local\Temp\tmp70y9lmx5'
       cwd: C:\Users\soumya nanditha\AppData\Local\Temp\pip-install-9kwgkfg_\hdbscan_5b6fb30503134ecfbecd653799addfd3
  Complete output (40 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-39
  creating build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\flat.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\hdbscan_.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\plots.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\prediction.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\robust_single_linkage_.py -> build\lib.win-amd64-cpython-39\hdbscan
  copying hdbscan\validity.py -> build\lib.wi

ModuleNotFoundError: No module named 'bertopic'

In [15]:
# Write your code here

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

text_data = ["Breaking: Alien Invasion Imminent, Government Warns",
    "Study Finds Vaccines Effective in Preventing Disease",
    "Billionaire Elon Musk to Fund Mission to Mars",
    "Exclusive: Top-Secret Government Files Leaked",]
topics, probs = topic_model.fit_transform(text_data)


from bertopic import CoherenceModel

best_k = -1
best_coherence = -1

for k in range(3, 14):  # Adjust the range as needed
    coherence_model = CoherenceModel(topic_model, texts=text_data, topics=topics, dictionary=your_dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()

    if coherence_score > best_coherence:
        best_k = k
        best_coherence = coherence_score
        
top_words_per_topic = topic_model.get_topics()

print("Optimal number of topics (K):", best_k)
print("Topics:")
for i, words in enumerate(top_words_per_topic):
    print(f"Topic {i}:", words)



NameError: name 'BERTopic' is not defined

## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

In [None]:
# Write your answer here (no code needed for this question)



