<a href="https://colab.research.google.com/github/NahidFathima/NahidF_INFO5731_Fall2023/blob/main/Syed_NF_In_class_exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
import csv  # Importing the CSV module for reading CSV files
import gensim  # Importing the Gensim library for topic modeling
from gensim import corpora  # Importing the corpora module for creating a document-term matrix
from gensim.models import CoherenceModel  # Importing CoherenceModel for computing coherence score
from nltk.tokenize import word_tokenize  # Importing the word_tokenize tool for tokenization
from nltk.corpus import stopwords  # Importing a list of common stop words
from nltk.stem import WordNetLemmatizer  # Importing the WordNetLemmatizer for lemmatization

# Reading the news articles from the CSV file
news_articles = []
with open('news_articles.csv', mode='r', encoding='utf-8') as file:  # Opening the CSV file
    reader = csv.reader(file)  # Creating a reader object
    for row in reader:  # Looping through each row in the CSV
        news_articles.append(row[2])  # Appending the text from the third column to the news_articles list

# Rest of the code remains the same

# Getting set of common stop words in english
stop_words = set(stopwords.words('english'))

# Lemmatization to convert words into their base form
lemmatizer = WordNetLemmatizer()

# Tokenizing, removing any stop words and finally lemmatize the articles
tokenized_articles = [[lemmatizer.lemmatize(word) for word in word_tokenize(article.lower()) if word.isalpha() and word not in stop_words] for article in news_articles]

# Let's create a dictionary of tokenized articles
dictionary = corpora.Dictionary(tokenized_articles)

# Create a corpus using the tokenized dictionary
corpus = [dictionary.doc2bow(article) for article in tokenized_articles]

# Function to compute coherence values for different numbers of topics
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):  # Iterating over a range of topic numbers
        model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary)  # Creating an LDA model
        model_list.append(model)
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')  # Computing coherence score
        coherence_values.append(coherence_model.get_coherence())
    return model_list, coherence_values

# Compute coherence values for different topic numbers
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=tokenized_articles, start=2, limit=10, step=1)

# Selecting the model with the highest coherence value
optimal_model = model_list[coherence_values.index(max(coherence_values))]

# Print the topics and their respective keywords
optimal_model.print_topics(num_words=5)


[(0,
  '0.012*"one" + 0.010*"say" + 0.009*"drag" + 0.008*"also" + 0.006*"first"'),
 (1,
  '0.008*"also" + 0.007*"say" + 0.007*"one" + 0.005*"indigenous" + 0.005*"price"'),
 (2,
  '0.014*"price" + 0.012*"company" + 0.011*"streaming" + 0.009*"service" + 0.008*"business"'),
 (3,
  '0.010*"also" + 0.009*"work" + 0.009*"like" + 0.008*"straw" + 0.008*"plastic"'),
 (4,
  '0.010*"say" + 0.009*"price" + 0.007*"company" + 0.006*"pricing" + 0.006*"straw"'),
 (5,
  '0.008*"say" + 0.007*"one" + 0.006*"price" + 0.006*"climate" + 0.006*"business"'),
 (6,
  '0.006*"straw" + 0.006*"say" + 0.005*"price" + 0.005*"also" + 0.005*"company"'),
 (7,
  '0.008*"say" + 0.007*"also" + 0.007*"community" + 0.006*"climate" + 0.005*"text"')]

## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# Read news articles from the CSV file
news_articles = []
with open('news_articles.csv', mode='r', encoding='utf-8') as file:  # Opening the CSV file
    reader = csv.reader(file)  # Creating a reader object
    for row in reader:
        news_articles.append(row[2])  # Appending the text from the third column to the news_articles list

# Set the number of topics to be determined by the coherence score
coherence_scores = []
for num_topics in range(2, 11):  # Trying different numbers of topics
    # Creating a TfidfVectorizer to convert the collection of raw documents to a matrix of TF-IDF features
    vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(news_articles)  # Fitting and transforming the data
    lsa_model = TruncatedSVD(num_topics)  # Creating an LSA model with a specific number of topics
    lsa_pipeline = make_pipeline(lsa_model, Normalizer(copy=False))  # Creating an LSA pipeline with normalization
    lsa_matrix = lsa_pipeline.fit_transform(tfidf_matrix)  # Fitting and transforming the data using the LSA pipeline
    coherence_scores.append((num_topics, lsa_model.explained_variance_ratio_.sum()))  # Appending the coherence score

# Choosing the number of topics that maximizes the coherence score
optimal_num_topics = max(coherence_scores, key=lambda x: x[1])[0]

# Generating topics using LSA with the optimal number of topics
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')  # Creating a TfidfVectorizer
tfidf_matrix = vectorizer.fit_transform(news_articles)  # Fitting and transforming the data
lsa_model = TruncatedSVD(optimal_num_topics)  # Creating an LSA model with the optimal number of topics
lsa_pipeline = make_pipeline(lsa_model, Normalizer(copy=False))  # Creating an LSA pipeline with normalization
lsa_matrix = lsa_pipeline.fit_transform(tfidf_matrix)  # Fitting and transforming the data using the LSA pipeline

# Getting the terms from the vectorizer
terms = vectorizer.get_feature_names_out()

# Printing the summary of the topics
print(f"Summary of {optimal_num_topics} topics generated by LSA:")
for i, component in enumerate(lsa_model.components_):  # Iterating over the components
    top_terms = [terms[j] for j in component.argsort()[:-6:-1]]  # Extracting the top terms for each topic
    print(f"Topic {i + 1}: {', '.join(top_terms)}")  # Printing the top terms for each topic


Summary of 10 topics generated by LSA:
Topic 1: says, like, ad, climate, million
Topic 2: like, sea, catalogue, brought, green
Topic 3: pricing, ad, companies, price, prices
Topic 4: work, 2016, museum, seen, said
Topic 5: hours, view, difficult, sense, later
Topic 6: ll, just, planet, places, international
Topic 7: climate, says, change, pricing, natural
Topic 8: like, known, came, pricing, catalogue
Topic 9: ad, company, time, free, business
Topic 10: adopted, announced, blue, hours, green


## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
pip install lda2vec

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install pyldavis

Collecting pyldavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
Collecting pandas>=2.0.0
  Downloading pandas-2.1.2-cp39-cp39-win_amd64.whl (10.8 MB)
Collecting joblib>=1.2.0
  Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
Collecting numpy>=1.24.2
  Downloading numpy-1.26.1-cp39-cp39-win_amd64.whl (15.8 MB)
Collecting funcy
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting tzdata>=2022.1
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting scipy
  Downloading scipy-1.11.3-cp39-cp39-win_amd64.whl (44.3 MB)
Installing collected packages: numpy, tzdata, scipy, joblib, pandas, funcy, pyldavis
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.7.3
    Uninstalling scipy-1.7.3:
      Successfully uninstalled scipy-1.7.3
  Attempting uninstall: joblib
    Found existing installat

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.5.0 requires daal==2021.4.0, which is not installed.
tensorflow-intel 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.1 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.26.1 which is incompatible.


In [None]:
pip install pyLDAvis

Collecting FuzzyTM>=0.4.0 (from gensim->pyLDAvis)
  Downloading FuzzyTM-2.0.5-py3-none-any.whl (29 kB)
Collecting pyfume (from FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading pyFUME-0.2.25-py3-none-any.whl (67 kB)
     ---------------------------------------- 0.0/67.1 kB ? eta -:--:--
     ---------------------------------------- 67.1/67.1 kB 1.8 MB/s eta 0:00:00
Collecting simpful (from pyfume->FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading simpful-2.11.0-py3-none-any.whl (32 kB)
Collecting fst-pso (from pyfume->FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading fst-pso-1.8.1.tar.gz (18 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting miniful (from fst-pso->pyfume->FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading miniful-0.0.6.tar.gz (2.8 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: fst-pso, miniful
  Building wheel for f

In [None]:
pip install preprocess

Collecting preprocess
  Downloading preprocess-2.0.0-py3-none-any.whl (12 kB)
Installing collected packages: preprocess
Successfully installed preprocess-2.0.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from lda2vec import preprocess, Corpus, Lda2Vec

try:
    # Load the data
    data_frame = pd.read_csv('news_articles.csv')

    # Preprocess the data
    vectorizer = CountVectorizer()  # Initialize the CountVectorizer
    X = vectorizer.fit_transform(data_frame['text'])  # Transform the text data into a document-term matrix
    vocab = vectorizer.get_feature_names_out()  # Get the vocabulary of the text data

    # Create a corpus object
    corpus_obj = Corpus()  # Initialize the corpus object
    corpus_obj.fit(X)  # Fit the document-term matrix to the corpus object

    # Train the LDA2Vec model to find the optimal number of topics
    lda2vec_model = Lda2Vec(num_topics=10, passes=5, vocab=vocab)  # Initialize the LDA2Vec model
    lda2vec_model.fit(corpus_obj)  # Fit the corpus object to the LDA2Vec model

    # Get the coherence scores for different numbers of topics
    coherence_scores = []
    for num_topics in range(5, 15):
        model = Lda2Vec(num_topics=num_topics, passes=5, vocab=vocab)  # Initialize the LDA2Vec model with varying topics
        model.fit(corpus_obj)  # Fit the corpus object to the LDA2Vec model
        coherence_scores.append((num_topics, model.get_coherence()))  # Append the coherence score for each number of topics

    # Find the number of topics with the highest coherence score
    best_num_topics = max(coherence_scores, key=lambda x: x[1])[0]  # Find the number of topics with the highest coherence score

    # Train the LDA2Vec model with the best number of topics
    final_model = Lda2Vec(num_topics=best_num_topics, passes=5, vocab=vocab)  # Initialize the LDA2Vec model with the best number of topics
    final_model.fit(corpus_obj)  # Fit the corpus object to the final LDA2Vec model

    # Get the topics
    topics = final_model.print_topics(num_words=5)  # Get the topics with the specified number of words

    # Display the topics
    for topic in topics:
        print("Topic {}: {}".format(topic[0], topic[1]))  # Print the topics with their corresponding words

except Exception as e:
    print(f"An error occurred: {e}")

## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.15.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
     ---------------------------------------- 0.0/5.2 MB ? eta -:--:--
      --------------------------------------- 0.1/5.2 MB 1.7 MB/s eta 0:00:03
     - -------------------------------------- 0.2/5.2 MB 2.3 MB/s eta 0:00:03
     --- ------------------------------------ 0.5/5.2 MB 4.0 MB/s eta 0:00:02
     ------- -------------------------------- 1.0/5.2 MB 5.2 MB/s eta 0:00:01
     ---------- ----------------------------- 1.4/5.2 MB 5.9 MB/s eta 0:00:01
     ------------------ --------------------- 2.4/5.2 MB 8.7 MB/s eta 0:00:01
     ----------------------- ---------------- 3.1/5.2 MB 9.3 MB/s eta 0:00:01
     ------------------------ --------------- 3.1/5.2 MB 9.6 MB/s eta 0:00:01
     ------------------------ --------------- 3.1/5.2 MB 9.6 MB/s eta 0:00:01
     ------------------------ --------------- 3.1/5

  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
tables 3.8.0 requires blosc2~=2.0.0, which is not installed.
featurewiz 0.2.2 requires pyarrow~=7.0.0, but you have pyarrow 11.0.0 which is incompatible.
tensorboard 2.10.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.
tensorflow 2.10.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.
tensorflow-intel 2.12.0 requires keras<2.13,>=2.12.0, but you have keras 2.10.0 which is incompatible.
tensorflow-intel 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.25.2 which is incompatible.
tensorflow-intel 2.12.0 requires tensorboard<2.13,>=2.12, but you have tensorboard 2.10.0 which is incompatible.
tensorflow-in

In [None]:
pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install torch




In [None]:
import pandas as pd
from bertopic import BERTopic  # Import the BERTopic library
from sentence_transformers import SentenceTransformer  # Import the SentenceTransformer for BERT embeddings

# Load the CSV file
data = pd.read_csv('news_articles.csv')  # Load the news_articles.csv file

# Extract the text data from the CSV
articles_text = data['Text'].tolist()  # Extract the text data from the CSV

# Define the sentence-transformer model for BERT embeddings
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')  # Load the pre-trained BERT-based model

# Create the BERTopic model and fit it to the data
topic_model = BERTopic(language='english', calculate_probabilities=True, embedding_model=model)  # Create a BERTopic model
topics, _ = topic_model.fit_transform(articles_text)  # Fit the model to the text data

# Get the topics with their top words
top_topics = topic_model.get_topic_freq()  # Get the frequencies of the topics

# Print the topics and their top words
for topic_id, freq in top_topics:
    words = topic_model.get_topic(topic_id)  # Get the top words for the topic
    print(f"Topic {topic_id}: {', '.join(words)}")  # Print the topic ID and its top words


## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

LDA (Latent Dirichlet Allocation) is adept at producing easily interpretable and distinct topics, making it advantageous for effective topic selection. Its output includes clear topic themes and keyword distributions, simplifying the process of identifying and selecting relevant topics.
On the other hand, LSA (Latent Semantic Analysis) generates broader topics that might not align as precisely with the specific themes of the document, potentially complicating the process of topic selection.

Meanwhile, LDA2VEC incorporates semantic relationships, allowing for a more nuanced understanding of topics. BERTOPIC, leveraging the context-rich BERT model, excels at capturing intricate topic nuances.

Considering the project's emphasis on topic selection, LDA's transparent and focused outputs make it the best choice. Its output allows for a deeper understanding of the document's key themes, aiding in the efficient selection of the most relevant topics. With its simplicity and interpretability, LDA proves to be the most appropriate and effective solution for topic selection among the four models.