# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [6]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel
from collections import Counter

# Function to preprocess text data for topic modeling
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    tokens = [token for token in tokens if token.isalpha()]  # Remove non-alphabetic tokens
    tokens = [token for token in tokens if token not in stopwords.words('english')]  # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize tokens
    return tokens

# Function to scrape IMDb reviews
def scrape_imdb_reviews(url, max_reviews=50):
    reviews = []
    page_num = 1

    while len(reviews) < max_reviews:
        page_url = f"{url}&start={page_num * 10 - 10}"
        response = requests.get(page_url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            review_containers = soup.find_all('div', class_='lister-item-content')

            if not review_containers:
                print("No more reviews available.")
                break

            for container in review_containers:
                review_text = container.find('div', class_='text show-more__control').get_text().strip()
                reviews.append(review_text)

                if len(reviews) >= max_reviews:
                    break

            page_num += 1
        else:
            print(f"Failed to fetch IMDb reviews page {page_num}. Status code:", response.status_code)
            break

    return reviews[:max_reviews]

# URL of the IMDb page to scrape reviews from
url = 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_urv'
reviews = scrape_imdb_reviews(url, max_reviews=50)

# Preprocess reviews for LDA topic modeling
preprocessed_reviews = [preprocess_text(review) for review in reviews]

# Create a dictionary from the preprocessed reviews
dictionary = Dictionary(preprocessed_reviews)

# Convert the reviews to bag-of-words format
corpus = [dictionary.doc2bow(review) for review in preprocessed_reviews]

# Determine the optimal number of topics (K) based on coherence score
coherence_scores = {}
for k in range(2, 11):
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=42)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_reviews, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lda.get_coherence()
    coherence_scores[k] = coherence_score

# Choose the number of topics with the highest coherence score
optimal_k = max(coherence_scores, key=coherence_scores.get)
print("Optimal number of topics:", optimal_k)

# Train the LDA model with the optimal number of topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=optimal_k, random_state=42)

# Get the top words for each topic
topic_top_words = {}
for topic_id in range(optimal_k):
    topic_words = lda_model.get_topic_terms(topic_id, topn=5)
    top_words = [dictionary[word_id] for word_id, _ in topic_words]
    topic_top_words[topic_id] = top_words

# Print the topics and their top words
print("\nTopics and Top Words:")
for topic, top_words in topic_top_words.items():
    print(f"Topic {topic + 1}: {' | '.join(top_words)}")

# Function to summarize topics based on top words
def summarize_topics(topic_top_words):
    topics_summary = []
    for topic, top_words in topic_top_words.items():
        topics_summary.append(f"Topic {topic + 1}: {' | '.join(top_words)}")
    return '\n'.join(topics_summary)

# Summarize the topics
topics_summary = summarize_topics(topic_top_words)
print("\nTopics Summary:")
print(topics_summary)




Optimal number of topics: 5

Topics and Top Words:
Topic 1: film | movie | harry | potter | book
Topic 2: harry | movie | potter | film | book
Topic 3: film | movie | harry | potter | book
Topic 4: film | movie | book | potter | harry
Topic 5: movie | harry | film | potter | book

Topics Summary:
Topic 1: film | movie | harry | potter | book
Topic 2: harry | movie | potter | film | book
Topic 3: film | movie | harry | potter | book
Topic 4: film | movie | book | potter | harry
Topic 5: movie | harry | film | potter | book


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [8]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import LsiModel, CoherenceModel
from collections import Counter

# Function to preprocess text data for topic modeling
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    tokens = [token for token in tokens if token.isalpha()]  # Remove non-alphabetic tokens
    tokens = [token for token in tokens if token not in stopwords.words('english')]  # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize tokens
    return tokens

# Function to scrape IMDb reviews
def scrape_imdb_reviews(url, max_reviews=50):
    reviews = []
    page_num = 1

    while len(reviews) < max_reviews:
        page_url = f"{url}&start={page_num * 10 - 10}"
        response = requests.get(page_url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            review_containers = soup.find_all('div', class_='lister-item-content')

            if not review_containers:
                print("No more reviews available.")
                break

            for container in review_containers:
                review_text = container.find('div', class_='text show-more__control').get_text().strip()
                reviews.append(review_text)

                if len(reviews) >= max_reviews:
                    break

            page_num += 1
        else:
            print(f"Failed to fetch IMDb reviews page {page_num}. Status code:", response.status_code)
            break

    return reviews[:max_reviews]

# URL of the IMDb page to scrape reviews from
url = 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_urv'
reviews = scrape_imdb_reviews(url, max_reviews=50)

# Preprocess reviews for LSA topic modeling
preprocessed_reviews = [preprocess_text(review) for review in reviews]

# Create a dictionary from the preprocessed reviews
dictionary = Dictionary(preprocessed_reviews)

# Convert the reviews to bag-of-words format
corpus = [dictionary.doc2bow(review) for review in preprocessed_reviews]

# Determine the optimal number of topics (K) based on coherence score
coherence_scores = {}
for k in range(2, 11):
    lsa_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=k)
    coherence_model_lsa = CoherenceModel(model=lsa_model, texts=preprocessed_reviews, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lsa.get_coherence()
    coherence_scores[k] = coherence_score

# Choose the number of topics with the highest coherence score
optimal_k = max(coherence_scores, key=coherence_scores.get)
print("Optimal number of topics:", optimal_k)

# Train the LSA model with the optimal number of topics
lsa_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=optimal_k)

# Get the top words for each topic
topic_top_words = {}
for topic_id, topic in lsa_model.show_topics(formatted=False):
    top_words = [word for word, _ in topic]
    topic_top_words[topic_id] = top_words

# Print the topics and their top words
print("\nTopics and Top Words:")
for topic, top_words in topic_top_words.items():
    print(f"Topic {topic + 1}: {' | '.join(top_words)}")

# Function to summarize topics based on top words
def summarize_topics(topic_top_words):
    topics_summary = []
    for topic, top_words in topic_top_words.items():
        topics_summary.append(f"Topic {topic + 1}: {' | '.join(top_words)}")
    return '\n'.join(topics_summary)

# Summarize the topics
topics_summary = summarize_topics(topic_top_words)
print("\nTopics Summary:")
print(topics_summary)


Optimal number of topics: 8

Topics and Top Words:
Topic 1: movie | harry | film | potter | book | like | child | character | first | time
Topic 2: movie | film | harry | also | wizard | magic | rowling | though | boy | villain
Topic 3: book | harry | potter | story | film | also | stone | way | much | even
Topic 4: philosopher | like | special | would | potter | good | thing | really | kid | stone
Topic 5: first | kid | world | see | harry | also | special | movie | magic | richard
Topic 6: film | like | would | also | without | special | feel | scene | new | want
Topic 7: nothing | actor | young | harry | read | character | role | novel | want | fine
Topic 8: feel | book | film | harry | never | potter | scene | wizard | also | like

Topics Summary:
Topic 1: movie | harry | film | potter | book | like | child | character | first | time
Topic 2: movie | film | harry | also | wizard | magic | rowling | though | boy | villain
Topic 3: book | harry | potter | story | film | also | stone 

## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [16]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel
from collections import Counter

# Function to preprocess text data for topic modeling
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    tokens = [token for token in tokens if token.isalpha()]  # Remove non-alphabetic tokens
    tokens = [token for token in tokens if token not in stopwords.words('english')]  # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize tokens
    return tokens

# Function to scrape IMDb reviews
def scrape_imdb_reviews(url, max_reviews=50):
    reviews = []
    page_num = 1

    while len(reviews) < max_reviews:
        page_url = f"{url}&start={page_num * 10 - 10}"
        response = requests.get(page_url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            review_containers = soup.find_all('div', class_='lister-item-content')

            if not review_containers:
                print("No more reviews available.")
                break

            for container in review_containers:
                review_text = container.find('div', class_='text show-more__control').get_text().strip()
                reviews.append(review_text)

                if len(reviews) >= max_reviews:
                    break

            page_num += 1
        else:
            print(f"Failed to fetch IMDb reviews page {page_num}. Status code:", response.status_code)
            break

    return reviews[:max_reviews]

# URL of the IMDb page to scrape reviews from
url = 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_urv'
reviews = scrape_imdb_reviews(url, max_reviews=50)

# Preprocess reviews for LDA topic modeling
preprocessed_reviews = [preprocess_text(review) for review in reviews]

# Create a dictionary from the preprocessed reviews
dictionary = Dictionary(preprocessed_reviews)

# Convert the reviews to bag-of-words format
corpus = [dictionary.doc2bow(review) for review in preprocessed_reviews]

# Determine the optimal number of topics (K) based on coherence score
coherence_scores = {}
for k in range(2, 11):
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_reviews, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lda.get_coherence()
    coherence_scores[k] = coherence_score

# Choose the number of topics with the highest coherence score
optimal_k = max(coherence_scores, key=coherence_scores.get)
print("Optimal number of topics:", optimal_k)

# Train the LDA model with the optimal number of topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=optimal_k)

# Get the top words for each topic
topic_top_words = {}
for topic_id, topic in lda_model.show_topics(formatted=False):
    top_words = [word for word, _ in topic]
    topic_top_words[topic_id] = top_words

# Print the topics and their top words
print("\nTopics and Top Words:")
for topic, top_words in topic_top_words.items():
    print(f"Topic {topic + 1}: {' | '.join(top_words)}")

# Function to summarize topics based on top words
def summarize_topics(topic_top_words):
    topics_summary = []
    for topic, top_words in topic_top_words.items():
        topics_summary.append(f"Topic {topic + 1}: {' | '.join(top_words)}")
    return '\n'.join(topics_summary)

# Summarize the topics
topics_summary = summarize_topics(topic_top_words)
print("\nTopics Summary:")
print(topics_summary)




Optimal number of topics: 2

Topics and Top Words:
Topic 1: movie | harry | potter | film | book | like | great | character | first | good
Topic 2: movie | film | harry | potter | book | character | like | child | wizard | one

Topics Summary:
Topic 1: movie | harry | potter | film | book | like | great | character | first | good
Topic 2: movie | film | harry | potter | book | character | like | child | wizard | one


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [18]:

import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from bertopic import BERTopic
from gensim.models import CoherenceModel

# Install NLTK data if needed
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Function to preprocess text data for topic modeling
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    tokens = [token for token in tokens if token.isalpha()]  # Remove non-alphabetic tokens
    tokens = [token for token in tokens if token not in stopwords.words('english')]  # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize tokens
    return tokens

# Function to scrape IMDb reviews
def scrape_imdb_reviews(url, max_reviews=50):
    reviews = []
    page_num = 1

    while len(reviews) < max_reviews:
        page_url = f"{url}&start={page_num * 10 - 10}"
        response = requests.get(page_url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            review_containers = soup.find_all('div', class_='lister-item-content')

            if not review_containers:
                print("No more reviews available.")
                break

            for container in review_containers:
                review_text = container.find('div', class_='text show-more__control').get_text().strip()
                reviews.append(review_text)

                if len(reviews) >= max_reviews:
                    break

            page_num += 1
        else:
            print(f"Failed to fetch IMDb reviews page {page_num}. Status code:", response.status_code)
            break

    return reviews[:max_reviews]

# URL of the IMDb page to scrape reviews from
url = 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_urv'
reviews = scrape_imdb_reviews(url, max_reviews=50)

# Preprocess reviews for BERTopic
preprocessed_reviews = [' '.join(preprocess_text(review)) for review in reviews]

# Initialize BERTopic and determine the optimal number of topics (K) based on coherence score
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(preprocessed_reviews)

# Print the number of topics and their top words
topic_top_words = topic_model.get_topics()
print("\nTopics and Top Words:")
for topic_id, top_words in topic_top_words.items():
    print(f"Topic {topic_id + 1}: {' | '.join(map(str, top_words[:5]))}")  # Convert top_words to strings

# Function to summarize topics based on top words
def summarize_topics(topic_top_words):
    topics_summary = []
    for topic_id, top_words in topic_top_words.items():
        topics_summary.append(f"Topic {topic_id + 1}: {' | '.join(map(str, top_words[:5]))}")
    return '\n'.join(topics_summary)

# Summarize the topics
topics_summary = summarize_topics(topic_top_words)
print("\nTopics Summary:")
print(topics_summary)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Topics and Top Words:
Topic 0: ('harry', 0.08904450600962192) | ('film', 0.06861888795651858) | ('magic', 0.04653463597890399) | ('potter', 0.043116946571880574) | ('wizard', 0.04272198355777768)
Topic 1: ('movie', 0.08578276013531544) | ('film', 0.0565590582428615) | ('harry', 0.051756965131882306) | ('potter', 0.04975474757827734) | ('book', 0.04110724687113712)
Topic 2: ('great', 0.1145428279901484) | ('film', 0.0631336438447923) | ('good', 0.06248197933300945) | ('harry', 0.05391742870571163) | ('harris', 0.053792902033650954)

Topics Summary:
Topic 0: ('harry', 0.08904450600962192) | ('film', 0.06861888795651858) | ('magic', 0.04653463597890399) | ('potter', 0.043116946571880574) | ('wizard', 0.04272198355777768)
Topic 1: ('movie', 0.08578276013531544) | ('film', 0.0565590582428615) | ('harry', 0.051756965131882306) | ('potter', 0.04975474757827734) | ('book', 0.04110724687113712)
Topic 2: ('great', 0.1145428279901484) | ('film', 0.0631336438447923) | ('good', 0.06248197933300945

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
En resume, BERTopic and LDA seem to have the best performance regarding coherence and interpretability which is why they are becoming increasingly popular for different text analysis purposes.
 Especially LSA, is under the influence of singular value decomposition and so, it usually does not obtain coherence scores as high as those of humans and is rather hard to interpret.
  LdA2Vec provides more accurate topic but needs additional adjustments and may be not successful to surpass the LDA because of their simplicity. There comes a question of the best algorithm and there are variants like coherence,
  interpretability, and advanced context - aware modeling that will have to be done to know the best to use. In all, BERTopic and LDA are mostly chosen for the reason that they have got these two factors much into the balance.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

This assignment is challenging for me, i have learned bert and lda model and i learned how to generate k values using lda.
i got clear understading about these algorithms and implementation form corpus.
i faced issues with lda2ves implementation, my colab is not supporting when in install thje lda2vec, all files are
not include in this version, i have to include missing files manually in conda implementation. so i used lda instead of lad2vec




'''