<a href="https://colab.research.google.com/github/MikeChastain84/Mike_INFO5731_Fall2024/blob/main/Chastain_Mike_Exercise_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


# Mounting Google Drive:

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

# Loading, Validating, and Preprocessing Data:

In [2]:
# Loading and validating Data:
import pandas as pd

# Load CSV file (Musical_instruments_reviews.csv) reviewer ID and reviewText columns:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/INFO 5731/Musical_instruments_reviews.csv', usecols=['reviewerID', 'reviewText'])

# Display the first few rows to ensure it loaded correctly.
df.head()

Unnamed: 0,reviewerID,reviewText
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac..."
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...
2,A195EZSQDW3E21,The primary job of this device is to block the...
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...


In [3]:
# Preprocessing Steps:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation \
                                        , preprocess_string, strip_short, stem_text

# Identify and remove empty values:
df['reviewText'].isna().value_counts()

print('Original shape of Data: ')
print(df.shape)
df = df.dropna()
print('New shape of Data: ')
print(df.shape)

# Text preprocessing using the gensim library
def preprocess(text):

    # clean text based on given filters
    CUSTOM_FILTERS = [lambda x: x.lower(),          # Converts text to lowercase
                                remove_stopwords,   # Removes stopwords
                                strip_punctuation,  # Removes punctuation
                                strip_short,        # Removes shortwords that are useless for the analysis
                                stem_text]          # Reduces words to their base form
    text = preprocess_string(text, CUSTOM_FILTERS)

    return text

    # apply function to all reviews
df['Text (Clean)'] = df['reviewText'].apply(lambda x: preprocess(x))
df.head()


Original shape of Data: 
(10261, 2)
New shape of Data: 
(10254, 2)


Unnamed: 0,reviewerID,reviewText,Text (Clean)
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...","[write, here, exactli, suppos, filter, pop, so..."
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,"[product, exactli, afford, realiz, doubl, scre..."
2,A195EZSQDW3E21,The primary job of this device is to block the...,"[primari, job, devic, block, breath, produc, p..."
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...,"[nice, windscreen, protect, mxl, mic, prevent,..."
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...,"[pop, filter, great, look, perform, like, stud..."


In [4]:
from gensim import corpora
# Create a dictionary with the corpus and covert corpus to Bag of Words (BOW)
corpus = df['Text (Clean)']
dictionary = corpora.Dictionary(corpus)

bow = [dictionary.doc2bow(text) for text in corpus]

## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [14]:
# Write your code here
# Run LDA for Different K Values. For each K, fit an LDA model on the BOW corpus.
# Calculate Coherence Score. Use Coherence Model from Gensim to calculate the coherence score for each LDA model.

from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

for i in range(2,11):
    lda_model = LdaModel(corpus=bow, num_topics=i, id2word=dictionary, random_state=42, passes=10, alpha='auto')
    coherence_model = CoherenceModel(model=lda_model, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))


# My results using a range of (2, 11) suggest an Optimal K of 6.
# However, the coherence scores appear like they might trend back upwards. Lets run this again using a range of (11, 15) to be sure.

Coherence score with 2 clusters: 0.31270568136382726
Coherence score with 3 clusters: 0.3315808413717854
Coherence score with 4 clusters: 0.39518526561266987
Coherence score with 5 clusters: 0.38766384108707896
Coherence score with 6 clusters: 0.4302480419246328
Coherence score with 7 clusters: 0.40868170865460574
Coherence score with 8 clusters: 0.40765306155635816
Coherence score with 9 clusters: 0.40908100751760545
Coherence score with 10 clusters: 0.4203352040020213


In [17]:
# Question 1 cont
for i in range(11,15):
    lda_model = LdaModel(corpus=bow, num_topics=i, id2word=dictionary, random_state=42, passes=10, alpha='auto')
    coherence_model = CoherenceModel(model=lda_model, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

# Again, it's unclear. Lets run this again using a range of (15, 20) to be sure.

Coherence score with 11 clusters: 0.4142583729389068
Coherence score with 12 clusters: 0.4027129302677517
Coherence score with 13 clusters: 0.4111895420526138
Coherence score with 14 clusters: 0.41426464454853085


In [16]:
# Question 1 cont
for i in range(15,20):
    lda_model = LdaModel(corpus=bow, num_topics=i, id2word=dictionary, random_state=42, passes=10, alpha='auto')
    coherence_model = CoherenceModel(model=lda_model, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

# We will select K=6 for a coherence score of ~0.43025.
# Optimal K value: K=6

Coherence score with 15 clusters: 0.3912705220986081
Coherence score with 16 clusters: 0.38785242393446073
Coherence score with 17 clusters: 0.39428691159459706
Coherence score with 18 clusters: 0.4016429899877864
Coherence score with 19 clusters: 0.4228717893003384


In [20]:
# Question 1 cont, Extracting top words using K=6:
# Set up the final LDA model with K=6
lda_model = LdaModel(corpus=bow,
                     id2word=dictionary,
                     num_topics=6,        # Using K=6 as determined by coherence scores
                     random_state=42,     # Ensures reproducibility
                     passes=10,           # Number of passes through the corpus
                     alpha='auto')        # Automatically adjusts topic distribution

# Print the top words for each topic
for idx, topic in lda_model.print_topics(num_words=10):
  print(f"Topic {idx}: {topic}")

Topic 0: 0.034*"pedal" + 0.027*"amp" + 0.026*"sound" + 0.014*"like" + 0.012*"tone" + 0.009*"effect" + 0.009*"us" + 0.008*"plai" + 0.007*"great" + 0.007*"tube"
Topic 1: 0.030*"guitar" + 0.023*"pick" + 0.016*"stand" + 0.016*"strap" + 0.014*"like" + 0.010*"fit" + 0.010*"us" + 0.009*"work" + 0.009*"good" + 0.009*"look"
Topic 2: 0.058*"string" + 0.031*"guitar" + 0.024*"sound" + 0.020*"plai" + 0.011*"great" + 0.011*"tone" + 0.011*"like" + 0.009*"good" + 0.009*"time" + 0.009*"set"
Topic 3: 0.047*"tuner" + 0.041*"capo" + 0.035*"tune" + 0.017*"guitar" + 0.017*"us" + 0.013*"string" + 0.012*"snark" + 0.012*"clip" + 0.012*"work" + 0.011*"fret"
Topic 4: 0.024*"great" + 0.023*"work" + 0.021*"price" + 0.020*"cabl" + 0.019*"good" + 0.015*"qualiti" + 0.013*"product" + 0.011*"bui" + 0.010*"need" + 0.009*"purchas"
Topic 5: 0.021*"record" + 0.013*"mic" + 0.012*"us" + 0.009*"headphon" + 0.008*"work" + 0.008*"sound" + 0.007*"track" + 0.007*"devic" + 0.007*"good" + 0.006*"tascam"


### Question 1 cont:

Topic 0: Likely related to guitar pedals and amplifiers. These are the top weighted words for Topic 0 and are followed by words relating to pedals and amplifiers such as the resulting sound, tone, and effect you get when using certain pedals and amplifiers (amps).

Topic 1: Likely related to guitar accessories. Guitar, pick, stand, and strap are the top weighted words for Topic 1. They are followed by words relating to descriptive words for guitar accessories.

Topic 2: Likely related to guitar strings. This is the top weighted word for Topic 2 and it is followed by descriptive words related to guitar strings.

Topic 3: Likely related to tuning devices and the use of capos. Tuner and capo are the most heavily weighed words followed by descriptive words.

Topic 4: Likely related to product quality and pricing. Words like work, price, qualiti, and product suggest this topic centers around user satisfaction, build quality, and value for money.

Topic 5: Seems to be about recording devices. Record and mic (microphone) are the most heavily weighed terms.

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [21]:
# Write your code here
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Set a k value range:
k_values = range(2, 11)
coherence_scores = []

# Calculate coherence score for each K in LSA
for k in k_values:
    lsi_model = LsiModel(corpus=bow, id2word=dictionary, num_topics=k)
    coherence_model = CoherenceModel(model=lsi_model, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)
    print(f"Coherence score with {k} topics: {coherence_score}")


Coherence score with 2 topics: 0.4638705176288849
Coherence score with 3 topics: 0.4255173024383379
Coherence score with 4 topics: 0.4433661893353248
Coherence score with 5 topics: 0.39039042820287334
Coherence score with 6 topics: 0.4147966554692966
Coherence score with 7 topics: 0.394001260725729
Coherence score with 8 topics: 0.42509014670798967
Coherence score with 9 topics: 0.3954390436481675
Coherence score with 10 topics: 0.3765926097204083


In [22]:
# Question 2 cont:
# K = 2 seems to be the optimal choice.

lsi_model = LsiModel(corpus=bow, num_topics=2, id2word=dictionary)
# Print the top words for each topic
for idx, topic in lsi_model.print_topics(num_words=10):
    print(f"Topic {idx}: {topic}")

Topic 0: 0.329*"sound" + 0.314*"guitar" + 0.242*"string" + 0.232*"pedal" + 0.217*"amp" + 0.214*"like" + 0.198*"us" + 0.163*"plai" + 0.161*"good" + 0.148*"great"
Topic 1: 0.584*"string" + -0.428*"pedal" + 0.380*"guitar" + -0.312*"amp" + 0.161*"tune" + -0.157*"sound" + -0.114*"tone" + 0.099*"tuner" + -0.090*"effect" + -0.080*"tube"


### Question 2 cont:

Topic 0: Likely related to the "sound" of different "guitar", "strings", "pedals", and "amps" (amplifiers). These are the top 5 weighted words.

Topic 1: This topic has both positive and negative weights, which is common in LSA. The positive weights are aligned with the topic of guitar strings and tuning equipment. The negative weights are aligned with a contrasting theme that appears to be pedals, amps, and sound equipment.

## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [7]:
# Write your code here
# See alternate Question 3

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [32]:
# Write your code here
!pip install bertopic -q
from bertopic import BERTopic
from gensim.models.coherencemodel import CoherenceModel

# Check for missing values. I used this to troubleshoot an issue that caused me to think some rows might be empty.
print("Initial data shape:", df.shape)
print("Missing values in 'reviewText':", df['reviewText'].isna().sum())

# Drop rows where 'reviewText' is missing or empty
df = df.dropna(subset=['reviewText'])
df = df[df['reviewText'].str.strip() != ""]

print("Data shape after removing empty rows:", df.shape)


Initial data shape: (10254, 3)
Missing values in 'reviewText': 0
Data shape after removing empty rows: (10254, 3)


In [35]:
# Question 4 cont
# Initialize variables to store the best model and coherence score
best_model = None
highest_coherence = 0
best_k = 0

# Text data for BERTopic
texts = df['Text (Clean)'].apply(lambda x: ' '.join(x)).tolist()

# Loop over different values of K (e.g., reducing topics to 2 to 10)
for k in range(2, 11):
    # Initialize BERTopic and fit to data
    topic_model = BERTopic(language="english")
    topics, probs = topic_model.fit_transform(texts)

    # Reduce the number of topics to K
    topic_model.reduce_topics(texts, nr_topics=k)

    # Extract topics as token lists for coherence calculation
    # Only include topics that exist after reduction
    topic_words = [[word for word, _ in topic_model.get_topic(i)]
                   for i in range(topic_model.get_topic_info().shape[0] -1 ) # Exclude -1 topic (outliers)
                   if topic_model.get_topic(i)]

    # Compute coherence score
    coherence_model = CoherenceModel(topics=topic_words, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()

    print(f"Coherence score with {k} topics: {coherence_score}")

    # Update best model and coherence score if current model is better
    if coherence_score > highest_coherence:
        highest_coherence = coherence_score
        best_model = topic_model
        best_k = k

print(f"\nOptimal K: {best_k} with Coherence Score: {highest_coherence}")

Coherence score with 2 topics: 0.34680413540095634
Coherence score with 3 topics: 0.33431669291235155
Coherence score with 4 topics: 0.4038861730296926
Coherence score with 5 topics: 0.37185104383243717
Coherence score with 6 topics: 0.4434509201699048
Coherence score with 7 topics: 0.4216606486881167
Coherence score with 8 topics: 0.4786719734452172
Coherence score with 9 topics: 0.46062612141383835
Coherence score with 10 topics: 0.4408593928293564

Optimal K: 8 with Coherence Score: 0.4786719734452172


In [36]:
# Question 4 cont:
# Display the topic information of the best model
topic_info = best_model.get_topic_info()
print(topic_info)

# Display top words for each topic in the best model
for i in range(best_k):
    print(f"Topic {i}: {best_model.get_topic(i)}")

   Topic  Count                          Name  \
0     -1   2186       -1_guitar_sound_us_work   
1      0   7204   0_guitar_string_sound_pedal   
2      1    353      1_stand_guitar_hold_fold   
3      2    324        2_work_good_price_capo   
4      3     89  3_light_batteri_music_bright   
5      4     50  4_filter_pop_clamp_microphon   
6      5     29     5_bench_seat_chair_adjust   
7      6     19     6_rack_screw_washer_mount   

                                      Representation  \
0  [guitar, sound, us, work, great, mic, like, go...   
1  [guitar, string, sound, pedal, like, us, great...   
2  [stand, guitar, hold, fold, sturdi, music, us,...   
3  [work, good, price, capo, qualiti, product, gr...   
4  [light, batteri, music, bright, power, adapt, ...   
5  [filter, pop, clamp, microphon, mic, screen, w...   
6  [bench, seat, chair, adjust, height, comfort, ...   
7  [rack, screw, washer, mount, hole, raxxess, ra...   

                                 Representative_Docs 

### Question 4 cont:

Topic 0: Most likely related to the sound quality of guitars, strings, and pedals, with terms like "guitar," "string," "sound," and "pedal" being the most heavily weighed terms.

Topic 1: Most likely related to the stability of guitar stands with terms like "stand," "hold," "fold," and "sturdi" being the most heavily weighed terms.

Topic 2: Most likely related to the quality and value of guitar capos with terms like "work," "price," "capo," and "qualiti" being the most heavily weighed terms.

Topic 3: Most likely related to lighting equipment with terms like words such as "light," "batteri," "music," and "bright" being the most heavily weighed terms.

Topic 4: Most likely related to microphone componenets with terms like "filter," "pop," "clamp," and "microphone" being the most heavily weighed terms.

Topic 5: Most likely related to seating with terms like "bench," "seat," "chair," and "adjust" being the most heavily weighed terms.

Topic 6: Most likely related to rack mounting and associated hardware with terms like "rack," "screw," "washer," and "mount" being the most heavily weighed terms.

## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [37]:
# Write your code here

# Visualize the top words per topic as a bar chart
# Visualization 1 is a barchart.
best_model.visualize_barchart()
# Then Explain the visualization
"""
Each chart below shows the most important words for each topic in the dataset.
Each box represents a different topic, with bars indicating the top words that define it.
The longer the bar, the more important the word is for that topic.

Topic 0 (e.g., "guitar," "string," "sound") focuses on the sound quality of guitars, strings, and pedals.
Topic 1 (e.g., "stand," "hold," "fold") focuses on the stability of guitar stands.
Topic 2 (e.g., "work," "price," "capo") focuses on the quality and value of guitar capos.
Topic 3 (e.g., "light," "batteri," "music") is about lighting equipment.
Topic 4 (e.g., "filter," "pop," "clamp") is about microphone components.
Topic 5 (e.g., "bench," "seat," "chair") is about seating equipment.
Topic 6 (e.g., "rack," "screw," "washer") is about rack mounting and associated hardware.
Each topic’s words provide a quick snapshot of what that topic is about, while each chart helps us
visualize the importance of each term, making it easier to understand the main themes in the data.
"""

In [38]:
# Question 3 (Alternative) cont:
# Visualization 2 is an intertopic distance map wich provides a visual overview of how topics relate to each other.
best_model.visualize_topics()

# The first thing to note is the size of the circles. It's clear that Topic 0 dominates the dataset with a size of 7204. The others range from 19 - 353.
# The next thing to note is the distance between circules. The closer the circles more similarity between the topics.
# This type of visualization helps you get a sense for how the themes in the dataset are structured.

In [39]:
# Question 3 (Alternative) cont:
# Generate and display a topic similarity heatmap (Matrix)
best_model.visualize_heatmap()

# The Similarity Heatmap displays the relationships between topics based on their similarity scores.
# The more blue, the more similar the corresponding terms are. Light green or nearly white is a low similarity score.
# Note, the diagnol with a perfect similarity score of 1 is because its comparing words with the same word. The diagnol can be ignored.
# You should also realize that the upper right and lower left are mirrored images of one another.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [10]:
# Write your code here
"""
After comparing LSA, LDA, and BERTopic, I found that each has its strengths, but
BERTopic stands out as the best choice for generating meaningful topics in complex
text data. BERTopic uses BERT embeddings, allowing it to capture the context and
meaning of words more effectively than LSA and LDA. This results in more coherent
and semantically rich topics, especially useful for datasets where word meanings
vary depending on context. It also doesn’t require setting a fixed number of
topics, making it more flexible and adaptable.

LDA, on the other hand, is a probabilistic model that works well for structured
analysis on larger datasets. It provides clear document-to-topic assignments,
which can be helpful when you need to know the main topic of each document. However,
LDA requires setting a fixed number of topics and fine-tuning parameters, which can
be time-consuming and less adaptable to datasets with varying themes.

LSA is the fastest and most computationally efficient method, making it good for
quick, exploratory analyses. However, it lacks the depth and interpretability of
LDA and BERTopic, especially in handling word meanings across contexts. Overall,
BERTopic is the most powerful choice here due to its coherence, flexibility, and
advanced visualization options.
"""

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [11]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
This wasn't a topic that excited me initially. However, after watching Ms. Fengjiao's demonstration, hearing her explanation
for different topic modeling techniques, and seeing her visualizations, I started to understand the value of topic modeling.
I found myself becoming more interested.

Learning Experience:
Overall, I learned a lot about topic modeling and code to implement some of the techniques. I really enjoyed the visualizations.
I can see how a corporation might be able to use this to better understand customer feedback in a matter of minutes instead of
paying people to read all the reviews and try to make sense of them.

Challenges encountered:
I tried to use code as demonstrated in Ms. Fengjiao's demonstration where I could. I struggled most with Ida2vec. This was my
first attempt at topic modeling. Ida2vec is a technical advanced method that I plan to learn after becoming familiar and comfortable
with the other three. I read some articles on Ida2vec but I couldn't get my code to work. I realized I needed to dedicate more time
to learning this.

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?
I can't tell if the question is asking how this relates to my field of study or NLP. It obviously relates to NLP, so I'll explain for my field of study:

The US Army uses serveys to improve performance in many domains. These serveys provide insight to senior leaders, but they can be laborious
and take a lot of man hours. Topic modeling could help senior leaders understand responses to mass serveys regarding soldier quality of life,
recruiting challenges, modernization friction points, and many other topics.

'''

'\nPlease write you answer here:\n\n\n\n\n\n'