<a href="https://colab.research.google.com/github/RDGopal/IB9LQ0-GenAI/blob/main/Topic_Modeling_with_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with LLM
We will read the first 100 rows of the file `Reviews.csv` and conduct topic modeling with a transformer based model from HuggingFace transformers pipeline.

We will take a slightly different approach to topic modeling. In this case, we have a list of pre-chosen topics:

["Fit", "Comfort", "Material", "Quality", "Price", "Style", "Color", "Size", "Shipping", "Customer Service"].

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/RDGopal/IB9LQ0-GenAI/main/Data/Reviews.csv',nrows=100)

# Identify NaN values in 'Review' column
mask = df['Review'].notnull()

# Filter DataFrame using the mask
df = df[mask]

# Display the updated DataFrame (Optional)
display(df.head())

## Data preparation
Extract the review text from the 'Review' column of the dataFrame df into a list called review_texts



In [None]:
review_texts = df['Review'].astype(str).tolist()

## Data wrangling
Clean and preprocess the review texts, including removing irrelevant characters, converting to lowercase, and removing stop words.



In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already present
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

preprocessed_reviews = []
for review in review_texts:
    # Clean the text
    review = re.sub(r'[^\w\s]', '', str(review))  # Remove punctuation and special characters

    # Convert to lowercase
    review = review.lower()

    # Remove stop words
    words = review.split()
    filtered_words = [word for word in words if word not in stop_words]

    preprocessed_reviews.append(" ".join(filtered_words))

## Model training
Perform topic modeling using the Hugging Face Transformers pipeline.


In [None]:
from transformers import pipeline

# Create a topic modeling pipeline
topic_model = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

# Define candidate topics
candidate_labels = ["Fit", "Comfort", "Material", "Quality", "Price", "Style", "Color", "Size", "Shipping", "Customer Service"]

# Perform topic modeling
results = topic_model(preprocessed_reviews, candidate_labels)

# Store results in a list of dictionaries (or a DataFrame)
topic_modeling_results = []
for i, result in enumerate(results):
    topic_probabilities = dict(zip(result['labels'], result['scores']))
    topic_modeling_results.append({"review_index": i, **topic_probabilities})

# Display the first few results (optional)
print(topic_modeling_results[:5])

## Data visualization
Visualize the topic modeling results.


In [None]:
import matplotlib.pyplot as plt

# Assuming topic_modeling_results is a list of dictionaries as produced by the previous code block.

# Individual review visualizations
for i, result in enumerate(topic_modeling_results):
    plt.figure(figsize=(10, 6))  # Adjust figure size as needed
    topics = list(result.keys())
    probabilities = list(result.values())

    # Remove 'review_index' from topics and probabilities
    topics.remove('review_index')
    probabilities = probabilities[1:]

    plt.bar(topics, probabilities, color=['skyblue', 'orange', 'green', 'red', 'purple', 'brown', 'pink', 'gray', 'olive', 'cyan'])
    plt.xlabel("Topics")
    plt.ylabel("Probability Score")
    plt.title(f"Topic Probabilities for Review {i}")
    plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
    plt.tight_layout()
    plt.savefig(f"review_{i}_topic_probabilities.png")  # Save each plot
    plt.close()  # Close the plot to free up memory


# Summary visualization (average probabilities across all reviews)
average_probabilities = {}
for topic in list(topic_modeling_results[0].keys())[1:]: # Skip the first element 'review_index'
    average_probabilities[topic] = sum([result[topic] for result in topic_modeling_results]) / len(topic_modeling_results)

plt.figure(figsize=(12, 6))
plt.bar(average_probabilities.keys(), average_probabilities.values(), color=['skyblue', 'orange', 'green', 'red', 'purple', 'brown', 'pink', 'gray', 'olive', 'cyan'])
plt.xlabel("Topics")
plt.ylabel("Average Probability Score")
plt.title("Average Topic Probabilities Across All Reviews")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig("average_topic_probabilities.png")
plt.show()


## Summary:

The overall goal was to perform topic modeling on customer reviews, so we can infer the implicit question: "What are the main topics discussed in these customer reviews?"  The analysis provides probabilities for each review belonging to ten pre-defined topics.

