# NLP Models Used in Product Review Analysis

## 1. BART for Extractive Summarization
- **Purpose**: Used to summarize reviews by extracting the most important parts.
- **How It Works**: It employs a technique called "extractive summarization," analyzing text and selecting the sentences that best represent the content of the review.
- **Advantages**:
  - Fast and reliable for summarizing large volumes of text.
  - Provides accurate summaries that directly reflect the content of the reviews.
- **Drawbacks**: May lack the ability to produce more creative text or convey nuanced emotions deeply.

## 2. T5 for Blog-Style Summaries
- **Purpose**: Used to create summaries in a blog-style format, making the text more engaging and easier to read.
- **How It Works**: It relies on a "generative" approach, creating new text based on the input. It uses previous examples to guide its output.
- **Advantages**:
  - Highly flexible in producing human-like text.
  - Can incorporate narrative elements, making summaries more appealing to readers.
- **Drawbacks**: May require more training data to achieve excellent performance and can be slower than BART in some cases.

## 3. GPT-4 for Generative Understanding
- **Purpose**: Used to enhance understanding and storytelling from raw reviews, with the ability to generate new text.
- **How It Works**: Based on a very complex neural network architecture, it can analyze context deeply and understand sentiments and strengths/weaknesses in reviews.
- **Advantages**:
  - Excellent ability to produce marketing-style and creative texts.
  - Can understand context better, facilitating rich, informative summaries.
- **Drawbacks**: Requires high resources to operate and may be more costly than other models.

## Key Differences
- **Summarization Style**: 
  - BART focuses on extracting important sentences, while T5 and GPT-4 focus on generating new text.

- **Creativity**:
  - T5 and GPT-4 produce more creative and engaging texts, while BART provides accurate but less creative summaries.

- **Speed and Performance**:
  - BART is generally faster at summarization, while T5 and GPT-4 may take more time to generate texts.

- **Applications**:
  - BART is preferred for quickly summarizing reviews, while T5 and GPT-4 are better suited for cases requiring more engaging and complex content creation.

In [3]:
import pandas as pd
from transformers import pipeline

# 1. Load data
df = pd.read_csv("/Users/aleph/Desktop/Project4/updated_categories_with_clusters.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Select the necessary columns
df = df[['primaryCategories', 'name', 'reviews.rating', 'reviews.text']].dropna()
df = df[df['reviews.rating'].astype(str).str.isnumeric()]
df['reviews.rating'] = df['reviews.rating'].astype(float)

# Summarization model (BERT)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Allow user to specify the category
category = input("Please enter a category (e.g., Electronics, Health & Beauty, etc.): ")

# Filter the DataFrame by the selected category
cat_df = df[df['primaryCategories'] == category]

# Product analysis in this category
grouped = cat_df.groupby('name').agg({
    'reviews.rating': 'mean',
    'reviews.text': lambda x: list(x)
}).reset_index()

# Best 3 products and worst product
top3 = grouped.sort_values(by='reviews.rating', ascending=False).head(3)
worst = grouped.sort_values(by='reviews.rating', ascending=True).head(1)

# Summarization function
def summarize_reviews(reviews, prompt="Summarize:"):
    combined = " ".join(reviews[:10])[:1024]  # Summarization of length review
    text = f"{prompt} {combined}"
    input_length = len(combined.split())
    if input_length < 20:
        return "Text too short to summarize effectively."
    max_length = min(120, max(30, input_length // 2))  # Adjust max_length dynamically
    return summarizer(text, max_length=max_length, min_length=7, do_sample=False)[0]['summary_text']

# Summary of the top products
top_summaries = []
for _, row in top3.iterrows():
    name = row['name']
    reviews = row['reviews.text']
    summary = summarize_reviews(reviews, prompt=f"What are customers saying about {name}?")
    complaints = summarize_reviews([r for r in reviews if "not" in r or "bad" in r or "problem" in r],
                                    prompt=f"What are the top complaints about {name}?")
    top_summaries.append((name, summary, complaints))

# Worst product
worst_name = worst.iloc[0]['name']
worst_reviews = worst.iloc[0]['reviews.text']
worst_summary = summarize_reviews(worst_reviews, prompt=f"Why should customers avoid {worst_name}?")

# Print the report
print(f"\nBlog-Style Article for Category: {category}\n")
print("Top 3 Products & Differences:\n")
for i, (name, summary, complaints) in enumerate(top_summaries, 1):
    print(f"{i}.  {name}\n")
    print(f" Summary:\n{summary}\n")
    print(f" Top Complaints:\n{complaints}\n")
    print("-" * 60)

print(f"\n Worst Product: {worst_name}")
print(f" Why it should be avoided:\n{worst_summary}")


Device set to use cpu



📝 Blog-Style Article for Category: Electronics

🔥 Top 3 Products & Differences:

1. 📦 Certified Refurbished Amazon Echo

✅ Summary:
This is my second Echo. The first one I purchased at Home Depot for 139. I knocked it off my nightstand a few times with no damage. The last time it fell, it completely died.

😡 Top Complaints:
The new version does not give you much of a choice for colors and you can get a skin for the 1st generation version. The speaker is great, fills the room with sound.

------------------------------------------------------------
2. 📦 Fire TV Stick Streaming Media Player Pair Kit

✅ Summary:
The mirror cast function didn't seem to work well displaying video while it was steaming to a Windows 10 laptop. With many apps to download and many apps that allow you free use of services such as streaming as long as you have a Cable provided, you can log in and stream.

😡 Top Complaints:
With many apps to download and many apps that allow you free use of services such as strea

summarization T5 model 

In [7]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load data
df = pd.read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Select the necessary columns
df = df[['primaryCategories', 'name', 'reviews.rating', 'reviews.text']].dropna()
df = df[df['reviews.rating'].astype(str).str.isnumeric()]
df['reviews.rating'] = df['reviews.rating'].astype(float)

# Load T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Function to summarize the reviews with few-shot prompt
def create_prompt(product, reviews):
    example = """You are a product analyst. Write a blog-style product summary using customer reviews.
Example:
Product: Fire HD 8
Reviews:
- Fast and easy to use
- My kids love it
- Excellent screen for the price
Summary:
A perfect tablet for families. It’s fast, easy to use, and loved by kids. Great value for money.
---
Now summarize this:
"""
    review_lines = "\n- " + "\n- ".join(reviews)
    prompt = example + f"Product: {product}\nReviews:{review_lines}\nOutput:"
    return prompt

# Function to generate the summary using T5 model
def generate_summary(prompt):
    inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model.generate(inputs, max_length=200, min_length=50, do_sample=False)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Allow user to specify the category
category = input("Please enter a category (e.g., Electronics, Health & Beauty, etc.): ")

# Filter the DataFrame by the selected category
cat_df = df[df['primaryCategories'] == category]

# Product analysis in this category
grouped = cat_df.groupby('name').agg({
    'reviews.rating': 'mean',
    'reviews.text': lambda x: list(x)
}).reset_index()

# Best 3 products and worst product
top3 = grouped.sort_values(by='reviews.rating', ascending=False).head(3)
worst = grouped.sort_values(by='reviews.rating', ascending=True).head(1)

# Summary of the top products
top_summaries = []
for _, row in top3.iterrows():
    name = row['name']
    reviews = row['reviews.text']
    prompt = create_prompt(name, reviews)
    summary = generate_summary(prompt)
    top_summaries.append((name, summary))

# Worst product
worst_name = worst.iloc[0]['name']
worst_reviews = worst.iloc[0]['reviews.text']
worst_prompt = create_prompt(worst_name, worst_reviews)
worst_summary = generate_summary(worst_prompt)

# Print the report
print(f"\n📝 Blog-Style Article for Category: {category}\n")
print("🔥 Top 3 Products & Differences:\n")
for i, (name, summary) in enumerate(top_summaries, 1):
    print(f"{i}. 📦 {name}\n")
    print(f"✅ Summary:\n{summary}\n")
    print("-" * 60)

print(f"\n🚫 Worst Product: {worst_name}")
print(f"❌ Why it should be avoided:\n{worst_summary}")


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]


📝 Blog-Style Article for Category: Electronics

🔥 Top 3 Products & Differences:

1. 📦 Certified Refurbished Amazon Echo

✅ Summary:
Amazon Echo is a great first step towards a smart home but there are more accessories required to connect the house together . if you live outside of the us you may find some of the features limited at the moment . if you live outside of the us you may find some of the features limited at the moment .

------------------------------------------------------------
2. 📦 Fire TV Stick Streaming Media Player Pair Kit

✅ Summary:
a fire TV stick - a 'perfect tablet for families' - works well like Alexia . a 'smart' tablet is a great value for money . a 'smart' tablet is a great way to watch movies and watch videos .

------------------------------------------------------------
3. 📦 AmazonBasics 16-Gauge Speaker Wire - 100 Feet

✅ Summary:
write a blog-style product summary using customer reviews . example: Fire HD 8 is fast and easy to use - my kids love it - E

summarization GPT-4 model 

In [1]:
import pandas as pd
from transformers import pipeline

# 1. Load data
df = pd.read_csv("/Users/aleph/Desktop/Project4/updated_categories_with_clusters.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Select the necessary columns
df = df[['primaryCategories', 'name', 'reviews.rating', 'reviews.text']].dropna()
df = df[df['reviews.rating'].astype(str).str.isnumeric()]
df['reviews.rating'] = df['reviews.rating'].astype(float)

# Summarization model (BERT)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Allow user to specify the category
category = input("Please enter a category (e.g., Electronics, Health & Beauty, etc.): ")

# Filter the DataFrame by the selected category
cat_df = df[df['primaryCategories'] == category]

# Product analysis in this category
grouped = cat_df.groupby('name').agg({
    'reviews.rating': 'mean',
    'reviews.text': lambda x: list(x)
}).reset_index()

# Best 3 products and worst product
top3 = grouped.sort_values(by='reviews.rating', ascending=False).head(3)
worst = grouped.sort_values(by='reviews.rating', ascending=True).head(1)

# Summarization function
def summarize_reviews(reviews, prompt="Summarize:"):
    combined = " ".join(reviews[:10])[:1024]  # Summarization of length review
    text = f"{prompt} {combined}"
    input_length = len(combined.split())
    if input_length < 20:
        return "Text too short to summarize effectively."
    max_length = min(120, max(30, input_length // 2))  # Adjust max_length dynamically
    return summarizer(text, max_length=max_length, min_length=7, do_sample=False)[0]['summary_text']

# Summary of the top products
top_summaries = []
for _, row in top3.iterrows():
    name = row['name']
    reviews = row['reviews.text']
    summary = summarize_reviews(reviews, prompt=f"What are customers saying about {name}?")
    complaints = summarize_reviews([r for r in reviews if "not" in r or "bad" in r or "problem" in r],
                                    prompt=f"What are the top complaints about {name}?")
    top_summaries.append((name, summary, complaints))

# Worst product
if not worst.empty:
    worst_name = worst.iloc[0]['name']
    worst_reviews = worst.iloc[0]['reviews.text']
    worst_summary = summarize_reviews(worst_reviews, prompt=f"Why should customers avoid {worst_name}?")
else:
    worst_name = "No products found"
    worst_summary = "No reviews available for the worst product."

# Print the report
print(f"\n📝 Blog-Style Article for Category: {category}\n")
print("🔥 Top 3 Products & Differences:\n")
for i, (name, summary, complaints) in enumerate(top_summaries, 1):
    print(f"{i}. 📦 {name}\n")
    print(f"✅ Summary:\n{summary}\n")
    print(f"😡 Top Complaints:\n{complaints}\n")
    print("-" * 60)

print(f"\n🚫 Worst Product: {worst_name}")
print(f"❌ Why it should be avoided:\n{worst_summary}")


2025-04-11 19:57:04.081480: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Device set to use cpu



📝 Blog-Style Article for Category: Electronics

🔥 Top 3 Products & Differences:

1. 📦 Certified Refurbished Amazon Echo

✅ Summary:
This is my second Echo. The first one I purchased at Home Depot for 139. I knocked it off my nightstand a few times with no damage. The last time it fell, it completely died.

😡 Top Complaints:
The new version does not give you much of a choice for colors and you can get a skin for the 1st generation version. The speaker is great, fills the room with sound.

------------------------------------------------------------
2. 📦 Fire TV Stick Streaming Media Player Pair Kit

✅ Summary:
The mirror cast function didn't seem to work well displaying video while it was steaming to a Windows 10 laptop. With many apps to download and many apps that allow you free use of services such as streaming as long as you have a Cable provided, you can log in and stream.

😡 Top Complaints:
With many apps to download and many apps that allow you free use of services such as strea

In [5]:
# Print all unique categories available in the dataset
print(df['primaryCategories'].unique())

['Health & Beauty' 'Electronics' 'Office Supplies'
 'Animals & Pet Supplies' 'Home & Garden' 'Electronics,Furniture'
 'Toys & Games,Electronics' 'Electronics,Media'
 'Office Supplies,Electronics']
