## **Part - 3 Review Summarization Using Generative AI**
Objective: Summarize reviews into articles that recommend the top products for each category.

Task: Create a model that generates a short article (like a blog post) for each product category. The output should include:

- Top 3 products and key differences between them.
- Top complaints for each of those products.
- Worst product in the category and why it should be avoided.

#### Imports.

In [1]:
import os
import re
import json
import torch
import openai
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv, find_dotenv
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, BitsAndBytesConfig,AutoModelForCausalLM



In [2]:
load_dotenv(find_dotenv())
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

client = openai.OpenAI(api_key=OPENAI_API_KEY)

#### Load the Dataset


In [3]:
# Load the dataset
df = pd.read_csv("../Dataset/full_reviews_with_clusters.csv")

# Display basic info and the first few rows
df.info(), df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59046 entries, 0 to 59045
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            59046 non-null  object 
 1   cluster_name    59046 non-null  object 
 2   review          59046 non-null  object 
 3   reviews.rating  59046 non-null  float64
 4   sentiment       59046 non-null  object 
dtypes: float64(1), object(4)
memory usage: 2.3+ MB


(None,
                                                 name  \
 0  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 1  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 2  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 3  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 4  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 
                      cluster_name  \
 0  E-Readers & Kindle Accessories   
 1  E-Readers & Kindle Accessories   
 2  E-Readers & Kindle Accessories   
 3  E-Readers & Kindle Accessories   
 4  E-Readers & Kindle Accessories   
 
                                               review  reviews.rating sentiment  
 0  Great device for reading. Definately pricey.. ...             5.0  Positive  
 1  Excellent Kindle. The best Kindle ever, for me...             5.0  Positive  
 2  Love it. I absolutely love this reader. The bi...             4.0  Positive  
 3  Good kindle. I always use it when i read ebook...             5.0  Positive  
 4  So mu

#### Let's do some cleaning on the name of the products to avoid product duplicates.


In [4]:
def clean_product_name(name):

    name = name.lower()
    name = re.sub(r'\b(black|blue|pink|silver|marine|tangerine|magenta|white|red|yellow|green|purple|gold|gray|grey|orange|rose|charcoal|graphite|plum|teal|Aluminum| lavender|coral)\b', '', name)
    name = re.sub(r'\b(all-new|includes special offers|includes offers| Special Offers| with special offers|with alexa)\b', '', name)
    name = re.sub(r'\b\d+\s?gb\b', '', name)
    name = re.sub(r'\b(5th|6th|7th|8th|9th|10th|11th|12th)\s?gen(eration)?\b', '', name)
    name = re.sub(r'\bwi[-]?fi\b|\bdisplay\b', '', name)
    name = re.sub(r'[^\w\s\-"]+', '', name)
    name = re.sub(r'(tablet)[^a-zA-Z]*(tablet)', r'\1', name)
    name = re.sub(r'\s+', ' ', name)

    return name.strip()

df['clean_name'] = df['name'].apply(clean_product_name)


#### Analyze top 3 products per category with a minimun of 50 reviews in each product but lowering the minimum review count if necessary to ensure enough products are included.
    

In [5]:
def get_top_n_products_flexible(df, top_n=3, min_reviews=50, min_allowed=5):

    result = []

    for category in df['cluster_name'].unique():
        df_cat = df[df['cluster_name'] == category]

        product_stats = df_cat.groupby('clean_name').agg(
            review_count=('review', 'count'),
            avg_rating=('reviews.rating', 'mean')
        ).reset_index()

        current_threshold = min_reviews
        while current_threshold >= min_allowed:
            filtered = product_stats[product_stats['review_count'] >= current_threshold]
            if len(filtered) >= top_n:
                break
            current_threshold -= 5

        top_products = filtered.sort_values(
            by=['avg_rating', 'review_count'], ascending=[False, False]
        ).head(top_n)

        top_products['cluster_name'] = category  # reattach category label
        result.append(top_products)

    return pd.concat(result, ignore_index=True)

top_3 = get_top_n_products_flexible(df)
top_3.head(12)


Unnamed: 0,clean_name,review_count,avg_rating,cluster_name
0,amazon kindle paperwhite - ebook reader - - 6 ...,3176,4.755038,E-Readers & Kindle Accessories
1,amazon 9w powerfast official oem usb charger a...,61,4.737705,E-Readers & Kindle Accessories
2,kindle voyage e-reader 6 high-resolution 300 p...,1085,4.735484,E-Readers & Kindle Accessories
3,amazon fire hd 10 tablet special offers - alum...,128,4.773438,Smart Devices & Streaming
4,amazon - echo plus w built-in hub -,590,4.749153,Smart Devices & Streaming
5,amazon - amazon tap portable bluetooth and spe...,318,4.72956,Smart Devices & Streaming
6,amazonbasics aa performance alkaline batteries...,3519,4.425973,Batteries & Accesories mix (AmazonBasics)
7,amazonbasics aaa performance alkaline batterie...,7754,4.411014,Batteries & Accesories mix (AmazonBasics)
8,amazonbasics backpack for laptops up to 17-inches,25,4.16,Batteries & Accesories mix (AmazonBasics)
9,amazon fire hd 8 8in tablet b018szt3bk 2016 an...,135,4.740741,Tablets


#### Worst Products by Category. We do the same as before considering the number of reviews

In [6]:
def get_worst_product_flexible(df, min_reviews=50, min_allowed=5):

    results = []

    for category in df['cluster_name'].unique():
        df_cat = df[df['cluster_name'] == category]

        product_stats = df_cat.groupby('clean_name').agg(
            review_count=('review', 'count'),
            avg_rating=('reviews.rating', 'mean')
        ).reset_index()

        current_threshold = min_reviews
        while current_threshold >= min_allowed:
            filtered = product_stats[product_stats['review_count'] >= current_threshold]
            if not filtered.empty:
                break
            current_threshold -= 5

        if filtered.empty:
            continue  # no valid products in this category

        worst_product = filtered.sort_values(
            by=['avg_rating', 'review_count'], ascending=[True, False]
        ).head(1)

        worst_product['cluster_name'] = category  # ensure label is present
        results.append(worst_product)

    return pd.concat(results, ignore_index=True)

worst_products = get_worst_product_flexible(df)
worst_products.head()

Unnamed: 0,clean_name,review_count,avg_rating,cluster_name
0,"amazon kindle e-reader 6"" 2016",96,4.40625,E-Readers & Kindle Accessories
1,amazon 5w usb official oem charger and power a...,207,4.458937,Smart Devices & Streaming
2,amazonbasics aaa performance alkaline batterie...,7754,4.411014,Batteries & Accesories mix (AmazonBasics)
3,fire tablet 7 -,12363,4.462347,Tablets


#### Prompt Fine-Tuning


We use GPT-3.5 with OpenAI API. High-quality model for:
- human-like writing
- Summarization, comparisons, blog-style content
- Following detailed prompts

In [7]:
def call_gpt_3_5(system_prompt, user_prompt, model="gpt-3.5-turbo", temperature=0.7, max_tokens=800):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"OpenAI API Error: {e}")
        return None

We make our function skeleton

The summarize_category function is responsible for preparing a complete text prompt that will be sent to the language model. It takes in the selected product category, the top 3 products, and the worst product from that category.

In [8]:
def summarize_category(category, df, top_3, worst_products, max_tokens_per_product=1000):
    """
    Generates a blog-style product summary for a given category using GPT-3.5.

    Parameters:
    - category: name of the cluster/category
    - df: full DataFrame with reviews
    - top_3: DataFrame of top 3 products per category
    - worst_products: DataFrame with worst-rated product per category
    - max_tokens_per_product: max characters to include per product review block

    Returns:
    - GPT-generated summary as markdown-formatted string
    """

    # Instruction prompt
    instruction_prompt = f"""
You are a professional product review writer creating a blog-style article for a tech-savvy audience (like The Verge or Wirecutter).

Your task is to:
- Compare the **top 3 Amazon {category.lower()}s** based on real customer feedback
- Include a short, informative paragraph for each product:
    - What customers liked
    - What customers complained about
    - What makes it unique
- Add 2 bullet points at the end of each paragraph:
    - **Pros**
    - **Cons**
- Finish with a final paragraph about the **worst-rated {category.lower()}**, including:
    - Why it scored lower
    - What users complained about
    - Why readers should consider avoiding it

Write in a clear, helpful, and slightly conversational tone. Format the article using markdown with product names as headings.

Here are the reviews:
"""

    
    review_text = ""
    top_names = top_3[top_3['cluster_name'] == category]['clean_name'].tolist()
    worst_name_row = worst_products[worst_products['cluster_name'] == category]

    if worst_name_row.empty:
        print(f"[!] No worst-rated product found for category '{category}'")
        return None

    worst_name = worst_name_row['clean_name'].values[0]

    for product in top_names + [worst_name]:
        product_reviews = df[
            (df['cluster_name'] == category) &
            (df['clean_name'] == product)
        ]['review'].dropna().tolist()

        #Use the cleaned product name directly
        display_name = product.replace("-", " ").title()

        # Join and truncate review text
        combined_reviews = " ".join(product_reviews)
        truncated = combined_reviews[:max_tokens_per_product * 4]  # ~4 characters per token
        review_text += f"\n\n## {display_name}\n\n{truncated.strip()}\n"

    # Combine prompt + review block
    full_prompt = instruction_prompt + review_text.strip()

    return call_gpt_3_5(
        system_prompt="You are a helpful product review summarizer.",
        user_prompt=full_prompt,
        max_tokens=800
    )



#### Summarize All Categories

In [9]:
summary = summarize_category(
    category="E-Readers & Kindle Accessories",
    df=df,
    top_3=top_3,
    worst_products=worst_products
)

print(summary)


# Amazon Kindle Paperwhite Ebook Reader

**What customers liked:** Customers appreciated the lightweight design, adjustable backlight for easy reading in various environments, and the distraction-free ebook reading experience. Many found it great for reading books, manuals, and other materials, with easy setup and cool features like Airplane Mode.

**What customers complained about:** Some users expressed disappointment about the lack of a faster charging option and the absence of a cord with the plug, assuming it was included. A few customers also wished for a standard inclusion of the wall charger with the Kindle purchase.

**What makes it unique:** The Kindle Paperwhite stands out for its clear display, lightweight design, and focus on providing an optimal reading experience without distractions.

- **Pros**
  - Lightweight and easy to operate
  - Adjustable backlight for night reading
- **Cons**
  - No faster charging option
  - Wall charger not always included with purchase

# Ama

In [10]:
summary = summarize_category(
    category="Smart Devices & Streaming",
    df=df,
    top_3=top_3,
    worst_products=worst_products
)

print(summary)

# Amazon Fire Hd 10 Tablet Special Offers - Aluminum

The Amazon Fire HD 10 Tablet with Special Offers in Aluminum has garnered positive feedback for its fun and entertaining features, making it a great addition to any household. Users appreciate its music-playing capabilities and hands-free commands, especially with the Echo integration. However, some users found the app selection limited and noted that Alexa requires precise wording for queries. One standout feature is its seamless integration with Amazon Prime services.

**Pros**
- Fun and entertaining features
- Hands-free commands

**Cons**
- Limited app selection
- Alexa requires precise wording

# Amazon Echo Plus with Built-In Hub

The Amazon Echo Plus with Built-In Hub has received praise for its advanced features, with users enjoying its ability to control smart home devices effortlessly. Customers particularly like the sound quality and compatibility with various smart home accessories like Philips Hue bulbs. However, some u

In [11]:
summary = summarize_category(
    category="Batteries & Accesories mix (AmazonBasics)",
    df=df,
    top_3=top_3,
    worst_products=worst_products
)

print(summary)

# AmazonBasics AA Performance Alkaline Batteries 48 Count

Customers liked the great value these batteries offer, providing plenty of power at an affordable price point. The quality is comparable to name brands, and users appreciate the cost savings. However, some noted that these batteries may not last as long as higher-priced alternatives. One unique aspect is the extensive range of battery sizes available from AmazonBasics, catering to various needs.

**Pros**
- Affordable price
- Good quality compared to name brands

**Cons**
- May not last as long as premium brands

# AmazonBasics AAA Performance Alkaline Batteries 36 Count

Users praised the excellent price point of these AAA batteries and found them to be a great value for powering devices with lower power drain. While they work well for everyday use, some customers complained that these batteries do not last as long as other premium brands. The packaging was noted to be less than convenient, with difficulties in opening and han

In [12]:
summary = summarize_category(
    category="Tablets",
    df=df,
    top_3=top_3,
    worst_products=worst_products
)

print(summary)

## Amazon Fire Hd 8 8In Tablet B018Szt3Bk 2016 Android

The Amazon Fire HD 8 8-inch tablet received high praise for its fantastic reading experience in dark conditions and lightweight design that makes it easy to hold. Customers appreciated the presence of side buttons for easier page-turning compared to touchscreen-only devices. The Kindle Voyage is lauded for its compactness and user-friendly interface, making it a great e-reader upgrade. However, some users were dissatisfied with the automatic system updates enforced by Amazon without user consent. Despite this, the Kindle Voyage remains a top choice for serious readers.

**Pros**
- Fantastic reading experience in dark conditions
- Lightweight and compact design

**Cons**
- Automatic system updates without user consent



## Fire Hd 8 Kids Edition Tablet 8 Hd Kid Proof Case

Parents were delighted with the durability and warranty of the Fire HD 8 Kids Edition tablet, allowing young children to use it without worry. The extended batt

#### Save All Summaries

In [13]:
output_dir = "summaries_clean"
os.makedirs(output_dir, exist_ok=True)

summary_results = {}

for category in df["cluster_name"].unique():
    print(f"📄 Summarizing: {category}")
    summary = summarize_category(
        category=category,
        df=df,
        top_3=top_3,
        worst_products=worst_products
    )

    if summary:
        filename = category.lower().replace(" ", "_").replace("&", "and").replace("(", "").replace(")", "") + "_summary.md"
        filepath = os.path.join(output_dir, filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(summary)
        summary_results[category] = filepath
        print(f"✅ Saved to {filepath}")
    else:
        print(f"⚠️ Skipped {category}")

📄 Summarizing: E-Readers & Kindle Accessories
✅ Saved to summaries_clean\e-readers_and_kindle_accessories_summary.md
📄 Summarizing: Smart Devices & Streaming
✅ Saved to summaries_clean\smart_devices_and_streaming_summary.md
📄 Summarizing: Batteries & Accesories mix (AmazonBasics)
✅ Saved to summaries_clean\batteries_and_accesories_mix_amazonbasics_summary.md
📄 Summarizing: Tablets
✅ Saved to summaries_clean\tablets_summary.md
