## **Part - 3 Review Summarization Using Generative AI**
Objective: Summarize reviews into articles that recommend the top products for each category.

Task: Create a model that generates a short article (like a blog post) for each product category. The output should include:

- Top 3 products and key differences between them.
- Top complaints for each of those products.
- Worst product in the category and why it should be avoided.

In [None]:
!pip install openai

In [None]:
pip install openai==0.28

#### Imports.

In [19]:
import os
import re
import json
import torch
import openai
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv, find_dotenv
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, BitsAndBytesConfig,AutoModelForCausalLM



#### Load the Dataset


In [None]:
# Load the dataset
df = pd.read_csv(r".\Dataset\full_reviews_with_clusters.csv")

# Display basic info and the first few rows
df.info(), df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59046 entries, 0 to 59045
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            59046 non-null  object 
 1   cluster_name    59046 non-null  object 
 2   review          59046 non-null  object 
 3   reviews.rating  59046 non-null  float64
 4   sentiment       59046 non-null  object 
dtypes: float64(1), object(4)
memory usage: 2.3+ MB


(None,
                                                 name  \
 0  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 1  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 2  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 3  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 4  All-New Kindle Oasis E-reader - 7 High-Resolut...   
 
                      cluster_name  \
 0  E-Readers & Kindle Accessories   
 1  E-Readers & Kindle Accessories   
 2  E-Readers & Kindle Accessories   
 3  E-Readers & Kindle Accessories   
 4  E-Readers & Kindle Accessories   
 
                                               review  reviews.rating sentiment  
 0  Great device for reading. Definately pricey.. ...             5.0  Positive  
 1  Excellent Kindle. The best Kindle ever, for me...             5.0  Positive  
 2  Love it. I absolutely love this reader. The bi...             4.0  Positive  
 3  Good kindle. I always use it when i read ebook...             5.0  Positive  
 4  So mu

Lets remove the colors of the products to avoid duplkicate products in the set.


In [29]:
def clean_product_name(name):

    name = name.lower()
    name = re.sub(r'\b(black|blue|pink|silver|marine|tangerine|magenta|white|red|yellow|green|purple|gold|gray|grey|orange|rose|charcoal|graphite|plum|teal|lavender|coral)\b', '', name)
    name = re.sub(r'\b(all-new|includes special offers|includes offers|with special offers|with alexa)\b', '', name)
    name = re.sub(r'\b\d+\s?gb\b', '', name)
    name = re.sub(r'\b(5th|6th|7th|8th|9th|10th|11th|12th)\s?gen(eration)?\b', '', name)
    name = re.sub(r'\bwi[-]?fi\b|\bdisplay\b', '', name)
    name = re.sub(r'[^\w\s\-"]+', '', name)
    name = re.sub(r'(tablet)[^a-zA-Z]*(tablet)', r'\1', name)
    name = re.sub(r'\s+', ' ', name)

    return name.strip()

df['clean_name'] = df['name'].apply(clean_product_name)


#### Analyze top 3 products per category.

In [30]:
def get_top_3_products(df, min_reviews=50):
    """
    Returns the top 3 highest-rated products per category, with a minimum number of reviews.
    """
    # Count reviews and average ratings per product
    product_stats = df.groupby(['cluster_name', 'clean_name']).agg(
        review_count=('review', 'count'),
        average_rating=('reviews.rating', 'mean')
    ).reset_index()

    # Only keep products with enough reviews
    filtered = product_stats[product_stats['review_count'] >= min_reviews]

    # Sort by category, then by highest rating and most reviews
    sorted_products = filtered.sort_values(
        by=['cluster_name', 'average_rating', 'review_count'],
        ascending=[True, False, False]
    )

    # Get top 3 per category
    top_3 = sorted_products.groupby('cluster_name').head(3)

    return top_3

top_3 = get_top_3_products(df)
top_3.head(10)

Unnamed: 0,cluster_name,clean_name,review_count,average_rating
3,Batteries & Accesories mix (AmazonBasics),amazonbasics aa performance alkaline batteries...,3519,4.425973
4,Batteries & Accesories mix (AmazonBasics),amazonbasics aaa performance alkaline batterie...,7754,4.411014
29,E-Readers & Kindle Accessories,amazon kindle paperwhite - ebook reader - - 6 ...,3176,4.755038
21,E-Readers & Kindle Accessories,amazon 9w powerfast official oem usb charger a...,61,4.737705
49,E-Readers & Kindle Accessories,kindle voyage e-reader 6 high-resolution 300 p...,1085,4.735484
61,Smart Devices & Streaming,amazon fire hd 10 tablet special offers - alum...,128,4.773438
53,Smart Devices & Streaming,amazon - echo plus w built-in hub -,590,4.749153
52,Smart Devices & Streaming,amazon - amazon tap portable bluetooth and spe...,318,4.72956
88,Tablets,amazon fire hd 8 8in tablet b018szt3bk 2016 an...,135,4.740741
93,Tablets,fire hd 8 kids edition tablet 8 hd kid-proof case,526,4.636882


#### Worst Products by Category

In [25]:
def get_worst_product(df, min_reviews=50):
    """
    Returns the lowest-rated product per category, with a minimum number of reviews.
    """
    # Count reviews and average ratings per product
    product_stats = df.groupby(['cluster_name', 'clean_name']).agg(
        review_count=('review', 'count'),
        average_rating=('reviews.rating', 'mean')
    ).reset_index()

    # Only keep products with enough reviews
    filtered = product_stats[product_stats['review_count'] >= min_reviews]

    # Sort by category, then by lowest rating and most reviews
    sorted_products = filtered.sort_values(
        by=['cluster_name', 'average_rating', 'review_count'],
        ascending=[True, True, False]
    )

    # Get worst product per category
    worst = sorted_products.groupby('cluster_name').head(1)

    return worst


worst_products = get_worst_product(df)
worst_products.head()

Unnamed: 0,cluster_name,clean_name,review_count,average_rating
4,Batteries & Accesories mix (AmazonBasics),amazonbasics aaa performance alkaline batterie...,7754,4.411014
25,E-Readers & Kindle Accessories,"amazon kindle e-reader 6"" 2016",96,4.40625
54,Smart Devices & Streaming,amazon 5w usb official oem charger and power a...,207,4.458937
97,Tablets,fire tablet 7 -,14095,4.468606


#### Prompt Fine-Tuning


We use GPT-4 with OpenAI API. High-quality model for:
- human-like writing
- Summarization, comparisons, blog-style content
- Following detailed prompts

In [10]:
def generate_with_gpt4(prompt, model="gpt-4", max_tokens=1000, temperature=0.7):
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful product reviewer."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_tokens,
        temperature=temperature
    )
    return response['choices'][0]['message']['content'].strip()

We make our function skeleton

The run_inference_on_model() function is responsible for preparing a complete text prompt that will be sent to the language model. It takes in the selected product category, the top 3 products, and the worst product from that category

In [30]:
def summarize_category(category, df, top_3, worst_products, max_tokens_per_product=1000):
    # Start with the base instruction prompt
    prompt = f"""
You are a professional product review writer creating a blog-style article for a tech-savvy audience (like The Verge or Wirecutter).

Your task is to:
- Compare the **top 3 Amazon {category.lower()}s** based on real customer feedback
- Include a short, informative paragraph for each product:
    - What customers liked
    - What customers complained about
    - What makes it unique
- Add 2 bullet points at the end of each paragraph:
    - **Pros**
    - **Cons**
- Finish with a final paragraph about the **worst-rated {category.lower()}**, including:
    - Why it scored lower
    - What users complained about
    - Why readers should consider avoiding it

Write in a clear, helpful, and slightly conversational tone. Format the article using markdown with product names as headings.

Here are the reviews:
"""

    review_text = ""

    # Get top 3 product names
    top_names = top_3[top_3['cluster_name'] == category]['clean_name'].tolist()
    worst_name = worst_products[worst_products['cluster_name'] == category]['clean_name'].values[0]

    for product in top_names + [worst_name]:
        product_reviews = df[
            (df['cluster_name'] == category) & (df['clean_name'] == product)
        ]['review'].tolist()

        if product_reviews:
            # Use most common original name
            display_name = df[
                (df['clean_name'] == product) & (df['cluster_name'] == category)
            ]['name'].value_counts().idxmax()
        else:
            display_name = product.title()

        joined = " ".join(product_reviews)[:max_tokens_per_product * 4]
        review_text += f"\n\n## {display_name}\n\n{joined.strip()}\n"

    full_prompt = prompt + review_text.strip()
    response = generate_with_gpt4(full_prompt)
    return response


In [31]:
top_3 = get_top_3_products(df)
worst_products = get_worst_product(df)

category = "Tablets"
summary = summarize_category(category, df, top_3, worst_products)

print(summary)

# Amazon Fire HD 8 8in Tablet 16gb Black B018szt3bk 6th Gen (2016) Android

Many customers praised this tablet for its high-quality reading experience and ease of use, particularly when borrowing books from the library. Users appreciated its lightweight design and convenient side-press buttons for effortless page-turning. However, some customers criticized the automatic system update, wishing that it could be done at their convenience. The Kindle Voyage, as this model is also known, has been described as an improvement on the Paperwhite, offering a seamless and enjoyable reading experience, especially for serious readers.

- **Pros**
    - Excellent for reading in various light conditions
    - Lightweight and easy to hold
- **Cons**
    - Automatic system update can be inconvenient
    - May be a bit pricey for some

# All-New Fire HD 8 Kids Edition Tablet, 8 HD Display, 32 GB, Pink Kid-Proof Case

On the positive side, the All-New Fire HD 8 Kids Edition has been lauded for its kid-fr

Summarize All Categories

In [32]:
def summarize_all_categories(df, top_3, worst_products, max_tokens_per_product=1000):
    """
    Summarizes all product categories using GPT-4.
    Returns a dictionary with category names as keys and summaries as values.
    """
    summaries = {}
    categories = df['cluster_name'].unique()

    for category in categories:
        print(f" Summarizing category: {category}")
        try:
            summary = summarize_category(category, df, top_3, worst_products, max_tokens_per_product)
            summaries[category] = summary
        except Exception as e:
            print(f" Error summarizing {category}: {e}")

    return summaries


In [33]:
top_3 = get_top_3_products(df)
worst_products = get_worst_product(df)

all_summaries = summarize_all_categories(df, top_3, worst_products)

 Summarizing category: E-Readers & Kindle Accessories
 Summarizing category: Smart Devices & Streaming
 Summarizing category: Batteries & Accesories mix (AmazonBasics)
 Summarizing category: Tablets


Save Each Summary to a File

In [35]:
for category, summary in all_summaries.items():
    filename = category.lower().replace(" ", "_") + "_summary.md"
    with open(filename, "w", encoding="utf-8") as f:
        f.write(summary)
    print(f"✅ Saved {filename}")

✅ Saved e-readers_&_kindle_accessories_summary.md
✅ Saved smart_devices_&_streaming_summary.md
✅ Saved batteries_&_accesories_mix_(amazonbasics)_summary.md
✅ Saved tablets_summary.md
