<a href="https://colab.research.google.com/github/KJanzon/project-nlp-business-case-automated-customers-reviews/blob/main/review_summarisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
pip install python-dotenv


Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0


In [44]:
!cp /content/drive/MyDrive/reviews-project/.env .
!cp /content/drive/MyDrive/reviews-project/.gitignore .

In [3]:

def load_and_clean(file_path):
    cols = [
        "id",
        "name",
        "brand",
        "reviews.text",
        "reviews.title",
        "reviews.rating",
        "mainCategory"
    ]

    df = pd.read_csv(file_path, usecols=cols)

    # Drop nulls
    df = df.dropna(subset=["reviews.text", "reviews.rating"])

    # Map ratings to sentiment
    def map_sentiment(rating):
        if rating <= 3:
            return "negative"
        else:
            return "positive"

    df["sentiment"] = df["reviews.rating"].apply(map_sentiment)

    # Drop exact duplicates if any
    df = df.drop_duplicates(subset=["id", "reviews.text"])

    return df

In [2]:
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')


df = load_and_clean("/content/drive/MyDrive/reviews-project/products_with_mainCategory.csv")


Mounted at /content/drive


In [4]:
print(df.shape)

(29771, 8)


Cleaning product name so that I don't compare the same product with different colours or other very minor variations.


In [20]:
import re


def clean_product_name(name):
    # Remove common variant patterns: color, GB, special offers, etc.
    name = name.lower()
    name = re.sub(r"\b\d{1,3}\s?gb\b", "", name)                    # Remove '16 GB' or '32GB'
    name = re.sub(r"\bwith special offers\b", "", name)
    name = re.sub(r"\bblack|blue|pink|magenta|tangerine|white\b", "", name)
    name = re.sub(r"\s+", " ", name)                                # Remove extra spaces
    return name.strip()


In [21]:
#filter products to have at least 5 reviews
min_reviews = 5
product_reviews = (
    df.groupby(['mainCategory', 'name'])
    .filter(lambda x: len(x) >= min_reviews)
)


In [22]:
#per product; review count, average rating, string with all reviwes for the product

summary_df = (
    product_reviews.groupby(['mainCategory', 'name'])
    .agg(
        review_count=('reviews.text', 'count'),
        avg_rating=('reviews.rating', 'mean'),
        all_reviews=('reviews.text', lambda x: ' '.join(x))
    )
    .reset_index()
)

summary_df['all_reviews'] = summary_df['all_reviews'].apply(lambda x: x[:2000] if isinstance(x, str) else "")

summary_df["base_name"] = summary_df["name"].apply(clean_product_name)


In [23]:
#top 3 and worst product
def get_top_and_worst(df, min_reviews=10):
    df = df.copy()
    df = df[df['review_count'] >= min_reviews]

    if len(df) < 4:
        return None, None

    # Add base_name column if not already there
    if 'base_name' not in df.columns:
        df['base_name'] = df['name'].apply(clean_product_name)

    # Group by base_name and take the best-rated item per group
    distinct = df.sort_values('avg_rating', ascending=False)
    distinct = distinct.drop_duplicates(subset='base_name', keep='first')

    # Top 3 distinct products
    top = distinct.head(3)

    # Worst product (doesn't need to be distinct, just lowest-rated)
    worst = df.sort_values('avg_rating', ascending=True).head(1)

    return top, worst




## Chatgpt 3.5 Model

In [8]:
pip install openai




In [9]:
pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.2 MB[0m [31m10.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [45]:
from openai import OpenAI
from dotenv import load_dotenv
import os

# Load from .env file
load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)


In [25]:
def build_gpt_prompt_from_reviews(category, top_df, worst_df, max_tokens_per_product=1000):
    prompt = f"""
You are a professional product review writer creating a blog-style article for a tech-savvy audience (like The Verge or Wirecutter).

Your task is to:
- Compare the **top 3 Amazon {category.lower()}s** based on real customer feedback
- Include a short, informative paragraph for each product:
    - What customers liked ✅
    - What customers complained about ❌
    - What makes it unique
- Add 2 bullet points at the end of each paragraph:
    - **Pros**
    - **Cons**
- Finish with a final paragraph about the **worst-rated {category.lower()}**, including:
    - Why it scored lower
    - What users complained about
    - Why readers should consider avoiding it

Write in a clear, helpful, and slightly conversational tone. Format the article using markdown with product names as headings.

Here are the reviews:
"""

    # Add top 3 products
    for i, row in top_df.iterrows():
        name = row['name']
        rating = row['avg_rating']
        reviews = safe_truncate(row['all_reviews'], max_tokens=max_tokens_per_product)
        prompt += f"\n### {i+1}. {name} (Rating: {rating:.2f})\n{reviews}\n"

    # Add worst product
    worst_row = worst_df.iloc[0]
    name = worst_row['name']
    rating = worst_row['avg_rating']
    reviews = safe_truncate(worst_row['all_reviews'], max_tokens=max_tokens_per_product)
    prompt += f"\n### Worst Product: {name} (Rating: {rating:.2f})\n{reviews}\n"

    return prompt


In [26]:

import tiktoken

def safe_truncate(text: str, max_tokens: int, model: str = "gpt-3.5-turbo") -> str:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    tokens = tokens[:max_tokens]
    return enc.decode(tokens)



In [27]:
from openai import OpenAI


def generate_article(prompt):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content


In [14]:
product_counts = summary_df.groupby("mainCategory")['name'].nunique().sort_values()
print("📊 Product count per category:\n")
print(product_counts)

📊 Product count per category:

mainCategory
Batteries         2
Accessories       4
Smart Speaker     7
Other             8
E-reader         14
Tablet           26
Name: name, dtype: int64


In [15]:
smart_speakers = summary_df[summary_df["mainCategory"] == "Other"]

# Show all available columns (name, review_count, avg_rating, etc.)
print(f"🔊 Smart Speaker Products ({len(smart_speakers)} found):\n")
display(smart_speakers[['name', 'review_count', 'avg_rating']].sort_values(by='review_count', ascending=False))


🔊 Smart Speaker Products (8 found):



Unnamed: 0,name,review_count,avg_rating
22,AmazonBasics Backpack for Laptops up to 17-inches,25,4.16
20,AmazonBasics 15.6-Inch Laptop and Tablet Bag,21,4.52381
26,Expanding Accordion File Folder Plastic Portab...,9,5.0
25,AmazonBasics Ventilated Adjustable Laptop Stand,8,4.0
24,AmazonBasics External Hard Drive Case,6,4.5
23,AmazonBasics Bluetooth Keyboard for Android De...,6,4.333333
27,Fire TV Stick Streaming Media Player Pair Kit,6,5.0
21,AmazonBasics 16-Gauge Speaker Wire - 100 Feet,5,5.0


In [28]:
test_df = summary_df[summary_df['mainCategory'] == "Tablet"]
top, worst = get_top_and_worst(test_df)
prompt = build_gpt_prompt_from_reviews("Tablet", top, worst)
print(f"\n -- Prompt: {prompt[:2000]}")  # preview prompt

article = generate_article(prompt)
print(f"\n -- Article: {article}")


 -- Prompt: 
You are a professional product review writer creating a blog-style article for a tech-savvy audience (like The Verge or Wirecutter).

Your task is to:
- Compare the **top 3 Amazon tablets** based on real customer feedback
- Include a short, informative paragraph for each product:
    - What customers liked ✅
    - What customers complained about ❌
    - What makes it unique
- Add 2 bullet points at the end of each paragraph:
    - **Pros**
    - **Cons**
- Finish with a final paragraph about the **worst-rated tablet**, including:
    - Why it scored lower
    - What users complained about
    - Why readers should consider avoiding it

Write in a clear, helpful, and slightly conversational tone. Format the article using markdown with product names as headings.

Here are the reviews:

### 38. All-New Fire HD 8 Kids Edition Tablet, 8 HD Display, 32 GB, Pink Kid-Proof Case (Rating: 4.64)
Purchase this Amazon - Fire HD 8 Kids Edition for my 2 year old grand daughter. She can w

In [32]:
#generate article by category

all_summaries = []

categories = summary_df['mainCategory'].unique()

for cat in categories:
    cat_df = summary_df[summary_df['mainCategory'] == cat]
    top, worst = get_top_and_worst(cat_df)

    # Skip if there aren't enough products
    if top is None or worst is None:
        print(f"\n⚠️ Skipping category '{cat}' due to insufficient data.")
        continue

    prompt = build_gpt_prompt_from_reviews(cat, top, worst)

     # Optional: Preview what you're sending
    #print(f"\n🧠 [{cat}] Prompt preview:\n{prompt[:500]}...\n")

        # Generate article using ChatGPT API
    article = generate_article(prompt)

    print(f"\n📝 === Blog Summary for Category: {cat} ===\n")
    print(article)

  # Store in a list for optional saving
    all_summaries.append({
        "category": cat,
        "article": article,
        "top_products": top["name"].tolist(),
        "worst_product": worst["name"].values[0],
    })


⚠️ Skipping category 'Accessories' due to insufficient data.

⚠️ Skipping category 'Batteries' due to insufficient data.

📝 === Blog Summary for Category: E-reader ===

# Kindle Voyage E-reader, 6 High-Resolution Display (300 ppi) with Adaptive Built-in Light, PagePress Sensors, Free 3G + Wi-Fi

Customers have praised the Kindle Voyage for its exceptional reading experience akin to a real book. The adaptive built-in light makes it easy to read in various lighting conditions, while the 3G connectivity allows for on-the-go book downloads. The device's portability, direct sun readability, and long battery life have also been highlighted.

**Pros**
- Real book-like reading experience
- Exceptional direct sun readability

**Cons**
- Sensitive touch controls
- Experimental web browser

# Kindle Voyage E-reader, 6 High-Resolution Display (300 ppi) with Adaptive Built-in Light, PagePress Sensors, Wi-Fi

Users appreciate the Kindle Voyage's lightweight design and readable e-ink display. The pa

In [46]:
import os

# Create a folder to store outputs
output_dir = "/content/drive/MyDrive/reviews-project/product_articles"
os.makedirs(output_dir, exist_ok=True)

for summary in all_summaries:
    category = summary["category"]
    article = summary["article"]

    # Clean file name (e.g., no slashes/spaces)
    safe_filename = category.lower().replace(" ", "_").replace("/", "-")

    # Choose file extension
    file_path = os.path.join(output_dir, f"{safe_filename}.md")  # or .txt if preferred

    with open(file_path, "w", encoding="utf-8") as f:
        f.write(f"# Product Review Summary: {category}\n\n")
        f.write(article)

    print(f"✅ Saved: {file_path}")


✅ Saved: /content/drive/MyDrive/reviews-project/product_articles/e-reader.md
✅ Saved: /content/drive/MyDrive/reviews-project/product_articles/smart_speaker.md
✅ Saved: /content/drive/MyDrive/reviews-project/product_articles/tablet.md
