# Synthetic Retail Data Generator for RAG Recommendation Agent

**When opened in Google Colab: Run all cells. Output folder: /content/synthetic_retail_data/**

This notebook generates synthetic retail data including product catalogs, customer profiles, inventory data, promotions, and product brochure PDFs for demonstrating a Retrieval-Augmented Generation (RAG) recommendation agent.

## What this notebook produces:
- `products.csv`: 30 product entries with SKUs, descriptions, pricing, etc.
- `customers.json`: 10 customer profiles with purchase histories
- `inventory.json`: Stock levels across 5 stores and 1 warehouse
- `promotions.json`: 3-5 promotional rules
- `product_brochures/`: 6 PDF brochures with text and placeholder images
- `images/`: Placeholder PNG images used in the PDFs
- `README_generated_files.md`: Instructions for ingesting the data

All files are saved to `/content/synthetic_retail_data/` and packaged into `/content/synthetic_retail_data.zip`

For LangChain ingestion, you can load the CSV with `CSVLoader` and PDFs with `PyPDFLoader`.

In [1]:
# Install required packages
!pip install fpdf2 faker pandas pillow lorem

# Standard imports
import os
import json
import random
import pandas as pd
import numpy as np
from faker import Faker
from fpdf import FPDF
from PIL import Image, ImageDraw, ImageFont
import lorem
from datetime import date, timedelta
import shutil
from google.colab import files

# Set random seeds for reproducibility
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# Initialize Faker with seed
fake = Faker()
Faker.seed(RANDOM_SEED)

print("Packages installed and imports completed!")

Collecting fpdf2
  Downloading fpdf2-2.8.5-py3-none-any.whl.metadata (76 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/76.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m76.9/76.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faker
  Downloading faker-37.12.0-py3-none-any.whl.metadata (15 kB)
Collecting lorem
  Downloading lorem-0.1.1-py3-none-any.whl.metadata (2.3 kB)
Downloading fpdf2-2.8.5-py3-none-any.whl (301 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m301.6/301.6 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faker-37.12.0-py3-none-any.whl (2.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

In [2]:
# Create output directory structure
output_dir = "/content/synthetic_retail_data"
brochures_dir = os.path.join(output_dir, "product_brochures")
images_dir = os.path.join(output_dir, "images")

# Clear existing output directory if it exists
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)

# Create directories
os.makedirs(output_dir, exist_ok=True)
os.makedirs(brochures_dir, exist_ok=True)
os.makedirs(images_dir, exist_ok=True)

print(f"Created directory structure at {output_dir}")
print(f"- Main directory: {output_dir}")
print(f"- Brochures directory: {brochures_dir}")
print(f"- Images directory: {images_dir}")

Created directory structure at /content/synthetic_retail_data
- Main directory: /content/synthetic_retail_data
- Brochures directory: /content/synthetic_retail_data/product_brochures
- Images directory: /content/synthetic_retail_data/images


In [3]:
def generate_products(n=30):
    """Generate a synthetic product catalog with n products."""

    # Product categories and brands
    categories = ["Footwear", "Apparel", "Accessories", "Tech", "Home"]
    category_abbrev = {"Footwear": "FT", "Apparel": "AP", "Accessories": "AC", "Tech": "TC", "Home": "HM"}

    brands = {
        "Footwear": ["Nike", "Adidas", "Puma", "Reebok", "New Balance"],
        "Apparel": ["Uniqlo", "Zara", "H&M", "Gucci", "Prada"],
        "Accessories": ["Ray-Ban", "Swatch", "Coach", "Michael Kors", "Kate Spade"],
        "Tech": ["Apple", "Samsung", "Sony", "Microsoft", "Google"],
        "Home": ["IKEA", "West Elm", "CB2", "Pottery Barn", "Crate & Barrel"]
    }

    materials = ["Cotton", "Polyester", "Leather", "Nylon", "Wool", "Silk", "Denim", "Linen"]
    colors = ["Black", "White", "Blue", "Red", "Green", "Yellow", "Purple", "Pink", "Orange", "Brown", "Grey"]

    size_ranges = {
        "Footwear": ["US 6", "US 7", "US 8", "US 9", "US 10", "US 11"],
        "Apparel": ["XS", "S", "M", "L", "XL", "XXL"],
        "Accessories": ["One Size"],
        "Tech": ["N/A"],
        "Home": ["Small", "Medium", "Large"]
    }

    # Generate products
    products = []
    for i in range(n):
        category = random.choice(categories)
        brand = random.choice(brands[category])

        # Generate SKU
        sku_number = str(i+1).zfill(4)
        sku = f"PRD-{category_abbrev[category]}-{sku_number}"

        # Generate title
        title = f"{brand} {fake.word().title()} {category[:-1] if category.endswith('s') else category}"

        # Generate price (INR)
        price = random.randint(500, 15000)

        # Generate description
        description = f"{lorem.sentence()} {lorem.sentence()} {lorem.sentence()} "
        description += f"Upper material: {random.choice(materials)}; "
        description += f"Color: {random.choice(colors)}; "
        description += f"Care: {fake.sentence()[:50]}"

        # Generate attributes
        size_range = size_ranges[category]
        product_colors = random.sample(colors, k=min(3, len(colors)))
        material = random.choice(materials)

        # Generate tags
        tags = [fake.word(), fake.word(), category.lower()]

        # Image filename
        image_filename = f"{sku}.png"

        # Store availability (5 stores)
        store_availability = {}
        for j in range(1, 6):
            store_availability[f"store_{j}"] = random.randint(0, 20)

        # Warehouse stock
        warehouse_stock = random.randint(0, 200)

        # Rating
        rating = round(random.uniform(3.0, 5.0), 1)

        product = {
            "sku": sku,
            "title": title,
            "category": category,
            "brand": brand,
            "price": price,
            "description": description,
            "attributes": {
                "size_range": size_range,
                "colors": product_colors,
                "material": material
            },
            "tags": tags,
            "image_filename": image_filename,
            "store_availability": store_availability,
            "warehouse_stock": warehouse_stock,
            "rating": rating
        }

        products.append(product)

    return products

# Generate products
products = generate_products(30)

# Save to CSV
products_df = pd.DataFrame(products)
# Flatten nested fields for CSV
products_df['attributes'] = products_df['attributes'].apply(json.dumps)
products_df['tags'] = products_df['tags'].apply(json.dumps)
products_df['store_availability'] = products_df['store_availability'].apply(json.dumps)

csv_path = os.path.join(output_dir, "products.csv")
products_df.to_csv(csv_path, index=False, encoding='utf-8')

print(f"Generated and saved {len(products)} products to {csv_path}")
print("\nFirst 5 products:")
products_df.head()

Generated and saved 30 products to /content/synthetic_retail_data/products.csv

First 5 products:


Unnamed: 0,sku,title,category,brand,price,description,attributes,tags,image_filename,store_availability,warehouse_stock,rating
0,PRD-FT-0001,Nike Purpose Footwear,Footwear,Nike,12649,Eius eius dolor tempora consectetur sed. Amet ...,"{""size_range"": [""US 6"", ""US 7"", ""US 8"", ""US 9""...","[""require"", ""sit"", ""footwear""]",PRD-FT-0001.png,"{""store_1"": 5, ""store_2"": 13, ""store_3"": 10, ""...",55,4.9
1,PRD-AC-0002,Ray-Ban Wait Accessorie,Accessories,Ray-Ban,2019,Consectetur labore labore quiquia est velit al...,"{""size_range"": [""One Size""], ""colors"": [""Red"",...","[""rate"", ""science"", ""accessories""]",PRD-AC-0002.png,"{""store_1"": 14, ""store_2"": 20, ""store_3"": 11, ...",90,3.4
2,PRD-AC-0003,Ray-Ban Offer Accessorie,Accessories,Ray-Ban,10480,Porro tempora eius dolore neque. Est quisquam ...,"{""size_range"": [""One Size""], ""colors"": [""Green...","[""grow"", ""fall"", ""accessories""]",PRD-AC-0003.png,"{""store_1"": 6, ""store_2"": 20, ""store_3"": 15, ""...",117,3.3
3,PRD-AP-0004,Zara Clearly Apparel,Apparel,Zara,12705,Porro est tempora quaerat modi quaerat magnam ...,"{""size_range"": [""XS"", ""S"", ""M"", ""L"", ""XL"", ""XX...","[""PM"", ""everything"", ""apparel""]",PRD-AP-0004.png,"{""store_1"": 16, ""store_2"": 8, ""store_3"": 17, ""...",174,4.8
4,PRD-AC-0005,Coach Surface Accessorie,Accessories,Coach,2327,Modi dolore neque adipisci tempora tempora. Nu...,"{""size_range"": [""One Size""], ""colors"": [""Orang...","[""human"", ""bar"", ""accessories""]",PRD-AC-0005.png,"{""store_1"": 15, ""store_2"": 0, ""store_3"": 3, ""s...",61,3.1


In [4]:
def generate_placeholder_image(sku, title, image_filename, width=400, height=300):
    """Generate a placeholder image with SKU and title text."""

    # Create image with random background color
    bg_color = (
        random.randint(100, 255),
        random.randint(100, 255),
        random.randint(100, 255)
    )

    image = Image.new('RGB', (width, height), bg_color)
    draw = ImageDraw.Draw(image)

    # Draw text
    try:
        # Try to use a better font if available
        from PIL import ImageFont
        font_large = ImageFont.truetype("DejaVuSans.ttf", 24)
        font_small = ImageFont.truetype("DejaVuSans.ttf", 16)
    except:
        # Fallback to default font
        font_large = ImageFont.load_default()
        font_small = ImageFont.load_default()

    # Draw SKU
    draw.text((10, 10), sku, fill=(0, 0, 0), font=font_large)

    # Draw title (may need to wrap)
    words = title.split()
    lines = []
    current_line = ""

    for word in words:
        test_line = current_line + " " + word if current_line else word
        bbox = draw.textbbox((0, 0), test_line, font=font_small)
        text_width = bbox[2] - bbox[0]

        if text_width <= width - 20:
            current_line = test_line
        else:
            lines.append(current_line)
            current_line = word

    if current_line:
        lines.append(current_line)

    # Draw lines of title
    y_offset = 50
    for line in lines[:3]:  # Limit to 3 lines
        draw.text((10, y_offset), line, fill=(0, 0, 0), font=font_small)
        y_offset += 20

    # Save image
    image_path = os.path.join(images_dir, image_filename)
    image.save(image_path)

    return image_path

# Generate placeholder images for all products
generated_images = []
for product in products:
    image_path = generate_placeholder_image(
        product['sku'],
        product['title'],
        product['image_filename']
    )
    generated_images.append(image_path)

print(f"Generated {len(generated_images)} placeholder images in {images_dir}")

Generated 30 placeholder images in /content/synthetic_retail_data/images


In [5]:
def generate_customers(n=10, products=None):
    """Generate synthetic customer profiles with purchase histories."""

    loyalty_tiers = [None, "Bronze", "Silver", "Gold", "Platinum"]
    channels = ["web", "mobile_app", "in_store", "phone"]

    customers = []

    for i in range(n):
        customer_id = f"CUST{str(i+1).zfill(4)}"
        name = fake.name()
        email = fake.email()

        # Loyalty tier (weighted toward None and Bronze)
        loyalty_tier = random.choices(loyalty_tiers, weights=[0.4, 0.3, 0.15, 0.1, 0.05])[0]

        # Preferred brands (select from product brands)
        if products:
            all_brands = list(set([p['brand'] for p in products]))
            preferred_brands = random.sample(all_brands, k=min(3, len(all_brands)))
        else:
            preferred_brands = [fake.company() for _ in range(3)]

        preferred_store = f"store_{random.randint(1, 5)}"

        # Sizes
        shoe_size = random.choice(["US 6", "US 7", "US 8", "US 9", "US 10", "US 11"])
        clothing_size = random.choice(["XS", "S", "M", "L", "XL", "XXL"])

        last_channel = random.choice(channels)

        # Past purchases (2-4 purchases for some customers)
        past_purchases = []
        if products and random.random() > 0.3:  # 70% chance to have purchases
            num_purchases = random.randint(2, 4)
            purchased_skus = random.sample(products, min(num_purchases, len(products)))

            for product in purchased_skus:
                # Purchase date within last year
                days_ago = random.randint(1, 365)
                purchase_date = date.today() - timedelta(days=days_ago)

                purchase = {
                    "sku": product['sku'],
                    "date": purchase_date.strftime("%Y-%m-%d"),
                    "price": product['price']
                }
                past_purchases.append(purchase)

        customer = {
            "id": customer_id,
            "name": name,
            "email": email,
            "loyalty_tier": loyalty_tier,
            "past_purchases": past_purchases,
            "preferred_brands": preferred_brands,
            "preferred_store": preferred_store,
            "sizes": {
                "shoe": shoe_size,
                "clothing": clothing_size
            },
            "last_channel": last_channel
        }

        customers.append(customer)

    return customers

# Generate customers
customers = generate_customers(10, products)

# Save to JSON
json_path = os.path.join(output_dir, "customers.json")
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(customers, f, indent=2, ensure_ascii=False)

print(f"Generated and saved {len(customers)} customers to {json_path}")
print("\nSample customers:")
for i in range(min(2, len(customers))):
    cust = customers[i]
    print(f"\nCustomer {cust['id']}: {cust['name']} ({cust['email']})")
    print(f"  Loyalty Tier: {cust['loyalty_tier']}")
    print(f"  Preferred Brands: {', '.join(cust['preferred_brands'])}")
    print(f"  Past Purchases: {len(cust['past_purchases'])} items")

Generated and saved 10 customers to /content/synthetic_retail_data/customers.json

Sample customers:

Customer CUST0001: Todd Hudson (harveyrobert@example.net)
  Loyalty Tier: Gold
  Preferred Brands: Crate & Barrel, Puma, CB2
  Past Purchases: 0 items

Customer CUST0002: Emily Green (john62@example.net)
  Loyalty Tier: Bronze
  Preferred Brands: Zara, New Balance, IKEA
  Past Purchases: 4 items


In [6]:
def generate_inventory(products=None):
    """Generate inventory data for 5 stores and 1 warehouse."""

    inventory = {}

    # Initialize locations
    locations = [f"store_{i}" for i in range(1, 6)] + ["warehouse_1"]

    # Generate inventory for each product
    if products:
        for product in products:
            sku = product['sku']
            inventory[sku] = {}

            for location in locations:
                if location.startswith("store"):
                    # Stores have smaller inventory
                    inventory[sku][location] = random.randint(0, 20)
                else:  # warehouse
                    # Warehouse has larger inventory
                    inventory[sku][location] = random.randint(0, 200)

    return inventory

# Generate inventory
inventory = generate_inventory(products)

# Save to JSON
json_path = os.path.join(output_dir, "inventory.json")
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(inventory, f, indent=2)

print(f"Generated and saved inventory data for {len(inventory)} products to {json_path}")
print(f"Inventory covers 5 stores and 1 warehouse")

Generated and saved inventory data for 30 products to /content/synthetic_retail_data/inventory.json
Inventory covers 5 stores and 1 warehouse


In [12]:
!pip install lorem-text


Collecting lorem-text
  Downloading lorem_text-3.0-py3-none-any.whl.metadata (2.3 kB)
Downloading lorem_text-3.0-py3-none-any.whl (5.6 kB)
Installing collected packages: lorem-text
Successfully installed lorem-text-3.0


In [13]:
# ‚úÖ Full working code for brochure PDF generation (Unicode-safe)
!apt install -y fonts-dejavu-core > /dev/null

import os
import random
from datetime import date
from fpdf import FPDF
from lorem_text import lorem

# Example data (replace with your actual data)
products = [
    {
        "sku": "FT001",
        "title": "Classic Leather Loafers",
        "category": "Footwear",
        "brand": "Premium Footwear Co.",
        "price": 2999,
        "rating": 4.5,
        "attributes": {
            "material": "Leather",
            "colors": ["Black", "Brown"]
        },
        "image_filename": "shoe1.png"
    },
    {
        "sku": "AP001",
        "title": "Cotton Summer Shirt",
        "category": "Apparel",
        "brand": "Seasonal Fashion House",
        "price": 1499,
        "rating": 4.2,
        "attributes": {
            "material": "Cotton",
            "colors": ["Blue", "White", "Green"]
        },
        "image_filename": "shirt1.png"
    },
]

images_dir = "./images"
brochures_dir = "./brochures"
os.makedirs(images_dir, exist_ok=True)
os.makedirs(brochures_dir, exist_ok=True)

# ------------------------- PDF CREATION FUNCTION -------------------------

def create_product_brochure(brochure_info, products, images_dir, output_path):
    """Create a product brochure PDF with multiple product sections."""

    pdf = FPDF()
    pdf.set_auto_page_break(auto=True, margin=15)
    pdf.add_page()

    # ‚úÖ Add a Unicode TrueType font
    font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
    pdf.add_font("DejaVu", "", font_path, uni=True)
    pdf.add_font("DejaVu", "B", font_path, uni=True)

    # ---------- Title Page ----------
    pdf.set_font("DejaVu", "B", 24)
    pdf.cell(0, 20, brochure_info["title"], ln=True, align="C")

    pdf.set_font("DejaVu", "", 16)
    pdf.cell(0, 15, brochure_info["brand"], ln=True, align="C")

    pdf.set_font("DejaVu", "", 12)
    pdf.cell(0, 10, f"Date: {date.today().strftime('%Y-%m-%d')}", ln=True, align="C")

    pdf.ln(10)
    pdf.multi_cell(0, 8, brochure_info["description"])
    pdf.add_page()

    # ---------- Product Sections ----------
    skus_in_brochure = []

    for product_section in brochure_info["products"]:
        product = next((p for p in products if p["sku"] == product_section["sku"]), None)
        if not product:
            continue

        skus_in_brochure.append(product["sku"])

        pdf.set_font("DejaVu", "B", 16)
        pdf.cell(0, 12, product["title"], ln=True)

        pdf.set_font("DejaVu", "", 10)
        pdf.cell(0, 8, f"SKU: {product['sku']}", ln=True)

        pdf.set_font("DejaVu", "", 11)
        for feature in product_section["features"]:
            pdf.cell(10, 6, chr(149), ln=False)
            pdf.cell(0, 6, feature, ln=True)

        pdf.ln(2)

        pdf.set_font("DejaVu", "", 11)
        description = (
            f"{product_section['description']} "
            f"This {product['category'].lower()} pairs well with complementary items. "
            f"Recommended accessories include products from the same collection. "
            f"Reserve in-store for a try-on - limited stock available. "
            f"For care instructions, gently clean with appropriate methods for "
            f"{product['attributes']['material']} materials."
        )
        pdf.multi_cell(0, 7, description)
        pdf.ln(5)

        # ---------- Specifications ----------
        pdf.set_font("DejaVu", "B", 12)
        pdf.cell(0, 10, "Specifications:", ln=True)

        pdf.set_font("DejaVu", "", 10)
        pdf.cell(40, 7, "Category:", ln=False)
        pdf.cell(0, 7, product["category"], ln=True)

        pdf.cell(40, 7, "Brand:", ln=False)
        pdf.cell(0, 7, product["brand"], ln=True)

        pdf.cell(40, 7, "Material:", ln=False)
        pdf.cell(0, 7, product["attributes"]["material"], ln=True)

        pdf.cell(40, 7, "Colors:", ln=False)
        pdf.cell(0, 7, ", ".join(product["attributes"]["colors"]), ln=True)

        pdf.cell(40, 7, "Price:", ln=False)
        pdf.cell(0, 7, f"‚Çπ{product['price']}", ln=True)

        pdf.cell(40, 7, "Rating:", ln=False)
        pdf.cell(0, 7, f"{product['rating']}/5.0", ln=True)

        # ---------- Image (optional) ----------
        image_path = os.path.join(images_dir, product["image_filename"])
        if os.path.exists(image_path):
            pdf.ln(5)
            try:
                pdf.image(image_path, w=80, h=60)
            except:
                pdf.cell(0, 10, "[Product Image]", ln=True, align="C")

        pdf.ln(10)

    # ---------- Footer ----------
    pdf.add_page()
    pdf.set_font("DejaVu", "", 10)
    pdf.cell(0, 8, "SKU References:", ln=True)
    for sku in skus_in_brochure:
        pdf.cell(0, 6, f"‚Ä¢ {sku}", ln=True)

    pdf.ln(10)
    pdf.cell(0, 8, "Contact us for more information or to place an order.", ln=True)
    pdf.cell(0, 6, "Email: info@retail-demo.com | Phone: +91-9876543210", ln=True)

    # ‚úÖ Save PDF
    pdf.output(output_path)
    return skus_in_brochure


# ------------------------- BROCHURE GENERATION FUNCTION -------------------------

def generate_brochures(products):
    """Generate product brochure PDFs."""

    brochure_themes = [
        {
            "filename": "brochure_footwear.pdf",
            "title": "Classic Footwear Collection",
            "brand": "Premium Footwear Co.",
            "description": "Discover our premium footwear collection featuring comfort and style.",
            "category": "Footwear"
        },
        {
            "filename": "brochure_apparel.pdf",
            "title": "Summer Apparel Edit",
            "brand": "Seasonal Fashion House",
            "description": "Refresh your wardrobe with our summer collection.",
            "category": "Apparel"
        },
        {
            "filename": "brochure_best_sellers.pdf",
            "title": "Best Sellers & Customer Favorites",
            "brand": "Popular Choice",
            "description": "Our most loved products chosen by thousands of satisfied customers.",
            "category": "mixed"
        }
    ]

    products_by_category = {}
    for product in products:
        category = product["category"]
        if category not in products_by_category:
            products_by_category[category] = []
        products_by_category[category].append(product)

    generated_brochures = []

    for theme in brochure_themes:
        brochure_products = []

        if theme["category"] == "mixed":
            for category, cat_products in products_by_category.items():
                sorted_products = sorted(cat_products, key=lambda x: x["rating"], reverse=True)
                brochure_products.extend(sorted_products[:2])
        elif theme["category"] in products_by_category:
            cat_products = products_by_category[theme["category"]]
            brochure_products = cat_products[:min(5, len(cat_products))]

        if not brochure_products:
            continue

        brochure_info = {
            "title": theme["title"],
            "brand": theme["brand"],
            "description": theme["description"],
            "products": []
        }

        for product in brochure_products:
            features = [
                f"Premium {product['attributes']['material']} construction",
                f"Available in {len(product['attributes']['colors'])} colors",
                f"{random.choice(['Lightweight design', 'Durable materials', 'Ergonomic fit', 'Easy maintenance'])}",
                f"Customer rating: {product['rating']}/5.0",
                f"Price: ‚Çπ{product['price']}"
            ]

            description = f"{lorem.sentence()} {lorem.sentence()} {lorem.sentence()}"

            product_section = {
                "sku": product["sku"],
                "features": features[:5],
                "description": description
            }

            brochure_info["products"].append(product_section)

        output_path = os.path.join(brochures_dir, theme["filename"])
        skus_in_brochure = create_product_brochure(brochure_info, products, images_dir, output_path)

        generated_brochures.append({
            "filename": theme["filename"],
            "title": theme["title"],
            "skus": skus_in_brochure,
            "path": output_path
        })

        print(f"‚úÖ Generated brochure: {theme['filename']} with {len(skus_in_brochure)} products")

    return generated_brochures


# ------------------------- RUN GENERATION -------------------------
brochures = generate_brochures(products)

print(f"\nüìò Generated {len(brochures)} brochures in '{brochures_dir}'")
for brochure in brochures:
    print(f"- {brochure['filename']}: {brochure['title']} ({len(brochure['skus'])} products)")






  pdf.add_font("DejaVu", "", font_path, uni=True)
  pdf.add_font("DejaVu", "B", font_path, uni=True)
  pdf.cell(0, 20, brochure_info["title"], ln=True, align="C")
  pdf.cell(0, 15, brochure_info["brand"], ln=True, align="C")
  pdf.cell(0, 10, f"Date: {date.today().strftime('%Y-%m-%d')}", ln=True, align="C")
  pdf.cell(0, 12, product["title"], ln=True)
  pdf.cell(0, 8, f"SKU: {product['sku']}", ln=True)
  pdf.cell(10, 6, chr(149), ln=False)
  pdf.cell(0, 6, feature, ln=True)
  pdf.cell(0, 10, "Specifications:", ln=True)
  pdf.cell(40, 7, "Category:", ln=False)
  pdf.cell(0, 7, product["category"], ln=True)
  pdf.cell(40, 7, "Brand:", ln=False)
  pdf.cell(0, 7, product["brand"], ln=True)
  pdf.cell(40, 7, "Material:", ln=False)
  pdf.cell(0, 7, product["attributes"]["material"], ln=True)
  pdf.cell(40, 7, "Colors:", ln=False)
  pdf.cell(0, 7, ", ".join(product["attributes"]["colors"]), ln=True)
  pdf.cell(40, 7, "Price:", ln=False)
  pdf.cell(0, 7, f"‚Çπ{product['price']}", ln=True)
  pd

‚úÖ Generated brochure: brochure_footwear.pdf with 1 products
‚úÖ Generated brochure: brochure_apparel.pdf with 1 products




‚úÖ Generated brochure: brochure_best_sellers.pdf with 2 products

üìò Generated 3 brochures in './brochures'
- brochure_footwear.pdf: Classic Footwear Collection (1 products)
- brochure_apparel.pdf: Summer Apparel Edit (1 products)
- brochure_best_sellers.pdf: Best Sellers & Customer Favorites (2 products)


In [7]:
def generate_promotions():
    """Generate mock promotion rules."""

    categories = ["Footwear", "Apparel", "Accessories", "Tech", "Home"]
    loyalty_tiers = ["Bronze", "Silver", "Gold", "Platinum"]

    promotions = []

    # Promotion 1: Category discount
    promo1 = {
        "promo_id": "PROMO001",
        "description": "10% off on all Footwear",
        "valid_from": date.today().strftime("%Y-%m-%d"),
        "valid_to": (date.today() + timedelta(days=30)).strftime("%Y-%m-%d"),
        "eligibility": {
            "category": "Footwear"
        },
        "discount_type": "percent",
        "value": 10,
        "coupon_code": "FOOTWEAR10"
    }
    promotions.append(promo1)

    # Promotion 2: Loyalty discount
    promo2 = {
        "promo_id": "PROMO002",
        "description": "15% off for Gold and Platinum members",
        "valid_from": date.today().strftime("%Y-%m-%d"),
        "valid_to": (date.today() + timedelta(days=30)).strftime("%Y-%m-%d"),
        "eligibility": {
            "loyalty_tier": ["Gold", "Platinum"]
        },
        "discount_type": "percent",
        "value": 15,
        "coupon_code": "LOYALTY15"
    }
    promotions.append(promo2)

    # Promotion 3: Tech category discount
    promo3 = {
        "promo_id": "PROMO003",
        "description": "Buy 2 Tech items, get 20% off",
        "valid_from": date.today().strftime("%Y-%m-%d"),
        "valid_to": (date.today() + timedelta(days=15)).strftime("%Y-%m-%d"),
        "eligibility": {
            "category": "Tech",
            "min_quantity": 2
        },
        "discount_type": "percent",
        "value": 20,
        "coupon_code": "TECH20"
    }
    promotions.append(promo3)

    # Promotion 4: High-value item discount
    promo4 = {
        "promo_id": "PROMO004",
        "description": "‚Çπ500 off on orders above ‚Çπ5000",
        "valid_from": date.today().strftime("%Y-%m-%d"),
        "valid_to": (date.today() + timedelta(days=45)).strftime("%Y-%m-%d"),
        "eligibility": {
            "min_order_value": 5000
        },
        "discount_type": "fixed",
        "value": 500,
        "coupon_code": "HIGHVALUE500"
    }
    promotions.append(promo4)

    # Promotion 5: Seasonal discount
    promo5 = {
        "promo_id": "PROMO005",
        "description": "Summer Sale - 25% off on Apparel",
        "valid_from": date.today().strftime("%Y-%m-%d"),
        "valid_to": (date.today() + timedelta(days=20)).strftime("%Y-%m-%d"),
        "eligibility": {
            "category": "Apparel"
        },
        "discount_type": "percent",
        "value": 25,
        "coupon_code": "SUMMER25"
    }
    promotions.append(promo5)

    return promotions

# Generate promotions
promotions = generate_promotions()

# Save to JSON
json_path = os.path.join(output_dir, "promotions.json")
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(promotions, f, indent=2, ensure_ascii=False)

print(f"Generated and saved {len(promotions)} promotions to {json_path}")
print("\nPromotion examples:")
for promo in promotions[:3]:
    print(f"- {promo['description']} ({promo['coupon_code']})")

Generated and saved 5 promotions to /content/synthetic_retail_data/promotions.json

Promotion examples:
- 10% off on all Footwear (FOOTWEAR10)
- 15% off for Gold and Platinum members (LOYALTY15)
- Buy 2 Tech items, get 20% off (TECH20)


In [9]:
!apt install fonts-dejavu-core -y


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  fonts-dejavu-core
0 upgraded, 1 newly installed, 0 to remove and 41 not upgraded.
Need to get 1,041 kB of archives.
After this operation, 3,025 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-dejavu-core all 2.37-2build1 [1,041 kB]
Fetched 1,041 kB in 2s (488 kB/s)
Selecting previously unselected package fonts-dejavu-core.
(Reading database ... 125080 files and directories currently installed.)
Preparing to unpack .../fonts-dejavu-core_2.37-2build1_all.deb ...
Unpacking fonts-dejavu-core (2.37-2build1) ...
Setting up fonts-dejavu-core (2.37-2build1) ...
Processing triggers for fontconfig (2.13.1-4.2ubuntu5) ...


In [15]:
# Validate generated files and show preview
print("=== VALIDATION & PREVIEW ===\n")

# Count files in each directory
def count_files_in_directory(directory):
    count = 0
    total_size = 0
    for root, dirs, files in os.walk(directory):
        count += len(files)
        for file in files:
            file_path = os.path.join(root, file)
            total_size += os.path.getsize(file_path)
    return count, total_size

# Main directory
main_count, main_size = count_files_in_directory(output_dir)
print(f"Main directory ({output_dir}): {main_count} files, {main_size/1024:.2f} KB")

# Brochures directory
brochures_count, brochures_size = count_files_in_directory(brochures_dir)
print(f"Brochures directory ({brochures_dir}): {brochures_count} files, {brochures_size/1024:.2f} KB")

# Images directory
images_count, images_size = count_files_in_directory(images_dir)
print(f"Images directory ({images_dir}): {images_count} files, {images_size/1024:.2f} KB")

print("\n=== BROCHURE PREVIEW ===")
# Show first few brochure names
print("Generated brochure PDFs:")
for brochure in brochures:
    print(f"- {brochure['filename']}")

print("\n=== DATA SUMMARY ===")
print(f"Products: {len(products)}")
print(f"Customers: {len(customers)}")
print(f"Inventory items: {len(inventory)}")
print(f"Promotions: {len(promotions)}")
print(f"Product brochures: {len(brochures)}")
print(f"Placeholder images: {len(generated_images)}")

=== VALIDATION & PREVIEW ===

Main directory (/content/synthetic_retail_data): 34 files, 122.25 KB
Brochures directory (./brochures): 3 files, 70.83 KB
Images directory (./images): 0 files, 0.00 KB

=== BROCHURE PREVIEW ===
Generated brochure PDFs:
- brochure_footwear.pdf
- brochure_apparel.pdf
- brochure_best_sellers.pdf

=== DATA SUMMARY ===
Products: 2
Customers: 10
Inventory items: 30
Promotions: 5
Product brochures: 3
Placeholder images: 30


In [16]:
# Create README with ingestion instructions
readme_content = """# Generated Retail Data for RAG Demo

This package contains synthetic retail data for demonstrating a Retrieval-Augmented Generation (RAG) recommendation agent.

## Contents

1. `products.csv` - 30 product entries with SKUs, descriptions, pricing, etc.
2. `customers.json` - 10 customer profiles with purchase histories
3. `inventory.json` - Stock levels across 5 stores and 1 warehouse
4. `promotions.json` - 3-5 promotional rules
5. `product_brochures/` - 6 PDF brochures with product information
6. `images/` - Placeholder PNG images used in the PDFs

## Ingestion Instructions

### Loading Products CSV with LangChain

```python
from langchain.document_loaders import CSVLoader

loader = CSVLoader(file_path='products.csv')
documents = loader.load()
```

### Loading Product Brochures with LangChain

```python
from langchain.document_loaders import PyPDFLoader
import os

# Load all PDFs in the brochures directory
brochure_docs = []
for filename in os.listdir('product_brochures'):
    if filename.endswith('.pdf'):
        loader = PyPDFLoader(os.path.join('product_brochures', filename))
        brochure_docs.extend(loader.load())
```

## Token Budget Tips

1. For large PDFs, consider splitting into smaller chunks (500-1000 tokens each)
2. Use only 1-2 product sections per retrieval context for the demo
3. The CSV and JSON files include metadata to support Pinecone filters (category, price, store_availability)

## Next Steps

1. Run embedding creation on loaded documents
2. Upsert embeddings into Pinecone vector database
3. Implement retrieval tests with various queries
"""

readme_path = os.path.join(output_dir, "README_generated_files.md")
with open(readme_path, 'w', encoding='utf-8') as f:
    f.write(readme_content)

print(f"Created README at {readme_path}")

Created README at /content/synthetic_retail_data/README_generated_files.md


In [17]:
# Zip the output directory
import zipfile

zip_path = "/content/synthetic_retail_data.zip"

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arc_path = os.path.relpath(file_path, output_dir)
            zipf.write(file_path, arcname=os.path.join("synthetic_retail_data", arc_path))

print(f"Created zip archive at {zip_path}")
print(f"Archive size: {os.path.getsize(zip_path)/1024/1024:.2f} MB")

Created zip archive at /content/synthetic_retail_data.zip
Archive size: 0.08 MB


In [18]:
# Provide download link
print("=== DOWNLOAD LINK ===")
print("Click the folder icon on the left panel to access the files, or use the download link below:")

# Uncomment the following lines to enable direct download
# from google.colab import files
# files.download(zip_path)

print(f"Download path: {zip_path}")
print("To download, either:")
print("1. Click the folder icon on the left, navigate to content/, and download synthetic_retail_data.zip")
print("2. Uncomment the files.download() line in the code cell above and run it")

=== DOWNLOAD LINK ===
Click the folder icon on the left panel to access the files, or use the download link below:
Download path: /content/synthetic_retail_data.zip
To download, either:
1. Click the folder icon on the left, navigate to content/, and download synthetic_retail_data.zip
2. Uncomment the files.download() line in the code cell above and run it


In [20]:
from google.colab import files
import os

def _notebook_path():
    # This function attempts to find the path of the current notebook.
    # It's an internal function used by Colab, so its reliability may vary.
    # A more robust approach might involve saving the notebook first.
    import ipykernel
    import requests
    from requests.exceptions import ConnectionError

    try:
        connection_file = os.path.basename(ipykernel.get_connection_file())
        kernel_id = connection_file.split('-', 1)[1].split('.')[0]
        response = requests.get('http://172.28.0.12:9000/api/sessions') # Internal Colab API endpoint
        response.raise_for_status()
        sessions = response.json()
        for session in sessions:
            if session['kernel']['id'] == kernel_id:
                return session['path']
    except (ConnectionError, requests.exceptions.RequestException, IndexError) as e:
        print(f"Could not determine notebook path automatically: {e}")
    return 'Untitled.ipynb' # Fallback name

notebook_name = _notebook_path()
if not notebook_name.endswith('.ipynb'):
    notebook_name += '.ipynb' # Ensure .ipynb extension

print(f"Attempting to download: {notebook_name}")
files.download(notebook_name)


Attempting to download: fileId=1SzCcEC96qipx3rMDZjUySBAkvBvSy412.ipynb


FileNotFoundError: Cannot find file: fileId=1SzCcEC96qipx3rMDZjUySBAkvBvSy412.ipynb

# Final Notes

## Token Budget Considerations

When implementing your RAG system:

1. **Chunk Large PDFs**: Split lengthy brochure documents into smaller chunks (500-1000 tokens each) for better retrieval precision.
2. **Context Window Management**: Use only 1-2 product sections per retrieval context to stay within model token limits.
3. **Metadata Filtering**: Leverage metadata in CSV/JSON files (category, price, store_availability) for Pinecone pre-filtering.

## File Summary

Exact files generated:
- `/content/synthetic_retail_data/products.csv`
- `/content/synthetic_retail_data/customers.json`
- `/content/synthetic_retail_data/inventory.json`
- `/content/synthetic_retail_data/promotions.json`
- `/content/synthetic_retail_data/product_brochures/*.pdf` (6 files)
- `/content/synthetic_retail_data/images/*.png` (30 files)
- `/content/synthetic_retail_data/README_generated_files.md`
- `/content/synthetic_retail_data.zip` (compressed archive)

## Example LangChain Ingestion

```python
# Load structured data
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path='/content/synthetic_retail_data/products.csv')
product_docs = loader.load()

# Load brochure PDFs
from langchain.document_loaders import PyPDFLoader
import glob

brochure_files = glob.glob('/content/synthetic_retail_data/product_brochures/*.pdf')
brochure_docs = []
for file_path in brochure_files:
    loader = PyPDFLoader(file_path)
    brochure_docs.extend(loader.load())
```

## Suggested Next Steps

1. Create embeddings using OpenAIEmbeddings or similar
2. Upsert documents into Pinecone with metadata
3. Build a retrieval chain with filters
4. Test with sample queries
5. Evaluate retrieval relevance and adjust chunking/embedding strategies

---
*Note: All data is synthetic and randomized with a fixed seed for reproducibility.*

In [None]:
#@title Optional GPU Check
#@markdown Uncomment the line below to check if GPU is available (optional)

# !nvidia-smi