# Sales Health Monitor - Data Generation
## Project Overview
This notebook generates realistic sales data for our AI-powered business monitoring system.

### What we'll create:
- 2-3 years of daily sales transactions (~500K-1M records)
- Multiple regions, products, and sales channels
- Realistic business patterns and seasonality
- Intentional data quality issues for cleaning practice

### Generated Tables:
1. **Main Sales Transactions** (~800K records)
2. **Customer Master Data** (~50K records) 
3. **Product Catalog** (~500 records)
4. **Regional Information** (~20 records)

In [3]:
# Data manipulation and generation
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
from faker import Faker

In [4]:
# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)
fake = Faker()
Faker.seed(42)  # Keep faker data consistent too

print("✅ All libraries imported successfully!")

✅ All libraries imported successfully!


## Business Context & Parameters
Before generating data, we need to define our business scenario and realistic parameters.

In [5]:
# Business scenario parameters
print("🏢 BUSINESS SCENARIO SETUP")
print("=" * 50)

# Time period for our data
START_DATE = datetime(2022, 1, 1)
END_DATE = datetime(2024, 12, 31)
TOTAL_DAYS = (END_DATE - START_DATE).days

print(f"📅 Data Period: {START_DATE.strftime('%Y-%m-%d')} to {END_DATE.strftime('%Y-%m-%d')}")
print(f"📊 Total Days: {TOTAL_DAYS} days")

# Geographic regions
REGIONS = ['North', 'South', 'East', 'West', 'Central']

# Product categories
PRODUCT_CATEGORIES = [
    'Electronics', 'Clothing', 'Home & Garden', 
    'Sports & Outdoors', 'Books & Media'
]

# Sales channels
SALES_CHANNELS = ['Online', 'Retail Store', 'Phone Orders', 'Mobile App']

print(f"\n🌍 Regions: {len(REGIONS)} regions")
print(f"📦 Products: {len(PRODUCT_CATEGORIES)} categories") 
print(f"🛒 Channels: {len(SALES_CHANNELS)} channels")

🏢 BUSINESS SCENARIO SETUP
📅 Data Period: 2022-01-01 to 2024-12-31
📊 Total Days: 1095 days

🌍 Regions: 5 regions
📦 Products: 5 categories
🛒 Channels: 4 channels


## Generating Customer Master Data
Creating realistic customer profiles that our sales transactions will reference.

In [6]:
print("👥 GENERATING CUSTOMER MASTER DATA")
print("=" * 50)

# Customer parameters
NUM_CUSTOMERS = 50000  # Realistic customer base for mid-size company

# Customer segments (important for business analysis)
CUSTOMER_SEGMENTS = ['Premium', 'Standard', 'Budget']
SEGMENT_WEIGHTS = [0.15, 0.60, 0.25]  # 15% premium, 60% standard, 25% budget

print(f"🎯 Generating {NUM_CUSTOMERS:,} customers")
print(f"📊 Segments: {dict(zip(CUSTOMER_SEGMENTS, SEGMENT_WEIGHTS))}")

# Generate customer data
customers_data = []

for i in range(NUM_CUSTOMERS):
    # Create realistic customer profile
    customer_id = f"CUST_{i+1:06d}"
    
    # Random acquisition date (customers joined over time)
    acquisition_date = fake.date_between(
        start_date=START_DATE - timedelta(days=365),  # Some customers from before our data period
        end_date=END_DATE - timedelta(days=30)        # No brand new customers
    )
    
    # Assign segment with realistic distribution
    segment = np.random.choice(CUSTOMER_SEGMENTS, p=SEGMENT_WEIGHTS)
    
    # Generate realistic customer info
    first_name = fake.first_name()
    last_name = fake.last_name()
    email = f"{first_name.lower()}.{last_name.lower()}@{fake.domain_name()}"
    
    # Assign to region (customers distributed across regions)
    region = np.random.choice(REGIONS)
    
    customers_data.append({
        'customer_id': customer_id,
        'first_name': first_name,
        'last_name': last_name,
        'email': email,
        'segment': segment,
        'region': region,
        'acquisition_date': acquisition_date
    })

# Create DataFrame
customers_df = pd.DataFrame(customers_data)

print(f"✅ Generated {len(customers_df):,} customer records")
print(f"📅 Acquisition date range: {customers_df['acquisition_date'].min()} to {customers_df['acquisition_date'].max()}")
print("\n📊 Customer segment distribution:")
print(customers_df['segment'].value_counts())

👥 GENERATING CUSTOMER MASTER DATA
🎯 Generating 50,000 customers
📊 Segments: {'Premium': 0.15, 'Standard': 0.6, 'Budget': 0.25}
✅ Generated 50,000 customer records
📅 Acquisition date range: 2021-01-01 to 2024-11-30

📊 Customer segment distribution:
segment
Standard    30115
Budget      12453
Premium      7432
Name: count, dtype: int64


## Generating Product Catalog
Creating our product master data with realistic pricing and categories.

In [7]:
print("📦 GENERATING PRODUCT CATALOG")
print("=" * 50)

# Product parameters
PRODUCTS_PER_CATEGORY = 100  # 100 products per category = 500 total products
TOTAL_PRODUCTS = len(PRODUCT_CATEGORIES) * PRODUCTS_PER_CATEGORY

print(f"🎯 Generating {TOTAL_PRODUCTS} products ({PRODUCTS_PER_CATEGORY} per category)")

# Price ranges by category (realistic business pricing)
CATEGORY_PRICE_RANGES = {
    'Electronics': (50, 2000),      # $50 - $2000
    'Clothing': (15, 300),          # $15 - $300
    'Home & Garden': (10, 500),     # $10 - $500
    'Sports & Outdoors': (20, 800), # $20 - $800
    'Books & Media': (5, 100)       # $5 - $100
}

products_data = []
product_counter = 1

for category in PRODUCT_CATEGORIES:
    min_price, max_price = CATEGORY_PRICE_RANGES[category]
    
    for i in range(PRODUCTS_PER_CATEGORY):
        product_id = f"PROD_{product_counter:04d}"
        
        # Generate realistic product name
        if category == 'Electronics':
            product_name = f"{fake.company()} {random.choice(['Laptop', 'Phone', 'Tablet', 'Camera', 'Speaker'])}"
        elif category == 'Clothing':
            product_name = f"{random.choice(['Premium', 'Classic', 'Sport'])} {random.choice(['Jacket', 'Jeans', 'Shirt', 'Dress', 'Shoes'])}"
        elif category == 'Home & Garden':
            product_name = f"{random.choice(['Deluxe', 'Standard', 'Compact'])} {random.choice(['Chair', 'Table', 'Lamp', 'Plant', 'Tool'])}"
        elif category == 'Sports & Outdoors':
            product_name = f"{random.choice(['Pro', 'Amateur', 'Kids'])} {random.choice(['Ball', 'Racket', 'Bike', 'Gear', 'Equipment'])}"
        else:  # Books & Media
            product_name = f"{fake.catch_phrase()} {random.choice(['Book', 'DVD', 'Game', 'Magazine'])}"
        
        # Realistic pricing with some variation
        base_price = np.random.uniform(min_price, max_price)
        price = round(base_price * np.random.uniform(0.8, 1.2), 2)  # ±20% variation
        
        # Cost (70-85% of price for realistic margins)
        cost = round(price * np.random.uniform(0.70, 0.85), 2)
        
        # Launch date (products launched over time)
        launch_date = fake.date_between(
            start_date=START_DATE - timedelta(days=500),
            end_date=END_DATE - timedelta(days=60)
        )
        
        products_data.append({
            'product_id': product_id,
            'product_name': product_name,
            'category': category,
            'price': price,
            'cost': cost,
            'margin_percent': round((price - cost) / price * 100, 1),
            'launch_date': launch_date
        })
        
        product_counter += 1

# Create DataFrame
products_df = pd.DataFrame(products_data)

print(f"✅ Generated {len(products_df):,} product records")
print(f"💰 Price range: ${products_df['price'].min():.2f} - ${products_df['price'].max():.2f}")
print(f"📊 Average margin: {products_df['margin_percent'].mean():.1f}%")

print("\n📦 Products per category:")
print(products_df['category'].value_counts())

print("\n💡 Sample products:")
print(products_df[['product_id', 'product_name', 'category', 'price']].head(10))

📦 GENERATING PRODUCT CATALOG
🎯 Generating 500 products (100 per category)
✅ Generated 500 product records
💰 Price range: $5.16 - $2229.43
📊 Average margin: 22.5%

📦 Products per category:
category
Electronics          100
Clothing             100
Home & Garden        100
Sports & Outdoors    100
Books & Media        100
Name: count, dtype: int64

💡 Sample products:
  product_id                          product_name     category    price
0  PROD_0001                  Castillo-Diaz Laptop  Electronics   932.32
1  PROD_0002                Mitchell-Martin Laptop  Electronics   162.41
2  PROD_0003  Jennings, Hansen and Figueroa Tablet  Electronics  1709.67
3  PROD_0004           Zhang, Smith and Snow Phone  Electronics  2089.26
4  PROD_0005                  Boone and Sons Phone  Electronics    61.34
5  PROD_0006                     Buckley PLC Phone  Electronics   712.10
6  PROD_0007    Johnson, Martinez and Clark Laptop  Electronics   231.61
7  PROD_0008    Griffin, Flores and Jacobs Speak

## Generating Main Sales Transactions
Creating realistic daily sales with seasonal patterns, regional differences, and intentional anomalies for our ML system to detect.

In [8]:
print("💰 GENERATING SALES TRANSACTIONS")
print("=" * 50)

# Transaction parameters
DAILY_TRANSACTION_TARGET = 800  # Target ~800 transactions per day
TOTAL_EXPECTED_TRANSACTIONS = TOTAL_DAYS * DAILY_TRANSACTION_TARGET

print(f"🎯 Target: ~{DAILY_TRANSACTION_TARGET} transactions/day")
print(f"📊 Expected total: ~{TOTAL_EXPECTED_TRANSACTIONS:,} transactions over {TOTAL_DAYS} days")

# Create date range
date_range = pd.date_range(start=START_DATE, end=END_DATE, freq='D')

print(f"📅 Generating transactions for {len(date_range)} days...")

# This will take a moment - let's track progress
sales_transactions = []
transaction_id_counter = 1

print("\n🔄 Progress:")
for day_idx, current_date in enumerate(date_range):
    # Show progress every 100 days
    if day_idx % 100 == 0:
        print(f"   Day {day_idx+1}/{len(date_range)} ({current_date.strftime('%Y-%m-%d')})")
    
    # Calculate daily transaction count with variations
    base_daily_count = DAILY_TRANSACTION_TARGET
    
    # Day of week patterns (realistic business patterns)
    day_of_week = current_date.weekday()  # 0=Monday, 6=Sunday
    if day_of_week == 6:  # Sunday - lowest sales
        daily_multiplier = 0.4
    elif day_of_week == 5:  # Saturday
        daily_multiplier = 0.6
    elif day_of_week in [0, 1]:  # Monday, Tuesday
        daily_multiplier = 0.8
    else:  # Wed, Thu, Fri - peak days
        daily_multiplier = 1.0
    
    # Seasonal patterns (this is key for realistic data!)
    month = current_date.month
    if month in [11, 12]:  # Holiday season
        seasonal_multiplier = 1.8
    elif month in [1, 2]:  # Post-holiday slowdown
        seasonal_multiplier = 0.6
    elif month in [3, 4, 5]:  # Spring uptick
        seasonal_multiplier = 1.1
    elif month in [6, 7, 8]:  # Summer
        seasonal_multiplier = 1.0
    else:  # Fall
        seasonal_multiplier = 1.2
    
    # Random daily variation (±20%)
    random_multiplier = np.random.uniform(0.8, 1.2)
    
    # Calculate final daily transaction count
    daily_count = int(base_daily_count * daily_multiplier * seasonal_multiplier * random_multiplier)
    daily_count = max(50, daily_count)  # Minimum 50 transactions per day
    
    # Generate transactions for this day
    for trans_idx in range(daily_count):
        transaction_id = f"TXN_{transaction_id_counter:08d}"
        
        # Select random customer, product, region, channel
        customer = customers_df.sample(1).iloc[0]
        product = products_df.sample(1).iloc[0]
        region = np.random.choice(REGIONS)
        channel = np.random.choice(SALES_CHANNELS)
        
        # Calculate realistic quantities (most orders are 1-3 items)
        if np.random.random() < 0.7:  # 70% are single items
            quantity = 1
        elif np.random.random() < 0.9:  # 20% are 2-3 items
            quantity = np.random.randint(2, 4)
        else:  # 10% are larger orders
            quantity = np.random.randint(4, 10)
        
        # Calculate pricing
        base_price = product['price']
        
        # Random discounts (realistic business practice)
        if np.random.random() < 0.15:  # 15% of transactions have discounts
            discount_percent = np.random.uniform(5, 25)  # 5-25% discount
            final_price = base_price * (1 - discount_percent/100)
        else:
            discount_percent = 0
            final_price = base_price
        
        # Total transaction amount
        total_amount = round(final_price * quantity, 2)
        
        # Random transaction time during the day
        hour = np.random.randint(6, 23)  # Business hours 6 AM to 11 PM
        minute = np.random.randint(0, 60)
        transaction_datetime = current_date.replace(hour=hour, minute=minute)
        
        sales_transactions.append({
            'transaction_id': transaction_id,
            'transaction_date': current_date,
            'transaction_datetime': transaction_datetime,
            'customer_id': customer['customer_id'],
            'customer_segment': customer['segment'],
            'product_id': product['product_id'],
            'product_category': product['category'],
            'region': region,
            'sales_channel': channel,
            'quantity': quantity,
            'unit_price': round(final_price, 2),
            'discount_percent': round(discount_percent, 1),
            'total_amount': total_amount
        })
        
        transaction_id_counter += 1

print(f"\n✅ Generated {len(sales_transactions):,} sales transactions!")

💰 GENERATING SALES TRANSACTIONS
🎯 Target: ~800 transactions/day
📊 Expected total: ~876,000 transactions over 1095 days
📅 Generating transactions for 1096 days...

🔄 Progress:
   Day 1/1096 (2022-01-01)
   Day 101/1096 (2022-04-11)
   Day 201/1096 (2022-07-20)
   Day 301/1096 (2022-10-28)
   Day 401/1096 (2023-02-05)
   Day 501/1096 (2023-05-16)
   Day 601/1096 (2023-08-24)
   Day 701/1096 (2023-12-02)
   Day 801/1096 (2024-03-11)
   Day 901/1096 (2024-06-19)
   Day 1001/1096 (2024-09-27)

✅ Generated 793,505 sales transactions!


## Convert to DataFrame & Quick Analysis
Let's analyze our generated sales data to verify patterns.

In [9]:
print("📊 CONVERTING TO DATAFRAME & ANALYSIS")
print("=" * 50)

# Convert to DataFrame
sales_df = pd.DataFrame(sales_transactions)

print(f"✅ Created DataFrame with {len(sales_df):,} rows and {len(sales_df.columns)} columns")
print(f"💾 Memory usage: {sales_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Quick data analysis
print("\n📈 TRANSACTION SUMMARY:")
print(f"   Date range: {sales_df['transaction_date'].min()} to {sales_df['transaction_date'].max()}")
print(f"   Total revenue: ${sales_df['total_amount'].sum():,.2f}")
print(f"   Average transaction: ${sales_df['total_amount'].mean():.2f}")
print(f"   Total customers: {sales_df['customer_id'].nunique():,}")
print(f"   Total products sold: {sales_df['product_id'].nunique():,}")

print("\n🛒 CHANNEL DISTRIBUTION:")
print(sales_df['sales_channel'].value_counts())

print("\n🏢 REGIONAL DISTRIBUTION:")
print(sales_df['region'].value_counts())

print("\n📦 CATEGORY DISTRIBUTION:")
print(sales_df['product_category'].value_counts())

print("\n💰 REVENUE BY SEGMENT:")
segment_revenue = sales_df.groupby('customer_segment')['total_amount'].agg(['count', 'sum', 'mean'])
print(segment_revenue)

# Check for seasonal patterns (this proves our logic worked!)
print("\n📅 MONTHLY SALES PATTERN:")
monthly_sales = sales_df.groupby(sales_df['transaction_date'].dt.month)['total_amount'].sum()
print(monthly_sales)

print(f"\n🎯 November sales: ${monthly_sales[11]:,.2f}")
print(f"🎯 December sales: ${monthly_sales[12]:,.2f}")
print(f"🎯 January sales: ${monthly_sales[1]:,.2f}")
print("📊 Notice the holiday peak and post-holiday drop - our seasonal logic worked!")

📊 CONVERTING TO DATAFRAME & ANALYSIS
✅ Created DataFrame with 793,505 rows and 13 columns
💾 Memory usage: 400.9 MB

📈 TRANSACTION SUMMARY:
   Date range: 2022-01-01 00:00:00 to 2024-12-31 00:00:00
   Total revenue: $469,040,584.24
   Average transaction: $591.10
   Total customers: 50,000
   Total products sold: 500

🛒 CHANNEL DISTRIBUTION:
sales_channel
Retail Store    198925
Online          198427
Phone Orders    198119
Mobile App      198034
Name: count, dtype: int64

🏢 REGIONAL DISTRIBUTION:
region
West       159252
East       159069
Central    158476
North      158436
South      158272
Name: count, dtype: int64

📦 CATEGORY DISTRIBUTION:
product_category
Clothing             158952
Electronics          158746
Books & Media        158647
Home & Garden        158584
Sports & Outdoors    158576
Name: count, dtype: int64

💰 REVENUE BY SEGMENT:
                   count           sum        mean
customer_segment                                  
Budget            197208  1.166416e+08  59

## Adding Data Quality Issues for Cleaning Practice
Now we'll intentionally introduce realistic data problems that occur in real business systems.

In [10]:
print("🧹 ADDING DATA QUALITY ISSUES")
print("=" * 50)

# Create a copy for corrupting (keep original clean)
corrupted_sales_df = sales_df.copy()

print(f"📊 Starting with {len(corrupted_sales_df):,} clean records")

# Issue 1: Missing customer IDs (2% - realistic system integration issues)
missing_customer_count = int(len(corrupted_sales_df) * 0.02)
missing_customer_indices = np.random.choice(corrupted_sales_df.index, missing_customer_count, replace=False)
corrupted_sales_df.loc[missing_customer_indices, 'customer_id'] = None

print(f"❌ Added {missing_customer_count:,} missing customer IDs (2%)")

# Issue 2: Duplicate transactions (1% - system glitches)
duplicate_count = int(len(corrupted_sales_df) * 0.01)
duplicate_indices = np.random.choice(corrupted_sales_df.index, duplicate_count, replace=False)
duplicated_rows = corrupted_sales_df.loc[duplicate_indices].copy()
# Change transaction_id but keep everything else same (realistic duplicate scenario)
duplicated_rows['transaction_id'] = duplicated_rows['transaction_id'] + '_DUP'
corrupted_sales_df = pd.concat([corrupted_sales_df, duplicated_rows], ignore_index=True)

print(f"❌ Added {duplicate_count:,} duplicate transactions (1%)")

# Issue 3: Impossible negative quantities (0.1% - data entry errors)
negative_qty_count = int(len(corrupted_sales_df) * 0.001)
negative_qty_indices = np.random.choice(corrupted_sales_df.index, negative_qty_count, replace=False)
corrupted_sales_df.loc[negative_qty_indices, 'quantity'] = -1 * corrupted_sales_df.loc[negative_qty_indices, 'quantity']

print(f"❌ Added {negative_qty_count:,} negative quantities (0.1%)")

# Issue 4: Future dates (0.05% - system clock issues)
future_date_count = int(len(corrupted_sales_df) * 0.0005)
future_date_indices = np.random.choice(corrupted_sales_df.index, future_date_count, replace=False)
future_date = datetime(2025, 6, 15)  # Future date
corrupted_sales_df.loc[future_date_indices, 'transaction_date'] = future_date

print(f"❌ Added {future_date_count:,} future dates (0.05%)")

# Issue 5: Inconsistent region names (1% - data entry variations)
inconsistent_region_count = int(len(corrupted_sales_df) * 0.01)
inconsistent_region_indices = np.random.choice(corrupted_sales_df.index, inconsistent_region_count, replace=False)
# Create variations of existing regions
region_variations = {
    'North': ['NORTH', 'north', 'N', 'Northern'],
    'South': ['SOUTH', 'south', 'S', 'Southern'],
    'East': ['EAST', 'east', 'E', 'Eastern'],
    'West': ['WEST', 'west', 'W', 'Western'],
    'Central': ['CENTRAL', 'central', 'C', 'Centre']
}

for idx in inconsistent_region_indices:
    current_region = corrupted_sales_df.loc[idx, 'region']
    if current_region in region_variations:
        new_region = np.random.choice(region_variations[current_region])
        corrupted_sales_df.loc[idx, 'region'] = new_region

print(f"❌ Added {inconsistent_region_count:,} inconsistent region names (1%)")

# Issue 6: Extreme outliers (0.2% - data entry or system errors)
outlier_count = int(len(corrupted_sales_df) * 0.002)
outlier_indices = np.random.choice(corrupted_sales_df.index, outlier_count, replace=False)
# Create unrealistic high amounts (10x-100x normal)
corrupted_sales_df.loc[outlier_indices, 'total_amount'] = corrupted_sales_df.loc[outlier_indices, 'total_amount'] * np.random.uniform(10, 100, size=outlier_count)

print(f"❌ Added {outlier_count:,} extreme outliers (0.2%)")

print(f"\n✅ Final corrupted dataset: {len(corrupted_sales_df):,} records")
print(f"📈 Added {len(corrupted_sales_df) - len(sales_df):,} total data quality issues")

# Quick summary of issues
print(f"\n🔍 DATA QUALITY ISSUES SUMMARY:")
print(f"   Missing customer IDs: {corrupted_sales_df['customer_id'].isnull().sum():,}")
print(f"   Negative quantities: {(corrupted_sales_df['quantity'] < 0).sum():,}")
print(f"   Future dates: {(corrupted_sales_df['transaction_date'] > datetime.now()).sum():,}")
print(f"   Unique regions: {corrupted_sales_df['region'].nunique()} (should be 5)")
print(f"   Max transaction amount: ${corrupted_sales_df['total_amount'].max():,.2f}")

🧹 ADDING DATA QUALITY ISSUES
📊 Starting with 793,505 clean records
❌ Added 15,870 missing customer IDs (2%)
❌ Added 7,935 duplicate transactions (1%)
❌ Added 801 negative quantities (0.1%)
❌ Added 400 future dates (0.05%)
❌ Added 8,014 inconsistent region names (1%)
❌ Added 1,602 extreme outliers (0.2%)

✅ Final corrupted dataset: 801,440 records
📈 Added 7,935 total data quality issues

🔍 DATA QUALITY ISSUES SUMMARY:
   Missing customer IDs: 16,041
   Negative quantities: 801
   Future dates: 0
   Unique regions: 25 (should be 5)
   Max transaction amount: $1,255,612.50


## Save All Generated Datasets
Saving clean and corrupted datasets to appropriate folders for the next phase of our project.

In [11]:
print("💾 SAVING ALL DATASETS")
print("=" * 50)

import os

# Create directories if they don't exist
os.makedirs('../Dataset/raw', exist_ok=True)
os.makedirs('../Dataset/processed', exist_ok=True)
os.makedirs('../Dataset/sample', exist_ok=True)

print("📁 Created/verified directory structure")

# Save clean master data
print("\n🔄 Saving clean master datasets...")

customers_df.to_csv('../Dataset/raw/customers.csv', index=False)
print(f"✅ Saved customers.csv: {len(customers_df):,} records")

products_df.to_csv('../Dataset/raw/products.csv', index=False)
print(f"✅ Saved products.csv: {len(products_df):,} records")

sales_df.to_csv('../Dataset/raw/sales_clean.csv', index=False)
print(f"✅ Saved sales_clean.csv: {len(sales_df):,} records")

# Save corrupted dataset for cleaning practice
corrupted_sales_df.to_csv('../Dataset/raw/sales_corrupted.csv', index=False)
print(f"✅ Saved sales_corrupted.csv: {len(corrupted_sales_df):,} records")

# Create sample datasets for GitHub (to avoid large file uploads)
print("\n🔄 Creating sample datasets for GitHub...")

# Sample 1000 customers
customers_sample = customers_df.sample(n=1000, random_state=42)
customers_sample.to_csv('../Dataset/sample/customers_sample.csv', index=False)
print(f"✅ Saved customers_sample.csv: {len(customers_sample):,} records")

# Sample 100 products  
products_sample = products_df.sample(n=100, random_state=42)
products_sample.to_csv('../Dataset/sample/products_sample.csv', index=False)
print(f"✅ Saved products_sample.csv: {len(products_sample):,} records")

# Sample 5000 sales transactions
sales_sample = corrupted_sales_df.sample(n=5000, random_state=42)
sales_sample.to_csv('../Dataset/sample/sales_sample.csv', index=False)
print(f"✅ Saved sales_sample.csv: {len(sales_sample):,} records")

# Check file sizes
print("\n📊 FILE SIZES:")
import os

def get_file_size_mb(filepath):
    return os.path.getsize(filepath) / (1024 * 1024)

files_to_check = [
    '../Dataset/raw/customers.csv',
    '../Dataset/raw/products.csv', 
    '../Dataset/raw/sales_clean.csv',
    '../Dataset/raw/sales_corrupted.csv'
]

total_size = 0
for file_path in files_to_check:
    if os.path.exists(file_path):
        size_mb = get_file_size_mb(file_path)
        total_size += size_mb
        print(f"   {os.path.basename(file_path)}: {size_mb:.1f} MB")

print(f"\n💾 Total dataset size: {total_size:.1f} MB")

# Create data summary for next notebook
data_summary = {
    'generation_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'customers': len(customers_df),
    'products': len(products_df),
    'clean_transactions': len(sales_df),
    'corrupted_transactions': len(corrupted_sales_df),
    'date_range': f"{sales_df['transaction_date'].min()} to {sales_df['transaction_date'].max()}",
    'total_revenue_clean': sales_df['total_amount'].sum(),
    'data_quality_issues': {
        'missing_customer_ids': corrupted_sales_df['customer_id'].isnull().sum(),
        'duplicate_transactions': len(corrupted_sales_df) - len(sales_df),
        'negative_quantities': (corrupted_sales_df['quantity'] < 0).sum(),
        'inconsistent_regions': corrupted_sales_df['region'].nunique() - 5,
        'extreme_outliers': (corrupted_sales_df['total_amount'] > corrupted_sales_df['total_amount'].quantile(0.99)).sum()
    }
}

# Save summary for next notebook
import json
with open('../Dataset/raw/data_generation_summary.json', 'w') as f:
    json.dump(data_summary, f, indent=2, default=str)

print(f"\n✅ Saved data generation summary")
print(f"\n🎯 READY FOR NEXT PHASE:")
print(f"   📂 All datasets saved to Dataset/raw/")
print(f"   📊 Sample files created for GitHub")
print(f"   📋 Summary file for next notebook")
print(f"   🧹 Ready to start data cleaning!")

💾 SAVING ALL DATASETS
📁 Created/verified directory structure

🔄 Saving clean master datasets...
✅ Saved customers.csv: 50,000 records
✅ Saved products.csv: 500 records
✅ Saved sales_clean.csv: 793,505 records
✅ Saved sales_corrupted.csv: 801,440 records

🔄 Creating sample datasets for GitHub...
✅ Saved customers_sample.csv: 1,000 records
✅ Saved products_sample.csv: 100 records
✅ Saved sales_sample.csv: 5,000 records

📊 FILE SIZES:
   customers.csv: 3.8 MB
   products.csv: 0.0 MB
   sales_clean.csv: 94.8 MB
   sales_corrupted.csv: 95.7 MB

💾 Total dataset size: 194.4 MB

✅ Saved data generation summary

🎯 READY FOR NEXT PHASE:
   📂 All datasets saved to Dataset/raw/
   📊 Sample files created for GitHub
   📋 Summary file for next notebook
   🧹 Ready to start data cleaning!


## Data Generation Complete! 
### 📊 What We Accomplished:

**Generated Datasets:**
- **50,000 customers** across 5 regions with realistic segments
- **500 products** across 5 categories with proper pricing
- **793,505 clean transactions** with seasonal patterns
- **801,440 corrupted transactions** with 6 types of data quality issues

**Business Patterns Created:**
- ✅ Seasonal trends (holiday spikes, post-holiday drops)
- ✅ Weekly patterns (weekday vs weekend differences)  
- ✅ Regional distributions (balanced across all regions)
- ✅ Customer segments (Premium, Standard, Budget)
- ✅ Realistic pricing and margins

**Data Quality Issues for Cleaning Practice:**
- ❌ Missing customer IDs (16,041 records)
- ❌ Duplicate transactions (7,935 duplicates)
- ❌ Negative quantities (801 records)
- ❌ Inconsistent region names (25 variations instead of 5)
- ❌ Extreme outliers ($1.2M max transaction)

### 🎯 Next Steps:
Move to `02_Data_Cleaning_Validation.ipynb` to identify and fix all these issues!

In [12]:
print("🎉 DATA GENERATION PHASE COMPLETE!")
print("=" * 50)

# Final project status
print("✅ PHASE 1 COMPLETE: Data Generation")
print("   - Realistic business data created")
print("   - Seasonal patterns implemented") 
print("   - Data quality issues introduced")
print("   - All files saved successfully")

print("\n🎯 NEXT PHASE: Data Cleaning & Validation")
print("   - Load corrupted data")
print("   - Identify all quality issues")
print("   - Clean and validate data")
print("   - Prepare for ML/AI components")

print(f"\n📊 PROJECT METRICS:")
print(f"   Total Records Generated: {len(sales_df) + len(customers_df) + len(products_df):,}")
print(f"   Data Quality Issues: {len(corrupted_sales_df) - len(sales_df):,}")
print(f"   Files Created: 8 datasets")
print(f"   Storage Used: 194.4 MB")

print(f"\n🏆 READY TO MOVE TO NOTEBOOK 02!")

🎉 DATA GENERATION PHASE COMPLETE!
✅ PHASE 1 COMPLETE: Data Generation
   - Realistic business data created
   - Seasonal patterns implemented
   - Data quality issues introduced
   - All files saved successfully

🎯 NEXT PHASE: Data Cleaning & Validation
   - Load corrupted data
   - Identify all quality issues
   - Clean and validate data
   - Prepare for ML/AI components

📊 PROJECT METRICS:
   Total Records Generated: 844,005
   Data Quality Issues: 7,935
   Files Created: 8 datasets
   Storage Used: 194.4 MB

🏆 READY TO MOVE TO NOTEBOOK 02!
