# Data Generation and Exploration

This notebook demonstrates the generation of synthetic consumer goods sales data for analytics.

## Objectives
- Generate realistic synthetic sales data
- Explore data characteristics
- Prepare data for segmentation and forecasting analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import sys
import os

# Add src directory to path
sys.path.append('../src')

from data_generator import generate_synthetic_data, calculate_rfm_features

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## 1. Generate Synthetic Data

In [None]:
# Generate synthetic consumer goods data
print("Generating synthetic data...")
data = generate_synthetic_data(n_customers=1000, n_transactions=5000)

print(f"Generated {len(data)} transactions for {data['customer_id'].nunique()} customers")
print(f"Date range: {data['transaction_date'].min()} to {data['transaction_date'].max()}")

# Display basic info
data.info()

In [None]:
# Display sample data
print("Sample transaction data:")
data.head(10)

## 2. Data Exploration

In [None]:
# Basic statistics
print("=== BASIC STATISTICS ===")
print(f"Total Revenue: ${data['total_amount'].sum():,.2f}")
print(f"Average Order Value: ${data['total_amount'].mean():.2f}")
print(f"Total Customers: {data['customer_id'].nunique():,}")
print(f"Total Transactions: {len(data):,}")
print(f"Average Transactions per Customer: {len(data) / data['customer_id'].nunique():.1f}")

# Category breakdown
print("\n=== CATEGORY BREAKDOWN ===")
category_stats = data.groupby('category').agg({    'total_amount': ['sum', 'mean', 'count'],    'quantity': 'sum'}).round(2)category_stats.columns = ['Total_Revenue', 'Avg_Order_Value', 'Transactions', 'Total_Quantity']print(category_stats)

## 3. Save Data

Let's save the generated data for use in other notebooks.

In [None]:
# Calculate RFM features
rfm_data = calculate_rfm_features(data)

# Create data directory if it doesn't exist
os.makedirs('../data', exist_ok=True)

# Save datasets
data.to_csv('../data/synthetic_data.csv', index=False)
rfm_data.to_csv('../data/processed_data.csv', index=False)

print("✅ Data saved successfully!")
print(f"📁 Synthetic data: ../data/synthetic_data.csv ({len(data)} rows)")
print(f"📁 Processed data: ../data/processed_data.csv ({len(rfm_data)} rows)")

## Summary

This notebook successfully:

1. **Generated synthetic data** with realistic consumer goods sales patterns
2. **Explored data characteristics** including revenue, customer demographics, and sales patterns
3. **Calculated RFM features** for customer segmentation analysis
4. **Saved processed data** for use in subsequent analysis

**Next Steps**: 
- Proceed to customer segmentation analysis
- Develop demand forecasting models
- Build interactive dashboard