# E-Commerce Pattern Recognition Analysis
## Building AI-Powered Customer Data Intelligence

This notebook demonstrates advanced pattern recognition techniques for e-commerce behavioral data, similar to what Segment provides but enhanced with AI-powered insights.

### Dataset Overview
- **67M+ events** from November 2019
- **Event types**: view (94%), cart (4%), purchase (1%), remove_from_cart
- **Key fields**: user_id, product_id, category, brand, price, timestamp
- **Goal**: Extract actionable patterns for customer segmentation and personalization

In [62]:
# Import required libraries for comprehensive analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Data processing and ML libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Statistical libraries
from scipy.stats import chi2_contingency
import networkx as nx
# Install mlxtend if not available
try:
    from mlxtend.frequent_patterns import association_rules, apriori
except ImportError:
    import subprocess
    import sys
    print("📦 Installing mlxtend...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "mlxtend"])
    from mlxtend.frequent_patterns import association_rules, apriori
    print("✅ mlxtend installed and imported successfully!")

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print(f"📊 Ready for pattern analysis on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Libraries imported successfully!
📊 Ready for pattern analysis on 2025-10-25 15:18:39


## 1. Data Loading and Initial Exploration

Load the e-commerce events data from Supabase and perform initial analysis to understand the data structure and quality.

In [63]:
# Load environment variables and connect to Supabase
import os
from dotenv import load_dotenv
import requests

load_dotenv()

# Supabase configuration
SUPABASE_URL = os.getenv('SUPABASE_URL')
SUPABASE_ANON_KEY = os.getenv('SUPABASE_ANON_KEY')

def load_ecommerce_data(limit=100000):
    """Load e-commerce events data from Supabase"""
    
    headers = {
        "apikey": SUPABASE_ANON_KEY,
        "Authorization": f"Bearer {SUPABASE_ANON_KEY}",
    }
    
    # Query parameters for data loading
    params = {
        "select": "*",
        "order": "event_time.asc",
        "limit": str(limit)
    }
    
    try:
        print(f"🔄 Loading {limit:,} records from Supabase...")
        response = requests.get(
            f"{SUPABASE_URL}/rest/v1/ecommerce_events",
            headers=headers,
            params=params
        )
        
        if response.status_code == 200:
            data = response.json()
            df = pd.DataFrame(data)
            print(f"✅ Successfully loaded {len(df):,} records")
            return df
        else:
            print(f"❌ Error: Status {response.status_code}")
            return None
            
    except Exception as e:
        print(f"❌ Error loading data: {e}")
        return None

# Load the data
df = load_ecommerce_data()

if df is not None:
    print(f"\n📊 Dataset Shape: {df.shape}")
    print(f"📅 Date Range: {df['event_time'].min()} to {df['event_time'].max()}")
    print(f"👥 Unique Users: {df['user_id'].nunique():,}")
    print(f"🛍️ Unique Products: {df['product_id'].nunique():,}")
    print(f"📈 Event Types: {df['event_type'].value_counts().to_dict()}")

🔄 Loading 100,000 records from Supabase...
✅ Successfully loaded 1,000 records

📊 Dataset Shape: (1000, 11)
📅 Date Range: 2019-11-01T00:00:00+00:00 to 2019-11-01T00:03:23+00:00
👥 Unique Users: 199
🛍️ Unique Products: 376
📈 Event Types: {'view': 994, 'purchase': 4, 'cart': 2}
✅ Successfully loaded 1,000 records

📊 Dataset Shape: (1000, 11)
📅 Date Range: 2019-11-01T00:00:00+00:00 to 2019-11-01T00:03:23+00:00
👥 Unique Users: 199
🛍️ Unique Products: 376
📈 Event Types: {'view': 994, 'purchase': 4, 'cart': 2}


In [64]:
# Check actual table size and Supabase limits
def check_supabase_limits():
    """Check total records and Supabase query limits"""
    
    headers = {
        "apikey": SUPABASE_ANON_KEY,
        "Authorization": f"Bearer {SUPABASE_ANON_KEY}",
        "Prefer": "count=exact"
    }
    
    print("🔍 CHECKING SUPABASE LIMITS")
    print("=" * 40)
    
    try:
        # Get total record count
        response = requests.head(
            f"{SUPABASE_URL}/rest/v1/ecommerce_events",
            headers=headers
        )
        
        if response.status_code == 200:
            content_range = response.headers.get('content-range', '')
            print(f"📊 Content-Range Header: {content_range}")
            
            if content_range:
                total = content_range.split('/')[-1]
                print(f"📈 Total Records in Table: {total:,}")
            else:
                print("⚠️ No content-range header found")
        
        # Test different limit sizes
        test_limits = [1000, 5000, 10000, 50000]
        
        for test_limit in test_limits:
            print(f"\n🧪 Testing limit={test_limit:,}...")
            
            params = {
                "select": "id",
                "limit": str(test_limit)
            }
            
            test_response = requests.get(
                f"{SUPABASE_URL}/rest/v1/ecommerce_events",
                headers={"apikey": SUPABASE_ANON_KEY, "Authorization": f"Bearer {SUPABASE_ANON_KEY}"},
                params=params,
                timeout=30  # Add timeout
            )
            
            if test_response.status_code == 200:
                actual_count = len(test_response.json())
                print(f"  ✅ Requested: {test_limit:,} | Received: {actual_count:,}")
                
                if actual_count < test_limit:
                    print(f"  🚨 Hit limit at {actual_count:,} records")
                    break
            else:
                print(f"  ❌ Failed: Status {test_response.status_code}")
                break
    
    except Exception as e:
        print(f"❌ Error checking limits: {e}")

# Run the limit check
check_supabase_limits()

🔍 CHECKING SUPABASE LIMITS

🧪 Testing limit=1,000...
  ✅ Requested: 1,000 | Received: 1,000

🧪 Testing limit=5,000...

🧪 Testing limit=1,000...
  ✅ Requested: 1,000 | Received: 1,000

🧪 Testing limit=5,000...
  ✅ Requested: 5,000 | Received: 1,000
  🚨 Hit limit at 1,000 records
  ✅ Requested: 5,000 | Received: 1,000
  🚨 Hit limit at 1,000 records


In [65]:
# Enhanced data loading with pagination support
def load_large_dataset(total_limit=100000, batch_size=1000):
    """
    Load large datasets using pagination to bypass Supabase limits
    """
    
    headers = {
        "apikey": SUPABASE_ANON_KEY,
        "Authorization": f"Bearer {SUPABASE_ANON_KEY}",
    }
    
    all_data = []
    offset = 0
    
    print(f"🚀 LOADING LARGE DATASET")
    print(f"Target: {total_limit:,} records in batches of {batch_size:,}")
    print("=" * 50)
    
    while len(all_data) < total_limit:
        current_batch_size = min(batch_size, total_limit - len(all_data))
        
        params = {
            "select": "*",
            "order": "event_time.asc",
            "limit": str(current_batch_size),
            "offset": str(offset)
        }
        
        try:
            print(f"📥 Fetching batch {offset//batch_size + 1}: offset={offset:,}, limit={current_batch_size:,}")
            
            response = requests.get(
                f"{SUPABASE_URL}/rest/v1/ecommerce_events",
                headers=headers,
                params=params,
                timeout=60
            )
            
            if response.status_code == 200:
                batch_data = response.json()
                
                if not batch_data:  # No more data
                    print(f"📊 No more data available. Total loaded: {len(all_data):,}")
                    break
                
                all_data.extend(batch_data)
                offset += len(batch_data)
                
                print(f"  ✅ Loaded {len(batch_data):,} records | Total: {len(all_data):,}")
                
                # If we got fewer records than requested, we've hit the end
                if len(batch_data) < current_batch_size:
                    print(f"📊 Reached end of data. Final count: {len(all_data):,}")
                    break
                    
            else:
                print(f"❌ Error: Status {response.status_code}")
                print(f"Response: {response.text}")
                break
                
        except Exception as e:
            print(f"❌ Error in batch {offset//batch_size + 1}: {e}")
            break
    
    if all_data:
        df = pd.DataFrame(all_data)
        print(f"\n✅ Successfully loaded {len(df):,} records")
        print(f"📊 Shape: {df.shape}")
        return df
    else:
        print("❌ No data loaded")
        return None

# Try the enhanced loader
print("🎯 Attempting to load larger dataset...")
df_large = load_large_dataset(total_limit=50000, batch_size=1000)

🎯 Attempting to load larger dataset...
🚀 LOADING LARGE DATASET
Target: 50,000 records in batches of 1,000
📥 Fetching batch 1: offset=0, limit=1,000
  ✅ Loaded 1,000 records | Total: 1,000
📥 Fetching batch 2: offset=1,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 1,000
📥 Fetching batch 2: offset=1,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 2,000
📥 Fetching batch 3: offset=2,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 2,000
📥 Fetching batch 3: offset=2,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 3,000
📥 Fetching batch 4: offset=3,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 3,000
📥 Fetching batch 4: offset=3,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 4,000
📥 Fetching batch 5: offset=4,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 4,000
📥 Fetching batch 5: offset=4,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 5,000
📥 Fetching batch 6: offset=5,000, limit=1,000
  ✅ Loaded 1,000 records | Total: 5,000
📥 Fetching batch 6: offset=5,000, limi

In [66]:
df = df_large # <-- use the large dset?

In [67]:
# Initial data exploration and quality assessment
if df is not None:
    print("🔍 DATA QUALITY ASSESSMENT")
    print("=" * 50)
    
    # Basic info
    print(f"\n📋 Column Information:")
    print(df.info())
    
    # Missing values analysis
    print(f"\n❓ Missing Values:")
    missing_data = df.isnull().sum()
    missing_pct = (missing_data / len(df)) * 100
    missing_summary = pd.DataFrame({
        'Missing Count': missing_data,
        'Missing %': missing_pct
    })
    print(missing_summary[missing_summary['Missing Count'] > 0])
    
    # Data types and sample values
    print(f"\n📊 Sample Data (First 5 rows):")
    display(df.head())
    
    # Statistical summary
    print(f"\n📈 Statistical Summary:")
    display(df.describe(include='all'))
    
    # Event type distribution
    event_dist = df['event_type'].value_counts(normalize=True) * 100
    print(f"\n🎯 Event Distribution:")
    for event, pct in event_dist.items():
        print(f"  {event}: {pct:.1f}%")
    
    # Price analysis
    if 'price' in df.columns:
        print(f"\n💰 Price Statistics:")
        print(f"  Average: ${df['price'].mean():.2f}")
        print(f"  Median: ${df['price'].median():.2f}")
        print(f"  Range: ${df['price'].min():.2f} - ${df['price'].max():.2f}")
        print(f"  Null prices: {df['price'].isnull().sum():,} ({df['price'].isnull().mean()*100:.1f}%)")

🔍 DATA QUALITY ASSESSMENT

📋 Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             50000 non-null  int64  
 1   event_time     50000 non-null  object 
 2   event_type     50000 non-null  object 
 3   product_id     50000 non-null  int64  
 4   category_id    50000 non-null  int64  
 5   category_code  32122 non-null  object 
 6   brand          41644 non-null  object 
 7   price          49976 non-null  float64
 8   user_id        50000 non-null  int64  
 9   user_session   50000 non-null  object 
 10  created_at     50000 non-null  object 
dtypes: float64(1), int64(4), object(6)
memory usage: 4.2+ MB
None

❓ Missing Values:
               Missing Count  Missing %
category_code          17878     35.756
brand                   8356     16.712
price                     24      0.048

📊 Sample Data (First 5 

Unnamed: 0,id,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session,created_at
0,1,2019-11-01T00:00:00+00:00,view,1003461,2053013555631882655,electronics.smartphone,xiaomi,489.07,520088904,4d3b30da-a5e4-49df-b1a8-ba5943f1dd33,2025-10-25T05:51:55.280937+00:00
1,2,2019-11-01T00:00:00+00:00,view,5000088,2053013566100866035,appliances.sewing_machine,janome,293.65,530496790,8e5f4f83-366c-4f70-860e-ca7417414283,2025-10-25T05:51:55.280937+00:00
2,50001,2019-11-01T00:00:00+00:00,view,1003461,2053013555631882655,electronics.smartphone,xiaomi,489.07,520088904,4d3b30da-a5e4-49df-b1a8-ba5943f1dd33,2025-10-25T06:00:37.683386+00:00
3,50002,2019-11-01T00:00:00+00:00,view,5000088,2053013566100866035,appliances.sewing_machine,janome,293.65,530496790,8e5f4f83-366c-4f70-860e-ca7417414283,2025-10-25T06:00:37.683386+00:00
4,3,2019-11-01T00:00:01+00:00,view,17302664,2053013553853497655,,creed,28.31,561587266,755422e7-9040-477b-9bd2-6a6e8fd97387,2025-10-25T05:51:55.280937+00:00



📈 Statistical Summary:


Unnamed: 0,id,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session,created_at
count,50000.0,50000,50000,50000.0,50000.0,32122,41644,49976.0,50000.0,50000,50000
unique,,6178,3,,,108,996,,,6417,51
top,,2019-11-01T01:49:06+00:00,view,,,electronics.smartphone,samsung,,,59f9573a-87ea-47c3-af23-a9778e545443,2025-10-25T05:51:55.280937+00:00
freq,,34,49126,,,11360,5444,,,180,1000
mean,37496.50032,,,11904850.0,2.057861e+18,,,291.813103,535624600.0,,
std,26019.163422,,,12307500.0,1.927457e+16,,,357.742941,20326190.0,,
min,1.0,,,1001588.0,2.053014e+18,,,1.06,389979800.0,,
25%,12500.75,,,1307338.0,2.053014e+18,,,65.9,516058900.0,,
50%,25000.5,,,6701107.0,2.053014e+18,,,168.86,532083900.0,,
75%,62496.25,,,17200960.0,2.053014e+18,,,361.605,555042300.0,,



🎯 Event Distribution:
  view: 98.3%
  purchase: 0.9%
  cart: 0.8%

💰 Price Statistics:
  Average: $291.81
  Median: $168.86
  Range: $1.06 - $2574.07
  Null prices: 24 (0.0%)


## 2. Data Preprocessing and Feature Engineering

Clean the data and create additional features that will help in pattern recognition.

In [68]:
def preprocess_ecommerce_data(df):
    """
    Comprehensive data preprocessing and feature engineering
    """
    if df is None:
        return None
    
    # Create a copy to avoid modifying original
    df_processed = df.copy()
    
    print("🔧 PREPROCESSING DATA")
    print("=" * 40)
    
    # 1. Convert event_time to datetime
    df_processed['event_time'] = pd.to_datetime(df_processed['event_time'])
    print("✅ Converted event_time to datetime")
    
    # 2. Extract time-based features
    df_processed['hour'] = df_processed['event_time'].dt.hour
    df_processed['day_of_week'] = df_processed['event_time'].dt.dayofweek  # 0=Monday
    df_processed['day_name'] = df_processed['event_time'].dt.day_name()
    df_processed['is_weekend'] = df_processed['day_of_week'].isin([5, 6])
    df_processed['date'] = df_processed['event_time'].dt.date
    print("✅ Created time-based features")
    
    # 3. Handle missing values
    if 'category_code' in df_processed.columns:
        df_processed['category_code'] = df_processed['category_code'].fillna('unknown')
    if 'brand' in df_processed.columns:
        df_processed['brand'] = df_processed['brand'].fillna('unknown')
    print("✅ Handled missing values")
    
    # 4. Create price categories
    if 'price' in df_processed.columns:
        # Remove rows with null prices for price analysis
        price_data = df_processed[df_processed['price'].notna()]
        if len(price_data) > 0:
            df_processed['price_category'] = pd.cut(
                df_processed['price'], 
                bins=[0, 50, 100, 200, 500, float('inf')], 
                labels=['Budget', 'Low', 'Medium', 'High', 'Premium'],
                include_lowest=True
            )
        print("✅ Created price categories")
    
    # 5. Create category hierarchy
    if 'category_code' in df_processed.columns:
        # Extract main category (first part before '.')
        df_processed['main_category'] = df_processed['category_code'].str.split('.').str[0]
        # Extract sub category (second part)
        df_processed['sub_category'] = df_processed['category_code'].str.split('.').str[1]
        print("✅ Created category hierarchy")
    
    # 6. Create user engagement score (events per user)
    user_activity = df_processed.groupby('user_id').size().reset_index(name='total_events')
    df_processed = df_processed.merge(user_activity, on='user_id', how='left')
    
    # Categorize users by activity level
    activity_percentiles = user_activity['total_events'].quantile([0.33, 0.66, 1.0])
    df_processed['user_activity_level'] = pd.cut(
        df_processed['total_events'],
        bins=[0, activity_percentiles[0.33], activity_percentiles[0.66], activity_percentiles[1.0]],
        labels=['Low', 'Medium', 'High'],
        include_lowest=True
    )
    print("✅ Created user activity features")
    
    # 7. Session analysis preparation
    # Sort by user and time for session analysis
    df_processed = df_processed.sort_values(['user_id', 'event_time'])
    print("✅ Sorted data for session analysis")
    
    print(f"\n📊 Processed Dataset Shape: {df_processed.shape}")
    print(f"🆕 New Columns: {set(df_processed.columns) - set(df.columns)}")
    
    return df_processed

# Apply preprocessing
df_clean = preprocess_ecommerce_data(df)

if df_clean is not None:
    print(f"\n✨ Preprocessing Complete!")
    print(f"Original shape: {df.shape}")
    print(f"Processed shape: {df_clean.shape}")

🔧 PREPROCESSING DATA
✅ Converted event_time to datetime
✅ Created time-based features✅ Created time-based features
✅ Handled missing values
✅ Created price categories

✅ Handled missing values
✅ Created price categories
✅ Created category hierarchy
✅ Created user activity features
✅ Sorted data for session analysis

📊 Processed Dataset Shape: (50000, 21)
🆕 New Columns: {'total_events', 'day_name', 'is_weekend', 'hour', 'user_activity_level', 'sub_category', 'date', 'main_category', 'day_of_week', 'price_category'}

✨ Preprocessing Complete!
Original shape: (50000, 11)
Processed shape: (50000, 21)
✅ Created category hierarchy
✅ Created user activity features
✅ Sorted data for session analysis

📊 Processed Dataset Shape: (50000, 21)
🆕 New Columns: {'total_events', 'day_name', 'is_weekend', 'hour', 'user_activity_level', 'sub_category', 'date', 'main_category', 'day_of_week', 'price_category'}

✨ Preprocessing Complete!
Original shape: (50000, 11)
Processed shape: (50000, 21)


## 3. Temporal Pattern Analysis

Analyze time-based patterns to understand when and how users interact with the platform.

In [69]:
def analyze_temporal_patterns(df):
    """
    Comprehensive temporal pattern analysis
    """
    if df is None:
        return
    
    print("📅 TEMPORAL PATTERN ANALYSIS")
    print("=" * 45)
    
    # Create subplots for multiple visualizations
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=['Hourly Activity Pattern', 'Daily Activity Pattern', 
                       'Event Types by Hour', 'Weekend vs Weekday'],
        specs=[[{"secondary_y": True}, {"secondary_y": True}],
               [{"secondary_y": False}, {"secondary_y": False}]]
    )
    
    # 1. Hourly patterns
    hourly_activity = df.groupby('hour').agg({
        'user_id': 'count',
        'event_type': lambda x: (x == 'purchase').sum()
    }).rename(columns={'user_id': 'total_events', 'event_type': 'purchases'})
    
    fig.add_trace(
        go.Scatter(x=hourly_activity.index, y=hourly_activity['total_events'],
                  name='Total Events', line=dict(color='blue')),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Scatter(x=hourly_activity.index, y=hourly_activity['purchases'],
                  name='Purchases', line=dict(color='red')),
        row=1, col=1, secondary_y=True
    )
    
    # 2. Daily patterns
    daily_activity = df.groupby('day_name').size().reindex([
        'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
    ])
    
    fig.add_trace(
        go.Bar(x=daily_activity.index, y=daily_activity.values,
               name='Events by Day', marker_color='lightblue'),
        row=1, col=2
    )
    
    # 3. Event types by hour (heatmap style)
    event_hour_pivot = df.pivot_table(
        values='user_id', index='hour', columns='event_type', 
        aggfunc='count', fill_value=0
    )
    
    # Convert to percentage for better visualization
    event_hour_pct = event_hour_pivot.div(event_hour_pivot.sum(axis=1), axis=0) * 100
    
    # Show purchase rate by hour
    if 'purchase' in event_hour_pct.columns:
        fig.add_trace(
            go.Scatter(x=event_hour_pct.index, y=event_hour_pct['purchase'],
                      name='Purchase Rate %', line=dict(color='green')),
            row=2, col=1
        )
    
    # 4. Weekend vs Weekday comparison
    weekend_comparison = df.groupby(['is_weekend', 'event_type']).size().unstack(fill_value=0)
    weekend_comparison_pct = weekend_comparison.div(weekend_comparison.sum(axis=1), axis=0) * 100
    
    for event_type in weekend_comparison_pct.columns:
        fig.add_trace(
            go.Bar(x=['Weekday', 'Weekend'], 
                  y=[weekend_comparison_pct.loc[False, event_type] if False in weekend_comparison_pct.index else 0,
                     weekend_comparison_pct.loc[True, event_type] if True in weekend_comparison_pct.index else 0],
                  name=f'{event_type.title()}'),
            row=2, col=2
        )
    
    # Update layout
    fig.update_layout(
        height=800,
        title_text="Temporal Activity Patterns Analysis",
        showlegend=True
    )
    
    fig.show()
    
    # Print insights
    print("\n🔍 KEY TEMPORAL INSIGHTS:")
    print("=" * 30)
    
    # Peak hours
    peak_hour = hourly_activity['total_events'].idxmax()
    peak_events = hourly_activity['total_events'].max()
    print(f"🕐 Peak Activity Hour: {peak_hour}:00 ({peak_events:,} events)")
    
    # Best conversion hours
    if 'purchases' in hourly_activity.columns and hourly_activity['purchases'].sum() > 0:
        conversion_rate = (hourly_activity['purchases'] / hourly_activity['total_events'] * 100)
        best_conversion_hour = conversion_rate.idxmax()
        best_conversion_rate = conversion_rate.max()
        print(f"💰 Best Conversion Hour: {best_conversion_hour}:00 ({best_conversion_rate:.2f}%)")
    
    # Weekend vs Weekday insights
    weekend_events = df[df['is_weekend']]['user_id'].count()
    weekday_events = df[~df['is_weekend']]['user_id'].count()
    weekend_pct = weekend_events / (weekend_events + weekday_events) * 100
    print(f"📊 Weekend Activity: {weekend_pct:.1f}% of all events")
    
    # Most active day
    most_active_day = daily_activity.idxmax()
    most_active_day_events = daily_activity.max()
    print(f"📅 Most Active Day: {most_active_day} ({most_active_day_events:,} events)")

# Run temporal analysis
if df_clean is not None:
    analyze_temporal_patterns(df_clean)

📅 TEMPORAL PATTERN ANALYSIS



🔍 KEY TEMPORAL INSIGHTS:
🕐 Peak Activity Hour: 1:00 (28,000 events)
💰 Best Conversion Hour: 2:00 (2.65%)
📊 Weekend Activity: 0.0% of all events
📅 Most Active Day: Friday (50,000.0 events)


## 4. Purchase Funnel Analysis

Track the customer journey from initial view to final purchase, identifying conversion rates and drop-off points.

In [70]:
def analyze_purchase_funnel(df):
    """
    Comprehensive purchase funnel analysis
    """
    if df is None:
        return
    
    print("🛒 PURCHASE FUNNEL ANALYSIS")
    print("=" * 40)
    
    # Define funnel stages
    funnel_stages = ['view', 'cart', 'purchase']
    
    # Calculate funnel metrics
    funnel_data = df['event_type'].value_counts()
    
    # Create funnel visualization
    fig = go.Figure()
    
    # Funnel chart
    funnel_values = []
    funnel_labels = []
    
    for stage in funnel_stages:
        if stage in funnel_data.index:
            funnel_values.append(funnel_data[stage])
            funnel_labels.append(f"{stage.title()}")
        else:
            funnel_values.append(0)
            funnel_labels.append(f"{stage.title()}")
    
    fig.add_trace(go.Funnel(
        y=funnel_labels,
        x=funnel_values,
        textposition="inside",
        textinfo="value+percent initial",
        opacity=0.65,
        marker={"color": ["deepskyblue", "lightsalmon", "lightgreen"],
                "line": {"width": [4, 2, 2], "color": ["wheat", "wheat", "wheat"]}},
        connector={"line": {"color": "royalblue", "dash": "dot", "width": 3}}
    ))
    
    fig.update_layout(title="E-commerce Purchase Funnel", height=500)
    fig.show()
    
    # Calculate conversion rates
    print("📊 FUNNEL METRICS:")
    print("=" * 20)
    
    total_users = df['user_id'].nunique()
    
    # User-based funnel
    user_funnel = df.groupby('user_id')['event_type'].apply(lambda x: x.unique()).reset_index()
    
    viewers = user_funnel[user_funnel['event_type'].apply(lambda x: 'view' in x)]['user_id'].nunique()
    cart_users = user_funnel[user_funnel['event_type'].apply(lambda x: 'cart' in x)]['user_id'].nunique()
    purchasers = user_funnel[user_funnel['event_type'].apply(lambda x: 'purchase' in x)]['user_id'].nunique()
    
    print(f"👀 Viewers: {viewers:,} users ({viewers/total_users*100:.1f}%)")
    print(f"🛒 Added to Cart: {cart_users:,} users ({cart_users/viewers*100:.1f}% of viewers)")
    print(f"💰 Purchased: {purchasers:,} users ({purchasers/cart_users*100:.1f}% of cart users)")
    
    # Overall conversion rate
    if viewers > 0:
        overall_conversion = purchasers / viewers * 100
        print(f"🎯 Overall Conversion Rate: {overall_conversion:.2f}%")
    
    # Analyze funnel by category
    if 'main_category' in df.columns:
        print(f"\n📈 CONVERSION BY CATEGORY:")
        print("=" * 30)
        
        category_funnel = df.groupby(['main_category', 'event_type']).size().unstack(fill_value=0)
        
        # Calculate conversion rates by category
        for category in category_funnel.index:
            if category != 'unknown':
                views = category_funnel.loc[category, 'view'] if 'view' in category_funnel.columns else 0
                purchases = category_funnel.loc[category, 'purchase'] if 'purchase' in category_funnel.columns else 0
                
                if views > 0:
                    conv_rate = purchases / views * 100
                    print(f"  {category}: {conv_rate:.2f}% ({purchases:,}/{views:,})")
    
    # Time to conversion analysis
    print(f"\n⏱️ TIME TO CONVERSION ANALYSIS:")
    print("=" * 35)
    
    # Find users who both viewed and purchased
    user_events = df.groupby('user_id').agg({
        'event_time': ['min', 'max'],
        'event_type': lambda x: set(x)
    }).reset_index()
    
    user_events.columns = ['user_id', 'first_event', 'last_event', 'event_types']
    
    # Users who converted (viewed and purchased)
    converters = user_events[
        user_events['event_types'].apply(lambda x: 'view' in x and 'purchase' in x)
    ]
    
    if len(converters) > 0:
        converters['time_to_convert'] = (converters['last_event'] - converters['first_event']).dt.total_seconds() / 3600  # hours
        
        avg_time_to_convert = converters['time_to_convert'].mean()
        median_time_to_convert = converters['time_to_convert'].median()
        
        print(f"🕐 Average Time to Convert: {avg_time_to_convert:.1f} hours")
        print(f"🕐 Median Time to Convert: {median_time_to_convert:.1f} hours")
        
        # Distribution of conversion times
        quick_converters = (converters['time_to_convert'] < 1).sum()  # < 1 hour
        same_day = (converters['time_to_convert'] < 24).sum()  # < 24 hours
        
        print(f"⚡ Quick Converters (<1h): {quick_converters:,} ({quick_converters/len(converters)*100:.1f}%)")
        print(f"📅 Same Day Converters: {same_day:,} ({same_day/len(converters)*100:.1f}%)")

# Run funnel analysis
if df_clean is not None:
    analyze_purchase_funnel(df_clean)

🛒 PURCHASE FUNNEL ANALYSIS


📊 FUNNEL METRICS:
👀 Viewers: 5,491 users (100.0%)
🛒 Added to Cart: 130 users (2.4% of viewers)
💰 Purchased: 194 users (149.2% of cart users)
🎯 Overall Conversion Rate: 3.53%

📈 CONVERSION BY CATEGORY:
  accessories: 0.53% (2/374)
  apparel: 0.32% (6/1,886)
  appliances: 0.66% (32/4,844)
  auto: 0.25% (2/796)
  computers: 0.61% (16/2,643)
  construction: 0.89% (10/1,118)
  country_yard: 0.00% (0/8)
  electronics: 1.60% (256/15,976)
  furniture: 0.35% (10/2,881)
  kids: 0.00% (0/642)
  medicine: 0.00% (0/6)
  sport: 0.00% (0/270)

⏱️ TIME TO CONVERSION ANALYSIS:
🕐 Average Time to Convert: 0.2 hours
🕐 Median Time to Convert: 0.1 hours
⚡ Quick Converters (<1h): 188 (96.9%)
📅 Same Day Converters: 194 (100.0%)
👀 Viewers: 5,491 users (100.0%)
🛒 Added to Cart: 130 users (2.4% of viewers)
💰 Purchased: 194 users (149.2% of cart users)
🎯 Overall Conversion Rate: 3.53%

📈 CONVERSION BY CATEGORY:
  accessories: 0.53% (2/374)
  apparel: 0.32% (6/1,886)
  appliances: 0.66% (32/4,844)
  auto: 0.25% (2

## 5. User Behavior Segmentation

Identify distinct user segments based on their behavioral patterns using machine learning clustering techniques.

In [71]:
def segment_users_by_behavior(df):
    """
    Advanced user segmentation using behavioral patterns
    """
    if df is None:
        return None
    
    print("👥 USER BEHAVIOR SEGMENTATION")
    print("=" * 40)
    
    # Create user-level features for segmentation
    user_features = df.groupby('user_id').agg({
        'event_type': [
            lambda x: (x == 'view').sum(),
            lambda x: (x == 'cart').sum(), 
            lambda x: (x == 'purchase').sum(),
            lambda x: len(x.unique())  # variety of actions
        ],
        'price': ['mean', 'std', 'sum'],  # price behavior
        'product_id': 'nunique',  # product diversity
        'main_category': 'nunique' if 'main_category' in df.columns else lambda x: 0,  # category diversity
        'event_time': lambda x: (x.max() - x.min()).total_seconds() / 3600,  # session duration in hours
        'hour': lambda x: x.std(),  # time consistency
        'is_weekend': lambda x: x.mean()  # weekend preference
    }).round(2)
    
    # Flatten column names
    user_features.columns = [
        'views', 'cart_adds', 'purchases', 'action_variety',
        'avg_price', 'price_std', 'total_spent', 'product_diversity',
        'category_diversity', 'session_duration', 'time_consistency', 'weekend_preference'
    ]
    
    # Fill NaN values
    user_features = user_features.fillna(0)
    
    # Calculate derived metrics
    user_features['conversion_rate'] = user_features['purchases'] / user_features['views'].replace(0, 1)
    user_features['cart_conversion'] = user_features['purchases'] / user_features['cart_adds'].replace(0, 1)
    user_features['avg_order_value'] = user_features['total_spent'] / user_features['purchases'].replace(0, 1)
    
    print(f"📊 User Features Shape: {user_features.shape}")
    print(f"👥 Total Users: {len(user_features):,}")
    
    # Prepare data for clustering
    feature_cols = ['views', 'cart_adds', 'purchases', 'avg_price', 'total_spent', 
                   'product_diversity', 'category_diversity', 'session_duration', 
                   'conversion_rate']
    
    # Handle infinite values
    clustering_data = user_features[feature_cols].replace([np.inf, -np.inf], 0)
    
    # Scale the features
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(clustering_data)
    
    # Determine optimal number of clusters using elbow method
    inertias = []
    k_range = range(2, 8)
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(scaled_features)
        inertias.append(kmeans.inertia_)
    
    # Plot elbow curve
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=list(k_range), y=inertias, mode='lines+markers', name='Inertia'))
    fig.update_layout(title='Elbow Method for Optimal Clusters', 
                     xaxis_title='Number of Clusters', 
                     yaxis_title='Inertia')
    fig.show()
    
    # Use 4 clusters (good balance)
    n_clusters = 4
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    user_features['segment'] = kmeans.fit_predict(scaled_features)
    
    # Analyze segments
    print(f"\n🎯 USER SEGMENTS ANALYSIS:")
    print("=" * 30)
    
    segment_summary = user_features.groupby('segment').agg({
        'views': ['count', 'mean'],
        'purchases': ['sum', 'mean'],
        'total_spent': ['sum', 'mean'],
        'conversion_rate': 'mean',
        'avg_order_value': 'mean',
        'product_diversity': 'mean',
        'session_duration': 'mean'
    }).round(2)
    
    # Create segment profiles
    segment_profiles = {}
    
    for segment in range(n_clusters):
        segment_data = user_features[user_features['segment'] == segment]
        profile = {
            'size': len(segment_data),
            'avg_views': segment_data['views'].mean(),
            'avg_purchases': segment_data['purchases'].mean(),
            'total_revenue': segment_data['total_spent'].sum(),
            'conversion_rate': segment_data['conversion_rate'].mean() * 100,
            'avg_order_value': segment_data['avg_order_value'].mean(),
            'avg_products': segment_data['product_diversity'].mean()
        }
        segment_profiles[segment] = profile
        
        print(f"\n📊 Segment {segment}:")
        print(f"  Size: {profile['size']:,} users ({profile['size']/len(user_features)*100:.1f}%)")
        print(f"  Avg Views: {profile['avg_views']:.1f}")
        print(f"  Avg Purchases: {profile['avg_purchases']:.2f}")
        print(f"  Total Revenue: ${profile['total_revenue']:,.2f}")
        print(f"  Conversion Rate: {profile['conversion_rate']:.2f}%")
        print(f"  Avg Order Value: ${profile['avg_order_value']:.2f}")
        print(f"  Avg Products Viewed: {profile['avg_products']:.1f}")
    
    # Create segment visualization
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=['Segment Sizes', 'Revenue by Segment', 
                       'Conversion Rates', 'Average Order Values'],
        specs=[[{"type": "pie"}, {"type": "bar"}],
               [{"type": "bar"}, {"type": "bar"}]]
    )
    
    # Segment sizes
    segment_sizes = [segment_profiles[i]['size'] for i in range(n_clusters)]
    fig.add_trace(
        go.Pie(labels=[f'Segment {i}' for i in range(n_clusters)], 
               values=segment_sizes, name="Segment Sizes"),
        row=1, col=1
    )
    
    # Revenue by segment
    revenues = [segment_profiles[i]['total_revenue'] for i in range(n_clusters)]
    fig.add_trace(
        go.Bar(x=[f'Segment {i}' for i in range(n_clusters)], 
               y=revenues, name="Revenue"),
        row=1, col=2
    )
    
    # Conversion rates
    conv_rates = [segment_profiles[i]['conversion_rate'] for i in range(n_clusters)]
    fig.add_trace(
        go.Bar(x=[f'Segment {i}' for i in range(n_clusters)], 
               y=conv_rates, name="Conversion Rate %"),
        row=2, col=1
    )
    
    # Average order values
    aovs = [segment_profiles[i]['avg_order_value'] for i in range(n_clusters)]
    fig.add_trace(
        go.Bar(x=[f'Segment {i}' for i in range(n_clusters)], 
               y=aovs, name="AOV"),
        row=2, col=2
    )
    
    fig.update_layout(height=700, title_text="User Segmentation Analysis")
    fig.show()
    
    # Assign segment names based on characteristics
    segment_names = {
        0: "Browser/Researcher",  # High views, low purchases
        1: "Casual Shopper",      # Medium engagement
        2: "VIP Customer",        # High purchases, high AOV
        3: "Bargain Hunter"       # Price-sensitive
    }
    
    # This is a simplified naming - in practice you'd analyze the actual characteristics
    print(f"\n🏷️ SEGMENT LABELS:")
    print("=" * 20)
    for i, name in segment_names.items():
        if i in segment_profiles:
            print(f"Segment {i}: {name}")
    
    return user_features, segment_profiles

# Run user segmentation
if df_clean is not None:
    user_segments, segment_info = segment_users_by_behavior(df_clean)

👥 USER BEHAVIOR SEGMENTATION
📊 User Features Shape: (5493, 15)
👥 Total Users: 5,493
📊 User Features Shape: (5493, 15)
👥 Total Users: 5,493



🎯 USER SEGMENTS ANALYSIS:

📊 Segment 0:
  Size: 693 users (12.6%)
  Avg Views: 6.4
  Avg Purchases: 0.01
  Total Revenue: $4,153,416.58
  Conversion Rate: 0.04%
  Avg Order Value: $5930.67
  Avg Products Viewed: 2.2

📊 Segment 1:
  Size: 572 users (10.4%)
  Avg Views: 34.0
  Avg Purchases: 0.09
  Total Revenue: $5,502,826.48
  Conversion Rate: 0.25%
  Avg Order Value: $9331.99
  Avg Products Viewed: 11.1

📊 Segment 2:
  Size: 4,058 users (73.9%)
  Avg Views: 5.8
  Avg Purchases: 0.00
  Total Revenue: $4,249,119.63
  Conversion Rate: 0.00%
  Avg Order Value: $1046.73
  Avg Products Viewed: 2.1

📊 Segment 3:
  Size: 170 users (3.1%)
  Avg Views: 9.0
  Avg Purchases: 2.42
  Total Revenue: $678,288.96
  Conversion Rate: 41.03%
  Avg Order Value: $1567.81
  Avg Products Viewed: 2.4



🏷️ SEGMENT LABELS:
Segment 0: Browser/Researcher
Segment 1: Casual Shopper
Segment 2: VIP Customer
Segment 3: Bargain Hunter


## 6. Product-to-Segment Mapping

Create intelligent product-segment mapping to understand which customer segments are attracted to specific products. This enables targeted marketing, personalized recommendations, and strategic inventory management.

In [76]:
def create_product_segment_mapping(df, user_segments):
    """
    Create detailed brand-to-segment mapping showing which customer segments
    are most attracted to Samsung, Apple, Huawei, and Yamaha. Perfect for targeted marketing!
    """
    if df is None or user_segments is None:
        print("❌ Need both cleaned data and user segments")
        return None
    
    print("🎯 BRAND-TO-SEGMENT MAPPING (MVP)")
    print("=" * 50)
    
    # Create brand mapping for demo (map product IDs to brands)
    # In real implementation, you'd have this in your product catalog
    brand_mapping = {}
    demo_brands = ['Samsung', 'Apple', 'Huawei', 'Yamaha']
    
    # Get unique product IDs and assign them to brands for demo
    unique_products = df['product_id'].unique()
    products_per_brand = len(unique_products) // len(demo_brands)
    
    for i, brand in enumerate(demo_brands):
        start_idx = i * products_per_brand
        if i == len(demo_brands) - 1:  # Last brand gets remaining products
            end_idx = len(unique_products)
        else:
            end_idx = (i + 1) * products_per_brand
        
        brand_products = unique_products[start_idx:end_idx]
        for product_id in brand_products:
            brand_mapping[product_id] = brand
    
    print(f"🏷️ Created brand mapping for {len(brand_mapping)} products across {len(demo_brands)} brands")
    
    # Add brand information to dataframe
    df_branded = df.copy()
    df_branded['brand'] = df_branded['product_id'].map(brand_mapping)
    
    # Filter out products without brand mapping (if any)
    df_branded = df_branded.dropna(subset=['brand'])
    
    # Merge user segment data with product interactions
    df_with_segments = df_branded.merge(
        user_segments[['segment']].reset_index(), 
        on='user_id', 
        how='inner'
    )
    
    print(f"✅ Merged {len(df_with_segments):,} events with segment data")
    
    # Create segment names for better readability
    segment_names = {
        0: "Browser/Researcher",
        1: "Casual Shopper", 
        2: "VIP Customer",
        3: "Bargain Hunter"
    }
    
    df_with_segments['segment_name'] = df_with_segments['segment'].map(segment_names)
    
    # Calculate brand-segment distribution instead of product-segment
    brand_segment_stats = df_with_segments.groupby(['brand', 'segment_name']).agg({
        'user_id': 'count',
        'event_type': [
            lambda x: (x == 'view').sum(),
            lambda x: (x == 'cart').sum(),
            lambda x: (x == 'purchase').sum()
        ],
        'price': 'mean'
    }).round(2)
    
    # Flatten column names
    brand_segment_stats.columns = [
        'total_interactions', 'views', 'cart_adds', 'purchases', 'avg_price'
    ]
    
    # Calculate segment distribution percentages for each brand
    brand_totals = df_with_segments.groupby('brand')['user_id'].count()
    brand_segment_counts = df_with_segments.groupby(['brand', 'segment_name']).size().unstack(fill_value=0)
    
    # Convert to percentages - each brand's segments should add up to 100%
    # This shows: "X% of Samsung buyers are VIP Customers"
    brand_segment_pcts = brand_segment_counts.div(brand_segment_counts.sum(axis=1), axis=0) * 100
    brand_segment_pcts = brand_segment_pcts.round(1)
    
    print(f"📊 Analyzed {len(brand_segment_pcts)} brands across {len(segment_names)} segments")
    
    # Show all brands (since we only have 4)
    target_brands = demo_brands
    
    print(f"\n🏆 BRAND SEGMENT DISTRIBUTION - MVP DEMO:")
    print("=" * 60)
    
    results = []
    
    for i, brand in enumerate(target_brands, 1):
        if brand in brand_totals.index:
            total_interactions = brand_totals[brand]
            
            # Get segment breakdown
            segments_data = brand_segment_pcts.loc[brand].to_dict()
            
            # Sort by percentage descending
            sorted_segments = sorted(segments_data.items(), key=lambda x: x[1], reverse=True)
            
            print(f"\n#{i:2d}. 🏷️ {brand}")
            print(f"     Total Interactions: {total_interactions:,}")
            print(f"     Customer Composition:")
            
            # Verify percentages add up to 100%
            total_check = sum([pct for _, pct in sorted_segments])
            
            segment_breakdown = {}
            for segment_name, percentage in sorted_segments:
                if percentage > 0:
                    print(f"       {percentage}% are {segment_name}")
                    segment_breakdown[segment_name] = percentage
                        
            # Store for API generation
            results.append({
                'brand': brand,
                'total_interactions': total_interactions,
                'segment_breakdown': segment_breakdown,
                'dominant_segment': sorted_segments[0][0],
                'dominant_percentage': sorted_segments[0][1]
            })
    
    # Create brand visualization dashboard
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=['Brand-Segment Distribution Heatmap', 'Brands by VIP Appeal', 
                       'Bargain Hunter Brand Preferences', 'Brand Performance Overview'],
        specs=[[{"type": "heatmap"}, {"type": "bar"}],
               [{"type": "bar"}, {"type": "scatter"}]]
    )
    
    # 1. Heatmap of all brands vs segments
    heatmap_data = brand_segment_pcts
    
    fig.add_trace(
        go.Heatmap(
            z=heatmap_data.values,
            x=heatmap_data.columns,
            y=heatmap_data.index,
            colorscale='RdYlBu_r',
            text=heatmap_data.values,
            texttemplate="%{text}%",
            textfont={"size": 12}
        ),
        row=1, col=1
    )
    
    # 2. Brands with highest VIP customer appeal
    vip_brands = brand_segment_pcts.sort_values('VIP Customer', ascending=False)
    fig.add_trace(
        go.Bar(
            x=vip_brands.index,
            y=vip_brands['VIP Customer'],
            name='VIP Appeal %',
            marker_color='gold',
            text=vip_brands['VIP Customer'],
            texttemplate='%{text}%',
            textposition='outside'
        ),
        row=1, col=2
    )
    
    # 3. Bargain Hunter favorites
    bargain_brands = brand_segment_pcts.sort_values('Bargain Hunter', ascending=False)
    fig.add_trace(
        go.Bar(
            x=bargain_brands.index,
            y=bargain_brands['Bargain Hunter'],
            name='Bargain Hunter %',
            marker_color='lightgreen',
            text=bargain_brands['Bargain Hunter'],
            texttemplate='%{text}%',
            textposition='outside'
        ),
        row=2, col=1
    )
    
    # 4. Brand performance scatter (Total interactions vs Dominant segment percentage)
    brand_performance = []
    for brand in demo_brands:
        if brand in brand_totals.index:
            total_int = brand_totals[brand]
            max_segment_pct = brand_segment_pcts.loc[brand].max()
            brand_performance.append({'brand': brand, 'total': total_int, 'max_pct': max_segment_pct})
    
    if brand_performance:
        perf_df = pd.DataFrame(brand_performance)
        fig.add_trace(
            go.Scatter(
                x=perf_df['total'],
                y=perf_df['max_pct'],
                mode='markers+text',
                text=perf_df['brand'],
                textposition='top center',
                marker=dict(size=15, color='lightblue', line=dict(width=2, color='darkblue')),
                name='Brand Performance'
            ),
            row=2, col=2
        )
    
    fig.update_layout(height=800, title_text="Brand-to-Segment Intelligence Dashboard (MVP Demo)")
    fig.show()
    
    # Generate brand insights for MVP demo
    print(f"\n🎯 BRAND COMPOSITION ANALYSIS - MVP:")
    print("=" * 45)
    
    for segment_name in segment_names.values():
        if segment_name in brand_segment_pcts.columns:
            # Show what percentage of each brand's customers belong to this segment
            print(f"\n📊 {segment_name} Representation in Each Brand:")
            
            # Sort brands by percentage of this segment in their customer base
            segment_brands = brand_segment_pcts[segment_name].sort_values(ascending=False)
            
            for brand, pct in segment_brands.items():
                if brand in brand_totals.index and pct > 0:
                    interactions = brand_totals[brand]
                    # Calculate actual number of customers in this segment
                    segment_customers = int(interactions * pct / 100)
                    print(f"    🏷️ {brand}: {pct:.1f}% ({segment_customers:,} customers)")
            
            # Summary insight
            top_brand = segment_brands.idxmax()
            top_pct = segment_brands.max()
            print(f"    💡 Insight: {top_brand} has the highest {segment_name} representation ({top_pct:.1f}%)")
    
    # Generate marketing recommendations
    print(f"\n💡 MARKETING RECOMMENDATIONS:")
    print("=" * 35)
    
    for brand in demo_brands:
        if brand in brand_segment_pcts.index:
            top_segment = brand_segment_pcts.loc[brand].idxmax()
            top_percentage = brand_segment_pcts.loc[brand].max()
            total_interactions = brand_totals[brand] if brand in brand_totals.index else 0
            
            print(f"\n🏷️ {brand}:")
            print(f"  🎯 Primary Target: {top_segment} ({top_percentage:.1f}%)")
            print(f"  📊 Total Engagement: {total_interactions:,} interactions")
            
            # Marketing advice based on dominant segment
            if top_segment == "VIP Customer":
                print(f"  💰 Strategy: Premium positioning, exclusive offers, luxury messaging")
            elif top_segment == "Bargain Hunter":
                print(f"  🏷️ Strategy: Price promotions, value messaging, discount campaigns")
            elif top_segment == "Casual Shopper":
                print(f"  🛒 Strategy: Convenience focus, broad appeal, lifestyle marketing")
            elif top_segment == "Browser/Researcher":
                print(f"  🔍 Strategy: Educational content, detailed specs, comparison tools")
    
    return {
        'brand_segment_mapping': results,
        'segment_percentages': brand_segment_pcts,
        'segment_names': segment_names,
        'brand_mapping': brand_mapping,
        'demo_brands': demo_brands
    }

# Create the brand-segment mapping for MVP demo
if df_clean is not None and 'user_segments' in globals():
    print("🚀 Creating Brand-to-Segment Intelligence (MVP Demo)...")
    product_mapping = create_product_segment_mapping(df_clean, user_segments)
    
    if product_mapping:
        print(f"\n✨ Brand Mapping Complete!")
        print(f"🏷️ Mapped {len(product_mapping['demo_brands'])} brands: {', '.join(product_mapping['demo_brands'])}")
        print(f"📊 Analyzed {len(product_mapping['brand_segment_mapping'])} brand-segment combinations")
        print(f"🎯 Ready for targeted marketing campaigns!")
        
        # Show quick summary
        print(f"\n📋 QUICK BRAND SUMMARY:")
        for brand_data in product_mapping['brand_segment_mapping']:
            print(f"  🏷️ {brand_data['brand']}: {brand_data['dominant_segment']} ({brand_data['dominant_percentage']:.1f}%)")
else:
    print("⚠️ Run the user segmentation analysis first (Section 5)")

🚀 Creating Brand-to-Segment Intelligence (MVP Demo)...
🎯 BRAND-TO-SEGMENT MAPPING (MVP)
🏷️ Created brand mapping for 8434 products across 4 brands
✅ Merged 50,000 events with segment data
📊 Analyzed 4 brands across 4 segments

🏆 BRAND SEGMENT DISTRIBUTION - MVP DEMO:

# 1. 🏷️ Samsung
     Total Interactions: 23,839
     Customer Composition:
       47.1% are VIP Customer
       35.0% are Casual Shopper
       12.2% are Browser/Researcher
       5.7% are Bargain Hunter

# 2. 🏷️ Apple
     Total Interactions: 11,308
     Customer Composition:
       48.4% are VIP Customer
       42.5% are Casual Shopper
       6.1% are Browser/Researcher
       3.0% are Bargain Hunter

# 3. 🏷️ Huawei
     Total Interactions: 8,179
     Customer Composition:
       49.4% are VIP Customer
       42.8% are Casual Shopper
       5.9% are Browser/Researcher
       2.0% are Bargain Hunter

# 4. 🏷️ Yamaha
     Total Interactions: 6,674
     Customer Composition:
       46.2% are VIP Customer
       43.6% are Ca


🎯 BRAND COMPOSITION ANALYSIS - MVP:

📊 Browser/Researcher Representation in Each Brand:
    🏷️ Samsung: 12.2% (2,908 customers)
    🏷️ Apple: 6.1% (689 customers)
    🏷️ Huawei: 5.9% (482 customers)
    🏷️ Yamaha: 5.9% (393 customers)
    💡 Insight: Samsung has the highest Browser/Researcher representation (12.2%)

📊 Casual Shopper Representation in Each Brand:
    🏷️ Yamaha: 43.6% (2,909 customers)
    🏷️ Huawei: 42.8% (3,500 customers)
    🏷️ Apple: 42.5% (4,805 customers)
    🏷️ Samsung: 35.0% (8,343 customers)
    💡 Insight: Yamaha has the highest Casual Shopper representation (43.6%)

📊 VIP Customer Representation in Each Brand:
    🏷️ Huawei: 49.4% (4,040 customers)
    🏷️ Apple: 48.4% (5,473 customers)
    🏷️ Samsung: 47.1% (11,228 customers)
    🏷️ Yamaha: 46.2% (3,083 customers)
    💡 Insight: Huawei has the highest VIP Customer representation (49.4%)

📊 Bargain Hunter Representation in Each Brand:
    🏷️ Samsung: 5.7% (1,358 customers)
    🏷️ Yamaha: 4.3% (286 customers)
   

### 6.2 Product Recommendation Engine

Now that we understand which segments prefer which products, let's build a recommendation engine!

In [73]:
def build_segment_based_recommender(product_mapping, df, user_segments):
    """
    Build a smart recommendation engine based on segment preferences
    Returns personalized product recommendations for any user
    """
    if not product_mapping or df is None:
        print("❌ Need product mapping and data")
        return None
    
    print("🎯 BUILDING SEGMENT-BASED RECOMMENDER")
    print("=" * 45)
    
    # Extract segment mapping data
    segment_percentages = product_mapping['segment_percentages']
    segment_names = product_mapping['segment_names']
    
    # Create user purchase history
    purchase_history = df[df['event_type'] == 'purchase'].groupby('user_id')['product_id'].apply(list).to_dict()
    
    def recommend_products_for_user(user_id, num_recommendations=10, include_purchased=False):
        """
        Get personalized product recommendations for a specific user
        """
        # Get user's segment
        if user_id not in user_segments.index:
            print(f"⚠️ User {user_id} not found in segments")
            return []
        
        user_segment_id = user_segments.loc[user_id, 'segment']
        user_segment_name = segment_names[user_segment_id]
        
        print(f"👤 User {user_id} → {user_segment_name}")
        
        # Get user's purchase history
        purchased_products = purchase_history.get(user_id, [])
        
        # Get products that appeal to this segment
        segment_prefs = segment_percentages[user_segment_name].sort_values(ascending=False)
        
        # Filter out already purchased products if requested
        if not include_purchased and purchased_products:
            segment_prefs = segment_prefs.drop(purchased_products, errors='ignore')
        
        # Get top recommendations
        recommendations = segment_prefs.head(num_recommendations)
        
        print(f"📦 Previously purchased: {len(purchased_products)} products")
        print(f"🎯 Segment appeal score for top recommendations:")
        
        results = []
        for i, (product_id, appeal_score) in enumerate(recommendations.items(), 1):
            print(f"  #{i:2d}. Product {product_id} - {appeal_score:.1f}% segment appeal")
            results.append({
                'rank': i,
                'product_id': product_id,
                'segment_appeal_score': appeal_score,
                'user_segment': user_segment_name
            })
        
        return results
    
    def get_cross_segment_recommendations(product_id, target_segments=None):
        """
        Find which segments would be interested in a specific product
        Perfect for targeted marketing campaigns!
        """
        if product_id not in segment_percentages.index:
            print(f"⚠️ Product {product_id} not found")
            return {}
        
        product_appeal = segment_percentages.loc[product_id].sort_values(ascending=False)
        
        print(f"📦 PRODUCT {product_id} - CROSS-SEGMENT APPEAL")
        print("-" * 40)
        
        recommendations = {}
        for segment_name, appeal_pct in product_appeal.items():
            if appeal_pct > 0:
                print(f"  {segment_name}: {appeal_pct:.1f}%")
                recommendations[segment_name] = appeal_pct
        
        return recommendations
    
    def bulk_user_recommendations(user_list, num_recs=5):
        """
        Generate recommendations for multiple users at once
        """
        print(f"🔄 Generating recommendations for {len(user_list)} users...")
        
        all_recommendations = {}
        
        for user_id in user_list:
            try:
                user_recs = recommend_products_for_user(user_id, num_recs, include_purchased=False)
                all_recommendations[user_id] = user_recs
                print(f"✅ User {user_id}: {len(user_recs)} recommendations")
            except Exception as e:
                print(f"❌ Error for user {user_id}: {e}")
                all_recommendations[user_id] = []
        
        return all_recommendations
    
    # Test with a few users
    print("\n🧪 TESTING RECOMMENDATIONS:")
    print("=" * 30)
    
    # Get sample users from each segment
    sample_users = {}
    for segment_id, segment_name in segment_names.items():
        segment_users = user_segments[user_segments['segment'] == segment_id].index[:2].tolist()
        sample_users[segment_name] = segment_users
        
        print(f"\n🎯 {segment_name} Sample:")
        for user_id in segment_users[:1]:  # Show 1 user per segment
            recs = recommend_products_for_user(user_id, num_recommendations=5)
            print()
    
    return {
        'recommend_for_user': recommend_products_for_user,
        'cross_segment_appeal': get_cross_segment_recommendations,
        'bulk_recommendations': bulk_user_recommendations,
        'sample_users': sample_users
    }

# Build the recommendation engine
if 'product_mapping' in globals() and product_mapping:
    print("🚀 Building Recommendation Engine...")
    recommender = build_segment_based_recommender(product_mapping, df_clean, user_segments)
    
    if recommender:
        print(f"\n✨ Recommendation Engine Ready!")
        print(f"🎯 Available Functions:")
        print(f"  • recommender['recommend_for_user'](user_id)")
        print(f"  • recommender['cross_segment_appeal'](product_id)")
        print(f"  • recommender['bulk_recommendations'](user_list)")
else:
    print("⚠️ Run product mapping first (Section 6.1)")

🚀 Building Recommendation Engine...
🎯 BUILDING SEGMENT-BASED RECOMMENDER

🧪 TESTING RECOMMENDATIONS:

🎯 Browser/Researcher Sample:
👤 User 454010850 → Browser/Researcher
📦 Previously purchased: 0 products
🎯 Segment appeal score for top recommendations:
  # 1. Product Samsung - 12.9% segment appeal
  # 2. Product Apple - 6.3% segment appeal
  # 3. Product Yamaha - 6.1% segment appeal
  # 4. Product Huawei - 6.0% segment appeal


🎯 Casual Shopper Sample:
👤 User 512370912 → Casual Shopper
📦 Previously purchased: 0 products
🎯 Segment appeal score for top recommendations:
  # 1. Product Yamaha - 48.4% segment appeal
  # 2. Product Apple - 46.7% segment appeal
  # 3. Product Huawei - 46.4% segment appeal
  # 4. Product Samsung - 42.6% segment appeal


🎯 VIP Customer Sample:
👤 User 389979783 → VIP Customer
📦 Previously purchased: 0 products
🎯 Segment appeal score for top recommendations:
  # 1. Product Samsung - 99.5% segment appeal
  # 2. Product Apple - 99.0% segment appeal
  # 3. Product Hu

### 6.3 Marketing Intelligence API Generator

Let's create REST API endpoints for all our product-segment insights!

In [74]:
class ProductSegmentAPI:
    """
    Generate production-ready API endpoints for product-segment intelligence
    Perfect for integrating with marketing automation tools!
    """
    
    def __init__(self, product_mapping, recommender, user_segments):
        self.product_mapping = product_mapping
        self.recommender = recommender
        self.user_segments = user_segments
        self.segment_names = product_mapping['segment_names']
        
        print("🚀 PRODUCT-SEGMENT API INITIALIZED")
        print(f"📊 {len(product_mapping['product_segment_mapping'])} products mapped")
        print(f"👥 {len(user_segments)} users segmented")
        print(f"🎯 {len(self.segment_names)} customer segments")
    
    def generate_flask_api(self):
        """
        Generate complete Flask API code for product-segment intelligence
        """
        api_code = '''
from flask import Flask, request, jsonify
from flask_cors import CORS
import json
import pandas as pd

app = Flask(__name__)
CORS(app)

# Load your data here (in production, connect to your database)
# product_mapping = load_product_mapping()
# user_segments = load_user_segments()
# recommender = load_recommender()

@app.route('/api/product/<int:product_id>/segments', methods=['GET'])
def get_product_segments(product_id):
    """
    GET /api/product/12345/segments
    Returns which customer segments are attracted to this product
    Perfect for: Targeted advertising, campaign optimization
    """
    try:
        if product_id not in product_mapping['segment_percentages'].index:
            return jsonify({"error": "Product not found"}), 404
        
        segments = product_mapping['segment_percentages'].loc[product_id].to_dict()
        
        # Sort by appeal percentage
        sorted_segments = sorted(segments.items(), key=lambda x: x[1], reverse=True)
        
        result = {
            "product_id": product_id,
            "segment_appeal": [
                {
                    "segment": segment_name,
                    "appeal_percentage": round(percentage, 1),
                    "targeting_priority": "high" if percentage > 30 else "medium" if percentage > 15 else "low"
                }
                for segment_name, percentage in sorted_segments if percentage > 0
            ],
            "primary_segment": sorted_segments[0][0] if sorted_segments else None,
            "cross_segment_appeal": len([s for s in segments.values() if s > 20])
        }
        
        return jsonify(result)
    
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route('/api/user/<int:user_id>/recommendations', methods=['GET'])
def get_user_recommendations(user_id):
    """
    GET /api/user/67890/recommendations?limit=10
    Returns personalized product recommendations based on user segment
    Perfect for: Personalization engines, email campaigns
    """
    try:
        limit = request.args.get('limit', 10, type=int)
        include_purchased = request.args.get('include_purchased', 'false').lower() == 'true'
        
        recommendations = recommender['recommend_for_user'](
            user_id, 
            num_recommendations=limit,
            include_purchased=include_purchased
        )
        
        if not recommendations:
            return jsonify({"error": "User not found or no recommendations available"}), 404
        
        result = {
            "user_id": user_id,
            "user_segment": recommendations[0]['user_segment'] if recommendations else None,
            "recommendations": [
                {
                    "product_id": rec['product_id'],
                    "rank": rec['rank'],
                    "segment_appeal_score": rec['segment_appeal_score'],
                    "confidence": "high" if rec['segment_appeal_score'] > 40 else "medium" if rec['segment_appeal_score'] > 25 else "low"
                }
                for rec in recommendations
            ],
            "total_recommendations": len(recommendations)
        }
        
        return jsonify(result)
    
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route('/api/segment/<segment_name>/products', methods=['GET'])
def get_segment_products(segment_name):
    """
    GET /api/segment/VIP Customer/products?limit=20
    Returns top products for a specific customer segment
    Perfect for: Segment-targeted campaigns, inventory planning
    """
    try:
        limit = request.args.get('limit', 20, type=int)
        
        if segment_name not in product_mapping['segment_percentages'].columns:
            return jsonify({"error": "Segment not found"}), 404
        
        segment_products = product_mapping['segment_percentages'][segment_name].sort_values(ascending=False)
        top_products = segment_products.head(limit)
        
        result = {
            "segment_name": segment_name,
            "total_products_analyzed": len(segment_products),
            "top_products": [
                {
                    "product_id": int(product_id),
                    "appeal_score": round(score, 1),
                    "targeting_priority": "high" if score > 40 else "medium" if score > 25 else "low"
                }
                for product_id, score in top_products.items() if score > 0
            ],
            "average_appeal": round(segment_products.mean(), 1)
        }
        
        return jsonify(result)
    
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route('/api/analytics/segments/overview', methods=['GET'])
def get_segments_overview():
    """
    GET /api/analytics/segments/overview
    Returns complete segment analytics overview
    Perfect for: Dashboards, executive reporting
    """
    try:
        segments_stats = {}
        
        for segment_id, segment_name in segment_names.items():
            segment_users = user_segments[user_segments['segment'] == segment_id]
            
            segments_stats[segment_name] = {
                "user_count": len(segment_users),
                "percentage_of_users": round(len(segment_users) / len(user_segments) * 100, 1),
                "top_products": product_mapping['segment_percentages'][segment_name].nlargest(5).to_dict()
            }
        
        result = {
            "total_users": len(user_segments),
            "total_segments": len(segment_names),
            "segments": segments_stats,
            "api_version": "1.0",
            "last_updated": "2024-01-01"  # Update with actual timestamp
        }
        
        return jsonify(result)
    
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route('/api/analytics/products/cross-segment', methods=['GET'])
def get_cross_segment_analysis():
    """
    GET /api/analytics/products/cross-segment?min_appeal=20
    Returns products with broad cross-segment appeal
    Perfect for: Universal campaigns, bestseller identification
    """
    try:
        min_appeal = request.args.get('min_appeal', 20, type=int)
        
        # Find products that appeal to multiple segments
        segment_counts = (product_mapping['segment_percentages'] > min_appeal).sum(axis=1)
        cross_appeal_products = segment_counts[segment_counts > 1].sort_values(ascending=False)
        
        result = {
            "analysis_criteria": f"Products with >{min_appeal}% appeal in multiple segments",
            "total_cross_appeal_products": len(cross_appeal_products),
            "products": [
                {
                    "product_id": int(product_id),
                    "segments_appealing_to": int(segment_count),
                    "segment_breakdown": product_mapping['segment_percentages'].loc[product_id].to_dict()
                }
                for product_id, segment_count in cross_appeal_products.head(50).items()
            ]
        }
        
        return jsonify(result)
    
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    print("🚀 Product-Segment Intelligence API")
    print("📊 Available Endpoints:")
    print("  GET /api/product/<id>/segments - Product segment analysis")
    print("  GET /api/user/<id>/recommendations - Personalized recommendations")
    print("  GET /api/segment/<name>/products - Segment product preferences")
    print("  GET /api/analytics/segments/overview - Complete segment overview")
    print("  GET /api/analytics/products/cross-segment - Cross-segment product analysis")
    
    app.run(debug=True, host='0.0.0.0', port=5000)
'''
        
        return api_code
    
    def generate_api_examples(self):
        """
        Generate example API calls and responses
        """
        examples = {
            "product_segments": {
                "endpoint": "GET /api/product/12345/segments",
                "description": "Get segment breakdown for a specific product",
                "example_response": {
                    "product_id": 12345,
                    "segment_appeal": [
                        {"segment": "VIP Customer", "appeal_percentage": 45.2, "targeting_priority": "high"},
                        {"segment": "Casual Shopper", "appeal_percentage": 28.1, "targeting_priority": "medium"},
                        {"segment": "Bargain Hunter", "appeal_percentage": 16.7, "targeting_priority": "low"},
                        {"segment": "Browser/Researcher", "appeal_percentage": 10.0, "targeting_priority": "low"}
                    ],
                    "primary_segment": "VIP Customer",
                    "cross_segment_appeal": 2
                }
            },
            "user_recommendations": {
                "endpoint": "GET /api/user/67890/recommendations?limit=5",
                "description": "Get personalized product recommendations",
                "example_response": {
                    "user_id": 67890,
                    "user_segment": "Bargain Hunter",
                    "recommendations": [
                        {"product_id": 555, "rank": 1, "segment_appeal_score": 52.3, "confidence": "high"},
                        {"product_id": 777, "rank": 2, "segment_appeal_score": 48.9, "confidence": "high"},
                        {"product_id": 999, "rank": 3, "segment_appeal_score": 41.2, "confidence": "high"}
                    ],
                    "total_recommendations": 3
                }
            },
            "segment_products": {
                "endpoint": "GET /api/segment/VIP%20Customer/products?limit=3",
                "description": "Get top products for a customer segment",
                "example_response": {
                    "segment_name": "VIP Customer",
                    "total_products_analyzed": 1000,
                    "top_products": [
                        {"product_id": 1001, "appeal_score": 67.8, "targeting_priority": "high"},
                        {"product_id": 1002, "appeal_score": 58.4, "targeting_priority": "high"},
                        {"product_id": 1003, "appeal_score": 52.1, "targeting_priority": "high"}
                    ],
                    "average_appeal": 23.4
                }
            }
        }
        
        return examples

# Generate the complete API
if 'product_mapping' in globals() and 'recommender' in globals():
    print("🚀 Generating Product-Segment Intelligence API...")
    
    api_generator = ProductSegmentAPI(product_mapping, recommender, user_segments)
    
    # Generate Flask API code
    flask_code = api_generator.generate_flask_api()
    
    # Generate API examples
    api_examples = api_generator.generate_api_examples()
    
    print(f"\n✨ API Generated Successfully!")
    print(f"📋 Available API Endpoints:")
    for example_name, example_data in api_examples.items():
        print(f"  • {example_data['endpoint']}")
        print(f"    {example_data['description']}")
    
    print(f"\n💾 Save the Flask code to 'product_segment_api.py' to deploy!")
    
    # Save Flask API to file
    with open('product_segment_api.py', 'w') as f:
        f.write(flask_code)
    
    print(f"✅ API saved to product_segment_api.py")
    print(f"🚀 Run with: python product_segment_api.py")
    
else:
    print("⚠️ Run product mapping and recommender first")

🚀 Generating Product-Segment Intelligence API...
🚀 PRODUCT-SEGMENT API INITIALIZED


KeyError: 'product_segment_mapping'

In [None]:
class PatternBasedAPIGenerator:
    """
    Generate intelligent APIs based on discovered behavioral patterns
    This is the core AI engine that transforms patterns into actionable APIs
    """
    
    def __init__(self, df, user_segments=None):
        self.df = df
        self.user_segments = user_segments
        self.patterns = {}
        self.api_endpoints = {}
        
    def discover_patterns(self):
        """Extract key behavioral patterns from the data"""
        print("🧠 AI PATTERN DISCOVERY ENGINE")
        print("=" * 40)
        
        patterns = {}
        
        # 1. High-Value Customer Patterns
        if self.user_segments is not None:
            high_value_users = self.user_segments.nlargest(100, 'total_spent').index.tolist()
            high_value_behavior = self.df[self.df['user_id'].isin(high_value_users)]
            
            patterns['high_value_customers'] = {
                'user_count': len(high_value_users),
                'top_categories': high_value_behavior['main_category'].value_counts().head(5).to_dict(),
                'avg_session_duration': high_value_behavior.groupby('user_id')['event_time'].apply(lambda x: (x.max() - x.min()).total_seconds() / 3600).mean(),
                'preferred_hours': high_value_behavior['hour'].value_counts().head(3).to_dict(),
                'avg_order_value': high_value_behavior[high_value_behavior['event_type'] == 'purchase']['price'].mean()
            }
        
        # 2. Conversion Optimization Patterns
        converters = self.df.groupby('user_id')['event_type'].apply(lambda x: 'purchase' in x.values)
        converter_ids = converters[converters == True].index.tolist()
        
        if len(converter_ids) > 0:
            converter_behavior = self.df[self.df['user_id'].isin(converter_ids)]
            
            patterns['conversion_patterns'] = {
                'conversion_rate': len(converter_ids) / self.df['user_id'].nunique() * 100,
                'avg_views_before_purchase': converter_behavior[converter_behavior['event_type'] == 'view'].groupby('user_id').size().mean(),
                'top_converting_categories': converter_behavior[converter_behavior['event_type'] == 'purchase']['main_category'].value_counts().head(5).to_dict(),
                'optimal_price_ranges': self._find_optimal_price_ranges(),
            }
        
        # 3. Churn Risk Patterns
        recent_activity = self.df['event_time'].max() - pd.Timedelta(days=7)
        recent_users = set(self.df[self.df['event_time'] > recent_activity]['user_id'].unique())
        all_users = set(self.df['user_id'].unique())
        inactive_users = all_users - recent_users
        
        patterns['churn_risk'] = {
            'total_inactive_users': len(inactive_users),
            'churn_risk_percentage': len(inactive_users) / len(all_users) * 100,
            'last_activity_patterns': self._analyze_churn_patterns(inactive_users)
        }
        
        # 4. Product Affinity Patterns
        patterns['product_affinity'] = self._discover_product_affinities()
        
        # 5. Temporal Engagement Patterns
        patterns['temporal_engagement'] = {
            'peak_hours': self.df['hour'].value_counts().head(3).to_dict(),
            'peak_days': self.df['day_name'].value_counts().head(3).to_dict(),
            'weekend_vs_weekday': {
                'weekend_engagement': self.df[self.df['is_weekend']]['user_id'].count(),
                'weekday_engagement': self.df[~self.df['is_weekend']]['user_id'].count()
            }
        }
        
        self.patterns = patterns
        print("✅ Pattern discovery complete!")
        return patterns
    
    def _find_optimal_price_ranges(self):
        """Find price ranges with highest conversion rates"""
        if 'price' not in self.df.columns:
            return {}
            
        # Create price bins
        price_bins = pd.qcut(self.df[self.df['price'].notna()]['price'], 
                           q=10, duplicates='drop')
        
        df_with_bins = self.df[self.df['price'].notna()].copy()
        df_with_bins['price_bin'] = price_bins
        
        # Calculate conversion rate by price bin
        conversion_by_price = df_with_bins.groupby('price_bin').agg({
            'event_type': [
                lambda x: (x == 'view').sum(),
                lambda x: (x == 'purchase').sum()
            ]
        })
        
        conversion_by_price.columns = ['views', 'purchases']
        conversion_by_price['conversion_rate'] = (
            conversion_by_price['purchases'] / conversion_by_price['views'] * 100
        )
        
        return conversion_by_price.nlargest(3, 'conversion_rate')['conversion_rate'].to_dict()
    
    def _analyze_churn_patterns(self, inactive_users):
        """Analyze patterns of users at risk of churning"""
        if len(inactive_users) == 0:
            return {}
            
        inactive_sample = list(inactive_users)[:1000]  # Sample for performance
        inactive_behavior = self.df[self.df['user_id'].isin(inactive_sample)]
        
        return {
            'last_categories_viewed': inactive_behavior['main_category'].value_counts().head(5).to_dict(),
            'avg_session_length': inactive_behavior.groupby('user_id')['event_time'].apply(
                lambda x: (x.max() - x.min()).total_seconds() / 3600
            ).mean(),
            'last_actions': inactive_behavior['event_type'].value_counts().to_dict()
        }
    
    def _discover_product_affinities(self):
        """Find products frequently viewed together"""
        # Group by user sessions to find product co-occurrences
        user_sessions = self.df.groupby('user_session')['product_id'].apply(list).reset_index()
        
        # Find pairs of products viewed in same session
        product_pairs = {}
        for products in user_sessions['product_id']:
            if len(products) > 1:
                for i in range(len(products)):
                    for j in range(i+1, len(products)):
                        pair = tuple(sorted([products[i], products[j]]))
                        product_pairs[pair] = product_pairs.get(pair, 0) + 1
        
        # Get top product affinities
        top_affinities = dict(sorted(product_pairs.items(), 
                                   key=lambda x: x[1], reverse=True)[:10])
        
        return top_affinities
    
    def generate_api_endpoints(self):
        """Generate intelligent API endpoints based on discovered patterns"""
        print("\n🚀 AI-POWERED API GENERATION")
        print("=" * 40)
        
        if not self.patterns:
            self.discover_patterns()
        
        endpoints = {}
        
        # 1. Customer Intelligence APIs
        endpoints['customer_intelligence'] = {
            '/api/customers/high-value': {
                'description': 'Get high-value customer insights and characteristics',
                'method': 'GET',
                'response_example': {
                    'total_high_value_customers': self.patterns.get('high_value_customers', {}).get('user_count', 0),
                    'avg_order_value': self.patterns.get('high_value_customers', {}).get('avg_order_value', 0),
                    'top_categories': self.patterns.get('high_value_customers', {}).get('top_categories', {}),
                    'recommended_targeting_hours': self.patterns.get('high_value_customers', {}).get('preferred_hours', {})
                }
            },
            '/api/customers/{user_id}/segment': {
                'description': 'Get AI-determined customer segment and personalization data',
                'method': 'GET',
                'response_example': {
                    'segment': 'VIP Customer',
                    'conversion_probability': 0.75,
                    'recommended_products': ['product_123', 'product_456'],
                    'optimal_contact_time': '14:00-16:00'
                }
            }
        }
        
        # 2. Conversion Optimization APIs
        endpoints['conversion_optimization'] = {
            '/api/conversion/funnel-analysis': {
                'description': 'Get real-time conversion funnel metrics and optimization suggestions',
                'method': 'GET',
                'response_example': {
                    'conversion_rate': self.patterns.get('conversion_patterns', {}).get('conversion_rate', 0),
                    'bottlenecks': ['cart_abandonment', 'pricing_sensitivity'],
                    'optimization_suggestions': [
                        'Reduce cart abandonment with exit-intent popups',
                        'A/B test pricing in optimal ranges'
                    ]
                }
            },
            '/api/conversion/predict': {
                'description': 'Predict conversion probability for a user session',
                'method': 'POST',
                'payload_example': {
                    'user_id': 'user_123',
                    'current_session_events': ['view', 'view', 'cart'],
                    'products_viewed': ['product_456', 'product_789']
                },
                'response_example': {
                    'conversion_probability': 0.68,
                    'recommended_actions': ['send_discount_offer', 'show_similar_products'],
                    'optimal_price_range': '$50-$100'
                }
            }
        }
        
        # 3. Personalization APIs
        endpoints['personalization'] = {
            '/api/personalization/recommendations': {
                'description': 'Get AI-powered product recommendations based on behavior patterns',
                'method': 'POST',
                'payload_example': {'user_id': 'user_123', 'context': 'homepage'},
                'response_example': {
                    'recommended_products': [
                        {'product_id': 'prod_123', 'score': 0.92, 'reason': 'similar_users_purchased'},
                        {'product_id': 'prod_456', 'score': 0.87, 'reason': 'frequently_bought_together'}
                    ],
                    'recommended_categories': ['electronics.smartphone', 'computers.notebook']
                }
            },
            '/api/personalization/optimal-timing': {
                'description': 'Get optimal engagement timing for users',
                'method': 'GET',
                'response_example': {
                    'peak_engagement_hours': self.patterns.get('temporal_engagement', {}).get('peak_hours', {}),
                    'user_specific_timing': '14:00-16:00 weekdays',
                    'campaign_recommendations': ['email_at_2pm', 'push_notifications_evening']
                }
            }
        }
        
        # 4. Churn Prevention APIs
        endpoints['churn_prevention'] = {
            '/api/churn/risk-assessment': {
                'description': 'Identify users at risk of churning and get retention strategies',
                'method': 'GET',
                'response_example': {
                    'high_risk_users_count': len(self.patterns.get('churn_risk', {}).get('last_activity_patterns', {})),
                    'churn_probability_threshold': 0.7,
                    'retention_strategies': [
                        'personalized_discount_campaign',
                        'win_back_email_series',
                        'exclusive_early_access'
                    ]
                }
            },
            '/api/churn/predict/{user_id}': {
                'description': 'Predict churn probability for specific user',
                'method': 'GET',
                'response_example': {
                    'churn_probability': 0.23,
                    'risk_level': 'low',
                    'last_activity': '2019-11-15',
                    'recommended_actions': ['no_action_needed']
                }
            }
        }
        
        # 5. Business Intelligence APIs
        endpoints['business_intelligence'] = {
            '/api/insights/revenue-optimization': {
                'description': 'Get AI-driven revenue optimization insights',
                'method': 'GET',
                'response_example': {
                    'revenue_opportunities': [
                        {'opportunity': 'price_optimization', 'potential_lift': '15%'},
                        {'opportunity': 'cross_sell_campaign', 'potential_lift': '8%'}
                    ],
                    'optimal_price_ranges': self.patterns.get('conversion_patterns', {}).get('optimal_price_ranges', {}),
                    'product_affinities': self.patterns.get('product_affinity', {})
                }
            }
        }
        
        self.api_endpoints = endpoints
        
        # Display generated APIs
        print("✅ Generated AI-Powered API Endpoints:")
        print("=" * 45)
        
        for category, apis in endpoints.items():
            print(f"\n📂 {category.upper().replace('_', ' ')} APIs:")
            for endpoint, details in apis.items():
                print(f"  🔗 {details['method']} {endpoint}")
                print(f"     📝 {details['description']}")
        
        print(f"\n🎯 Total API Endpoints Generated: {sum(len(apis) for apis in endpoints.values())}")
        
        return endpoints
    
    def generate_api_documentation(self):
        """Generate comprehensive API documentation"""
        if not self.api_endpoints:
            self.generate_api_endpoints()
        
        print("\n📚 AI-POWERED API DOCUMENTATION")
        print("=" * 45)
        
        doc = {
            'title': 'AI-Powered Customer Data Intelligence API',
            'version': '1.0.0',
            'description': 'Intelligent APIs generated from behavioral pattern analysis',
            'base_url': 'https://api.yourdomain.com',
            'authentication': 'Bearer Token',
            'endpoints': self.api_endpoints,
            'pattern_insights': self.patterns
        }
        
        # Save documentation as JSON (in practice, you'd save to file)
        print("💾 API Documentation Generated Successfully!")
        print(f"📊 Based on analysis of {len(self.df):,} events from {self.df['user_id'].nunique():,} users")
        
        return doc

# Generate Pattern-Based APIs
if df_clean is not None:
    print("🚀 INITIALIZING AI PATTERN ENGINE...")
    
    # Initialize the AI API generator
    api_generator = PatternBasedAPIGenerator(df_clean, user_segments)
    
    # Discover patterns
    discovered_patterns = api_generator.discover_patterns()
    
    # Generate APIs
    generated_apis = api_generator.generate_api_endpoints()
    
    # Generate documentation
    api_docs = api_generator.generate_api_documentation()
    
    print("\n🎉 AI-POWERED SEGMENT PLATFORM READY!")
    print("=" * 50)
    print("Your AI engine has analyzed the behavioral data and generated")
    print("intelligent APIs that can power personalization, conversion")
    print("optimization, and customer intelligence - just like Segment")
    print("but enhanced with AI-driven insights! 🚀")

🚀 INITIALIZING AI PATTERN ENGINE...
🧠 AI PATTERN DISCOVERY ENGINE
✅ Pattern discovery complete!

🚀 AI-POWERED API GENERATION
✅ Generated AI-Powered API Endpoints:

📂 CUSTOMER INTELLIGENCE APIs:
  🔗 GET /api/customers/high-value
     📝 Get high-value customer insights and characteristics
  🔗 GET /api/customers/{user_id}/segment
     📝 Get AI-determined customer segment and personalization data

📂 CONVERSION OPTIMIZATION APIs:
  🔗 GET /api/conversion/funnel-analysis
     📝 Get real-time conversion funnel metrics and optimization suggestions
  🔗 POST /api/conversion/predict
     📝 Predict conversion probability for a user session

📂 PERSONALIZATION APIs:
  🔗 POST /api/personalization/recommendations
     📝 Get AI-powered product recommendations based on behavior patterns
  🔗 GET /api/personalization/optimal-timing
     📝 Get optimal engagement timing for users

📂 CHURN PREVENTION APIs:
  🔗 GET /api/churn/risk-assessment
     📝 Identify users at risk of churning and get retention strategie