# Customer LTV Forecasting â€“ Synthetic Data EDA

This notebook sanity-checks the synthetic transactional dataset and outlines the initial modeling plan.

## Data Overview
- Granularity: transaction-level records per customer
- Fields: `customer_id`, `transaction_date`, `revenue`, `frequency`, `recency`, `channel`
- Channels cover lifecycle and acquisition programs (email, sms, push, paid, organic, referral)

In [None]:
from pathlib import Path
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
PROJECT_ROOT = Path.cwd()
if PROJECT_ROOT.name == 'notebooks':
    PROJECT_ROOT = PROJECT_ROOT.parent
data_path = PROJECT_ROOT / 'data' / 'transactions.csv'
df = pd.read_csv(data_path, parse_dates=['transaction_date'])
df.sort_values(['customer_id', 'transaction_date'], inplace=True)
df.head()

In [None]:
summary = {
    'rows': len(df),
    'customers': df['customer_id'].nunique(),
    'date_range': (df['transaction_date'].min().date(), df['transaction_date'].max().date()),
    'channels': sorted(df['channel'].unique())
}
summary

In [None]:
df.describe(include='all')

In [None]:
channel_perf = (
    df.groupby('channel')
      .agg(txn_count=('revenue', 'size'),
           customers=('customer_id', 'nunique'),
           revenue_sum=('revenue', 'sum'),
           revenue_mean=('revenue', 'mean'))
      .sort_values('revenue_sum', ascending=False)
)
channel_perf

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.histplot(df['frequency'], bins=10, ax=axes[0])
axes[0].set_title('In-session frequency (transaction index)')
sns.scatterplot(x='recency', y='revenue', data=df.sample(400, random_state=42), ax=axes[1])
axes[1].set_title('Recency gap vs revenue (sample)')
plt.tight_layout()