# 05 - Sales Forecasting

Revenue forecasting using time series analysis for Olist marketplace.

**Key Questions:**
1. What are the underlying trend and seasonality patterns in revenue?
2. Can we accurately forecast revenue for the next 8 weeks?
3. What does the forecast tell us about business trajectory?

**Methods:**
- Time series decomposition (trend, seasonality, residual)
- Holt-Winters Exponential Smoothing
- MAPE evaluation metric

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Time series specific imports
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Paths
PROJECT_ROOT = Path.cwd().parent
DB_PATH = PROJECT_ROOT / 'data' / 'olist_ecommerce.db'
IMAGES_PATH = PROJECT_ROOT / 'images'
IMAGES_PATH.mkdir(exist_ok=True)

print(f"Database path: {DB_PATH}")
print(f"Images path: {IMAGES_PATH}")

## 1. Data Loading

Load orders and payments data, then merge to create a comprehensive dataset for time series analysis.

In [None]:
# Connect to database and load data
conn = sqlite3.connect(DB_PATH)

# Load orders
orders_query = """
SELECT 
    order_id,
    order_status,
    order_purchase_timestamp
FROM orders
"""
orders_df = pd.read_sql_query(orders_query, conn)

# Load payments (aggregate by order_id)
payments_query = """
SELECT order_id, SUM(payment_value) as payment_value
FROM order_payments
GROUP BY order_id
"""
payments_df = pd.read_sql_query(payments_query, conn)

conn.close()

# Merge orders with payments
merged_df = orders_df.merge(payments_df, on='order_id', how='left')

# Convert timestamp
merged_df['order_purchase_timestamp'] = pd.to_datetime(merged_df['order_purchase_timestamp'])

print(f"Total orders: {len(merged_df):,}")
print(f"Delivered orders: {(merged_df['order_status'] == 'delivered').sum():,}")
print(f"Date range: {merged_df['order_purchase_timestamp'].min().date()} to {merged_df['order_purchase_timestamp'].max().date()}")
print(f"Total revenue: R${merged_df['payment_value'].sum():,.2f}")
merged_df.head()

## 2. Prepare Weekly Revenue Time Series

Aggregate revenue at weekly granularity for time series analysis. We focus on delivered orders only to ensure we're measuring completed transactions.

In [None]:
# Create weekly revenue time series from delivered orders only
weekly_revenue = (
    merged_df[merged_df['order_status'] == 'delivered']
    .set_index('order_purchase_timestamp')
    .resample('W')['payment_value']
    .sum()
)

# Display summary
print(f"Weekly revenue time series:")
print(f"  - Start: {weekly_revenue.index.min().date()}")
print(f"  - End: {weekly_revenue.index.max().date()}")
print(f"  - Number of weeks: {len(weekly_revenue)}")
print(f"  - Average weekly revenue: R${weekly_revenue.mean():,.2f}")
print(f"  - Min weekly revenue: R${weekly_revenue.min():,.2f}")
print(f"  - Max weekly revenue: R${weekly_revenue.max():,.2f}")

# Preview the time series
weekly_revenue.tail(10)

In [None]:
# Visualize weekly revenue over time
fig, ax = plt.subplots(figsize=(14, 5))

ax.plot(weekly_revenue.index, weekly_revenue.values, linewidth=2, color='#2E86AB')
ax.fill_between(weekly_revenue.index, weekly_revenue.values, alpha=0.3, color='#2E86AB')

ax.set_title('Weekly Revenue Over Time', fontsize=14, fontweight='bold')
ax.set_xlabel('Date', fontsize=11)
ax.set_ylabel('Revenue (R$)', fontsize=11)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'R${x/1000:,.0f}K'))

# Add grid
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(IMAGES_PATH / 'weekly_revenue_timeseries.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Saved: {IMAGES_PATH / 'weekly_revenue_timeseries.png'}")

## 3. Time Series Decomposition

Decompose the time series into its core components:
- **Trend**: Long-term direction of the series
- **Seasonal**: Repeating patterns at fixed intervals
- **Residual**: Random variation after trend and seasonality are removed

Using additive decomposition with a 4-week (monthly) seasonal period.

In [None]:
# Time series decomposition
decomposition = seasonal_decompose(weekly_revenue, model='additive', period=4)

# Create decomposition plot
fig, axes = plt.subplots(4, 1, figsize=(14, 10))

# Observed
axes[0].plot(weekly_revenue.index, decomposition.observed, linewidth=1.5, color='#2E86AB')
axes[0].set_ylabel('Observed', fontsize=10)
axes[0].set_title('Time Series Decomposition (Additive Model, Period=4 weeks)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Trend
axes[1].plot(weekly_revenue.index, decomposition.trend, linewidth=1.5, color='#E94F37')
axes[1].set_ylabel('Trend', fontsize=10)
axes[1].grid(True, alpha=0.3)

# Seasonal
axes[2].plot(weekly_revenue.index, decomposition.seasonal, linewidth=1.5, color='#44AF69')
axes[2].set_ylabel('Seasonal', fontsize=10)
axes[2].grid(True, alpha=0.3)

# Residual
axes[3].plot(weekly_revenue.index, decomposition.resid, linewidth=1.5, color='#F8961E')
axes[3].set_ylabel('Residual', fontsize=10)
axes[3].set_xlabel('Date', fontsize=11)
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(IMAGES_PATH / 'seasonal_decomposition.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Saved: {IMAGES_PATH / 'seasonal_decomposition.png'}")

In [None]:
# Analyze decomposition components
print("Decomposition Analysis:")
print("="*50)

# Trend analysis
trend_values = decomposition.trend.dropna()
trend_start = trend_values.iloc[:4].mean()
trend_end = trend_values.iloc[-4:].mean()
trend_growth = ((trend_end - trend_start) / trend_start) * 100

print(f"\nTREND COMPONENT:")
print(f"  - Start avg (first 4 weeks): R${trend_start:,.2f}")
print(f"  - End avg (last 4 weeks): R${trend_end:,.2f}")
print(f"  - Overall growth: {trend_growth:+.1f}%")

# Seasonal analysis
seasonal_values = decomposition.seasonal.dropna()
print(f"\nSEASONAL COMPONENT:")
print(f"  - Average amplitude: R${seasonal_values.abs().mean():,.2f}")
print(f"  - Max swing: R${seasonal_values.max() - seasonal_values.min():,.2f}")

# Residual analysis
resid_values = decomposition.resid.dropna()
print(f"\nRESIDUAL COMPONENT:")
print(f"  - Mean: R${resid_values.mean():,.2f} (should be ~0)")
print(f"  - Std Dev: R${resid_values.std():,.2f}")
print(f"  - This represents unexplained variance in the data")

## 4. Train/Test Split

Hold out the last 8 weeks of data for model evaluation. This allows us to measure forecast accuracy on unseen data.

In [None]:
# Train/Test split: hold out last 8 weeks
train = weekly_revenue[:-8]
test = weekly_revenue[-8:]

print("Train/Test Split:")
print("="*50)
print(f"\nTRAINING SET:")
print(f"  - Period: {train.index.min().date()} to {train.index.max().date()}")
print(f"  - Weeks: {len(train)}")
print(f"  - Total revenue: R${train.sum():,.2f}")

print(f"\nTEST SET:")
print(f"  - Period: {test.index.min().date()} to {test.index.max().date()}")
print(f"  - Weeks: {len(test)}")
print(f"  - Total revenue: R${test.sum():,.2f}")

# Visualize the split
fig, ax = plt.subplots(figsize=(14, 5))

ax.plot(train.index, train.values, linewidth=2, color='#2E86AB', label='Training Data')
ax.plot(test.index, test.values, linewidth=2, color='#E94F37', label='Test Data (holdout)')
ax.axvline(x=test.index[0], color='gray', linestyle='--', alpha=0.7, label='Train/Test Split')

ax.set_title('Train/Test Split for Forecasting', fontsize=14, fontweight='bold')
ax.set_xlabel('Date', fontsize=11)
ax.set_ylabel('Revenue (R$)', fontsize=11)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'R${x/1000:,.0f}K'))
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Holt-Winters Exponential Smoothing

Holt-Winters is ideal for time series with both trend and seasonality:
- **Level**: Base value of the series
- **Trend**: Direction and rate of change
- **Seasonality**: Repeating patterns

We use additive components for both trend and seasonality with a 4-week seasonal period.

In [None]:
# Fit Holt-Winters Exponential Smoothing model
model = ExponentialSmoothing(
    train,
    trend='add',
    seasonal='add',
    seasonal_periods=4
)

fit = model.fit()

# Display model parameters
print("Holt-Winters Model Summary:")
print("="*50)
print(f"\nSmoothing Parameters:")
print(f"  - Alpha (level): {fit.params['smoothing_level']:.4f}")
print(f"  - Beta (trend): {fit.params['smoothing_trend']:.4f}")
print(f"  - Gamma (seasonal): {fit.params['smoothing_seasonal']:.4f}")

print(f"\nModel Fit Statistics:")
print(f"  - AIC: {fit.aic:,.2f}")
print(f"  - BIC: {fit.bic:,.2f}")

In [None]:
# Generate forecast for test period
forecast = fit.forecast(8)

# Display forecast values
print("8-Week Forecast:")
print("="*50)
forecast_df = pd.DataFrame({
    'Week': forecast.index,
    'Actual': test.values,
    'Forecast': forecast.values,
    'Error': test.values - forecast.values,
    'Error %': ((test.values - forecast.values) / test.values * 100)
})
forecast_df['Week'] = forecast_df['Week'].dt.strftime('%Y-%m-%d')
print(forecast_df.to_string(index=False))

## 6. Model Evaluation

Evaluate forecast accuracy using Mean Absolute Percentage Error (MAPE):
- MAPE < 10%: Highly accurate forecasting
- MAPE 10-20%: Good forecasting  
- MAPE 20-50%: Reasonable forecasting
- MAPE > 50%: Inaccurate forecasting

In [None]:
# Calculate evaluation metrics
mape = np.mean(np.abs((test - forecast) / test)) * 100
mae = np.mean(np.abs(test - forecast))
rmse = np.sqrt(np.mean((test - forecast) ** 2))

print("Forecast Evaluation Metrics:")
print("="*50)
print(f"\nMAPE (Mean Absolute Percentage Error): {mape:.1f}%")
print(f"MAE (Mean Absolute Error): R${mae:,.2f}")
print(f"RMSE (Root Mean Square Error): R${rmse:,.2f}")

# Interpretation
print("\n" + "="*50)
print("INTERPRETATION:")
if mape < 10:
    print(f"  MAPE of {mape:.1f}% indicates HIGHLY ACCURATE forecasting.")
    print("  The model captures revenue patterns very well.")
elif mape < 20:
    print(f"  MAPE of {mape:.1f}% indicates GOOD forecasting accuracy.")
    print("  The model provides reliable revenue predictions.")
elif mape < 50:
    print(f"  MAPE of {mape:.1f}% indicates REASONABLE forecasting.")
    print("  Predictions are useful but should be used with caution.")
else:
    print(f"  MAPE of {mape:.1f}% indicates POOR forecasting accuracy.")
    print("  The model struggles to capture revenue patterns.")

## 7. Forecast Visualization

Visualize the complete forecast with training data, test data, and predictions.

In [None]:
# Main forecast visualization
fig, ax = plt.subplots(figsize=(14, 6))

# Plot training data
ax.plot(train.index, train.values, linewidth=2, color='#2E86AB', label='Training Data')

# Plot test data (actual)
ax.plot(test.index, test.values, linewidth=2, color='#44AF69', marker='o', markersize=6, label='Actual (Test)')

# Plot forecast
ax.plot(forecast.index, forecast.values, linewidth=2, color='#E94F37', marker='s', markersize=6, 
        linestyle='--', label='Forecast')

# Add confidence interval (approximate using historical residuals)
residuals_std = (train - fit.fittedvalues).std()
upper_ci = forecast + 1.96 * residuals_std
lower_ci = forecast - 1.96 * residuals_std
ax.fill_between(forecast.index, lower_ci, upper_ci, alpha=0.2, color='#E94F37', label='95% Confidence Interval')

# Add vertical line at forecast start
ax.axvline(x=test.index[0], color='gray', linestyle='--', alpha=0.5)

# Formatting
ax.set_title('8-Week Revenue Forecast', fontsize=14, fontweight='bold')
ax.set_xlabel('Date', fontsize=11)
ax.set_ylabel('Revenue (R$)', fontsize=11)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'R${x/1000:,.0f}K'))
ax.legend(loc='upper left', fontsize=10)
ax.grid(True, alpha=0.3)

# Add MAPE annotation
ax.annotate(f'MAPE: {mape:.1f}%', xy=(0.98, 0.95), xycoords='axes fraction',
            fontsize=12, ha='right', va='top',
            bbox=dict(boxstyle='round', facecolor='white', edgecolor='gray', alpha=0.8))

plt.tight_layout()
plt.savefig(IMAGES_PATH / 'revenue_forecast.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Saved: {IMAGES_PATH / 'revenue_forecast.png'}")

In [None]:
# Zoomed view of forecast period
fig, ax = plt.subplots(figsize=(12, 5))

# Include last 12 weeks of training data for context
context_start = train.index[-12]
context_data = train[train.index >= context_start]

# Plot context (recent training data)
ax.plot(context_data.index, context_data.values, linewidth=2, color='#2E86AB', 
        marker='o', markersize=5, alpha=0.7, label='Recent History')

# Plot test data (actual)
ax.plot(test.index, test.values, linewidth=2.5, color='#44AF69', 
        marker='o', markersize=8, label='Actual')

# Plot forecast
ax.plot(forecast.index, forecast.values, linewidth=2.5, color='#E94F37', 
        marker='s', markersize=8, linestyle='--', label='Forecast')

# Confidence interval
ax.fill_between(forecast.index, lower_ci, upper_ci, alpha=0.2, color='#E94F37')

# Add error bars connecting actual to forecast
for i, (actual, pred) in enumerate(zip(test.values, forecast.values)):
    ax.plot([test.index[i], test.index[i]], [actual, pred], 
            color='gray', linestyle=':', alpha=0.5)

# Formatting
ax.set_title('Forecast vs Actual (8-Week Period)', fontsize=14, fontweight='bold')
ax.set_xlabel('Date', fontsize=11)
ax.set_ylabel('Revenue (R$)', fontsize=11)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'R${x/1000:,.0f}K'))
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Business Interpretation & Key Findings

In [None]:
# Business interpretation summary
print("="*60)
print("SALES FORECASTING - BUSINESS INTERPRETATION")
print("="*60)

# 1. Trend Direction
trend_direction = "GROWING" if trend_growth > 0 else "DECLINING"
print(f"\n1. TREND DIRECTION: {trend_direction}")
print(f"   - Revenue trend has {'increased' if trend_growth > 0 else 'decreased'} by {abs(trend_growth):.1f}% over the analysis period")
print(f"   - Early period avg: R${trend_start:,.2f}/week")
print(f"   - Recent period avg: R${trend_end:,.2f}/week")

# 2. Seasonality Patterns
print(f"\n2. SEASONALITY PATTERNS:")
print(f"   - 4-week seasonal cycle detected")
print(f"   - Average seasonal swing: +/- R${seasonal_values.abs().mean():,.2f}")
print(f"   - Revenue varies by approximately R${seasonal_values.max() - seasonal_values.min():,.2f} within each monthly cycle")

# 3. Forecast Accuracy
print(f"\n3. FORECAST RELIABILITY:")
print(f"   - MAPE: {mape:.1f}% - ", end="")
if mape < 10:
    print("Highly reliable for planning")
elif mape < 20:
    print("Good for strategic planning")
elif mape < 50:
    print("Use with caution, factor in uncertainty")
else:
    print("High uncertainty, use directional guidance only")

# 4. Next Period Outlook
avg_forecast = forecast.mean()
avg_actual = test.mean()
print(f"\n4. FORECAST SUMMARY:")
print(f"   - 8-week forecast average: R${avg_forecast:,.2f}/week")
print(f"   - 8-week actual average: R${avg_actual:,.2f}/week")
print(f"   - Total forecasted revenue: R${forecast.sum():,.2f}")
print(f"   - Total actual revenue: R${test.sum():,.2f}")

## 9. Summary

| Metric | Value |
|--------|-------|
| **Time Series** | Weekly revenue (delivered orders) |
| **Analysis Period** | Full dataset span |
| **Forecast Horizon** | 8 weeks |
| **Model** | Holt-Winters Exponential Smoothing |
| **Seasonal Period** | 4 weeks (monthly cycle) |
| **MAPE** | See results above |

### Key Takeaways

1. **Trend Analysis**: The decomposition reveals the underlying direction of business growth, separating it from seasonal noise.

2. **Seasonal Patterns**: A 4-week cycle suggests monthly behavioral patterns in customer purchasing - useful for inventory and staffing planning.

3. **Forecast Utility**: Even with modest accuracy, forecasts provide value for:
   - Cash flow planning
   - Inventory procurement
   - Marketing campaign timing
   - Resource allocation

4. **Model Limitations**: 
   - Does not account for external shocks (promotions, holidays, economic events)
   - Assumes patterns will continue
   - Longer forecast horizons have higher uncertainty

### Recommendations

- Use forecasts for 4-8 week planning horizons
- Monitor actual vs forecast weekly and adjust
- Consider adding exogenous variables (holidays, promotions) for improved accuracy
- Re-train model monthly with new data

In [None]:
# Final summary output
print("="*60)
print("NOTEBOOK 05 - SALES FORECASTING COMPLETE")
print("="*60)
print(f"\nImages saved to: {IMAGES_PATH}")
print("  - weekly_revenue_timeseries.png")
print("  - seasonal_decomposition.png")
print("  - revenue_forecast.png")
print("\nThis analysis demonstrates predictive analytics capability")
print("for strategic business planning and resource allocation.")