# MMM Data Preparation Pipeline

This notebook demonstrates the comprehensive data preparation pipeline for MMM modeling, including:
- Handling weekly seasonality and trends
- Zero-spend period treatment
- Feature scaling and transformations
- Adstock and saturation transformations
- Interaction feature creation


In [1]:
# Import necessary libraries
import sys
import os
sys.path.append('../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, RobustScaler

# Import our custom modules
from data_preparation import DataPreparator
from utils import set_random_seed, adstock_transform, saturation_transform

# Set random seed for reproducibility
set_random_seed(42)

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)


In [2]:
# Load the processed data from exploration notebook
data = pd.read_csv('../data/raw/mmm_data.csv')
data['date'] = pd.to_datetime(data['date'])

print(f"Loaded data shape: {data.shape}")
print(f"Date range: {data['date'].min()} to {data['date'].max()}")
print(f"Columns: {list(data.columns)}")
data.head()


Loaded data shape: (104, 12)
Date range: 2023-09-17 00:00:00 to 2025-09-07 00:00:00
Columns: ['facebook_spend', 'google_spend', 'tiktok_spend', 'instagram_spend', 'snapchat_spend', 'followers', 'average_price', 'promotions', 'email_volume', 'sms_volume', 'revenue', 'date']


Unnamed: 0,facebook_spend,google_spend,tiktok_spend,instagram_spend,snapchat_spend,followers,average_price,promotions,email_volume,sms_volume,revenue,date
0,6030.8,3130.14,2993.22,1841.08,2204.72,0,101.95,0,102684,20098,83124.16,2023-09-17
1,5241.44,2704.0,0.0,0.0,0.0,0,103.86,0,96573,29920,373.02,2023-09-24
2,5893.0,0.0,0.0,0.0,0.0,0,100.38,0,96797,22304,513.01,2023-10-01
3,7167.16,0.0,0.0,0.0,0.0,0,103.14,1,99098,14171,452.78,2023-10-08
4,5360.29,0.0,0.0,3237.15,0.0,0,107.76,1,120754,30207,41441.95,2023-10-15


## 1. Data Preparation Pipeline

Let's apply the comprehensive data preparation pipeline:


In [3]:
# Initialize data preparator
prep = DataPreparator(random_seed=42)

# Apply comprehensive data preparation
prepared_data = prep.prepare_data(data, apply_transformations=True)

print(f"Prepared data shape: {prepared_data.shape}")
print(f"New features created: {len(prepared_data.columns) - len(data.columns)}")
print(f"Total features: {len(prepared_data.columns)}")


Preparing data for modeling...
Handling missing values...
Handling outliers...
Creating additional features...
Applying media transformations...
Data preparation complete. Features: 50
Prepared data shape: (104, 52)
New features created: 40
Total features: 52


In [4]:
# Display the new features created
new_features = [col for col in prepared_data.columns if col not in data.columns]
print("New features created:")
for i, feature in enumerate(new_features, 1):
    print(f"{i:2d}. {feature}")

print(f"\nFeature categories:")
print(f"- Original features: {len(data.columns)}")
print(f"- New features: {len(new_features)}")
print(f"- Total features: {len(prepared_data.columns)}")


New features created:
 1. year
 2. month
 3. quarter
 4. day_of_year
 5. revenue_lag_1
 6. google_spend_lag_1
 7. revenue_lag_2
 8. google_spend_lag_2
 9. revenue_lag_4
10. google_spend_lag_4
11. revenue_ma_4
12. google_spend_ma_4
13. revenue_ma_8
14. google_spend_ma_8
15. revenue_ma_12
16. google_spend_ma_12
17. total_media_spend
18. facebook_spend_ratio
19. google_spend_ratio
20. tiktok_spend_ratio
21. instagram_spend_ratio
22. snapchat_spend_ratio
23. google_spend_lag_1_ratio
24. google_spend_lag_2_ratio
25. google_spend_lag_4_ratio
26. google_spend_ma_4_ratio
27. google_spend_ma_8_ratio
28. google_spend_ma_12_ratio
29. facebook_spend_adstock
30. facebook_spend_saturated
31. google_spend_adstock
32. google_spend_saturated
33. tiktok_spend_adstock
34. tiktok_spend_saturated
35. snapchat_spend_adstock
36. snapchat_spend_saturated
37. google_spend_x_average_price
38. facebook_spend_x_promotions
39. tiktok_spend_x_promotions
40. email_volume_x_sms_volume

Feature categories:
- Original 