# 01 - Data Preprocessing
This notebook loads the Online Retail II transactions, removes invalid entries, aggregates sales, and writes the processed CSVs used in later phases.


## Workflow
1. Import project helpers.
2. Load and clean the raw dataset.
3. Aggregate sales to daily and monthly cadences.
4. Persist the processed outputs for the modeling stack.


In [1]:
import sys
from pathlib import Path

PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))


In [2]:
from src.data_prep import (
    load_raw_transactions,
    clean_transactions,
    aggregate_sales,
    save_aggregated_sales,
)


In [3]:
raw_df = load_raw_transactions()
clean_df = clean_transactions(raw_df)
daily_df = aggregate_sales(clean_df, freq="D")
monthly_df = aggregate_sales(clean_df, freq="M")

daily_path = save_aggregated_sales(daily_df, freq="D")
monthly_path = save_aggregated_sales(monthly_df, freq="M")

daily_df.head()


Unnamed: 0,ds,y
0,2009-12-01,54513.5
1,2009-12-02,63352.51
2,2009-12-03,74037.91
3,2009-12-04,40732.92
4,2009-12-05,9803.05


In [4]:
print(f"Daily data stored at: {daily_path}")
print(f"Monthly data stored at: {monthly_path}")
summary = {
    "daily_rows": len(daily_df),
    "monthly_rows": len(monthly_df),
    "daily_range": f"{daily_df['ds'].min()} -> {daily_df['ds'].max()}",
}
summary


Daily data stored at: ..\data\processed\sales_daily.csv
Monthly data stored at: ..\data\processed\sales_monthly.csv


{'daily_rows': 739,
 'monthly_rows': 25,
 'daily_range': '2009-12-01 00:00:00 -> 2011-12-09 00:00:00'}