# Prepare time series data for causal impact evaluation

This notebook shows how to prepare a daily time series dataset for **causal impact / synthetic control style** evaluation:

- input schema and required columns
- missing dates handling
- feature engineering (time trend, day-of-week)
- saving a prepared dataset for downstream notebooks


In [None]:
import sys
from pathlib import Path

# Ensure `src/` is importable when running from repo root
repo_root = Path.cwd()
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from tecore.causal import DataSpec
from tecore.causal.schema import validate_timeseries_df
from tecore.causal.preprocess import prepare_timeseries


## 1) Load a dataset

You can use:
- `data/example_ts/example_daily.csv` (bundled example)
- or your own CSV (must include `date` and `y` + optional covariates)


In [None]:
data_path = repo_root / "data" / "example_ts" / "example_daily.csv"
df_raw = pd.read_csv(data_path)
df_raw.head()

## 2) Define the input spec

For v1 we focus on a single treated unit:

- `date`: daily index (YYYY-MM-DD)
- `y`: target KPI (revenue/orders/margin/…)
- `x_cols`: covariates (controls)

Recommended covariates:
- sessions / DAU / marketing spend / external index


In [None]:
spec = DataSpec(
    date_col="date",
    y_col="y",
    x_cols=["sessions", "active_users", "marketing_spend", "external_index"],
    freq="D",
    missing_policy="raise",
    aggregation="mean",
    add_time_trend=True,
    add_day_of_week=True,
    winsorize_p=None,
)

df_valid = validate_timeseries_df(df_raw, spec)
df_valid.dtypes

## 3) Sanity plots

Quick plots for `y` and covariates.

In [None]:
df_plot = df_valid.copy()
df_plot[spec.date_col] = pd.to_datetime(df_plot[spec.date_col])

plt.figure()
plt.plot(df_plot[spec.date_col], df_plot[spec.y_col])
plt.title("Target series y")
plt.tight_layout()
plt.show()

for c in spec.x_cols:
    plt.figure()
    plt.plot(df_plot[spec.date_col], df_plot[c])
    plt.title(f"Covariate: {c}")
    plt.tight_layout()
    plt.show()

## 4) Prepare data for the causal module

Preparation does:
- align to a complete calendar (daily)
- apply missing-date policy
- add features: `_t` (time trend) and `dow_*` (day-of-week dummies)
- split into pre/post using an intervention date


In [None]:
intervention_date = pd.Timestamp("2025-05-01")

prepared = prepare_timeseries(df_valid, spec, intervention_date=intervention_date)
prepared.df.head()

### Prepared dataset summary

In [None]:
print("n_total:", len(prepared.df))
print("n_pre:", int(prepared.pre_mask.sum()))
print("n_post:", int(prepared.post_mask.sum()))
print("features:", len(prepared.feature_cols))
prepared.feature_cols[:20]

## 5) Save prepared dataset

We save the prepared dataset for the next notebooks.

Note: the causal module will re-run preprocessing internally; this export is mainly for transparency and demos.

In [None]:
out_dir = repo_root / "data" / "prepared_ts"
out_dir.mkdir(parents=True, exist_ok=True)

out_path = out_dir / "prepared_example_daily.csv"
prepared.df.to_csv(out_path, index=False)

print("Wrote:", out_path)