# 00 – Prepare Data

This notebook documents the preprocessing steps. Run the CLI commands below (from the project root) to regenerate processed datasets.

## Commands

```bash
python scripts/make_flows_from_unsw.py \
  --train data_raw/unsw_nb15/Training\ and\ Testing\ Sets/UNSW_NB15_training-set.csv \
  --test data_raw/unsw_nb15/Training\ and\ Testing\ Sets/UNSW_NB15_testing-set.csv \
  --output data_processed \
  --drop-duplicates

./scripts/build_cesnet_residuals.py \
  --agg-dir data_raw/cesnet/ip_addresses_sample/ip_addresses_sample/agg_10_minutes \
  --ids data_raw/cesnet/ids_relationship.csv \
  --times data_raw/cesnet/times/times_10_minutes.csv \
  --calendar data_raw/cesnet/weekends_and_holidays.csv \
  --outdir data_processed \
  --max-files 500 \
  --lag-windows 1 2 3 6 12 \
  --roll-window 12 \
  --contamination 0.03
```

## Verify Outputs

In [None]:
from pathlib import Path
import pandas as pd

data_dir = Path('..') / 'data_processed'
files = [
    'flows_clean.csv',
    'label_category_map.csv',
    'cesnet_windows_train.csv',
    'cesnet_windows_test.csv',
]

for fname in files:
    path = data_dir / fname
    if not path.exists():
        print(f'Missing {fname}')
        continue
    df = pd.read_csv(path, nrows=5)
    print(f'Preview of {fname}')
    display(df.head())
    print('-' * 40)


## Next
- Proceed to `01_eda_flows.ipynb` for exploratory analysis.
- Training commands are in `README.md` under *Quick commands*.