# 00 – Prepare Data

This notebook documents the preprocessing steps. Run the CLI commands below (from the project root) to regenerate processed datasets.

## Commands

```bash
python scripts/make_flows_from_unsw.py \
  --train data_raw/unsw_nb15/Training\ and\ Testing\ Sets/UNSW_NB15_training-set.csv \
  --test data_raw/unsw_nb15/Training\ and\ Testing\ Sets/UNSW_NB15_testing-set.csv \
  --output data_processed \
  --drop-duplicates

./scripts/build_cesnet_residuals.py \
  --agg-dir data_raw/cesnet/ip_addresses_sample/ip_addresses_sample/agg_10_minutes \
  --ids data_raw/cesnet/ids_relationship.csv \
  --times data_raw/cesnet/times/times_10_minutes.csv \
  --calendar data_raw/cesnet/weekends_and_holidays.csv \
  --outdir data_processed \
  --max-files 500 \
  --lag-windows 1 2 3 6 12 \
  --roll-window 12 \
  --contamination 0.03
```

## Verify Outputs

In [1]:
from pathlib import Path
import pandas as pd

data_dir = Path('..') / 'data_processed'
files = [
    'flows_clean.csv',
    'label_category_map.csv',
    'cesnet_windows_train.csv',
    'cesnet_windows_test.csv',
]

for fname in files:
    path = data_dir / fname
    if not path.exists():
        print(f'Missing {fname}')
        continue
    df = pd.read_csv(path, nrows=5)
    print(f'Preview of {fname}')
    display(df.head())
    print('-' * 40)


Preview of flows_clean.csv


Unnamed: 0,dur,protocol_type,service,flag,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label_family
0,0.121478,tcp,UNK,FIN,6,4,258,172,74.08749,252,...,1,1,1,0,0,0,1,1,0,Normal
1,0.649902,tcp,UNK,FIN,14,38,734,42014,78.473372,62,...,1,1,2,0,0,0,1,6,0,Normal
2,1.623129,tcp,UNK,FIN,8,16,364,13186,14.170161,62,...,1,1,3,0,0,0,2,6,0,Normal
3,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,62,...,1,1,3,1,1,0,2,1,0,Normal
4,0.449454,tcp,UNK,FIN,10,6,534,268,33.373826,254,...,2,1,40,0,0,0,2,39,0,Normal


----------------------------------------
Preview of label_category_map.csv


Unnamed: 0,label,category
0,Analysis,Attack
1,Backdoor,Attack
2,DoS,Attack
3,Exploits,Attack
4,Fuzzers,Attack


----------------------------------------
Preview of cesnet_windows_train.csv


Unnamed: 0,ip_id,id_time,time,id_institution,n_bytes,hour,dow,is_weekend,is_holiday,n_flows,...,lag2_n_bytes,lag2_n_flows,lag3_n_bytes,lag3_n_flows,lag6_n_bytes,lag6_n_flows,lag12_n_bytes,lag12_n_flows,rolling_mean_12_bytes,rolling_std_12_bytes
0,11,12,2023-10-09 02:03:49,1,6377998,2,0,0,0,35542,...,6354128.0,36352.0,6870070.0,38091.0,6678362.0,36003.0,5806461.0,31049.0,6204462.0,564613.128235
1,11,13,2023-10-09 02:13:49,1,6308482,2,0,0,0,34848,...,7122921.0,37858.0,6354128.0,36352.0,5923330.0,35158.0,5887159.0,32765.0,6239572.0,556124.07301
2,11,14,2023-10-09 02:23:49,1,6742263,2,0,0,0,37401,...,6377998.0,35542.0,7122921.0,37858.0,6601350.0,37478.0,5432005.0,30469.0,6348761.0,509855.740136
3,11,15,2023-10-09 02:33:49,1,6250844,2,0,0,0,37032,...,6308482.0,34848.0,6377998.0,35542.0,6870070.0,38091.0,5271808.0,29960.0,6430347.0,384869.080831
4,11,16,2023-10-09 02:43:49,1,5586679,2,0,0,0,31226,...,6742263.0,37401.0,6308482.0,34848.0,6354128.0,36352.0,6082582.0,35818.0,6389022.0,447183.934774


----------------------------------------
Preview of cesnet_windows_test.csv


Unnamed: 0,ip_id,id_time,time,id_institution,n_bytes,hour,dow,is_weekend,is_holiday,n_flows,...,lag3_n_flows,lag6_n_bytes,lag6_n_flows,lag12_n_bytes,lag12_n_flows,rolling_mean_12_bytes,rolling_std_12_bytes,n_bytes_pred,residual,is_anom
0,11,27914,2024-04-19 21:30:52,1,6000603,21,4,0,0,38810,...,35491.0,5114212.0,35529.0,5786994.0,37475.0,5450101.0,530422.498259,10135020.0,-4134422.0,0
1,11,27915,2024-04-19 21:40:52,1,5404360,21,4,0,0,37676,...,35007.0,6747297.0,34482.0,5289450.0,37126.0,5459677.0,528291.535421,2009177.0,3395183.0,0
2,11,27916,2024-04-19 21:50:52,1,5153714,21,4,0,0,34599,...,37850.0,4759196.0,32364.0,5103750.0,34021.0,5463841.0,525420.412907,2009177.0,3144537.0,0
3,11,27917,2024-04-19 22:00:52,1,5827300,22,4,0,0,38545,...,38810.0,5457591.0,35491.0,5431656.0,38511.0,5496811.0,535533.302495,9188815.0,-3361515.0,0
4,11,27918,2024-04-19 22:10:52,1,5216288,22,4,0,0,34107,...,37676.0,5497490.0,35007.0,5152944.0,34155.0,5502090.0,532137.147174,2009177.0,3207111.0,0


----------------------------------------


## Next
- Proceed to `01_eda_flows.ipynb` for exploratory analysis.
- Training commands are in `README.md` under *Quick commands*.