# MITRE ATT&CK Command Classification â€” Data Preparation

This notebook prepares the dataset used throughout the project:
- Load the raw dataset from `data/raw/`
- Build a deduplicated dataset (`df_final`) using frequency-aware aggregation
- Save the processed dataset to `data/processed/df_final.csv`


In [None]:
import sys
from pathlib import Path

def find_repo_root(start: Path) -> Path:
    for p in [start] + list(start.parents):
        if (p / "command_classifier").exists() and (p / "data").exists():
            return p
    return start

ROOT = find_repo_root(Path.cwd())
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

print("Repo root:", ROOT)


In [None]:
from command_classifier.common import get_paths

paths = get_paths()
print("Raw CSV:", paths.raw_csv)
print("Processed df_final:", paths.df_final_csv)


In [None]:
from command_classifier.common import load_df_final

df_final = load_df_final(prefer_processed=False, save_processed=True)
df_final.head()


In [None]:
print("Rows:", len(df_final))
print("Classes:", df_final["technique_grouped"].nunique())
df_final["technique_grouped"].value_counts().head(15)


In [None]:
mass = df_final.groupby("technique_grouped")["count"].sum().sort_values(ascending=False)
mass.head(15)


In [None]:
  `df_final` is now available at `data/processed/df_final.csv`.

Next notebooks:
- `02_ml_lr.ipynb` trains a TF-IDF + Logistic Regression baseline (with a custom token pattern).
- `03_dl_lstm.ipynb` trains an LSTM baseline.
