# Pipeline commands (pure Python)

This notebook runs the project CLIs via `subprocess` using the **current Python interpreter** (`sys.executable`).
It avoids `%%bash` and shell quoting issues.

If something fails, the cell will print both **STDOUT** and **STDERR** and raise an exception.


In [4]:
from __future__ import annotations

from pathlib import Path
import os
import sys
import subprocess
from typing import Iterable, Optional

def find_pipeline_dir(start: Optional[Path] = None) -> Path:
    """Locate the repo's `pipeline/` directory by searching upward from `start` (or CWD)."""
    p = (start or Path.cwd()).resolve()
    for cur in [p, *p.parents]:
        # Case A: repo root contains pipeline/
        if (cur / "pipeline").is_dir():
            return (cur / "pipeline").resolve()
        # Case B: already inside pipeline/
        if cur.name == "pipeline" and (cur / "pair_dataset").exists():
            return cur.resolve()
    raise FileNotFoundError(f"Could not locate pipeline/ from {p}")

def run_module(module: str, *args: str, cwd: Optional[Path] = None) -> None:
    """Run `python -m <module> ...` and print stdout/stderr."""
    cmd = [sys.executable, "-m", module, *args]
    print("Running:\n ", " ".join(cmd))
    res = subprocess.run(
        cmd,
        text=True,
        capture_output=True,
        cwd=str(cwd) if cwd else None,
    )
    print("returncode =", res.returncode)
    if res.stdout:
        print("\nSTDOUT:\n", res.stdout)
    if res.stderr:
        print("\nSTDERR:\n", res.stderr)
    res.check_returncode()

pipeline_dir = find_pipeline_dir()
os.chdir(pipeline_dir)
print("cwd =", Path.cwd())


cwd = /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline


## 1) Preview pair sampling (synthetic dataset)

Adjust `DATASET_ROOT` and `CLONE_TYPE` as needed.


In [5]:
from pathlib import Path

DATASET_ROOT = (Path.cwd() / "../data/code-clone-dataset/dataset").resolve()
CLONE_TYPE = "type-3"   # type-1 | type-2 | type-3

run_module(
    "pair_dataset.cli_preview",
    "--root", str(DATASET_ROOT),
    "--clone-type", CLONE_TYPE,
    "--neg-pool", "same_clone_type",
    "--n", "5",
)


Running:
  /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/venv/bin/python -m pair_dataset.cli_preview --root /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset --clone-type type-3 --neg-pool same_clone_type --n 5
returncode = 0

STDOUT:
 indices=100 indices_with_clones=100 total_clones=300 clone_type=type-3 neg_pool=same_clone_type

POSITIVE samples:
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/16/main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/type-3/16/2.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/39/main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/type-3/39/1.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/93/main.java -> /Users/jonas/Documents/NTNU/Bachelor/

## 2) Preview pair sampling (Google Code Jam compiled dataset)

This uses the bucket-based layout under `gcj_compiled/` (buckets 1..12).


In [6]:
from pathlib import Path

GCJ_ROOT = (Path.cwd() / "../data/gcj_compiled").resolve()

run_module(
    "pair_dataset_googlejam.cli_preview",
    "--root", str(GCJ_ROOT),
    "--limit-buckets", "4",
    "--n", "10",
)


Running:
  /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/venv/bin/python -m pair_dataset_googlejam.cli_preview --root /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled --limit-buckets 4 --n 10
returncode = 0

STDOUT:
 Samples:
0 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/4/p044/googlejam4/p044/Main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/2/p223/googlejam2/p223/CounterCulture.java
0 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/4/p112/googlejam4/p112/a.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/2/p315/googlejam2/p315/A.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/4/p144/googlejam4/p144/A.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/4/p044/googlejam4/p044/Main.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddi

## 3) Build program artifacts (synthetic dataset)

Creates `program_index.json` plus per-program artifact directories, including embed shards.


In [7]:
from pathlib import Path

DATASET_ROOT = (Path.cwd() / "../data/code-clone-dataset/dataset").resolve()
OUT_DIR = (Path.cwd() / "./program_artifacts").resolve()

run_module(
    "program_project.cli",
    "--dataset-root", str(DATASET_ROOT),
    "--clone-type", "type-3",
    "--out", str(OUT_DIR),
    "--jdk-home", os.environ.get("JAVA_HOME", ""),
    "--vineflower", str((Path.cwd() / "../chatgpt/vineflower-1.11.2.jar").resolve()),
    "--model", "microsoft/graphcodebert-base",
    "--device", "mps",
    "--limit-indices", "10",
)


Running:
  /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/venv/bin/python -m program_project.cli --dataset-root /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset --clone-type type-3 --out /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/program_artifacts --jdk-home  --vineflower /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/chatgpt/vineflower-1.11.2.jar --model microsoft/graphcodebert-base --device mps --limit-indices 10
returncode = 0

STDOUT:
 Wrote program index: /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/program_artifacts/program_index.json
Programs ok: 0  failed: 40
First failure: {'source_path': '/Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/01/main.java', 'error': 'Internal error: jdk_home/vineflower_jar missing for synthetic run'}



## 4) Build program artifacts (GCJ compiled dataset)

Uses the already-compilable files in `gcj_compiled`. Produces `program_artifacts_googlejam/`.


In [8]:
from pathlib import Path

GCJ_ROOT = (Path.cwd() / "../data/gcj_compiled").resolve()
OUT_DIR = (Path.cwd() / "./program_artifacts_googlejam").resolve()

run_module(
    "program_project.cli",
    "--dataset-root", str(GCJ_ROOT),
    "--clone-type", "googlejam",
    "--out", str(OUT_DIR),
    "--model", "microsoft/graphcodebert-base",
    "--device", "mps",
    "--limit-indices", "4",
)


Running:
  /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/venv/bin/python -m program_project.cli --dataset-root /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled --clone-type googlejam --out /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/program_artifacts_googlejam --model microsoft/graphcodebert-base --device mps --limit-indices 4
returncode = 0

STDOUT:
 Wrote program index: /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/program_artifacts_googlejam/program_index.json
Programs ok: 268  failed: 0



## 5) Train program-level model (synthetic dataset)

Trains and evaluates with a validation split.


In [9]:
from pathlib import Path

PROGRAM_INDEX = (Path.cwd() / "./program_artifacts/program_index.json").resolve()
DATASET_ROOT = (Path.cwd() / "../data/code-clone-dataset/dataset").resolve()

run_module(
    "gnn_train_program.cli",
    "--program-index", str(PROGRAM_INDEX),
    "--dataset-root", str(DATASET_ROOT),
    "--clone-type", "type-3",
    "--limit-indices", "10",
    "--val-ratio", "0.2",
    "--steps", "2000",
    "--batch-pairs", "32",
    "--pos-ratio", "0.5",
    "--device", "mps",
    "--eval-every", "100",
    "--val-pairs", "200",
    "--log-every", "50",
)


Running:
  /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/venv/bin/python -m gnn_train_program.cli --program-index /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/program_artifacts/program_index.json --dataset-root /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset --clone-type type-3 --limit-indices 10 --val-ratio 0.2 --steps 2000 --batch-pairs 32 --pos-ratio 0.5 --device mps --eval-every 100 --val-pairs 200 --log-every 50
returncode = 0

STDOUT:
 Saved to /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/gnn_models_program


STDERR:
 
Training(program):   0%|          | 0/2000 [00:00<?, ?it/s]
Training(program):   1%|          | 16/2000 [00:00<00:14, 136.42it/s]
Training(program):   3%|▎         | 60/2000 [00:00<00:06, 302.00it/s]
Training(program):   6%|▌         | 117/2000 [00:00<00:04, 418.28it/s]
Training(program):   9%|▉         | 177/2000 [00:00<00:03, 485.10it/s]
Training(program):  12%|█

## 6) Train program-level model (GCJ compiled dataset)

Same trainer, but `--clone-type googlejam` and `--dataset-root` points to `gcj_compiled/`.


In [10]:
from pathlib import Path

PROGRAM_INDEX = (Path.cwd() / "./program_artifacts_googlejam/program_index.json").resolve()
GCJ_ROOT = (Path.cwd() / "../data/gcj_compiled").resolve()

run_module(
    "gnn_train_program.cli",
    "--program-index", str(PROGRAM_INDEX),
    "--dataset-root", str(GCJ_ROOT),
    "--clone-type", "googlejam",
    "--limit-indices", "4",
    "--val-ratio", "0.2",
    "--steps", "2000",
    "--batch-pairs", "32",
    "--pos-ratio", "0.5",
    "--device", "mps",
    "--eval-every", "100",
    "--val-pairs", "200",
    "--log-every", "50",
)


Running:
  /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/venv/bin/python -m gnn_train_program.cli --program-index /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/program_artifacts_googlejam/program_index.json --dataset-root /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled --clone-type googlejam --limit-indices 4 --val-ratio 0.2 --steps 2000 --batch-pairs 32 --pos-ratio 0.5 --device mps --eval-every 100 --val-pairs 200 --log-every 50
returncode = 0

STDOUT:
 step=50 loss=0.6443 ema=0.6948
step=100 loss=0.7343 ema=0.6966
[VAL] used=139 pos=74 neg=65 val_loss=0.6981 val_acc=0.532
val_loss=0.6981160640716553 val_acc=0.5323741007194245
step=150 loss=0.6322 ema=0.6873
step=200 loss=0.7377 ema=0.6975
[VAL] used=120 pos=55 neg=65 val_loss=0.7074 val_acc=0.458
val_loss=0.7073948979377747 val_acc=0.4583333333333333
step=250 loss=0.6729 ema=0.7058
step=300 loss=0.6913 ema=0.6960
[VAL] used=122 pos=58 neg=64 val_loss=0.6953 val_acc=0.475

## 7) Preview augmentation pairs (GCJ)

This previews mixed pairs where positives may point to files written into the augmentation cache dir.


In [11]:
from pathlib import Path

GCJ_ROOT = (Path.cwd() / "../data/gcj_compiled").resolve()
AUG_OUT = (Path.cwd() / "./aug_cache_gcj").resolve()

run_module(
    "augment_pipeline.cli_preview",
    "--root", str(GCJ_ROOT),
    "--out", str(AUG_OUT),
    "--limit-buckets", "4",
    "--n", "10",
    "--pos-ratio", "0.5",
    "--seed", "0",
)


Running:
  /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/venv/bin/python -m augment_pipeline.cli_preview --root /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled --out /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/aug_cache_gcj --limit-buckets 4 --n 10 --pos-ratio 0.5 --seed 0
returncode = 0

STDOUT:
 0 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/3/p518/googlejam3/p518/Main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/1/p782/googlejam1/p782/Main.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/2/p293/googlejam2/p293/CodeJamCounter.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/aug_cache_gcj/4aee06c8c5f89af9c387ea24f9c93a2a1bc649d9.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled/2/p134/googlejam2/p134/R1BA.java -> /Users/jonas/Documents/NTNU/Bachelor/code-