# Code clone pipeline runner (commands notebook)

This notebook collects the **exact CLI commands** we’ve been using for:

- Building **program artifacts** (per-program method graphs + embeddings)
- Training **program-level clone detector**
- Previewing **pair generators**
- Optional: generating **augmentation cache**
- Quick sanity checks (index keys, shard content, etc.)

It’s written so you can run it from the repo root, or from `pipeline/`.

> Edit the variables in the first cells (paths, model, device, limits).


## 0) Setup variables


In [7]:
from pathlib import Path

# === Adjust these ===
REPO_ROOT = Path("..").resolve()         # if you open the notebook inside pipeline/, keep ".."
PIPELINE_DIR = REPO_ROOT / "pipeline"

# Dataset roots
SYNTH_ROOT = REPO_ROOT / "data" / "code-clone-dataset" / "dataset"
GCJ_ROOT   = REPO_ROOT / "data" / "gcj_compiled"

# Artifacts output dirs
OUT_SYNTH = PIPELINE_DIR / "program_artifacts"
OUT_GCJ   = PIPELINE_DIR / "program_artifacts_googlejam"

# Model / device
MODEL  = "microsoft/graphcodebert-base"
DEVICE = "mps"     # "cpu" or "cuda" or "mps"

# Limits (useful for quick tests)
LIMIT_INDICES = 10     # for synthetic dataset (base/type-3 indices)
LIMIT_BUCKETS = 4      # for GCJ buckets (1..12)

# Training config
STEPS = 2000
BATCH_PAIRS = 32
POS_RATIO = 0.5
VAL_RATIO = 0.2
EVAL_EVERY = 100
VAL_PAIRS = 200
LOG_EVERY = 50

print("REPO_ROOT:", REPO_ROOT)
print("PIPELINE_DIR:", PIPELINE_DIR)
print("SYNTH_ROOT:", SYNTH_ROOT)
print("GCJ_ROOT:", GCJ_ROOT)


REPO_ROOT: /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings
PIPELINE_DIR: /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline
SYNTH_ROOT: /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset
GCJ_ROOT: /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/gcj_compiled


## 1) Preview pair generation


### 1A) Synthetic dataset (type-1/type-2/type-3)


In [8]:
%%bash
set -euo pipefail

cd "$(python3 - <<'PY'
from pathlib import Path
print((Path("..").resolve() / "pipeline").as_posix())
PY
)"

python3 -m pair_dataset.cli_preview \
  --root "../data/code-clone-dataset/dataset" \
  --clone-type type-3 \
  --neg-pool same_clone_type \
  --n 10


indices=100 indices_with_clones=100 total_clones=300 clone_type=type-3 neg_pool=same_clone_type

POSITIVE samples:
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/16/main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/type-3/16/2.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/39/main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/type-3/39/1.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/93/main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/type-3/93/3.java
1 /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/base/09/main.java -> /Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/data/code-clone-dataset/dataset/type-3/09/2.ja

### 1B) Google Code Jam compiled buckets (unlabeled; proxy labels: same bucket = clone)


In [9]:
%%bash
set -euo pipefail

cd "$(python3 - <<'PY'
from pathlib import Path
print((Path("..").resolve() / "pipeline").as_posix())
PY
)"

python3 -m pair_dataset_googlejam.cli_preview \
  --root "../data/gcj_compiled" \
  --limit-buckets 4 \
  --n 10


Traceback (most recent call last):
  File [35m"<frozen runpy>"[0m, line [35m198[0m, in [35m_run_module_as_main[0m
  File [35m"<frozen runpy>"[0m, line [35m88[0m, in [35m_run_code[0m
  File [35m"/Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/pair_dataset_googlejam/cli_preview.py"[0m, line [35m7[0m, in [35m<module>[0m
    from pair_dataset_googlejam.generators import GoogleJamConfig, positive_pairs, negative_pairs, interleave
  File [35m"/Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/pair_dataset_googlejam/generators.py"[0m, line [35m1[0m
    [1;31m<[0mfile name=0 path=/Users/jonas/Documents/NTNU/Bachelor/code-model-embeddings/pipeline/pair_dataset_googlejam/generators.py># pipeline/pair_dataset_googlejam/generators.py
    [1;31m^[0m
[1;35mSyntaxError[0m: [35minvalid syntax[0m


CalledProcessError: Command 'b'set -euo pipefail\n\ncd "$(python3 - <<\'PY\'\nfrom pathlib import Path\nprint((Path("..").resolve() / "pipeline").as_posix())\nPY\n)"\n\npython3 -m pair_dataset_googlejam.cli_preview \\\n  --root "../data/gcj_compiled" \\\n  --limit-buckets 4 \\\n  --n 10\n'' returned non-zero exit status 1.

## 2) Build program artifacts


Program artifacts stage writes a `program_index.json` that maps **source java path → artifact dir**.

- For synthetic dataset: `--clone-type type-3` (or type-1/type-2)
- For GCJ: `--clone-type googlejam` (your updated CLI)


### 2A) Synthetic dataset → `pipeline/program_artifacts/`


In [10]:
%%bash
set -euo pipefail

cd "$(python3 - <<'PY'
from pathlib import Path
print((Path("..").resolve() / "pipeline").as_posix())
PY
)"

python3 -m program_project.cli \
  --dataset-root "../data/code-clone-dataset/dataset" \
  --clone-type type-3 \
  --out "./program_artifacts" \
  --model "microsoft/graphcodebert-base" \
  --device "mps" \
  --limit-indices 10


Error: --jdk-home and --vineflower are required unless --clone-type is 'googlejam'


CalledProcessError: Command 'b'set -euo pipefail\n\ncd "$(python3 - <<\'PY\'\nfrom pathlib import Path\nprint((Path("..").resolve() / "pipeline").as_posix())\nPY\n)"\n\npython3 -m program_project.cli \\\n  --dataset-root "../data/code-clone-dataset/dataset" \\\n  --clone-type type-3 \\\n  --out "./program_artifacts" \\\n  --model "microsoft/graphcodebert-base" \\\n  --device "mps" \\\n  --limit-indices 10\n'' returned non-zero exit status 1.

### 2B) GCJ compiled dataset → `pipeline/program_artifacts_googlejam/`


In [None]:
%%bash
set -euo pipefail

cd "$(python3 - <<'PY'
from pathlib import Path
print((Path("..").resolve() / "pipeline").as_posix())
PY
)"

python3 -m program_project.cli \
  --dataset-root "../data/gcj_compiled" \
  --clone-type googlejam \
  --out "./program_artifacts_googlejam" \
  --model "microsoft/graphcodebert-base" \
  --device "mps" \
  --limit-indices 4


## 3) Sanity check: inspect program_index.json


In [None]:
import json
from pathlib import Path

p = Path("./program_artifacts_googlejam/program_index.json")  # change to ./program_artifacts/program_index.json if needed
idx = json.loads(p.read_text(encoding="utf-8"))
items = idx.get("items", {})
print("num_items:", len(items))

if items:
    k = next(iter(items))
    print("example key:", k)
    print("example item keys:", sorted(items[k].keys()))
    print("example item:", items[k])


## 4) Train program-level GNN clone detector


This uses `gnn_train_program.cli`:

- `--program-index` points to the program index produced above
- `--dataset-root` must match the generator root used for pairs
- `--clone-type` selects generator mode (`type-3` or `googlejam`)
- Validation is done by splitting indices/buckets by `--val-ratio`


### 4A) Train on synthetic dataset


In [None]:
%%bash
set -euo pipefail

cd "$(python3 - <<'PY'
from pathlib import Path
print((Path("..").resolve() / "pipeline").as_posix())
PY
)"

python3 -m gnn_train_program.cli \
  --program-index "./program_artifacts/program_index.json" \
  --dataset-root "../data/code-clone-dataset/dataset" \
  --clone-type type-3 \
  --limit-indices 10 \
  --val-ratio 0.2 \
  --steps 2000 \
  --batch-pairs 32 \
  --pos-ratio 0.5 \
  --device "mps" \
  --eval-every 100 \
  --val-pairs 200 \
  --log-every 50


### 4B) Train on GCJ compiled dataset


In [None]:
%%bash
set -euo pipefail

cd "$(python3 - <<'PY'
from pathlib import Path
print((Path("..").resolve() / "pipeline").as_posix())
PY
)"

python3 -m gnn_train_program.cli \
  --program-index "./program_artifacts_googlejam/program_index.json" \
  --dataset-root "../data/gcj_compiled" \
  --clone-type googlejam \
  --limit-indices 4 \
  --val-ratio 0.2 \
  --steps 2000 \
  --batch-pairs 32 \
  --pos-ratio 0.5 \
  --device "mps" \
  --eval-every 100 \
  --val-pairs 200 \
  --log-every 50


## 5) Optional: augmentation cache preview (variable-renaming etc.)


If you’re generating augmented positives into a cache directory, preview them like this.

This assumes you have:

- `python3 -m augment_pipeline.cli_preview`


In [None]:
%%bash
set -euo pipefail

cd "$(python3 - <<'PY'
from pathlib import Path
print((Path("..").resolve() / "pipeline").as_posix())
PY
)"

python3 -m augment_pipeline.cli_preview \
  --root "../data/gcj_compiled" \
  --out "./aug_cache_gcj" \
  --limit-buckets 4 \
  --n 10 \
  --pos-ratio 0.5 \
  --seed 0


## 6) Transfer / copy artifacts to another machine (or move dirs)


If you want to move the artifacts directory as a unit (recommended), copy:

- `pipeline/program_artifacts/` or `pipeline/program_artifacts_googlejam/`
- `pipeline/gnn_models_program/` (after training)

Example rsync commands (edit paths/host):


In [None]:
%%bash
set -euo pipefail

# Example: copy artifacts to a remote machine
# rsync -avh --progress ./program_artifacts_googlejam user@host:/path/to/repo/pipeline/

# Example: copy trained models
# rsync -avh --progress ./gnn_models_program user@host:/path/to/repo/pipeline/
echo "Edit and run rsync commands as needed."


## 7) Debug helpers


### 7A) Inspect one embed_cache shard (methods_00000.pt)


In [None]:
import torch
from pathlib import Path

# Point at one program artifact's embed_cache shards
# You can copy a shards_dir from program_index.json and paste it here:
shards_dir = None  # e.g. Path("/abs/path/to/prog_xxx/embed_cache/shards")

if shards_dir is None:
    print("Set shards_dir first (copy from program_index.json -> item['shards_dir']).")
else:
    shard_files = sorted(Path(shards_dir).glob("*.pt"))
    print("num_shards:", len(shard_files))
    if shard_files:
        data = torch.load(shard_files[0], map_location="cpu")
        print("records:", len(data))
        item = data[0]
        print("keys:", sorted(item.keys()))
        print("method_id:", item.get("method_id"))
        print("x:", item["x"].shape, item["x"].dtype)
        print("edge_index:", item["edge_index"].shape, item["edge_index"].dtype)
        print("edge_type:", item["edge_type"].shape, item["edge_type"].dtype)
        print("num_nodes/edges:", item["num_nodes"], item["num_edges"])
