# CS559 Code Completion (Colab Notebook) — Token-Level Only

This notebook runs the full pipeline end-to-end for **token-level next-token prediction**:

1. Install dependencies
2. Download + extract Py150
3. Preprocess/tokenize
4. Create **token-level** completion datasets
5. Train (`train_v2.py`)
6. Evaluate (`evaluate.py`)
7. Run a small inference demo

**Tip:** start with the **Quick sanity run** settings (small `max_train_examples`) to verify everything works, then scale up.


In [None]:
# (Optional) Mount Google Drive to save runs
try:
    from google.colab import drive
    drive.mount('/content/drive')
except Exception as e:
    print('Not running in Colab (or Drive mount failed):', e)


In [None]:
# Clone the repo (if you uploaded only this notebook)
import os
import subprocess

REPO_URL = "https://github.com/MahyarFardin/cs559_code_completion.git"
REPO_DIR = "/content/cs559_code_completion"

if not os.path.exists(REPO_DIR):
    subprocess.check_call(["git", "clone", REPO_URL, REPO_DIR])

os.chdir(REPO_DIR)
print("cwd:", os.getcwd())
subprocess.check_call(["python", "-V"])


In [None]:
# Install dependencies
# Note: Colab typically already includes PyTorch; requirements.txt should be compatible.
!pip -q install --upgrade pip
!pip -q install -r requirements.txt


In [None]:
# Check GPU
import subprocess
import torch

print('torch:', torch.__version__)
print('cuda available:', torch.cuda.is_available())

if torch.cuda.is_available():
    try:
        subprocess.check_call(["nvidia-smi"])
    except Exception as e:
        print("nvidia-smi failed:", e)


## Configure your run

- Set `QUICK_RUN=True` first to validate the pipeline.
- When you scale up, increase `MAX_TRAIN_EXAMPLES` and optionally `NUM_EPOCHS`.


In [None]:
# --- Run configuration (token-level only) ---
QUICK_RUN = True

# Fixed task for this notebook
TASK = 'token'

# If True, rebuild cached dataset JSONLs (slow)
FORCE_REBUILD_DATASET = False

# Common training hyperparams
BATCH_SIZE = 16
ACCUMULATION_STEPS = 4
NUM_EPOCHS = 2 if QUICK_RUN else 15
MAX_LENGTH = 256

# Vocab controls model size heavily
VOCAB_MIN_FREQ = 50 if QUICK_RUN else 20
VOCAB_SAMPLE_LINES = 20000 if QUICK_RUN else 50000

# Dataset sizing
MAX_TRAIN_EXAMPLES = 2000 if QUICK_RUN else 200000
MAX_VAL_EXAMPLES = 1000 if QUICK_RUN else 10000
MAX_TEST_EXAMPLES = 2000 if QUICK_RUN else 50000

# DataLoader workers
NUM_WORKERS = 2
EVAL_NUM_WORKERS = 0  # set 0 if evaluation hangs

# Device
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

def make_run_name(task, bs, ep, max_length, vocab_min_freq, max_train_examples, acc_steps):
    name = f"run_{task}_v2_bs{bs}_ep{ep}_len{max_length}_vocab{vocab_min_freq}"
    if max_train_examples:
        name += f"_train{max_train_examples}"
    if acc_steps and acc_steps > 1:
        name += f"_acc{acc_steps}"
    return name

RUN_NAME = make_run_name(TASK, BATCH_SIZE, NUM_EPOCHS, MAX_LENGTH, VOCAB_MIN_FREQ, MAX_TRAIN_EXAMPLES, ACCUMULATION_STEPS)
OUTPUT_DIR = f"runs/{RUN_NAME}"

# Shared dataset cache (one-time creation; reused across runs)
# - On Colab: lives under /content
# - Locally: lives under the current repo directory
COMMON_DATASET_ROOT = "/content/cs559_shared_datasets" if os.path.exists("/content") else os.path.join(os.getcwd(), "cs559_shared_datasets")
DATASET_DIR = os.path.join(COMMON_DATASET_ROOT, "completion_datasets_token_only")

MODEL_PATH = f"{OUTPUT_DIR}/best_model_{TASK}_level.pt"
VOCAB_PATH = f"{OUTPUT_DIR}/vocab.json"

print('TASK:', TASK)
print('RUN_NAME:', RUN_NAME)
print('OUTPUT_DIR:', OUTPUT_DIR)
print('DATASET_DIR:', DATASET_DIR)
print('FORCE_REBUILD_DATASET:', FORCE_REBUILD_DATASET)
print('MODEL_PATH:', MODEL_PATH)
print('VOCAB_PATH:', VOCAB_PATH)
print('DEVICE:', DEVICE)


## Download dataset (Py150)

This can take a while. If you already have `py150_files/` from a previous session, you can skip.


In [None]:
import os
import subprocess

subprocess.check_call(["bash", "download_and_extract.sh"])

## Preprocess/tokenize

Creates `token_completion/{train,dev,test}.txt`.


In [None]:
!python preprocess.py --base_dir py150_files --output_dir token_completion


## Create completion datasets (token-level only)

Creates `{DATASET_DIR}/token_level/{train,dev,test}.jsonl` (stored in a shared cache folder so you don’t regenerate it every run).


In [None]:
# Generate token-level dataset only (shared cache)
# This runs once and then reuses the cached JSONLs across runs.
import os
import shutil

token_train = os.path.join(DATASET_DIR, "token_level", "train.jsonl")

if FORCE_REBUILD_DATASET or not os.path.exists(token_train):
    print("Building token-level completion dataset cache...")
    shutil.rmtree(DATASET_DIR, ignore_errors=True)
    os.makedirs(DATASET_DIR, exist_ok=True)
    !python create_completion_datasets.py --input_dir token_completion --output_dir {DATASET_DIR} --token_level
else:
    print("Found cached token-level dataset. Skipping rebuild:", token_train)

# Safety: if an old line_level folder exists in the cache, remove it
line_level_dir = os.path.join(DATASET_DIR, "line_level")
if os.path.exists(line_level_dir):
    print("Removing stale line_level cache:", line_level_dir)
    shutil.rmtree(line_level_dir)


## Train token-level (train_v2.py)

- This creates `runs/<RUN_NAME>/` with `training_params.json`, `vocab.json`, and the best model checkpoint.
- Note: your `train_v2.py` run naming is **deterministic** (no timestamp). If you re-run the same config, it will reuse the same folder and can resume from checkpoints.


In [None]:
!python train_v2.py \
  --task {TASK} \
  --dataset_dir {DATASET_DIR} \
  --batch_size {BATCH_SIZE} \
  --accumulation_steps {ACCUMULATION_STEPS} \
  --num_epochs {NUM_EPOCHS} \
  --max_length {MAX_LENGTH} \
  --vocab_min_freq {VOCAB_MIN_FREQ} \
  --vocab_sample_lines {VOCAB_SAMPLE_LINES} \
  --max_train_examples {MAX_TRAIN_EXAMPLES} \
  --max_val_examples {MAX_VAL_EXAMPLES} \
  --lazy_load \
  --num_workers {NUM_WORKERS} \
  --device {DEVICE}


## Evaluate

- Uses the model from the run directory.
- If evaluation hangs at `0%`, keep `--num_workers 0`.


In [None]:
!python evaluate.py \
  --model_path {MODEL_PATH} \
  --vocab_path {VOCAB_PATH} \
  --task {TASK} \
  --dataset_dir {DATASET_DIR} \
  --max_length {MAX_LENGTH} \
  --max_test_examples {MAX_TEST_EXAMPLES} \
  --lazy_load \
  --num_workers {EVAL_NUM_WORKERS} \
  --device {DEVICE}


## Quick token-level inference demo

Adjust the `--context` string to try different prompts.


In [None]:
import subprocess

# Token-level inference
cmd = (
    f"python inference.py "
    f"--model_path {MODEL_PATH} "
    f"--vocab_path {VOCAB_PATH} "
    f"--task token "
    f"--context \"from collections import\" "
    f"--top_k 5 "
    f"--device {DEVICE}"
)

print(cmd)
subprocess.check_call(cmd, shell=True)


## Save your run to Google Drive (optional)

If you mounted Drive, copy the whole run folder so it persists after the Colab session ends.


In [None]:
import os, shutil
drive_root = '/content/drive/MyDrive'
if os.path.exists('/content/drive'):
    dst = os.path.join(drive_root, 'cs559_runs', RUN_NAME)
    os.makedirs(os.path.dirname(dst), exist_ok=True)
    if os.path.exists(dst):
        print('Destination already exists, removing:', dst)
        shutil.rmtree(dst)
    shutil.copytree(OUTPUT_DIR, dst)
    print('Copied run to:', dst)
else:
    print('Drive is not mounted; skipping copy.')
