# Onboarding Notebook: Anticipatory Prefill Project

This notebook verifies your environment is ready for development.
It checks:
1. Python & GPU
2. Installs dependencies (handling both local and remote execution)
3. Loads the project config & model
4. Runs a basic inference
5. Runs a KV-cache sanity check

## 1. Runtime Checks & Setup

In [14]:
import sys
import os
import torch

print(f"Python Version: {sys.version}")
if torch.cuda.is_available():
    print(f"GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    print("WARNING: No GPU detected. Please enable GPU runtime.")

Python Version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
GPU Available: Tesla T4
CUDA Version: 12.6


## 2. Remote Setup (Colab/VS Code Remote)
If running remotely (e.g. VS Code connected to Colab), we clone the repo into a subdirectory to ensure we have access to `src/`.

In [15]:
import os

REPO_DIR = "OT1-APITS"

# Check if we are running locally (src exists relative to us) or need to clone
if os.path.exists("../src"):
    print("Running locally (parent directory has src).")
    REPO_ROOT = ".."
elif os.path.exists("src"):
    print("Running locally (current directory has src).")
    REPO_ROOT = "."
else:
    # We are likely in a fresh Colab VM
    print(f"Local src not found. Checking for clone in {REPO_DIR}...")
    if not os.path.exists(REPO_DIR):
        print("Cloning repo...")
        !git clone https://github.com/Samsam19191/OT1-APITS.git {REPO_DIR}
    else:
        print(f"Repo already cloned in {REPO_DIR}. Pulling latest changes...")
        !cd {REPO_DIR} && git pull
    
    REPO_ROOT = REPO_DIR

print(f"Repository root set to: {REPO_ROOT}")

Local src not found. Checking for clone in OT1-APITS...
Repo already cloned in OT1-APITS. Pulling latest changes...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 17 (delta 0), reused 17 (delta 0), pack-reused 0 (from 0)[K
Unpacking objects: 100% (17/17), 12.78 KiB | 4.26 MiB/s, done.
From https://github.com/Samsam19191/OT1-APITS
   3fe9f43..887f09b  main       -> origin/main
Updating 3fe9f43..887f09b
Fast-forward
 .gitignore                                         |   5 [32m+[m
 README.md                                          |  40 [32m++[m[31m-[m
 docs/colab.md                                      |  19 [32m++[m
 docs/vscode_colab.md                               |  24 [32m++[m
 notebooks/00_onboarding.ipynb                      | 355 [32m+++++++++++++++++++++[m
 experiments/requirements.txt => requirements.txt   |   4 [32m+[m[31m-[m
 src/__init__.py      

## 3. Install Dependencies

In [16]:
import os

req_file = os.path.join(REPO_ROOT, "requirements.txt")

if os.path.exists(req_file):
    print(f"Found requirements at: {req_file}")
    !pip install -r {req_file}
else:
    print(f"WARNING: requirements.txt not found at {req_file}")
    print("Current working directory files:", os.listdir("."))
    if os.path.exists(REPO_ROOT):
        print(f"Files in {REPO_ROOT}:", os.listdir(REPO_ROOT))
        
    print("Installing default packages as fallback...")
    !pip install torch transformers accelerate bitsandbytes sentencepiece protobuf

Found requirements at: OT1-APITS/requirements.txt


## 4. Import & Load Model

In [17]:
import sys
import os

# Add the repo root key paths to sys.path so imports work
if REPO_ROOT not in sys.path:
    sys.path.append(os.path.abspath(REPO_ROOT))

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

try:
    # Try importing from src.config
    from src.config import MODEL_NAME, LOAD_IN_4BIT, DEVICE, MAX_SEQ_LEN_TYPING
    print(f"Config loaded from src. Model: {MODEL_NAME}")
except ImportError as e:
    print(f"ImportError: {e}")
    print("Could not import src.config. Using defaults.")
    MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
    LOAD_IN_4BIT = True
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Transformers version: {transformers.__version__}")
print(f"Loading model: {MODEL_NAME}...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    load_in_4bit=LOAD_IN_4BIT,
    trust_remote_code=True
)

print("Model loaded successfully!")

Config loaded from src. Model: Qwen/Qwen2.5-1.5B-Instruct
Transformers version: 4.57.3
Loading model: Qwen/Qwen2.5-1.5B-Instruct...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Model loaded successfully!


## 5. Sanity Inference

In [18]:
input_text = "def fibonacci(n):"
inputs = tokenizer(input_text, return_tensors="pt").to(DEVICE)

outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("--- Generation sanity check ---")
print(result)
print("------------------------------")

--- Generation sanity check ---
def fibonacci(n): 
    a = 0
    b = 1
    if n == 0:

------------------------------


## 6. KV-Cache Sanity Check

In [19]:
prompt = "The quick brown fox"
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

# Run forward pass with use_cache=True
with torch.no_grad():
    outputs = model(inputs.input_ids, use_cache=True)

kv_cache = outputs.past_key_values

if kv_cache is not None:
    print("✅ KV Cache extracted successfully.")
    # Check structure (layers, keys/values)
    print(f"Cache type: {type(kv_cache)}")
    if hasattr(kv_cache, "get_seq_length"):
         print(f"Sequence length in cache: {kv_cache.get_seq_length()}")
    else:
         # Fallback for old tuple-based cache
         print(f"Layers: {len(kv_cache)}")
else:
    print("❌ Failed to extract KV Cache.")

✅ KV Cache extracted successfully.
Cache type: <class 'transformers.cache_utils.DynamicCache'>
Sequence length in cache: 4
