# Final Assignment (Main Competition) Inference Standard Code

## 1. Overview
This notebook contains standard code for generating benchmark inference results JSON and generating a competition submission JSON file using the LoRA adapter you trained and uploaded to Hugging Face.
- The submission for the competition is not the trained LoRA itself, but the inference result JSON file.
- This notebook provides instructions for reliably creating the submission JSON.

- Generate (infer) answers to 150 questions sampled from StructEval-T.
- Requires /content/public_150.json (handout) to run.
- The output is inference.json (submission format), which you can upload to OmniCampus for grading.

## 2. Advance preparation

:
- Set the Colab runtime to **GPU (T4)**.
- Log in to Hugging Face (token entry required).
- As a general rule, the LoRA adapter used for inference will be the one uploaded in the learning notebook.

--

## 3. Execution procedure (recommended flow)

### Step 0: Setup (clone / install)
Execute the cells in order from top to bottom.

- Clone `StructEval` and install dependencies (vLLM, etc.).
- If `python3 -m structeval.cli --help` is displayed, the basic setup was successful.

### Step 1: Hugging Face Login
- Run `login()` and enter your token.

### Step 2: LoRA Integration (Merge)
- Load the LoRA at `adapter_id`, merge it with the base model, and generate `./merged_model`.
- Once this is complete, use `./merged_model` as the model path for subsequent inferences.

### Step 3: Run vLLM Inference and Generate Submission JSON
- Generate `custom_inference.py`, and run it.
- The inference results will be saved to `/content/StructEval/outputs/nonrenderable.json`.
- Set `output` to `generation` and output the submission file `/content/inference.json`.
- Download the output `/content/inference.json` and submit it to Omnicampus.
--

## 4. Handling output files (submissions)

### 4.1 Main Generated Files
- Merged Model (No Submission Required)
- `./merged_model/`

- Inference Results **Submission File (Most Important)**
- `/content/inference.json`
- *This file has been formatted to include the `generation` field.

### 4.2 Submission Procedure (Download → Upload to Omnicampus)
1. In Colab, **download** the final output `/content/inference.json` to your local PC.
- Open `/content/` from the "Files" (folder icon) on the left side of Colab.
- Right-click `inference.json` → **Download**

2. On the Omnicampus submission screen, **upload and submit** the downloaded `inference.json`.

Please name the submission file `inference.json`.

### **4.3 Important Points to Note when Participating in the Competition**:
- For inference using this code, please use the "trained and uploaded LoRA." Anyone submitting inference results using any other model will be disqualified.
- The submission must be the "inference result JSON" (not the LoRA itself).
- When submitting, be sure to include the URL of the adapter you uploaded to HuggingFace.

--

## 5. Common mistakes and solutions

- **GPU is not enabled**
- This may cause extremely slow inference or vLLM to fail. Be sure to check T4.

- **`./merged_model` does not exist**
- LoRA integration (merge) may not have completed. Please re-run the merge cell.

- **Out of Memory (OOM) occurs when running vLLM**
- This standard code uses `gpu_memory_utilization=0.6` for safety, but it may fail depending on the environment.
- In this case, first restart the runtime (factory reset) and then re-run the same procedure.

--

## 6. Expected final state (check)

Just before submission, the following conditions must be met:

- `/content/inference.json` exists
- The JSON is a list, and each element contains a `generation` field (it's not empty)
- Upload `inference.json` to Omnicampus and submit it
---

# Execution code


### Step 0: Setup (clone / install)

In [None]:
# 0) Setup (Fixed version)


!git clone -b fix-module-not-found-issue-2 https://github.com/Osakana7777777/StructEval.git

!uv pip install \
  "vllm==0.13.0" \
  "torch==2.9.0" \
  "torchaudio==2.9.0" \
  "torchvision==0.24.0" \
  "triton==3.5.0" \
  "compressed-tensors==0.12.2" \
  "openai==2.15.0" \
  "xgrammar==0.1.27" \
  "bitsandbytes==0.46.1" \
  fire
# Only flash-attn does not have a fixed version because its behavior changes depending on the environment
!uv pip install flash-attn --no-build-isolation

%cd StructEval
!uv pip install -e .

!python3 -m structeval.cli --help
!mkdir -p outputs


Cloning into 'StructEval'...
remote: Enumerating objects: 17398, done.[K
remote: Counting objects: 100% (149/149), done.[K
remote: Compressing objects: 100% (123/123), done.[K
remote: Total 17398 (delta 91), reused 45 (delta 26), pack-reused 17249 (from 3)[K
Receiving objects: 100% (17398/17398), 529.90 MiB | 16.40 MiB/s, done.
Resolving deltas: 100% (5424/5424), done.
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m165 packages[0m [2min 1.92s[0m[0m
[2K[2mPrepared [1m47 packages[0m [2min 20.14s[0m[0m
[2mUninstalled [1m4 packages[0m [2min 118ms[0m[0m
[2K[2mInstalled [1m47 packages[0m [2min 172ms[0m[0m
 [32m+[39m [1manthropic[0m[2m==0.71.0[0m
 [32m+[39m [1mapache-tvm-ffi[0m[2m==0.1.8.post2[0m
 [32m+[39m [1mastor[0m[2m==0.8.1[0m
 [32m+[39m [1mbitsandbytes[0m[2m==0.46.1[0m
 [32m+[39m [1mblake3[0m[2m==1.0.8[0m
 [32m+[39m [1mcbor2[0m[2m==5.8.0[0m
 [32m+[39m [1mcompressed-tensors[0m[2m==0.12.2[0m
 [3

### Step 1: Hugging Face Login
- Run `login()` and enter your token.

In [1]:

# -----------------------------
# 1) HF login (once)
# -----------------------------
# Log in to HuggingFace to read the dataset on the HF Hub.
#
from huggingface_hub import login
login()  # Colab will prompt

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Step 2: LoRA Merge
- Load the LoRA at `adapter_id` and merge it with the base model to generate `./merged_model`.
- Once this is complete, subsequent inference will use `./merged_model` as the model path.

- Now, upload "public_150.json" to the content folder.
- You need to place the evaluation public_150.json in the Colab file area (/content).

In [None]:
# ------------------------------------------------------------
# 1) Config
# ------------------------------------------------------------

MODEL_SOURCE = "adapter_merge"   # "merged" | "base" | "adapter_merge"
# Select which model to use. For this example, select "adapter_merge."

# - "base": Base model (untrained raw model)
# - "merged": Model with LoRA already merged (assuming it's distributed as a finished product)
# - "adapter_merge": Load the base model and LoRA adapter on the fly and merge them locally before use.

# base model (HF repo ID or local path)
# Enter the base model used during training.
BASE_MODEL_ID_OR_PATH   = "Qwen/Qwen3-4B-Instruct-2507"

# merged model (HF repo id or local path)
# If you uploaded a merged model instead of an adapter, enter its ID here.
# Fill in if you selected "merged"
MERGED_MODEL_ID_OR_PATH = "your_id/your-merged-repo"

# adapter merge
# Enter the ID of the adapter you uploaded to HuggingFace.
# Fill in if you selected "adapter_merge"
ADAPTER_ID       = "your_id/test-lora-repo"
# Temporarily save merged model
MERGED_LOCAL_DIR = "./merged_model"

# Specify input (150 questions) and output (submission) file paths
INPUT_PATH  = "/content/public_150.json"
OUTPUT_PATH = "/content/inference.json"


TEMPERATURE = 0.0
# 0.0 is the most deterministic (the same input is likely to produce the same output) and is generally stable for evaluation purposes.


### Step 3: Run vLLM inference and generate JSON for submission
- `custom_inference.py` is generated and executed.
- The inference results are saved to `/content/StructEval/outputs/nonrenderable.json`.
- Set `output` to `generation` and output the submission file `/content/inference.json`.
- Download the output `/content/inference.json` and submit it to Omnicampus.
--

In [None]:

# ------------------------------------------------------------
# 2) Stable vLLM env (IMPORTANT: must be set BEFORE importing vllm)
# ------------------------------------------------------------

import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
# The method of creating worker processes within vLLM will be fixed to "spawn".
# In some environments, such as Colab, this may be more stable than "fork".

os.environ["VLLM_LOGGING_LEVEL"] = "INFO"
# Set the vLLM log level (INFO). This is useful for debugging.

# ------------------------------------------------------------
# 3) Resolve model_path
# ------------------------------------------------------------
# Depending on the MODEL_SOURCE you select, determine the "model location (model_path)" to be passed to vLLM.

def resolve_model_path():
    # A function that returns the path/ID to pass to vLLM depending on which model to use.
    if MODEL_SOURCE == "base":
        return BASE_MODEL_ID_OR_PATH

    if MODEL_SOURCE == "merged":
        return MERGED_MODEL_ID_OR_PATH

    if MODEL_SOURCE == "adapter_merge":
        # NOTE: To use torch/CUDA (GPU), do this before starting vLLM.
        import os, gc
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer
        from peft import PeftModel
        print("[INFO] Merging adapter into base model...")
        base_model = AutoModelForCausalLM.from_pretrained(
            BASE_MODEL_ID_OR_PATH,
            dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
        )
       # Load the tokenizer corresponding to the base model (usually the same one is used after merging)
        tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID_OR_PATH, trust_remote_code=True)

        # Merge the LoRA adapter (ADAPTER_ID) into the base_model.
        # After merging, you can remove the LoRA layer (unload), simplifying inference handling.
        model_to_merge = PeftModel.from_pretrained(base_model, ADAPTER_ID)
        merged_model = model_to_merge.merge_and_unload()

        os.makedirs(MERGED_LOCAL_DIR, exist_ok=True)
        merged_model.save_pretrained(MERGED_LOCAL_DIR)
        tokenizer.save_pretrained(MERGED_LOCAL_DIR)

        del base_model, model_to_merge, merged_model
        gc.collect()
        torch.cuda.empty_cache()
        print("[INFO] Merged model saved:", MERGED_LOCAL_DIR)
        return MERGED_LOCAL_DIR

    raise ValueError("MODEL_SOURCE must be 'merged'|'base'|'adapter_merge'")

# Determine the path/ID of the final model to be used
model_path = resolve_model_path()
print("[INFO] Using model:", model_path)

# ------------------------------------------------------------
# 4) Load public_150 and build prompts (no torch usage here)
# ------------------------------------------------------------
# Read the input file and create prompts (strings to pass to the model) for each question.

import json
from pathlib import Path
from transformers import AutoTokenizer

pub = json.loads(Path(INPUT_PATH).read_text(encoding="utf-8"))

assert isinstance(pub, list), "public_150.json must be a list"
assert len(pub) == 150, f"public_150 must have 150 items, got {len(pub)}"
assert len({x["task_id"] for x in pub}) == 150, "public_150 has duplicate task_id"

# Safety: ensure output_type exists (office enriched file)

missing_ot = [x.get("task_id") for x in pub if not (x.get("output_type") or "").strip()]

if missing_ot:
    raise RuntimeError(f"FATAL: public_150 missing output_type (not enriched). Examples: {missing_ot[:5]}")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# task_ids: Stores the sequence of task_ids to be used for output.
# prompts: Stores the prompt string to be passed to vLLM.
task_ids, prompts = [], []

for item in pub:
    task_ids.append(item["task_id"])
    query = item.get("query", "")
    messages = [{"role": "user", "content": query}]
    prompts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
    # ↑ Use apply_chat_template to format the string into the conversational format expected by the model.
    # tokenize=False: Do not tokenize yet and return as a string.
    # add_generation_prompt=True: Add a boundary where the assistant will answer.
    # This makes it easier for the model to continue generating answers.

# ------------------------------------------------------------
# 5) Presets + fallback plan
# ------------------------------------------------------------
# If you set the "context length (max_model_len)" or "output limit (max_tokens)" too large when starting vLLM,
# it is likely to crash due to insufficient GPU memory (OOM).
# Therefore, we prepare several settings that are likely to succeed, and if they fail, we gradually reduce the settings and retry.
# Because actual memory usage can vary between merged (already baked) and adapter_merge (on-the-fly merge),
# the settings to be tried first (e.g., gpu_mem) are different.
# Create a "trial candidate list" in advance and try them in order from top to bottom.

def build_try_configs():

    # Primary presets

    if MODEL_SOURCE == "merged":
        base = [
            {"max_model_len": 4096, "max_tokens": 4096, "gpu_mem": 0.85},
            {"max_model_len": 4096, "max_tokens": 4096, "gpu_mem": 0.80},
        ]
       # ↑ Try increasing GPU usage from 0.85 to 0.80 while allowing up to 4096 tokens of context/output.
    elif MODEL_SOURCE == "adapter_merge":
        base = [
            {"max_model_len": 4096, "max_tokens": 4096, "gpu_mem": 0.60},
            {"max_model_len": 4096, "max_tokens": 4096, "gpu_mem": 0.65},
        ]
        # ↑ adapter_merge tends to be memory intensive, so try starting with a low gpu_mem.

    else:  # base
        base = [
            {"max_model_len": 4096, "max_tokens": 4096, "gpu_mem": 0.80},
            {"max_model_len": 4096, "max_tokens": 4096, "gpu_mem": 0.70},
        ]
        # ↑ The base model is assumed to be relatively light, so we will try 0.80→0.70.

    # Fallback ladder (reduce context / output)
    # A "gradual reduction setting" in case of failure.
    # Lowering max_model_len and max_tokens reduces memory requirements and increases the likelihood of success.
    ladder = [
        {"max_model_len": 3072, "max_tokens": 3072},
        {"max_model_len": 2048, "max_tokens": 2048},
        {"max_model_len": 1536, "max_tokens": 1536},
    ]

  # Expand base configs with ladder and a couple of gpu_mem tweaks
# ↑ "Mix" ladder steps into the base configuration to increase the number of trial patterns.
# Also, try a version that slightly increases gpu_mem (this may be effective when the failure reason is "insufficient memory allocation").
    out = []
    for cfg in base:
        out.append(cfg)

        for step in ladder:
            out.append({**cfg, **step})

        # try a slightly higher gpu_mem if still failing (some failures are "not enough alloc")
        out.append({**cfg, "gpu_mem": min(0.90, cfg["gpu_mem"] + 0.05)})

# Deduplicate while preserving order
# ↑ Similar settings may overlap, so we'll delete them while preserving the order.
    seen = set()
    uniq = []
    for c in out:
        key = (c["max_model_len"], c["max_tokens"], round(c["gpu_mem"], 2))

        if key in seen:
            continue

        seen.add(key)
        uniq.append(c)

    return uniq


TRY_CONFIGS = build_try_configs()
# ↑ Create a list of settings to try out.

print("[INFO] Try configs (in order):")

for i, c in enumerate(TRY_CONFIGS[:8], 1):
    print(f"  {i:02d}. max_model_len={c['max_model_len']} max_tokens={c['max_tokens']} gpu_mem={c['gpu_mem']}")

if len(TRY_CONFIGS) > 8:
    print(f"  ... total {len(TRY_CONFIGS)} configs")

# ------------------------------------------------------------
# 6) vLLM run with retry
# ------------------------------------------------------------
# ↑ This is the main part of the inference.

from vllm import LLM, SamplingParams
def run_with_config(cfg):

    sampling = SamplingParams(
        temperature=TEMPERATURE,
        max_tokens=cfg["max_tokens"],
    )

    llm = LLM(
        model=model_path,
        max_model_len=cfg["max_model_len"],
        gpu_memory_utilization=cfg["gpu_mem"],
        enforce_eager=True,
        tensor_parallel_size=1,
         disable_log_stats=True,
    )

    outs = llm.generate(prompts, sampling)

    submission = []
# ↑ Create a submission form [{"task_id": ..., "generation": ...}, ...].

    for tid, out in zip(task_ids, outs):
        gen = out.outputs[0].text if out.outputs else ""
        submission.append({"task_id": tid, "generation": gen})
    return submission
# ↑ Returns a submission array for 150 questions.
last_err = None
submission = None
# ↑ Variable to store submitted data (150 items) if successful. None until successful.
for idx, cfg in enumerate(TRY_CONFIGS, 1):
    print(f"[INFO] Attempt {idx}/{len(TRY_CONFIGS)}: max_model_len={cfg['max_model_len']} max_tokens={cfg['max_tokens']} gpu_mem={cfg['gpu_mem']}")
    try:
        submission = run_with_config(cfg)
        print("[INFO] ✅ Generation succeeded with this config.")
        # ↑ Success log
        break
    except RuntimeError as e:
        last_err = e
        msg = str(e)
        print("[WARN] Failed:", msg[:200].replace("\n", " "))

# try next config
if submission is None:
    raise RuntimeError(f"All configs failed. Last error: {last_err}")


# Final guards
# ↑ Finally, perform a "submission consistency check."

if len(submission) != 150:
    # ↑ Check if 150 items have been generated
    raise RuntimeError(f"Submission count mismatch: {len(submission)}")

if len({x['task_id'] for x in submission}) != 150:
   # ↑ Check for duplicate task_ids
    raise RuntimeError("Duplicate task_id in submission")

Path(OUTPUT_PATH).write_text(json.dumps(submission, ensure_ascii=False, indent=2), encoding="utf-8")
# ↑ Convert the submission (Python object) into a JSON string and save it to a file.

print("[OK] wrote:", OUTPUT_PATH, "items=150")
