<a href="https://colab.research.google.com/github/H4miiiid/MentorApp/blob/main/context_engineering_models(10_samples).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## This is the test file to generate 10 samples (description + code) in JSON format with different models.

## A. Generate project descriptions and titles with three different models

### 1. Setup the GPT-5 model and OpenRouter API key

In [None]:
!pip -q install requests jsonschema

In [12]:
import os, json, re, textwrap, datetime
import requests
from jsonschema import Draft7Validator

OPENROUTER_API_KEY = "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"
MODELS = [
    #"openai/gpt-4.1-mini",
    "anthropic/claude-sonnet-4.5",
    "qwen/qwen3-coder",
]

### 2. Set the variables

In [27]:
SYSTEM_PROMPT = """
You are a meticulous AI project designer.
Your job is to produce concise, implementable AI mini-project ideas that can be turned into runnable Python scripts.

Dataset policy (VERY IMPORTANT):
- You must prefer ONLY these real datasets when proposing projects:
  - sklearn: iris, digits, wine, breast_cancer, diabetes
  - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
  - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
- Do NOT propose or mention 20 Newsgroups, fetch_20newsgroups, or any "newsgroups"/"news group" variant.
- If the project is NOT naturally covered by the whitelist (e.g. NLP, audio, recommendation, time-series), you MUST say in the description:
  - “generate 200–300 synthetic … samples” (text / audio-like / tabular / time-series)
  - Keep it clearly offline and small.
- You may mention standard AI datasets (MNIST, FashionMNIST, CIFAR10) even if they need a download, BUT you must phrase the description so the script can fall back to a small synthetic dataset if the download is not available.

Metrics & acceptance:
- Every project idea must propose an acceptance/check that is realistic for the dataset + model.
- If the code will FALL BACK to synthetic/FakeData, the acceptance must also FALL BACK to an easier threshold.
- Use these safe ranges:
  • iris (classification): accuracy ≥ 0.90
  • wine (classification): accuracy ≥ 0.90–0.92
  • breast_cancer (classification): accuracy ≥ 0.90–0.94
  • digits + simple model (logreg / linear SVM): accuracy ≥ 0.90–0.93 (not 0.98)
  • diabetes (regression): R² ≥ 0.35–0.45
  • classic synthetic classifiers (make_moons, make_circles, make_blobs): accuracy ≥ 0.85–0.90, or silhouette ≥ 0.5 for blobs
  • PCA / plotting / KMeans on images: acceptance = “file exists and non-empty” or “score in easy range”

- If the task says “use synthetic / generate N samples”: set accuracy to 0.60–0.75 or R² to 0.25–0.35.
- Do NOT demand SOTA or long training (no 0.99+, no 1e-4 MAE) for mini-projects.
- If the dataset may not be available offline (e.g. Fashion-MNIST, MNIST, Reuters), explicitly tell the code generator:
  “If real dataset not available → generate synthetic data → use lower threshold.”

Rules:
- Output must be valid JSON ONLY (no extra text).
- Each item has exactly two keys: "title" and "description".
- Titles are short and specific (≤ 6 words).
- Descriptions are 1–2 sentences, concrete, and implementable offline in 20–60 minutes.
- Prefer single-file, single-metric projects with tiny data and fast runtime.
- Avoid duplicate or near-duplicate ideas.
- Prefer standard Python libs or widely used ML libs (numpy, pandas, scikit-learn, PyTorch, TensorFlow, OpenCV).
- No external downloads; use built-in toy datasets (e.g., sklearn iris/digits) or tiny synthetic data.
""".strip()


In [5]:
FEW_SHOTS = """
{"title":"Iris KNN Classifier",
"description":"Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.9."}
{"title":"Synthetic Text Sentiment",
"description":"Create 200 short synthetic sentences labeled positive or negative, vectorize with CountVectorizer, train a LogisticRegression, and print accuracy; print TEST_PASS if accuracy ≥ 0.7."}
""".strip()


### 3. A function for describing the task

In [6]:
import textwrap

def build_task(n=10):
    return textwrap.dedent(f"""
    Task: Generate {n} distinct AI mini-project ideas.

    Constraints:
    - Return a JSON array of length {n}.
    - Each item: object with exactly "title" (string) and "description" (string).
    - No comments, no prose outside JSON.

    Scope & Simplicity:
    - Each project is doable offline in 20–60 minutes on CPU.
    - Single-file mindset: one clear goal, one primary metric or artifact.
    - Keep dependencies minimal (numpy/pandas/sklearn/torch/tf/opencv only).
    - Mention one artifact or metric (png, accuracy, inertia, silhouette, TEST_PASS).

    Dataset whitelist (must follow):
    - sklearn: iris, digits, wine, breast_cancer, diabetes
    - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
    - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
    - Do NOT use 20 Newsgroups or fetch_20newsgroups.

    For NLP / audio / task-specific topics:
    - Explicitly say: “generate 200–300 synthetic <domain> samples” so the code agent knows to build data in-code.

    Description style:
    - Titles ≤ 6 words, specific.
    - Descriptions are 1–2 sentences with concrete I/O hints (flags, paths, outputs).
    - Include at least one quick validation (e.g., accuracy threshold, file existence, non-empty output).

    Diversity (within the allowed areas):
    - Avoid repeating the same idea or trivial variants.

    Follow the style of these examples without repeating them:
    {FEW_SHOTS}

    Now produce the JSON array of {n} items.
    """).strip()


### 4. OpenRouter call helper

In [7]:
# Helper to make a safe Python variable name from a slug
def varname_from_slug(slug: str) -> str:
    name = slug.lower().replace("/", "_").replace("-", "_").replace(".", "_")
    return f"{name}_result"

In [8]:
# Generic OpenRouter caller taking model_id
def call_openrouter_model(model_id, messages, temperature=0.3, top_p=0.9, max_tokens=6000):
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://colab.research.google.com/",
        "X-Title": f"Multi-Model Project Generator",
    }
    payload = {
        "model": model_id,
        "messages": messages,
        "temperature": float(temperature),
        "top_p": float(top_p),
        "max_tokens": int(max_tokens),
    }

    # force SiliconFlow for Qwen models
    if model_id.startswith("qwen/"):
      payload["provider"] = {
          "only": ["atlas-cloud/fp8"],
          "allow_fallbacks": False
      }

    t0 = time.time()
    r = requests.post(url, headers=headers, json=payload, timeout=120)
    latency = time.time() - t0
    r.raise_for_status()
    content = r.json()["choices"][0]["message"]["content"]
    return content, latency

In [9]:
# Validator for title/description array
from jsonschema import Draft7Validator
def validate_items(arr, N):
    ITEM_SCHEMA = {
        "type":"object",
        "required":["title","description"],
        "properties":{
            "title":{"type":"string","minLength":3, "maxLength":100},
            "description":{"type":"string","minLength":20, "maxLength":600}
        },
        "additionalProperties": False
    }
    ARRAY_SCHEMA = {"type":"array","items":ITEM_SCHEMA, "minItems":N, "maxItems":N}
    errs = [e.message for e in Draft7Validator(ARRAY_SCHEMA).iter_errors(arr)]
    titles = [ (x.get("title") or "").strip().lower() for x in arr ]
    if len(set(titles)) != len(titles):
        errs.append("Duplicate titles detected.")
    return errs

In [10]:
# Robust extractor (keeps your previous logic)
import re, json, time, requests
def extract_json_array(text: str):
    m = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", text, re.IGNORECASE)
    if m:
        text = m.group(1).strip()
    start = text.find('[')
    if start == -1:
        return None
    depth = 0
    for i in range(start, len(text)):
        ch = text[i]
        if ch == '[': depth += 1
        elif ch == ']':
            depth -= 1
            if depth == 0:
                return text[start:i+1]
    return None

In [13]:
# Build a single shared message set
N = 10
task = build_task(N) + "\n\nReturn a raw JSON array only — no prose, no code fences, no markdown."
messages = [
    {"role":"system","content": SYSTEM_PROMPT},
    {"role":"user","content": task}
]

# Loop models and create a separate variable per model with the results
for model_id in MODELS:
    print(f"\n===== {model_id} =====")
    varname = varname_from_slug(model_id)
    try:
        raw, secs = call_openrouter_model(model_id, messages, temperature=0.2)
        json_str = extract_json_array(raw)
        if not json_str:
            globals()[varname] = {
                "raw": raw, "json_str": None, "items": None,
                "errors": ["No JSON array found"], "latency": secs
            }
            print(f"❌ No JSON array found | {secs:.1f}s")
            continue

        items = json.loads(json_str)
        errors = validate_items(items, N)

        globals()[varname] = {
            "raw": raw, "json_str": json_str, "items": items,
            "errors": errors, "latency": secs
        }

        if errors:
            print(f"⚠️ Parsed but validation errors ({len(errors)}) | {secs:.1f}s")
            for e in errors[:5]:
                print(" -", e)
        else:
            print(f"✅ Valid JSON ({len(items)} items) | {secs:.1f}s")
            print(json.dumps(items[:2], indent=2, ensure_ascii=False))
    except Exception as e:
        globals()[varname] = {
            "raw": None, "json_str": None, "items": None,
            "errors": [str(e)], "latency": None
        }
        print("❌ Exception:", e)


===== anthropic/claude-sonnet-4.5 =====
✅ Valid JSON (10 items) | 11.0s
[
  {
    "title": "Digits SVM Classifier",
    "description": "Load sklearn's digits dataset, train a linear SVM, and print test accuracy. Print TEST_PASS if accuracy ≥ 0.95."
  },
  {
    "title": "Wine Random Forest",
    "description": "Load sklearn's wine dataset, train a RandomForestClassifier with 50 trees, and print test accuracy. Print TEST_PASS if accuracy ≥ 0.92."
  }
]

===== qwen/qwen3-coder =====
✅ Valid JSON (10 items) | 42.3s
[
  {
    "title": "Digits PCA Visualization",
    "description": "Load sklearn's digits dataset, apply PCA to reduce to 2D, and save a scatter plot as 'digits_pca.png'. Print TEST_PASS if the file exists and is non-empty."
  },
  {
    "title": "Wine Cluster Silhouette",
    "description": "Use sklearn's wine dataset to perform KMeans clustering (k=3), compute the average silhouette score, and print it. Print TEST_PASS if the score is ≥ 0.4."
  }
]


In [14]:
for model_id in MODELS:
    print(" -", varname_from_slug(model_id))

 - anthropic_claude_sonnet_4_5_result
 - qwen_qwen3_coder_result


### All the resuts for three models that I used are below:

In [15]:
print(json.dumps(anthropic_claude_sonnet_4_5_result['items'], indent=2, ensure_ascii=False))

[
  {
    "title": "Digits SVM Classifier",
    "description": "Load sklearn's digits dataset, train a linear SVM, and print test accuracy. Print TEST_PASS if accuracy ≥ 0.95."
  },
  {
    "title": "Wine Random Forest",
    "description": "Load sklearn's wine dataset, train a RandomForestClassifier with 50 trees, and print test accuracy. Print TEST_PASS if accuracy ≥ 0.92."
  },
  {
    "title": "K-Means on Blobs",
    "description": "Generate 300 samples with make_blobs (3 clusters), run KMeans (k=3), and print silhouette score. Print TEST_PASS if silhouette ≥ 0.5."
  },
  {
    "title": "Diabetes Linear Regression",
    "description": "Load sklearn's diabetes dataset, train a LinearRegression model, and print test R² score. Print TEST_PASS if R² ≥ 0.4."
  },
  {
    "title": "MNIST Logistic Regression",
    "description": "Load MNIST (or generate 1000 synthetic 28×28 grayscale images if unavailable), flatten pixels, train LogisticRegression, and print test accuracy. Print TEST_PASS 

In [16]:
print(json.dumps(qwen_qwen3_coder_result['items'], indent=2, ensure_ascii=False))

[
  {
    "title": "Digits PCA Visualization",
    "description": "Load sklearn's digits dataset, apply PCA to reduce to 2D, and save a scatter plot as 'digits_pca.png'. Print TEST_PASS if the file exists and is non-empty."
  },
  {
    "title": "Wine Cluster Silhouette",
    "description": "Use sklearn's wine dataset to perform KMeans clustering (k=3), compute the average silhouette score, and print it. Print TEST_PASS if the score is ≥ 0.4."
  },
  {
    "title": "Breast Cancer SVM",
    "description": "Train an SVM classifier on sklearn's breast_cancer dataset, split 80/20, and print test accuracy. Print TEST_PASS if accuracy ≥ 0.92."
  },
  {
    "title": "Diabetes Linear Regression",
    "description": "Fit a linear regression model on sklearn's diabetes dataset, print the R² score on test split. Print TEST_PASS if R² ≥ 0.4."
  },
  {
    "title": "MNIST CNN Classifier",
    "description": "Train a small CNN on MNIST (or FakeData if unavailable), print test accuracy after 5 epochs

## 2. Generating the codes for each projects that have been generated earlier with the models

In [17]:
# -------- External memory (compact, curated) --------
memory = {
  "style_guide": [
    "Single-file script with `if __name__ == '__main__':` entrypoint.",
    "Use argparse with clear --help and sensible defaults.",
    "Prefer standard library datasets (sklearn, torchvision, keras).",
    "Attempt auto-download/cache with short timeout; if unavailable, fallback to a tiny structured synthetic dataset.",
    "Fix randomness: set seeds for random, numpy; torch if used; run on CPU by default.",
    "Validate inputs (paths, columns, image loads) and fail gracefully with one-line reason.",
    "Keep runtime < 2 minutes (few epochs, small subsets).",
    "Print `TEST_PASS` on success; otherwise `TEST_FAIL: <reason>`."
  ],
  "lessons": [
    "When standardizing features use sklearn.pipeline.Pipeline to avoid leakage.",
    "For OpenCV Canny, expose --threshold1 and --threshold2; convert to grayscale before edges.",
    "For CSV tasks, explicitly validate required columns; show a friendly error if missing.",
    "For plotting, save figures to disk and plt.close() to avoid backend issues.",
    "One file only. Return exactly ONE ```python block. No extra prose.",
    "CLI + help. Use a single 'argparse.ArgumentParser()'. All help strings are single-line (no embedded newlines).",
    "Seeds in 'main()'. Expose '--seed' and set seeds for random, numpy, and torch (if present) inside main().",
    "Data access policy. Only use library datasets when --allow-download is passed. Otherwise do not download; use a robust fallback (sklearn tabular, torchvision FakeData, PIL shapes, etc.). If using 20NG, call with download_if_missing=False unless allowed.",
    "Task–dataset match. Choose datasets that match the task (e.g., do not use 20 Newsgroups for spam/ham).",
    "CV safety. For OpenCV: 1. convert to grayscale if needed. 2. ensure input to detectors is uint8 (cv2.convertScaleAbs if needed). 3. for Haar, check face_cascade.empty() == False or fail.",
    "Acceptance contract. Implement explicit pass/fail checks (files exist, metrics ≥ thresholds, non-empty edge map, etc.). Print TEST_PASS only when all conditions hold; otherwise TEST_FAIL: <reason> and sys.exit(1).",
    "No broken syntax. Never split identifiers across lines. Never break f-strings or string literals across lines.",
    "End marker. Append '# END_OF_SCRIPT' as the last line of the file."
  ],
  "snippets": [
    # seed block to embed in each script
    "import random, numpy as np\nrandom.seed(42)\nnp.random.seed(42)\ntry:\n    import torch\n    torch.manual_seed(42)\nexcept Exception:\n    pass"
  ]
}

In [18]:
SYSTEM_PROMPT_CODEGEN = """
You are a meticulous senior Python engineer who writes production-quality, runnable scripts.
Priorities: (1) correctness, (2) reproducibility, (3) clarity, (4) speed.

Formatting & Output Contract:
- Return ONE code block only: ```python ...```
- The code must be a single file with `if __name__ == "__main__":` entrypoint.
- Provide a clear CLI via argparse and `--help`. All help strings must be single-line (no embedded newlines).
- Do not print explanations. Do not include markdown outside the single code block.
- Append `# END_OF_SCRIPT` as the final line of the file.

Dataset whitelist (MUST follow):
- You may directly load/use ONLY these real datasets:
  - sklearn: iris, digits, wine, breast_cancer, diabetes
  - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
  - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
- Do NOT use 20 Newsgroups, fetch_20newsgroups, or any "newsgroups" variant.
- If the project description says to “generate 200–300 synthetic … samples”, you MUST implement that synthetic dataset in code (e.g. build 200 labeled sentences, or 300 tabular rows, or 200 (x,y) pairs).
- If the project mentions an allowed dataset that might require download (e.g. MNIST, FashionMNIST, CIFAR10), first TRY to load it, and if it fails or `--allow-download` was not passed, fall back to a synthetic dataset that matches the task.

Behavioral Rules:
- Expose `--seed` and set seeds **inside `main()`** for `random`, `numpy`, and `torch` (if available); run on CPU by default.
- Validate inputs (paths, columns, image loads, flags) and fail gracefully with a concise message.
- For OpenCV tasks: convert to grayscale when needed; ensure `uint8` input; for Haar cascades, ensure `face_cascade.empty() == False` or fail.
- Keep the code minimal, readable, and fully runnable in a fresh Colab.
- Never split identifiers across lines; never break string literals, f-strings, or comments across lines.
  - Comments must be on one line each (e.g. `# custom text prediction`), not split into two lines.
- Implement explicit acceptance checks tied to the task (files exist, metrics ≥ thresholds, non-empty edge map, etc.).
- Print `TEST_PASS` only when all acceptance conditions hold; otherwise print `TEST_FAIL: <reason>` and `sys.exit(1)`.

Self-Check Before Returning (silently revise if any item fails):
- argparse help strings are single-line.
- Seeds are applied in `main()` for random/numpy/torch.
- No downloads are attempted because `--allow-download` was not passed.
- Dataset matches the task semantics.
- Dataset name is NOT `20newsgroups` / `fetch_20newsgroups` / “newsgroups”, unless the task is explicitly about newsgroups AND `--allow-download` was passed.
- Acceptance checks implemented; `TEST_PASS`/`TEST_FAIL` present.
- File ends with `# END_OF_SCRIPT`.
- Code parses without SyntaxError and comments are not broken across lines.
""".strip()


In [19]:
FEWSHOTS_CODE = """
Example A (tabular classification with sklearn iris -> fallback synthetic; seeds-in-main; single-line help; acceptance checks)
```python
import argparse, sys
import random, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

def load_iris_or_synthetic(seed=42):
    try:
        from sklearn.datasets import load_iris  # no download required
        data = load_iris()
        X, y, used = data.data, data.target, "iris"
    except Exception:
        rng = np.random.default_rng(seed)
        n = 210
        c = rng.integers(0, 3, size=n)
        X = rng.normal(0, 1, size=(n, 4)) + c[:, None] * 1.5
        y, used = c, "synthetic"
    return X, y, used

def main():
    p = argparse.ArgumentParser(description="Iris (no-download) or synthetic fallback; seeds set in main; explicit acceptance.")
    p.add_argument("--test-size", type=float, default=0.2, help="Test set fraction (default: 0.2).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    args = p.parse_args()

    # seeds in main
    random.seed(args.seed); np.random.seed(args.seed)
    try:
        import torch; torch.manual_seed(args.seed)
    except Exception:
        pass

    X, y, used = load_iris_or_synthetic(args.seed)
    if X is None or y is None or len(X) == 0:
        print("TEST_FAIL: dataset not available"); sys.exit(1)

    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=args.test_size, random_state=args.seed, stratify=y)
    clf = Pipeline([("scaler", StandardScaler()), ("lr", LogisticRegression(max_iter=300))])
    clf.fit(Xtr, ytr)
    acc = clf.score(Xte, yte)
    print(f"dataset={used} accuracy={acc:.3f}")
    # acceptance: stricter if iris, looser if synthetic
    if acc >= (0.85 if used == "iris" else 0.70):
        print("TEST_PASS")
    else:
        print("TEST_FAIL: accuracy below threshold"); sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT
```

Example B (vision MNIST with opt-in download -> fallback FakeData; uint8 safety; seeds-in-main; acceptance checks)
```python
import argparse, sys, os
import random, numpy as np

def load_mnist_or_fakedata(max_train=2000, max_test=500, seed=42, allow_download=False):
    try:
        import torch
        from torchvision import datasets, transforms
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        # only download if explicitly allowed
        train = datasets.MNIST(root="./data", train=True, download=bool(allow_download), transform=tfm)
        test  = datasets.MNIST(root="./data", train=False, download=bool(allow_download), transform=tfm)
        # if dataset objects are empty because cache missing and download disabled, trigger fallback
        if len(train) == 0 or len(test) == 0:
            raise RuntimeError("MNIST cache missing and download disabled")
        train = torch.utils.data.Subset(train, list(range(min(len(train), max_train))))
        test  = torch.utils.data.Subset(test,  list(range(min(len(test),  max_test))))
        return train, test, True
    except Exception:
        import torch
        from torchvision import transforms
        from torchvision.datasets import FakeData
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        train = FakeData(size=max_train, image_size=(1, 28, 28), num_classes=10, transform=tfm)
        test  = FakeData(size=max_test,  image_size=(1, 28, 28), num_classes=10, transform=tfm)
        return train, test, False

def main():
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader

    p = argparse.ArgumentParser(description="MNIST (opt-in download) or FakeData fallback; seeds in main; explicit acceptance.")
    p.add_argument("--epochs", type=int, default=1, help="Training epochs (default: 1).")
    p.add_argument("--batch", type=int, default=128, help="Batch size (default: 128).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    p.add_argument("--allow-download", action="store_true", help="Permit MNIST download if not cached.")
    args = p.parse_args()

    # seeds in main
    random.seed(args.seed); np.random.seed(args.seed); torch.manual_seed(args.seed)

    train_ds, test_ds, real = load_mnist_or_fakedata(seed=args.seed, allow_download=args.allow_download)
    train = DataLoader(train_ds, batch_size=args.batch, shuffle=True)
    test  = DataLoader(test_ds,  batch_size=args.batch, shuffle=False)

    class TinyCNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.net = nn.Sequential(
                nn.Conv2d(1, 16, 3, 1), nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(16, 32, 3, 1), nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Flatten(),
                nn.Linear(32*5*5, 64), nn.ReLU(),
                nn.Linear(64, 10)
            )
        def forward(self, x): return self.net(x)

    model = TinyCNN()
    opt = optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = nn.CrossEntropyLoss()

    model.train()
    for _ in range(args.epochs):
        for xb, yb in train:
            # ensure uint8 -> float32 is handled by ToTensor; just train
            opt.zero_grad()
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward(); opt.step()

    # eval + acceptance
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for xb, yb in test:
            pred = model(xb).argmax(1)
            correct += (pred == yb).sum().item()
            total += yb.numel()
    acc = correct / max(total, 1)
    print(f"acc={acc:.3f} dataset={'mnist' if real else 'fake'}")
    # stricter if real, looser if fake
    if acc >= (0.85 if real else 0.20):
        print("TEST_PASS")
    else:
        print("TEST_FAIL: accuracy below threshold"); sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT
```
""".strip()

In [20]:
import os, re, json, requests

# --- helpers ---
def extract_code_block(text: str) -> str:
    """
    Return the first code block content if present; otherwise return the whole text.
    Prefers ```python ... ``` but accepts ``` ... ```.
    """
    m = re.search(r"```(?:python)?\s*([\s\S]*?)\s*```", text, re.IGNORECASE)
    return m.group(1) if m else text

def print_long(s: str, width: int = 4000):
    """Print long strings without Colab truncation, in chunks."""
    for i in range(0, len(s), width):
        print(s[i:i+width])

In [21]:
# -------- Task builder (per project) --------
def build_code_task(project, memory):
    guide = "\n- ".join(memory["style_guide"])
    lessons = "\n- ".join(memory["lessons"][-6:])
    seed_block = memory["snippets"][0]

    return f"""
    PROJECT TITLE:
    {project['title']}

    PROJECT DESCRIPTION:
    {project['description']}

    Follow this style guide:
    - {guide}

    Incorporate recent lessons:
    - {lessons}

    Hard guardrails (must follow):
    - Return ONE code block only: ```python ...``` (no extra prose).
    - Single file with `if __name__ == "__main__":` entrypoint.
    - Use argparse; **all help strings are single-line** (no embedded newlines).
    - Expose `--seed` and set seeds **inside `main()`** for random, numpy, and torch (if available).
    - Use ONLY the whitelist datasets when a real dataset is required:
      - sklearn: iris, digits, wine, breast_cancer, diabetes
      - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
      - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
    - If the description asks for NLP, audio, or task-specific data that is NOT in the whitelist, generate 200–300 synthetic samples in code (labelled if classification).
    - Do NOT use 20 Newsgroups or fetch_20newsgroups.
    - Choose datasets that **match the task semantics** (e.g., do NOT use 20 Newsgroups for spam/ham).
    - For OpenCV tasks: convert to grayscale when needed; ensure `uint8` input (use `cv2.convertScaleAbs` if necessary); for Haar cascades verify `face_cascade.empty()==False` or fail.
    - Never split identifiers across lines; never break string literals or f-strings across lines.

    Embed this seed block near the top of the script:
    ```python
    {seed_block}"""


In [22]:
# -------- OpenRouter caller --> (code) --------
import time

def call_openrouter_model_code(model_id, messages, temperature=0.2, top_p=0.9, max_tokens=6000):
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://colab.research.google.com/",
        "X-Title": "Multi-Model Project Generator",
    }
    payload = {
        "model": model_id,
        "messages": messages,
        "temperature": float(temperature),
        "top_p": float(top_p),
        "max_tokens": int(max_tokens),
    }

    # force SiliconFlow for Qwen models
    if model_id.startswith("qwen/"):
      payload["provider"] = {
          "only": ["novita/fp8"],
          "allow_fallbacks": False
      }

    time.sleep(2.0)
    t0 = time.time()
    r = requests.post(url, headers=headers, json=payload, timeout=120)
    latency = time.time() - t0
    r.raise_for_status()
    code = r.json()["choices"][0]["message"]["content"]
    return code, latency


In [23]:
#-------- Map: model slug -> per-model dict variable name --------

MODEL_TO_RESULTVAR = {
#"openai/gpt-5": "openai_gpt_5_result",
"anthropic/claude-sonnet-4.5": "anthropic_claude_sonnet_4_5_result" ,
"qwen/qwen3-coder": "qwen_qwen3_coder_result"
}

In [24]:
# -------- Generate code for the first 4 projects per model --------

print("Generating code for first 10 items of each model's projects...\n")
per_model_generated_code = {}

def code_items_varname(slug: str) -> str:
    return slug.lower().replace("/", "_").replace("-", "_").replace(".", "_") + "_code_items"


for model_id, varname in MODEL_TO_RESULTVAR.items():
    result = globals().get(varname)
    if not result or not result.get("items"):
        print(f"Skipping {model_id}: no items found in `{varname}`")
        continue

    projects = result["items"][:]
    print(f"\n===== {model_id}: generating and attaching code for {len(projects)} projects =====")

    items_with_code = []
    for idx, proj in enumerate(projects, 1):
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT_CODEGEN},
            {"role": "user", "content": "Study these two short code examples and copy their structure (CLI, dataset policy, seeds, TEST_PASS contract)."},
            {"role": "assistant", "content": FEWSHOTS_CODE},
            {"role": "user", "content": build_code_task(proj, memory)},
        ]
        raw, latency = call_openrouter_model_code(model_id, messages, temperature=0.2, max_tokens=7000)
        code = extract_code_block(raw)

        item = {
            "title": proj["title"],
            "description": proj["description"],
            "code": code
        }
        items_with_code.append(item)

        # show the full code (no truncation)
        print(f"\n--- {model_id} • Project {idx}: {proj['title']} --- Latency: {latency:.2f}s ---\n")
        print_long(code)  # full code printed

    # put per-model list into a dedicated variable
    var_codes = code_items_varname(model_id)  # e.g., openai_gpt_5_code_items
    globals()[var_codes] = items_with_code

    # also store inside the original result dict under 'items_with_code' for convenience
    result["items_with_code"] = items_with_code

    # pretty JSON view of the per-model list
    print(f"\n>>> {model_id} • JSON with title, description, code:")
    print_long(json.dumps(items_with_code, indent=2, ensure_ascii=False))

Generating code for first 10 items of each model's projects...


===== anthropic/claude-sonnet-4.5: generating and attaching code for 10 projects =====

--- anthropic/claude-sonnet-4.5 • Project 1: Digits SVM Classifier --- Latency: 8.21s ---

import argparse
import sys
import random
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def main():
    p = argparse.ArgumentParser(description="Train a linear SVM on sklearn digits dataset and print test accuracy.")
    p.add_argument("--test-size", type=float, default=0.2, help="Test set fraction (default: 0.2).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    args = p.parse_args()

    # Set seeds in main
    random.seed(args.seed)
    np.random.seed(args.seed)
    try:
        import torch
        torch.manual_seed(

In [25]:
print(json.dumps(anthropic_claude_sonnet_4_5_result["items_with_code"], indent=2, ensure_ascii=False))

[
  {
    "title": "Digits SVM Classifier",
    "description": "Load sklearn's digits dataset, train a linear SVM, and print test accuracy. Print TEST_PASS if accuracy ≥ 0.95.",
    "code": "import argparse\nimport sys\nimport random\nimport numpy as np\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import LinearSVC\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.pipeline import Pipeline\n\ndef main():\n    p = argparse.ArgumentParser(description=\"Train a linear SVM on sklearn digits dataset and print test accuracy.\")\n    p.add_argument(\"--test-size\", type=float, default=0.2, help=\"Test set fraction (default: 0.2).\")\n    p.add_argument(\"--seed\", type=int, default=42, help=\"Random seed (default: 42).\")\n    args = p.parse_args()\n\n    # Set seeds in main\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    try:\n        import torch\n        torch.manual_seed(args.seed)\n    except

**Note:** So now we have more structured and more correct project codes and descriptions that before which they are acceptable for finetuning the smaller models.