In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## This is the test file to generate 10 samples (description + code) in JSON format with different models.

## A. Generate project descriptions and titles with three different models

### 1. Setup the GPT-5 model and OpenRouter API key

In [None]:
!pip -q install requests jsonschema

In [29]:
import os, json, re, textwrap, datetime
import requests
from jsonschema import Draft7Validator

OPENROUTER_API_KEY = "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"
MODELS = [
    #"openai/gpt-4.1-mini",
    "anthropic/claude-sonnet-4.5",
    "qwen/qwen3-coder",
]

### 2. Set the variables

In [27]:
SYSTEM_PROMPT = """
You are a meticulous AI project designer.
Your job is to produce concise, implementable AI mini-project ideas that can be turned into runnable Python scripts.

Dataset policy (VERY IMPORTANT):
- You must prefer ONLY these real datasets when proposing projects:
  - sklearn: iris, digits, wine, breast_cancer, diabetes
  - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
  - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
- Do NOT propose or mention 20 Newsgroups, fetch_20newsgroups, or any "newsgroups"/"news group" variant.
- If the project is NOT naturally covered by the whitelist (e.g. NLP, audio, recommendation, time-series), you MUST say in the description:
  - “generate 200–300 synthetic … samples” (text / audio-like / tabular / time-series)
  - Keep it clearly offline and small.
- You may mention standard AI datasets (MNIST, FashionMNIST, CIFAR10) even if they need a download, BUT you must phrase the description so the script can fall back to a small synthetic dataset if the download is not available.

Metrics & acceptance:
- Every project idea must propose an acceptance/check that is realistic for the dataset + model.
- If the code will FALL BACK to synthetic/FakeData, the acceptance must also FALL BACK to an easier threshold.
- Use these safe ranges:
  • iris (classification): accuracy ≥ 0.90
  • wine (classification): accuracy ≥ 0.90–0.92
  • breast_cancer (classification): accuracy ≥ 0.90–0.94
  • digits + simple model (logreg / linear SVM): accuracy ≥ 0.90–0.93 (not 0.98)
  • diabetes (regression): R² ≥ 0.35–0.45
  • classic synthetic classifiers (make_moons, make_circles, make_blobs): accuracy ≥ 0.85–0.90, or silhouette ≥ 0.5 for blobs
  • PCA / plotting / KMeans on images: acceptance = “file exists and non-empty” or “score in easy range”

- If the task says “use synthetic / generate N samples”: set accuracy to 0.60–0.75 or R² to 0.25–0.35.
- Do NOT demand SOTA or long training (no 0.99+, no 1e-4 MAE) for mini-projects.
- If the dataset may not be available offline (e.g. Fashion-MNIST, MNIST, Reuters), explicitly tell the code generator:
  “If real dataset not available → generate synthetic data → use lower threshold.”

Rules:
- Output must be valid JSON ONLY (no extra text).
- Each item has exactly two keys: "title" and "description".
- Titles are short and specific (≤ 6 words).
- Descriptions are 1–2 sentences, concrete, and implementable offline in 20–60 minutes.
- Prefer single-file, single-metric projects with tiny data and fast runtime.
- Avoid duplicate or near-duplicate ideas.
- Prefer standard Python libs or widely used ML libs (numpy, pandas, scikit-learn, PyTorch, TensorFlow, OpenCV).
- No external downloads; use built-in toy datasets (e.g., sklearn iris/digits) or tiny synthetic data.
""".strip()


In [5]:
FEW_SHOTS = """
{"title":"Iris KNN Classifier",
"description":"Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.9."}
{"title":"Synthetic Text Sentiment",
"description":"Create 200 short synthetic sentences labeled positive or negative, vectorize with CountVectorizer, train a LogisticRegression, and print accuracy; print TEST_PASS if accuracy ≥ 0.7."}
""".strip()


### 3. A function for describing the task

In [6]:
import textwrap

def build_task(n=10):
    return textwrap.dedent(f"""
    Task: Generate {n} distinct AI mini-project ideas.

    Constraints:
    - Return a JSON array of length {n}.
    - Each item: object with exactly "title" (string) and "description" (string).
    - No comments, no prose outside JSON.

    Scope & Simplicity:
    - Each project is doable offline in 20–60 minutes on CPU.
    - Single-file mindset: one clear goal, one primary metric or artifact.
    - Keep dependencies minimal (numpy/pandas/sklearn/torch/tf/opencv only).
    - Mention one artifact or metric (png, accuracy, inertia, silhouette, TEST_PASS).

    Dataset whitelist (must follow):
    - sklearn: iris, digits, wine, breast_cancer, diabetes
    - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
    - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
    - Do NOT use 20 Newsgroups or fetch_20newsgroups.

    For NLP / audio / task-specific topics:
    - Explicitly say: “generate 200–300 synthetic <domain> samples” so the code agent knows to build data in-code.

    Description style:
    - Titles ≤ 6 words, specific.
    - Descriptions are 1–2 sentences with concrete I/O hints (flags, paths, outputs).
    - Include at least one quick validation (e.g., accuracy threshold, file existence, non-empty output).

    Diversity (within the allowed areas):
    - Avoid repeating the same idea or trivial variants.

    Follow the style of these examples without repeating them:
    {FEW_SHOTS}

    Now produce the JSON array of {n} items.
    """).strip()


### 4. OpenRouter call helper

In [7]:
# Helper to make a safe Python variable name from a slug
def varname_from_slug(slug: str) -> str:
    name = slug.lower().replace("/", "_").replace("-", "_").replace(".", "_")
    return f"{name}_result"

In [8]:
# Generic OpenRouter caller taking model_id
def call_openrouter_model(model_id, messages, temperature=0.3, top_p=0.9, max_tokens=6000):
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://colab.research.google.com/",
        "X-Title": f"Multi-Model Project Generator",
    }
    payload = {
        "model": model_id,
        "messages": messages,
        "temperature": float(temperature),
        "top_p": float(top_p),
        "max_tokens": int(max_tokens),
    }

    # force SiliconFlow for Qwen models
    if model_id.startswith("qwen/"):
      payload["provider"] = {
          "only": ["atlas-cloud/fp8"],
          "allow_fallbacks": False
      }

    t0 = time.time()
    r = requests.post(url, headers=headers, json=payload, timeout=120)
    latency = time.time() - t0
    r.raise_for_status()
    content = r.json()["choices"][0]["message"]["content"]
    return content, latency

In [9]:
# Validator for title/description array
from jsonschema import Draft7Validator
def validate_items(arr, N):
    ITEM_SCHEMA = {
        "type":"object",
        "required":["title","description"],
        "properties":{
            "title":{"type":"string","minLength":3, "maxLength":100},
            "description":{"type":"string","minLength":20, "maxLength":600}
        },
        "additionalProperties": False
    }
    ARRAY_SCHEMA = {"type":"array","items":ITEM_SCHEMA, "minItems":N, "maxItems":N}
    errs = [e.message for e in Draft7Validator(ARRAY_SCHEMA).iter_errors(arr)]
    titles = [ (x.get("title") or "").strip().lower() for x in arr ]
    if len(set(titles)) != len(titles):
        errs.append("Duplicate titles detected.")
    return errs

In [10]:
# Robust extractor (keeps your previous logic)
import re, json, time, requests
def extract_json_array(text: str):
    m = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", text, re.IGNORECASE)
    if m:
        text = m.group(1).strip()
    start = text.find('[')
    if start == -1:
        return None
    depth = 0
    for i in range(start, len(text)):
        ch = text[i]
        if ch == '[': depth += 1
        elif ch == ']':
            depth -= 1
            if depth == 0:
                return text[start:i+1]
    return None

In [35]:
# Build a single shared message set
N = 50
task = build_task(N) + "\n\nReturn a raw JSON array only — no prose, no code fences, no markdown."
messages = [
    {"role":"system","content": SYSTEM_PROMPT},
    {"role":"user","content": task}
]

# Loop models and create a separate variable per model with the results
for model_id in MODELS:
    print(f"\n===== {model_id} =====")
    varname = varname_from_slug(model_id)
    try:
        raw, secs = call_openrouter_model(model_id, messages, temperature=0.2)
        json_str = extract_json_array(raw)
        if not json_str:
            globals()[varname] = {
                "raw": raw, "json_str": None, "items": None,
                "errors": ["No JSON array found"], "latency": secs
            }
            print(f"❌ No JSON array found | {secs:.1f}s")
            continue

        items = json.loads(json_str)
        errors = validate_items(items, N)

        globals()[varname] = {
            "raw": raw, "json_str": json_str, "items": items,
            "errors": errors, "latency": secs
        }

        if errors:
            print(f"⚠️ Parsed but validation errors ({len(errors)}) | {secs:.1f}s")
            for e in errors[:5]:
                print(" -", e)
        else:
            print(f"✅ Valid JSON ({len(items)} items) | {secs:.1f}s")
            print(json.dumps(items[:2], indent=2, ensure_ascii=False))
    except Exception as e:
        globals()[varname] = {
            "raw": None, "json_str": None, "items": None,
            "errors": [str(e)], "latency": None
        }
        print("❌ Exception:", e)


===== anthropic/claude-sonnet-4.5 =====
✅ Valid JSON (50 items) | 44.2s
[
  {
    "title": "Iris KNN Classifier",
    "description": "Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.90."
  },
  {
    "title": "Digits Logistic Regression",
    "description": "Load sklearn digits, flatten images, train LogisticRegression with max_iter=1000, print test accuracy. Print TEST_PASS if accuracy ≥ 0.90."
  }
]

===== qwen/qwen3-coder =====
✅ Valid JSON (50 items) | 38.4s
[
  {
    "title": "Iris KNN Classifier",
    "description": "Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.9."
  },
  {
    "title": "Wine Logistic Regression",
    "description": "Use sklearn wine dataset, scale features, train logistic regression, and report test accuracy. Print TEST_PASS if accuracy ≥ 0.92."
  }
]


In [14]:
for model_id in MODELS:
    print(" -", varname_from_slug(model_id))

 - anthropic_claude_sonnet_4_5_result
 - qwen_qwen3_coder_result


### All the resuts for three models that I used are below:

In [36]:
print(json.dumps(anthropic_claude_sonnet_4_5_result['items'], indent=2, ensure_ascii=False))

[
  {
    "title": "Iris KNN Classifier",
    "description": "Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.90."
  },
  {
    "title": "Digits Logistic Regression",
    "description": "Load sklearn digits, flatten images, train LogisticRegression with max_iter=1000, print test accuracy. Print TEST_PASS if accuracy ≥ 0.90."
  },
  {
    "title": "Wine Random Forest",
    "description": "Load sklearn wine dataset, train a RandomForestClassifier with 50 trees, print test accuracy. Print TEST_PASS if accuracy ≥ 0.90."
  },
  {
    "title": "Breast Cancer SVM",
    "description": "Load sklearn breast_cancer, train a linear SVM (SVC kernel='linear'), print test accuracy. Print TEST_PASS if accuracy ≥ 0.92."
  },
  {
    "title": "Diabetes Ridge Regression",
    "description": "Load sklearn diabetes, train Ridge regression (alpha=1.0), compute test R². Print TEST_PASS if R² ≥ 0.35."
  },
  {
    "tit

In [37]:
print(json.dumps(qwen_qwen3_coder_result['items'], indent=2, ensure_ascii=False))

[
  {
    "title": "Iris KNN Classifier",
    "description": "Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.9."
  },
  {
    "title": "Wine Logistic Regression",
    "description": "Use sklearn wine dataset, scale features, train logistic regression, and report test accuracy. Print TEST_PASS if accuracy ≥ 0.92."
  },
  {
    "title": "Breast Cancer SVM",
    "description": "Train a linear SVM on sklearn breast_cancer dataset, report test accuracy. Print TEST_PASS if accuracy ≥ 0.93."
  },
  {
    "title": "Digits Linear Classifier",
    "description": "Classify sklearn digits using LogisticRegression, report test accuracy. Print TEST_PASS if accuracy ≥ 0.92."
  },
  {
    "title": "Diabetes Ridge Regression",
    "description": "Fit Ridge regression on sklearn diabetes dataset, compute R² score. Print TEST_PASS if R² ≥ 0.4."
  },
  {
    "title": "Moons Decision Tree",
    "description": "Gene

## 2. Generating the codes for each projects that have been generated earlier with the models

In [17]:
# -------- External memory (compact, curated) --------
memory = {
  "style_guide": [
    "Single-file script with `if __name__ == '__main__':` entrypoint.",
    "Use argparse with clear --help and sensible defaults.",
    "Prefer standard library datasets (sklearn, torchvision, keras).",
    "Attempt auto-download/cache with short timeout; if unavailable, fallback to a tiny structured synthetic dataset.",
    "Fix randomness: set seeds for random, numpy; torch if used; run on CPU by default.",
    "Validate inputs (paths, columns, image loads) and fail gracefully with one-line reason.",
    "Keep runtime < 2 minutes (few epochs, small subsets).",
    "Print `TEST_PASS` on success; otherwise `TEST_FAIL: <reason>`."
  ],
  "lessons": [
    "When standardizing features use sklearn.pipeline.Pipeline to avoid leakage.",
    "For OpenCV Canny, expose --threshold1 and --threshold2; convert to grayscale before edges.",
    "For CSV tasks, explicitly validate required columns; show a friendly error if missing.",
    "For plotting, save figures to disk and plt.close() to avoid backend issues.",
    "One file only. Return exactly ONE ```python block. No extra prose.",
    "CLI + help. Use a single 'argparse.ArgumentParser()'. All help strings are single-line (no embedded newlines).",
    "Seeds in 'main()'. Expose '--seed' and set seeds for random, numpy, and torch (if present) inside main().",
    "Data access policy. Only use library datasets when --allow-download is passed. Otherwise do not download; use a robust fallback (sklearn tabular, torchvision FakeData, PIL shapes, etc.). If using 20NG, call with download_if_missing=False unless allowed.",
    "Task–dataset match. Choose datasets that match the task (e.g., do not use 20 Newsgroups for spam/ham).",
    "CV safety. For OpenCV: 1. convert to grayscale if needed. 2. ensure input to detectors is uint8 (cv2.convertScaleAbs if needed). 3. for Haar, check face_cascade.empty() == False or fail.",
    "Acceptance contract. Implement explicit pass/fail checks (files exist, metrics ≥ thresholds, non-empty edge map, etc.). Print TEST_PASS only when all conditions hold; otherwise TEST_FAIL: <reason> and sys.exit(1).",
    "No broken syntax. Never split identifiers across lines. Never break f-strings or string literals across lines.",
    "End marker. Append '# END_OF_SCRIPT' as the last line of the file."
  ],
  "snippets": [
    # seed block to embed in each script
    "import random, numpy as np\nrandom.seed(42)\nnp.random.seed(42)\ntry:\n    import torch\n    torch.manual_seed(42)\nexcept Exception:\n    pass"
  ]
}

In [34]:
SYSTEM_PROMPT_CODEGEN = """
You are a meticulous senior Python engineer who writes production-quality, runnable scripts.
Priorities: (1) correctness, (2) reproducibility, (3) clarity, (4) speed.

Formatting & Output Contract:
- Return ONE code block only: ```python ...```
- The code must be a single file with `if __name__ == "__main__":` entrypoint.
- Provide a clear CLI via argparse and `--help`. All help strings must be single-line (no embedded newlines).
- Do not print explanations. Do not include markdown outside the single code block.
- Append `# END_OF_SCRIPT` as the final line of the file.

Dataset whitelist (MUST follow):
- You may directly load/use ONLY these real datasets:
  - sklearn: iris, digits, wine, breast_cancer, diabetes
  - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
  - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
- Do NOT use 20 Newsgroups, fetch_20newsgroups, or any "newsgroups" variant.
- If the project description says to “generate 200–300 synthetic … samples”, you MUST implement that synthetic dataset in code (e.g. build 200 labeled sentences, or 300 tabular rows, or 200 (x,y) pairs).
- If the project mentions an allowed dataset that might require download (e.g. MNIST, FashionMNIST, CIFAR10), first TRY to load it, and if it fails or `--allow-download` was not passed, fall back to a synthetic dataset that matches the task.

Behavioral Rules:
- Expose `--seed` and set seeds **inside `main()`** for `random`, `numpy`, and `torch` (if available); run on CPU by default.
- Validate inputs (paths, columns, image loads, flags) and fail gracefully with a concise message.
- For OpenCV tasks: convert to grayscale when needed; ensure `uint8` input; for Haar cascades, ensure `face_cascade.empty() == False` or fail.
- Keep the code minimal, readable, and fully runnable in a fresh Colab.
- Never split identifiers across lines; never break string literals, f-strings, or comments across lines.
  - Comments must be on one line each (e.g. `# custom text prediction`), not split into two lines.
- Implement explicit acceptance checks tied to the task (files exist, metrics ≥ thresholds, non-empty edge map, etc.).
- Print `TEST_PASS` only when all acceptance conditions hold; otherwise print `TEST_FAIL: <reason>` and `sys.exit(1)`.

Self-Check Before Returning:
- argparse help strings are single-line.
- Seeds are applied in `main()` for random/numpy/torch.
- No downloads are attempted because `--allow-download` was not passed.
- Dataset matches the task semantics.
- Dataset name is NOT `20newsgroups` / `fetch_20newsgroups` / “newsgroups”, unless the task is explicitly about newsgroups AND `--allow-download` was passed.
- Acceptance checks implemented; `TEST_PASS`/`TEST_FAIL` present.
- File ends with `# END_OF_SCRIPT`.
- Code parses without SyntaxError and comments are not broken across lines.
- If using LSTM on synthetic TS: TEST_PASS if MAE ≤ 0.35.
- If LSTM import fails: fall back to LinearRegression on same features with MAE ≤ 0.50.
- If using real CIFAR10: TEST_PASS if accuracy ≥ 0.55.
- If using FakeData: TEST_PASS if accuracy ≥ 0.25.
- If neither dataset is available: print TEST_FAIL: dataset unavailable.
""".strip()


In [19]:
FEWSHOTS_CODE = """
Example A (tabular classification with sklearn iris -> fallback synthetic; seeds-in-main; single-line help; acceptance checks)
```python
import argparse, sys
import random, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

def load_iris_or_synthetic(seed=42):
    try:
        from sklearn.datasets import load_iris  # no download required
        data = load_iris()
        X, y, used = data.data, data.target, "iris"
    except Exception:
        rng = np.random.default_rng(seed)
        n = 210
        c = rng.integers(0, 3, size=n)
        X = rng.normal(0, 1, size=(n, 4)) + c[:, None] * 1.5
        y, used = c, "synthetic"
    return X, y, used

def main():
    p = argparse.ArgumentParser(description="Iris (no-download) or synthetic fallback; seeds set in main; explicit acceptance.")
    p.add_argument("--test-size", type=float, default=0.2, help="Test set fraction (default: 0.2).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    args = p.parse_args()

    # seeds in main
    random.seed(args.seed); np.random.seed(args.seed)
    try:
        import torch; torch.manual_seed(args.seed)
    except Exception:
        pass

    X, y, used = load_iris_or_synthetic(args.seed)
    if X is None or y is None or len(X) == 0:
        print("TEST_FAIL: dataset not available"); sys.exit(1)

    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=args.test_size, random_state=args.seed, stratify=y)
    clf = Pipeline([("scaler", StandardScaler()), ("lr", LogisticRegression(max_iter=300))])
    clf.fit(Xtr, ytr)
    acc = clf.score(Xte, yte)
    print(f"dataset={used} accuracy={acc:.3f}")
    # acceptance: stricter if iris, looser if synthetic
    if acc >= (0.85 if used == "iris" else 0.70):
        print("TEST_PASS")
    else:
        print("TEST_FAIL: accuracy below threshold"); sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT
```

Example B (vision MNIST with opt-in download -> fallback FakeData; uint8 safety; seeds-in-main; acceptance checks)
```python
import argparse, sys, os
import random, numpy as np

def load_mnist_or_fakedata(max_train=2000, max_test=500, seed=42, allow_download=False):
    try:
        import torch
        from torchvision import datasets, transforms
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        # only download if explicitly allowed
        train = datasets.MNIST(root="./data", train=True, download=bool(allow_download), transform=tfm)
        test  = datasets.MNIST(root="./data", train=False, download=bool(allow_download), transform=tfm)
        # if dataset objects are empty because cache missing and download disabled, trigger fallback
        if len(train) == 0 or len(test) == 0:
            raise RuntimeError("MNIST cache missing and download disabled")
        train = torch.utils.data.Subset(train, list(range(min(len(train), max_train))))
        test  = torch.utils.data.Subset(test,  list(range(min(len(test),  max_test))))
        return train, test, True
    except Exception:
        import torch
        from torchvision import transforms
        from torchvision.datasets import FakeData
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        train = FakeData(size=max_train, image_size=(1, 28, 28), num_classes=10, transform=tfm)
        test  = FakeData(size=max_test,  image_size=(1, 28, 28), num_classes=10, transform=tfm)
        return train, test, False

def main():
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader

    p = argparse.ArgumentParser(description="MNIST (opt-in download) or FakeData fallback; seeds in main; explicit acceptance.")
    p.add_argument("--epochs", type=int, default=1, help="Training epochs (default: 1).")
    p.add_argument("--batch", type=int, default=128, help="Batch size (default: 128).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    p.add_argument("--allow-download", action="store_true", help="Permit MNIST download if not cached.")
    args = p.parse_args()

    # seeds in main
    random.seed(args.seed); np.random.seed(args.seed); torch.manual_seed(args.seed)

    train_ds, test_ds, real = load_mnist_or_fakedata(seed=args.seed, allow_download=args.allow_download)
    train = DataLoader(train_ds, batch_size=args.batch, shuffle=True)
    test  = DataLoader(test_ds,  batch_size=args.batch, shuffle=False)

    class TinyCNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.net = nn.Sequential(
                nn.Conv2d(1, 16, 3, 1), nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(16, 32, 3, 1), nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Flatten(),
                nn.Linear(32*5*5, 64), nn.ReLU(),
                nn.Linear(64, 10)
            )
        def forward(self, x): return self.net(x)

    model = TinyCNN()
    opt = optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = nn.CrossEntropyLoss()

    model.train()
    for _ in range(args.epochs):
        for xb, yb in train:
            # ensure uint8 -> float32 is handled by ToTensor; just train
            opt.zero_grad()
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward(); opt.step()

    # eval + acceptance
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for xb, yb in test:
            pred = model(xb).argmax(1)
            correct += (pred == yb).sum().item()
            total += yb.numel()
    acc = correct / max(total, 1)
    print(f"acc={acc:.3f} dataset={'mnist' if real else 'fake'}")
    # stricter if real, looser if fake
    if acc >= (0.85 if real else 0.20):
        print("TEST_PASS")
    else:
        print("TEST_FAIL: accuracy below threshold"); sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT
```
""".strip()

In [20]:
import os, re, json, requests

# --- helpers ---
def extract_code_block(text: str) -> str:
    """
    Return the first code block content if present; otherwise return the whole text.
    Prefers ```python ... ``` but accepts ``` ... ```.
    """
    m = re.search(r"```(?:python)?\s*([\s\S]*?)\s*```", text, re.IGNORECASE)
    return m.group(1) if m else text

def print_long(s: str, width: int = 4000):
    """Print long strings without Colab truncation, in chunks."""
    for i in range(0, len(s), width):
        print(s[i:i+width])

In [21]:
# -------- Task builder (per project) --------
def build_code_task(project, memory):
    guide = "\n- ".join(memory["style_guide"])
    lessons = "\n- ".join(memory["lessons"][-6:])
    seed_block = memory["snippets"][0]

    return f"""
    PROJECT TITLE:
    {project['title']}

    PROJECT DESCRIPTION:
    {project['description']}

    Follow this style guide:
    - {guide}

    Incorporate recent lessons:
    - {lessons}

    Hard guardrails (must follow):
    - Return ONE code block only: ```python ...``` (no extra prose).
    - Single file with `if __name__ == "__main__":` entrypoint.
    - Use argparse; **all help strings are single-line** (no embedded newlines).
    - Expose `--seed` and set seeds **inside `main()`** for random, numpy, and torch (if available).
    - Use ONLY the whitelist datasets when a real dataset is required:
      - sklearn: iris, digits, wine, breast_cancer, diabetes
      - sklearn generators: make_classification, make_regression, make_blobs, make_moons, make_circles
      - torchvision: MNIST, FashionMNIST, CIFAR10, FakeData, ImageFolder
    - If the description asks for NLP, audio, or task-specific data that is NOT in the whitelist, generate 200–300 synthetic samples in code (labelled if classification).
    - Do NOT use 20 Newsgroups or fetch_20newsgroups.
    - Choose datasets that **match the task semantics** (e.g., do NOT use 20 Newsgroups for spam/ham).
    - For OpenCV tasks: convert to grayscale when needed; ensure `uint8` input (use `cv2.convertScaleAbs` if necessary); for Haar cascades verify `face_cascade.empty()==False` or fail.
    - Never split identifiers across lines; never break string literals or f-strings across lines.

    Embed this seed block near the top of the script:
    ```python
    {seed_block}"""


In [22]:
# -------- OpenRouter caller --> (code) --------
import time

def call_openrouter_model_code(model_id, messages, temperature=0.2, top_p=0.9, max_tokens=6000):
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://colab.research.google.com/",
        "X-Title": "Multi-Model Project Generator",
    }
    payload = {
        "model": model_id,
        "messages": messages,
        "temperature": float(temperature),
        "top_p": float(top_p),
        "max_tokens": int(max_tokens),
    }

    # force SiliconFlow for Qwen models
    if model_id.startswith("qwen/"):
      payload["provider"] = {
          "only": ["novita/fp8"],
          "allow_fallbacks": False
      }

    time.sleep(2.0)
    t0 = time.time()
    r = requests.post(url, headers=headers, json=payload, timeout=120)
    latency = time.time() - t0
    r.raise_for_status()
    code = r.json()["choices"][0]["message"]["content"]
    return code, latency


In [43]:
#-------- Map: model slug -> per-model dict variable name --------

MODEL_TO_RESULTVAR = {
#"openai/gpt-5": "openai_gpt_5_result",
"anthropic/claude-sonnet-4.5": "anthropic_claude_sonnet_4_5_result" ,
#"qwen/qwen3-coder": "qwen_qwen3_coder_result"
}

In [44]:
# -------- Generate code for the first 4 projects per model --------

print("Generating code for first 10 items of each model's projects...\n")
per_model_generated_code = {}

def code_items_varname(slug: str) -> str:
    return slug.lower().replace("/", "_").replace("-", "_").replace(".", "_") + "_code_items"


for model_id, varname in MODEL_TO_RESULTVAR.items():
    result = qwen_qwen3_coder_result #globals().get(varname)
    if not result or not result.get("items"):
        print(f"Skipping {model_id}: no items found in `{varname}`")
        continue

    projects = result["items"][:]
    print(f"\n===== {model_id}: generating and attaching code for {len(projects)} projects =====")

    items_with_code = []
    for idx, proj in enumerate(projects, 1):
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT_CODEGEN},
            {"role": "user", "content": "Study these two short code examples and copy their structure (CLI, dataset policy, seeds, TEST_PASS contract)."},
            {"role": "assistant", "content": FEWSHOTS_CODE},
            {"role": "user", "content": build_code_task(proj, memory)},
        ]
        raw, latency = call_openrouter_model_code(model_id, messages, temperature=0.2, max_tokens=7000)
        code = extract_code_block(raw)

        item = {
            "title": proj["title"],
            "description": proj["description"],
            "code": code
        }
        items_with_code.append(item)

        # show the full code (no truncation)
        print(f"\n--- {model_id} • Project {idx}: {proj['title']} --- Latency: {latency:.2f}s ---\n")
        print_long(code)  # full code printed

    # put per-model list into a dedicated variable
    var_codes = code_items_varname(model_id)  # e.g., openai_gpt_5_code_items
    globals()[var_codes] = items_with_code

    # also store inside the original result dict under 'items_with_code' for convenience
    result["items_with_code"] = items_with_code

    # pretty JSON view of the per-model list
    print(f"\n>>> {model_id} • JSON with title, description, code:")
    print_long(json.dumps(items_with_code, indent=2, ensure_ascii=False))

Generating code for first 10 items of each model's projects...


===== anthropic/claude-sonnet-4.5: generating and attaching code for 50 projects =====

--- anthropic/claude-sonnet-4.5 • Project 1: Iris KNN Classifier --- Latency: 8.85s ---

import argparse
import sys
import random
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

def main():
    p = argparse.ArgumentParser(description="Iris KNN classifier with k=3; prints TEST_PASS if accuracy >= 0.9.")
    p.add_argument("--test-size", type=float, default=0.2, help="Test set fraction (default: 0.2).")
    p.add_argument("--k", type=int, default=3, help="Number of neighbors for KNN (default: 3).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    args = p.parse_args()

    # seeds in main
    random.seed(args.seed)
    np.random.seed(args.seed)
    try:
        import torch
        to

In [53]:
print(json.dumps(result["items_with_code"], indent=2, ensure_ascii=False))

[
  {
    "title": "Iris KNN Classifier",
    "description": "Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.9.",
    "code": "import argparse\nimport sys\nimport random\nimport numpy as np\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\n\ndef main():\n    p = argparse.ArgumentParser(description=\"Iris KNN classifier with k=3; prints TEST_PASS if accuracy >= 0.9.\")\n    p.add_argument(\"--test-size\", type=float, default=0.2, help=\"Test set fraction (default: 0.2).\")\n    p.add_argument(\"--k\", type=int, default=3, help=\"Number of neighbors for KNN (default: 3).\")\n    p.add_argument(\"--seed\", type=int, default=42, help=\"Random seed (default: 42).\")\n    args = p.parse_args()\n\n    # seeds in main\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    try:\n        import torch

In [56]:
import json

# your existing data
items = result["items_with_code"]

# the 6 fixed versions
fixed_items = [
    {
        "title": "Classification Synthetic Data",
        "description": "Generate make_classification data (n=500, n_features=4), train an SGDClassifier inside a StandardScaler pipeline, report accuracy. Print TEST_PASS if accuracy ≥ 0.85.",
        "code": """import argparse
import sys
import random
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

def main():
    p = argparse.ArgumentParser(description="Generate make_classification data (n=500, n_features=4), train SGDClassifier, report accuracy.")
    p.add_argument("--n-samples", type=int, default=500, help="Number of samples (default: 500).")
    p.add_argument("--n-features", type=int, default=4, help="Number of features (default: 4).")
    p.add_argument("--test-size", type=float, default=0.2, help="Test set fraction (default: 0.2).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    args = p.parse_args()

    random.seed(args.seed)
    np.random.seed(args.seed)
    try:
        import torch
        torch.manual_seed(args.seed)
    except Exception:
        pass

    X, y = make_classification(
        n_samples=args.n_samples,
        n_features=args.n_features,
        n_informative=max(2, args.n_features // 2),
        n_redundant=0,
        n_clusters_per_class=1,
        random_state=args.seed
    )

    if X is None or y is None or len(X) == 0:
        print("TEST_FAIL: dataset generation failed")
        sys.exit(1)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=args.test_size, random_state=args.seed, stratify=y
    )

    clf = Pipeline([
        ("scaler", StandardScaler()),
        ("sgd", SGDClassifier(max_iter=1000, tol=1e-3, random_state=args.seed))
    ])
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"accuracy={acc:.3f}")

    if acc >= 0.85:
        print("TEST_PASS")
    else:
        print("TEST_FAIL: accuracy below 0.85")
        sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT"""
    },
    {
        "title": "Digits Confusion Heatmap",
        "description": "Train SVC on sklearn digits, compute confusion matrix, plot it with matplotlib (no seaborn), save as digits_heatmap.png. Print TEST_PASS if file exists.",
        "code": """import argparse
import sys
import random
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
import os

def main():
    p = argparse.ArgumentParser(description="Train SVC on digits dataset, plot confusion matrix heatmap, save as digits_heatmap.png.")
    p.add_argument("--output", type=str, default="digits_heatmap.png", help="Output heatmap filename (default: digits_heatmap.png).")
    p.add_argument("--test-size", type=float, default=0.3, help="Test set fraction (default: 0.3).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    args = p.parse_args()

    random.seed(args.seed)
    np.random.seed(args.seed)
    try:
        import torch
        torch.manual_seed(args.seed)
    except Exception:
        pass

    try:
        data = load_digits()
        X, y = data.data, data.target
    except Exception as e:
        print(f"TEST_FAIL: failed to load digits dataset: {e}")
        sys.exit(1)

    if X is None or y is None or len(X) == 0:
        print("TEST_FAIL: digits dataset is empty")
        sys.exit(1)

    Xtr, Xte, ytr, yte = train_test_split(
        X, y, test_size=args.test_size, random_state=args.seed, stratify=y
    )

    clf = SVC(kernel='rbf', gamma='scale', random_state=args.seed)
    clf.fit(Xtr, ytr)
    ypred = clf.predict(Xte)

    cm = confusion_matrix(yte, ypred)

    fig, ax = plt.subplots(figsize=(8, 6))
    im = ax.imshow(cm, cmap='Blues')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title('Digits Confusion Matrix')
    plt.colorbar(im, ax=ax)
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, cm[i, j], ha='center', va='center', color='black')
    plt.tight_layout()
    plt.savefig(args.output, dpi=100)
    plt.close()

    if not os.path.isfile(args.output):
        print(f"TEST_FAIL: output file {args.output} does not exist")
        sys.exit(1)

    print(f"Confusion matrix heatmap saved to {args.output}")
    print("TEST_PASS")

if __name__ == "__main__":
    main()
# END_OF_SCRIPT"""
    },
    {
        "title": "ImageFolder Histogram",
        "description": "Try to load an ImageFolder and compute average RGB histogram, else fall back to FakeData. Save plot to histogram.png. Print TEST_PASS if file exists and is non-empty.",
        "code": """import argparse
import sys
import os
import random
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def load_imagefolder_or_fakedata(root_dir, num_fake=50, seed=42, allow_download=False):
    try:
        import torch
        from torchvision import datasets, transforms
        torch.manual_seed(seed)
        if os.path.isdir(root_dir):
            tfm = transforms.ToTensor()
            ds = datasets.ImageFolder(root=root_dir, transform=tfm)
            if len(ds) > 0:
                return ds, True
        raise RuntimeError("ImageFolder not available or empty")
    except Exception:
        import torch
        from torchvision import transforms
        from torchvision.datasets import FakeData
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        ds = FakeData(size=num_fake, image_size=(3, 64, 64), num_classes=2, transform=tfm)
        return ds, False

def compute_rgb_histogram(dataset, num_samples=None):
    if num_samples is None:
        num_samples = len(dataset)
    else:
        num_samples = min(num_samples, len(dataset))

    r_sum = np.zeros(256, dtype=np.float64)
    g_sum = np.zeros(256, dtype=np.float64)
    b_sum = np.zeros(256, dtype=np.float64)

    for i in range(num_samples):
        img_tensor, _ = dataset[i]
        img_np = (img_tensor.numpy() * 255).astype(np.uint8)
        r_hist, _ = np.histogram(img_np[0].flatten(), bins=256, range=(0, 256))
        g_hist, _ = np.histogram(img_np[1].flatten(), bins=256, range=(0, 256))
        b_hist, _ = np.histogram(img_np[2].flatten(), bins=256, range=(0, 256))
        r_sum += r_hist
        g_sum += g_hist
        b_sum += b_hist

    r_avg = r_sum / num_samples
    g_avg = g_sum / num_samples
    b_avg = b_sum / num_samples
    return r_avg, g_avg, b_avg

def main():
    p = argparse.ArgumentParser(description="Compute average RGB histogram from ImageFolder or FakeData and save as histogram.png.")
    p.add_argument("--root-dir", type=str, default="./imagefolder_data", help="Path to ImageFolder root directory (default: ./imagefolder_data).")
    p.add_argument("--num-fake", type=int, default=50, help="Number of FakeData images if ImageFolder unavailable (default: 50).")
    p.add_argument("--output", type=str, default="histogram.png", help="Output histogram file path (default: histogram.png).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    p.add_argument("--allow-download", action="store_true", help="Permit dataset download if needed.")
    args = p.parse_args()

    random.seed(args.seed)
    np.random.seed(args.seed)
    try:
        import torch
        torch.manual_seed(args.seed)
    except Exception:
        pass

    dataset, is_real = load_imagefolder_or_fakedata(
        root_dir=args.root_dir,
        num_fake=args.num_fake,
        seed=args.seed,
        allow_download=args.allow_download
    )

    if dataset is None or len(dataset) == 0:
        print("TEST_FAIL: dataset not available or empty")
        sys.exit(1)

    print(f"Using {'ImageFolder' if is_real else 'FakeData'} with {len(dataset)} images")

    r_hist, g_hist, b_hist = compute_rgb_histogram(dataset)

    fig, ax = plt.subplots(figsize=(10, 6))
    bins = np.arange(256)
    ax.plot(bins, r_hist, color='red', alpha=0.7, label='Red')
    ax.plot(bins, g_hist, color='green', alpha=0.7, label='Green')
    ax.plot(bins, b_hist, color='blue', alpha=0.7, label='Blue')
    ax.set_xlabel('Pixel Value')
    ax.set_ylabel('Average Frequency')
    ax.set_title('Average RGB Histogram')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(args.output, dpi=100)
    plt.close()

    if os.path.isfile(args.output):
        file_size = os.path.getsize(args.output)
        if file_size > 0:
            print(f"Histogram saved to {args.output} ({file_size} bytes)")
            print("TEST_PASS")
        else:
            print("TEST_FAIL: output file is empty")
            sys.exit(1)
    else:
        print("TEST_FAIL: output file does not exist")
        sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT"""
    },
    {
        "title": "KMeans on CIFAR Colors",
        "description": "Load CIFAR10 pixels (or FakeData fallback), run KMeans(k=5) on RGB pixels, and print inertia. Print TEST_PASS if inertia < 2.0e5. Uses fewer images to reduce runtime.",
        "code": """import argparse
import sys
import random
import numpy as np

def load_cifar_or_fakedata(max_samples=500, seed=42, allow_download=False):
    try:
        import torch
        from torchvision import datasets, transforms
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        train = datasets.CIFAR10(root="./data", train=True, download=bool(allow_download), transform=tfm)
        if len(train) == 0:
            raise RuntimeError("CIFAR10 cache missing and download disabled")
        subset = torch.utils.data.Subset(train, list(range(min(len(train), max_samples))))
        loader = torch.utils.data.DataLoader(subset, batch_size=max_samples, shuffle=False)
        for xb, _ in loader:
            pixels = xb.permute(0, 2, 3, 1).reshape(-1, 3).numpy()
            return pixels, True
    except Exception:
        import torch
        from torchvision import transforms
        from torchvision.datasets import FakeData
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        fake = FakeData(size=max_samples, image_size=(3, 32, 32), num_classes=10, transform=tfm)
        loader = torch.utils.data.DataLoader(fake, batch_size=max_samples, shuffle=False)
        for xb, _ in loader:
            pixels = xb.permute(0, 2, 3, 1).reshape(-1, 3).numpy()
            return pixels, False
    return None, False

def main():
    p = argparse.ArgumentParser(description="KMeans on CIFAR10 RGB pixels (opt-in download) or FakeData fallback; seeds in main; explicit acceptance.")
    p.add_argument("--k", type=int, default=5, help="Number of clusters (default: 5).")
    p.add_argument("--max-samples", type=int, default=500, help="Max images to load (default: 500).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    p.add_argument("--allow-download", action="store_true", help="Permit CIFAR10 download if not cached.")
    args = p.parse_args()

    random.seed(args.seed)
    np.random.seed(args.seed)
    try:
        import torch
        torch.manual_seed(args.seed)
    except Exception:
        pass

    pixels, real = load_cifar_or_fakedata(max_samples=args.max_samples, seed=args.seed, allow_download=args.allow_download)
    if pixels is None or len(pixels) == 0:
        print("TEST_FAIL: dataset not available")
        sys.exit(1)

    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=args.k, random_state=args.seed, n_init=10, max_iter=100)
    kmeans.fit(pixels)
    inertia = kmeans.inertia_

    print(f"dataset={'cifar10' if real else 'fake'} k={args.k} inertia={inertia:.2e}")

    if inertia < 2.0e5:
        print("TEST_PASS")
    else:
        print("TEST_FAIL: inertia >= 2.0e5")
        sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT"""
    },
    {
        "title": "MNIST Autoencoder",
        "description": "Load MNIST (or FakeData fallback), train a small autoencoder for a few epochs, reconstruct one image, and save original/reconstructed pair to autoencode.png. Print TEST_PASS if file exists.",
        "code": """import argparse
import sys
import os
import random
import numpy as np

def load_mnist_or_fakedata(max_train=2000, seed=42, allow_download=False):
    try:
        import torch
        from torchvision import datasets, transforms
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        train = datasets.MNIST(root="./data", train=True, download=bool(allow_download), transform=tfm)
        if len(train) == 0:
            raise RuntimeError("MNIST cache missing and download disabled")
        train = torch.utils.data.Subset(train, list(range(min(len(train), max_train))))
        return train, True
    except Exception:
        import torch
        from torchvision import transforms
        from torchvision.datasets import FakeData
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        train = FakeData(size=max_train, image_size=(1, 28, 28), num_classes=10, transform=tfm)
        return train, False

def main():
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader
    import matplotlib
    matplotlib.use('Agg')
    import matplotlib.pyplot as plt

    p = argparse.ArgumentParser(description="MNIST autoencoder (opt-in download) or FakeData fallback; saves autoencode.png.")
    p.add_argument("--epochs", type=int, default=3, help="Training epochs (default: 3).")
    p.add_argument("--batch", type=int, default=128, help="Batch size (default: 128).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    p.add_argument("--allow-download", action="store_true", help="Permit MNIST download if not cached.")
    p.add_argument("--output", type=str, default="autoencode.png", help="Output image path (default: autoencode.png).")
    args = p.parse_args()

    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)

    train_ds, real = load_mnist_or_fakedata(max_train=2000, seed=args.seed, allow_download=args.allow_download)
    train_loader = DataLoader(train_ds, batch_size=args.batch, shuffle=True)

    class Autoencoder(nn.Module):
        def __init__(self):
            super().__init__()
            self.encoder = nn.Sequential(
                nn.Flatten(),
                nn.Linear(28*28, 128),
                nn.ReLU(),
                nn.Linear(128, 64),
                nn.ReLU(),
                nn.Linear(64, 32)
            )
            self.decoder = nn.Sequential(
                nn.Linear(32, 64),
                nn.ReLU(),
                nn.Linear(64, 128),
                nn.ReLU(),
                nn.Linear(128, 28*28),
                nn.Sigmoid()
            )
        def forward(self, x):
            z = self.encoder(x)
            recon = self.decoder(z)
            return recon.view(-1, 1, 28, 28)

    model = Autoencoder()
    opt = optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = nn.MSELoss()

    model.train()
    for epoch in range(args.epochs):
        total_loss = 0.0
        for xb, _ in train_loader:
            opt.zero_grad()
            recon = model(xb)
            loss = loss_fn(recon, xb)
            loss.backward()
            opt.step()
            total_loss += loss.item()
        avg_loss = total_loss / len(train_loader)
        print(f"epoch={epoch+1}/{args.epochs} loss={avg_loss:.4f}")

    model.eval()
    with torch.no_grad():
        sample_x, _ = next(iter(DataLoader(train_ds, batch_size=1, shuffle=False)))
        sample_recon = model(sample_x)

    orig = sample_x[0, 0].cpu().numpy()
    recon = sample_recon[0, 0].cpu().numpy()

    fig, axes = plt.subplots(1, 2, figsize=(6, 3))
    axes[0].imshow(orig, cmap='gray')
    axes[0].set_title('Original')
    axes[0].axis('off')
    axes[1].imshow(recon, cmap='gray')
    axes[1].set_title('Reconstructed')
    axes[1].axis('off')
    plt.tight_layout()
    plt.savefig(args.output)
    plt.close()

    print(f"dataset={'mnist' if real else 'fake'} output={args.output}")

    if os.path.isfile(args.output):
        print("TEST_PASS")
    else:
        print("TEST_FAIL: output file not created")
        sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT"""
    },
    {
        "title": "FashionMNIST VAE Latent",
        "description": "Train a small VAE on FashionMNIST (or FakeData fallback), collect 2D latent codes, and save scatter plot to latent_vae.png. Print TEST_PASS if file exists.",
        "code": """import argparse
import sys
import os
import random
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def load_fashionmnist_or_fakedata(max_train=2000, seed=42, allow_download=False):
    try:
        import torch
        from torchvision import datasets, transforms
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        train = datasets.FashionMNIST(root="./data", train=True, download=bool(allow_download), transform=tfm)
        if len(train) == 0:
            raise RuntimeError("FashionMNIST cache missing and download disabled")
        train = torch.utils.data.Subset(train, list(range(min(len(train), max_train))))
        return train, True
    except Exception:
        import torch
        from torchvision import transforms
        from torchvision.datasets import FakeData
        torch.manual_seed(seed)
        tfm = transforms.ToTensor()
        train = FakeData(size=max_train, image_size=(1, 28, 28), num_classes=10, transform=tfm)
        return train, False

def main():
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader

    p = argparse.ArgumentParser(description="FashionMNIST VAE latent space visualization with opt-in download or FakeData fallback.")
    p.add_argument("--epochs", type=int, default=2, help="Training epochs (default: 2).")
    p.add_argument("--batch", type=int, default=128, help="Batch size (default: 128).")
    p.add_argument("--latent-dim", type=int, default=2, help="Latent dimension (default: 2).")
    p.add_argument("--seed", type=int, default=42, help="Random seed (default: 42).")
    p.add_argument("--allow-download", action="store_true", help="Permit FashionMNIST download if not cached.")
    p.add_argument("--output", type=str, default="latent_vae.png", help="Output latent space plot filename (default: latent_vae.png).")
    args = p.parse_args()

    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)

    train_ds, real = load_fashionmnist_or_fakedata(max_train=2000, seed=args.seed, allow_download=args.allow_download)
    train_loader = DataLoader(train_ds, batch_size=args.batch, shuffle=True)

    class VAE(nn.Module):
        def __init__(self, latent_dim=2):
            super().__init__()
            self.latent_dim = latent_dim
            self.encoder = nn.Sequential(
                nn.Flatten(),
                nn.Linear(28*28, 256), nn.ReLU(),
                nn.Linear(256, 128), nn.ReLU()
            )
            self.fc_mu = nn.Linear(128, latent_dim)
            self.fc_logvar = nn.Linear(128, latent_dim)
            self.decoder = nn.Sequential(
                nn.Linear(latent_dim, 128), nn.ReLU(),
                nn.Linear(128, 256), nn.ReLU(),
                nn.Linear(256, 28*28), nn.Sigmoid()
            )

        def encode(self, x):
            h = self.encoder(x)
            return self.fc_mu(h), self.fc_logvar(h)

        def reparameterize(self, mu, logvar):
            std = torch.exp(0.5 * logvar)
            eps = torch.randn_like(std)
            return mu + eps * std

        def decode(self, z):
            return self.decoder(z).view(-1, 1, 28, 28)

        def forward(self, x):
            mu, logvar = self.encode(x)
            z = self.reparameterize(mu, logvar)
            recon = self.decode(z)
            return recon, mu, logvar

    def vae_loss(recon, x, mu, logvar):
        bce = nn.functional.binary_cross_entropy(recon, x, reduction='sum')
        kld = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
        return bce + kld

    model = VAE(latent_dim=args.latent_dim)
    opt = optim.Adam(model.parameters(), lr=1e-3)

    model.train()
    for epoch in range(args.epochs):
        total_loss = 0.0
        for xb, _ in train_loader:
            opt.zero_grad()
            recon, mu, logvar = model(xb)
            loss = vae_loss(recon, xb, mu, logvar)
            loss.backward()
            opt.step()
            total_loss += loss.item()
        avg_loss = total_loss / len(train_loader.dataset)
        print(f"epoch={epoch+1}/{args.epochs} loss={avg_loss:.4f}")

    model.eval()
    latents = []
    labels = []
    with torch.no_grad():
        for xb, yb in train_loader:
            mu, _ = model.encode(xb)
            latents.append(mu.cpu().numpy())
            labels.append(yb.cpu().numpy())
    latents = np.concatenate(latents, axis=0)
    labels = np.concatenate(labels, axis=0)

    plt.figure(figsize=(8, 6))
    scatter = plt.scatter(latents[:, 0], latents[:, 1], c=labels, cmap='tab10', alpha=0.6, s=10)
    plt.colorbar(scatter, label='Class')
    plt.xlabel('Latent Dim 1')
    plt.ylabel('Latent Dim 2')
    plt.title(f"VAE Latent Space ({'FashionMNIST' if real else 'FakeData'})")
    plt.tight_layout()
    plt.savefig(args.output, dpi=100)
    plt.close()
    print(f"Saved latent space plot to {args.output}")

    if os.path.isfile(args.output):
        print("TEST_PASS")
    else:
        print("TEST_FAIL: output file not created")
        sys.exit(1)

if __name__ == "__main__":
    main()
# END_OF_SCRIPT"""
    }
]

# titles we want to replace
titles_to_fix = {fi["title"] for fi in fixed_items}

# replace in-place
title_to_fixed = {fi["title"]: fi for fi in fixed_items}
for i, item in enumerate(items):
    t = item.get("title")
    if t in titles_to_fix:
        items[i] = title_to_fixed[t]

# if some are missing (not present in original), append them
existing_titles = {it["title"] for it in items}
for fi in fixed_items:
    if fi["title"] not in existing_titles:
        items.append(fi)

# optional: pretty print
print(json.dumps(items, indent=2, ensure_ascii=False))

[
  {
    "title": "Iris KNN Classifier",
    "description": "Load sklearn's iris dataset, split into train/test, train a k-NN classifier (k=3), and print test accuracy. Print TEST_PASS if accuracy ≥ 0.9.",
    "code": "import argparse\nimport sys\nimport random\nimport numpy as np\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\n\ndef main():\n    p = argparse.ArgumentParser(description=\"Iris KNN classifier with k=3; prints TEST_PASS if accuracy >= 0.9.\")\n    p.add_argument(\"--test-size\", type=float, default=0.2, help=\"Test set fraction (default: 0.2).\")\n    p.add_argument(\"--k\", type=int, default=3, help=\"Number of neighbors for KNN (default: 3).\")\n    p.add_argument(\"--seed\", type=int, default=42, help=\"Random seed (default: 42).\")\n    args = p.parse_args()\n\n    # seeds in main\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    try:\n        import torch

In [60]:
items = result["items_with_code"]

for item in items:
    if item["title"] == "Classification Synthetic Data":
        print(item)
        break

{'title': 'Classification Synthetic Data', 'description': 'Generate make_classification data (n=500, n_features=4), train an SGDClassifier inside a StandardScaler pipeline, report accuracy. Print TEST_PASS if accuracy ≥ 0.85.', 'code': 'import argparse\nimport sys\nimport random\nimport numpy as np\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler\n\ndef main():\n    p = argparse.ArgumentParser(description="Generate make_classification data (n=500, n_features=4), train SGDClassifier, report accuracy.")\n    p.add_argument("--n-samples", type=int, default=500, help="Number of samples (default: 500).")\n    p.add_argument("--n-features", type=int, default=4, help="Number of features (default: 4).")\n    p.add_argument("--test-size", type=float, default=0.2,

In [78]:
claude_sonnet = result["items_with_code"]

In [80]:
import json

# 49th project -> index 48  (Breast Cancer SHAP Values)
# 50th project -> index 49  (Digits Grad-CAM Heatmap)
to_drop = {48, 49}

claude_results = [item for i, item in enumerate(claude_sonnet) if i not in to_drop]

with open("claude_sonnet_clean.json", "w", encoding="utf-8") as f:
    json.dump({"items_with_code": claude_results}, f, indent=2, ensure_ascii=False)

print(f"Kept {len(claude_results)} projects, dropped {len(to_drop)}")


Kept 48 projects, dropped 2


**Note:** Now we have a dataset with 48 samples. The dataset is in JSONL format with title, description and code. This is only the base dataset and still we need to improve and generate more samples with more diversity in both topics and styles.