# Benchmark Generator for the miniformer Code Agent Task

This notebook orchestrates the complete generation of the `miniformer-15` benchmark, a high-fidelity, repo-level code generation task designed for evaluating LLM-based agents.

### Core Objectives
*   **Generate a Codebase:** Create the complete `miniformer` repository, a minimal but architecturally sound Python library for building transformers. The codebase is engineered to be statistically representative of real-world projects, featuring a long-tail file size distribution and a rich dependency graph.
*   **Generate a Dataset:** Create a dataset of 15 programming tasks. These tasks are formatted as verbose, documentation-style prompts and are designed to be **verified by an executable test suite**, not simple string matching.
*   **Produce Artifacts:** The final outputs of this notebook are:
    1.  `miniformer_codebase.jsonl`: The complete codebase in JSONL format.
    2.  `miniformer_tasks.jsonl`: The 15 tasks in the required JSONL format.
    3.  `miniformer_repo.zip`: A downloadable archive of the physical repository.

This setup is inspired by the principles outlined in the `CODEAGENT` paper, providing a rigorous and realistic environment for agent evaluation.

## 1. Setup and Configuration

This cell handles the initial setup for the generation process. It defines all necessary file paths and constants, and cleans up any artifacts from previous runs to ensure a clean slate.

In [2]:
import os
import json
import shutil

# --- Configuration ---
# The root directory for the physical repository structure
REPO_ROOT_PATH = "/content/mini_transformers_repo"

# Output file paths for the JSONL benchmark data
CODEBASE_OUTPUT_PATH = "/content/mini_transformers_codebase.jsonl"
TASKS_OUTPUT_PATH = "/content/mini_transformers_tasks.jsonl"

# Clean up previous runs to ensure a fresh start
if os.path.exists(REPO_ROOT_PATH):
    shutil.rmtree(REPO_ROOT_PATH)
if os.path.exists(CODEBASE_OUTPUT_PATH):
    os.remove(CODEBASE_OUTPUT_PATH)
if os.path.exists(TASKS_OUTPUT_PATH):
    os.remove(TASKS_OUTPUT_PATH)

print("Setup complete. Old artifacts removed.")

Setup complete. Old artifacts removed.


## 2. Codebase Definition

This cell defines the entire source code for the `miniformer` repository. The code is stored in a Python dictionary where keys are the relative file paths and values are the file contents.

The codebase includes:
*   The core `miniformer` library with sub-packages for `layers`, `models`, `utils`, etc.
*   A comprehensive `tests/` directory containing the test suite required for the test-driven evaluation of our 15 tasks.

In [3]:
# A dictionary holding all file paths and their content.
# This version is engineered to have a statistical profile closer to the reference codebase.
CODEBASE_CONTENT = {
    # --- Project Root Files (Small) ---
    "main.py": "# Main entry point for agent tasks. Initially empty.",
    "README.md": """
# Miniformer
A minimal, educational library for building transformer blocks, designed for agent-based code generation tasks.
This library provides core components for building and experimenting with transformer architectures.
""",
    "requirements.txt": "numpy\ntorch\nscipy\npydantic",

    # --- Core Library: miniformer (Small __init__ files) ---
    "miniformer/__init__.py": "from .models.block import TransformerBlock",
    "miniformer/config.py": """
from pydantic import BaseModel, Field
from typing import Literal

class TransformerConfig(BaseModel):
    \"\"\"Configuration for a Miniformer model.\"\"\"
    vocab_size: int = Field(default=1000, ge=1)
    n_layer: int = Field(default=4, ge=1)
    n_head: int = Field(default=4, ge=1)
    n_embd: int = Field(default=128, ge=1)
    block_size: int = Field(default=256, ge=1)
    attn_pdrop: float = Field(default=0.1, ge=0.0, le=1.0)
    resid_pdrop: float = Field(default=0.1, ge=0.0, le=1.0)
    activation_function: Literal['relu', 'gelu'] = 'gelu'

    class Config:
        validate_assignment = True

    def __post_init__(self):
        if self.n_embd % self.n_head != 0:
            raise ValueError("n_embd must be divisible by n_head")
""",

    # --- Layers Package (Medium-sized files) ---
    "miniformer/layers/__init__.py": "from .attention import MultiHeadSelfAttention",
    "miniformer/layers/attention.py": """
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class CausalSelfAttention(nn.Module):
    \"\"\"A vanilla multi-head masked self-attention layer with a projection at the end.\"\"\"
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # Key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # Regularization
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        # Causal mask
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                      .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # Batch size, sequence length, embedding dimensionality
        # Calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # Causal self-attention
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # Re-assemble all head outputs
        # Output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

class MultiHeadSelfAttention(CausalSelfAttention):
    \"\"\"Alias for CausalSelfAttention for clearer naming conventions.\"\"\"
    pass
""",
    "miniformer/layers/feedforward.py": """
import torch.nn as nn
from miniformer.activations import get_activation

class FeedForward(nn.Module):
    \"\"\"A position-wise feed-forward network.\"\"\"
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd)
        self.dropout = nn.Dropout(config.resid_pdrop)
        self.activation = get_activation(config.activation_function)

    def forward(self, x):
        return self.dropout(self.c_proj(self.activation(self.c_fc(x))))
""",

    # --- Activations Module (Small) ---
    "miniformer/activations.py": """
import torch
import torch.nn.functional as F
from scipy.special import erf
import numpy as np

def gelu_exact(x):
    \"\"\"Gaussian Error Linear Unit (GELU) activation function using SciPy's erf for NumPy arrays.\"\"\"
    return 0.5 * x * (1.0 + erf(x / np.sqrt(2.0)))

def get_activation(name: str):
    \"\"\"Returns the activation function corresponding to the name.\"\"\"
    if name == 'relu':
        return F.relu
    elif name == 'gelu':
        return F.gelu # PyTorch's optimized GELU for tensors
    else:
        raise ValueError(f"Unknown activation function: {name}")
""",

    # --- Models Package (Small/Medium) ---
    "miniformer/models/__init__.py": "from .block import TransformerBlock",
    "miniformer/models/block.py": """
import torch.nn as nn
from miniformer.layers.attention import MultiHeadSelfAttention
from miniformer.layers.feedforward import FeedForward

class TransformerBlock(nn.Module):
    \"\"\"
    A single block of a transformer model. It consists of a multi-head
    self-attention layer followed by a feed-forward network. Layer
    normalization and residual connections are applied.
    \"\"\"
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = MultiHeadSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = FeedForward(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
""",

    # --- Utils Package (Small files) ---
    "miniformer/utils/__init__.py": "from .testing import assert_allclose",
    "miniformer/utils/testing.py": """
import numpy as np
import time

def assert_allclose(actual, desired, rtol=1e-6, atol=1e-6, label=""):
    \"\"\"A wrapper around np.testing.assert_allclose with a label for better test reporting.\"\"\"
    try:
        np.testing.assert_allclose(actual, desired, rtol=rtol, atol=atol)
        if label: print(f"Assertion PASSED for: {label}")
    except AssertionError as e:
        print(f"Assertion FAILED for: {label}" if label else "Assertion FAILED.")
        raise e
""",
    "miniformer/utils/tensor_ops.py": """
import numpy as np
import math
import torch

def to_numpy(tensor):
    \"\"\"Converts a PyTorch tensor to a NumPy array on the CPU.\"\"\"
    if tensor is None: return None
    return tensor.detach().cpu().numpy()

def kaiming_uniform_numpy(shape, a=0, mode='fan_in', nonlinearity='leaky_relu'):
    \"\"\"A simplified numpy-based Kaiming uniform initializer.\"\"\"
    fan = np.prod(shape[1:]) if mode == 'fan_in' else shape[0]
    gain = math.sqrt(2.0 / (1 + a**2)) if nonlinearity == 'leaky_relu' else 1.0
    std = gain / math.sqrt(fan)
    bound = math.sqrt(3.0) * std
    return np.random.uniform(-bound, bound, size=shape)
""",

    # --- Tests Directory (Our single "heavyweight" file) ---
    "tests/__init__.py": "",
    "tests/test_integration.py": """
# ==============================================================================
# Comprehensive Integration Test Suite for the Miniformer Library
# ==============================================================================
#
# This file serves as the primary validation mechanism for the entire miniformer
# codebase. Unlike unit tests that might focus on a single function, these
# integration tests ensure that all the different components (config, layers,
# models, utils) work together as expected.
#
# This file is intentionally verbose and detailed to:
# 1. Provide a clear example of how to use the library's components.
# 2. Serve as a challenging, realistic file for a code generation agent to modify.
# 3. Create a statistical outlier in file size, mimicking real-world codebases.
#
# It imports from nearly every other module in the project.

import torch
import numpy as np
import time
import pytest # Using pytest for better test structure and fixtures

# Import all major components from the library
from miniformer.config import TransformerConfig
from miniformer.models.block import TransformerBlock
from miniformer.layers.attention import MultiHeadSelfAttention
from miniformer.layers.feedforward import FeedForward
from miniformer.activations import get_activation
from miniformer.utils.testing import assert_allclose
from miniformer.utils.tensor_ops import to_numpy

# --- Test Fixtures ---

@pytest.fixture(scope="module")
def default_config():
    \"\"\"Provides a default, consistent TransformerConfig for all tests in this module.\"\"\"
    return TransformerConfig(
        vocab_size=100,
        n_layer=2,
        n_head=4,
        n_embd=64,
        block_size=128,
        attn_pdrop=0.0,  # Disable dropout for deterministic tests
        resid_pdrop=0.0
    )

@pytest.fixture
def transformer_block(default_config):
    \"\"\"Provides an initialized TransformerBlock instance for testing.\"\"\"
    model = TransformerBlock(default_config)
    model.eval() # Set to evaluation mode to disable dropout behavior
    return model


# --- Core Functionality Tests ---

def test_block_forward_pass_shape(transformer_block, default_config):
    \"\"\"
    PURPOSE: To verify that a forward pass through a TransformerBlock preserves
    the shape of the input tensor. This is the most fundamental sanity check.
    If input shape is (B, T, C), output shape must also be (B, T, C).
    \"\"\"
    print("\\n--- Running Test: Block Forward Pass Shape ---")
    batch_size = 4
    seq_len = 32
    input_tensor = torch.rand(batch_size, seq_len, default_config.n_embd)

    with torch.no_grad():
        output_tensor = transformer_block(input_tensor)

    assert input_tensor.shape == output_tensor.shape, \\
        f"Shape mismatch: In {input_tensor.shape} vs Out {output_tensor.shape}"
    print("Shape preservation test PASSED.")


def test_attention_causality(default_config):
    \"\"\"
    PURPOSE: To rigorously test the causal nature of the self-attention mechanism.
    A token at position `i` should NEVER be influenced by tokens at positions `j > i`.

    METHOD:
    1. Create an input tensor `input1`.
    2. Create a second tensor `input2` which is identical to `input1` initially.
    3. Modify `input2` at a future position (e.g., add noise at `t+1`).
    4. Pass both tensors through the attention layer.
    5. The output for position `t` should be IDENTICAL for both `output1` and `output2`.
    \"\"\"
    print("\\n--- Running Test: Attention Causality ---")
    attention_layer = MultiHeadSelfAttention(default_config).eval()
    seq_len = 16
    pos_to_check = 8 # The position we will observe

    # Create two input tensors.
    input1 = torch.randn(1, seq_len, default_config.n_embd)
    input2 = input1.clone()
    input2[:, pos_to_check + 1, :] += 10.0 # Add large noise to a future token

    with torch.no_grad():
        output1 = attention_layer(input1)
        output2 = attention_layer(input2)

    # The output up to and including `pos_to_check` should be bit-for-bit identical.
    out1_numpy = to_numpy(output1[:, :pos_to_check + 1, :])
    out2_numpy = to_numpy(output2[:, :pos_to_check + 1, :])

    assert_allclose(out1_numpy, out2_numpy, label="Causality Check")
    print("Causality test PASSED.")

def test_batch_independence(transformer_block):
    \"\"\"
    PURPOSE: To ensure that computations for different items in a batch are
    completely independent of one another.

    METHOD:
    1. Process a full batch of size N > 1.
    2. Process the first item of that batch alone (batch size 1).
    3. The output from the single-item pass must be identical to the first slice
       of the output from the full-batch pass.
    \"\"\"
    print("\\n--- Running Test: Batch Independence ---")
    # Input with batch size > 1
    full_batch_input = torch.rand(4, 16, transformer_block.attn.n_embd)
    # Input with only the first element of the batch
    single_item_input = full_batch_input[0:1, :, :].clone()

    with torch.no_grad():
        full_batch_output = transformer_block(full_batch_input)
        single_item_output = transformer_block(single_item_input)

    # Compare the first item from the full batch output to the single item output
    out_full_numpy = to_numpy(full_batch_output[0:1, :, :])
    out_single_numpy = to_numpy(single_item_output)

    assert_allclose(out_full_numpy, out_single_numpy, label="Batch Independence")
    print("Batch independence test PASSED.")

# --- A placeholder for future tests ---
def test_model_training_step():
    \"\"\"
    PURPOSE: This is a placeholder test. A real implementation would check if the
    model parameters are updated after a backward pass and optimizer step.
    For this benchmark, we just confirm that it runs without error.
    \"\"\"
    print("\\n--- Running Test: Model Training Step (Placeholder) ---")
    assert True
    print("Training step placeholder test PASSED.")

# --- Main execution block to run tests if the file is executed directly ---

def run_all_tests():
    \"\"\"Main function to run all defined tests sequentially.\"\"\"
    # This function is more for direct execution than for pytest
    config = default_config()
    block = transformer_block(config)

    test_block_forward_pass_shape(block, config)
    test_attention_causality(config)
    test_batch_independence(block)
    test_model_training_step()


if __name__ == "__main__":
    print("=============================================")
    print("    RUNNING MINIFORMER INTEGRATION SUITE     ")
    print("=============================================")
    start_time = time.time()

    # Manually run tests if not using pytest
    run_all_tests()

    end_time = time.time()
    print("\\n=============================================")
    print(f"   SUITE FINISHED in {end_time - start_time:.2f}s")
    print("=============================================")
""",
"tests/test_config.py": """
import pytest
from miniformer.config import TransformerConfig

def test_config_instantiation():
    # Tests that a config can be created with custom values (Task #11)
    config = TransformerConfig(n_layer=2, n_head=2, n_embd=32)
    assert config.n_layer == 2
    assert config.n_head == 2
    assert config.n_embd == 32

def test_config_to_dict_method():
    # This test is expected to fail until Task #3 is completed.
    config = TransformerConfig()
    try:
        d = config.to_dict()
        assert isinstance(d, dict)
        assert d['n_embd'] == 128
    except AttributeError:
        pytest.fail("The 'to_dict' method does not exist on TransformerConfig.")

def test_config_bias_field():
    # This test is expected to fail until Task #1 is completed.
    try:
        config = TransformerConfig(use_bias=True)
        assert config.use_bias is True
    except TypeError:
        pytest.fail("The 'use_bias' field does not exist on TransformerConfig.")
""",

"tests/test_activations.py": """
import pytest
import torch
from miniformer.activations import get_activation

def test_swish_activation():
    # This test is expected to fail until Task #4 is completed.
    try:
        swish = get_activation('swish')
        x = torch.tensor([1.0, 2.0, -1.0])
        expected = x * torch.sigmoid(x)
        assert torch.allclose(swish(x), expected)
    except ValueError:
        pytest.fail("Activation 'swish' is not registered in get_activation.")
""",

"tests/test_layers.py": """
import pytest
import torch
from miniformer.config import TransformerConfig

def test_flash_attention_placeholder():
    # This test is expected to fail until Task #12 is completed.
    from miniformer.layers.attention import CausalSelfAttention

    # This config would enable flash attention if the field existed
    try:
        config = TransformerConfig(use_flash=True, n_embd=32, n_head=4)
        attention = CausalSelfAttention(config)
        input_tensor = torch.rand(1, 16, 32)
        with pytest.raises(NotImplementedError, match="Flash Attention not yet implemented"):
            attention(input_tensor)
    except (TypeError, AttributeError):
        pytest.fail("Task #12 is not complete. Either 'use_flash' field is missing or the check in CausalSelfAttention is not implemented.")
""",

"tests/test_models.py": """
import pytest
from miniformer.config import TransformerConfig
from miniformer.models.block import TransformerBlock

def test_block_summary_method():
    # This test is expected to fail until Task #9 is completed.
    config = TransformerConfig()
    block = TransformerBlock(config)
    try:
        # Check if the method exists and returns a string containing layer names
        summary_str = block.summary()
        assert isinstance(summary_str, str)
        assert "MultiHeadSelfAttention" in summary_str
        assert "FeedForward" in summary_str
    except (AttributeError, TypeError):
        pytest.fail("The 'summary' method does not exist or does not return a string.")
""",

"tests/test_utils.py": """
import torch
import numpy as np
from miniformer.utils.tensor_ops import to_numpy

def test_to_numpy_conversion():
    # Tests the functionality for Task #2
    tensor = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
    arr = to_numpy(tensor)
    assert isinstance(arr, np.ndarray)
    assert arr.shape == (2, 2)

def test_xavier_initializer():
    # This test is expected to fail until Task #14 (the replacement task) is completed.
    try:
        from miniformer.utils.tensor_ops import xavier_uniform_numpy
        shape = (100, 100)
        result = xavier_uniform_numpy(shape)
        assert result.shape == shape
        # Check if values are within a reasonable bound for uniform dist
        assert np.abs(result).mean() < 0.5
    except ImportError:
        pytest.fail("Function 'xavier_uniform_numpy' not found in tensor_ops.py.")
""",

"tests/test_created_files.py": """
import pytest
from miniformer.config import TransformerConfig

# This file contains tests for components that are created by agent tasks.
# These tests are EXPECTED TO FAIL until the agent completes the tasks.

def test_positional_embedding_layer():
    # This test is for Task #5.
    try:
        from miniformer.layers.embedding import PositionalEmbedding
        config = TransformerConfig(block_size=64, n_embd=32)
        layer = PositionalEmbedding(config)
        input_tensor = torch.rand(2, 16, 32) # B, T, C
        pos_emb = layer(input_tensor)
        assert pos_emb.shape == (16, 32) # T, C
    except ImportError:
        pytest.fail("File 'miniformer/layers/embedding.py' or class 'PositionalEmbedding' not found.")

def test_full_model_instantiation():
    # This test is for Task #10.
    try:
        from miniformer.models.model import Miniformer
        config = TransformerConfig()
        model = Miniformer(config)
        assert model is not None
    except ImportError:
        pytest.fail("File 'miniformer/models/model.py' or class 'Miniformer' not found.")

def test_language_model_head():
    # This test is for Task #15.
    try:
        from miniformer.models.head import LanguageModelHead
        config = TransformerConfig(n_embd=32, vocab_size=100)
        head = LanguageModelHead(config)
        input_tensor = torch.rand(2, 16, 32) # B, T, C
        logits = head(input_tensor)
        assert logits.shape == (2, 16, 100) # B, T, vocab_size
    except ImportError:
        pytest.fail("File 'miniformer/models/head.py' or class 'LanguageModelHead' not found.")
"""
}

## 3. Task Dataset Definition

This cell defines the 15 programming tasks for our benchmark. These are the "raw" task definitions, written in a human-readable format.

Key characteristics of these tasks:
*   **Test-Driven:** Each task is designed to be verified by making a specific test in the `tests/` directory pass.
*   **Rich Prompts:** The `prompt` for each task is written in a verbose, documentation-style format to mimic real-world specifications.
*   **Varied Difficulty:** The tasks range from simple modifications to complex refactoring and file creation.

In [4]:
# Raw, human-centric task definitions with verbose, documentation-style prompts.
# This version is designed to match the content style of the reference benchmark.
TASK_DEFINITIONS = [
    {
        "task_id": "miniformer-01",
        "target_symbol": "TransformerConfig",
        "target_file": "miniformer/config.py",
        "test_file_path": "tests/test_config.py",
        "prompt": """
"Add Bias Configuration to TransformerConfig"

class miniformer.config.TransformerConfig

Modify the configuration class to support toggling bias terms in linear layers.

-[ Notes ]-
In some transformer architectures, bias terms in linear projections are considered redundant, especially when followed by a normalization layer like LayerNorm. Providing a configurable flag for this is a common design pattern for architectural experimentation. Your change is required to make the test `test_config_bias_field` in `tests/test_config.py` pass.

Parameters to Add:
  * **use_bias** (*bool*) -- If True, linear layers will include a bias term. This should default to `False` to follow modern best practices.
"""
    },
    {
        "task_id": "miniformer-02",
        "target_symbol": "to_numpy",
        "target_file": "miniformer/utils/tensor_ops.py",
        "test_file_path": "tests/test_utils.py",
        "prompt": """
"Verify Tensor Conversion Utility"

function miniformer.utils.tensor_ops.to_numpy(tensor)

Verify that the `to_numpy` utility function is correctly implemented.

-[ Description ]-
This function serves as a standard bridge between PyTorch tensors and NumPy arrays, a common operation in ML workflows for debugging, analysis, or interfacing with other libraries. It should handle the device transfer (`.cpu()`) and gradient detachment (`.detach()`) before conversion. The test `test_to_numpy_conversion` in `tests/test_utils.py` validates this behavior. No changes are needed if the implementation is correct.
"""
    },
    {
        "task_id": "miniformer-03",
        "target_symbol": "TransformerConfig",
        "target_file": "miniformer/config.py",
        "test_file_path": "tests/test_config.py",
        "prompt": """
"Implement Configuration Export Method"

class miniformer.config.TransformerConfig

Add a method to the `TransformerConfig` class for exporting its settings.

Method to Add:
  to_dict(self)
    Exports the configuration instance to a Python dictionary. This is crucial for serialization (e.g., saving to JSON), logging experiment parameters, or re-instantiating models from a saved state. The implementation should leverage the built-in `.dict()` method from the Pydantic `BaseModel`. This is required to pass the `test_config_to_dict_method`.
"""
    },
    {
        "task_id": "miniformer-04",
        "target_symbol": "get_activation",
        "target_file": "miniformer/activations.py",
        "test_file_path": "tests/test_activations.py",
        "prompt": """
"Add Swish Activation Support"

function miniformer.activations.get_activation(name)

Extend the activation function factory to include support for 'swish'.

-[ Notes ]-
The Swish activation function, defined as `f(x) = x * sigmoid(x)`, is a smooth, non-monotonic function that often matches or exceeds the performance of ReLU on deeper models.

Implementation Steps:
1.  Implement a new Python function `swish(x)` that computes the activation. It should use `torch.sigmoid`.
2.  Modify the `get_activation` function to return a reference to your `swish` function when the input `name` is 'swish'.
This change is required to pass `test_swish_activation` in `tests/test_activations.py`.
"""
    },
    {
        "task_id": "miniformer-05",
        "target_symbol": "PositionalEmbedding",
        "target_file": "miniformer/layers/embedding.py",
        "test_file_path": "tests/test_created_files.py",
        "prompt": """
"Create PositionalEmbedding Layer"

class miniformer.layers.embedding.PositionalEmbedding(config)

Bases: `torch.nn.Module`

Create a new file `miniformer/layers/embedding.py`. In it, define a `PositionalEmbedding` class.

-[ Description ]-
This layer learns a unique vector for each position in the input sequence, up to a maximum length defined by `config.block_size`. These embeddings are added to token embeddings to provide the model with information about token order, which is essential for sequence processing tasks since self-attention is permutation-invariant. This implementation is required to pass `test_positional_embedding_layer`.

__init__(self, config):
  The constructor must initialize a `torch.nn.Embedding` layer.
  Parameters:
    * **num_embeddings** (*int*): The maximum number of positions, from `config.block_size`.
    * **embedding_dim** (*int*): The dimensionality of the embedding vectors, from `config.n_embd`.

forward(self, x):
  The forward pass must generate the positional embeddings for the sequence length `T` of the input tensor `x`.
  Parameters:
    * **x** (*torch.Tensor* of shape *(B, T, C)*): The input tensor from the previous layer. Only the sequence length `T` is used.
  Returns:
    * **pos_emb** (*torch.Tensor* of shape *(T, C)*): The learned positional embeddings for positions 0 to T-1.
"""
    },
    # --- [Tasks 6-9 follow a similar, moderately verbose pattern] ---
    {
        "task_id": "miniformer-06", "target_symbol": "MultiHeadSelfAttention", "target_file": "miniformer/layers/attention.py", "test_file_path": "tests/test_integration.py",
        "prompt": """
"Refactor Attention Scaling"

class miniformer.layers.attention.MultiHeadSelfAttention

Refactor the `forward` method of the `MultiHeadSelfAttention` class to make the scaling factor explicit.

-[ Notes ]-
The attention formula `softmax(Q @ K.T / sqrt(d_k))` includes a scaling factor to prevent the dot products from growing too large and pushing the softmax into regions with extremely small gradients. Making this factor a local variable improves code clarity and maintainability. The end-to-end logic must remain identical to ensure the existing tests in `test_integration.py` continue to pass.
"""
    },
    {
        "task_id": "miniformer-07", "target_symbol": "FeedForward", "target_file": "miniformer/layers/feedforward.py", "test_file_path": "tests/test_integration.py",
        "prompt": """
"Refactor FeedForward Layer with Dependency Injection"

class miniformer.layers.feedforward.FeedForward

Refactor the `FeedForward` class to accept an `activation_fn` object directly in its constructor, rather than a string name from the config.

-[ Rationale ]-
This change decouples the `FeedForward` layer from the `config` object and the `activations` module, an example of Dependency Injection. It makes the layer more reusable and easier to test in isolation. You must also update the instantiation of `FeedForward` within the `TransformerBlock` in `miniformer/models/block.py` to pass the correct activation function object. The refactoring must be correct for all tests in `test_integration.py` to pass.
"""
    },
    {
        "task_id": "miniformer-08", "target_symbol": "test_integration.py", "target_file": "tests/test_integration.py", "test_file_path": "tests/test_integration.py",
        "prompt": """
"Add Test for ReLU Activation"

file tests/test_integration.py

Add a new test function `test_relu_activation` to the integration test suite.

-[ Description ]-
A robust test suite should cover various configurations. This test will ensure that the `TransformerBlock` can be successfully instantiated and executed using the 'relu' activation function, as specified in the configuration. The new test must verify that a forward pass completes without error and preserves the input tensor shape. The entire test suite must remain passing after your addition.
"""
    },
    {
        "task_id": "miniformer-09", "target_symbol": "TransformerBlock", "target_file": "miniformer/models/block.py", "test_file_path": "tests/test_models.py",
        "prompt": """
"Implement Model Block Summary"

class miniformer.models.block.TransformerBlock

Add a `summary(self)` method to the `TransformerBlock` class.

-[ Description ]-
This method should provide a simple, human-readable string representation of the layers contained within the block, which is a common utility for debugging and inspecting model architectures. The returned string must contain the class names of the attention and MLP layers. This change is required to pass `test_block_summary_method` in `tests/test_models.py`.
"""
    },
    # --- [Task 10 is our "heavyweight" task, mimicking the reference's verbosity] ---
    {
        "task_id": "miniformer-10",
        "target_symbol": "Miniformer",
        "target_file": "miniformer/models/model.py",
        "test_file_path": "tests/test_created_files.py",
        "prompt": """
"Create Full Miniformer Model"

class miniformer.models.model.Miniformer(config)

Bases: `torch.nn.Module`

Create a new file `miniformer/models/model.py`. In it, define the main `Miniformer` class that assembles the complete model architecture.

-[ Architecture ]-
This class orchestrates the entire forward pass of a decoder-only transformer. It combines token and positional embeddings, processes them through a stack of transformer blocks, and applies a final normalization step.

__init__(self, config):
  The constructor must initialize all the necessary sub-modules of the model.
  Parameters:
    * **config** (*TransformerConfig*): An object containing the model's hyperparameters.
  Modules to create:
    * **wte**: A `torch.nn.Embedding` for token embeddings. Its size should be `config.vocab_size` x `config.n_embd`.
    * **wpe**: A `torch.nn.Embedding` for positional embeddings. Its size should be `config.block_size` x `config.n_embd`.
    * **drop**: A `torch.nn.Dropout` layer, using the `config.resid_pdrop` probability.
    * **h**: A `torch.nn.Sequential` container that holds `config.n_layer` instances of the `TransformerBlock` class.
    * **ln_f**: A final `torch.nn.LayerNorm` layer.

forward(self, idx):
  Defines the computation performed at every call.
  -[ Logic ]-
  1. Get the token embeddings from `wte` using the input indices `idx`.
  2. Get the positional embeddings from `wpe` for the sequence length of the input.
  3. Add the token and positional embeddings together.
  4. Apply dropout to the combined embeddings.
  5. Pass the result through the stack of transformer blocks (`self.h`).
  6. Apply the final layer norm (`self.ln_f`).
  7. Return the final output tensor.
This implementation is required to pass `test_full_model_instantiation`.
"""
    },
    {
        "task_id": "miniformer-11", "target_symbol": "TransformerConfig", "target_file": "miniformer/config.py", "test_file_path": "tests/test_config.py",
        "prompt": """
"Verify Custom Configuration Instantiation"

class miniformer.config.TransformerConfig

The test `test_config_instantiation` in `tests/test_config.py` validates that the `TransformerConfig` class correctly handles instantiation with custom, non-default values. Your task is to ensure the class implementation supports this. No changes should be necessary if the Pydantic model is correctly defined.
"""
    },
    {
        "task_id": "miniformer-12", "target_symbol": "CausalSelfAttention", "target_file": "miniformer/layers/attention.py", "test_file_path": "tests/test_layers.py",
        "prompt": """
"Implement Flash Attention Placeholder"

class miniformer.layers.attention.CausalSelfAttention

Modify the `CausalSelfAttention` and `TransformerConfig` classes to support a placeholder for Flash Attention.

-[ Rationale ]-
Flash Attention is a highly optimized algorithm for attention. While we will not implement it, we want our architecture to be forward-compatible. This involves adding a flag to the configuration and a check in the `forward` pass. This change is required to pass `test_flash_attention_placeholder`.

Implementation Steps:
1.  Add a `use_flash: bool = True` field to the `TransformerConfig` class.
2.  In `CausalSelfAttention`, check for this flag in the `forward` method. If `True`, the method must raise a `NotImplementedError`.
3.  The original attention logic should execute if the flag is `False`.
"""
    },
    {
        "task_id": "miniformer-13", "target_symbol": "CausalSelfAttention", "target_file": "miniformer/layers/attention.py", "test_file_path": "tests/test_integration.py",
        "prompt": """
"Expose Key-Value Projections"

class miniformer.layers.attention.CausalSelfAttention

Refactor the `c_attn` linear layer, which currently creates Q, K, and V projections in one batch, into three separate `nn.Linear` layers: `q_attn`, `k_attn`, and `v_attn`.

-[ Notes ]-
Separating the projection matrices makes the architecture more explicit and is a prerequisite for more advanced schemes like Grouped-Query Attention (GQA). The forward pass must be updated to use these three distinct layers. The end-to-end functionality must remain identical to ensure all tests in `test_integration.py` continue to pass.
"""
    },
    {
        "task_id": "miniformer-14", "target_symbol": "xavier_uniform_numpy", "target_file": "miniformer/utils/tensor_ops.py", "test_file_path": "tests/test_utils.py",
        "prompt": """
"Add Xavier Uniform Initializer"

function miniformer.utils.tensor_ops.xavier_uniform_numpy(shape)

In `miniformer/utils/tensor_ops.py`, implement a new function `xavier_uniform_numpy(shape)` that provides a simplified, NumPy-based Xavier (or Glorot) uniform weight initializer.

-[ Notes ]-
The Xavier initializer is designed to keep the variance of activations the same across every layer. The bound for the uniform distribution is calculated as `sqrt(6 / (fan_in + fan_out))`. For simplicity, you can assume a 2D weight matrix where `fan_in = shape[1]` and `fan_out = shape[0]`. This is required to pass `test_xavier_initializer`.
"""
    },
    {
        "task_id": "miniformer-15", "target_symbol": "LanguageModelHead", "target_file": "miniformer/models/head.py", "test_file_path": "tests/test_created_files.py",
        "prompt": """
"Create Language Model Head"

class miniformer.models.head.LanguageModelHead(config)

Bases: `torch.nn.Module`

Create a new file `miniformer/models/head.py` and define the `LanguageModelHead` class.

-[ Description ]-
This is the final projection layer in a language model. It takes the high-level feature vectors produced by the transformer blocks and maps them to a score for each word in the vocabulary.

Implementation:
- It must inherit from `torch.nn.Module`.
- The `__init__` method must create a single `torch.nn.Linear` layer named `lm_head`. This layer should map from the embedding dimension (`config.n_embd`) to the vocabulary size (`config.vocab_size`), with `bias=False`.
- The `forward` method should accept a tensor and pass it through the `lm_head`.
This implementation is required to pass the `test_language_model_head`.
"""
    }
]

## 4. Task Enrichment and Formatting

This cell contains the `enrich_task_definitions` function. Its purpose is to process the raw task list from the previous cell and transform it into the final, precise JSONL schema required by our benchmark.

It uses Python's Abstract Syntax Tree (`ast`) module to introspect the codebase and intelligently generate metadata fields like `class_annotation` and `class_name`, ensuring the final output matches the structure of the reference `CODEAGENTBENCH` dataset.

In [5]:
import ast

def enrich_task_definitions(task_list, codebase_content):
    """
    Introspects the codebase to find authentic metadata for each task,
    then builds the final task records with the exact schema from the reference,
    including the 'title' field. This version relies on the `test_file_path`
    for evaluation.
    """
    enriched_tasks = []

    # Build an AST for every Python file to aid in metadata generation
    file_asts = {}
    for path, content in codebase_content.items():
        if path.endswith(".py"):
            try:
                file_asts[path] = ast.parse(content.strip())
            except (SyntaxError, ValueError) as e:
                print(f"Warning: Could not parse {path}: {e}")
                continue

    for task in task_list:
        symbol_name = task.get("target_symbol")
        file_path = task.get("target_file")

        # --- Generate Metadata ---
        class_annotation = "N/A"
        class_name = "N/A"
        module_path = file_path.replace('/', '.').replace('.py', '')

        # Heuristic: If a symbol starts with a capital letter, it's likely a class.
        if symbol_name[0].isupper():
            class_annotation = f"{module_path}.{symbol_name}"
            class_name = class_annotation
        # Otherwise, treat it as a function or module-level entity.
        else:
            class_annotation = f"{module_path}.{symbol_name}"

        # --- Build the Final Record ---
        # The schema now exactly matches the reference benchmark.
        final_task_record = {
            "title": symbol_name,
            "class_annotation": class_annotation,
            "comment": task['prompt'].strip(),
            "class_name": class_name,
            "class_link": f"{file_path}",
            "test_file_path": task.get("test_file_path"),
            "task_id": task['task_id']
        }
        enriched_tasks.append(final_task_record)

    return enriched_tasks

## 5. Generation and Execution

This is the main execution cell. It calls the helper functions defined previously to perform the three core generation steps:
1.  **Create the physical repository** on the local filesystem.
2.  **Generate the `codebase.jsonl` file** from the source code dictionary.
3.  **Process and generate the `tasks.jsonl` file** from the enriched task definitions.

In [6]:
def create_physical_repo(repo_root, codebase_content):
    """Creates the physical directory structure and files on disk."""
    print(f"Creating physical repository at: {repo_root}")
    for rel_path, content in codebase_content.items():
        full_path = os.path.join(repo_root, rel_path)
        os.makedirs(os.path.dirname(full_path), exist_ok=True)
        # .strip() to remove potential leading/trailing whitespace from multiline strings
        with open(full_path, 'w', encoding='utf-8') as f:
            f.write(content.strip())
    print("Physical repository created successfully.")

def create_codebase_jsonl(output_path, codebase_content):
    """Creates the codebase.jsonl file with the correct list-of-strings format."""
    print(f"Creating codebase JSONL at: {output_path}")
    with open(output_path, 'w', encoding='utf-8') as f:
        for rel_path, content in codebase_content.items():
            entry = {
                "path": rel_path,
                # Use splitlines(True) to match the original benchmark format
                "content": content.strip().splitlines(keepends=True)
            }
            # Robust handling for empty files
            if not entry["content"] and content.strip() == "":
                entry["content"] = [""]
            elif not entry["content"]:
                entry["content"] = [content]
            f.write(json.dumps(entry) + '\n')
    print("Codebase JSONL created successfully.")


def create_tasks_jsonl(output_path, enriched_task_list):
    """Creates the tasks.jsonl file from the enriched task data."""
    print(f"Creating tasks JSONL at: {output_path}")
    with open(output_path, 'w', encoding='utf-8') as f:
        for task_record in enriched_task_list:
            f.write(json.dumps(task_record) + '\n')
    print("Tasks JSONL created successfully.")

# --- Main Execution ---
# This block will use the string-based path variables defined in your original Cell 1
create_physical_repo(REPO_ROOT_PATH, CODEBASE_CONTENT)
print("-" * 20)
create_codebase_jsonl(CODEBASE_OUTPUT_PATH, CODEBASE_CONTENT)
print("-" * 20)

enriched_tasks = enrich_task_definitions(TASK_DEFINITIONS, CODEBASE_CONTENT)
create_tasks_jsonl(TASKS_OUTPUT_PATH, enriched_tasks)
print("\nBenchmark generation complete with authentic metadata.")

Creating physical repository at: /content/mini_transformers_repo
Physical repository created successfully.
--------------------
Creating codebase JSONL at: /content/mini_transformers_codebase.jsonl
Codebase JSONL created successfully.
--------------------
Creating tasks JSONL at: /content/mini_transformers_tasks.jsonl
Tasks JSONL created successfully.

Benchmark generation complete with authentic metadata.


## 6. Verification and Packaging

This final cell performs a sanity check on the generated artifacts to ensure the process was successful. It displays the first few lines of the output `jsonl` files and lists the directory structure of the created repository.

Finally, it packages the entire `miniformer_repo` directory into a `.zip` file for easy download and distribution.

In [7]:
print("--- Verifying Outputs ---")

# Check codebase file
print(f"\nFirst 2 lines of {CODEBASE_OUTPUT_PATH}:")
!head -n 2 {CODEBASE_OUTPUT_PATH}

# Check tasks file
print(f"\nFirst 2 lines of {TASKS_OUTPUT_PATH}:")
!head -n 2 {TASKS_OUTPUT_PATH}

# Check physical repo structure
print(f"\nPhysical directory structure of {REPO_ROOT_PATH}:")
!ls -R {REPO_ROOT_PATH}

# --- Create a downloadable ZIP file of the repository ---
print("\n--- Creating ZIP file for GitHub ---")
shutil.make_archive("mini_transformers_repo", 'zip', REPO_ROOT_PATH)
print("Created mini_transformers_repo.zip. You can download it from the Colab file browser (click the folder icon on the left).")

--- Verifying Outputs ---

First 2 lines of /content/mini_transformers_codebase.jsonl:
{"path": "main.py", "content": ["# Main entry point for agent tasks. Initially empty."]}
{"path": "README.md", "content": ["# Miniformer\n", "A minimal, educational library for building transformer blocks, designed for agent-based code generation tasks.\n", "This library provides core components for building and experimenting with transformer architectures."]}

First 2 lines of /content/mini_transformers_tasks.jsonl:
{"title": "TransformerConfig", "class_annotation": "miniformer.config.TransformerConfig", "comment": "\"Add Bias Configuration to TransformerConfig\"\n\nclass miniformer.config.TransformerConfig\n\nModify the configuration class to support toggling bias terms in linear layers.\n\n-[ Notes ]-\nIn some transformer architectures, bias terms in linear projections are considered redundant, especially when followed by a normalization layer like LayerNorm. Providing a configurable flag for this