# LLM-Based Code Generation & Evaluation System

This notebook implements a complete system to:
1.  **Generate Python code** using a local LLM (Qwen2.5-Coder-1.5B-Instruct).
2.  **Evaluate the generated code** by executing it against test cases.

### Prerequisites
Ensure you have a GPU environment if possible for faster inference, though this specific model is small enough to run on many CPUs.

In [None]:
# Install necessary dependencies
%pip -q install torch transformers accelerate

## 1. The Code Generator

This class handles loading the model from Hugging Face and generating text based on prompts.
We use `Qwen/Qwen2.5-Coder-1.5B-Instruct`, a highly capable small model for code.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class CodeGenerator:
    def __init__(self, model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct", device=None):
        if device is None:
            self.device = "cuda" if torch.cuda.is_available() else "cpu"
        else:
            self.device = device

        print(f"Loading model {model_name} on {self.device}...")
        try:
            # Load tokenizer and model
            self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map=self.device,
                trust_remote_code=True
            )
            print("Model loaded successfully.")
        except Exception as e:
            print(f"Error loading model: {e}")
            raise

    def generate_code(self, prompt, max_new_tokens=512):
        # Format the prompt using the chat template
        messages = [
            {"role": "system", "content": "You are a helpful coding assistant. You only output valid Python code inside markdown code blocks. Do not add explanations unless asked."},
            {"role": "user", "content": prompt}
        ]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        model_inputs = self.tokenizer([text], return_tensors="pt").to(self.device)

        # Generate
        generated_ids = self.model.generate(
            model_inputs.input_ids,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.2, # Lower temperature for more deterministic code
        )

        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]

        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
        return self._extract_code_block(response)

    def _extract_code_block(self, text):
        """Extracts python code from markdown blocks if present."""
        if "```python" in text:
            try:
                code_block = text.split("```python")[1].split("```")[0]
                return code_block.strip()
            except IndexError:
                return text.strip()
        elif "```" in text:
            try:
                code_block = text.split("```")[1].split("```")[0]
                return code_block.strip()
            except IndexError:
                return text.strip()
        return text.strip()

## 2. The Evaluator

This class safely executes the generated string as Python code and runs unit tests against it.

In [None]:
import sys
import io
import traceback
import contextlib

class CodeEvaluator:
    def evaluate(self, code, test_cases=None, function_name="solution"):
        """
        Executes the provided code and runs test cases if provided.
        """
        buffer = io.StringIO()

        # Capture stdout to keep the notebook clean
        with contextlib.redirect_stdout(buffer), contextlib.redirect_stderr(buffer):
            try:
                # Create a new local scope for execution
                local_scope = {}
                try:
                    exec(code, {}, local_scope)
                except Exception as e:
                     return {
                        "success": False,
                        "output": buffer.getvalue(),
                        "error": f"Syntax/Runtime Error during definition: {str(e)}",
                        "passed_tests": 0,
                        "total_tests": len(test_cases) if test_cases else 0
                    }

                if not test_cases:
                    return {
                        "success": True,
                        "output": buffer.getvalue(),
                        "error": None,
                        "passed_tests": 0,
                        "total_tests": 0
                    }

                # Verify function exists
                if function_name not in local_scope:
                    return {
                        "success": False,
                        "output": buffer.getvalue(),
                        "error": f"Function '{function_name}' not found in generated code.",
                        "passed_tests": 0,
                        "total_tests": len(test_cases)
                    }

                func = local_scope[function_name]
                passed = 0

                # Run tests
                for args, expected in test_cases:
                    try:
                        if isinstance(args, tuple):
                            result = func(*args)
                        else:
                            result = func(args)

                        if result == expected:
                            passed += 1
                        else:
                            print(f"Test failed: Input {args}, Expected {expected}, Got {result}")
                    except Exception as e:
                        print(f"Error during test execution with args {args}: {e}")
                        # traceback.print_exc()

                return {
                    "success": passed == len(test_cases),
                    "output": buffer.getvalue(),
                    "error": None if passed == len(test_cases) else "Some tests failed",
                    "passed_tests": passed,
                    "total_tests": len(test_cases)
                }

            except Exception as e:
                return {
                    "success": False,
                    "output": buffer.getvalue(),
                    "error": str(e) + "\n" + traceback.format_exc(),
                    "passed_tests": 0,
                    "total_tests": len(test_cases) if test_cases else 0
                }

## 3. Run the Experiment

This section initializes the model, defines tasks, and runs the evaluation loop.

In [None]:
# Initialize Generator (This may take a minute to download/load the model)
try:
    generator = CodeGenerator()
    evaluator = CodeEvaluator()
    print("system initialized.")
except Exception as e:
    print("Initialization failed.", e)

Loading model Qwen/Qwen2.5-Coder-1.5B-Instruct on cuda...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]



model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Model loaded successfully.
system initialized.


In [None]:
# Define Tasks
tasks = [
    {
        "name": "Fibonacci Sequence",
        "prompt": "Write a Python function named `fibonacci` that takes an integer `n` as input and returns the n-th Fibonacci number. The sequence starts with 0 for n=0, 1 for n=1. Output ONLY the function code. The Code should 100% correct if you are less than 95% sure for the code then reiterate and give me only the final correct code ",
        "function_name": "fibonacci",
        "test_cases": [
            (0, 0),
            (1, 1),
            (2, 1),
            (5, 5),
            (10, 55)
        ]
    },
    {
        "name": "Check Palindrome",
        "prompt": "Write a Python function named `is_palindrome` that checks if a given string is a palindrome (reads the same forwards and backwards). It should return True or False. Output ONLY the function code.",
        "function_name": "is_palindrome",
        "test_cases": [
            ("madam", True),
            ("racecar", True),
            ("hello", False),
            ("", True)
        ]
    },
    {
        "name": "Sum of Array",
        "prompt": "Write a Python function named `sum_array` that takes a list of numbers and returns their sum. Output ONLY the function code.",
        "function_name": "sum_array",
        "test_cases": [
            ([1, 2, 3], 6),
            ([], 0),
            ([-1, 1], 0)
        ]
    }
]

# Run Loop
print("\n--- Starting Evaluation ---\n")

for task in tasks:
    print(f"Task: {task['name']}")
    print(f"Prompt: {task['prompt']}")

    print("Generating code...")
    try:
        generated_code = generator.generate_code(task['prompt'])
        print("\nGenerated Code:")
        print("--------------------------------------------------")
        print(generated_code)
        print("--------------------------------------------------")

        print("Evaluating...")
        result = evaluator.evaluate(
            generated_code,
            test_cases=task['test_cases'],
            function_name=task['function_name']
        )

        if result['success']:
            print(f"Result: SUCCESS ({result['passed_tests']}/{result['total_tests']} tests passed)")
        else:
            print(f"Result: FAILED ({result['passed_tests']}/{result['total_tests']} tests passed)")
            if result['error']:
                print(f"Error: {result['error']}")
            if result['output']:
                print(f"Execution Output: {result['output']}")

    except Exception as e:
        print(f"An unexpected error occurred during task execution: {e}")

    print("\n" + "="*40 + "\n")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



--- Starting Evaluation ---

Task: Fibonacci Sequence
Prompt: Write a Python function named `fibonacci` that takes an integer `n` as input and returns the n-th Fibonacci number. The sequence starts with 0 for n=0, 1 for n=1. Output ONLY the function code. The Code should 100% correct if you are less than 95% sure for the code then reiterate and give me only the final correct code 
Generating code...

Generated Code:
--------------------------------------------------
def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        a, b = 0, 1
        for _ in range(2, n + 1):
            a, b = b, a + b
        return b
--------------------------------------------------
Evaluating...
Result: SUCCESS (5/5 tests passed)


Task: Check Palindrome
Prompt: Write a Python function named `is_palindrome` that checks if a given string is a palindrome (reads the same forwards and backwards). It should return True or False. Output ONLY the function code.
Genera