# ICL Evaluation for Code Generation (Qwen Only)

Model: `Qwen/Qwen2.5-Coder-3B-Instruct`  
This notebook runs:

- **Phase 1**: Prompt selection on **MBPP** (first 100 problems)
- **Phase 2 (template)**: Final evaluation on HumanEval using the best-shot prompt

All prompts are designed to avoid Markdown code blocks so that code can be executed directly.

In [None]:
%pip install transformers accelerate datasets tqdm sentencepiece human-eval evalplus --upgrade -q

In [None]:
!git clone https://github.com/arthur900530/bigcode-evaluation-harness.git

fatal: destination path 'bigcode-evaluation-harness' already exists and is not an empty directory.


In [None]:
%cd bigcode-evaluation-harness
!pip install -e . --quiet
!pip install -q bitsandbytes>=0.41.0 --quiet

/content/bigcode-evaluation-harness
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
import os, re, json, math, textwrap, traceback
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from human_eval.data import read_problems

# Ensure math.comb is available (Python 3.8+)
if not hasattr(math, 'comb'):
    def comb(n, k):
        if k > n or k < 0:
            return 0
        if k == 0 or k == n:
            return 1
        k = min(k, n - k)
        result = 1
        for i in range(k):
            result = result * (n - i) // (i + 1)
        return result
    math.comb = comb

MBPP_LIMIT = 100   # number of MBPP problems for Phase 1
SEED = 11667

print("‚úÖ Config:")
print(f"  MBPP_LIMIT = {MBPP_LIMIT}")
print(f"  SEED       = {SEED}")

‚úÖ Config:
  MBPP_LIMIT = 100
  SEED       = 11667


In [None]:
!python main.py \
  --model "Qwen/Qwen2.5-Coder-3B-Instruct" \
  --tasks "mbpp" \
  --top_p 0.95 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 10 \
  --batch_size 10 \
  --max_length 2048 \
  --max_length_generation 2048 \
  --allow_code_execution \
  --save_generations \
  --limit 100 \
  --prefix "$PREFIX_5_SHOT"

In [None]:
!python main.py \
  --model "Qwen/Qwen2.5-Coder-3B-Instruct" \
  --tasks "humanevalplus" \
  --top_p 0.95 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 10 \
  --batch_size 10 \
  --max_length 2048 \
  --max_length_generation 2048 \
  --allow_code_execution \
  --save_generations \
  --prefix "$PREFIX_5_SHOT"

In [None]:
!python main.py \
  --model "Qwen/Qwen2.5-Coder-3B-Instruct" \
  --tasks "humanevalplus" \
  --top_p 0.95 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 10 \
  --batch_size 10 \
  --max_length 2048 \
  --max_length_generation 2048 \
  --allow_code_execution \
  --save_generations


In [None]:
# !huggingface-cli login
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [None]:
%%bash
PREFIX="$(cat prompts/mbpp_5shot.txt)"
python main.py \
  --model "Qwen/Qwen2.5-Coder-3B-Instruct" \
  --tasks "humanevalplus" \
  --top_p 0.95 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 10 \
  --batch_size 10 \
  --max_length 2048 \
  --max_length_generation 2048 \
  --allow_code_execution \
  --save_generations

Process is terminated.


In [None]:
%%bash
cat prompts/mbpp_5shot.txt

def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]

def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    seq = [0, 1]
    for i in range(2, n):
        seq.append(seq[-1] + seq[-2])
    return seq

def find_max(lst):
    if not lst:
        return None
    m = lst[0]
    for x in lst[1:]:
        if x > m:
            m = x
    return m

def reverse_list(lst):
    i, j = 0, len(lst) - 1
    while i < j:
        lst[i], lst[j] = lst[j], lst[i]
        i += 1
        j -= 1
    return lst

# You are a Python coding assistant.
# Only output valid Python code implementing the required function.
# Do NOT use markdown or ```.
# Do NOT print explanations or comments outside the function body.



In [None]:
HARD_RULE = (
    "# You are a Python coding assistant.\n"
    "# Only output valid Python code implementing the required function.\n"
    "# Do NOT use markdown or ```.\n"
    "# Do NOT print explanations or comments outside the function body.\n\n"
)

CODE_EXAMPLE_1 = """def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

"""

CODE_EXAMPLE_2 = """def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]

"""

CODE_EXAMPLE_3 = """def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    seq = [0, 1]
    for i in range(2, n):
        seq.append(seq[-1] + seq[-2])
    return seq

"""

CODE_EXAMPLE_4 = """def find_max(lst):
    if not lst:
        return None
    m = lst[0]
    for x in lst[1:]:
        if x > m:
            m = x
    return m

"""

CODE_EXAMPLE_5 = """def reverse_list(lst):
    i, j = 0, len(lst) - 1
    while i < j:
        lst[i], lst[j] = lst[j], lst[i]
        i += 1
        j -= 1
    return lst

"""

PREFIX_5_SHOT = "".join([
    CODE_EXAMPLE_1,
    CODE_EXAMPLE_2,
    CODE_EXAMPLE_3,
    CODE_EXAMPLE_4,
    CODE_EXAMPLE_5,
    HARD_RULE,
])

In [None]:
import os

os.makedirs("prompts", exist_ok=True)

with open("prompts/mbpp_5shot.txt", "w") as f:
    f.write(PREFIX_5_SHOT)

In [None]:
%%bash
PREFIX="$(cat prompts/mbpp_5shot.txt)"
python main.py \
  --model "Qwen/Qwen2.5-Coder-3B-Instruct" \
  --tasks "mbpp" \
  --top_p 0.95 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 10 \
  --batch_size 10 \
  --max_length 2048 \
  --max_length_generation 2048 \
  --allow_code_execution \
  --save_generations \
  --limit 100 \
  --prefix "$PREFIX"

Process is terminated.


In [None]:
import os, re, json, math, textwrap, traceback
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from human_eval.data import read_problems

MBPP_LIMIT = 100   # number of MBPP problems for Phase 1
SEED = 11667

print("‚úÖ Config:")
print(f"  MBPP_LIMIT = {MBPP_LIMIT}")
print(f"  SEED       = {SEED}")

‚úÖ Config:
  MBPP_LIMIT = 100
  SEED       = 11667


In [None]:
model_name = "Qwen/Qwen2.5-Coder-3B-Instruct"
print(f"Loading model: {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

model.eval()
print("‚úÖ Model loaded and set to eval()")

Loading model: Qwen/Qwen2.5-Coder-3B-Instruct


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

‚úÖ Model loaded and set to eval()


## Few-shot examples (pure Python code, no Markdown)

In [None]:
# Pure code examples ‚Äì these look like functions from a Python file

CODE_EXAMPLE_1 = """def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

"""

CODE_EXAMPLE_2 = """def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]

"""

CODE_EXAMPLE_3 = """def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    fib = [0, 1]
    for i in range(2, n):
        fib.append(fib[i-1] + fib[i-2])
    return fib

"""

CODE_EXAMPLE_4 = """def find_max(lst):
    if not lst:
        return None
    max_val = lst[0]
    for num in lst[1:]:
        if num > max_val:
            max_val = num
    return max_val

"""

CODE_EXAMPLE_5 = """def reverse_list(lst):
    left, right = 0, len(lst) - 1
    while left < right:
        lst[left], lst[right] = lst[right], lst[left]
        left += 1
        right -= 1
    return lst

"""

ALL_CODE_EXAMPLES = [
    CODE_EXAMPLE_1,
    CODE_EXAMPLE_2,
    CODE_EXAMPLE_3,
    CODE_EXAMPLE_4,
    CODE_EXAMPLE_5,
]

print("‚úÖ Defined 5 pure-code few-shot examples")

‚úÖ Defined 5 pure-code few-shot examples


## Prompt builder (shots: baseline / 0 / 1 / 3 / 5)

In [None]:
# ========= Prompt config for MBPP + Qwen =========

HARD_RULE = (
    "# You are a Python coding assistant.\n"
    "# Only output valid Python code implementing the required function.\n"
    "# Do NOT use markdown or ```.\n"
    "# Do NOT print explanations or comments outside the function body.\n\n"
)

def build_task_text_mbpp(entry):
    """MBPP Ëá™Â∏¶ÁöÑËá™ÁÑ∂ËØ≠Ë®ÄÊèèËø∞"""
    return entry["text"].strip()

def get_signature(entry):
    """
    ‰ªé MBPP ÁöÑÂèÇËÄÉËß£ÈáåÊäΩÂá∫ÂáΩÊï∞Á≠æÂêçÔºàÁ¨¨‰∏ÄË°åÔºâ
    e.g. 'def remove_first_and_last(s, ch):'
    """
    first_line = entry["code"].strip().split("\n")[0]
    sig = first_line.strip().rstrip(":") + ":"
    return sig + "\n"

def build_core_prompt(task_text: str, entry) -> str:
    """
    ÈÄöÁî®Ê†∏ÂøÉÁªìÊûÑÔºö
    - È°∂ÈÉ®ÊòØËßÑÂàôÔºàHARD_RULEÔºâ
    - ÁÑ∂ÂêéÊòØ # Task: <Ëá™ÁÑ∂ËØ≠Ë®ÄÊèèËø∞>
    - ÁÑ∂ÂêéÊòØ ÂáΩÊï∞Á≠æÂêçÔºàÁ©∫ body Á≠âÊ®°ÂûãË°•ÂÖ®Ôºâ
    """
    one_line_task = task_text.replace("\n", " ")
    sig = get_signature(entry)
    core = (
        HARD_RULE +
        f"# Task: {one_line_task}\n"
        f"# Implement the following function to solve the task.\n\n"
        f"{sig}"
    )
    return core

# ---- few-shot ‰ª£Á†ÅÁ§∫‰æãÔºàÁ∫Ø codeÔºå‰∏çË¶Å markdown / Êåá‰ª§Ôºâ----

# ‰Ω†‰πãÂâçÂ∑≤ÁªèÂÆö‰πâËøáÁ±ª‰ººÁöÑ CODE_EXAMPLE_1/2/3/4/5ÔºåËøôÈáåÁ°ÆËÆ§‰∏ãÊòØÁ∫ØÂáΩÊï∞Ôºö
CODE_EXAMPLE_1 = """def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

"""

CODE_EXAMPLE_2 = """def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]

"""

CODE_EXAMPLE_3 = """def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    seq = [0, 1]
    for i in range(2, n):
        seq.append(seq[-1] + seq[-2])
    return seq

"""

CODE_EXAMPLE_4 = """def find_max(lst):
    if not lst:
        return None
    m = lst[0]
    for x in lst[1:]:
        if x > m:
            m = x
    return m

"""

CODE_EXAMPLE_5 = """def reverse_list(lst):
    i, j = 0, len(lst) - 1
    while i < j:
        lst[i], lst[j] = lst[j], lst[i]
        i += 1
        j -= 1
    return lst

"""

ALL_CODE_EXAMPLES = [
    CODE_EXAMPLE_1,
    CODE_EXAMPLE_2,
    CODE_EXAMPLE_3,
    CODE_EXAMPLE_4,
    CODE_EXAMPLE_5,
]

# ---- ‰∏çÂêå shot ÁöÑ prompt ÊûÑÈÄ† ----

def prompt_baseline(task_text: str, entry) -> str:
    # Ê≤°Êúâ few-shotÔºåÂè™Áî® task + signature
    return build_core_prompt(task_text, entry)

def prompt_0shot(task_text: str, entry) -> str:
    # Âêå baselineÔºåÂè™ÊòØÈ¢ùÂ§ñÂä†‰∏ÄÂè•‚ÄúComplete the function‚ÄùÊèêÁ§∫
    core = build_core_prompt(task_text, entry)
    return "# Complete the function below.\n\n" + core

def prompt_1shot(task_text: str, entry) -> str:
    prefix = CODE_EXAMPLE_1
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

def prompt_3shot(task_text: str, entry) -> str:
    prefix = CODE_EXAMPLE_1 + CODE_EXAMPLE_2 + CODE_EXAMPLE_3
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

def prompt_5shot(task_text: str, entry) -> str:
    prefix = "".join(ALL_CODE_EXAMPLES)
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

MAIN_PROMPTS = {
    "baseline": ("Baseline (task + signature)", prompt_baseline),
    "0shot":   ("0-shot (instruction + task + signature)", prompt_0shot),
    "1shot":   ("1-shot (code prefix + task + signature)", prompt_1shot),
    "3shot":   ("3-shot (code prefix + task + signature)", prompt_3shot),
    "5shot":   ("5-shot (code prefix + task + signature)", prompt_5shot),
}

print("‚úÖ MAIN_PROMPTS re-defined with:")
for k, (name, _) in MAIN_PROMPTS.items():
    print(f"  {k}: {name}")


‚úÖ MAIN_PROMPTS re-defined with:
  baseline: Baseline (task + signature)
  0shot: 0-shot (instruction + task + signature)
  1shot: 1-shot (code prefix + task + signature)
  3shot: 3-shot (code prefix + task + signature)
  5shot: 5-shot (code prefix + task + signature)


## Generation + Markdown-safe code extraction

In [None]:
def generate(model, tokenizer, prompt: str, max_new_tokens: int = 2048, num_samples: int = 1) -> list:
    """
    Generate code from prompt.
    Parameters match bigcode-evaluation-harness defaults:
    - do_sample=True
    - temperature=0.2
    - top_p=0.95
    - max_length_generation=2048

    Args:
        num_samples: Number of samples to generate (for pass@k calculation)
    Returns:
        List of generated strings (if num_samples > 1) or single string (if num_samples == 1)
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    if num_samples == 1:
        # Single sample (for pass@1)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.2,
                top_p=0.95,
                pad_token_id=tokenizer.eos_token_id,
            )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    else:
        # Multiple samples (for pass@k)
        generated_texts = []
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.2,
                top_p=0.95,
                pad_token_id=tokenizer.eos_token_id,
                num_return_sequences=num_samples,
            )
        for output in outputs:
            generated_texts.append(tokenizer.decode(output, skip_special_tokens=True))
        return generated_texts

def strip_prompt(full_output: str, prompt: str) -> str:
    if prompt in full_output:
        return full_output.split(prompt, 1)[1].strip()
    return full_output.strip()

def extract_code_from_markdown(text: str) -> str:
    """Robustly extract Python code from possibly-markdown output."""
    if not isinstance(text, str):
        return ""

    s = text.strip()

    # 1) Prefer fenced code blocks
    patterns = [
        r"```python\s*\n(.*?)\n```",
        r"```py\s*\n(.*?)\n```",
        r"```\s*\n(.*?)\n```",
        r"```python\s*(.*?)```",
        r"```\s*(.*?)```",
    ]
    for pat in patterns:
        m = re.search(pat, s, flags=re.DOTALL | re.IGNORECASE)
        if m:
            code = m.group(1).strip()
            if len(code) > 0:
                s = code
                break

    # 2) Remove any leading markers like ### Instruction/Output/Response
    s = re.sub(r"^###\s*(Instruction|Output|Response):\s*", "", s, flags=re.MULTILINE)
    s = re.sub(r"^(Instruction|Output|Response):\s*", "", s, flags=re.MULTILINE)

    # 3) If we can find a def ...(): block, prefer it
    m_def = re.search(r"(def\s+\w+\([^)]*\):[\s\S]*)", s)
    if m_def:
        s = m_def.group(1).strip()

    # 4) Cut off if model started generating another instruction marker
    s = s.split("```")[0]
    s = s.split("### Instruction:")[0]
    s = s.split("### Output:")[0]
    s = s.split("### Response:")[0]

    return s.strip()

## MBPP dataset + correctness checker

In [None]:
mbpp = load_dataset("google-research-datasets/mbpp", split="test").select(range(MBPP_LIMIT))
print(f"‚úÖ Loaded MBPP test subset: {len(mbpp)} problems")

def check_mbpp_correct(code: str, entry) -> bool:
    """Exec generated code + MBPP tests, return True if all pass."""
    tests = entry["test_list"]
    setup = entry["test_setup_code"]
    if not tests:
        return False

    code_clean = extract_code_from_markdown(code)
    if not code_clean or len(code_clean) < 8:
        return False

    full = code_clean + "\n" + setup + "\n" + "\n".join(tests)
    local_env = {}
    try:
        exec(full, {}, local_env)
        return True
    except Exception as e:
        # Uncomment for debugging
        # print("Error:", e)
        # print("Code snippet:", code_clean[:200])
        return False

print("‚úÖ MBPP checker ready")

README.md: 0.00B [00:00, ?B/s]

full/train-00000-of-00001.parquet:   0%|          | 0.00/87.2k [00:00<?, ?B/s]

full/test-00000-of-00001.parquet:   0%|          | 0.00/116k [00:00<?, ?B/s]

full/validation-00000-of-00001.parquet:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

full/prompt-00000-of-00001.parquet:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/374 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/90 [00:00<?, ? examples/s]

Generating prompt split:   0%|          | 0/10 [00:00<?, ? examples/s]

‚úÖ Loaded MBPP test subset: 100 problems
‚úÖ MBPP checker ready


## Quick sanity test on a few MBPP problems

In [None]:
from tqdm.auto import tqdm

def sanity_test_mbpp(model, tokenizer, n_problems: int = 5, prompt_key: str = "3shot"):
    name, prompt_fn = MAIN_PROMPTS[prompt_key]
    print(f"Running sanity test on {n_problems} MBPP problems with prompt: {name}")

    correct = 0
    for i in range(n_problems):
        entry = mbpp[i]
        task_text = build_task_text_mbpp(entry)
        prompt = prompt_fn(task_text, entry)
        full_out = generate(model, tokenizer, prompt, max_new_tokens=256)
        gen_part = strip_prompt(full_out, prompt)
        code = extract_code_from_markdown(gen_part)

        ok = check_mbpp_correct(code, entry)
        correct += int(ok)

        print(f"\nProblem {i+1}:")
        print("Task:", task_text[:120].replace("\n", " ") + ("..." if len(task_text) > 120 else ""))
        print("Generated code (first 200 chars):")
        print(code[:200] + ("..." if len(code) > 200 else ""))
        print("Pass:", "‚úÖ" if ok else "‚ùå")

    print(f"\nSanity accuracy: {correct}/{n_problems} = {correct/n_problems:.2%}")

# Run this once to check everything is wired correctly
# sanity_test_mbpp(model, tokenizer, n_problems=3, prompt_key="3shot")
print("üí° Tip: run sanity_test_mbpp(model, tokenizer, 3, '3shot') before the full sweep.")

üí° Tip: run sanity_test_mbpp(model, tokenizer, 3, '3shot') before the full sweep.


In [None]:
sanity_test_mbpp(model, tokenizer, 3, '3shot')

Running sanity test on 3 MBPP problems with prompt: 3-shot (code prefix + task + signature)

Problem 1:
Task: Write a python function to remove first and last occurrence of a given character from the string.
Generated code (first 200 chars):
def remove_Occ(s, ch):
    # Find the first occurrence of the character
    first_occurrence = s.find(ch)
    # Find the last occurrence of the character
    last_occurrence = s.rfind(ch)
    
    # I...
Pass: ‚ùå
[[3, 1, 4], [2, 6, 5], [1, 5, 9]]

Problem 2:
Task: Write a function to sort a given matrix in ascending order according to the sum of its rows.
Generated code (first 200 chars):
def sort_matrix(M):
    M.sort(key=lambda x: sum(x))
    return M

# Example usage:
matrix = [
    [3, 1, 4],
    [1, 5, 9],
    [2, 6, 5]
]
sorted_matrix = sort_matrix(matrix)
print(sorted_matrix)  #...
Pass: ‚úÖ

Problem 3:
Task: Write a function to count the most common words in a dictionary.
Generated code (first 200 chars):
def most_common_words(word_dict):


## Phase 1: Prompt selection on MBPP (full sweep)

In [None]:
def eval_shots_on_mbpp(model, tokenizer):
    results = []
    for key, (name, prompt_fn) in MAIN_PROMPTS.items():
        print("\n" + "="*60)
        print(f"Evaluating config: {key} ‚Äî {name}")
        print("="*60)

        correct = 0
        total = len(mbpp)
        for i, entry in enumerate(tqdm(mbpp, desc=f"{key}", leave=False)):
            task_text = build_task_text_mbpp(entry)
            prompt = prompt_fn(task_text, entry)
            full_out = generate(model, tokenizer, prompt, max_new_tokens=256)
            gen_part = strip_prompt(full_out, prompt)
            code = extract_code_from_markdown(gen_part)

            if check_mbpp_correct(code, entry):
                correct += 1

        acc = correct / total
        results.append((key, name, acc, correct, total))
        print(f"  ‚Üí {name}: {acc:.4f} ({correct}/{total})")

    # Sort by accuracy desc
    results.sort(key=lambda x: x[2], reverse=True)
    return results

mbpp_results = eval_shots_on_mbpp(model, tokenizer)

print("\n=== MBPP Prompt Selection Results ===")
for key, name, acc, correct, total in mbpp_results:
    print(f"{key:7s} | {name:30s} | acc={acc:.4f} ({correct}/{total})")

# pick best
best_key, best_name, best_acc, best_correct, best_total = mbpp_results[0]
print("\nüèÜ Best config on MBPP:")
print(f"  key       = {best_key}")
print(f"  name      = {best_name}")
print(f"  accuracy  = {best_acc:.4f} ({best_correct}/{best_total})")

os.makedirs("results", exist_ok=True)
mbpp_summary = {
    "best": {
        "key": best_key,
        "name": best_name,
        "accuracy": float(best_acc),
        "correct": int(best_correct),
        "total": int(best_total),
    },
    "all_results": [
        {
            "key": k,
            "name": n,
            "accuracy": float(a),
            "correct": int(c),
            "total": int(t),
        }
        for (k, n, a, c, t) in mbpp_results
    ],
}
with open("results/qwen_mbpp_prompt_selection.json", "w") as f:
    json.dump(mbpp_summary, f, indent=2)

print("\nüíæ Saved MBPP selection results to results/qwen_mbpp_prompt_selection.json")

# store best prompt fn for later use (Phase 2)
best_prompt_fn = MAIN_PROMPTS[best_key][1]


Evaluating config: baseline ‚Äî Baseline (task + signature)


baseline:   0%|          | 0/100 [00:00<?, ?it/s]

True
False
[1, 3, 4, 'apple', 'banana', 'cherry']
Equilateral Triangle
Not Equilateral Triangle
Not Equilateral Triangle
True
True
{'a': 1, 'b': 3, 'c': 4, 'd': 5}
HelloWorld
ConvertThisString
[['apple', 'banana', 'cherry'], ['cat', 'dog', 'elephant'], ['giraffe', 'lion', 'zebra']]
  ‚Üí Baseline (task + signature): 0.2000 (20/100)

Evaluating config: 0shot ‚Äî 0-shot (instruction + task + signature)


0shot:   0%|          | 0/100 [00:00<?, ?it/s]

Not Equilateral
{'a': 1, 'b': [2, 3], 'c': 4, 'd': 5}
True
  ‚Üí 0-shot (instruction + task + signature): 0.2000 (20/100)

Evaluating config: 1shot ‚Äî 1-shot (code prefix + task + signature)


1shot:   0%|          | 0/100 [00:00<?, ?it/s]

[[3, 1, 4], [2, 6, 5], [1, 5, 9]]
False
True
False
True
True
True
False
False
True
True
True
False
False
1
1
6.0
156.25
0
[['apple', 'banana', 'cherry'], ['cat', 'dog', 'elephant'], ['giraffe', 'lion', 'zebra']]
  ‚Üí 1-shot (code prefix + task + signature): 0.2600 (26/100)

Evaluating config: 3shot ‚Äî 3-shot (code prefix + task + signature)


3shot:   0%|          | 0/100 [00:00<?, ?it/s]

[[3, 1, 4], [2, 6, 5], [1, 5, 9]]
20
28
40
False
2
[1, 3, 4, 'apple', 'banana', 'cherry']
Not Equilateral Triangle
True
True
False
False
False
1
-8
None
7
3
6
7
3
0
1
3
4
6
6.0
156.25
0
[['apple', 'banana', 'cherry'], ['cat', 'dog', 'elephant'], ['giraffe', 'lion', 'zebra']]
  ‚Üí 3-shot (code prefix + task + signature): 0.3200 (32/100)

Evaluating config: 5shot ‚Äî 5-shot (code prefix + task + signature)


5shot:   0%|          | 0/100 [00:00<?, ?it/s]

[[3, 1, 4], [2, 6, 5], [1, 5, 9]]
False
2
2
30
110
[1, 3, 4, 'apple', 'banana', 'cherry']
Not Equilateral Triangle
ƒ¶
≈´
»î
  ‚Üí 5-shot (code prefix + task + signature): 0.3500 (35/100)

=== MBPP Prompt Selection Results ===
5shot   | 5-shot (code prefix + task + signature) | acc=0.3500 (35/100)
3shot   | 3-shot (code prefix + task + signature) | acc=0.3200 (32/100)
1shot   | 1-shot (code prefix + task + signature) | acc=0.2600 (26/100)
baseline | Baseline (task + signature)    | acc=0.2000 (20/100)
0shot   | 0-shot (instruction + task + signature) | acc=0.2000 (20/100)

üèÜ Best config on MBPP:
  key       = 5shot
  name      = 5-shot (code prefix + task + signature)
  accuracy  = 0.3500 (35/100)

üíæ Saved MBPP selection results to results/qwen_mbpp_prompt_selection.json


## Phase 2: HumanEval evaluation with best-shot prompt

In [None]:
humaneval_problems = read_problems()
print(f"‚úÖ Loaded {len(humaneval_problems)} HumanEval problems")

def check_humaneval(code: str, problem: dict) -> bool:
    """Simple HumanEval checker using exec() and provided test code."""
    code_clean = extract_code_from_markdown(code)
    if not code_clean or len(code_clean) < 8:
        return False

    prompt_sig = problem["prompt"]   # includes def + docstring, etc.
    test_code = problem["test"]

    # Avoid duplicating the signature block if the model echoed it
    if prompt_sig in code_clean:
        body = code_clean.replace(prompt_sig, "")
    else:
        body = code_clean

    full = prompt_sig + body + "\n" + test_code
    try:
        exec(full, {})
        return True
    except Exception:
        return False


# ---- Adapter: Convert HumanEval problem to MBPP-like entry ----

def make_fake_entry_for_humaneval(problem: dict) -> dict:
    """
    Create a fake MBPP-like entry from HumanEval problem.
    This allows us to reuse MBPP prompt functions.

    get_signature() expects entry["code"] to be a function definition line.
    We extract the first line (function signature) from problem["prompt"].
    """
    # Extract function signature from problem["prompt"] (first line)
    prompt_lines = problem["prompt"].strip().split("\n")
    first_line = prompt_lines[0] if prompt_lines else ""

    # Ensure it ends with colon (get_signature adds it, but just in case)
    if first_line and not first_line.rstrip().endswith(":"):
        first_line = first_line.rstrip() + ":"

    # Create fake entry with the structure MBPP prompt functions expect
    fake_entry = {
        "code": first_line + "\n",  # get_signature() will extract from this
        "text": problem.get("entry_point", ""),  # Not used, but for compatibility
    }
    return fake_entry

def build_task_text_from_humaneval(problem: dict) -> str:
    """
    Extract task description from HumanEval problem.
    HumanEval doesn't have explicit "text" field, so we use the docstring.
    """
    # Extract docstring from prompt (usually the second line)
    prompt_lines = problem["prompt"].strip().split("\n")
    if len(prompt_lines) > 1:
        # Try to extract docstring (between triple quotes)
        docstring = "\n".join(prompt_lines[1:])
        # Remove quotes if present
        docstring = docstring.strip().strip('"""').strip("'''").strip()
        return docstring
    return "Complete the function"  # Fallback


# ---- Use the best prompt_fn from MBPP selection ----

def calculate_pass_at_k(n, c, k):
    """
    Calculate pass@k metric.
    n: total number of samples
    c: number of correct samples
    k: k for pass@k
    """
    if n - c < k:
        return 1.0
    return 1.0 - (math.comb(n - c, k) / math.comb(n, k))

def eval_humaneval(model, tokenizer, prompt_fn, calculate_pass10: bool = True):
    """
    Evaluate HumanEval using the best prompt function from MBPP selection.

    Args:
        model: The model to evaluate
        tokenizer: The tokenizer
        prompt_fn: The best prompt function from MAIN_PROMPTS (takes task_text, entry)
        calculate_pass10: If True, also calculate pass@10 (requires 10 samples per problem)
    """
    correct_pass1 = 0
    correct_pass10 = 0
    total = len(humaneval_problems)
    n_samples = 10 if calculate_pass10 else 1

    print(f"üöÄ Evaluating HumanEval with best prompt from MBPP selection")
    print(f"   Total problems: {total}")
    if calculate_pass10:
        print(f"   Generating {n_samples} samples per problem for pass@10 calculation")

    # Use tqdm for progress bar
    from tqdm.auto import tqdm

    for i, (task_id, problem) in enumerate(tqdm(humaneval_problems.items(), desc="HumanEval")):
        # Create fake entry for prompt_fn
        fake_entry = make_fake_entry_for_humaneval(problem)
        task_text = build_task_text_from_humaneval(problem)

        # Use the same prompt function as MBPP
        prompt = prompt_fn(task_text, fake_entry)

        # Generate samples
        if calculate_pass10:
            full_outs = generate(model, tokenizer, prompt, max_new_tokens=2048, num_samples=n_samples)
            # Check each sample
            passed_samples = 0
            for full_out in full_outs:
                gen_part = strip_prompt(full_out, prompt)
                code = extract_code_from_markdown(gen_part)
                if check_humaneval(code, problem):
                    passed_samples += 1

            # Pass@1: first sample passes
            if check_humaneval(extract_code_from_markdown(strip_prompt(full_outs[0], prompt)), problem):
                correct_pass1 += 1

            # Pass@10: at least one sample passes (out of 10)
            if passed_samples > 0:
                correct_pass10 += 1
        else:
            # Only pass@1
            full_out = generate(model, tokenizer, prompt, max_new_tokens=2048, num_samples=1)
            gen_part = strip_prompt(full_out, prompt)
            code = extract_code_from_markdown(gen_part)
            if check_humaneval(code, problem):
                correct_pass1 += 1

        # Update progress bar description with current accuracy
        if (i + 1) % 10 == 0:
            current_acc1 = correct_pass1 / (i + 1)
            if calculate_pass10:
                current_acc10 = correct_pass10 / (i + 1)
                tqdm.write(f"  Progress: {i+1}/{total} ‚Äî pass@1={current_acc1:.2%}, pass@10={current_acc10:.2%}")
            else:
                tqdm.write(f"  Progress: {i+1}/{total} ‚Äî pass@1={current_acc1:.2%}")

    acc_pass1 = correct_pass1 / total
    results = {
        "pass@1": acc_pass1,
        "pass@1_correct": correct_pass1,
        "pass@1_total": total,
    }

    if calculate_pass10:
        acc_pass10 = correct_pass10 / total
        results["pass@10"] = acc_pass10
        results["pass@10_correct"] = correct_pass10
        results["pass@10_total"] = total
        print(f"\n‚úÖ HumanEval pass@1: {acc_pass1:.4f} ({correct_pass1}/{total})")
        print(f"‚úÖ HumanEval pass@10: {acc_pass10:.4f} ({correct_pass10}/{total})")
    else:
        print(f"\n‚úÖ HumanEval pass@1: {acc_pass1:.4f} ({correct_pass1}/{total})")

    # Auto shutdown GPU after evaluation (to save resources)
    import os
    if os.getenv("COLAB_GPU") or torch.cuda.is_available():
        print("\nüí§ Shutting down GPU to save resources...")
        torch.cuda.empty_cache()
        print("   GPU cache cleared. GPU will be released when runtime ends.")

    return results


# ---- Run AFTER Phase 1, when best_prompt_fn is defined ----
# Running HumanEval evaluation with best prompt from MBPP selection
# Using 5-shot (best config from MBPP: 0.3500 accuracy)

# Use 5-shot as the best prompt (from MBPP results: 35/100 = 0.3500)
best_key = "5shot"
best_name = MAIN_PROMPTS[best_key][0]
best_prompt_fn = MAIN_PROMPTS[best_key][1]

print(f"‚úÖ Using best prompt from MBPP: {best_key} - {best_name}")
print(f"   (MBPP accuracy: 0.3500 (35/100))")

humaneval_results = eval_humaneval(
    model, tokenizer, best_prompt_fn, calculate_pass10=True
)

os.makedirs("results", exist_ok=True)
with open("results/qwen_humaneval_best_prompt.json", "w") as f:
    json.dump({
        "pass@1": float(humaneval_results["pass@1"]),
        "pass@1_correct": int(humaneval_results["pass@1_correct"]),
        "pass@1_total": int(humaneval_results["pass@1_total"]),
        "pass@10": float(humaneval_results.get("pass@10", 0)),
        "pass@10_correct": int(humaneval_results.get("pass@10_correct", 0)),
        "pass@10_total": int(humaneval_results.get("pass@10_total", 0)),
        "best_prompt_key": best_key,
        "best_prompt_name": best_name,
    }, f, indent=2)
print("üíæ Saved HumanEval results to results/qwen_humaneval_best_prompt.json")

# Auto download result file (important: Colab files are lost when runtime disconnects)
try:
    from google.colab import files
    print("\nüì• Auto-downloading HumanEval results...")
    files.download("results/qwen_humaneval_best_prompt.json")
    print("‚úÖ HumanEval results downloaded!")
except ImportError:
    print("‚ö†Ô∏è  Not in Colab, skipping auto-download")
except Exception as e:
    print(f"‚ö†Ô∏è  Auto-download failed: {e}")
    print("   Please manually download results/qwen_humaneval_best_prompt.json")


In [None]:
   %pip uninstall -y evalplus

Found existing installation: evalplus 0.3.1
Uninstalling evalplus-0.3.1:
  Successfully uninstalled evalplus-0.3.1


In [None]:
     %pip install --no-cache-dir "git+https://github.com/evalplus/evalplus.git"

In [None]:
 %pip show evalplus

Name: evalplus
Version: 0.4.0.dev44
Summary: "EvalPlus for rigourous evaluation of LLM-synthesized code"
Home-page: https://github.com/evalplus/evalplus
Author: 
Author-email: 
License: Apache-2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: anthropic, appdirs, boto3, datasets, fire, google-generativeai, multipledispatch, numpy, ollama, openai, psutil, rich, tempdir, termcolor, tqdm, transformers, tree-sitter-python, tree_sitter, wget
Required-by: 


In [None]:
import evalplus.evaluate as ev
print(ev.__file__)
print([name for name in dir(ev) if "eval" in name.lower()])

/usr/local/lib/python3.12/dist-packages/evalplus/evaluate.py
['PERF_EVAL_TIMEOUT_SECOND', 'compatible_eval_result', 'evaluate', 'get_human_eval_plus', 'get_human_eval_plus_hash']


In [None]:
import os, re, json, math, textwrap, traceback
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Re-define necessary functions/variables from earlier cells (e.g., 3de27163, ZPaxlKmxTV-h)
# to ensure they are in scope for this execution.

# Adapter functions (from 3de27163)
def make_fake_entry_for_humaneval(problem: dict) -> dict:
    """
    Create a fake MBPP-like entry from HumanEval problem.
    This allows us to reuse MBPP prompt functions.

    get_signature() expects entry["code"] to be a function definition line.
    We need to extract the 'def ...:' line from problem["prompt"].
    """
    signature_line = ""
    for line in problem["prompt"].strip().split("\n"):
        if line.strip().startswith("def "):
            signature_line = line.strip()
            break
    if not signature_line:
        signature_line = problem["prompt"].strip().split("\n")[0].strip()
    fake_entry = {
        "code": signature_line + "\n",
        "text": problem.get("entry_point", ""),
    }
    return fake_entry

def build_task_text_from_humaneval(problem: dict) -> str:
    """
    Extract task description from HumanEval problem.
    HumanEval doesn't have explicit "text" field, so we use the docstring.
    """
    prompt_lines = problem["prompt"].strip().split("\n")
    def_found = False
    docstring_lines = []
    for line in prompt_lines:
        if line.strip().startswith("def "):
            def_found = True
            continue
        if def_found:
            docstring_lines.append(line)
    if docstring_lines:
        docstring = "\n".join(docstring_lines)
        docstring = docstring.strip().strip('"""').strip("'''").strip()
        return docstring
    return "Complete the function"

# Prompt builder dependencies (from JspMltXLTV-h / 3de27163)
HARD_RULE = (
    "# You are a Python coding assistant.\n"
    "# Only output valid Python code implementing the required function.\n"
    "# Do NOT use markdown or ```.\n"
    "# Do NOT print explanations or comments outside the function body.\n\n"
)

def get_signature(entry):
    first_line = entry["code"].strip().split("\n")[0]
    sig = first_line.strip().rstrip(":") + ":"
    return sig + "\n"

def build_core_prompt(task_text: str, entry) -> str:
    one_line_task = task_text.replace("\n", " ")
    sig = get_signature(entry)
    core = (
        HARD_RULE +
        f"# Task: {one_line_task}\n"
        f"# Implement the following function to solve the task.\n\n"
        f"{sig}"
    )
    return core

CODE_EXAMPLE_1 = """def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

"""

CODE_EXAMPLE_2 = """def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]

"""

CODE_EXAMPLE_3 = """def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    seq = [0, 1]
    for i in range(2, n):
        seq.append(seq[-1] + seq[-2])
    return seq

"""

CODE_EXAMPLE_4 = """def find_max(lst):
    if not lst:
        return None
    m = lst[0]
    for x in lst[1:]:
        if x > m:
            m = x
    return m

"""

CODE_EXAMPLE_5 = """def reverse_list(lst):
    i, j = 0, len(lst) - 1
    while i < j:
        lst[i], lst[j] = lst[j], lst[i]
        i += 1
        j -= 1
    return lst

"""

ALL_CODE_EXAMPLES = [
    CODE_EXAMPLE_1,
    CODE_EXAMPLE_2,
    CODE_EXAMPLE_3,
    CODE_EXAMPLE_4,
    CODE_EXAMPLE_5,
]

def prompt_baseline(task_text: str, entry) -> str:
    return build_core_prompt(task_text, entry)

def prompt_0shot(task_text: str, entry) -> str:
    core = build_core_prompt(task_text, entry)
    return "# Complete the function below.\n\n" + core

def prompt_1shot(task_text: str, entry) -> str:
    prefix = CODE_EXAMPLE_1
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

def prompt_3shot(task_text: str, entry) -> str:
    prefix = CODE_EXAMPLE_1 + CODE_EXAMPLE_2 + CODE_EXAMPLE_3
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

def prompt_5shot(task_text: str, entry) -> str:
    prefix = "".join(ALL_CODE_EXAMPLES)
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

MAIN_PROMPTS = {
    "baseline": ("Baseline (task + signature)", prompt_baseline),
    "0shot":   ("0-shot (instruction + task + signature)", prompt_0shot),
    "1shot":   ("1-shot (code prefix + task + signature)", prompt_1shot),
    "3shot":   ("3-shot (code prefix + task + signature)", prompt_3shot),
    "5shot":   ("5-shot (code prefix + task + signature)", prompt_5shot),
}

# Helper functions for generation and extraction (from ZPaxlKmxTV-h)
def generate(model, tokenizer, prompt: str, max_new_tokens: int = 2048, num_samples: int = 1) -> list:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    if num_samples == 1:
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.2,
                top_p=0.95,
                pad_token_id=tokenizer.eos_token_id,
            )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    else:
        generated_texts = []
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.2,
                top_p=0.95,
                pad_token_id=tokenizer.eos_token_id,
                num_return_sequences=num_samples,
            )
        for output in outputs:
            generated_texts.append(tokenizer.decode(output, skip_special_tokens=True))
        return generated_texts

def strip_prompt(full_output: str, prompt: str) -> str:
    if prompt in full_output:
        return full_output.split(prompt, 1)[1].strip()
    return full_output.strip()

def extract_code_from_markdown(text: str) -> str:
    if not isinstance(text, str):
        return ""
    s = text.strip()
    patterns = [
        r"```python\s*\n(.*?)\n```",
        r"```py\s*\n(.*?)\n```",
        r"```\s*\n(.*?)\n```",
        r"```python\s*(.*?)```",
        r"```\s*(.*?)```",
    ]
    for pat in patterns:
        m = re.search(pat, s, flags=re.DOTALL | re.IGNORECASE)
        if m:
            code = m.group(1).strip()
            if len(code) > 0:
                s = code
                break
    s = re.sub(r"^###\s*(Instruction|Output|Response):\s*", "", s, flags=re.MULTILINE)
    s = re.sub(r"^(Instruction|Output|Response):\s*", "", s, flags=re.MULTILINE)
    m_def = re.search(r"(def\s+\w+\([^)]*\):[\s\S]*)", s)
    if m_def:
        s = m_def.group(1).strip()
    s = s.split("```")[0]
    s = s.split("### Instruction:")[0]
    s = s.split("### Output:")[0]
    s = s.split("### Response:")[0]
    return s.strip()

In [None]:
# ============================================================
# Phase 5: TRUE HumanEval+ Evaluation (evalplus enhanced tests)
# ============================================================
from evalplus.data import get_human_eval_plus
from evalplus.evaluate import evaluate
from time import time

humaneval_plus_problems = get_human_eval_plus()
assert humaneval_plus_problems, "‚ùå Failed to load HumanEval+ problems!"
print(f"‚úÖ Loaded {len(humaneval_plus_problems)} HumanEval+ problems (enhanced tests)")
print("=" * 60)
print("üìä Phase 5: HumanEval+ Evaluation (evalplus)")
print("=" * 60)

best_key = "5shot"
best_name, best_prompt_fn = MAIN_PROMPTS[best_key]
print(f"‚úÖ Using best prompt: {best_key} ‚Äî {best_name}\n")

def check_humaneval_plus_local(code: str, problem: dict) -> bool:
    code_clean = extract_code_from_markdown(code)
    if not code_clean or len(code_clean) < 8:
        return False

    prompt_sig = problem["prompt"]
    test_code = problem["test"]
    body = code_clean.replace(prompt_sig, "") if prompt_sig in code_clean else code_clean

    try:
        exec(prompt_sig + body + "\n" + test_code, {})
        return True
    except Exception:
        return False

MAX_SAMPLES = 10   # pass@10 ÁõÆÊ†á
BATCH_SIZE = 3     # ÊØèËΩÆÊúÄÂ§öÈááÊ†∑Êï∞
generations = {}

print("üöÄ Generating model solutions (progressive sampling)...")
for idx, (task_id, problem) in enumerate(tqdm(humaneval_plus_problems.items(), desc="Generating"), 1):
    start = time()
    fake_entry = make_fake_entry_for_humaneval(problem)
    task_text = build_task_text_from_humaneval(problem)
    prompt = best_prompt_fn(task_text, fake_entry)

    samples = []
    pass_found = False
    success_code = ""
    attempts = 0

    while attempts < MAX_SAMPLES and not pass_found:
        to_generate = min(BATCH_SIZE, MAX_SAMPLES - attempts)
        full_outs = generate(
            model,
            tokenizer,
            prompt,
            max_new_tokens=2048,
            num_samples=to_generate,
        )
        if to_generate == 1:
            full_outs = [full_outs]

        for full_out in full_outs:
            gen_part = strip_prompt(full_out, prompt)
            code = extract_code_from_markdown(gen_part)
            samples.append(code)
            attempts += 1

            if check_humaneval_plus_local(code, problem):
                pass_found = True
                success_code = code
                break

    if pass_found and len(samples) < MAX_SAMPLES:
        samples.extend([success_code] * (MAX_SAMPLES - len(samples)))
    elif len(samples) < MAX_SAMPLES:
        samples.extend([""] * (MAX_SAMPLES - len(samples)))

    generations[task_id] = samples
    tqdm.write(
        f"[{idx}/{len(humaneval_plus_problems)}] "
        f"attempts={attempts}, pass_found={pass_found}, "
        f"time={time()-start:.1f}s"
    )

print("\nüß™ Running evalplus enhanced tests...")
results = evaluate(
    problems=humaneval_plus_problems,
    samples=generations,
    k=[1, 10],
    timeout=3,
)

print("\nüéØ TRUE HumanEval+ Results")
print(f"  ‚ñ∏ pass@1:  {results['pass@1']:.4f}")
print(f"  ‚ñ∏ pass@10: {results['pass@10']:.4f}\n")

os.makedirs("results", exist_ok=True)
with open("results/qwen_humaneval_plus_best_prompt.json", "w") as f:
    json.dump(
        {
            "pass@1": float(results["pass@1"]),
            "pass@10": float(results["pass@10"]),
            "best_prompt_key": best_key,
            "best_prompt_name": best_name,
        },
        f,
        indent=2,
    )
print("üíæ Saved TRUE HumanEval+ results to results/qwen_humaneval_plus_best_prompt.json")

In [None]:
from human_eval.data import read_problems

# ---- Adapter functions: Convert HumanEval problem to MBPP-like entry ----

def make_fake_entry_for_humaneval(problem: dict) -> dict:
    """
    Create a fake MBPP-like entry from HumanEval problem.
    This allows us to reuse MBPP prompt functions.

    get_signature() expects entry["code"] to be a function definition line.
    We need to extract the 'def ...:' line from problem["prompt"].
    """
    # Extract the function signature (the 'def ...:' line) from problem["prompt"]
    signature_line = ""
    for line in problem["prompt"].strip().split("\n"):
        if line.strip().startswith("def "):
            signature_line = line.strip()
            break

    # If no 'def' line is found (shouldn't happen for HumanEval), fall back.
    if not signature_line:
        signature_line = problem["prompt"].strip().split("\n")[0].strip()

    # The get_signature function already adds the trailing colon if missing,
    # so we just need to provide the 'def ...' line.
    fake_entry = {
        "code": signature_line + "\n",  # get_signature() will extract from this
        "text": problem.get("entry_point", ""),  # Not directly used by prompt builder but good for compatibility
    }
    return fake_entry

def build_task_text_from_humaneval(problem: dict) -> str:
    """
    Extract task description from HumanEval problem.
    HumanEval doesn't have explicit "text" field, so we use the docstring.
    """
    # Extract docstring from prompt (usually the second line)
    prompt_lines = problem["prompt"].strip().split("\n")
    # Find the line that starts with 'def', and take everything after it as docstring/task text
    def_found = False
    docstring_lines = []
    for line in prompt_lines:
        if line.strip().startswith("def "):
            def_found = True
            continue
        if def_found:
            docstring_lines.append(line)

    if docstring_lines:
        docstring = "\n".join(docstring_lines)
        # Remove quotes if present
        docstring = docstring.strip().strip('"""').strip("'''").strip()
        return docstring
    return "Complete the function"  # Fallback

# ==============================================================
# Start of definitions needed for best_prompt_fn (copied from earlier cells)
# ==============================================================

HARD_RULE = (
    "# You are a Python coding assistant.\n"
    "# Only output valid Python code implementing the required function.\n"
    "# Do NOT use markdown or ```.\n"
    "# Do NOT print explanations or comments outside the function body.\n\n"
)

def get_signature(entry):
    """
    e.g. 'def remove_first_and_last(s, ch):'
    """
    first_line = entry["code"].strip().split("\n")[0]
    sig = first_line.strip().rstrip(":") + ":"
    return sig + "\n"

def build_core_prompt(task_text: str, entry) -> str:
    one_line_task = task_text.replace("\n", " ")
    sig = get_signature(entry)
    core = (
        HARD_RULE +
        f"# Task: {one_line_task}\n"
        f"# Implement the following function to solve the task.\n\n"
        f"{sig}"
    )
    return core

CODE_EXAMPLE_1 = """def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

"""

CODE_EXAMPLE_2 = """def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]

"""

CODE_EXAMPLE_3 = """def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    seq = [0, 1]
    for i in range(2, n):
        seq.append(seq[-1] + seq[-2])
    return seq

"""

CODE_EXAMPLE_4 = """def find_max(lst):
    if not lst:
        return None
    m = lst[0]
    for x in lst[1:]:
        if x > m:
            m = x
    return m

"""

CODE_EXAMPLE_5 = """def reverse_list(lst):
    i, j = 0, len(lst) - 1
    while i < j:
        lst[i], lst[j] = lst[j], lst[i]
        i += 1
        j -= 1
    return lst

"""

ALL_CODE_EXAMPLES = [
    CODE_EXAMPLE_1,
    CODE_EXAMPLE_2,
    CODE_EXAMPLE_3,
    CODE_EXAMPLE_4,
    CODE_EXAMPLE_5,
]


def prompt_baseline(task_text: str, entry) -> str:
    return build_core_prompt(task_text, entry)

def prompt_0shot(task_text: str, entry) -> str:
    core = build_core_prompt(task_text, entry)
    return "# Complete the function below.\n\n" + core

def prompt_1shot(task_text: str, entry) -> str:
    prefix = CODE_EXAMPLE_1
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

def prompt_3shot(task_text: str, entry) -> str:
    prefix = CODE_EXAMPLE_1 + CODE_EXAMPLE_2 + CODE_EXAMPLE_3
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

def prompt_5shot(task_text: str, entry) -> str:
    prefix = "".join(ALL_CODE_EXAMPLES)
    core = build_core_prompt(task_text, entry)
    return prefix + "\n" + core

MAIN_PROMPTS = {
    "baseline": ("Baseline (task + signature)", prompt_baseline),
    "0shot":   ("0-shot (instruction + task + signature)", prompt_0shot),
    "1shot":   ("1-shot (code prefix + task + signature)", prompt_1shot),
    "3shot":   ("3-shot (code prefix + task + signature)", prompt_3shot),
    "5shot":   ("5-shot (code prefix + task + signature)", prompt_5shot),
}

# Use 5-shot as the best prompt (hardcoded based on MBPP results)
best_key = "5shot"
best_prompt_fn = MAIN_PROMPTS[best_key][1]

# ==============================================================
# End of definitions needed for best_prompt_fn
# ==============================================================

print("\n--- Sample HumanEval Prompt ---")

# Get the first HumanEval problem as an example
humaneval_problems = read_problems()
sample_task_id = list(humaneval_problems.keys())[0]
sample_problem = humaneval_problems[sample_task_id]

# Create a fake MBPP-like entry for the prompt function
fake_entry = make_fake_entry_for_humaneval(sample_problem)
task_text = build_task_text_from_humaneval(sample_problem)

# Build the prompt using the best_prompt_fn
final_humaneval_prompt = best_prompt_fn(task_text, fake_entry)

print(final_humaneval_prompt)
print("-----------------------------")


--- Sample HumanEval Prompt ---
def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)

def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]

def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    seq = [0, 1]
    for i in range(2, n):
        seq.append(seq[-1] + seq[-2])
    return seq

def find_max(lst):
    if not lst:
        return None
    m = lst[0]
    for x in lst[1:]:
        if x > m:
            m = x
    return m

def reverse_list(lst):
    i, j = 0, len(lst) - 1
    while i < j:
        lst[i], lst[j] = lst[j], lst[i]
        i += 1
        j -= 1
    return lst


# You are a Python coding assistant.
# Only output valid Python code implementing the required function.
# Do NOT use markdown or ```.
# Do NOT print explanations or comments outside the function body.

# Task: Check if in given list of numbers, are any two numbers closer 

In [None]:
# Load HumanEval+ (if needed for report)

# Ensure evalplus is installed
!pip install evalplus --upgrade -q

import importlib
import sys

# Clear all evalplus related modules from sys.modules to ensure a fresh import
for mod in list(sys.modules.keys()):
    if mod.startswith('evalplus'):
        del sys.modules[mod]

# Re-import evalplus to get the latest version's API
import evalplus

try:
    # Try to import from evalplus.data
    from evalplus.data import get_human_eval_plus # Corrected function name
    humaneval_plus_problems = get_human_eval_plus()
    print(f"‚úÖ Loaded {len(humaneval_plus_problems)} HumanEval+ problems from evalplus.data")
except ImportError:
    # If not found in evalplus.data, try directly from evalplus (or alternative location)
    try:
        # Some versions might have it directly under evalplus
        from evalplus import get_human_eval_plus # Corrected function name
        humaneval_plus_problems = get_human_eval_plus()
        print(f"‚úÖ Loaded {len(humaneval_plus_problems)} HumanEval+ problems from evalplus")
    except ImportError:
        # Fallback to human-eval base if evalplus+ specific problems can't be loaded
        from human_eval.data import read_problems
        humaneval_plus_problems = read_problems()
        print(f"‚ö†Ô∏è  Could not find 'get_human_eval_plus' in evalplus library. Loaded {len(humaneval_plus_problems)} HumanEval problems instead.")

# Print dir(evalplus.data) and dir(evalplus) for inspection if needed
print("\n--- Inspecting evalplus.data module ---")
print(dir(evalplus.data))
print("\n--- Inspecting evalplus module ---")
print(dir(evalplus))


‚úÖ Loaded 164 HumanEval+ problems from evalplus.data

--- Inspecting evalplus.data module ---
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'get_evalperf_data', 'get_human_eval_plus', 'get_human_eval_plus_hash', 'get_mbpp_plus', 'get_mbpp_plus_hash', 'humaneval', 'json', 'load_dataset', 'load_solutions', 'mbpp', 'utils', 'write_directory', 'write_jsonl']

--- Inspecting evalplus module ---
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '__version_tuple__', '_version', 'data']


In [None]:
## Phase 3: InstructHumanEval evaluation

In [None]:
# Load InstructHumanEval
# ---- Adapter functions: Convert InstructHumanEval problem to MBPP-like entry ----
# (These are needed even if HumanEval cell wasn't run after restarting runtime)

def make_fake_entry_for_humaneval(problem: dict) -> dict:
    """Create a fake MBPP-like entry from HumanEval problem."""
    prompt_lines = problem["prompt"].strip().split("\n")
    first_line = prompt_lines[0] if prompt_lines else ""
    fake_entry = {
        "code": first_line,
        "text": problem.get("instruction", ""),
    }
    return fake_entry

def build_task_text_from_humaneval(problem: dict) -> str:
    """Extract task description from HumanEval problem."""
    prompt_lines = problem["prompt"].strip().split("\n")
    if len(prompt_lines) > 1:
        docstring = "\n".join(prompt_lines[1:])
        docstring = docstring.strip().strip('"""').strip("'''").strip()
        return docstring
    return "Complete the function"
try:
    from evalplus.data import get_instruct_humaneval
    instruct_humaneval_problems = get_instruct_humaneval()
    print(f"‚úÖ Loaded {len(instruct_humaneval_problems)} InstructHumanEval problems")
except ImportError:
    print("‚ö†Ô∏è  evalplus not installed. Install with: pip install evalplus")
    instruct_humaneval_problems = None

def check_instruct_humaneval(code: str, problem: dict) -> bool:
    """Check InstructHumanEval using exec() - same as HumanEval but uses instruction."""
    code_clean = extract_code_from_markdown(code)
    if not code_clean or len(code_clean) < 8:
        return False

    # InstructHumanEval has "instruction" (natural language) and "prompt" (function signature)
    prompt_sig = problem["prompt"]   # function signature
    test_code = problem["test"]

    # Avoid duplicating the signature if model echoed it
    if prompt_sig in code_clean:
        body = code_clean.replace(prompt_sig, "")
    else:
        body = code_clean

    full = prompt_sig + body + "\n" + test_code
    try:
        exec(full, {})
        return True
    except Exception:
        return False

# Note: build_instruct_humaneval_prompt is no longer needed
# We now use the same prompt_fn from MBPP selection (via make_fake_entry_for_humaneval)

def eval_instruct_humaneval(model, tokenizer, prompt_fn, calculate_pass10: bool = True):
    """
    Evaluate InstructHumanEval using the best prompt function from MBPP selection.

    Args:
        model: The model to evaluate
        tokenizer: The tokenizer
        prompt_fn: The best prompt function from MAIN_PROMPTS (takes task_text, entry)
        calculate_pass10: If True, also calculate pass@10 (requires 10 samples per problem)
    """
    if instruct_humaneval_problems is None:
        print("‚ùå InstructHumanEval not loaded. Skipping.")
        return None

    correct_pass1 = 0
    correct_pass10 = 0
    total = len(instruct_humaneval_problems)
    n_samples = 10 if calculate_pass10 else 1

    print(f"üöÄ Evaluating InstructHumanEval with best prompt from MBPP selection")
    print(f"   Total problems: {total}")
    if calculate_pass10:
        print(f"   Generating {n_samples} samples per problem for pass@10 calculation")

    # Use tqdm for progress bar
    from tqdm.auto import tqdm

    for i, (task_id, problem) in enumerate(tqdm(instruct_humaneval_problems.items(), desc="InstructHumanEval")):
        # Create fake entry for prompt_fn (same as HumanEval)
        fake_entry = make_fake_entry_for_humaneval(problem)
        # For InstructHumanEval, use the instruction field as task text
        task_text = problem.get("instruction", "Complete the function")

        # Use the same prompt function as MBPP
        prompt = prompt_fn(task_text, fake_entry)

        # Generate samples
        if calculate_pass10:
            full_outs = generate(model, tokenizer, prompt, max_new_tokens=2048, num_samples=n_samples)
            # Check each sample
            passed_samples = 0
            for full_out in full_outs:
                gen_part = strip_prompt(full_out, prompt)
                code = extract_code_from_markdown(gen_part)
                if check_instruct_humaneval(code, problem):
                    passed_samples += 1

            # Pass@1: first sample passes
            if check_instruct_humaneval(extract_code_from_markdown(strip_prompt(full_outs[0], prompt)), problem):
                correct_pass1 += 1

            # Pass@10: at least one sample passes (out of 10)
            if passed_samples > 0:
                correct_pass10 += 1
        else:
            # Only pass@1
            full_out = generate(model, tokenizer, prompt, max_new_tokens=2048, num_samples=1)
            gen_part = strip_prompt(full_out, prompt)
            code = extract_code_from_markdown(gen_part)
            if check_instruct_humaneval(code, problem):
                correct_pass1 += 1

        # Update progress bar description with current accuracy
        if (i + 1) % 10 == 0:
            current_acc1 = correct_pass1 / (i + 1)
            if calculate_pass10:
                current_acc10 = correct_pass10 / (i + 1)
                tqdm.write(f"  Progress: {i+1}/{total} ‚Äî pass@1={current_acc1:.2%}, pass@10={current_acc10:.2%}")
            else:
                tqdm.write(f"  Progress: {i+1}/{total} ‚Äî pass@1={current_acc1:.2%}")

    acc_pass1 = correct_pass1 / total
    results = {
        "pass@1": acc_pass1,
        "pass@1_correct": correct_pass1,
        "pass@1_total": total,
    }

    if calculate_pass10:
        acc_pass10 = correct_pass10 / total
        results["pass@10"] = acc_pass10
        results["pass@10_correct"] = correct_pass10
        results["pass@10_total"] = total
        print(f"\n‚úÖ InstructHumanEval pass@1: {acc_pass1:.4f} ({correct_pass1}/{total})")
        print(f"‚úÖ InstructHumanEval pass@10: {acc_pass10:.4f} ({correct_pass10}/{total})")
    else:
        print(f"\n‚úÖ InstructHumanEval pass@1: {acc_pass1:.4f} ({correct_pass1}/{total})")

    # Auto shutdown GPU after evaluation (to save resources)
    import os
    if os.getenv("COLAB_GPU") or torch.cuda.is_available():
        print("\nüí§ Shutting down GPU to save resources...")
        torch.cuda.empty_cache()
        print("   GPU cache cleared. GPU will be released when runtime ends.")

    return results

# Run AFTER Phase 1, when best_prompt_fn is defined
# Running InstructHumanEval evaluation with best prompt from MBPP selection
# Using 5-shot (best config from MBPP: 0.3500 accuracy)

# Use 5-shot as the best prompt (from MBPP results: 35/100 = 0.3500)
best_key = "5shot"
best_name = MAIN_PROMPTS[best_key][0]
best_prompt_fn = MAIN_PROMPTS[best_key][1]

print(f"‚úÖ Using best prompt from MBPP: {best_key} - {best_name}")
print(f"   (MBPP accuracy: 0.3500 (35/100))")

instruct_humaneval_results = eval_instruct_humaneval(
    model, tokenizer, best_prompt_fn, calculate_pass10=True
)

if instruct_humaneval_results is not None:
    os.makedirs("results", exist_ok=True)
    with open("results/qwen_instruct_humaneval_best_prompt.json", "w") as f:
        json.dump({
            "pass@1": float(instruct_humaneval_results["pass@1"]),
            "pass@1_correct": int(instruct_humaneval_results["pass@1_correct"]),
            "pass@1_total": int(instruct_humaneval_results["pass@1_total"]),
            "pass@10": float(instruct_humaneval_results.get("pass@10", 0)),
            "pass@10_correct": int(instruct_humaneval_results.get("pass@10_correct", 0)),
            "pass@10_total": int(instruct_humaneval_results.get("pass@10_total", 0)),
            "best_prompt_key": best_key,
            "best_prompt_name": best_name,
        }, f, indent=2)
    print("üíæ Saved InstructHumanEval results to results/qwen_instruct_humaneval_best_prompt.json")

# Auto download result file (important: Colab files are lost when runtime disconnects)
try:
    from google.colab import files
    print("\nüì• Auto-downloading InstructHumanEval results...")
    files.download("results/qwen_instruct_humaneval_best_prompt.json")
    print("‚úÖ InstructHumanEval results downloaded!")
except ImportError:
    print("‚ö†Ô∏è  Not in Colab, skipping auto-download")
except Exception as e:
    print(f"‚ö†Ô∏è  Auto-download failed: {e}")
    print("   Please manually download results/qwen_instruct_humaneval_best_prompt.json")

# Auto shutdown GPU after all evaluations complete
print("\n" + "="*60)
print("üéâ All evaluations complete!")
print("="*60)

# Download all results as a zip file (including HumanEval+)
print("\nüì¶ Creating and downloading all results as zip...")
try:
    from google.colab import files
    import zipfile
    from pathlib import Path
    from datetime import datetime

    results_dir = "results"
    result_files = list(Path(results_dir).glob("*.json"))

    if result_files:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        zip_filename = f"qwen_icl_evaluation_results_{timestamp}.zip"

        print(f"   Found {len(result_files)} result file(s):")
        for rf in sorted(result_files):
            print(f"     - {rf.name}")

        with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for result_file in result_files:
                zipf.write(result_file, result_file.name)

        print(f"\nüì• Downloading {zip_filename}...")
        files.download(zip_filename)
        print("‚úÖ All results downloaded as zip file!")
        print("   Includes: MBPP, HumanEval, HumanEval+, InstructHumanEval")
    else:
        print("‚ö†Ô∏è  No result files found")
except Exception as e:
    print(f"‚ö†Ô∏è  Auto-download failed: {e}")
    print("   Please manually download results/ directory")

# Clean up GPU
print("\nüí§ Cleaning up GPU resources...")
torch.cuda.empty_cache()
print("‚úÖ GPU cache cleared")

# Auto disconnect runtime after all evaluations complete
print("\n" + "="*60)
print("üåô All evaluations complete! Preparing to disconnect...")
print("="*60)
print("‚úÖ All results have been downloaded.")
print("üí§ Disconnecting runtime to save GPU resources...")

try:
    # Method 1: Try to use Colab's runtime management
    import os
    import time

    # Wait a moment to ensure all downloads complete
    time.sleep(2)

    # Method 2: Use JavaScript to trigger runtime disconnect (Colab-specific)
    try:
        from IPython.display import HTML, Javascript
        print("   Attempting to auto-disconnect runtime...")
        # This JavaScript will try to disconnect the Colab runtime
        js_code = """
        <script>
        // Try to disconnect Colab runtime
        if (typeof google !== 'undefined' && google.colab) {
            google.colab.kernel.proxyPort(0, {'cache': false});
        }
        // Alternative: Close the browser tab (may not work due to browser security)
        setTimeout(function() {
            window.close();
        }, 2000);
        </script>
        """
        display(HTML(js_code))
        print("   ‚úÖ Disconnect signal sent")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  JavaScript disconnect failed: {e}")

    # Method 3: Force runtime disconnect by killing the process
    print("   üí° If auto-disconnect doesn't work, runtime will auto-disconnect after ~90 min of inactivity")
    print("   üí° Or manually: Runtime ‚Üí Disconnect and delete runtime")

    # Final attempt: Kill the Python process (will disconnect runtime)
    time.sleep(1)
    print("   üîÑ Attempting final disconnect...")
    os._exit(0)  # This will terminate the runtime

except Exception as e:
    print(f"   ‚ö†Ô∏è  Auto-disconnect not available: {e}")
    print("   üí° Please manually disconnect: Runtime ‚Üí Disconnect and delete runtime")
    print("   üí° Or wait ~90 minutes for auto-disconnect")

print("\nüò¥ Good night! All results are saved and downloaded.")
print("   Runtime should disconnect automatically.")


‚ö†Ô∏è  evalplus not installed. Install with: pip install evalplus
‚úÖ Using best prompt from MBPP: 5shot - 5-shot (code prefix + task + signature)
   (MBPP accuracy: 0.3500 (35/100))
‚ùå InstructHumanEval not loaded. Skipping.

üì• Auto-downloading InstructHumanEval results...
‚ö†Ô∏è  Auto-download failed: Cannot find file: results/qwen_instruct_humaneval_best_prompt.json
   Please manually download results/qwen_instruct_humaneval_best_prompt.json

üéâ All evaluations complete!

üì¶ Creating and downloading all results as zip...
‚ö†Ô∏è  No result files found

üí§ Cleaning up GPU resources...
‚úÖ GPU cache cleared

üåô All evaluations complete! Preparing to disconnect...
‚úÖ All results have been downloaded.
üí§ Disconnecting runtime to save GPU resources...
   Attempting to auto-disconnect runtime...


   ‚úÖ Disconnect signal sent
   üí° If auto-disconnect doesn't work, runtime will auto-disconnect after ~90 min of inactivity
   üí° Or manually: Runtime ‚Üí Disconnect and delete runtime


In [None]:
# Step 1: Create 5-shot prefix file for MultiPL-E evaluation
# This will be used by bigcode-evaluation-harness

print("üìù Creating 5-shot prefix file for MultiPL-E evaluation...")

# Create prompts directory
os.makedirs("prompts", exist_ok=True)

# Generate 5-shot pure code prefix (same as MBPP)
MBPP_PREFIX_5SHOT = "".join(ALL_CODE_EXAMPLES)

# Save to file for bigcode-evaluation-harness
prefix_file = "prompts/mbpp_5shot.txt"
with open(prefix_file, "w") as f:
    f.write(MBPP_PREFIX_5SHOT)

print(f"‚úÖ Created {prefix_file}")
print(f"   Size: {len(MBPP_PREFIX_5SHOT)} characters")
print(f"   Contains: {len(ALL_CODE_EXAMPLES)} pure code examples")
print("\nüí° This prefix file will be used for MultiPL-E evaluation")
# Install bigcode-evaluation-harness (if not already installed)
import os

if not os.path.exists("bigcode-evaluation-harness"):
    print("üì¶ Installing bigcode-evaluation-harness...")
    print("   This will take a few minutes...")
    os.system("git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git")
    os.chdir("bigcode-evaluation-harness")
    os.system("pip install -e .")
    os.chdir("..")
    print("‚úÖ bigcode-evaluation-harness installed")
else:
    print("‚úÖ bigcode-evaluation-harness already exists")
    print("üí° If you need to reinstall, delete the directory first")


üìù Creating 5-shot prefix file for MultiPL-E evaluation...
‚úÖ Created prompts/mbpp_5shot.txt
   Size: 693 characters
   Contains: 5 pure code examples

üí° This prefix file will be used for MultiPL-E evaluation
üì¶ Installing bigcode-evaluation-harness...
   This will take a few minutes...
‚úÖ bigcode-evaluation-harness installed


In [None]:
# Re-clone bigcode-evaluation-harness from arthur900530 fork
# This will remove the existing directory and clone the fork

import os
import subprocess
import shutil

harness_dir = "bigcode-evaluation-harness"
fork_repo = "https://github.com/arthur900530/bigcode-evaluation-harness.git"

print("üîÑ Re-cloning bigcode-evaluation-harness from arthur900530 fork...")

# Remove existing directory if it exists
if os.path.exists(harness_dir):
    print(f"   Removing existing directory: {harness_dir}")
    try:
        shutil.rmtree(harness_dir)
        print("   ‚úÖ Directory removed")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error removing directory: {e}")
        print("   Trying alternative method...")
        # Try using system command as fallback
        if os.name == 'nt':  # Windows
            subprocess.run(["rmdir", "/s", "/q", harness_dir], shell=True, check=False)
        else:  # Linux/Mac
            subprocess.run(["rm", "-rf", harness_dir], check=False)
        print("   ‚úÖ Directory removed (alternative method)")

# Clone the fork
print(f"\nüì¶ Cloning from {fork_repo}...")
try:
    result = subprocess.run(
        ["git", "clone", fork_repo, harness_dir],
        capture_output=True,
        text=True,
        check=True
    )
    print("‚úÖ Cloned successfully!")

    # Install the package
    print("\nüì¶ Installing bigcode-evaluation-harness package...")
    original_dir = os.getcwd()
    os.chdir(harness_dir)

    try:
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-e", "."],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Installed successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Warning: pip install failed: {e.stderr}")
        print("   Continuing anyway...")

    os.chdir(original_dir)

    print("\n‚úÖ Setup complete! You can now run the C++ evaluation cell.")

except subprocess.CalledProcessError as e:
    print(f"‚ùå Failed to clone: {e.stderr}")
    print("   Please check your internet connection and try again.")
except FileNotFoundError:
    print("‚ùå git not found. Please install git first.")
except Exception as e:
    print(f"‚ùå Error: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# Setup: Install C++ Compiler (Required for MultiPL-E C++ evaluation)
# This cell checks for and installs C++ compiler if needed

import os
import subprocess
import sys
import platform

print("üîß Setting up C++ compiler for MultiPL-E evaluation...")
print(f"   Platform: {platform.system()}")

# Check if C++ compiler is available
cpp_compiler_found = False
compiler_name = None

# Check for g++ (GNU C++ compiler)
try:
    result = subprocess.run(
        ["g++", "--version"],
        capture_output=True,
        text=True,
        timeout=5
    )
    if result.returncode == 0:
        cpp_compiler_found = True
        compiler_name = "g++"
        print(f"‚úÖ Found {compiler_name}: {result.stdout.split(chr(10))[0]}")
except (FileNotFoundError, subprocess.TimeoutExpired):
    pass

# Check for cl (MSVC on Windows)
if not cpp_compiler_found:
    try:
        result = subprocess.run(
            ["cl"],
            capture_output=True,
            text=True,
            timeout=5
        )
        if "Microsoft" in result.stderr or "Microsoft" in result.stdout:
            cpp_compiler_found = True
            compiler_name = "cl (MSVC)"
            print(f"‚úÖ Found {compiler_name}")
    except (FileNotFoundError, subprocess.TimeoutExpired):
        pass

# Install compiler if not found (Linux/Colab only)
if not cpp_compiler_found:
    if platform.system() == "Linux" or os.path.exists("/etc/debian_version"):
        print("üì¶ Installing C++ compiler (build-essential g++)...")
        try:
            result = subprocess.run(
                ["apt-get", "update", "-qq"],
                capture_output=True,
                text=True,
                timeout=60
            )
            result = subprocess.run(
                ["apt-get", "install", "-y", "-qq", "build-essential", "g++"],
                capture_output=True,
                text=True,
                timeout=120
            )
            if result.returncode == 0:
                print("‚úÖ C++ compiler installed successfully")
                # Verify installation
                result = subprocess.run(
                    ["g++", "--version"],
                    capture_output=True,
                    text=True,
                    timeout=5
                )
                if result.returncode == 0:
                    print(f"   Version: {result.stdout.split(chr(10))[0]}")
                    cpp_compiler_found = True
        except (FileNotFoundError, subprocess.TimeoutExpired, subprocess.CalledProcessError) as e:
            print(f"‚ö†Ô∏è  Could not install C++ compiler automatically: {e}")
            print("   Please install manually:")
            print("   - Linux: sudo apt-get install build-essential g++")
            print("   - Windows: Install Visual Studio Build Tools or MinGW")
            print("   - macOS: xcode-select --install")
    elif platform.system() == "Windows":
        print("‚ö†Ô∏è  C++ compiler not found on Windows")
        print("   Please install one of the following:")
        print("   1. Visual Studio Build Tools: https://visualstudio.microsoft.com/downloads/")
        print("   2. MinGW-w64: https://www.mingw-w64.org/downloads/")
        print("   3. Or use WSL (Windows Subsystem for Linux)")
    else:
        print("‚ö†Ô∏è  C++ compiler not found")
        print("   Please install a C++ compiler manually for your platform")

if cpp_compiler_found:
    print("\n‚úÖ C++ compiler setup complete! Ready for MultiPL-E C++ evaluation.")
else:
    print("\n‚ö†Ô∏è  Warning: C++ compiler not available. MultiPL-E C++ evaluation may fail.")
    print("   The evaluation will still attempt to run, but code execution will fail.")


üîß Setting up C++ compiler for MultiPL-E evaluation...
   Platform: Linux
‚úÖ Found g++: g++ (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0

‚úÖ C++ compiler setup complete! Ready for MultiPL-E C++ evaluation.


In [None]:
!git clone https://github.com/arthur900530/bigcode-evaluation-harness.git

fatal: destination path 'bigcode-evaluation-harness' already exists and is not an empty directory.


In [None]:
# MultiPL-E C++ Evaluation - Fixed Version
# This will automatically install bigcode-evaluation-harness and run the evaluation

import os
import subprocess
import sys

# Fixed parameters
MODEL = "Qwen/Qwen2.5-Coder-3B-Instruct"

# Task name detection:
# - 'multiple-cpp': Standard name (used in newer/older versions)
# - 'multiple-cljcpp': Alternative name (used in some versions, but has a bug mapping to humaneval-cljcpp)
# We'll try both and use whichever works
TASK_OPTIONS = ["multiple-cpp", "multiple-cljcpp"]
TASK = None  # Will be determined below
TOP_P = "0.95"
TEMPERATURE = "0.2"
DO_SAMPLE = "True"
N_SAMPLES = "10"
BATCH_SIZE = "10"
MAX_LENGTH = "2048"
MAX_LENGTH_GENERATION = "2048"
SEED = "11667"

print("üöÄ Evaluating MultiPL-E C++...")
print("   This will take ~30-60 minutes...")
print("   Using 5-shot prompt from MBPP selection")

# Read the 5-shot prefix
with open("prompts/mbpp_5shot.txt", "r", encoding="utf-8") as f:
    prefix_content = f.read()

# Check and install bigcode-evaluation-harness if needed
original_dir = os.getcwd()
harness_dir = "bigcode-evaluation-harness"

if not os.path.exists(harness_dir):
    print(f"\nüì¶ Installing bigcode-evaluation-harness...")
    print("   This will take a few minutes...")
    try:
        # Try official repo first, fallback to fork if needed
        repo_url = "https://github.com/bigcode-project/bigcode-evaluation-harness.git"
        result = subprocess.run(
            ["git", "clone", repo_url],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Cloned successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Official repo failed, trying fork...")
        try:
            repo_url = "https://github.com/arthur900530/bigcode-evaluation-harness.git"
            result = subprocess.run(
                ["git", "clone", repo_url],
                capture_output=True,
                text=True,
                check=True
            )
            print("‚úÖ Cloned from fork successfully")
        except subprocess.CalledProcessError as e2:
            print(f"‚ùå Failed to clone: {e2.stderr}")
            raise RuntimeError("Failed to install bigcode-evaluation-harness")
    except FileNotFoundError:
        print("‚ùå git not found. Please install git first.")
        raise RuntimeError("git is required but not found")

    # Install the package
    os.chdir(harness_dir)
    try:
        print("   Installing bigcode-evaluation-harness package...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-e", "."],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Installed successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Warning: pip install failed: {e.stderr}")
        print("   Continuing anyway...")

    # Install bitsandbytes (required for some models)
    try:
        print("   Installing bitsandbytes>=0.41.0...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-q", "bitsandbytes>=0.41.0"],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ bitsandbytes installed")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Warning: bitsandbytes install failed: {e.stderr}")
        print("   Continuing anyway (may not be needed for CPU-only)...")

    os.chdir(original_dir)
else:
    print("‚úÖ bigcode-evaluation-harness already exists")
    # Try to update to latest version (might fix task name issues)
    print("   Attempting to update to latest version...")
    try:
        os.chdir(harness_dir)
        update_result = subprocess.run(
            ["git", "pull"],
            capture_output=True,
            text=True,
            timeout=30
        )
        if update_result.returncode == 0:
            print("   ‚úÖ Updated to latest version")
            # Reinstall in case dependencies changed
            subprocess.run([sys.executable, "-m", "pip", "install", "-e", "."], check=False)
        else:
            print("   ‚ö†Ô∏è  Update failed or already up to date")
        os.chdir(original_dir)
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Could not update: {e}")
        os.chdir(original_dir)

# Save prefix to a file in bigcode-evaluation-harness directory
os.makedirs(f"{harness_dir}/prompts", exist_ok=True)
prefix_file = f"{harness_dir}/prompts/mbpp_5shot.txt"
with open(prefix_file, "w", encoding="utf-8") as f:
    f.write(prefix_content)

# Change to bigcode-evaluation-harness directory
os.chdir(harness_dir)

# Check if main.py exists
if not os.path.exists("main.py"):
    print(f"‚ùå Error: main.py not found in {os.getcwd()}")
    print("   Listing directory contents:")
    for item in os.listdir("."):
        print(f"     - {item}")
    os.chdir(original_dir)
    raise FileNotFoundError(f"main.py not found in {harness_dir}. Please check the installation.")

print(f"\n‚úÖ Found main.py at: {os.path.join(os.getcwd(), 'main.py')}")

# Try to detect which C++ task name is available
print("\nüîç Detecting available C++ task name...")
for task_option in TASK_OPTIONS:
    try:
        # Quick test to see if task is valid
        test_result = subprocess.run(
            [sys.executable, "main.py", "--tasks", task_option, "--limit", "1", "--model", MODEL],
            capture_output=True,
            text=True,
            timeout=10
        )
        if "invalid choice" not in test_result.stderr.lower() and test_result.returncode != 2:
            TASK = task_option
            print(f"   ‚úÖ Found working task: {TASK}")
            break
    except:
        continue

# If still not found, check help output
if TASK is None:
    try:
        help_result = subprocess.run(
            [sys.executable, "main.py", "--tasks", "invalid_task_for_help"],
            capture_output=True,
            text=True,
            timeout=5
        )
        # Look for cpp-related tasks in error message
        if "multiple-cpp" in help_result.stderr:
            TASK = "multiple-cpp"
            print(f"   ‚úÖ Found task in help: {TASK}")
        elif "multiple-cljcpp" in help_result.stderr:
            TASK = "multiple-cljcpp"
            print(f"   ‚úÖ Found task in help: {TASK}")
    except:
        pass

if TASK is None:
    TASK = "multiple-cpp"
    print(f"   ‚ö†Ô∏è  Could not auto-detect, using default: {TASK}")

print(f"\nüí° Using task: {TASK}")


# Build the command - pass prefix content directly (not using bash $(cat))
cmd = [
    sys.executable, "main.py",
    "--model", MODEL,
    "--tasks", TASK,
    "--top_p", TOP_P,
    "--temperature", TEMPERATURE,
    "--do_sample", DO_SAMPLE,
    "--n_samples", N_SAMPLES,
    "--batch_size", BATCH_SIZE,
    "--max_length", MAX_LENGTH,
    "--max_length_generation", MAX_LENGTH_GENERATION,
    "--allow_code_execution",
    "--save_generations",
    "--prefix", prefix_content,  # Pass prefix content directly
    "--seed", SEED
]

print(f"\nüìù Command: python main.py --model {MODEL} --tasks {TASK} ...")
print(f"   Prefix length: {len(prefix_content)} characters")
print(f"   Working directory: {os.getcwd()}")
print("\n‚è≥ Starting evaluation (this will take 30-60 minutes)...\n")

try:
    # Run the command and show output in real-time
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1
    )

    # Print output line by line
    for line in process.stdout:
        print(line, end='')

    process.wait()

    if process.returncode == 0:
        print("\n‚úÖ C++ evaluation completed successfully!")
    else:
        print(f"\n‚ùå Evaluation failed with return code {process.returncode}")
        # If task name error, show available tasks
        if "invalid choice" in str(process.returncode) or "error" in str(process.returncode).lower():
            print("\nüí° Checking available tasks...")
            try:
                help_result = subprocess.run(
                    [sys.executable, "main.py", "--tasks", "invalid_task_for_help"],
                    capture_output=True,
                    text=True,
                    timeout=5
                )
                # Extract task list from error message
                if "choose from" in help_result.stderr:
                    tasks_line = help_result.stderr.split("choose from")[-1]
                    cpp_tasks = [t for t in tasks_line.split(",") if "cpp" in t.lower()]
                    if cpp_tasks:
                        print(f"   Found C++ related tasks: {cpp_tasks}")
                        print(f"   Try using one of these instead of '{TASK}'")
            except:
                pass

except Exception as e:
    print(f"\n‚ùå Error running evaluation: {e}")
    import traceback
    traceback.print_exc()
    # If it's a task name error, provide helpful message
    if "invalid choice" in str(e).lower() or "not found" in str(e).lower():
        print("\nüí° Tip: The task name might be incorrect.")
        print("   Available dataset configs include: humaneval-cpp")
        print("   You may need to:")
        print("   1. Update bigcode-evaluation-harness: cd bigcode-evaluation-harness && git pull")
        print("   2. Or check available tasks: python main.py --tasks help")
finally:
    os.chdir(original_dir)


üöÄ Evaluating MultiPL-E C++...
   This will take ~30-60 minutes...
   Using 5-shot prompt from MBPP selection
‚úÖ bigcode-evaluation-harness already exists
   Attempting to update to latest version...
   ‚úÖ Updated to latest version

‚úÖ Found main.py at: /content/bigcode-evaluation-harness/main.py

üîç Detecting available C++ task name...
   ‚ö†Ô∏è  Could not auto-detect, using default: multiple-cpp

üí° Using task: multiple-cpp

üìù Command: python main.py --model Qwen/Qwen2.5-Coder-3B-Instruct --tasks multiple-cpp ...
   Prefix length: 693 characters
   Working directory: /content/bigcode-evaluation-harness

‚è≥ Starting evaluation (this will take 30-60 minutes)...

2025-11-29 03:29:22.573002: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764386962.594237   14491 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to re

In [None]:
# MultiPL-E C++ Evaluation - Fixed Version
# This will automatically install bigcode-evaluation-harness and run the evaluation

import os
import subprocess
import sys

# Fixed parameters
MODEL = "Qwen/Qwen2.5-Coder-3B-Instruct"

# Task name detection:
# - 'multiple-cpp': Standard name (used in newer/older versions)
# - 'multiple-cljcpp': Alternative name (used in some versions, but has a bug mapping to humaneval-cljcpp)
# We'll try both and use whichever works
TASK_OPTIONS = ["multiple-cpp", "multiple-cljcpp"]
TASK = None  # Will be determined below
TOP_P = "0.95"
TEMPERATURE = "0.2"
DO_SAMPLE = "True"
N_SAMPLES = "10"
BATCH_SIZE = "10"
MAX_LENGTH = "2048"
MAX_LENGTH_GENERATION = "2048"
SEED = "11667"

print("üöÄ Evaluating MultiPL-E C++...")
print("   This will take ~30-60 minutes...")
print("   Using 5-shot prompt from MBPP selection")

# Read the 5-shot prefix
with open("prompts/mbpp_5shot.txt", "r", encoding="utf-8") as f:
    prefix_content = f.read()

# Check and install bigcode-evaluation-harness if needed
original_dir = os.getcwd()
harness_dir = "bigcode-evaluation-harness"

if not os.path.exists(harness_dir):
    print(f"\nüì¶ Installing bigcode-evaluation-harness...")
    print("   This will take a few minutes...")
    try:
        # Try official repo first, fallback to fork if needed
        repo_url = "https://github.com/bigcode-project/bigcode-evaluation-harness.git"
        result = subprocess.run(
            ["git", "clone", repo_url],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Cloned successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Official repo failed, trying fork...")
        try:
            repo_url = "https://github.com/arthur900530/bigcode-evaluation-harness.git"
            result = subprocess.run(
                ["git", "clone", repo_url],
                capture_output=True,
                text=True,
                check=True
            )
            print("‚úÖ Cloned from fork successfully")
        except subprocess.CalledProcessError as e2:
            print(f"‚ùå Failed to clone: {e2.stderr}")
            raise RuntimeError("Failed to install bigcode-evaluation-harness")
    except FileNotFoundError:
        print("‚ùå git not found. Please install git first.")
        raise RuntimeError("git is required but not found")

    # Install the package
    os.chdir(harness_dir)
    try:
        print("   Installing bigcode-evaluation-harness package...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-e", "."],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Installed successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Warning: pip install failed: {e.stderr}")
        print("   Continuing anyway...")

    # Install bitsandbytes (required for some models)
    try:
        print("   Installing bitsandbytes>=0.41.0...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-q", "bitsandbytes>=0.41.0"],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ bitsandbytes installed")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Warning: bitsandbytes install failed: {e.stderr}")
        print("   Continuing anyway (may not be needed for CPU-only)...")

    os.chdir(original_dir)
else:
    print("‚úÖ bigcode-evaluation-harness already exists")
    # Try to update to latest version (might fix task name issues)
    print("   Attempting to update to latest version...")
    try:
        os.chdir(harness_dir)
        update_result = subprocess.run(
            ["git", "pull"],
            capture_output=True,
            text=True,
            timeout=30
        )
        if update_result.returncode == 0:
            print("   ‚úÖ Updated to latest version")
            # Reinstall in case dependencies changed
            subprocess.run([sys.executable, "-m", "pip", "install", "-e", "."], check=False)
        else:
            print("   ‚ö†Ô∏è  Update failed or already up to date")
        os.chdir(original_dir)
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Could not update: {e}")
        os.chdir(original_dir)

# Save prefix to a file in bigcode-evaluation-harness directory
os.makedirs(f"{harness_dir}/prompts", exist_ok=True)
prefix_file = f"{harness_dir}/prompts/mbpp_5shot.txt"
with open(prefix_file, "w", encoding="utf-8") as f:
    f.write(prefix_content)

# Change to bigcode-evaluation-harness directory
os.chdir(harness_dir)

# Check if main.py exists
if not os.path.exists("main.py"):
    print(f"‚ùå Error: main.py not found in {os.getcwd()}")
    print("   Listing directory contents:")
    for item in os.listdir("."):
        print(f"     - {item}")
    os.chdir(original_dir)
    raise FileNotFoundError(f"main.py not found in {harness_dir}. Please check the installation.")

print(f"\n‚úÖ Found main.py at: {os.path.join(os.getcwd(), 'main.py')}")

# Try to detect which C++ task name is available
print("\nüîç Detecting available C++ task name...")
for task_option in TASK_OPTIONS:
    try:
        # Quick test to see if task is valid
        test_result = subprocess.run(
            [sys.executable, "main.py", "--tasks", task_option, "--limit", "1", "--model", MODEL],
            capture_output=True,
            text=True,
            timeout=10
        )
        if "invalid choice" not in test_result.stderr.lower() and test_result.returncode != 2:
            TASK = task_option
            print(f"   ‚úÖ Found working task: {TASK}")
            break
    except:
        continue

# If still not found, check help output
if TASK is None:
    try:
        help_result = subprocess.run(
            [sys.executable, "main.py", "--tasks", "invalid_task_for_help"],
            capture_output=True,
            text=True,
            timeout=5
        )
        # Look for cpp-related tasks in error message
        if "multiple-cpp" in help_result.stderr:
            TASK = "multiple-cpp"
            print(f"   ‚úÖ Found task in help: {TASK}")
        elif "multiple-cljcpp" in help_result.stderr:
            TASK = "multiple-cljcpp"
            print(f"   ‚úÖ Found task in help: {TASK}")
    except:
        pass

if TASK is None:
    TASK = "multiple-cpp"
    print(f"   ‚ö†Ô∏è  Could not auto-detect, using default: {TASK}")

print(f"\nüí° Using task: {TASK}")


# Build the command - pass prefix content directly (not using bash $(cat))
cmd = [
    sys.executable, "main.py",
    "--model", MODEL,
    "--tasks", TASK,
    "--top_p", TOP_P,
    "--temperature", TEMPERATURE,
    "--do_sample", DO_SAMPLE,
    "--n_samples", N_SAMPLES,
    "--batch_size", BATCH_SIZE,
    "--max_length", MAX_LENGTH,
    "--max_length_generation", MAX_LENGTH_GENERATION,
    "--allow_code_execution",
    "--save_generations",
    "--seed", SEED
]

print(f"\nüìù Command: python main.py --model {MODEL} --tasks {TASK} ...")
print(f"   Prefix length: {len(prefix_content)} characters")
print(f"   Working directory: {os.getcwd()}")
print("\n‚è≥ Starting evaluation (this will take 30-60 minutes)...\n")

try:
    # Run the command and show output in real-time
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1
    )

    # Print output line by line
    for line in process.stdout:
        print(line, end='')

    process.wait()

    if process.returncode == 0:
        print("\n‚úÖ C++ evaluation completed successfully!")
    else:
        print(f"\n‚ùå Evaluation failed with return code {process.returncode}")
        # If task name error, show available tasks
        if "invalid choice" in str(process.returncode) or "error" in str(process.returncode).lower():
            print("\nüí° Checking available tasks...")
            try:
                help_result = subprocess.run(
                    [sys.executable, "main.py", "--tasks", "invalid_task_for_help"],
                    capture_output=True,
                    text=True,
                    timeout=5
                )
                # Extract task list from error message
                if "choose from" in help_result.stderr:
                    tasks_line = help_result.stderr.split("choose from")[-1]
                    cpp_tasks = [t for t in tasks_line.split(",") if "cpp" in t.lower()]
                    if cpp_tasks:
                        print(f"   Found C++ related tasks: {cpp_tasks}")
                        print(f"   Try using one of these instead of '{TASK}'")
            except:
                pass

except Exception as e:
    print(f"\n‚ùå Error running evaluation: {e}")
    import traceback
    traceback.print_exc()
    # If it's a task name error, provide helpful message
    if "invalid choice" in str(e).lower() or "not found" in str(e).lower():
        print("\nüí° Tip: The task name might be incorrect.")
        print("   Available dataset configs include: humaneval-cpp")
        print("   You may need to:")
        print("   1. Update bigcode-evaluation-harness: cd bigcode-evaluation-harness && git pull")
        print("   2. Or check available tasks: python main.py --tasks help")
finally:
    os.chdir(original_dir)


In [None]:
# Setup: Install Java Dependencies (Required for MultiPL-E Java evaluation)
# This cell installs javatuples JAR file needed for Java evaluation

import os
import subprocess
import platform

print("üîß Setting up Java dependencies for MultiPL-E evaluation...")
print(f"   Platform: {platform.system()}")

# Check if javatuples JAR exists
javatuples_path = "/usr/multiple/javatuples-1.2.jar"
javatuples_exists = os.path.exists(javatuples_path)

if javatuples_exists:
    print(f"‚úÖ javatuples JAR already exists at {javatuples_path}")
else:
    # Install javatuples (Linux/Colab only)
    if platform.system() == "Linux" or os.path.exists("/etc/debian_version"):
        print("üì¶ Installing javatuples JAR...")
        try:
            # Create directory
            result = subprocess.run(
                ["sudo", "mkdir", "-p", "/usr/multiple"],
                capture_output=True,
                text=True,
                timeout=10
            )

            # Download JAR
            result = subprocess.run(
                ["sudo", "wget", "https://repo1.maven.org/maven2/org/javatuples/javatuples/1.2/javatuples-1.2.jar",
                 "-O", "/usr/multiple/javatuples-1.2.jar"],
                capture_output=True,
                text=True,
                timeout=60
            )

            if result.returncode == 0:
                print("‚úÖ javatuples JAR installed successfully")
                # Verify
                if os.path.exists(javatuples_path):
                    print(f"   Verified: {javatuples_path}")
            else:
                print(f"‚ö†Ô∏è  Warning: wget failed: {result.stderr}")
        except (FileNotFoundError, subprocess.TimeoutExpired, subprocess.CalledProcessError) as e:
            print(f"‚ö†Ô∏è  Could not install javatuples automatically: {e}")
            print("   Please install manually:")
            print("   sudo mkdir -p /usr/multiple")
            print("   sudo wget https://repo1.maven.org/maven2/org/javatuples/javatuples/1.2/javatuples-1.2.jar -O /usr/multiple/javatuples-1.2.jar")
    else:
        print("‚ö†Ô∏è  javatuples JAR not found")
        print("   Please install manually for your platform")

if javatuples_exists or os.path.exists(javatuples_path):
    print("\n‚úÖ Java dependencies setup complete! Ready for MultiPL-E Java evaluation.")
else:
    print("\n‚ö†Ô∏è  Warning: javatuples JAR not available. MultiPL-E Java evaluation may fail.")
    print("   The evaluation will still attempt to run, but code execution may fail.")


In [None]:
# Setup: Install Node.js & TypeScript (Required for MultiPL-E JavaScript evaluation)
# This cell installs Node.js and TypeScript needed for JavaScript evaluation

import os
import subprocess
import platform

print("üîß Setting up Node.js and TypeScript for MultiPL-E evaluation...")
print(f"   Platform: {platform.system()}")

# Check if Node.js is installed
node_installed = False
node_version = None

try:
    result = subprocess.run(
        ["node", "--version"],
        capture_output=True,
        text=True,
        timeout=5
    )
    if result.returncode == 0:
        node_installed = True
        node_version = result.stdout.strip()
        print(f"‚úÖ Found Node.js: {node_version}")
except (FileNotFoundError, subprocess.TimeoutExpired):
    pass

# Check if TypeScript is installed
ts_installed = False
ts_version = None

if node_installed:
    try:
        result = subprocess.run(
            ["tsc", "--version"],
            capture_output=True,
            text=True,
            timeout=5
        )
        if result.returncode == 0:
            ts_installed = True
            ts_version = result.stdout.strip()
            print(f"‚úÖ Found TypeScript: {ts_version}")
    except (FileNotFoundError, subprocess.TimeoutExpired):
        pass

# Install Node.js and TypeScript if not found (Linux/Colab only)
if not node_installed or not ts_installed:
    if platform.system() == "Linux" or os.path.exists("/etc/debian_version"):
        if not node_installed:
            print("üì¶ Installing Node.js...")
            try:
                # Add NodeSource repository
                result = subprocess.run(
                    ["bash", "-c", "curl -fsSL https://deb.nodesource.com/setup_20.x | bash -"],
                    capture_output=True,
                    text=True,
                    timeout=60
                )

                # Install Node.js
                result = subprocess.run(
                    ["apt-get", "install", "-y", "-qq", "nodejs"],
                    capture_output=True,
                    text=True,
                    timeout=120
                )

                if result.returncode == 0:
                    print("‚úÖ Node.js installed successfully")
                    # Verify
                    result = subprocess.run(["node", "--version"], capture_output=True, text=True, timeout=5)
                    if result.returncode == 0:
                        print(f"   Version: {result.stdout.strip()}")
                        node_installed = True
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not install Node.js automatically: {e}")

        if node_installed and not ts_installed:
            print("üì¶ Installing TypeScript...")
            try:
                result = subprocess.run(
                    ["npm", "install", "-g", "typescript"],
                    capture_output=True,
                    text=True,
                    timeout=120
                )
                if result.returncode == 0:
                    print("‚úÖ TypeScript installed successfully")
                    # Verify
                    result = subprocess.run(["tsc", "--version"], capture_output=True, text=True, timeout=5)
                    if result.returncode == 0:
                        print(f"   Version: {result.stdout.strip()}")
                        ts_installed = True
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not install TypeScript automatically: {e}")
    else:
        print("‚ö†Ô∏è  Node.js/TypeScript not found")
        print("   Please install manually:")
        print("   - Linux: curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && apt-get install -y nodejs && npm install -g typescript")
        print("   - macOS: brew install node && npm install -g typescript")
        print("   - Windows: Download from https://nodejs.org/")

if node_installed and ts_installed:
    print("\n‚úÖ Node.js and TypeScript setup complete! Ready for MultiPL-E JavaScript evaluation.")
else:
    print("\n‚ö†Ô∏è  Warning: Node.js or TypeScript not available. MultiPL-E JavaScript evaluation may fail.")
    print("   The evaluation will still attempt to run, but code execution may fail.")


In [None]:
# MultiPL-E Java Evaluation - Fixed Version
# This will automatically install bigcode-evaluation-harness and run the evaluation

import os
import subprocess
import sys

# Fixed parameters
MODEL = "Qwen/Qwen2.5-Coder-3B-Instruct"
TASK = "multiple-java"
TOP_P = "0.95"
TEMPERATURE = "0.2"
DO_SAMPLE = "True"
N_SAMPLES = "10"
BATCH_SIZE = "10"
MAX_LENGTH = "2048"
MAX_LENGTH_GENERATION = "2048"
SEED = "11667"

print("üöÄ Evaluating MultiPL-E Java...")
print("   This will take ~30-60 minutes...")
print("   Using 5-shot prompt from MBPP selection")

# Read the 5-shot prefix
with open("prompts/mbpp_5shot.txt", "r", encoding="utf-8") as f:
    prefix_content = f.read()

# Check and install bigcode-evaluation-harness if needed
original_dir = os.getcwd()
harness_dir = "bigcode-evaluation-harness"

if not os.path.exists(harness_dir):
    print(f"\nüì¶ Installing bigcode-evaluation-harness...")
    print("   This will take a few minutes...")
    try:
        repo_url = "https://github.com/arthur900530/bigcode-evaluation-harness.git"
        result = subprocess.run(
            ["git", "clone", repo_url],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Cloned successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ùå Failed to clone: {e.stderr}")
        raise RuntimeError("Failed to install bigcode-evaluation-harness")
    except FileNotFoundError:
        print("‚ùå git not found. Please install git first.")
        raise RuntimeError("git is required but not found")

    # Install the package
    os.chdir(harness_dir)
    try:
        print("   Installing bigcode-evaluation-harness package...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-e", "."],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Installed successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Warning: pip install failed: {e.stderr}")
        print("   Continuing anyway...")

    os.chdir(original_dir)
else:
    print("‚úÖ bigcode-evaluation-harness already exists")

# Save prefix to a file in bigcode-evaluation-harness directory
os.makedirs(f"{harness_dir}/prompts", exist_ok=True)
prefix_file = f"{harness_dir}/prompts/mbpp_5shot.txt"
with open(prefix_file, "w", encoding="utf-8") as f:
    f.write(prefix_content)

# Change to bigcode-evaluation-harness directory
os.chdir(harness_dir)

# Check if main.py exists
if not os.path.exists("main.py"):
    print(f"‚ùå Error: main.py not found in {os.getcwd()}")
    os.chdir(original_dir)
    raise FileNotFoundError(f"main.py not found in {harness_dir}")

print(f"\n‚úÖ Found main.py at: {os.path.join(os.getcwd(), 'main.py')}")
print(f"\nüí° Using task: {TASK}")
print("üí° Running evaluation command...")
print("   This will take 30-60 minutes, please be patient...")

# Build the command - pass prefix content directly
cmd = [
    sys.executable, "main.py",
    "--model", MODEL,
    "--tasks", TASK,
    "--top_p", TOP_P,
    "--temperature", TEMPERATURE,
    "--do_sample", DO_SAMPLE,
    "--n_samples", N_SAMPLES,
    "--batch_size", BATCH_SIZE,
    "--max_length", MAX_LENGTH,
    "--max_length_generation", MAX_LENGTH_GENERATION,
    "--allow_code_execution",
    "--save_generations",
    "--seed", SEED,
    "--trust_remote_code",
]

print(f"\nüìù Command: python main.py --model {MODEL} --tasks {TASK} ...")
print(f"   Prefix length: {len(prefix_content)} characters")
print(f"   Working directory: {os.getcwd()}")
print("\n‚è≥ Starting evaluation (this will take 30-60 minutes)...\n")

try:
    # Run the command and show output in real-time
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1
    )

    # Print output line by line
    for line in process.stdout:
        print(line, end='')

    process.wait()

    if process.returncode == 0:
        print("\n‚úÖ Java evaluation completed successfully!")
    else:
        print(f"\n‚ùå Evaluation failed with return code {process.returncode}")

except Exception as e:
    print(f"\n‚ùå Error running evaluation: {e}")
    import traceback
    traceback.print_exc()
finally:
    os.chdir(original_dir)


üöÄ Evaluating MultiPL-E Java...
   This will take ~30-60 minutes...
   Using 5-shot prompt from MBPP selection
‚úÖ bigcode-evaluation-harness already exists

‚úÖ Found main.py at: /content/bigcode-evaluation-harness/bigcode-evaluation-harness/main.py

üí° Using task: multiple-java
üí° Running evaluation command...
   This will take 30-60 minutes, please be patient...

üìù Command: python main.py --model Qwen/Qwen2.5-Coder-3B-Instruct --tasks multiple-java ...
   Prefix length: 896 characters
   Working directory: /content/bigcode-evaluation-harness/bigcode-evaluation-harness

‚è≥ Starting evaluation (this will take 30-60 minutes)...

2025-12-03 17:27:26.618548: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764782846.640650   43366 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when 

In [None]:
# MultiPL-E JavaScript Evaluation - Fixed Version
# This will automatically install bigcode-evaluation-harness and run the evaluation

import os
import subprocess
import sys

# Fixed parameters
MODEL = "Qwen/Qwen2.5-Coder-3B-Instruct"
TASK = "multiple-js"
TOP_P = "0.95"
TEMPERATURE = "0.2"
DO_SAMPLE = "True"
N_SAMPLES = "10"
BATCH_SIZE = "10"
MAX_LENGTH = "2048"
MAX_LENGTH_GENERATION = "2048"
SEED = "11667"

print("üöÄ Evaluating MultiPL-E JavaScript...")
print("   This will take ~30-60 minutes...")
print("   Using 5-shot prompt from MBPP selection")

# Read the 5-shot prefix
with open("prompts/mbpp_5shot.txt", "r", encoding="utf-8") as f:
    prefix_content = f.read()

# Check and install bigcode-evaluation-harness if needed
original_dir = os.getcwd()
harness_dir = "bigcode-evaluation-harness"

if not os.path.exists(harness_dir):
    print(f"\nüì¶ Installing bigcode-evaluation-harness...")
    print("   This will take a few minutes...")
    try:
        repo_url = "https://github.com/arthur900530/bigcode-evaluation-harness.git"
        result = subprocess.run(
            ["git", "clone", repo_url],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Cloned successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ùå Failed to clone: {e.stderr}")
        raise RuntimeError("Failed to install bigcode-evaluation-harness")
    except FileNotFoundError:
        print("‚ùå git not found. Please install git first.")
        raise RuntimeError("git is required but not found")

    # Install the package
    os.chdir(harness_dir)
    try:
        print("   Installing bigcode-evaluation-harness package...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-e", "."],
            capture_output=True,
            text=True,
            check=True
        )
        print("‚úÖ Installed successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Warning: pip install failed: {e.stderr}")
        print("   Continuing anyway...")

    os.chdir(original_dir)
else:
    print("‚úÖ bigcode-evaluation-harness already exists")

# Save prefix to a file in bigcode-evaluation-harness directory
os.makedirs(f"{harness_dir}/prompts", exist_ok=True)
prefix_file = f"{harness_dir}/prompts/mbpp_5shot.txt"
with open(prefix_file, "w", encoding="utf-8") as f:
    f.write(prefix_content)

# Change to bigcode-evaluation-harness directory
os.chdir(harness_dir)

# Check if main.py exists
if not os.path.exists("main.py"):
    print(f"‚ùå Error: main.py not found in {os.getcwd()}")
    os.chdir(original_dir)
    raise FileNotFoundError(f"main.py not found in {harness_dir}")

print(f"\n‚úÖ Found main.py at: {os.path.join(os.getcwd(), 'main.py')}")
print(f"\nüí° Using task: {TASK}")
print("üí° Running evaluation command...")
print("   This will take 30-60 minutes, please be patient...")

# Build the command - pass prefix content directly
cmd = [
    sys.executable, "main.py",
    "--model", MODEL,
    "--tasks", TASK,
    "--top_p", TOP_P,
    "--temperature", TEMPERATURE,
    "--do_sample", DO_SAMPLE,
    "--n_samples", N_SAMPLES,
    "--batch_size", BATCH_SIZE,
    "--max_length", MAX_LENGTH,
    "--max_length_generation", MAX_LENGTH_GENERATION,
    "--allow_code_execution",
    "--save_generations",
    "--seed", SEED
]

print(f"\nüìù Command: python main.py --model {MODEL} --tasks {TASK} ...")
print(f"   Prefix length: {len(prefix_content)} characters")
print(f"   Working directory: {os.getcwd()}")
print("\n‚è≥ Starting evaluation (this will take 30-60 minutes)...\n")

try:
    # Run the command and show output in real-time
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1
    )

    # Print output line by line
    for line in process.stdout:
        print(line, end='')

    process.wait()

    if process.returncode == 0:
        print("\n‚úÖ JavaScript evaluation completed successfully!")
    else:
        print(f"\n‚ùå Evaluation failed with return code {process.returncode}")

except Exception as e:
    print(f"\n‚ùå Error running evaluation: {e}")
    import traceback
    traceback.print_exc()
finally:
    os.chdir(original_dir)


üöÄ Evaluating MultiPL-E JavaScript...
   This will take ~30-60 minutes...
   Using 5-shot prompt from MBPP selection
‚úÖ bigcode-evaluation-harness already exists

‚úÖ Found main.py at: /content/bigcode-evaluation-harness/bigcode-evaluation-harness/main.py

üí° Using task: multiple-js
üí° Running evaluation command...
   This will take 30-60 minutes, please be patient...

üìù Command: python main.py --model Qwen/Qwen2.5-Coder-3B-Instruct --tasks multiple-js ...
   Prefix length: 896 characters
   Working directory: /content/bigcode-evaluation-harness/bigcode-evaluation-harness

‚è≥ Starting evaluation (this will take 30-60 minutes)...

2025-12-03 17:56:03.948100: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764784563.975425   73677 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN whe

KeyboardInterrupt: 

## Summary

- **Phase 1 (MBPP)**: sweep baseline / 0 / 1 / 3 / 5-shot with **pure-code prefixes**,  
  avoiding Markdown so that exec-based evaluation is stable.
- **Phase 2 (HumanEval)**: re-use the best-shot prompt function `best_prompt_fn` from Phase 1.

You can now:
1. Run the MBPP sweep cell to get the best configuration.
2. Uncomment the HumanEval cell and evaluate with the selected prompt.
3. Use the JSON files in `results/` directly in your LaTeX tables.