## Reproduction of the evaluation of OctoCoder on HumanEvalFix(tests, python)

To reproduce the results it is always important to do everything exactly as the authors did. So let's literally use their code:

In [3]:
# this is actually a repo that does evaluation
!git clone -b octopack https://github.com/bigcode-project/bigcode-evaluation-harness
%cd bigcode-evaluation-harness
!pip install -q -r requirements.txt
!pip install -e .

In [4]:
!huggingface-cli login
!accelerate config

Interestingly, authors never mention using bf16/fp16 in the paper, but provide an official script for evaluation using bf16. There it is: [original script](https://github.com/bigcode-project/octopack/blob/main/evaluation/run/eval_scripts/octocoder/eval_octocoder_humanevalfix.sh)

I am using the model [OctoCoder](https://huggingface.co/bigcode/octocoder) with the same parameters (also precision bf16 and bs 5). 

In [21]:
!accelerate launch main.py \
--model bigcode/octocoder \
--tasks humanevalfixtests-python \
--do_sample True \
--n_samples 20 \
--temperature 0.2 \
--allow_code_execution \
--trust_remote_code \
--prompt octocoder \
--max_length_generation 2048 \
--precision bf16 \
--batch_size 5 \
--save_generations \
--save_generations_path OC_generations_same.json \
--metric_output_path OC_metrics_1.json

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Selected Tasks: ['humanevalfixtests-python']
Loading tokenizer and model (in bf16)
config.json: 100%|█████████████████████████| 1.01k/1.01k [00:00<00:00, 3.55MB/s]
model.safetensors.index.json: 100%|████████| 38.2k/38.2k [00:00<00:00, 1.43MB/s]
Downloading shards:   0%|                                 | 0/7 [00:00<?, ?it/s]
Downloading shards: 100%|████████████████████████| 7/7 [25:18<00:00, 216.97s/it]
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:32<00:00,  4.59s/it]
generation_config.json: 100%|███████████████████| 116/116 [00:00<00:00, 492kB/s]
tokenizer_config.json: 100%|███████████████████| 717/717 [00:00<00:00, 2.61MB/s]
vocab.json: 100%|████████████████████████████| 777k/777k [00:00<00:00, 12.5MB/s]
merges.txt: 100%|████████████████████████████| 442k/442k [00:00<00:00, 4.88MB/s

Here we see -  **"pass@1": 0.31432926829268293** - which is relatively close to the authors result using these settings (**0.304**).

This actually means our open source pretrained model fixes the error using only 1 generation in almost 1/3 cases!

## Getting deeper & some Qualitative Analisys

To be honest I was mostly interested in the details (promts, pass@k metrics, actually running scripts and fixing code), so I've decided to spend more time on that than on the data exploration part in the other notebook. So, let's get a little deeper and try to:
1) Actually look at the correct and wrong results
2) Inferencing
3) Actually running automatic tests

### Part 1. Looking at the results

In [1]:
# %cd bigcode-evaluation-harness

/home/bigcode-evaluation-harness


In [2]:
### load dataset and our generations
import datasets
import json

# there is only test set
dataset = datasets.load_dataset("bigcode/humanevalpack", "python")['test']

with open("../OC_generations_same.json", "r") as f:
    generations = json.load(f)
    
with open("../logs.json", "r") as f:
    logs = json.load(f)

In [3]:
### function to print code
from IPython.display import Code as print_code
# print_code(dataset[1]['old_contents'], language='python')

### let's start with the first sample
IDX = 0

Unfortunately this beautiful printing does only work once a cell, so I am unable to use functions for it.

In [4]:
print("Buggy input")
print_code(dataset[IDX]["prompt"] + dataset[IDX]["buggy_solution"], language='python')

Buggy input


In [5]:
print("Solution")
print_code(dataset[IDX]["prompt"] + dataset[IDX]["canonical_solution"], language='python')

Solution


#### The solution is to add ```abs```

In [6]:
print("Generation")
print("pased?", logs[str(IDX)][0][1]['passed'])
print_code(generations[IDX][0], language='python')

Generation
pased? False


We can actually see that this sample passes on the **second** generation. And there is indeed a solution with ```abs```:

In [7]:
print("Generation")
print("pased?", logs[str(IDX)][1][1]['passed'])
print_code(generations[IDX][1], language='python')

Generation
pased? True


#### Let's see another one:

In [8]:
IDX = 120 # random.randint(0, len(dataset))

In [9]:
print("Buggy input")
print_code(dataset[IDX]["prompt"] + dataset[IDX]["buggy_solution"], language='python')

Buggy input


In [10]:
print("Solution")
print_code(dataset[IDX]["prompt"] + dataset[IDX]["canonical_solution"], language='python')

Solution


Here the error is in double reverse sorting. Let's see if we passed:

In [11]:
print("Generation")
print("pased?", logs[str(IDX)][0][1]['passed'])
print_code(generations[IDX][1], language='python')

Generation
pased? True


I think it's too many comments for such a small simple function, but on the bright side - **we fixed it!**

### Part 2. Let's try inference

It's not really clear how the models is used and working, so I believe it's worth trying to inference the model and see what happens.

We can obviously use the same script as the authors, but we have already done it (with bash command), so here I will try to reproduce prompt myself from the paper.

In [203]:
###
func_name = "has_close_elements"


###
func = '''from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True

    return False'''


###
func_start = '''from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
'''

In [204]:
def create_sample_and_print(func_name, func, func_start):
    whole_sample = f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.

    ### Instruction:

    Fix bugs in {func_name}.

    {func}

    ### Response:

    {func_start}'''
    print_code(whole_sample, language='python')
    return whole_sample


whole_sample = create_sample_and_print(func_name, func, func_start)

Now let's inference the model

In [14]:
#!pip install -q transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/octocoder"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(dtype=torch.bfloat16).to(device)

inputs = tokenizer.encode_plus(whole_sample, return_attention_mask=True, return_tensors="pt").to(device)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Generate and look at the code:

In [226]:
def generate_code(model, tokenizer, inputs):
    # here I use 512 instead of 2048 since the model starts to repeat the code (I guess they delete it in the repo)
    outputs = model.generate(
        input_ids=inputs['input_ids'], 
        attention_mask=inputs['attention_mask'],
        max_length=2048, 
        do_sample=True, 
        temperature=0.2
    )
    # here we decode only new code
    decoded = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    return decoded
    
decoded = generate_code(model, tokenizer, inputs)
# add the start of the function
decoded = func_start + decoded
print_code(decoded, language="python")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


And it indeed solves this one! There is an ```abs```.

Let's now try some automatic testing by python

### Part 3. Running some automatic tests (pass@k)

Some shinanigans with the branches of the repo (cause I only found eval script in the main, but not in the OctoCoder branch):

In [67]:
%cd ..

/home


In [69]:
!yes | rm -r bigcode-evaluation-harness
!git clone -b main https://github.com/bigcode-project/bigcode-evaluation-harness
!pip install -q evaluate
!pip install -q datasets
%cd bigcode-evaluation-harness

In [70]:
from bigcode_eval.tasks.custom_metrics.code_eval import compute_code_eval
import os

os.environ["HF_ALLOW_CODE_EVAL"] = "1"

Now we are ready!

In [228]:
predictions = [[decoded]]
references = ["""assert has_close_elements([1.0, 2.0, 3.0], 0.5)==False
assert has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)==True"""]
k = [1]

pass_at_k, results = compute_code_eval(references=references, predictions=predictions, k=k)
pass_at_k

{'pass@1': 1.0}

Hurray!

#### My own example

So now we can try it on our own functions! I will try one. The bug here is using >= instead of >. Let's see if the model can fix it.

In [187]:
###
func_name = "has_subsequence_with_sum_greater_than"


###
func = '''from typing import List
from itertools import combinations

def has_subsequence_with_sum_greater_than(numbers: List[float], value: float) -> bool:
    """
    Check if in the given list of numbers, there is any subsequence of any length whose sum is greater than the specified value.
    >>> has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 5.0)
    True
    >>> has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 10.0)
    False
    """
    for length in range(1, len(numbers) + 1):
        for subset in combinations(numbers, length):
            if sum(subset) >= value:
                return True
    return False'''


###
func_start = '''from typing import List
from itertools import combinations

def has_subsequence_with_sum_greater_than(numbers: List[float], value: float) -> bool:'''

In [189]:
# create whole prompt+sample
whole_sample = create_sample_and_print(func_name, func, func_start)
# tokenize
inputs = tokenizer.encode_plus(whole_sample, return_attention_mask=True, return_tensors="pt").to(device)
# generate answer
decoded = generate_code(model, tokenizer, inputs)
# add the start of the function
decoded = func_start + decoded
# print
print_code(decoded, language="python")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Looks mainly good, but still uses >=. Let's check pass@1 for 2 samples from the code and 1 additional that the buggy version should fail:

In [190]:
predictions = [[decoded]]
references = ["""assert has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 5.0)==True
assert has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 10.0)==False
assert has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 6.0)==False"""]

pass_at_k, results = compute_code_eval(references=references, predictions=predictions, k=[1,2])
pass_at_k

{'pass@1': 0.0}

Since it does not pass all the tested - the function is incorrect. And since we have only 1 pair we get pass@1 0. It's actualy stupid. But I mean it passes the tests. Let's add the test to check exactly our case. We only change ```func```:

In [191]:
func = func = '''from typing import List
from itertools import combinations

def has_subsequence_with_sum_greater_than(numbers: List[float], value: float) -> bool:
    """
    Check if in the given list of numbers, there is any subsequence of any length whose sum is strictly greater than the specified value.
    >>> has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 5.0)
    True
    >>> has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 6.0)
    False
    >>> has_subsequence_with_sum_greater_than([1.0, 2.0, 3.0], 10.0)
    False
    """
    for length in range(1, len(numbers) + 1):
        for subset in combinations(numbers, length):
            if sum(subset) >= value:
                return True
    return False'''

In [194]:
# create whole prompt+sample
whole_sample = create_sample_and_print(func_name, func, func_start)
# tokenize
inputs = tokenizer.encode_plus(whole_sample, return_attention_mask=True, return_tensors="pt").to(device)
# generate answer
decoded = generate_code(model, tokenizer, inputs)
# add the start of the function
decoded = func_start + decoded
# print
print_code(decoded, language="python")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [195]:
pass_at_k, results = compute_code_eval(references=references, predictions=predictions, k=[1,2])
pass_at_k

{'pass@1': 0.0}

I tried 10 different prompts (adding "strictly", examples, formats and so on) but it never generated just > instead of >= unfortunately. Maybe I have a mistake somewhere in the prompt spelling or maybe there were no samples of using ```combinations``` in the training set haha.