<a href="https://colab.research.google.com/github/PioneerAlexander/Leveraging-software-evolution-data-with-LLMs/blob/main/Refact-1_6B-eval-final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Firstly, check that the encoding is correct.

In [3]:
import locale
assert(locale.getpreferredencoding()=="UTF-8")

## Install the required packages


In [4]:
!pip install datasets
!pip install timeout_decorator

Collecting datasets
  Downloading datasets-2.16.0-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.16.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting timeout_decorator
  Downloading timeout-decorator-0.5.0.tar.gz (4.8 kB)
  Prep

## Load dataset from Hugging Face - we need Python language

In [5]:
from datasets import load_dataset
ds = load_dataset("bigcode/humanevalpack", "python")['test']

Downloading data:   0%|          | 0.00/199k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/164 [00:00<?, ? examples/s]

## Fix dataset: append the missed imports for tasks.

In [6]:
new_ds = []
for task in ds:

  task["import"] = "from typing import List, Tuple, Optional, Any, Callable\n"
  if task["task_id"] == 'Python/32':  # In the task 32 the function poly misses completely in the code which is needed to fix
    task["import"] += '''import math\ndef poly(xs: list, x: float):
    """
    Evaluates polynomial with coefficients xs at point x.
    return xs[0] + xs[1] * x + xs[1] * x^2 + .... xs[n] * x^n
    """
    return sum([coeff * math.pow(x, i) for i, coeff in enumerate(xs)])'''
  new_ds.append(task)

print(new_ds[32]["import"])


from typing import List, Tuple, Optional, Any, Callable
import math
def poly(xs: list, x: float):
    """
    Evaluates polynomial with coefficients xs at point x.
    return xs[0] + xs[1] * x + xs[1] * x^2 + .... xs[n] * x^n
    """
    return sum([coeff * math.pow(x, i) for i, coeff in enumerate(xs)])


## Load the model.
Warning: here we load with cuda. If your GPU access in colab is exceeded, change device to 'cpu'

In [None]:
# Load model directly
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "smallcloudai/Refact-1_6B-fim"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)

tokenizer_config.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/532 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

configuration_gpt_refact.py:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/smallcloudai/Refact-1_6B-fim:
- configuration_gpt_refact.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_gpt_refact.py:   0%|          | 0.00/24.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/smallcloudai/Refact-1_6B-fim:
- modeling_gpt_refact.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/3.17G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## Prompt template to work with the model in [Chat format](https://huggingface.co/smallcloudai/Refact-1_6B-fim#chat-format)

In [None]:
prompt_template = "<empty_output>SYSTEM {system}\n" \
                  "<empty_output>USER {query}\n" \
                  "<empty_output>ASSISTANT\n"

## Generate prompt function

As the parameters we pass there are:


*   *system*, which is the string describing for the model on the highest level what she should do. While dealing with **HumanEvalFix** and **Refact-1_6B** this parameter is an empty string, but you could experiment for you choice with this parameter.
*   *task* is the task for which we generate prompt. From this task we get the function signature, imports, solution we need to fix, unit tests and docstring.
*   *with_tests* is the option to run the task in the regime NL+C->C, where C is a buggy solution with unit tests, NL is natural language, here "Fix the bugs in..."
*   *with_docstring* is the option to add the strong hint - docstring -  to the model. Here the experiment is runned without this parameter.




In [None]:
def generate_prompt(system, task, with_tests=False, with_docstring=False):
    query_template="Question: Fix the bugs in {entry_point}: \n{import_desc} \ndef {signature}: " \
    "\n {docstring}\n{buggy_solution} {test}\nAnswer:\n{import_desc}\ndef {signature}: \n"
    return prompt_template.format(system=system,
                                  query=query_template.format(entry_point=task['entry_point'],
                                                              import_desc=task['import'],
                                                              signature=task['signature'],
                                                              buggy_solution=task['buggy_solution'],
                                                              test=task['test'] if with_tests else "",
                                                              docstring="'''" + task['docstring'] + "'''" if with_docstring else ""))

## Parse the answer

The following function parses the model answer which goes after ASSISTANT. The code it returns will then be checked by the *exec* function.

In [None]:
import re

def get_model_answer(text: str) -> str:
  pattern = r'<empty_output>ASSISTANT\n(.*?)<empty_output>'

  match = re.search(pattern, text, re.DOTALL)

  return match.group(1) if match else ""

## Set timeout to the output generating and execution

How to deal with the infinite cycles, if we execute the solution provided by the model? Or other unexpected behaviour which will make our run much more time-consuming? I solved this problem by using special decorator which checks the time execution of the function. You can change the limit by your choice in the decorator parameter (it is in seconds). If the function executes to long, it would raise the TimeoutError.  

In [None]:
from timeout_decorator import timeout

@timeout(60)
def execute_tests(text: str) -> None:
  exec(text)
  locals().clear() # clear local variable values after execution

In [None]:
@timeout(180)
def generate_output(inputs_model):
  return model.generate(inputs_model, max_length=15000, temperature=0.2, pad_token_id=tokenizer.eos_token_id) # hyperparameter temperature could be fine-tuned


In [None]:
false_generated_tasks = {} # here we store the task which were failed, and the respective exception.

## Is the solution correct?

In the paper all the result is checked by the zero-shot pass metric (1@pass). More about k@pass metric in [this paper](https://arxiv.org/pdf/2107.03374.pdf?trk=public_post_comment-text).

The core idea in the zero-shot pass metric is just check one time whether the generated solution passes the unit tests. We check this using the following function, store all exceptions.

In [None]:
def did_task_pass_unit_tests(task):
  inputs_model = tokenizer.encode(generate_prompt("", task, with_tests=True), return_tensors="pt").to(device) # generate input
  try:
    outputs_model = generate_output(inputs_model) # generate output
  except:
    false_generated_tasks[task["task_id"]] = TimeoutError
    return 0
  code_with_tests = get_model_answer(tokenizer.decode(outputs_model[0])) # answer of the model which is checked

  try:
    execute_tests(code_with_tests) # execution
    return 1 # if all tests were executed, return 1
  except Exception as e:
    false_generated_tasks[task["task_id"]] = e
    return 0

# 1@pass metric

For each task in the provided dataset sample (ds_sample variable) we check if the solution has passed unit tests or not. If yes, correct_number is increased by 1. The metric value is just a fraction of the correct answers.

In [None]:
from tqdm import tqdm

def calculate_zero_shot_pass_metric(ds_sample):
    correct_number = 0

    total_number = len(ds_sample)


    for task in tqdm(ds_sample, desc="Progress: "): # representative running process implemented with tqdm
      correct_number += did_task_pass_unit_tests(task)

    return correct_number / total_number


## Split the dataset by the categories
**HumanEvalFix** is the most challenging task
for most models. They commonly regenerate the buggy function without making any change (e.g. WizardCoder) or they introduce new bugs (e.g. GPT-4)

We analyze model performance by bug type (categories: "function misuse", "variable misuse", "operator misuse", "excess logic", "missing logic", "value misuse"). Misuse bug types could be together classified as "wrong logic"

In [None]:
splitted_by_categories = {}
categories = ["function misuse", "variable misuse", "operator misuse", "excess logic", "missing logic", "value misuse"]

for category in categories:
  splitted_by_categories[category] = []

for task in new_ds:
  splitted_by_categories[task["bug_type"]].append(task)

## Evaluate all tasks from the category


In [None]:
def evaluate_the_category(category_name):
  print("Starting evalution of tasks with bug type {category_name}:".format(category_name=category_name))
  zero_shot_pass_metric_result = calculate_zero_shot_pass_metric(splitted_by_categories[category_name])
  print("Result for {category_name}: {result}".format(category_name=category_name, result=zero_shot_pass_metric_result))
  return zero_shot_pass_metric_result


In [None]:
evaluate_the_category("excess logic")

Starting evalution of tasks with bug type excess logic:


Progress: 100%|██████████| 31/31 [12:02<00:00, 23.30s/it]

Result for excess logic: 0.06451612903225806





0.06451612903225806

In [None]:
evaluate_the_category("missing logic")

Starting evalution of tasks with bug type missing logic:


Progress: 100%|██████████| 33/33 [10:15<00:00, 18.64s/it]

Result for missing logic: 0.030303030303030304





0.030303030303030304

In [None]:
evaluate_the_category("value misuse")

Starting evalution of tasks with bug type value misuse:


Progress: 100%|██████████| 44/44 [14:43<00:00, 20.09s/it]

Result for value misuse: 0.13636363636363635





0.13636363636363635

In [None]:
for category in ["function misuse", "variable misuse", "operator misuse"]:
  evaluate_the_category(category)

Starting evalution of tasks with bug type function misuse:


Progress: 100%|██████████| 8/8 [01:12<00:00,  9.01s/it]


Result for function misuse: 0.25
Starting evalution of tasks with bug type variable misuse:


Progress: 100%|██████████| 23/23 [06:46<00:00, 17.67s/it]


Result for variable misuse: 0.2608695652173913
Starting evalution of tasks with bug type operator misuse:


Progress: 100%|██████████| 25/25 [04:06<00:00,  9.88s/it]

Result for operator misuse: 0.16





We have calculated the metric values for each bug type. Let's observe the received results. We see that (in my implementation) the "missing logic" turns out to be the most challenging, but in general let's notice that "missing logic" and "excess logic" bugs the model had changed only 3 times out of 64 tasks, which is really poor performance.

Because of it on the whole dataset the function has (2 + 1 + 6 + 2 + 6 + 4) / 164 = **0.128...**, which is less than it stays in hugging face experiments:

pass@1 (T=0.2) on HumanEvalFixTests Python
self-reported
**18.380**

But if calculate the statistic among only *"wrong logic"* bug type, it would be (6 + 2 + 4 + 6) / 100 = **0.18**, which is near the value they had presented.