# Autopilot training from scratch using Hugging Face Trainer(PyTorch)

This notebook shows us on how to fine tune a large pretrained Foundation model, on a customized dataset such as python code. We use some of the standard functions from hugging face libraries. We have another notebook to use the customized loss functions, optimizers and training loop. We push the fine tuned model to the hub and use the same for various code generation instances. Finally, we evaluate the model's performance on a sampled HumanEval dataset. This notebook contains instructions to use either the GPU or TPU for training.

# Installation and set up

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
import torch
print(torch.__version__)

1.10.0+cu111


In [None]:
#@title Installation for GPU training
!pip install datasets transformers[sentencepiece]
!apt install git-lfs

In [None]:
#@title Installation for TPU training

!pip install datasets transformers[sentencepiece]
!apt install git-lfs
!pip uninstall -y torch
!pip install torch==1.8.2+cpu torchvision==0.9.2+cpu -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "your_email@example.com"
!git config --global user.name "YourName"

Store git credentials

In [3]:
!git config --global credential.helper store

## Hugging face login
Log in to the Hugging Face Hub. Execute the following and enter your credentials.

In [4]:
from huggingface_hub import notebook_login

notebook_login()



Login successful
Your token has been saved to /root/.huggingface/token


# Prepare the dataset

## Filter the dataset for specific libraries

In [9]:
def any_keyword_in_string(string, keywords):
    for keyword in keywords:
        if keyword in string:
            return True
    return False

In [10]:
filters = ["pandas", "sklearn", "matplotlib", "seaborn"]
example_1 = "import matplotlib.pyplot as plt"
example_2 = "import datetime"

print(
    any_keyword_in_string(example_1, filters), any_keyword_in_string(example_2, filters)
)

True False


In [12]:
def filter_streaming_dataset(dataset, filters):
    filtered_dict = defaultdict(list)
    total = 0
    for sample in tqdm(iter(dataset)):
        total += 1
        if any_keyword_in_string(sample["content"], filters):
            for k, v in sample.items():
                filtered_dict[k].append(v)
    print(f"{len(filtered_dict['content'])/total:.2%} of data after filtering.")
    return dataset.from_dict(filtered_dict)

This could take longer. We have our sampled and cleaned dataset in the next cell which is easier to load.

In [None]:
from datasets import load_dataset
from collections import defaultdict
from tqdm import tqdm

split = "train"  # "valid"
filters = ["pandas", "sklearn", "matplotlib", "seaborn"]

data = load_dataset(f"transformersbook/codeparrot-{split}", split=split, streaming=True)
filtered_data = filter_streaming_dataset(data, filters)

## Load fully cleaned dataset

In [None]:
ds_train = load_dataset("huggingface-course/codeparrot-ds-train", split="train")
ds_valid = load_dataset("huggingface-course/codeparrot-ds-valid", split="validation")

raw_datasets = DatasetDict(
    {
        "train": ds_train, #.shuffle(28).select(range(50000)),
        "valid": ds_valid #.shuffle(28).select(range(500))
    }
)

raw_datasets

In case you would like to push any sampled dataset to the hub

In [None]:
raw_datasets['train'].push_to_hub('autopilot-sampled-train')
raw_datasets['valid'].push_to_hub('autopilot-sampled-valid')

## Load sampled dataset

This cell loads sampled dataset from 'huggingface-course/codeparrot-ds-train' and 'huggingface-course/codeparrot-ds-valid'.

In [5]:
from datasets import load_dataset, DatasetDict

ds_train = load_dataset("Pavithra/autopilot-sampled50k-train", split="train")
ds_valid = load_dataset("Pavithra/autopilot-sampled50k-valid", split="validation")

raw_datasets = DatasetDict(
    {
        "train": ds_train, 
        "valid": ds_valid
    }
)

raw_datasets

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Using custom data configuration Pavithra--autopilot-sampled50k-train-c88516463b6bf489


Downloading and preparing dataset json/huggingface-course--codeparrot-ds-train (download: 234.92 MiB, generated: 622.56 MiB, post-processed: Unknown size, total: 857.48 MiB) to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-train-c88516463b6bf489/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/125M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/122M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-train-c88516463b6bf489/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


Downloading:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

Using custom data configuration Pavithra--autopilot-sampled50k-valid-18a3ed5d32b711a1


Downloading and preparing dataset json/huggingface-course--codeparrot-ds-valid (download: 2.29 MiB, generated: 6.35 MiB, post-processed: Unknown size, total: 8.64 MiB) to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-valid-18a3ed5d32b711a1/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.40M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-valid-18a3ed5d32b711a1/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 50000
    })
    valid: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 500
    })
})

In [17]:
print("The number of files sampled for the training set ", len(raw_datasets['train']['repo_name']))

The number of files sampled for the training set  50000


Visualize what's in the training set

In [None]:
for key in raw_datasets["train"][10]:
    print(f"{key.upper()}: {raw_datasets['train'][10][key][:2000]}")

REPO_NAME: rtrwalker/geotecha
PATH: geotecha/consolidation/xieandleo2004.py
COPIES: 1
SIZE: 32571
CONTENT: # geotecha - A software suite for geotechncial engineering
# Copyright (C) 2018  Rohan T. Walker (rtrwalker@gmail.com)
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see http://www.gnu.org/licenses/gpl.html.

"""
Xie and Leo (2004) "Analytical solutions of one-dimensional large strain
consolidation of saturated and homogeneous clays".

""

# Tokenize the dataset

Chunk the input sequences into context sized pieces. We use a pretrained tokenizer here

In [6]:
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
    raw_datasets["train"][:2]["content"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Downloading:   0%|          | 0.00/265 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/771k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Input IDs length: 43
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 41, 128, 128, 128, 128, 128, 128, 128, 128, 26]
Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [7]:
# when dealing with long contexts or short seq we should concatenate first
def tokenize(element):
    outputs = tokenizer(
        element["content"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

  0%|          | 0/50 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 1377812
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 13133
    })
})

# Load the model

We have two pretrained models that can be loaded for fine tuning.
* GPT-2 small
* GPT-Neo small

## GPT-2 small model

In [9]:
from transformers import GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [10]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.2M parameters


## GPT Neo small model

In [None]:
from transformers import AutoTokenizer, GPTNeoForCausalLM, AutoConfig

config = AutoConfig.from_pretrained(
    "EleutherAI/gpt-neo-125M",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

In [None]:
model = GPTNeoForCausalLM(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-Neo size: {model_size/1000**2:.1f}M parameters")

GPT-Neo size: 125.0M parameters


## Set up the data collator

We need to set up a data collator that will take care of creating the batches. We can use the DataCollatorForLanguageModeling collator, which is designed specifically for language modeling

Besides stacking and padding batches, it also takes care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training so we don’t need to duplicate the input_ids.

In [11]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [12]:
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

input_ids shape: torch.Size([5, 128])
attention_mask shape: torch.Size([5, 128])
labels shape: torch.Size([5, 128])


# Train the model

## Specify Training arguments

In [None]:
from transformers import Trainer, TrainingArguments


args = TrainingArguments(
    output_dir="autopilot-from-scratch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=1_000,
    logging_steps=1_000,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=1_000,
    fp16=True, # comment this if using TPU as it does not support this precision.
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
)

## Begin training

In [None]:
from datetime import datetime
start_time = datetime.now()
trainer.train()
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

***** Running training *****
  Num examples = 1377812
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 8
  Total optimization steps = 10764


Step,Training Loss,Validation Loss
1000,4.5248,2.975685
2000,2.5422,2.439659
3000,2.1642,2.188042
4000,1.9135,1.988376
5000,1.7236,1.84696
6000,1.5459,1.750104
7000,1.4363,1.676125
8000,1.3639,1.610456
9000,1.3046,1.566746
10000,1.273,1.548346


***** Running Evaluation *****
  Num examples = 13133
  Batch size = 32
Saving model checkpoint to codeparrot-ds-500sample-gpt-neo-2ep/checkpoint-1000
Configuration saved in codeparrot-ds-500sample-gpt-neo-2ep/checkpoint-1000/config.json
Model weights saved in codeparrot-ds-500sample-gpt-neo-2ep/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in codeparrot-ds-500sample-gpt-neo-2ep/checkpoint-1000/tokenizer_config.json
Special tokens file saved in codeparrot-ds-500sample-gpt-neo-2ep/checkpoint-1000/special_tokens_map.json
tokenizer config file saved in codeparrot-ds-500sample-gpt-neo-2ep/tokenizer_config.json
Special tokens file saved in codeparrot-ds-500sample-gpt-neo-2ep/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 13133
  Batch size = 32
Saving model checkpoint to codeparrot-ds-500sample-gpt-neo-2ep/checkpoint-2000
Configuration saved in codeparrot-ds-500sample-gpt-neo-2ep/checkpoint-2000/config.json
Model weights saved in codeparrot-ds-500sam

Duration: 14:18:32.125029


## Push the model to hub

In [None]:
trainer.push_to_hub()

Saving model checkpoint to codeparrot-ds-500sample-gpt-neo-2ep
Configuration saved in codeparrot-ds-500sample-gpt-neo-2ep/config.json
Model weights saved in codeparrot-ds-500sample-gpt-neo-2ep/pytorch_model.bin
tokenizer config file saved in codeparrot-ds-500sample-gpt-neo-2ep/tokenizer_config.json
Special tokens file saved in codeparrot-ds-500sample-gpt-neo-2ep/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/525M [00:00<?, ?B/s]

Upload file runs/Apr12_15-04-53_7431ed79de03/events.out.tfevents.1649776585.7431ed79de03.95.0:  41%|####      …

To https://huggingface.co/Pavithra/codeparrot-ds-500sample-gpt-neo-2ep
   f917f07..2b19b03  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
To https://huggingface.co/Pavithra/codeparrot-ds-500sample-gpt-neo-2ep
   2b19b03..17fcc64  main -> main



'https://huggingface.co/Pavithra/codeparrot-ds-500sample-gpt-neo-2ep/commit/2b19b03f4f65f90d7da2f55f97bc9fd059267eb4'

# Inference

## Load the fine tuned model

In [18]:
import torch
from transformers import pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
pipe = pipeline(
    "text-generation", model="mimicheng/codeparrot-ds-sample-2ep")

Downloading:   0%|          | 0.00/898 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/632M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/276 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/771k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/120 [00:00<?, ?B/s]

## Test with the prompts

Given a prompt, the model generates the code output

In [19]:
prompt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(prompt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
scatter, = thinkstats2.NormalPlot(x, y


In [20]:
prompt = """\
# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
"""
print(pipe(prompt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
adp_tax_


In [21]:
prompt = """
# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
"""
print(pipe(prompt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:

n_estimators = 30
mask = np.arange(n_


In [25]:
prompt = """
# import numpy as np
# create an array of size 5
# x = np.array([1,2,3,4,5])

# create an array of size 10
"""
print(pipe(prompt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



# import numpy as np
# create an array of size 5
# x = np.array([1,2,3,4,5])

# create an array of size 10
x = np.array([1,


# Test on Human Eval dataset

In [None]:
!pip install datasets

In [17]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline


tokenizer = AutoTokenizer.from_pretrained("Pavithra/Autopilot-madgrad-training-version-1")
model = AutoModelForCausalLM.from_pretrained("Pavithra/madgrad-best-version")

pipe = pipeline(
    "text-generation", model=model, tokenizer=tokenizer)

https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpwhli7kd6


Downloading:   0%|          | 0.00/276 [00:00<?, ?B/s]

storing https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/37324be2acdb795673e9700e5667297e623995aa8de538f579677751942cbff2.5d7e8593c0c7db567dd533bd945522a4eb66f76856c296389a837037b0edb1c6
creating metadata file for /root/.cache/huggingface/transformers/37324be2acdb795673e9700e5667297e623995aa8de538f579677751942cbff2.5d7e8593c0c7db567dd533bd945522a4eb66f76856c296389a837037b0edb1c6
https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/vocab.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpiy5y2f4z


Downloading:   0%|          | 0.00/771k [00:00<?, ?B/s]

storing https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/6567aacc5df6427983773634377df2cd5b43305be16a1e90af82d1c5a4a3e419.4fb676c48a3f16a72a79c0e191e5b5087d3c100b9d6672b960d958c09ec83eb6
creating metadata file for /root/.cache/huggingface/transformers/6567aacc5df6427983773634377df2cd5b43305be16a1e90af82d1c5a4a3e419.4fb676c48a3f16a72a79c0e191e5b5087d3c100b9d6672b960d958c09ec83eb6
https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpoqm1fu_z


Downloading:   0%|          | 0.00/438k [00:00<?, ?B/s]

storing https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/9fc5e03e034791e2f5440af18f8d873042d67f8a7a82fb9eef9453400662c51e.2cde5d5ec9a675c75002950f79737dfc28bb37f2971280eea89af91188acd1b5
creating metadata file for /root/.cache/huggingface/transformers/9fc5e03e034791e2f5440af18f8d873042d67f8a7a82fb9eef9453400662c51e.2cde5d5ec9a675c75002950f79737dfc28bb37f2971280eea89af91188acd1b5
https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpnopp43jq


Downloading:   0%|          | 0.00/1.99M [00:00<?, ?B/s]

storing https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/a1dd45a9bc6604643e475053e1c74fa976f3e9f1e699894febb005f93512ca57.6c113c43ad8581ac1d36f57d596351741b07bbeb6fbca86f6e1f0b3cc4b4f4c7
creating metadata file for /root/.cache/huggingface/transformers/a1dd45a9bc6604643e475053e1c74fa976f3e9f1e699894febb005f93512ca57.6c113c43ad8581ac1d36f57d596351741b07bbeb6fbca86f6e1f0b3cc4b4f4c7
https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp2ovpspaa


Downloading:   0%|          | 0.00/120 [00:00<?, ?B/s]

storing https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/cf350c9faf4b0b7c99fe0a223e44f3bb7fac611a971c984b595130a5d84d6e2a.fbf4061fb19cfc48adf3510a9b4a6037fcf9cdf64fbdb306b328bafb3092779b
creating metadata file for /root/.cache/huggingface/transformers/cf350c9faf4b0b7c99fe0a223e44f3bb7fac611a971c984b595130a5d84d6e2a.fbf4061fb19cfc48adf3510a9b4a6037fcf9cdf64fbdb306b328bafb3092779b
loading file https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/6567aacc5df6427983773634377df2cd5b43305be16a1e90af82d1c5a4a3e419.4fb676c48a3f16a72a79c0e191e5b5087d3c100b9d6672b960d958c09ec83eb6
loading file https://huggingface.co/Pavithra/Autopilot-madgrad-training-version-1/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/9fc5e03e034791e2f5440af18f8d873042d67f8a7a82fb9eef9453400662c51e.

Downloading:   0%|          | 0.00/898 [00:00<?, ?B/s]

storing https://huggingface.co/Pavithra/madgrad-best-version/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/fa487d6885c5ca74d9bddd840cee91377a2c969f477e79219798b56dc5a0c507.c03c29eacfb8608c6ffba0a3a4018c7730ff3191fdaf0857328c72ee9258f3ff
creating metadata file for /root/.cache/huggingface/transformers/fa487d6885c5ca74d9bddd840cee91377a2c969f477e79219798b56dc5a0c507.c03c29eacfb8608c6ffba0a3a4018c7730ff3191fdaf0857328c72ee9258f3ff
loading configuration file https://huggingface.co/Pavithra/madgrad-best-version/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fa487d6885c5ca74d9bddd840cee91377a2c969f477e79219798b56dc5a0c507.c03c29eacfb8608c6ffba0a3a4018c7730ff3191fdaf0857328c72ee9258f3ff
Model config GPT2Config {
  "_name_or_path": "Pavithra/madgrad-best-version",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 0,
  "embd_pdrop": 0.1,
  "eos_token_id": 0,
  "in

Downloading:   0%|          | 0.00/486M [00:00<?, ?B/s]

storing https://huggingface.co/Pavithra/madgrad-best-version/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/50bb1b35d4fa96156346db435f1fcfb900ee41872178014c60e0db55fa60d822.3e9bd40a5d7ad058213f310ffdb749ced19cec0d872dc5fe12c66a15a4809a1c
creating metadata file for /root/.cache/huggingface/transformers/50bb1b35d4fa96156346db435f1fcfb900ee41872178014c60e0db55fa60d822.3e9bd40a5d7ad058213f310ffdb749ced19cec0d872dc5fe12c66a15a4809a1c
loading weights file https://huggingface.co/Pavithra/madgrad-best-version/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/50bb1b35d4fa96156346db435f1fcfb900ee41872178014c60e0db55fa60d822.3e9bd40a5d7ad058213f310ffdb749ced19cec0d872dc5fe12c66a15a4809a1c
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at Pavithra/madgrad-best-version.
If your task is similar to the task the model of the checkp

In [18]:
dataset = load_dataset("openai_humaneval")
dataset

Downloading builder script:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading and preparing dataset openai_humaneval/openai_humaneval (download: 43.83 KiB, generated: 189.86 KiB, post-processed: Unknown size, total: 233.68 KiB) to /root/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75...


Downloading data:   0%|          | 0.00/44.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/164 [00:00<?, ? examples/s]

Dataset openai_humaneval downloaded and prepared to /root/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point'],
        num_rows: 164
    })
})

In [19]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    HfArgumentParser,
    StoppingCriteria,
    StoppingCriteriaList,
    pipeline,
    set_seed,
)

class EndOfFunctionCriteria(StoppingCriteria):
    """Custom `StoppingCriteria` which checks if all generated functions in the batch are completed."""

    def __init__(self, start_length, eof_strings, tokenizer):
        self.start_length = start_length
        self.eof_strings = eof_strings
        self.tokenizer = tokenizer

    def __call__(self, input_ids, scores, **kwargs):
        """Returns true if all generated sequences contain any of the end-of-function strings."""
        decoded_generations = self.tokenizer.batch_decode(input_ids[:, self.start_length :])
        done = []
        for decoded_generation in decoded_generations:
            done.append(any([stop_string in decoded_generation for stop_string in self.eof_strings]))
        return all(done)

def first_block(string):
    """Split off first block of code by scanning for class, def etc. on newlines."""
    return re.split("|".join(EOF_STRINGS), string)[0].rstrip()


def complete_code(pipe, prompt, num_completions=1, **gen_kwargs):
    """Complete prompt with text generation pipeline and return num_completions."""
    prompt = pipe.tokenizer.eos_token + prompt
    code_gens = pipe(prompt, num_return_sequences=num_completions, **gen_kwargs)
    return [first_block(code_gen["generated_text"][len(prompt) :]) for code_gen in code_gens]
    
EOF_STRINGS = ["\nclass", "\ndef", "\n#", "\n@", "\nprint", "\nif"]

gen_kwargs = {
      "do_sample": True,
      "temperature": 0.2,
      "top_p": 0.95,
      "stopping_criteria": StoppingCriteriaList([EndOfFunctionCriteria(0, EOF_STRINGS, tokenizer)]),
  }

In [22]:
from datasets import load_metric, load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
code_eval = load_metric("code_eval")
# test_cases = ["assert add(2,3)==5"]
# candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
# pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2])

# print(pass_at_k)

Downloading builder script:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

In [None]:
import re
from tqdm import tqdm

n_samples = 200
batch_size = 64

human_eval = load_dataset("openai_humaneval")
code_eval_metric = load_metric("code_eval")

n_tasks = len(human_eval["test"])
generations, references = [], []

for task in tqdm(range(n_tasks)):
    task_generations = []
    prompt = human_eval["test"][task]["prompt"].strip()
    gen_kwargs["stopping_criteria"][0].start_length = len(tokenizer(prompt)["input_ids"])
    for batch in range(n_samples // batch_size):
        task_generations.extend(complete_code(pipe, prompt, num_completions=batch_size, **gen_kwargs))
    generations.append([prompt + gen for gen in task_generations])
    test_func = human_eval["test"][task]["test"]
    entry_point = f"check({human_eval['test'][task]['entry_point']})"
    references.append("\n" + test_func + "\n" + entry_point)

# Evaluate completions with "code_eval" metric
pass_at_k, _ = code_eval_metric.compute(
    references=references, predictions=generations
)
print(f"Results: {pass_at_k}")