# Training autopilot using custom train loop and loss function

This notebook shows us on how to fine tune a large pretrained Foundation model, on a customized dataset using custom loss function, different optimier than what is available on the hugging face hub and custom train loop. This notebook uses GPU for training.

# Installation and setup

In [1]:
import torch
print(torch.__version__)

1.10.0+cu111


In [2]:
#@title Install requirements
!pip install datasets transformers[sentencepiece]
!apt install git-lfs
!pip -q install madgrad

Collecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[?25l[K     |█                               | 10 kB 26.0 MB/s eta 0:00:01[K     |██                              | 20 kB 17.5 MB/s eta 0:00:01[K     |███                             | 30 kB 14.8 MB/s eta 0:00:01[K     |████                            | 40 kB 13.8 MB/s eta 0:00:01[K     |█████                           | 51 kB 10.9 MB/s eta 0:00:01[K     |██████                          | 61 kB 12.4 MB/s eta 0:00:01[K     |███████                         | 71 kB 13.7 MB/s eta 0:00:01[K     |████████                        | 81 kB 14.9 MB/s eta 0:00:01[K     |█████████                       | 92 kB 15.9 MB/s eta 0:00:01[K     |██████████                      | 102 kB 14.7 MB/s eta 0:00:01[K     |███████████                     | 112 kB 14.7 MB/s eta 0:00:01[K     |████████████                    | 122 kB 14.7 MB/s eta 0:00:01[K     |█████████████                   | 133 kB 14.7 MB/s eta

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "your_email@example.com"
!git config --global user.name "YourName"

Store git credentials

In [3]:
!git config --global credential.helper store

## Hugging face login
Log in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

# Prepare the dataset

## Load sampled dataset

In [4]:
from datasets import load_dataset, DatasetDict

ds_train = load_dataset("Pavithra/autopilot-sampled50k-train", split="train")
ds_valid = load_dataset("Pavithra/autopilot-sampled50k-valid", split="validation")

raw_datasets = DatasetDict(
    {
        "train": ds_train, 
        "valid": ds_valid
    }
)

raw_datasets

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Using custom data configuration Pavithra--autopilot-sampled50k-train-c88516463b6bf489


Downloading and preparing dataset json/huggingface-course--codeparrot-ds-train (download: 234.92 MiB, generated: 622.56 MiB, post-processed: Unknown size, total: 857.48 MiB) to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-train-c88516463b6bf489/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/125M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/122M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-train-c88516463b6bf489/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


Downloading:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

Using custom data configuration Pavithra--autopilot-sampled50k-valid-18a3ed5d32b711a1


Downloading and preparing dataset json/huggingface-course--codeparrot-ds-valid (download: 2.29 MiB, generated: 6.35 MiB, post-processed: Unknown size, total: 8.64 MiB) to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-valid-18a3ed5d32b711a1/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.40M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/Pavithra--autopilot-sampled50k-valid-18a3ed5d32b711a1/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 50000
    })
    valid: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 500
    })
})

# Tokenizer

Chunk the input sequences into context sized pieces. We use a pretrained tokenizer here

In [5]:
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
    raw_datasets["train"][:2]["content"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Downloading:   0%|          | 0.00/265 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/771k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Input IDs length: 43
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 41, 128, 128, 128, 128, 128, 128, 128, 128, 26]
Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Tokenize the dataset

In [6]:
# when dealing with long contexts or short seq we should concatenate first
def tokenize(element):
    outputs = tokenizer(
        element["content"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

  0%|          | 0/50 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 1377812
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 13133
    })
})

## Set up the data collator

In [7]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

# Training setup

## Customized Loss

In [8]:
keytoken_ids = []
for keyword in [
    "plt",
    "pd",
    "sk",
    "fit",
    "predict",
    "fit",
    "np",
    " plt",
    " pd",
    " sk",
    " fit",
    " predict",
    " fit",
    " np",
    "dummy_test",
]:
    ids = tokenizer([keyword]).input_ids[0]
    if len(ids) == 1:
        keytoken_ids.append(ids[0])
    else:
        print(f"Keyword has not single token: {keyword}")

Keyword has not single token: dummy_test


In [9]:
from torch.nn import CrossEntropyLoss
import torch

def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
    # Shift so that tokens < n predict n
    shift_labels = inputs[..., 1:].contiguous()
    shift_logits = logits[..., :-1, :].contiguous()

    # Calculate per-token loss
    loss_fct = CrossEntropyLoss(reduce=False)
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

    # Resize and average loss per sample
    loss_per_sample = loss.view(shift_logits.size(0), shift_logits.size(1)).mean(axis=1)

    # Calculate and scale weighting
    weights = torch.stack([(inputs == kt).float() for kt in keytoken_ids]).sum(
        axis=[0, 2]
    )
    weights = alpha * (1.0 + weights)

    # Calculate weighted average
    weighted_loss = (loss_per_sample * weights).mean()
    return weighted_loss

## Dataloader

In [10]:
from torch.utils.data.dataloader import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=32, shuffle=True, collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_datasets["valid"], batch_size=32, collate_fn=data_collator)

In [11]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([32, 128]),
 'input_ids': torch.Size([32, 128]),
 'labels': torch.Size([32, 128])}

## Weight decay on model parameters

In [12]:
def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [
        {"params": params_with_wd, "weight_decay": weight_decay},
        {"params": params_without_wd, "weight_decay": 0.0},
    ]

## Load GPT-2 small model

In [13]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [14]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.2M parameters


## Set up device

In [15]:
# from full training
device = torch.device("cuda")
model.to(device)
device

device(type='cuda')

## Evaluate dataset

In the following function, we compute both model's loss and perplexities. This function can be called on any dataset loaded on the eval_dataloader.

In [16]:
def evaluate():
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        batch = {k: v.to(device) for k,v in batch.items()}
        with torch.no_grad():
            outputs = model(batch["input_ids"], labels=batch["input_ids"])

        output_losses = outputs.loss
        
        losses.append(output_losses) 
    loss = torch.mean(torch.stack(losses))

    try:
        perplexity = torch.exp(loss)
    except OverflowError:
        perplexity = float("inf")

    return loss.item(), perplexity.item()

## Define Optimizer

### Adam optimizer

In [18]:
from torch.optim import AdamW

weight_decay = 0.1
optimizer = AdamW(get_grouped_params(model), lr=5e-4)


### MADGRAD optimizer

In [19]:
from madgrad import MADGRAD

optimizer_type = 'MADGRAD'
learning_rate = 2e-5
weight_decay = 1e-5
epsilon = 1e-6
max_grad_norm = 1.0

optimizer = MADGRAD(
          get_grouped_params(model),
          lr=learning_rate,
          eps=epsilon,
          weight_decay=weight_decay
      )

## Define scheduler

In [20]:
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, get_scheduler, set_seed #, AdamW
num_train_epochs = 2
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    name="cosine",
    optimizer=optimizer,
    num_warmup_steps=1_000,
    num_training_steps=num_training_steps,
)

def get_lr():
    return optimizer.param_groups[0]["lr"]

## Define repo and output directories

In [23]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "autopilot-ds-madgrad"
repo_name = get_full_repo_name(model_name)
repo_name

'Pavithra/autopilot-ds-madgrad'

In [24]:
output_dir = "autopilot-ds-madgrad"
repo = Repository(output_dir, clone_from=repo_name)

Cloning https://huggingface.co/Pavithra/autopilot-ds-madgrad into local empty directory.


If you want to commit the tokenizer to the hub, run the following code

In [None]:
# output_dir = "codeparrot-tokenizer-50k"
# repo = Repository(output_dir, clone_from=repo_name)
# tokenizer.save_pretrained(output_dir)
# repo.push_to_hub(
#                 commit_message=f"Tokenizer customized for smaller dataset", blocking=False
#             )

Cloning https://huggingface.co/Pavithra/codeparrot-ds-madgrad into local empty directory.


## Training

In [None]:
from tqdm.notebook import tqdm

gradient_accumulation_steps = 8
eval_steps = 5
save_checkpoints_steps=5

model.train()
completed_steps = 0

for epoch in range(num_train_epochs):
    for step, batch in tqdm(
        enumerate(train_dataloader, start=1), total=len(train_dataloader)
    ):
        batch = {k: v.to(device) for k, v in batch.items()}
        logits = model(batch["input_ids"]).logits
        loss = keytoken_weighted_loss(batch["input_ids"], logits, keytoken_ids)
        if step % 500 == 0:
            print(
                    "lr: ", get_lr(),
                    "; steps: ", completed_steps,
                    "; loss/train: ", loss.item() * gradient_accumulation_steps,
            )
        loss = loss / gradient_accumulation_steps
        loss.backward(loss)
        if step % gradient_accumulation_steps == 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            completed_steps += 1
        if (step % eval_steps) == 0:
            eval_loss, perplexity = evaluate()
            print("step: ", step," loss/eval: ", eval_loss, "; perplexity: ", perplexity)
           
            tokenizer.save_pretrained(output_dir)
            repo.push_to_hub(
                commit_message=f"Training in progress step {step}", blocking=False
            )
            model.train()
            
model.push_to_hub(f"codeparrot-madgrad")

# Inference

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline


tokenizer = AutoTokenizer.from_pretrained("Pavithra/Autopilot-madgrad-training-version-1")
model = AutoModelForCausalLM.from_pretrained("Pavithra/madgrad-best-version")

pipe = pipeline(
    "text-generation", model=model, tokenizer=tokenizer)

## Prompts

In [31]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
x_plot = np.arange(100)
y_


In [32]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
for i, (train, test) in enumerate(train):
   


In [33]:
txt = """\
# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
var_dict = {'


In [34]:
txt = """
# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
# get random forest classifier
X, y = make_friedman
