# Training Own Language Model

We will build autocomplete Python code model

1. Follow steps to get data onto device.

Note: Data preparation is crucial and one should make an effort to clean up dataset as much as possible; e.g. remove duplicated code, consider copyright, investigate language in comments, docstrings etc.. and remove personal identifying information such as passwords or keys.

Huggingface overcomes memory limitations for large dataset (200GB) by using the hard drive as a direct extension of RAM memory. Can load with DownloadConfig, setting delete_extracted=True so that we do not use up all our RAM.

**Load Data From Disk**

In [None]:
from datasets import load_dataset, DownloadConfig

download_config = DownloadConfig(delete_extracted=True)
dataset = load_dataset(
    "./codeparrot", split="train", download_config=download_config)

NLP data is lightweight to load in comparison to model processing computations, so won't be a bottleneck for I/O. Also zero-copy/zero-overhead uses Apache Arrow under the hood, making it efficient to access any element.

Can be very fast from few tenths of GB/s to several GB/s. And if we don't have enough space on disk, we can always stream.

**Streaming**

Some datasets may be >1TB so hard to fit onto a standard hard drive; an alternative is to stream. 

In [None]:
# load dataset directly from compressed JSON files instead of creating a cache file from them
# is an iterable Dataset object; so cannot access random elements but need to read in order
streamed_dataset = load_dataset("./codeparrot", split="train", streaming=True)

In [None]:
iterator = iter(streamed_dataset)

print(dataset[0] == next(iterator))
print(dataset[1] == next(iterator))

Can reference the dataset on Hub and directly download samples without downloading raw files locally - a step further..

In [None]:
# downloads examples on the fly
remote_dataset = load_dataset(
    "transformersbook/codeparrot", split="train", streaming=True)

## Add Data to Huggingface Hub
- Easily access from training server
- Share with community

1. Create HF Repository
2. Clone repository
3. Copy data to it
4. Push to hub
    - `git add .` may take a while since a hash of all the files is computed
    - Also good practice to add `README` cards with documentation


## Building a Tokenizer

When using a pretrained model, it's important to stick with the same preprocessing design choices selected for pretraining. Otherwise the model may be fed out-of-distribution patterns or unknown tokens.

However, when training a new model, a tokenizer prepared for another dataset can be suboptimal as we may encounter the following problems:
- T5 tokenizer trained on C4 corpus that had extensive stopword filtering, so has never seen common English words such as "sex".
- CamemBERT tokenizer also trained on a very large corpus of text, but only French text. Therefore, is unaware of common English words such as "being".

T5 and CamemBERT tokenizers instead split the words into subwords, which is inefficient as it also increases the sequence length of the model. **Lesson**: Be aware of the domain and dataset preprocessing used to train tokeniser; the tokeniser and model can encode bias that has a downstream impact on model behaviour. So to create an optimal tokenizer for our dataset, we need to train one for ourselves.

> Note: Training a tokenizer does not involve backpropagation or weights, it involves creating an optimal mapping from a string of text to list of integers that can be ingested by the model. Today's tokenisers has a vocabulary consisting of a list of atomic strings and a method to convert, normalise, cut or map a text string into a list of indices with this vocabulary. This list of indices is then input for our neueral network.


### The Tokenizer Model

There are several variations of preprocessing and algorithms for creating tokenisers. BPE and Unigram have the most reasonable performance in most cases:
- **BPE**: Starts from list of basic units (characters) and creates a vocabulary by a process of progressively creating new tokens formed by merging the most frequently co-occuring basic units and adding them to the vocabulary. This process is reiterated until a predefined vocabulary size is reached.
- **Unigram**: Starts opposite to BPE. Initialising base vocabulary with all the words in the corpus, and potential subwords. Then progressively removes/splits less useful tokens to obtain a smaller and smaller vocabulary, until the target vocabulary size is reached. WordPiece is a predecessor of Unigram, and it's official implementation was never open-sourced by Google.

Note: Superiority of performance for tokenising algorithm may depend on downstream task.

We want to look at aspects to consider when evaluating.

### Measuring Tokeniser Performance

Some possible metrics to consider (for optimality and performance):
- *Subword fertility*: Average number of subwords produced per tokeniser word.
- *Proportion of continued words*: Proportion of tokenised words in a corpus split into at least two subtokens.
- *Coverage metrics*: Such as the proportion of unknown words or rarely used tokens in a tokenised corpus.

Also robustness, mispelling, noise, model performance on out of domain examples etc..

Above provides views on tokeniser's performance but tends to ignore the interaction of tokeniser with the model. E.g. subword fertility can be minimised by including all possible words in the vocabulary but this produces a very large vocabulary. 

So performance of tokenisation generally best estimated by downstream performance of model as ultimate metric. E.g. BPE good performance demonstrated on machine translation tasks by models trained with these tokenisers and vocabularies instead of character/word based tokenisation.

See how we can build our own tokenizer optimised for Python code.

### A Tokenizer for Python

Have to think about the semantics of tokeniser; eg. if we remove whitespaces we can lose a lot of meaning. And line breaks are not meaningful as they can be added/removed without impacting semantics. But splitting on underscores can break apart variables so would affect the running, so using natural language for tokenising code may be suboptimal.

We want a tokeniser that preserves spaces, a good candidate could be a byte-level tokeniser; like GPT2.

In [None]:
from transformers import AutoTokenizer

python_code = r"""def say_hello():
    print("Hello, world!")

# print it
say_hello()

"""

tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer(python_code).tokens())

Note:
> Python has a built-in tokenize module that splits Python strings into meaningful units (code operation, comments, indent etc..). However, one issue is that this pretokeniser is Python based which is slow and limited by the GIL. Unlike Transformers which is coded in Rust, and Rust tokenizers are many orders of magnitude faster to train and use so more likely to use these given the size of our corpus.

In [None]:
# GPT2 uses no normaliser!
print(tokenizer.backend_tokenizer.normalizer)

In [None]:
# Works directly on raw unicode inputs without any normalisation steps. 
# look at pretokenisation
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(python_code))

All operations on input string are tracked so we can know exactly what part of the input string a token after tokenization corresponds to. Just tracks which original string a token came from, this is called *offset tracking*. If some characters are removed in normalisation, we can still associate each token with the respective part in the original string.

Odd-characters.. Byte-level; so tokenizer works on bytes instead of Unicode characters. Only 256 byte alphabet characters instead of 143,859 unicode characters in unicode alphabet; and you can express each unicode character as a sequence of bytes. Occupies far less characters, so our model embedding layer can be much smaller.

Trade off is that input sequence is segmented into many small pieces with only 256 byte values. This requires more compute power to reconstruct unicode characters.

Middle-ground solution: Construct medium-sized vocabulary by extending 256-word vocabulary with most common combination of bytes. Progressively construct a vocabulary of predefined size by creating new vocabulary tokens through iteratively combining most frequent co-occuring tokens in vocabulary. E.g. `t` + `h` + `e` for `the`; stringing together common tokens to form bigger elementary units.

One issue with BPE algorithm: These are designed to work with clean Unicode string as inputs, not bytes and expects regular ASCII characters, without spaces or control characters. However there are many control characters (newline, tabs, escape etc..) that are nonprintable. 

GPT2 tokenizer first maps all 256 input bytes to unicode that can be easily digested by standard BPE algorithms. What's important is that we have 256 single values at the end, forming our base vocabulary and that these 256 values are correctly handled by our BPE algorithm.

We could have used a more explicit conversion, like mapping newlines to a `NEWLINE` string, but BPE algorithms are typically designed to work on characters. So keeping Unicode character for each byte character is easier to handle with out-the-box BPE algorithm.

We know that newlines are mapped to unicode characters; also:
- Spaces, and consecutive spaces are conserved
- Consecutive spaces are considered a single word
- Each space preceding a word is attached and considered part of the subsequent word

BPE model is in charge of splitting words into subunits until all subunits belong to the predefined vocabulary. 

Vocabulary of GPT-2 tokenizer comprises 50,257 words:
- Base vocabulary with 256 values of the bytes
- 50,000 additional tokens created by repeatedly merging most commonly co-occuring tokens
- Special character added to vocabulary to represent document boundaries 

In [None]:
# check length attribute of tokenizer
print(f"Size of the vocabulary: {len(tokenizer)}")

print(tokenizer(python_code).tokens())

BPE tokenizer keeps most of the words but split multiple spaces of indentation to several consecutive spaces. Happens because tokenizer not specifically trained on code, but on texts where consecutive spaces are rare. BPE model doesn't include a specific token in vocabulary for indentation, so we can see that the tokenizer is poorly suited for the dataset's domain.

Solution is to retrain the tokenizer on the target's corpus, lets get to it!

### Training a Tokenizer

Retrain byte-level BPE tokenizer on a slice of our corpus to get a vocabulary better adapted to Python code. To retrain a tokenizer, we just need to:
- Specify our target vocabulary size
- Prepare an iterator to supply lists of input strings to process to train the tokenizer's model
- Call the `train_new_from_iterator()` method

Tokenizers are trained to extract the main statistics, unlike deep learning models which are often expected to memorise a lot of specific details from the training corpus. Effectively, tokenizers are trained to know which letter combinations are the most frequent in our corpus.

Therefore, don't need a large corpus to train our tokenizer on, just needs to be representative of domain and big enough for tokenizer to extract statistically significant measures. 

Note: Depending on vocabulary size and exact texts in corpus, the tokenizer can end up storing unexpected words. 

In [None]:
# access mapping
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode

byte_to_unicode_map = bytes_to_unicode()
unicode_to_byte_map = dict((v,k) for k,v in byte_to_unicode_map.items())
base_vocab = list(unicode_to_byte_map.keys())

print(f"Size of our base vocabulary: {len(base_vocab)}")
print(f"First element: `{base_vocab[0]}`, last element: `{base_vocab[-1]}`")

For each token, our model will have to learn an associated word embedding; and we don't want the embedding matrix to contain too many noisy words. Also unnecessary tokens are expensive in that they take up a vector in our vocabulary and increases total vocabulary size. 

To train a fresh tokenizer on our corpus and examine its learned vocabulary, we just need a corpus reasonably representative of our dataset statistics. Set about 1-2GB of data, or about 100,000 documents from our corupus:

In [None]:
from tqdm.auto import tqdm

length = 100_000
dataset_name = "transformersbook/codeparrot-train"
dataset = load_dataset(dataset_name, split="train", streaming=True)
iter_dataset = iter(dataset)

def batch_iterator(batch_size=10):
    for _ in tqdm(range(0, length, batch_size)):
        yield [next(iter_dataset)["content"] for _ in range(batch_size)]

new_tokenizer = tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=12_500, initial_alphabet=base_vocab)

In [None]:
# skip 256 byte tokens and look at first tokens added thereafter
tokens = sorted(new_tokenizer.vocab.items(), key=lambda x: x[1], reverse=False)
print([f"{tokenizer.convert_tokens_to_string(t)}" for t, _ in tokens[257:280]]);

In [None]:
# can see last words
print([f"{new_tokenizer.convert_tokens_to_string(t)}" for t, _ in tokens[-12:]])

In [None]:
# see how can tokenise example of Python code to see what it looks like
print(new_tokenizer(python_code).tokens())

In [None]:
# check if Python reserved keywords are in vocabulary

import keyword

print(f"There are in total {len(keyword.kwlist)} Python keywords.")
for keyw in keyword.kwlist:
    if keyw not in new_tokenizer.vocab:
        print(f"No, keyword `{keyw}` is not in the vocabulary.")

Quite a number of frequent keywords, like `finally` are not in the vocabulary. Try to build a larger vocabulary using a larger sample of our dataset; 32768 words and train the tokenizer on twice as large a slice of our corpus:

In [None]:
length = 200_000
new_tokenizer_larger = tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=32768, initial_alphabet=base_vocab
)

In [None]:
# look at last tokens
tokens = sorted(new_tokenizer_larger.vocab.items(), key=lambda x: x[1], reverse=False)
print([f"{tokenizer.convert_tokens_to_string(t)}" for t, _ in tokens[-12:]])

No regular programming words, which is promising. Try our code sample with larger tokenizer:

In [None]:
print(new_tokenizer_larger(python_code).tokens())

Can see indents are conveniently kept in vocabulary, and common English words like `Hello`, `World` are included as single tokens. This is more in line with our expectations of the data and model we may likely encounter down-stream.

In [None]:
# investigate common Python keywords
for keyw in keyword.kwlist:
    if keyw not in new_tokenizer_larger.vocab:
        print(f"No, keyword `{keyw}` is not in vocabulary")

Missing `nonlocal` keyword but it's rarely used in practice as it makes the syntax more complex.. Keeping it out of the vocabulary seems reasonable. 

So from eyeballing our larger tokenizer seems better adapted for our task; though objectively evaluating the tokenizer performance is a challenging task without measuring the model's performance.

Proceed with this tokeniser and train a model to see how well it works in practice.

> Can verify the new tokenizer ~twice as efficient as standard GPT-2 tokenizer by comparing sequence lengths of tokenized code examples to the original. Our tokenizer uses approx half as many tokens as existing one to encode a text, giving us twice the effective model context for free. So training a model on a context window of size 1,024 is equivalent to training a model with the old tokenizer on a context window of size 2,048 with the advantage of being much faster and more memory efficient.

### Saving a Custom Tokenizer On the Hub

In [None]:
# can push tokeniser as we already authenticated with huggingface-cli
model_ckpt = "codeparrot"
org = "transformersbook"
new_tokenizer_larger.push_to_hub(model_ckpt) # create a repository in our namespace

In [None]:
# anyone can load our tokenizer by running
reloaded_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
print(reloaded_tokenizer(python_code).tokens())

In [None]:
# can save our smaller tokenizer as well
new_tokenizer.push_to_hub(model_ckpt+"-small-vocabulary")

## Training a Model From Scratch

Here we will:
- Decide best architecture for the task
- Initialise a fresh model without pretrained weights
- Set up custom data loading class
- Create a scalable training loop
- Finally train small and large GPT-2 models with 111 million and 1.5bn parameters

With such a large codebase with code snippets and an efficient tokeniser; we are able to tackle several tasks. The three common tasks:
- **Causal Language Modeling**: Provide beginning of code sample and ask it to generate possible completions. This is self-supervised without any annotations. A decoder-only architecture such as GPT family is usually best suitited for this task
- **Masked Language Modeling**: Provide a model with a noisy code sample and ask it to reconstruct the original clean sample; also a self-supervised training procedure. Denoising is generally a good pretraining task to learn general representations for later downstream tasks. Can combine with fine-tuning on downstream tasks with a limited number of examples.
- **Sequence-to-Sequence Training**: Use heuristics like regular expressions to separate comments/docstrings from code and build a large-scale dataset of (code, comments) pairs that can be used as an annotated dataset. Training task is supervised training with (input, labels) or (comment, code) pairs. Can try to train a model that learns to transcript comments in code or vice-versa. A downstream task is documentation generation from code or code generation from documentation; depending on how we set our inputs/outputs.

Our objective is a code autocompletion model, so we'll go ahead with first objective!

> Note: Datasets and models are large so may require multiple GPUs and would run better with scripts than on notebook!

### Initialising the Model

Won't use `from_pretrained()`; instead load the configuration of `gpt2-xl` to utilise the hyperparameters and adapt the vocabulary size for the new tokenizer. Then initialise a new model with this configuration with `from_config()` method.

In [None]:
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
config = AutoConfig.from_pretrained("gpt2-xl", vocab_size=len(tokenizer))
model = AutoModelForCausalLM.from_config(config)

In [None]:
# check how large the model actually is
print(f"GPT-2 (xl) size: {model_size(model)/1000*2:.1f}M parameters")

1.5bn Model! A lot of capacity, we do have a large dataset.

Generally, large language models are more efficient to train as long as the dataset is reasonably large.

In [None]:
def model_size(model):
    return sum(t.numel() for t in model.parameters())

In [None]:
# save the newly initialised model in `models/` directory and push to hub
# may take a few minutes given the size of the ckpt (>5gb)
model.save_pretrained("models/" + model_ckpt, push_to_hub=True)

In [None]:
# create a smaller version we can train to make sure everything works before scaling
# take standard GPT-2 size as a base
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
config_small = AutoConfig.from_pretrained("gpt2", vocab_small=len(tokenizer))
model_small = AutoModelForCausalLM.from_config(config_small)

print(f"GPT-2 size: {model_size(model_small) / 1000 **2:.1f}M parameters")

In [None]:
# also save to hub for easy sharing and reuse
model.save_pretrained("models/" + model_ckpt + "-small", push_to_hub=True)

Now with two models to train, we need to make sure we can feed them the input data efficiently during training.

### Implementing the Dataloader

For maximal efficiency, we want to supply our model with sequences filling its context; eg. if context length is 1,024; we wan tot provide 1,024-token sequences in training. But some code examples may be shorter or longer; so we should either drop the last incomplete sequence or pad it. However, this renders our training slightly less efficient and forces us to take care of padding and masking padded token labels.

We are more compute than data-constrained, so we'll take the easy and efficienty way. We can tokenise several examples and then concatenate them, separated by the special end-of-sequence token, to get a very long sequence. Finally, split this sequence into equally sized chunks, so may lose at most a small fraction of data at the end.

Can make sure we have roughly one hundred full sequences in our tokenised examples by defining our input string character length as: 

`input_characters = number_of_sequences * sequence_length * characters_per_token`

Where:
- `input_characters`: Number of characters in string input to our tokenizer
- `number_of_sequences`: Number of (truncated) sequences we'd like from our tokenizer (eg. 100)
- `sequence_length`: Number of tokens per sequence returned by the tokenizer (eg. 1024)
- `characters_per_token`: Avg. number of characters per output token that we first need to estimate

If we input a string with input_characters characters we will get on average `number_of_sequences` output sequences, and can easily calculate how much input data we are losing by dropping the last sequence. 

If number_of_sequences=100 it means we stack roughly 100 sequences and at most lose the last element; which may be too short or too long; so lose at most 1% of our dataset. This approach ensures we don't introduce a bias by cutting off the majority of file endings.

In [None]:
# first estimate the average character length per tokeniser
examples, total_characters, total_tokens = 500, 0, 0
dataset = load_dataset("transformersbook/codeparrot-train", split="train", streaming=True)

for _, example in tqdm(zip(range(examples), iter(dataset)), total=examples):
    total_characters += len(example['content'])
    total_tokens += len(tokenizer(example['content']).tokens())

characters_per_token = total_characters / total_tokens

print(characters_per_token)

We have all that is needed to create our own `IterableDataset` (a helper class provided by PyTorch) for preparing constant-length inputs for the model. Just need to inherit `IterableDataset` and set up `__iter__()` function that yields the next element with the logic we walked through.

In [None]:
import torch
from torch.utils.data import IterableDataset

class ConstantLengthDataset(IterableDataset):

    def __init__(
        self, tokenizer, dataset, seq_length=1024, num_of_sequences=1024, chars_per_token=3.6):
        self.tokenizer = tokenizer
        self.dataset = dataset
        self.concat_token_id = tokenizer.eos_token_id
        self.seq_length = seq_length
        self.input_characters = seq_length * characters_per_token * num_of_sequences

    def __iter__(self):
        iterator = iter(self.dataset)
        more_examples = True
    
        while more_examples:
            # builds buffer of strings until it has enough characters
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.input_characters:
                    m = f"Buffer full: {buffer_len}>={self.input_characters:.0f}"
                    print(m)
                    break
                try:
                    m = f"Fill buffer: {buffer_len}<{self.input_characters:.0f}"
                    print(m)
                    buffer.append(next(iterator)["content"])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    iterator = iter(self.dataset)
            
            # tokenized and concat with EOS
            all_token_ids = []
            
            tokenized_inputs = self.tokenizer(buffer, truncation=False)
            for tokenized_input in tokenized_inputs["input_ids"]:
            # for tokenized_input in tokenized_inputs:
                all_token_ids.extend(tokenized_input + [self.concat_token_id])

            # chunked in seq_length-sized slices
            # no need to pad as all sequences are maximal length so no need mask either
            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i:i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    yield torch.tensor(input_ids)

In [None]:
# test iterable dataset

## Note we shuffle raw dataset; since this is an iterable dataset
# we cannot shuffle whole dataset; instead we shuffle elements in the buffer first
shuffled_dataset = dataset.shuffle(buffer_size=100)
constant_length_dataset = ConstantLengthDataset(
    tokenizer, shuffled_dataset ,num_of_sequences=10)

dataset_iterator = iter(constant_length_dataset)

lengths = [len(b) for _, b in zip(range(5), dataset_iterator)]
print(f"Lengths of the sequences: {lengths}")

So this works as intended and we get constant-length inputs for the model. With a reliable data source, we can build the actual training loop!

### Defining the Training Loop

An obvious limitation is the memory limit, even on a single GPU; so we would need to utilise several GPU's for training. Fortunately, we can use Accelerate to make our code scalable.

Accelerate library is designed to make distributed training, and changing the underlying hardware for training easy. Accelerate gives us full control over the training loop, which is what we want to explore.

HF Accelerate provides an API making training scripts run with mixed precision in any kind of distributed setting (single GPU, multi-GPU and TPI's). Then can run on local machine for debugging or beefy training cluster for final training run.

Making changes to native PyTorch training loop goes as follows:

In [None]:
import torch
import torch.nn.functional as F
import datasets
import transformers
from datasets import load_dataset
from accelerate import Accelerator

accelerator = Accelerator()

model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset("my_dataset")
data = torch.utils.data.DataLoader(dataset ,shuffle=True)
# makes sure optimizers and dataloaders are prepared and distributed on infrastructure
model, optimizer, data = accelerator.prepare(model, optimizer, data)

model.train()

for epoch in range(10):
    for source, targets in data:
        optimizer.zero_grad()
        output = model(source)
        loss = F.cross_entropy(output, targets)
        accelerator.backward(loss)
        optimizer.step()

So! We can define some heper functions; setting up the hyperparameters for training and wrap them in Namespace for easy access later.

In [None]:
from argparse import Namespace

# commented parameters correspond well to the small modely
config = {
    "train_batch_size": 2, # 12,
    "valid_batch_size": 2, # 12,
    "weight_decay": 0.1,
    "shuffle_buffer": 1000,
    "learning_rate": 2e-4, # 5e-4
    "lr_scheduler_type": "cosine",
    "num_warmup_steps": 750, # 2000
    "gradient_accumulation_steps": 16, # 1
    "max_train_steps": 50_000, #150_000
    "max_eval_steps": -1,
    "seq_length": 1024, 
    "seed": 1,
    "save_checkpoint_steps": 50_000 #15_000
}

args = Namespace(**config)

Set up logging for training; we want to make sure relevant information is stored and easily accessible. `setup_logging()` method sets up three levels of logging using standard Python Logger, TensorBoard and Weights & Biases. Depending on preferences.

In [None]:
from torch.utils.tensorboard import SummaryWriter
import logging
import wandb

def setup_logging(project_name):
    logger = logging.getLogger(__name__)
    logging.basicConfig(
        format='%(asctime)s %(levelname)s %(name)s %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S', level=logging.INFO,
        # each logger gets accelerator process index; log of each worker to file
        handlers=[
            logging.FileHandler(f"log/debug_{accelerator.process_index}.log"),
            logging.StreamHandler()
        ]
    )
    # true only for main worker; so don't initialise TensorBoard and W&B several times
    # as we decrease the logging levels for other workers
    if accelerator.is_main_process: # only want to set up logging once
        wandb.init(project=project_name, config=args)
        # return autogenerated unique wandb.run.name we use later to name our expt branch on the hub
        run_name = wandb.run.name
        tb_writer = SummaryWriter()
        tb_writer.add_hparams(vars(args), {'0': 0})
        logger.setLevel(logging.INFO)
        datasets.utils.logging.set_verbosity_debug()
        transformers.utils.logging.set_verbosity_info()
    else:
        tb_writer = None
        run_name = ""
        logger.setLevel(logging.ERROR)
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()
    return logger, tb_writer, run_name

In [None]:
def log_metrics(step, metrics):
    # Fn to log metrics with TensorBoard and W&B
    logger.info(f"Step {step}: {metrics}")
    if accelerator.is_main_process:
        wandb.log(metrics)
        [tb_writer.add_scalar(k, v, step) for k, v in metrics.items()]

In [None]:
from torch.utils.data.dataloader import DataLoader

def create_dataloaders(dataset_name):
    # create dataloader for train and validation sets 
    train_data = load_dataset(dataset_name+'-train', split="train", streaming=True)
    train_data = train_data.shuffle(buffer_size=args.shuffle_buffer, seed=args.seed)

    valid_data = load_dataset(dataset_name+'-valid', split="validation", streaming=True)

    train_dataset = ConstantLengthDataset(tokenizer, train_data, seq_length=args.seq_length)
    valid_dataset = ConstantLengthDataset(tokenizer, valid_data, seq_length=args.seq_length)

    # wrap dataset in DataLoader; which takes care of batching
    # HF Accelerate takes care of distributing the batches to each worker
    train_dataloader = DataLoader(train_dataset, batch_size=args.train_batch_size)
    eval_dataloader = DataLoader(valid_dataset, batch_size=args.valid_batch_size)

    return train_dataloader, eval_dataloader

In [None]:
def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
    # fn to differentiate parameters that should receive weight decay
    # generally, biases and LayerNorm weights are not subject to weight decay
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [
        {'params': params_with_wd, 'weight_decay': args.weight_decay},
        {'params': params_without_wd, 'weight_decay': 0.0}
        ]

In [None]:
def evaluate():
    # eval fn that calculates loss and perplexity on evaluation set
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(batch, labels=batch)
        loss = outputs.loss.repeat(args.valid_batch_size)
        losses.append(accelerator.gather(loss))
        if args.max_eval_steps > 0 and step >= args.max_eval_steps: break
    loss = torch.mean(torch.cat(losses))
    try:
        perplexity = torch.exp(loss)
    except OverflowError:
        perplexity = torch.tensor(float("inf"))
    return loss.item(), perplexity.item()


Perplexity measures how well the model's output probability distributions predict the target tokens. Lower perplexity -> better performance. Note: We can also compute perplexity by exponentiating cross-entropy loss that we get from model's output; especially at start of training when the loss is still high, we can get numerical outflow when calculating perplexity. We catch this error and set perplexity to infinity in these instances.

One more thing: HF Hub can use Git under the hood to store and version models and datasets. Repository class from HF Hub allow us to programmatically access the repository and pull, branch, commit or push. Use in our script to continuously push checkpoints to Hub during training.

With everything in place; we can write the heart of the training script!

In [None]:
set_seed(args.seed)

# Accelerator
accelerator = Accelerator()
samples_per_step = accelerator.state.num_processes * args.train_batch_size

# logging
logger, tb_writer, run_name = setup_logging(project_name.split("/")[1])
logger.info(accelerator.state)

# Load model and tokenizer
if accelerator.is_main_process:
    hf_repo = Repository("./", clone_from=project_name, revision=run_name)
model = AutoModelForCausalLM.from_pretrained("./", gradient_checkpointing=True)
tokenizer = AutoTokenizer.from_pretrained("./")

# Load dataset and dataloader
train_dataloader, eval_dataloader = create_dataloaders(dataset_name)

# Prepare optimizer and LR scheduler
optimizer = AdamW(get_grouped_params(model), lr=args.learning_rate)
lr_scheduler = get_scheduler(
    name=args.lr_scheduler_type, optimizer=optimizer,
    num_warmup_steps=args.num_warmup_steps, 
    num_training_steps=args.max_train_steps
)

def get_lr(): return optimizer.param_groups[0]["lr"]

# prepare everything with our `accelerator` (order of args not important)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader)

# Train model
model.train()
completed_steps = 0
for step, batch in enumerate(train_dataloader, start=1):
    loss = model(batch, labels=batch).loss
    log_metrics(
        step, 
        {"lr": get_lr(), "samples": step*samples_per_step, "steps": completed_steps, "loss/train": loss.item()})
    loss = loss / args.gradient_accumulation_steps
    accelerator.backward(loss) 

    if step % args.gradient_accumulation_steps == 0:
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        completed_steps += 1

    if step % args.save_checkpoint_steps == 0:
        logger.info("Evaluating and saving model checkpoint")
        eval_loss, perplexity = evaluate()
        log_metrics(step, {"loss/eval": eval_loss, "perplexity": perplexity})
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        if accelerator.is_main_process:
            unwrapped_model.save_pretrained("./")
            hf_repo.push_to_hub(commit_message=f"step {step}")
        model.train()
    
    if completed_steps >= args.max_train_steps:
        break

# Evaluate and save the last checkpoint
logger.info("Evaluatiing and saving model after training)")
eval_loss, perplexity = evaluate()
log_metrics(step, {"loss/eval": eval_loss, "perplexity": perplexity})
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
if accelerator.is_main_process:
    unwrapped_model.save_pretrained("./")
    hf_repo.push_to_hub(commit_message=f"final model")

Above is all we need to train a large language model on distributed infrastructure! Deconstructing the script a bit:

- *Model Saving*: Run script from model repository, and at the start check out a new branch named after the `run_name` we get from Weights & Bisaes. Later, we commit the model at each checkpoint and push it to the Hub. So each experiment is on a new branch and each commit represents a model checkpoint. Note we need to call `wait_for_everyone()` and `unwrap_model()` to make sure the model is properly synchronised before we store it.
- *Optimization*: Use `AdamW` with cosine learning rate schedule after linear warmup period. Hyper-parameters closely follow GPT-3 parameters for similar sized models
- *Evaluation*: Evaluate every time we save, so every `save_checkpoint_steps` and after training. We log validation perplexity as well as validation loss.
- *Gradient accumulation and checkpointing*: Gradient accumulation means the batch sizes can fit into GPU memory; rather than gatherig graidients over several backpasses and optimises once enough gradients are accumulated. Using a method called *gradient checkpointing*, we can trade some of the memory footprint for approx 20% training slowdown; allowing us to fit even the large model in a single GPU.


For training multiple GPUs, there are several approaches depending on model and data volume. HF Accelerate uses `DataDistributedParallelism`; allowing one to train models faster with larger batch sizes that wouldn't fit into any single GPU. Step by step:

1. Each worker consists of a GPU; in HF Accelerate, there is a dataloader on the main process that prepares batches of data and sends them to all the workers.
2. Each GPU receives a batch of data and calculates the loss and accumulated gradients from forward and backward passes with a local copy of the model.
3. The gradients from each node are averaged with a *reduce* pattern, and the averaged gradients are sent back to each worker.
4. The gradients are applied using the optimizer on each node individually. Might seem like redundant work, but avoids transferring copies of the large models between nodes. Trade-off is that other nodes would need to wait until they received the updated version.
5. With the models updated, we start all over again, with the main worker preparing new batches.

Simple pattern allows us to train large models extremely fast by scaling up to the number of available GPUs without much additional logic. Sometimes, however, this is not enough. And we may need more sophisticated parallelism strategies.

Though now, with everything required, we can launch a job!

### The Training Run

Put training script in a python file, then add requirements.txt file and push to github. To spin up the training script, we can run the commands:

```sh
git clone https://huggingface.co/transformersbook/codeparrot
cd codeparrot
pip install -r requirements.txt
wandb login
accelerate config
accelerate launch codeparrot_training.py
```

And our model will start training! wandb will prompt a login; and `accelerate config` will guide us through setting up the infrastructure. The smaller model takes about 24 hrs to train and the large 7 days. We can use the smaller as a testing ground for larger models.

After training is completed; we can merge the experiment branch on the Hub back into the main branch with the commands:

```sh
git checkout main
git merge <RUN_NAME>
git push
```

And now we can look at how to investigate the model performance.

## Results and Analysis

We see the training loss and validation perplexity go down continuously; and the loss curve looks almost linear on log-log scale. Large model converges faster in terms of processed tokens, though overall training takes longer.

Types of analyses: Qualitative and quantitative. In the former, we look at concrete examples to better understand which cases the model succeeds and where it fails. In the latter, we look at model's performance statistically on a large set of test cases.

First, we wrap the small model in a pipeline and use it to continue some code inputs:

In [None]:
from transformers import pipeline, set_seed

# wrap code in a pipeline
model_ckpt = "tansformersbook/codeparrot-small"
generation = pipeline("text-generation", model=model_ckpt, device=0)

Generation pipeline to generate candidate completions from a prompt. By default, pipeline will generate code until a predefined maximum length, and the output could contain multiple funcitons or classes.

To keep outputs concise, we'll implement a `first_block()` function that uses regex to extract the first occurrence of a function or class. The `complete_code()` function below applies this logic to print out the completions generated by CodeParrot:

In [None]:
import re
from transformers import set_seed

def first_block(string):
    return re.split("\nclass|\ndef|\ndef|\n#|\n@|\nprint|\nif", string)[0].rstrip()

def complete_code(pipe, prompt, max_lenght=64, num_completions=4, seed=1):
    set_seed(seed)
    gen_kwargs = {"temperature":0.4, "top_p":0.95, "top_k":0, "num_beams":1, "do_sample": True}
    code_gens = generation(
        prompt, num_return_sequences=num_completions, max_length=max_length, **gen_kwargs)
    code_strings = []
    for code in code_gens:
        generated_code = first_block(code_gen["generated_text"][len(prompt):])
        code_strings.append(generated_code)
    print(("\n"+"="*80 + "\n").join(code_strings))

In [None]:
# test; have the model write a function that calculates the area of a rectangle
prompt = """def area_of_rectangle(a: float, b: float):
    # return the area of the rectangle
"""
complete_code(generation, prompt)

In [None]:
# extract urls from a HTML string
promt = """def get_urls_from_html(html):
# Get all embedded URLs in a HTML string.
"""
complete_code(generation, prompt)

In [None]:
# test function on HF home page
import requests

def get_urls_from_html(html):
    return [url for url in re.findall(r'<a href="(.*?)"', html) if url]

print(" | ".join(get_urls_from_html(requests.get("https://hf.co/").text)))

In [None]:
# load large model and see if we can translate a function from pure Python to numpy

prompt = """# a function in native python
def mean(a):
    return sum(a) / len(a)
    
# the same function using numpy:
import numpy as np
def mean(a):"""
complete_code(generation, prompt, max_lenth=64)

In [None]:
# use to build a Scikit-learn model
prompt = '''X = np.random.randn(100, 100)
y = np.random.randint(0, 1, 100)
# fit random forest classifier with 20 estimators'''
complete_code(generation, prompt, max_length=96)

BLEU metric is bad here as it measures overlap of text; and in code there is a lot of freedom in terms of variables na dclasses; and the success of a program does not depend on the naming convention, as long as it is consistent. However, BLEU would punish a name which deviates from the reference, even though it may be impossible to predict.

Better ways to measure code, such as performance in unit tests; and fraction that pass through tests. Or other evaluation regimen!

## Conclusion

We build a code autocomplete function for Python! We also built:
- Large scale dataset suitable for pretraining a large language model
- Created a custom tokenizer able to efficiently encode Python code with that dataset
- Wrote a training script with HF transfoermers to train small and large versions of GPT-2 model from scratch on a multi-GPU instance

We saw reasonable code continuations and discussed areas of improvement!

Go forth and conquer!