# Code Autocompletion with GPT-2

GPT-2 is an autoregressive model trained on a Causal Language Modeling task. This menas that the GPT-2 model was trained on a next token prediction task, such that the model, provided a sequence of $n$ tokens, had to predict the $n+1$*th* token. This is a Causal Language Modeling task since the prediction of the $n+1$*th* token can be framed as the below probabilistic task:

$$t_{n+1} = \argmax_{x} \Pr(x∣t_1,t_2,…,t_n)$$

By giving this model a sequence of code ($n$ tokens of code, to be specific), we can expect to receive what, probabilistically, the next bit of code should be (the $n+1$*th* token). Once the model predicts the $n+1$*th* token, we can use this new sequence of tokens $[t_0, ..., t_{n+1}]$ to predict the $n+2$*th* token, and this process can be repeated recursively to generate as many tokens as we would like. This is known as autoregressive generation.

In [13]:
import os
import torch
import evaluate
import regex as re
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, AdamW, pipeline
from torch.utils.data import DataLoader
from datasets import load_dataset, Dataset, IterableDataset
# these are all the libraries you'd need

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_name(i))  # Should return the name of the GPU

In [None]:
def clean_data(inp: str) -> str:
    """OPTIONAL: Perform data cleaning, if necessary."""
    ...

def get_data() -> Dataset:
    # https://huggingface.co/datasets/codeparrot/codeparrot-clean
    # Load the dataset
    ds = load_dataset("codeparrot/codeparrot-clean", streaming=True, trust_remote_code=True, split="train")

    # Clean the data
    # ds = ds.map(lambda x: {"content": clean_data(x["content"])})

    return ds

dataset = get_data()

In [None]:
type(dataset) # This is important...

In [None]:
model     = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

model.to(device)

In [11]:
def get_train_valid_data(dataset: Dataset) -> (Dataset, Dataset):
    """TODO: Split the dataset into training and validation sets."""
    # This is not too straightforward because the dataset is a streaming dataset
    raise NotImplementedError

train_data, valid_data = get_train_valid_data(dataset)

In [None]:
class SafeIterableDataset(torch.utils.data.IterableDataset):
    """Wrapper to account for download errors so training doesn't stop due to error pulling data from HF."""
    def __init__(self, dataset):
        self.dataset = dataset

    def __iter__(self):
        iterator = iter(self.dataset)
        while True:
            try:
                item = next(iterator)
                yield item
            except StopIteration:
                break
            except Exception as e:
                print(f"Caught exception during data loading: {e}. Skipping item.")
                continue

train_data = SafeIterableDataset(train_data)
valid_data = SafeIterableDataset(valid_data)

train_loader = DataLoader(train_data,  batch_size=16)
test_loader  = DataLoader(valid_data,  batch_size=16)

In [None]:
def tokenize(inp: list[str]):
    """
    TODO: Tokenize the input.
    Consider:
    - Padding?
    - Truncation?
    - Anything else?
    """
    ...
    raise NotImplementedError

In [None]:
def train():
    model.train()

    for batch in train_loader:
        # TODO: Implement training loop
        # Note that device that data is on should be the same as the model
        ...
        raise NotImplementedError


In [None]:
def val():
    model.eval()

    with torch.no_grad():
        for batch in test_loader:
            # TODO: Implement validation loop
            # Note that device that data is on should be the same as the model
            ...
            raise NotImplementedError

In [None]:
os.environ["HF_HUB_ETAG_TIMEOUT"]     = "500"
os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "500"

In [None]:
# TODO: Consider setting up model checkpointing (set up a directory to save checkpoints)
...

In [None]:
# Clear residual gradients (might cause issues with taking grad. of frozen layers)
model.zero_grad(set_to_none=True)

n_epochs = ...

for epoch in range(n_epochs):
    print(f"Epoch: {epoch}")

    # TODO: Implement training and validation
    ...
    raise NotImplementedError

print("Training complete")

Common antidotes to CUDA Out of Memory errors include:
1. Freezing layers of your model (training less parameters).
2. Using gradient checkpointing to save GPU memory.
3. Reducing the max sequence length of your data (default=1024 with GPT-2 tokenizer, which is colossal).
4. Reducing batch size (look into gradient accumulation).

And, of course:

5. Using a smaller model.

In [None]:
# TODO: Save the model
...
raise NotImplementedError