### Environment
Training is originally run in Kaggle. It should work in other environments as well, except for how the secret token is used.

Firstly we import necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datasets
from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer
import re

### Configurations
We shall try to keep the minimalistic configuration which we tried in pure pytorch version [here](https://github.com/DipanjanSanyal/llm-fundamentals/blob/main/Simplified%20GPT%20Pretraining%20-%20Pure%20Pytorch.ipynb)

In [1]:
# Configurations
context_length = 64
embed_dim = 128
heads = 4
layers = 3
batch_size = 32
learning_rate = 0.00003
response_tokens = 50
num_epochs = 3
output_model_name = 'wikipedia_sample_tiny_gpt2_base'
commit_message = "1st commit"

### Data
Here we are going to use much easier coding with the virtue of Huggingface libraries. Hence, let us try to use a better dataset.

In [None]:
# Loading small wikipedia dataset from huggingface hub
data = datasets.load_dataset("wikimedia/wikipedia", "20231101.en", streaming = True)
data

Streaming only sets it up, does not download the data until called. We shall be calling only a subset of data.

We shall not use very short texts so we'll exclude them from getting downloaded.

Also, we'll limit the dowbload to a desired number of rows.

In [None]:
# This is not reproducible through
# To make it reproducible, hashing can be used, but that won't be efficient
# Otherwise, one can use streaming = False in the above cell with split = 'train[:5%]' although not optimum on disk

def filter_and_limit(dataset, min_char_len, row_subset):
    result = []
    for entry in dataset['train']:
        if (len(entry['text']) > min_char_len):
            result.append(entry)
        if len(result) >= row_subset:
            break
    return Dataset.from_list(result)


sample_data = filter_and_limit(data, 1024, 30000)

In [None]:
sample_data

Please note, unlike the pure pytorch version, here we shall not be streaming all the texts to a single long string but will keep them separate. This will ensure that every selected example contains tokens from same wikipedia article. This is because -
1. we are working with a small context window, and we are fine to limit the context within one paragraph
2. streaming would have resulted into samples with texts from 2 different articles which can be prevented if we do so
3. here we shall be passing the whole training data <code>num_epochs</code> times rather than sampling, #2 cases would defnitely arise if we stream

#### Data Tokenization, Chunking, Input vs Target Setup
1. Firstly, we import an instance of a pre-trained tokenizer class. This time we won't be using the tiny tokenmonsters tokenizers but will be using some standard tokenizer, say distilbert.
2. Now we need to pass our data through the instance of this class using [this calling protocol](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__) (similar to forward in neural nets)

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# example call
mysample = tokenizer(
    sample_data['text'][0:3], # 3 paragraphs entered
    truncation = True, # specifies that each sample will be tried to be split into disjoint chunks of max_length
    max_length = context_length,
    return_overflowing_tokens = True, # keeps track of which chunk was sampled from which sample
    return_length = True, # returns number of tokens in the selected chunks
    add_special_tokens = False # prevents adding special tokens to the tokenizer while applying the tokenizer
)

for k in mysample.keys():
    print(k)

In [None]:
len(mysample['input_ids'])

Just based on 3 articles, we produced 256 disjoint chunks of 64 tokens each.

By default, the chunking results into disjoint chunks, so the last chunk will mostly be shorter (inspect <code> mysample['length']</code>

But we need chunks exactly as long as context_length. So, we'll exclude shorter chunks. We shall also use stride (overlap between chunks) to mimic somewhat same of what we had done in pure pytorch version. We chose to use 1/4th of the chunk overlapping with next chunk.

In [None]:
def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation = True,
        max_length = context_length,
        stride = int(context_length*1/4),
        return_overflowing_tokens = True,
        return_length = True,
        add_special_tokens = False
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

# Tokenizing and Chunking the whole dataset to chunks of length exactly equal to context_length
tokenized_datasets = sample_data.map(tokenize, batched = True, remove_columns = data["train"].column_names)

In [None]:
tokenized_datasets

Let's do the train-test split since this data doesn't come with a default split.

In [None]:
tokenized_datasets = tokenized_datasets.train_test_split(test_size = 0.2)
tokenized_datasets

We'll use data collator which is a function callable under <code>DataLoader</code> object. This is actually redundant for our task. But the code is written with generalization of different tasks in mind, so we need to apply this. 

For us it only creates input (X) vs target (y) inside a selected batch while training. It puts the same sequence in both X and y but when the data_collator is called during training, the chunk is split properly to X and y. 

But it is capable of handling unequal length sequences in a batch with padding, which we don't need to use since we are using all equal length chunks.  

In [None]:
# Data collator function for dynamic padding by batch within training loop
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) # By default it fits with masked language model (like BERT)

In [None]:
# Examining samples
out = data_collator([tokenized_datasets["train"][i] for i in range(2)])
for key in out:
    print(f"{key} shape: {out[key]}")

### Model Initialization

We shall be using GPT2 architecture which is conveniently available in the hub, so we dont need to write all the model classes using Pytorch. Just to be clear, we are not loading the model weights which is usually done by <code>AutoModel.from_pretrained()</code>.

In [None]:
# Loading only the model architecture and hyperparameters, not the trained weights
# Also, customizing with desired transformer size (similar to pure pytorch)
# This may not work well, because here our data and tokenizers are much larger, but let's still use it

from transformers import GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size = len(tokenizer),
    n_ctx = context_length,
    n_positions = context_length,
    n_embd = embed_dim,
    n_head = heads,
    n_layer = layers,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

### Training

This is the part which is going to be heavy on compute.

We'll now handle some housekeeping stuff (like logins) and then train the model.

In [None]:
# Disable wanddb tracking
import os
os.environ["WANDB_DISABLED"] = "true"

# Logging in to huggingface for training (Save your HF API Token in secrets first)

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value = user_secrets.get_secret("HF_TOKEN")

from huggingface_hub import login
login(token=secret_value)

Before we start training, we can define various training parameters as well as model hosting parameters.

We can also optinally calculate additional evaluation score like bertscore during training, but we'll do a comprehensive evaluation later.

In [None]:
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_model_name, # model name in hub when pushed
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    eval_strategy = "epoch", # to do evaluation at the end of each epoch
    save_strategy = "epoch", # saves checkpoint (weights) at the end of each epoch
    num_train_epochs = num_epochs, 
    weight_decay = 0.1,
    learning_rate = learning_rate,
    fp16 = True, # by default it uses 32bit precision but we can reduce the precision this way
    push_to_hub = True, # this will push the model to hub
    report_to = "tensorboard" # this will report the metrics in hub tensorboard (including additional metrics like bertscore if added)
)

trainer = Trainer(
    model = model,
    tokenizer = tokenizer,
    args = args,
    data_collator = data_collator,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["test"],
    # compute_metrics = compute_metrics # Activate it when using any additonal metrics
)

Now train the model.

Please note that it will run <code> num_epochs x (training_size/ batch_size)</code> steps. So, be mindful about available resources and num_epochs.

In [None]:
# This will complete the training for given number of epochs
trainer.train()

In [None]:
# This will allow pushing to hub
trainer.push_to_hub(commit_message)

### Sample Inference

Note that we have pushed the model to https://huggingface.co/DipanjanSanyal/wikipedia_sample_tiny_gpt2_base

In [14]:
from transformers import pipeline
pipe = pipeline("text-generation", model="DipanjanSanyal/"+output_model_name)

Device set to use mps:0


In [15]:
pipe(
    [
    "Going to school is fun and", 
    "I really like pizza as much as I like holidays because"
    ],
    # temperature = 0.6,
    # truncation = True,
    max_new_tokens = response_tokens
)

[[{'generated_text': "Going to school is fun and from them to the first day. in 2012, the same time after the school was the following year. the club was a third - year - day and was the team's first team, as a total of the title. in the season,"}],
 [{'generated_text': 'I really like pizza as much as I like holidays because by the ability to prevent their second - time - time - time, having a few years, and in the next day of the band\'s own. in 2004, the band said that " the game he was to be " the song ".'}]]

The model actually did not learn much, but because of a much better tokenizer than what we used in pure pytorch version, we are getting atleast some english word pair or triplets.

Looking at the model card details (and tensorboard) you'll notice that, I had also attempted a richer training, but I paused that to save GPU time. By the time I paused, it had run 15 epochs through much more chunks from the same data. Let's also see how that model works.

In [16]:
pipe2 = pipeline("text-generation", model="DipanjanSanyal/"+output_model_name, 
                 revision = '87d9aa1eb492a5c20db562f113f07b8f8522f5d2')

model.safetensors:   0%|          | 0.00/18.0M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Device set to use mps:0


In [19]:
pipe2(
    [
    "Going to school is fun and", 
    "I really like pizza as much as I like holidays because"
    ],
    # temperature = 0.6,
    # truncation = True,
    max_new_tokens = response_tokens
)

[[{'generated_text': 'Going to school is fun ands. the first song is played by a different team, a band of young boys in the series before the 1990s, and has been performed by the british band. the band also appeared in several countries to play in the top 40 charts as well as'}],
 [{'generated_text': 'I really like pizza as much as I like holidays because and other friends of the world. " it is a popular destination for the city in washington.. after a short period, she has been in the role of the " most influential woman ", but as a result of the death of an american writer'}]]

This model generates much more sensible text, which seems to have been generated from similar articles. But how the model related the input to those set of articles is unclear (how going school relates to music/ british band and how holiday & pizza relates to washington travel destination and some influential woman).

In any case, hopefully, we can imagine how good the model becomes as we continue to train it for longer period. We just used a miniscule proportion of wikipedia, which is also a miniscule proportion of internet; and we used a miniscule 4.5M parameter model. If this is done on internet text with billions of parameters, it can generate awesome contextual text, which we see in modern LLMs.