# Training Transformers from Scratch

In the opening paragraph of this book, we mentioned a GitHub Copilot that uses GPT like transformers to perform code autocompletion, a feature that is particulary useful when programming in a new language or framework or learning to code or for automcaticaaly producing boilerplate code.

In notebook-5 we had a look at different decoding stratergies and sampling methods to generate quality text. In this notebook, we'll build our very own GPT-like model for generating Python source code! We
ll call the resulting model *CodeParrot*.

In this case we've loads of data not like multilingual ner where we've had less data for few languages and we've used transfer learning to overcome that. We'll explore the pretraining step itself and learn how to train a transformer from scratch. In this notebook, we'll cover below aspects of training which we haven't considered yet as follows,
* Gathering and processing a very large dataset
* Creating a custom tokenzier for our dataset
* Training a model on multiple GPUs at scale

To efficiently train large model with billions of paramters, we'll need special tools for distributed training. Although the `Trainer` from Transformers library supports distributed training, we'll take use PyTorch's Accelerate to showcase it's power/ We'll use some of the largest NLP models, but let's find a sufficiently large dataset first.


## Large Datasets and Where to Find Them

There are many domains where large amount of data at hand might be available ranging from biomedical datasets to programming codebases. In mose cases, these datasets are unlabeled, and their large size means that they can usually be labeled thriugh use of heeuristics(past labelling experience) or by using accompanying metadata that is stored during the gathering process.

Nevertheless unlaballed or heuristice labelled large corpus is useful. For instance it can be used to fine tune a language model for domain adaptation.

The decision between fine-tuning and training from scratch is dependent on two things:

1. What's the size of fine-tuning corpus?
2. What's the domain differences between pretrained models and the corpus?

When using a pretrained model it forces to use the tokenizer used with the model...
If the tokenizer is trained on a corpus from another domain it's suboptimal.

Example: using GPT's tokenizer on legal documents, other languages or even differnt sequences like musical notes or DNA sequences will result in poor tokenization.

As the amount of training data we have inches closer to amount of data required for pretraing, it becomes an intersting choice to training the model and tokenizer from scratch(provided the compute resources).

Before we discuss the pretrainig objectives, we'll have to build a large corpus which comes with it's own challenges. Let's explore that next.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m97.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m75.1 MB/s[0m eta [36m0:00:0

### Challenges of Building a Large-Scale Corpus

Quality of a pretrained model depends on the pretrained corpus itself, as the model inherits defects from the corpus. Hence before creating one, let's become aware of the common issues and challenges associated with building a large corpora for pretraining.

```
Pretraining Corpus[good/bad] ---Training---> Pretrained model[good/bad]
```

***1. Can we be aware of what's inside a very large dataset?***

*As the dataset grows larger and larger, the chances of full control or precise idea of what is inside dimishes.*

***2. How is a large dataset created? which might give some information on visiblity of the dataset***

* *It's not created by dedicated people who create one sample at a time, while being aware and knowledgeable of the full pipeline and the task that the machine learning model will be applied to.*
* *It has more chances of creating in an automatic or semiautomatic way by collecting data that is a side effect of some other activites. For example, it may consists of all the documents(contracts, purchase orders etc.) that a company stores, logs from user activites, or data gatherd from internet.*

***3. What are the consequences of creating a corpora with such high degree of automation?***

* *Limited control over the content and te way ther are created, thus increasing the risk of training a model on biased or lower-quality data.*
* *Recent investigations of large-scale datasets like BookCorpus and C4 which were used to train BERT and T4, have uncoverd (among other things) that:*
    * A significant proportion of the C4 corpus is machine-translated rather than by humans.
    * Disparate erasure of African-American English as a result of stopword filtering in C4 resulted in an underrepresentation of such content.
    * It's typically diffult to find a middle ground between including(often too much) sexually or other explicit content and totally ersation all mention of sexuality or gender. As a surprising consequence of this, a rather common word like sex(both neutral and explicit meanings) is completley unknown to tokenizer that is trained on C4, since this word is absent form corpus.

This discrepancies might not be incompatible if the downstream task requries such a skew. For example, In BookCorpus there's a strong overepresentation of romance novels and if a model is intended to be romance novel writing tool this skew is good for this task.

Let's checkout this skew on model based on the dataset by comparing GPT and GPT-2 models with same propmt, similar-sized versions where only datasets differ. BookCorpus(GPT) vs Webpages, blogs and new articles linked from reddit.

By using `text-generation` pipeline to investigate the model outputs.

In [None]:
from transformers import pipeline, set_seed

generation_gpt = pipeline("text-generation", model="openai-gpt")
generation_gpt2 = pipeline("text-generation", model="gpt2")

In [None]:
# Function to calculate total number of paramters in the model
def model_size(model):
  return sum(param.numel() for param in model.parameters())

print(f"GPT Size: {model_size(generation_gpt.model)/1000**2:.1f}M parameters")
print(f"GPT2 size: {model_size(generation_gpt2.model)/1000**2:.1f}M parameters")

GPT Size: 116.5M parameters
GPT2 size: 124.4M parameters


We're using the original gpt model vs smallest gpt 2 model and they have the same number of parameters. Next let's generate three different completions from each model, with the same input prompt.

In [None]:
import transformers
def enum_pipeline_outputs(
    pipe: transformers.pipeline,
    prompt: str,
    num_return_sequences: int
    ) -> str:
  """
  Function to generate text using text-generation pipeline

  Args:
    pipe (transformers.pipeline): Text generation pipeline to use to generate text
    prompt (str): Input text prompt to genreate text
    num_return_sequences (int): Number of sequences to generate

  Returns:
    str: Returns sequences generated
  """

  out = pipe(
      prompt,
      num_return_sequences=num_return_sequences,
      clean_up_tokenization_spaces=True,
      )
  return "\n".join(f"{i+1}." + s["generated_text"] for i, s in enumerate(out))

In [None]:
# Let's generate some text using the function on above cell
prompt = "\nWhen they came back"
gpt_completions = enum_pipeline_outputs(
    pipe=generation_gpt,
    prompt=prompt,
    num_return_sequences=3
)
gpt_2_completions = enum_pipeline_outputs(
    pipe=generation_gpt2,
    prompt=prompt,
    num_return_sequences=3
)
print(f"GPT completions: \n {gpt_completions}")
print(f"GPT-2 completions: \n {gpt_2_completions}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT completions: 
 1.
When they came back, the next one would be the best. 
 the first and only person who might pped a gun was david becker. 
 becker stared up at the gray beams from the ceiling. he was terrified. the light was blinding. it
2.
When they came back, she was ready. " 
 my jaw had dropped. " you've been watching me? " 
 " uh - huh, " he smiled. " i found your address on the phone. " 
 that made sense. i
3.
When they came back. 
 i would find out soon enough. but right now, my mind was busy processing all the information i was learning to deal with at a late stage of the journey in an uncomfortable sort of way. once we were in the
GPT-2 completions: 
 1.
When they came back to look over their shoulders they noticed a tiny black bear and she ran away; I thought they'd be worried, though the bear had a very unusual, large, tail with a large sharp sharp-edged claw.
The
2.
When they came back to him at the top of the stairs.

"I want to say this to anyone in the world out ther

On looking at these few samples, we can see the romantic skew in GPT generation, which will typically imagine an interaction between a man and a woman. On the other hand GPT-2 generation trained on webtext linked to and from reddit articles and mostly adopts a neutral *they* in it's generationsm whicg contation "blog-like" or adventure related elemets.

In general, any model trained on dataset will reflect the language bias and over-or underrepresentation of populations and events in its training data. These biases in the behaviour of the model are importatnt to take into consideration with reagard to the target audience interacting with the model.