<a href="https://colab.research.google.com/github/Taaniya/exploring-gpt2-language-model/blob/main/fine_tuning_DistilGPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is recreation of original notebook by Huggingface available [here](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and has some modifications done for the purpose of exploration, learning and experimentation.

In [None]:
! nvidia-smi

Sat Sep 24 18:16:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
! pip install transformers
! pip install datasets

In [2]:
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
from datasets import ClassLabel

import math
import random
import pandas as pd
from IPython.display import display, HTML


In [None]:
print(transformers.__version__)

4.15.0


In this notebook, we'll see how to fine-tune one of the 🤗 Transformers model on a language modeling tasks. We will cover two types of language modeling tasks which are:

### Causal language modeling: 
[Causal language modelling](https://huggingface.co/docs/transformers/tasks/language_modeling) or standard language modelling objective is the left-to-right language modelling task where the model predicts the next word given the previous sequence of tokens. In this task, the target for the model to predict is the next token and hence, the loss is computed based on every next word predicted.

Since the target is next token in sentence, labels are the same as the inputs shifted to the right. To make sure that the model does not cheat during training, it uses an attention mask that prevents it from accessing all the tokens after $i_{th}$ index while trying to predict a token at index $i+1$.

### Preparing the dataset
You can replace the dataset above with any dataset hosted on the hub or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

In [None]:
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Downloading:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

To access an actual element, you need to select a split first, then give an index:

In [None]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [None]:
# To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
datasets

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,= = Election of Clement XIV = = \n
1,"The university has two buildings in downtown Fort Lauderdale , both of which are considered part of one Fort Lauderdale campus . The Askew Tower ( AT ) and the Higher Education Complex ( HEC ) on Las Olas Boulevard . The campus offers courses in communication , graphic design , architecture , and urban and regional planning . The campus is home to approximately 900 students or 3 @.@ 2 % of the university 's student body . \n"
2,"In 1913 , a Romanesque Revival church designed by Nachtigall was built on the existing basement . The cornerstone was laid and construction begun in May ; the church was completed by the end of November , and formally dedicated on December 4 . The cost of construction was about $ 75 @,@ 000 . While the church was under construction , Catholic services were held in Madison 's armory . \n"
3,
4,
5,"4 ) , both of which are dangerously explosive and powerful oxidizing agents , and xenon dioxide ( XeO2 ) , which was reported in 2011 with a coordination number of four . XeO2 forms when xenon tetrafluoride is poured over ice . Its crystal structure may allow it to replace silicon in silicate minerals . The XeOO + cation has been identified by infrared spectroscopy in solid argon . \n"
6,
7,"Townsend 's next project took several years to come to fruition . After the creation of the IR8 demo tape , Townsend and Jason Newsted had begun work on a new project called Fizzicist , which they described as "" heavier than Strapping Young Lad "" . When the IR8 tape was leaked , Newsted 's Metallica bandmates James Hetfield and Lars Ulrich learned of the project . Hetfield was "" fucking pissed "" that Newsted was playing outside the band , and Newsted was prevented by his bandmates from working on any more side projects . With the project stalled , Townsend instead wrote the album himself , entitling it Physicist . Townsend assembled his Strapping Young Lad bandmates to record it , the only time this lineup was featured on a Devin Townsend album . The thrash @-@ influenced Physicist was released in June 2000 , and is generally considered a low point in Townsend 's career . Hoglan and the rest of the band were dissatisfied with the way the sound was mixed , and Townsend considers it his worst album to date . \n"
8,= = Music video = = \n
9,"Brooks conducted copious research while writing World War Z. The technology , politics , economics , culture , and military tactics were based on a variety of reference books and consultations with expert sources . Brooks also cites the U.S. Army as a reference on firearm statistics . \n"


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

### Causal Language Modelling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:

`part of text 1`

or 

`end of text 1 [BOS_TOKEN] beginning of text 2`

depending on whether they span over several of the original texts in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [distilgpt2](https://huggingface.co/distilgpt2) model for this example. GPT2 has undergone unsupervised pre-training with causal language modelling task and we will fine-tune it with the same objective.

Refer its previous paper [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) for details on pre-training settings.

In [None]:
model_checkpoint = "distilgpt2"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the AutoTokenizer class:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

We can now call the tokenizer on all our texts. This is very simple, using the map method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [None]:
tokenizer.vocab_size

50257

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our datasets object, using batched=True and 4 processes to speed up the preprocessing. We won't need the text column afterward, so we discard it.

In [None]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

In [None]:
# If we now look at an element of our datasets, we will see the text have been replaced by the input_ids the model will need:

tokenized_datasets["train"][1]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198]}

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain block_size. To do this, we will use the map method again, with the option batched=True. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the map method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of block_size every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

And we can check our datasets have changed: now the samples contain chunks of block_size contiguous tokens, potentially spanning over several of our original texts.

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Oz'

We can use dynamic padding for input lengths and use data collator to handle this part. This will pad sentences to longest sentences within a batch rather than the whole dataset. Data collator for language modelling shifts the input labels to the right by element (given that we set mlm to False  to specify causal language modelling objective)

In [None]:
# set EOS token as pad_token
tokenizer.pad_token = tokenizer.tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Now that the data has been cleaned, we're ready to instantiate our Trainer. We will instantiate a model:

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

In [None]:
model.num_parameters

<bound method ModuleUtilsMixin.num_parameters of GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_d

In [None]:
# And some TrainingArguments:

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

The last argument to setup everything so we can push the model to the Hub regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the hub_model_id argument to set the repo name (it needs to be the full name, including your namespace: for instance "sgugger/gpt-finetuned-wikitext2" or "huggingface/gpt-finetuned-wikitext2").

We pass along all of those to the Trainer class:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator
)

And we can train our model:

In [None]:
trainer.args

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.EPOCH,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=distilgpt2-finetuned

[Adam optimizer with weight decay](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.AdamW)

In [None]:
trainer.optimizer

AdamW (
Parameter Group 0
    betas: (0.9, 0.999)
    correct_bias: True
    eps: 1e-08
    initial_lr: 2e-05
    lr: 0.0
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    correct_bias: True
    eps: 1e-08
    initial_lr: 2e-05
    lr: 0.0
    weight_decay: 0.0
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 18666
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7002


Epoch,Training Loss,Validation Loss
1,3.7602,3.666851
2,3.633,3.645479
3,3.6078,3.642323


Saving model checkpoint to distilgpt2-finetuned-wikitext2/checkpoint-500
Configuration saved in distilgpt2-finetuned-wikitext2/checkpoint-500/config.json
Model weights saved in distilgpt2-finetuned-wikitext2/checkpoint-500/pytorch_model.bin
Saving model checkpoint to distilgpt2-finetuned-wikitext2/checkpoint-1000
Configuration saved in distilgpt2-finetuned-wikitext2/checkpoint-1000/config.json
Model weights saved in distilgpt2-finetuned-wikitext2/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to distilgpt2-finetuned-wikitext2/checkpoint-1500
Configuration saved in distilgpt2-finetuned-wikitext2/checkpoint-1500/config.json
Model weights saved in distilgpt2-finetuned-wikitext2/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to distilgpt2-finetuned-wikitext2/checkpoint-2000
Configuration saved in distilgpt2-finetuned-wikitext2/checkpoint-2000/config.json
Model weights saved in distilgpt2-finetuned-wikitext2/checkpoint-2000/pytorch_model.bin
***** Running Evaluation **

TrainOutput(global_step=7002, training_loss=3.694407620657447, metrics={'train_runtime': 1546.2368, 'train_samples_per_second': 36.216, 'train_steps_per_second': 4.528, 'total_flos': 1829011929956352.0, 'train_loss': 3.694407620657447, 'epoch': 3.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1931
  Batch size = 8


Perplexity: 38.18


#### Summary of training settings to fine-tune

* Data source - [Wiki Text](https://huggingface.co/datasets/wikitext)

Characteristics, monolingual, type of content e.g whether specific to domain, no. of tokens in dataset etc
* Training setting
Training examples, validation example, size of data in memory, mentioned with data source.
* no. of epochs - 3
* batch size - 8
* learning rate - 2e-05
* weight decay - 0.01
* example sequence length - 128
* optimizer - Adam optimizer with weidght decay
* model vocab size - 50257

### References - 

* https://tecknoworks.com/30-surprising-business-questions-data-can-answer/
* https://www.datapine.com/blog/analytics-and-business-intelligence-examples/
* [Training large models (Huggingface): Introduction, tools & examples](https://huggingface.co/transformers/v1.2.0/examples.html#introduction)
* https://huggingface.co/docs/transformers/tasks/language_modeling
* [GPT2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
* [GPT paper](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)