# Overview

There are two types of language modeling,
* Causal
* Masked

Causal language models are frequently used for text generation. We can use these models for creative applications like choosing our own text adventure or an intelligent coding assistane like Copilot or CodeParrot.


Causal language modeling predicts the next token is a sequence of tokens, and the model can only attentd to tokens on the left. This means the model cannot see future tokens. For example, like GPT-2.

In this notebook, we are trying to finetune DistilGPT2 on the r/asksience subset of the ELI5 dataset.

In [1]:
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install accelerate==0.25.0

Collecting transformers==4.35.2
  Obtaining dependency information for transformers==4.35.2 from https://files.pythonhosted.org/packages/12/dd/f17b11a93a9ca27728e12512d167eb1281c151c4c6881d3ab59eb58f4127/transformers-4.35.2-py3-none-any.whl.metadata
  Downloading transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.5/123.5 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.35.0
    Uninstalling transformers-4.35.0:
      Successfully uninstalled transformers-4.35.0
Successfully installed transformers-4.35.2
Collecting datasets==2.15.0
  Obtaining dependency information for datasets==2.15.0 from https://fil

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
# os.environ["WANDB_NOTES"] = "Fine tune model with low rank adaptation"
os.environ["WANDB_NAME"] = "ft-distilGPT2-with-ELI5"
os.environ["MODEL_NAME"] = "distilgpt2"

# For debuging on GPU
# os.environ["CUDA_LAUNCH_BLOCKING"] = "1" # It will cause the training stop at the beginning

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `distilgpt2` from `transformers`...
config.json: 100%|█████████████████████████████| 762/762 [00:00<00:00, 5.40MB/s]
┌────────────────────────────────────────────────────┐
│       Memory Usage for loading `distilgpt2`        │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│  147.24 MB  │313.22 MB │      1.22 GB      │
│float16│   73.62 MB  │156.61 MB │     626.44 MB     │
│  int8 │   36.81 MB  │ 78.31 MB │     313.22 MB     │
│  int4 │   18.4 MB   │ 39.15 MB │     156.61 MB     │
└───────┴─────────────┴──────────┴───────────────────┘


## Load ELI5 dataset

Start by loading a smaller subset of r/askscience subset of the ELI5 dataset. It makes our life more easier on preparing data.

In [4]:
from datasets import load_dataset

# Small datasets for doing a demo
eli5=load_dataset("eli5_category", split="train[:100]")

Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.6k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

Split the dataset's `train_asks` split into a train and teset set with the `train_test_split` method:

In [5]:
eli5=eli5.train_test_split(test_size=0.2)
eli5["train"][0]

{'q_id': '5li6sn',
 'title': 'Why is the moment of death always associated with "the light going out from one\'s eyes"? Does this actually happen or is it just a metaphor?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dbvvkbl',
   'dbw6vrb',
   'dbvy5c3',
   'dbw1qhx',
   'dbvvmx8',
   'dbw723v',
   'dbvylf4',
   'dbw64sm',
   'dbwciko'],
  'text': ["Absolutely not. If you have ever looked into something's eyes when they die the pupils dilate and focus become unfixed. The eyes begin to look waxy. The spark of life you literally see leave their eyes. Its the most heartbreaking thing I have ever seen. I know this because I work with wild animals, wildlife. I have seen many animals die and it never ceases to be something that signifies the intake is dead. It's horrible. I hate it. But it's legitimately also metaphor for the soul leaving the body.",
   "The scientific reason is merely a cumulative effect of small, almost unnoticeable t

## Preprocess

Loading a DistilGPT2 tokenizer to process the `text` subfiled by using `flatten` function, since the text filed is actually nested inside answers above.

In [6]:
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"))
eli5=eli5.flatten()
eli5["train"][0]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'q_id': '5li6sn',
 'title': 'Why is the moment of death always associated with "the light going out from one\'s eyes"? Does this actually happen or is it just a metaphor?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dbvvkbl',
  'dbw6vrb',
  'dbvy5c3',
  'dbw1qhx',
  'dbvvmx8',
  'dbw723v',
  'dbvylf4',
  'dbw64sm',
  'dbwciko'],
 'answers.text': ["Absolutely not. If you have ever looked into something's eyes when they die the pupils dilate and focus become unfixed. The eyes begin to look waxy. The spark of life you literally see leave their eyes. Its the most heartbreaking thing I have ever seen. I know this because I work with wild animals, wildlife. I have seen many animals die and it never ceases to be something that signifies the intake is dead. It's horrible. I hate it. But it's legitimately also metaphor for the soul leaving the body.",
  "The scientific reason is merely a cumulative effect of small, almost unnoticeable things.

### 1. Tokenizering

Instead of tokenizing each sentence separately, we can convert the list to a string so you can jointly tokenize them.

In [7]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

And we are going to apply the processing function over the entire dataset.

In [8]:
tokenized_eli5=eli5.map(
    preprocess_function,
    # Speed up the map function by using the parameters below
    batched=True, # Processing multiple elements of the dataset at once
    num_proc=4, # Increasing the number of processes
    remove_columns=eli5["train"].column_names, # Removing any columns we do not need
)

Map (num_proc=4):   0%|          | 0/80 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1695 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1648 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1543 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1062 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/20 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1145 > 1024). Running this sequence through the model will result in indexing errors


### 2. Spliting to chunks
This dataset contains the tokken sequences, but some of the these are longer than the maximum input length for the model. So, here we can use a second preprocessing function to concatenate all the sequences and split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [9]:
block_size=128

def group_texts(examples):
    concatenated_examples={k: sum(examples[k], []) for k in examples.keys()}
    total_length=len(concatenated_examples[list(examples.keys())[0]])
    # We could add padding if the model supported it instead of this drop,
    # we can customize this part to your needs.
    if total_length>=block_size:
        total_length=(total_length//block_size)*block_size
    # Split by chunks of block_size
    result={
        k: [t[i: i+block_size] for i in range(0, total_length, block_size)] for k,t in concatenated_examples.items()
    }
    
    result["labels"]=result["input_ids"].copy()
    return result

Apply the group_texts function over the entire dataset.

In [10]:
lm_dataset=tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/80 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/20 [00:00<?, ? examples/s]

We can create a batch of examples using `DataCollatorForLanguageModeling`. It's more efficient to `dynamically pad` the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. Use the end-of-sequence token as the padding token and set `mlm=False`, this will use the inputs as labels shifted to the right by one element.

In [11]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token=tokenizer.eos_token
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm=False)

In [12]:
data_collator

DataCollatorForLanguageModeling(tokenizer=GPT2TokenizerFast(name_or_path='distilgpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}, mlm=False, mlm_probability=0.15, pad_to_multiple_of=None, tf_experimental_compile=False, return_tensors='pt')

## Training


In [13]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model=AutoModelForCausalLM.from_pretrained(os.getenv("MODEL_NAME"))
model.config

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.35.2",
  "use_cache": true,
  "vocab_size": 50257
}

At this point, only three steps remain:

* Define your training hyperparameters in TrainingArguments. The only required parameters is `output_dir` which specifies where to save our model.
* Pass the training arguments to Trainer along with the model, datasets, and data collator.
* Call train() to finetune our model

In [14]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=10,
    push_to_hub=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME")
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.2 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240127_081413-v9dw217r[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-distilGPT2-with-ELI5[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/v9dw217r[0m
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to t

Epoch,Training Loss,Validation Loss
1,No log,4.001085
2,No log,3.988127
3,No log,3.980833
4,No log,3.977145
5,No log,3.978417
6,No log,3.976121
7,No log,3.977854
8,No log,3.979368
9,No log,3.979689
10,No log,3.979581


TrainOutput(global_step=170, training_loss=3.8116893095128677, metrics={'train_runtime': 80.7421, 'train_samples_per_second': 31.954, 'train_steps_per_second': 2.105, 'total_flos': 84268202065920.0, 'train_loss': 3.8116893095128677, 'epoch': 10.0})

Once training is completed, use the evaluate() method to evaluate your model and get its perplexity:

In [15]:
import math

eval_results=trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 53.49


In [16]:
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.16k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-distilGPT2-with-ELI5/commit/8088510a2a70f1f4ceb11d28c708bac040a056c6', commit_message='ft-distilGPT2-with-ELI5', commit_description='', oid='8088510a2a70f1f4ceb11d28c708bac040a056c6', pr_url=None, pr_revision=None, pr_num=None)

## Inference

Come up with a prompt you'd like to generate text from:

In [17]:
prompt="Somatic hypermutation allows the immune system to"

The simplest way to try out our finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for text generation with our model, and pass our text to it:

In [18]:
from transformers import pipeline

generator = pipeline("text-generation", model="aisuko/"+os.getenv("WANDB_NAME"))
generator(prompt)

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/476 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Somatic hypermutation allows the immune system to detect diseases. The immune system is basically a sort of 'laser-in-the-air' weapon but also a form of electroshock (EE) that is used to cause skin irritation and"}]

Tokenize the text and return the `input_ids` as PyTorch tensors:

In [19]:
tokenizer=AutoTokenizer.from_pretrained("aisuko/"+os.getenv("WANDB_NAME"))
inputs=tokenizer(prompt, return_tensors="pt").input_ids

Use the generate() method to generate text.

In [20]:
model=AutoModelForCausalLM.from_pretrained("aisuko/"+os.getenv("WANDB_NAME"))
outputs=model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Decode the generated token ids back into text

In [21]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

["Somatic hypermutation allows the immune system to control and control specific viruses, viruses, parasites and other types of organisms. There are certain antiviral conditions, such as low levels of immune function, and other things that can affect how the immune system functions, including immune system functions. So viruses that are not the only ones. All these things are common in different species of viruses, so most of them are immune and can't be prevented. In the modern era, viruses aren't viruses anymore. It's just now that viruses are more common and"]