<a href="https://www.kaggle.com/code/aisuko/causal-language-modelling-nlp?scriptVersionId=164642813" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

There are two types of language modeling, `causal` and `masked`. `Causal language models` are frequently used for `text generation`. You can use these models for creative applications like choosing your own text adventure or an intelligent coding assistant like Copilot or CodeParrot. Causal language modeling predicts the next token in s sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model. More detail in [Decoder Architectures](https://www.kaggle.com/code/aisuko/neural-network-architecture-transformers). In this notbook, we are going to fine-tune a text-generation pretrained model with a corresponsive dataset.

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-distilGPT2-with-askscience"

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Load ELI5 dataset

We will start by loading a smaller subset of the `r/askscience` which is the subset of the ELI5 dataset. This will give us a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [3]:
from datasets import load_dataset

eli5= load_dataset("eli5_category", split="train[:500]")

Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.6k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

Split the dataset's `train_asks` split into a train and test with the train_test_split method

In [4]:
eli5=eli5.train_test_split(test_size=0.2)
eli5["train"][0]

{'q_id': '5lgx81',
 'title': 'how do people hold their breath for so long without passing out/dying',
 'selftext': 'The official record for breath held underwater is 24 minutes and 3 seconds. HOW?!',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dbvm1nh', 'dbvpucx'],
  'text': ['They train to perform well for their sport. If you do something that is challenging repeatedly your body adapts. So their bodies are more efficient with oxygen, and they probably have a large lung capacity and good control over their heart rate. Some of these records are also set by breathing in pure oxygen.',
   "There's a video out there of a ted talk David Blaine did explaining how he got his record of holding his breath under water. He talks about the process and training of it. Edit: here's the link URL_0"],
  'score': [12, 3],
  'text_urls': [[],
   ['https://www.ted.com/talks/david_blaine_how_i_held_my_breath_for_17_min']]},
 'title_urls': ['url'],
 'selftext_urls': ['

Although there are lots of text fields, and for the language modeling tasks we do not need labels, because the next word is the model. 

This is known as an unsupervised task, where the model predicts the next token in a sequence of tokens without the need for labeled data. This approach has been leveraged to build NLP models using little to no annotated data, allowing for the distillation of knowledge embedded in large language models without the need for labeled data.

# Preprocess

Load a DistilGPT2 tokenizer to process the `text` subfield:

In [5]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("distilgpt2")
print(tokenizer)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

GPT2TokenizerFast(name_or_path='distilgpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}


In [6]:
eli5=eli5.flatten()
eli5["train"][0]

{'q_id': '5lgx81',
 'title': 'how do people hold their breath for so long without passing out/dying',
 'selftext': 'The official record for breath held underwater is 24 minutes and 3 seconds. HOW?!',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dbvm1nh', 'dbvpucx'],
 'answers.text': ['They train to perform well for their sport. If you do something that is challenging repeatedly your body adapts. So their bodies are more efficient with oxygen, and they probably have a large lung capacity and good control over their heart rate. Some of these records are also set by breathing in pure oxygen.',
  "There's a video out there of a ted talk David Blaine did explaining how he got his record of holding his breath under water. He talks about the process and training of it. Edit: here's the link URL_0"],
 'answers.score': [12, 3],
 'answers.text_urls': [[],
  ['https://www.ted.com/talks/david_blaine_how_i_held_my_breath_for_17_min']],
 'title_urls': ['url'],
 'self

The `text` filed is actuallt nested inside `answers`. This means we will need to extract the `text` subfiled from its nested structure with the flatten method. And instead of tokenizing each sentence separatelty, convert the list to a string so we can jointly tokenize them. We need to apply this preprocessing function over the entire dataset.

In [7]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

tokenized_eli5=eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/400 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1695 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1647 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1433 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4615 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/100 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1062 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2081 > 1024). Running this sequence through the model will result in indexing errors


We also need to make sure the token sequences are shorter than the maximum input length of the model, and we can also add padding if the model supported it. Apply the `group_texts` function over the entire dataset:

In [8]:
block_size=128

def group_texts(examples):
    concatenated_examples={k: sum(examples[k], []) for k in examples.keys()}
    total_length=len(concatenated_examples[list(examples.keys())[0]])
    if total_length>=block_size:
        total_length=(total_length//block_size)* block_size
    # Split by chunks of block size
    result={
        k: [t[i: i+block_size] for i in range(0, total_length, block_size)]
        for k,t in concatenated_examples.items()
    }
    
    result["labels"]=result["input_ids"].copy()
    return result


lm_dataset=tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/400 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/100 [00:00<?, ? examples/s]

Here we are going to use `dynamically pad` the sentence to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. 

In [9]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token=tokenizer.eos_token
# Use the end of sequence token as the padding token and set `mlm=False`.
# This will use the inputs as labels shifted to the right by one element.
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
print(data_collator)

DataCollatorForLanguageModeling(tokenizer=GPT2TokenizerFast(name_or_path='distilgpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}, mlm=False, mlm_probability=0.15, pad_to_multiple_of=None, tf_experimental_compile=False, return_tensors='pt')


# Training

In [10]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model=AutoModelForCausalLM.from_pretrained("distilgpt2")
print(model.config)

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.35.2",
  "use_cache": true,
  "vocab_size": 50257
}



In [11]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_checkpointing=True,
    num_train_epochs=5,
    weight_decay=0.01,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
    push_to_hub=False,
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240228_055901-za8a80hs[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-distilGPT2-with-askscience[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/za8a80hs[0m
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a cal

Epoch,Training Loss,Validation Loss
1,No log,3.917277
2,No log,3.900251
3,No log,3.895213
4,No log,3.891415
5,No log,3.890968


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=180, training_loss=3.9374786376953126, metrics={'train_runtime': 239.3969, 'train_samples_per_second': 23.58, 'train_steps_per_second': 0.752, 'total_flos': 184377519636480.0, 'train_loss': 3.9374786376953126, 'epoch': 5.0})

# Evaluate

In [12]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



Perplexity: 48.96


In [13]:
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

training_args.bin:   0%|          | 0.00/4.22k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-distilGPT2-with-askscience/commit/c0d0eeb142395b3a398f74e1e015ac630c3a2932', commit_message='ft-distilGPT2-with-askscience', commit_description='', oid='c0d0eeb142395b3a398f74e1e015ac630c3a2932', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [14]:
from transformers import pipeline

prompt="Somatic hypermutation allows the immune system to"

generator=pipeline("text-generation", model=os.getenv("WANDB_NAME"))
generator(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Somatic hypermutation allows the immune system to react to stimuli that make you sick and/or unable to handle the stress. In addition to suppressing the body's pressure, this prevents the immune from feeling so tired that it can become resistant to attack"}]

Tokenize the text and return the input_ids as PyTorch tensors:

In [15]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("WANDB_NAME"))
inputs=tokenizer(prompt, return_tensors="pt").input_ids

In [16]:
from transformers import AutoModelForCausalLM

model=AutoModelForCausalLM.from_pretrained(os.getenv("WANDB_NAME"))
outputs=model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [17]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

["Somatic hypermutation allows the immune system to suppress the infection of certain diseases. For example, a virus that can't cause a specific type of immune attack is known as a virus with a mutation called the 'Eugenic HIV/AIDS' mutation and this means it has some genetic information. If you've already experienced the virus, you might think the virus actually exists. To this end, the virus has a DNA function called the DNA 'Vibrio A' mutation and this is used by the virus as a protection against viruses from infections"]