# Fine Tuning DistilGPT2 for Text Generation

The DistilGPT2 is lighter in weight and faster in language generation than the original OpenAI GPT2. It is created by process of distillation applied to GPT2. Here, we will generate emotions by fine-tuning DistilGPT2 on a sample of "emotion" dataset from Hugging Face Hub. We can train a language generation model so that it can generate text for any subject in English.

In [1]:
from datasets import load_dataset
emotions = load_dataset("emotion")
emotions.set_format("pandas")
train, valid, test = emotions["train"][:], emotions["validation"][:], emotions["test"][:]

Using custom data configuration default
Reusing dataset emotion (/root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)


  0%|          | 0/3 [00:00<?, ?it/s]

For fine-tuning distilgpt2, just need the text field

In [5]:
texts = list(train['text'])

Store the emotions in a txt file where each line of txt file is a single expression 

In [7]:
file_name = 'testing.txt'
with open(file_name, 'w') as f:
    f.write(" |EndOfText|\n".join(texts))

Now, let's come to Transformers by Huggingface, and unleash the Transformers

Make 2 directories. 

1) weights - for storing the weights of distilgpt2

2) tokenizer - for storing the tokenizer of distilgpt2

# Fine-Tuning of DistilGPT2
Now, its time for Training (or fine tuning) distilgpt2 with IMDB reviews.  
Given below is a command containing few parameters to help Transformers finetune distilgpt2. Now, let's understand what these parameters mean

1. output_dir: It is the weights_dir we made where our finetuned model will be stored in the form of checkpoints

2. model_name_or_path: It tells the kind of model we are currently dealing with

3. per_device_train_batch_size: It tells the batch size for each gpu

4. do_train: It tells pytorch to start training mode

5. train_file: This is where we give the input text data 

6. num_train_epochs: Number of epochs for finetuning


Now, let the training begin...

In [8]:
import logging
logging.basicConfig(level=logging.ERROR)

In [9]:
weights_dir = "output"

Fine-tuning the DistilGPT2 for causal language modeling (GPT-2 in this notebooek) on a IMDB dataset.

In [10]:
!git clone --depth=1 --branch v4.6.0-release https://github.com/huggingface/transformers.git

fatal: destination path 'transformers' already exists and is not an empty directory.


In [11]:
cmd = f'''
python transformers/examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path distilgpt2 \
    --train_file {file_name} \
    --do_train \
    --num_train_epochs 3 \
    --overwrite_output_dir \
    --per_device_train_batch_size 2 \
    --output_dir {weights_dir}
'''

In [12]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [13]:
!{cmd}

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
03/22/2022 13:51:04 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=output, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Mar22_13-51-04_61e3c5fc8419, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=N

Although, Huggingface provides a run_generation.py file for language generation. Running it from a command (as it takes the input), makes it load the model and the tokenizer everytime you run the file which slows downs generation. To reduce the I/O overhead, I have restructured the run_generation.py file in the following code which only loads the model and tokenizer once in a model and a tokenizer object and we can use these objects to generate text over and over again

In [14]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def get_model_tokenizer(weights_dir, device = 'cuda'):
    print("Loading Model ...")
    model = GPT2LMHeadModel.from_pretrained(weights_dir)
    model.to('cuda')
    print("Model Loaded ...")
    tokenizer = GPT2Tokenizer.from_pretrained(weights_dir)
    return model, tokenizer

def generate_messages(
    model,
    tokenizer,
    prompt_text,
    stop_token,
    length,
    num_return_sequences,
    temperature = 0.7,
    k=20,
    p=0.9,
    repetition_penalty = 1.0,
    device = 'cuda'
):

    MAX_LENGTH = int(10000)
    def adjust_length_to_model(length, max_sequence_length):
        if length < 0 and max_sequence_length > 0:
            length = max_sequence_length
        elif 0 < max_sequence_length < length:
            length = max_sequence_length  # No generation bigger than model size
        elif length < 0:
            length = MAX_LENGTH  # avoid infinite loop
        return length
        
    length = adjust_length_to_model(length=length, max_sequence_length=model.config.max_position_embeddings)

    encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")

    encoded_prompt = encoded_prompt.to(device)

    output_sequences = model.generate(
            input_ids=encoded_prompt,
            max_length=length + len(encoded_prompt[0]),
            temperature=temperature,
            top_k=k,
            top_p=p,
            repetition_penalty=repetition_penalty,
            do_sample=True,
            num_return_sequences=num_return_sequences,
        )

    if len(output_sequences.shape) > 2:
        output_sequences.squeeze_()

    generated_sequences = []

    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        # print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()

        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

        # Remove all text after the stop token
        text = text[: text.find(stop_token) if stop_token else None]

        # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
        total_sequence = (
            prompt_text + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]
        )

        generated_sequences.append(total_sequence)
    return generated_sequences


In [15]:
model, tokenizer = get_model_tokenizer(weights_dir, device = 'cuda')

Loading Model ...
Model Loaded ...


In [16]:
temperature = 1.0
k = 400
p = 0.9
repetition_penalty = 1.0
num_return_sequences = 5
length = 1000
stop_token = '|EndOfText|'
prompt_text = "this is"

In [17]:
%%time
generate_messages(
    model,
    tokenizer,
    prompt_text,
    stop_token,
    length,
    num_return_sequences,
    temperature=temperature,
    k=k,
    p=p,
    repetition_penalty=repetition_penalty
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


CPU times: user 6.84 s, sys: 50.6 ms, total: 6.89 s
Wall time: 6.9 s


['this is the good health of women in this country it is the same as the average man should have ',
 'this is only going to change my mind if i feel punished for it and something i will do differently for the better because my precious daughter isnt hurt ',
 'this is too early to tell this just how dumb it was to have that information but i am completely numb to it so i was impressed with this fact that it was presented in my own way to feel accepted in my world ',
 'this is useful for helping reduce headaches and in some cases, pain to patients of any age group who may not have had the stress disorder ',
 'this is a gentle reminder that you will always enjoy taking part in what you do with your child i feel that i am giving her pleasure which is to do justice for her the way i did before her so that her life will never be ruined like this again in the long run ']

In [27]:
import torch
ckpt_dir = 'emotions_distilgpt2_model'
try:
    os.mkdir(ckpt_dir)
except FileExistsError: pass

output_model_file = os.path.join(ckpt_dir, 'torch_distilgpt2_news.pt')
torch.save(model, output_model_file)
tokenizer.save.save_vocabulary(ckpt_dir)
print('Saved')

Saved
