<a href="https://colab.research.google.com/github/AniLeo-01/MailCompletion-bot/blob/main/Distilgpt2_finetuning_on_AESLC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install the dependencies and import the libraries

In [None]:
!pip -q install datasets accelerate -U wandb

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset
import math
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

## Load the AESLC dataset from HuggingFace

We will be using the already cleaned version of AESLC by postbot

In [None]:

dataset = load_dataset("postbot/aeslc_kw")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.42M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.98M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Initialize the tokenizer

In [None]:
model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The map function allows for parallel processing the dataset tokenization process

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["clean_email"])

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4,
                                 remove_columns=['email_body', 'subject_line',
                                                 'clean_email', 'clean_email_keywords'])

Map (num_proc=4):   0%|          | 0/14436 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1159 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1182 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1162 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2466 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1960 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1079 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3178 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3452 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4245 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1906 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1054 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1040 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1134 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1038 > 1024). Running this sequence through the model will result in indexing errors


Keeping the block size to 1024, which is the tokenizer's max length.

Reduce the value if you have low GPU memory

In [None]:
block_size = tokenizer.model_max_length

Chunk the dataset of max_len to each block_size

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Applying the batched data transformation over the dataset

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/14436 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1960 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1906 [00:00<?, ? examples/s]

## Initialize the distilgpt2 model

Read more about distilgpt2 over here: https://huggingface.co/distilgpt2

In [None]:
model_checkpoint = 'distilgpt2'
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)


## Setup the training arguments

To optimize the GPU usage and performance, we are going to use fp16 values during the finetune process, instead of fp32.

In [None]:
model_name = model_checkpoint
training_args = TrainingArguments(
    output_dir="./distilgpt2-fine_tuned-aeslc-T4",
    dataloader_drop_last=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=10,
    logging_steps=5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=1e-3,
    lr_scheduler_type="cosine",
    warmup_steps=10,
    gradient_accumulation_steps=4,
    fp16=True,
    weight_decay=0.05,
    run_name="distilgpt2-fine_tuned-aeslc-T4",
    report_to="wandb",
)


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

Run the training process

In [None]:
trainer.train()


Epoch,Training Loss,Validation Loss
0,3.0745,3.089179
2,2.536,3.041362
4,1.96,3.160594
6,1.7625,3.341778
8,1.4545,3.475159
9,1.4303,3.502504


TrainOutput(global_step=1530, training_loss=2.094871759726331, metrics={'train_runtime': 2688.7851, 'train_samples_per_second': 9.134, 'train_steps_per_second': 0.569, 'total_flos': 6396544454492160.0, 'train_loss': 2.094871759726331, 'epoch': 9.97})

## Evaluate the model

Calculate the perplexity of the fine-tuned LLM.

Perplexity of a model is a measure of uncertainty for a random variable. It's calculated using the average cross-entropy, which is based on the number of words in a dataset and the predicted probability of a word based on the preceding context.

A higher perplexity score is generally considered worse. This is because it suggests that the text is more likely to have been written by a human.

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 33.20


Save the model

In [None]:
trainer.save_model("/content/drive/MyDrive/MailCompletion Bot/finetuned_models")

## Inference the LLM

Load the model from the saved checkpoint

In [None]:
model = AutoModelForCausalLM.from_pretrained('/content/distilgpt2-fine_tuned-aeslc-T4/checkpoint-1530')

In [None]:
text = """
Hey Aniruddha Mandal,

I'm sending you a final reminder that you've been added to the pre-vetted pool. We see that you haven't finished completing your profile details and need you to do this immediately in order to show your profile to US companies. If you do not complete your details, we will remove you from the pool.

Please confirm all your information is correct
"""

Decode the output with Causal Language Modeling (CLM) objective. Remember the max_length is used to constrain the number of tokens it will output, keeping it low to 2-5 gives much better contextual results.

In [None]:
inputs = tokenizer(text, return_tensors="pt")

generation_output = model.generate(
    **inputs,
    return_dict_in_generate=True,
    output_scores=True,
    max_length=inputs.input_ids.shape[-1]+2,  # Limit generation to two words
    # no_repeat_ngram_size=2,  # Avoid repeating word pairs
    num_beams=1,
    do_sample=False,  # Use greedy search for deterministic single-word output
    repetition_penalty=1.5,
    length_penalty=2.0
)

generated_word = tokenizer.decode(generation_output['sequences'][0])
print(generated_word)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Hey Aniruddha Mandal,

I'm sending you a final reminder that you've been added to the pre-vetted pool. We see that you haven't finished completing your profile details and need you to do this immediately in order to show your profile to US companies. If you do not complete your details, we will remove you from the pool.

Please confirm all your information is correct so there
