<a href="https://colab.research.google.com/github/Rami-RK/HugingFace_Transformers/blob/main/Fine_Tune_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Fine tunning of GPT-2 language model**




### **Objectives:**

At the end of the experiment you will be able to understand and implement:

1. GPT2 Language Model
2. fine tunning of GPT2 model for downstream task
3. domain specific generative question answering system  

#### Data & Expected outcome
The text file is taken from Project Gutenberg which is a book named **"The Buddha's Path of Virtue: A Translation of the Dhammapada by F. L. Woodward"**. We are going to fine-tune the GPT2 model with this data. We can expect that model will be able to reply to the prompt related to the subject matter of this book after fine-tuning.

#### Instalation & importing of libraries/packages

In [None]:
! pip install -U accelerate
! pip install -U transformers
!pip install torch

In [None]:
import os
import re
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

#### Function to read the text files
We can read different file types like pdf, doc etc. and based on that the read function may change.


In [None]:
def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text


In [None]:
# Loading text file
!wget https://www.gutenberg.org/files/35185/35185-0.txt

--2023-08-23 04:28:26--  https://www.gutenberg.org/files/35185/35185-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 120621 (118K) [text/plain]
Saving to: ‘35185-0.txt’


2023-08-23 04:28:26 (1.60 MB/s) - ‘35185-0.txt’ saved [120621/120621]



In [None]:
# Read files/documents
file_path = '/content/35185-0.txt'
text_file = read_txt(file_path)
text_file = re.sub(r'\n+', '\n', text_file).strip()  # Remove excess newline characters

In [None]:
# Split the text into training and validation sets
train_fraction=0.8
split_index = int(train_fraction * len(text_file))
train_text = text_file[:split_index]
val_text = text_file[split_index:]

In [None]:
# Save the training and validation data as text files
with open("train.txt", "w") as f:
    f.write(train_text)
with open("val.txt", "w") as f:
    f.write(val_text)

In [None]:
# Set up the tokenizer and model
checkpoint = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)  # also try gpt2-medium
model = GPT2LMHeadModel.from_pretrained(checkpoint)  # also try gpt2-medium

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Note the time for training with GPU for different GPT models on this dataset :**
* **GPT-2 : 20 minutes for 100 epoch**

* **GPT-2 medium :  1 hour for 100 epoch**

* **GPT-2 Large : Run out of memory**

In [None]:
# Prepare the dataset
train_dataset = TextDataset(tokenizer=tokenizer, file_path="train.txt", block_size=128)
val_dataset = TextDataset(tokenizer=tokenizer, file_path="val.txt", block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)



In [None]:
# Set up the training arguments
model_output_path = "/content/gpt_model"

training_args = TrainingArguments(
    output_dir=model_output_path,
    overwrite_output_dir=True,
    per_device_train_batch_size=4, # try with 2
    per_device_eval_batch_size=4,  #  try with 2
    num_train_epochs=100,
    save_steps=1_000,
    save_total_limit=2,
    logging_dir='./logs',
    )

In [None]:
# Train the model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()
# Save the model
trainer.save_model(model_output_path)
# Save the tokenizer
tokenizer.save_pretrained(model_output_path)

Step,Training Loss
500,0.9385
1000,0.3482
1500,0.161
2000,0.0938
2500,0.0667
3000,0.0531
3500,0.0443
4000,0.0395
4500,0.0354
5000,0.0314


('/content/gpt_model/tokenizer_config.json',
 '/content/gpt_model/special_tokens_map.json',
 '/content/gpt_model/vocab.json',
 '/content/gpt_model/merges.txt',
 '/content/gpt_model/added_tokens.json')

### **Testing the model with some prompt**


The **`generate_response`** function takes a trained model, tokenizer, and a prompt string as input and generates a response using the GPT-2 model.

In [None]:
def generate_response(model, tokenizer, prompt, max_length=100):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)


In [None]:
# Load the fine-tuned model and tokenizer
my_model = GPT2LMHeadModel.from_pretrained(model_output_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

#### **Prompt Example 1**

In [None]:
prompt = "What is teaching of Buddha?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What is teaching of Buddha?
  29.
  "Lust is not sated, tho' it rain gold coins;
  Brief is the pleasure, great the pains of lust"--
  Whoso saith this and knows it, he is wise.
  30.
  He finds no pleasure e'en in heaven's delights;
  He finds his joy in slaying all desire,
  That follower of the All-Awakened Ones.



#### **Prompt Example 2**

In [None]:
prompt = "what is dharma ?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt, max_length=150)  #
print("Generated response:", response)

Generated response: what is dharma?
    To lack it leads to hell.
  349.
  Seek not for love; things loved when lost bring woe;
    Both are impermanent.
  350.
  Sorrow and fear are born of things beloved.
    From things beloved set free,
  How canst thou sorrow? fearful how canst be?
  351.
  From things held dear, sorrow and fear are born.
    Set free by the perfect knowledge,
  How canst thou sorrow? fearful how canst be?
  352.
  From things held dear, sorrow and fear are born.
    Set free by


#### **Prompt Example 3**

In [None]:
prompt = "how to live ?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt, max_length=150)  #
print("Generated response:", response)

Generated response: how to live?
    With the uncongenial;
  Like a hare run to and fro,
  By the fetters' bonds entangled,
    Long must sorrow undergo.
  343.
  Beings, in the highest degree,
    Whoso is deep in wisdom and intelligence,
    Who can with skill discern the right and wrong,
      Pleasant is he to hear.
  344.
  Whoso with householders and wanderers alike
    Small dealings hath, who lives the homeless life,
     Neighbourless life, is blest indeed;
      Who live


In the case of the GPT-2 tokenizer, the model uses a byte-pair encoding (BPE) algorithm, which tokenizes text into subword units. As a result, one word might be represented by multiple tokens.For example, if you set max_length to 50, the generated response will be limited to 50 tokens, which could be fewer than 50 words, depending on the text.