# 🤖 **Training GPT-2 for Instruction Following**

### Importing Libraries

In [1]:
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
DataCollatorForLanguageModeling,
Trainer,
TrainingArguments
)
from datasets import load_dataset
import torch
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch.cuda.empty_cache()  # Frees unreferenced memory
torch.cuda.ipc_collect()  # Collects inter-process memory

# print(torch.cuda.is_available())  # Should return True
# print(torch.cuda.device_count())  # Should be > 0
# print(torch.cuda.current_device())  # Should return an integer (device index)
# print(torch.cuda.get_device_name(0))  # Should return your GPU name

# print(torch.__version__)         # PyTorch version
# print(torch.version.cuda)        # CUDA version PyTorch was built for
# print(torch.backends.cudnn.version())  # cuDNN version

### Load the Dataset

In [2]:
dataset = load_dataset("hakurei/open-instruct-v1", split='train')

In [3]:
dataset.to_pandas().sample(20)

Unnamed: 0,output,input,instruction
457654,1. Stanford University\n2. Massachusetts Insti...,,Top 20 colleges produced successful founders a...
459891,​,,Hi Joanne.\nI would like to take the renovatio...
159410,The average price for a new pair of running sh...,,What is the average price for a new pair of ru...
391368,- Lack of sex education in schools.\n- Peer pr...,,What do you think are the main reasons for tee...
145447,It is gratifying to hear that you have acquire...,"Hey, it's cool you got a job and stuff, but li...",Rewrite the given paragraph using formal langu...
222951,(defn solve-problem [lst target]\n (-> lst\n ...,,Write Clojure code to solve the following prob...
387513,"I love you, my dear.",,"Write a poem about your life, or something els..."
39024,"Bread, Cake, Coffee, Milk","Cake, Coffee, Bread, Milk",Organize the given list alphabetically.
191769,The 'if' statement is used to check if the len...,,"In the following code, what is the purpose of ..."
246831,Oil and water do not mix. Water is denser than...,,Why does adding water to a hot pan of oil caus...


* Each record contains an input, output, instruction
* Model will be trained on these instructions to get the output
* We will create the prompt using these input, instruction and output

In [4]:
def preprocess(example):
    example['prompt'] = f"{example['instruction']} {example['input']} {example['output']}"
    return example

def tokenize_dataset(dataset):
    tokenize_dataset = dataset.map(lambda example: tokenizer(example['prompt'], truncation=True, max_length=128), batched=True, remove_columns=['prompt'])
    return tokenize_dataset

In [5]:
dataset = dataset.map(preprocess, remove_columns=['output', 'input', 'instruction'])
dataset

Dataset({
    features: ['prompt'],
    num_rows: 498813
})

* Splitting train and test

In [6]:
dataset = dataset.shuffle(seed=42).select(range(100000)).train_test_split(test_size=0.1, seed=42)
# dataset = dataset.shuffle(seed=42).train_test_split(test_size=0.1, seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt'],
        num_rows: 90000
    })
    test: Dataset({
        features: ['prompt'],
        num_rows: 10000
    })
})

In [7]:
train_dataset = dataset["train"]
test_dataset = dataset["test"]

### Loading the Model

* GPT is a Causal Language Modeling (CLM) 
* CLM is a type of language model that predicts the next token in a sequence based on the preceding tokens, without access to future context
* It uses a technique called Sliding window, Sliding window based part-of-speech tagging is used to part-of-speech tag a text
* In this technique data is divided into fixed length of chuncks with some overlap, its helps while training with minimum amount of GPU usage as every token is tightly packed and not token wastage
* It dont have the padding tokens but in our usage is different
* So we define a seperate pad token at the end of the sentence

In [8]:
MODEL_NAME = "microsoft/DialoGPT-medium"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

tokenizer.pad_token = tokenizer.eos_token

In [9]:
train_dataset = tokenize_dataset(train_dataset)
test_dataset = tokenize_dataset(test_dataset)

In [10]:
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
model.gradient_checkpointing_enable()

## Training the GPT Model 🎯

Now, we configure the training parameters and initiate the training process using our prepared datasets.

To process the training data we need a data collator to do masking, hiding the data, loss function calculation, and creating batches for training, it need a tokenizer for pad examples. Since GPT is a generative model we will set the MLM as false

In [11]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [12]:
training_args = TrainingArguments(
    output_dir="./models/dialogpt2-instruct",
    num_train_epochs=1,
    per_device_eval_batch_size=1,
    per_device_train_batch_size=1,
    fp16=True,  # Enables mixed precision (saves memory)
    # deepspeed="ds_config.json"
)

Now creating the training Pipelines `Trainer` class will do that for us

In [13]:
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=test_dataset,
#     data_collator=data_collator
# )

# Since it consumes more GPU and takes more time to train we will import the exact trained model in this dataset
# trainer.train()

In [14]:
trained_model_name = "TheFuzzyScientist/diabloGPT_open-instruct"
trained_model = AutoModelForCausalLM.from_pretrained(trained_model_name).to(device)

* We will tokenize the inputs and pass it to our model
* We will set the max_length to 64 so our model will generate only 64 tokens which will be short and crisp
* Since this is a generative model it will generate tokens instead of probability vectors, these tokens will be decoded into text using the same tokenizer

In [15]:
def generate_text(prompt):
    inputs = tokenizer.encode(prompt, return_tensors='pt').to(device)
    outputs = trained_model.generate(inputs, max_length=64, pad_token_id=tokenizer.eos_token_id)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text[:generated_text.rfind('.')+1]

In [16]:
prompt = "Should I invest stocks?"

In [17]:
print(generate_text("What's the best way to cook chiken breast?"))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What's the best way to cook chiken breast?  The best way to cook chiken breast is to season it with salt and pepper, then heat a pan over medium heat. Add a tablespoon of olive oil and cook for about 5 minutes, stirring occasionally.


In [18]:
print(generate_text(prompt))

Should I invest stocks?  Yes, it is a good idea to invest in stocks. It is important to understand the risks associated with investing in stocks and to make sure that you are taking the necessary precautions. It is also important to understand the potential returns and to make sure that you are making the right investment.


In [19]:
print(generate_text("I need a place to go for this summer vacation, what locations would you recommend"))

I need a place to go for this summer vacation, what locations would you recommend.  I would recommend visiting the beach in San Diego, California. It is a popular destination for vacationers and has a great view of the ocean.


In [20]:
generate_text("What's the fastest route from NY City to Boston?")

"What's the fastest route from NY City to Boston?  The fastest route from New York City to Boston is by taking the New York City subway. The subway takes about 3 hours and 15 minutes to get from the city center to the Boston Common."