# **Training GPT for Instruction Following**

In this notebook, we demonstrate the process of training a GPT model to follow instructions effectively, using a custom dataset for varied tasks. Let's embark on this journey from data preprocessing to model training and text generation.


## Setup and Dependencies

First, we import necessary libraries and set up our environment to handle the tasks.

In [None]:
!pip install transformers[torch]==4.38.2
!pip install datasets===2.13.1

Collecting transformers[torch]==4.38.2
  Downloading transformers-4.38.2-py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers[torch]==4.38.2)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch]==4.38.2)
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch]==4.38.2)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[to

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset



## Data Exploration and Preparation

Let's load and preview our dataset, ensuring we understand the kind of data we're working with.

In [None]:
dataset = load_dataset("hakurei/open-instruct-v1", split='train')
dataset.to_pandas().sample(20)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading and preparing dataset json/hakurei--open-instruct-v1 to /root/.cache/huggingface/datasets/hakurei___json/hakurei--open-instruct-v1-00713eb9aefc6002/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/104M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/22.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/109M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/hakurei___json/hakurei--open-instruct-v1-00713eb9aefc6002/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.


Unnamed: 0,instruction,input,output
258186,Write a summary of the given article. Make sur...,The European Central Bank has decided to furth...,The European Central Bank (ECB) has lowered it...
356368,Classify whether the given sentence is about f...,,Entertainment
238644,Create a function that takes two arrays and re...,,"def find_common_elements(arr1, arr2):\n com..."
111738,"Given a list of words, find out how many words...",,3
437035,chegg A spherical container 1.6m in diameter i...,,"To solve this problem, we need to apply princi..."
363348,What is their age group?\n25,,20-30
482587,What are the laws for filing taxes in my state.,,The answer to this question depends on what st...
292561,Identify the odd word among the following.,"scream, whisper, shout",whisper
372829,"Design a simple algorithm for playing chess, i...",,White wins.
4665,Describe the attributes of a pine tree.,,A pine tree is an evergreen coniferous tree wi...


In [None]:
dataset[:5]

{'instruction': ['Give three tips for staying healthy.',
  'What are the three primary colors?',
  'Describe the structure of an atom.',
  'How can we reduce air pollution?',
  'Pretend you are a project manager of a construction company. Describe a time when you had to make a difficult decision.'],
 'input': ['', '', '', '', ''],
 'output': ['1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
  'The three primary colors are red, blue, and yellow.',
  'An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.',
  'There are a number of ways to reduc

In [None]:
dataset.to_pandas()[:5]

Unnamed: 0,instruction,input,output
0,Give three tips for staying healthy.,,1. Eat a balanced diet and make sure to includ...
1,What are the three primary colors?,,"The three primary colors are red, blue, and ye..."
2,Describe the structure of an atom.,,"An atom is made up of a nucleus, which contain..."
3,How can we reduce air pollution?,,There are a number of ways to reduce air pollu...
4,Pretend you are a project manager of a constru...,,I had to make a difficult decision when I was ...


In [None]:
example = dataset[0]
example

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

In [None]:
def preprocess(example):
    example['prompt'] = f"{example['instruction']} {example['input']} {example['output']}"
    return example

preprocess(example)

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
 'prompt': 'Give three tips for staying healthy.  1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

In [None]:
def tokenize_datasets(dataset):
    tokenized_dataset = dataset.map(lambda example: tokenizer(example['prompt'], truncation=True, max_length=128), batched=True, remove_columns=['prompt'])
    return tokenized_dataset


### Shuffling and Splitting the Dataset

Next, we shuffle the dataset and split it into training and test sets to ensure robust model training and evaluation.

In [None]:
dataset = dataset.map(preprocess, remove_columns=['instruction', 'input', 'output'])
dataset[:2]

Map:   0%|          | 0/498813 [00:00<?, ? examples/s]

{'prompt': ['Give three tips for staying healthy.  1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
  'What are the three primary colors?  The three primary colors are red, blue, and yellow.']}

In [None]:
dataset =  dataset.shuffle(42).select(range(1000)).train_test_split(test_size=0.1, seed=42)

In [None]:
train_dataset = dataset['train']
test_dataset = dataset['test']

## Model Initialization and Tokenization

We set up the tokenizer and the model, ensuring that our tokens align with the model's expected format.

In [None]:
MODEL_NAME = "microsoft/DialoGPT-medium"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

train_dataset = tokenize_datasets(train_dataset)
test_dataset = tokenize_datasets(test_dataset)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)



tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
train_dataset.shape

(900, 2)

## Training the GPT Model
Now, we configure the training parameters and initiate the training process using our prepared datasets.

In Natural Language Processing (NLP), DataCollatorForLanguageModeling is a class used for preparing batches of inputs for training language models. It typically handles tasks like padding sequences to the same length, masking tokens for masked language modeling (if mlm is set to True), and other data preprocessing tasks specific to language modeling objectives.

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

traing_args = TrainingArguments(output_dir="models/diablo_gpt",
                                num_train_epochs=1,
                                per_device_train_batch_size=32,
                                per_device_eval_batch_size=32)\

trainer = Trainer(model=model,
                    args=traing_args,
                    train_dataset=train_dataset,
                    eval_dataset=test_dataset,
                    data_collator=data_collator)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
# this will take a long time
# trainer.train()

In [None]:
# Get the trained checkpoint directly
model = AutoModelForCausalLM.from_pretrained("TheFuzzyScientist/diabloGPT_open-instruct")



config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

## Text Generation and Application
Finally, let's demonstrate the capability of our trained model by generating responses to various instructions.

In [None]:
def generate_text(prompt):
    inputs = tokenizer.encode(prompt, return_tensors='pt')
    # .to("cuda")
    # <-- if running on GPU, uncomment this
    outputs = model.generate(inputs, max_length=64, pad_token_id=tokenizer.eos_token_id)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return generated[:generated.rfind('.')+1]

In [None]:
generate_text("What's the best way to cook chiken breast?")

"What's the best way to cook chiken breast?  The best way to cook chiken breast is to season it with salt and pepper, then heat a pan over medium heat. Add a tablespoon of olive oil and cook for about 5 minutes, stirring occasionally."

In [None]:
generate_text("Should I invest stocks?")

'Should I invest stocks?  Yes, it is a good idea to invest in stocks. It is important to understand the risks associated with investing in stocks and to make sure that you are taking the necessary precautions. It is also important to understand the potential returns and to make sure that you are making the right investment.'

In [None]:
generate_text("I need a place to go for this summer vacation, what locations would you recommend")

'I need a place to go for this summer vacation, what locations would you recommend.  I would recommend visiting the beach in San Diego, California. It is a popular destination for vacationers and has a great view of the ocean.'

In [None]:
generate_text("I need a place to go for this summer vacation, what locations would you recommend in India")

'I need a place to go for this summer vacation, what locations would you recommend in India.  I would recommend visiting the Taj Mahal in Mumbai, India. It is a beautiful and historic building that is known for its rich history and culture.'

In [None]:
generate_text("What's the fastest route from NY City to Boston?")

"What's the fastest route from NY City to Boston?  The fastest route from New York City to Boston is by taking the New York City subway. The subway takes about 3 hours and 15 minutes to get from the city center to the Boston Common."