# Instruction tuning

In [1]:
!pip install transformers
!pip install torch
!pip install accelerate
!pip install pyarrow
!pip install datasets

Defaulting to user installation because normal site-packages is not writeable
Collecting transformers
  Downloading transformers-4.46.2-py3-none-any.whl.metadata (44 kB)
Collecting filelock (from transformers)
  Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x

Today, we are breaking up the pipeline function from transformers that we have used previously

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, max_length=250)


what the pipeline was doing behind the curtain was tokenising the text, but we can just as easily do that in a separate step. Huggingface lets us initialize our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

we get a tokenizer that corresponds to the model architecture we want to use,
we download the vocabulary used when pretraining this specific checkpoint.


In [4]:
input_text = "My name is "

tokenized_text = tokenizer(input_text, return_tensors="pt")
tokenized_text

{'input_ids': tensor([[499, 564,  19,   3,   1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

the input ids are the ids of the tokens in the vocabulary, which the model then converts into the embeddings of the tokens. we can check this by decoding the ids back into words

THE ATTENTION MASK
we have seen it before
part of the input the model will attend to
try to add a "padding" argument 

Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.

In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well.
try tro insert a long sentence and add the truncation argument

In [5]:
tokenizer.decode([564])

'name'

part of the pipeline corresponds to the .generate() method, which takes the token ids and generates the next token ids. we can do this in a separate step as well



In [7]:
output = model.generate(tokenized_text["input_ids"])
output

tensor([[  0,   3,   9,   3,   7,   9, 967,   1]])

we then only need to decode the ids back into words to get the generated text

In [8]:
tokenizer.decode(output[0])

'<pad> a sailor</s>'

try with and without the to cuda


In [12]:
model = model.to("cuda")
model.generate(tokenizer(input_text, return_tensors="pt").to("cuda")["input_ids"])

tensor([[  0,   3,   9,   3,   7,   9, 967,   1]], device='cuda:0')

it now tells us that the the device used is cuda (the GPU) and the processing time is way faster

make your own function that works like the pipeline, but with the tokenization and generation steps separated


In [8]:
def my_pipe(input_text, model):
    output = model.generate(tokenizer(input_text, return_tensors="pt").to("cuda")["input_ids"]).to("cuda")
    return tokenizer.decode(output[0])

## Machine translation

we will try to do machinet translation 

zero shot
one shot
few shot

finetune
and then do all again to see if they improved

the dataset iis the [OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) which contains translation pairs from over 100 languages. i chose the danish to english translation pairs because that makes it easier for me to evaluate the quality of the translations, so feel very free to choose a different language pair if you prefer. you can see the different language pairs available in the "Subset" part of the dataset viewer.


In [14]:
from datasets import load_dataset

ds = load_dataset("Helsinki-NLP/opus-100", "da-en", split='train[:1%]')

In [15]:
ds

Dataset({
    features: ['translation'],
    num_rows: 10000
})

we'll use huggingface's datasets library to load the dataset. the dataset is already split into training, validation, and test sets, so we can use those directly. it's in a dict format, so we can just use the key to access the part of the data we need

In [16]:
def unpack_cols(row):
    row["en"] = row["translation"]["en"]
    row["da"] = row["translation"]["da"]
    return row

train = ds.map(unpack_cols, remove_columns=["translation"])
train

Dataset({
    features: ['en', 'da'],
    num_rows: 10000
})

In [17]:
train[150]

{'en': 'That looks very painful, Viktor.',
 'da': 'Det ser smertefuldt ud, Viktor.'}

try to pick a few sentences and see how well the model can translate out of the box.

In [18]:
input_ids = tokenizer(train[150]['en'], return_tensors="pt").to("cuda")["input_ids"]
input_ids

tensor([[  466,  1416,   182, 10875,     6,  1813, 10377,     5,     1]],
       device='cuda:0')

In [19]:
tokenizer.decode(model.generate(input_ids).to("cuda")[0])

'<pad> That looks very painful, Viktor.</s>'

In [20]:
input_ids = tokenizer(f"English: {train[150]['en']} Danish: ", return_tensors="pt").to("cuda")["input_ids"]
input_ids

tensor([[ 1566,    10,   466,  1416,   182, 10875,     6,  1813, 10377,     5,
         23124,    10,     3,     1]], device='cuda:0')

In [21]:
tokenizer.decode(model.generate(input_ids)[0])

'<pad> Viktor: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>'

In [22]:
input_ids = tokenizer(f"""
                      English: "What a wonderful day!"
                      Danish: "Sikke en vidunderlig dag!"

                      English: "How are you?"
                      Danish: "Hvordan har du det?"
                      
                      English: {train[150]['en']} 
                      Danish: """, 
                      return_tensors="pt").to("cuda")["input_ids"]
input_ids

tensor([[ 1566,    10,    96,  5680,     3,     9,  1627,   239,  4720, 23124,
            10,    96,   134,    23,  8511,    15,     3,    35,     3,  6961,
          7248,  2825,   836,   122,  4720,  1566,    10,    96,  7825,    33,
            25,  4609, 23124,    10,    96,   566,  1967,  3768,     3,  3272,
           146,    20,    17,  4609,  1566,    10,   466,  1416,   182, 10875,
             6,  1813, 10377,     5, 23124,    10,     3,     1]],
       device='cuda:0')

In [23]:
tokenizer.decode(model.generate(input_ids).to("cuda")[0])

'<pad> Viktor: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>'

now let's try instruction tuning the model to hopefully get a better result

the datasets library has a nice map method that we can use to apply a function to all the examples in the dataset. the map method can take a custom function, so we just need to write a function that prepares our data for the model.

write preprocessing function that takes in a row of the dataset
- defines an instruction
- appends the input text to the instruction
- creates a list of all input texts
- creates a new column in the dataset called "input_ids" that contains the token ids of the input text
- creates a list of all output texts
- creates a new column in the dataset called "labels" that contains the token ids of the output text
- returns the augmented row


In [24]:
def preprocessing_func(batch):
    input_texts = ["English: " + row + " Danish: " for row in batch['en']]
    batch["input_ids"] = tokenizer(input_texts, padding="max_length", truncation=True, return_tensors="pt").to("cuda").input_ids
    target_texts = [row for row in batch['da']]
    batch["labels"] = tokenizer(target_texts, padding="max_length", truncation=True, return_tensors="pt").to("cuda").input_ids
    return batch

In [25]:
tokenized_train = train.map(preprocessing_func, batched=True)

In [26]:
tokenized_train = tokenized_train.remove_columns(["en", "da"])
tokenized_train

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 10000
})

In [67]:
tokenized_train[0]

{'input_ids': [1566,
  10,
  242,
  8,
  3,
  5080,
  188,
  16761,
  3201,
  23124,
  10,
  3,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,

We then want to initalize a Trainer class.

To do this, we have to defined the TrainingArguments, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional.

there are many things you can optimise here, like the learning rate, the batch size, the number of epochs, etc. but for now, we can just use the default values. 
i have changed a few parameters, like the learning rate and weight decay, as well as setting the max number of steps (so it doesn't run for a very long time) and the logging steps (so we get updated more frequently on the loss) and the batch size (also for speed)
if you want to change them, you can find the full list of arguments in the documentation.

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(output_dir="./flan-t5-small-da-en",
   per_device_train_batch_size=4,
   learning_rate=1e-3,
   weight_decay=0.01,
   max_steps=3000,
   logging_steps=200,
)

trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
)

max_steps is given, it will override any value given in num_train_epochs


buckle up this will probably take a handful of minutes

In [28]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
200,1.3577
400,0.1665
600,0.1669
800,0.1499




KeyboardInterrupt: 

In [37]:
trainer.save_model("instruct-model")

In [38]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("instruct-model").to("cuda")

In [39]:
my_pipe("English: How are you? Danish: ", instruct_model)

'<pad> Hvordan er du?</s>'

try to use your new, instruction tuned model for the pipeline function you made earlier, and test it on a handful of examples again
does it perform better than before?
why?
What if you instruct it to perform a different task? does the performance transfer?
why?
if you wanted to instruction tune a model to be able to solve a multitude of tasks (like chatpgt), what kind of training data would you need?
how would you produce that kind of data?
what are the limitations?

One- and few-shot inference: With smaller and more manageable models, you can sometimes achieve good performance with good prompting and in-context learning. This can happen, for example, if the model has been trained on similar tasks, and it just requires some nudging to adapt its behavior to a new task/data;
Instruction fine-tuning: If in-context learning is not enough, you can fine-tune your model. Note that here we have fine-tuned a relatively small model, and scaling to larger models can become prohibitively resource-intensive.