we are breaking up the pipeline

In [None]:
# ADD REQS

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, max_length=250)



what the pipeline was doing behind the curtain was tokenising the text, but we can just as easily do that in a separate step. Huggingface lets us initialize our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

we get a tokenizer that corresponds to the model architecture we want to use,
we download the vocabulary used when pretraining this specific checkpoint.


In [69]:
input_text = "My name is "

tokenizer(input_text, return_tensors="pt")

{'input_ids': tensor([[499, 564,  19,   3,   1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

the input ids are the ids of the tokens in the vocabulary, which the model then converts into the embeddings of the tokens. we can check this by decoding the ids back into words


In [70]:
tokenizer.decode([564])

'name'

part of the pipeline corresponds to the .generate() method, which takes the token ids and generates the next token ids. we can do this in a separate step as well

In [None]:
output = model.generate(tokenizer(input_text, return_tensors="pt")["input_ids"])
output



tensor([[  0,   3,   9,   3,   7,   9, 967,   1]])

we then only need to decode the ids back into words to get the generated text

In [72]:
tokenizer.decode(output[0])

'<pad> a sailor</s>'

make a function that works like the pipeline, but with the tokenization and generation steps separated


In [32]:
def my_pipe(input_text, model):
    output = model.generate(tokenizer(input_text, return_tensors="pt")["input_ids"])
    return tokenizer.decode(output[0])


## incstruction tuning

we will try to do machinet translation 

zero shot
one shot
few shot

finetune
and then do all again to see if they improved

the dataset iis the [OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) which contains translation pairs from over 100 languages. i chose the danish to english translation pairs because that makes it easier for me to evaluate the quality of the translations, so feel very free to choose a different language pair if you prefer. you can see the different language pairs available in the "Subset" part of the dataset viewer.


In [3]:
from datasets import load_dataset

ds = load_dataset("Helsinki-NLP/opus-100", "da-en")

In [46]:
ds

DatasetDict({
    test: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1000000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
})

we'll use huggingface's datasets library to load the dataset. the dataset is already split into training, validation, and test sets, so we can use those directly. it's in a dict format, so we can just use the key to access the part of the data we need

In [4]:
train = ds["validation"] # CHANGE TO TRAIN!
train

Dataset({
    features: ['translation'],
    num_rows: 2000
})

In [50]:
train["translation"][250]

{'da': '– Hvad sagde hun?', 'en': "- What'd she say?"}

try to pick a few sentences and see how well the model can translate out of the box.

In [22]:
input_ids = tokenizer(train['translation'][250]['en'], return_tensors="pt")["input_ids"]
input_ids

tensor([[ 466,   31,  195,   36, 1533,   40,   35,    9,    5,    1]])

In [74]:
tokenizer.decode(model.generate(input_ids)[0])

'<pad> Arlena is a beautiful girl.</s>'

In [75]:
input_ids = tokenizer(f"English: {train['translation'][250]['en']} Danish: ", return_tensors="pt")["input_ids"]
input_ids

tensor([[ 1566,    10,   466,    31,   195,    36,  1533,    40,    35,     9,
             5, 23124,    10,     3,     1]])

In [76]:
tokenizer.decode(model.generate(input_ids)[0])

'<pad> Arlena s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s<unk> s'

In [77]:
input_ids = tokenizer(f"""
                      English: "What a wonderful day!"
                      Danish: "Sikke en vidunderlig dag!"

                      English: "How are you?"
                      Danish: "Hvordan har du det?"
                      
                      English: {train['translation'][250]['en']} 
                      Danish: """, 
                      return_tensors="pt")["input_ids"]
input_ids

tensor([[ 1566,    10,    96,  5680,     3,     9,  1627,   239,  4720, 23124,
            10,    96,   134,    23,  8511,    15,     3,    35,     3,  6961,
          7248,  2825,   836,   122,  4720,  1566,    10,    96,  7825,    33,
            25,  4609, 23124,    10,    96,   566,  1967,  3768,     3,  3272,
           146,    20,    17,  4609,  1566,    10,   466,    31,   195,    36,
          1533,    40,    35,     9,     5, 23124,    10,     3,     1]])

In [78]:
tokenizer.decode(model.generate(input_ids)[0])

'<pad> Arlena s<unk> t<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> '

now let's try instruction tuning the model to hopefully get a better result

the datasets library has a nice map method that we can use to apply a function to all the examples in the dataset. the map method can take a custom function, so we just need to write a function that prepares our data for the model.

write preprocessing function that takes in a row of the dataset
- defines an instruction
- appends the input text to the instruction
- creates a new column in the dataset called "input_ids" that contains the token ids of the input text
- creates a new column in the dataset called "labels" that contains the token ids of the output text
- returns the augmented row


In [None]:
def preprocessing_func(row):
    instruction = "Translate the following English sentence to Danish: "
    input_text = instruction + row['translation']['en']
    row["input_ids"] = tokenizer(input_text, padding="max_length", truncation=True, return_tensors="pt").input_ids[0] ## HOW TO EXPLAIN THIS?!
    target_text = row['translation']['da']
    row["labels"] = tokenizer(target_text, padding="max_length", truncation=True, return_tensors="pt").input_ids[0]
    return row

In [25]:
tokenized_train = train.map(preprocessing_func)

Map: 100%|██████████| 2000/2000 [00:00<00:00, 2870.66 examples/s]


In [26]:
tokenized_train = tokenized_train.remove_columns(["translation"])
tokenized_train

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 2000
})

We then want to initalize a Trainer class.

To do this, we have to defined the TrainingArguments, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional.

there are many things you can optimise here, like the learning rate, the batch size, the number of epochs, etc. but for now, we can just use the default values. if you want to change them, you can find the full list of arguments in the documentation.

In [29]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(output_dir="./flan-t5-small-da-en",
   learning_rate=1e-4,
   per_device_train_batch_size=4,
   weight_decay=0.01,
   num_train_epochs=1,
   max_steps=100,
   predict_with_generate=True,
   push_to_hub=False
)

trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   tokenizer=tokenizer,
)

max_steps is given, it will override any value given in num_train_epochs


In [30]:
trainer.train()

  0%|          | 0/100 [07:52<?, ?it/s]
Non-default generation parameters: {'max_length': 250}
100%|██████████| 100/100 [03:36<00:00,  2.17s/it]

{'train_runtime': 216.5011, 'train_samples_per_second': 1.848, 'train_steps_per_second': 0.462, 'train_loss': 11.138912353515625, 'epoch': 0.2}





TrainOutput(global_step=100, training_loss=11.138912353515625, metrics={'train_runtime': 216.5011, 'train_samples_per_second': 1.848, 'train_steps_per_second': 0.462, 'total_flos': 74356201881600.0, 'train_loss': 11.138912353515625, 'epoch': 0.2})

In [33]:
trainer.save_model("instruct-model")

Non-default generation parameters: {'max_length': 250}


In [34]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("instruct-model")

In [35]:
my_pipe("Translate the following English sentence to Danish: How are you?", instruct_model)



'<pad> Wie s<unk>?</s>'