<a href="https://colab.research.google.com/github/NikyParfenov/EncoderDecoderTraining/blob/master/EncoderDecoderDifferentTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Training an Encoder Decoer Model in different ways

1. Masked LM - Used for Encoder-Decoder type models (t5,flan-t5 etc)
2. Causal LM - Used for Decoder type models (gpt2,bloom,palm etc)
3. Teacher Forcing - Can be used for both


In [None]:
!pip install transformers[sentencepiece]
!pip install accelerate
!pip install bitsandbytes
!pip install peft
from transformers import T5Tokenizer, T5ForConditionalGeneration
import numpy as np
import torch

Installing collected packages: peft
Successfully installed peft-0.3.0


 ## Masked LM/ denoising training

https://huggingface.co/docs/transformers/main/model_doc/t5#training

In [None]:
# utility class for denoised training, taken from hugging face library
class FlaxDataCollatorForT5MLM:
    """
    From https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py
    """
    def __init__(self,tokenizer,noise_density,mean_noise_span_length) -> None:
        self.tokenizer = tokenizer
        self.noise_density = noise_density
        self.mean_noise_span_length =mean_noise_span_length

    def create_sentinel_ids(self, mask_indices):
        """
        Sentinel ids creation given the indices that should be masked.
        The start indices of each mask are replaced by the sentinel ids in increasing
        order. Consecutive mask indices to be deleted are replaced with `-1`.
        """
        start_indices = mask_indices - np.roll(mask_indices, 1, axis=-1) * mask_indices
        start_indices[:, 0] = mask_indices[:, 0]

        sentinel_ids = np.where(start_indices != 0, np.cumsum(start_indices, axis=-1), start_indices)
        sentinel_ids = np.where(sentinel_ids != 0, (len(self.tokenizer) - sentinel_ids), 0)
        sentinel_ids -= mask_indices - start_indices

        return sentinel_ids

    def filter_input_ids(self, input_ids, sentinel_ids):
        """
        Puts sentinel mask on `input_ids` and fuse consecutive mask tokens into a single mask token by deleting.
        This will reduce the sequence length from `expanded_inputs_length` to `input_length`.
        """
        batch_size = input_ids.shape[0]

        input_ids_full = np.where(sentinel_ids != 0, sentinel_ids, input_ids)
        # input_ids tokens and sentinel tokens are >= 0, tokens < 0 are
        # masked tokens coming after sentinel tokens and should be removed
        input_ids = input_ids_full[input_ids_full >= 0].reshape((batch_size, -1))
        input_ids = np.concatenate(
            [input_ids, np.full((batch_size, 1), self.tokenizer.eos_token_id, dtype=np.int32)], axis=-1
        )
        return input_ids

    def random_spans_noise_mask(self, length):
        """This function is copy of `random_spans_helper <https://github.com/google-research/text-to-text-transfer-transformer/blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2682>`__ .
        # with the correction of this https://github.com/huggingface/transformers/pull/22938/files
        Noise mask consisting of random spans of noise tokens.
        The number of noise tokens and the number of noise spans and non-noise spans
        are determined deterministically as follows:
        num_noise_tokens = round(length * noise_density)
        num_nonnoise_spans = num_noise_spans = round(num_noise_tokens / mean_noise_span_length)
        Spans alternate between non-noise and noise, beginning with non-noise.
        Subject to the above restrictions, all masks are equally likely.
        Args:
            length: an int32 scalar (length of the incoming token sequence)
            noise_density: a float - approximate density of output mask
            mean_noise_span_length: a number
        Returns:
            a boolean tensor with shape [length]
        """

        orig_length = length

        num_noise_tokens = int(np.round(length * self.noise_density))
        num_nonnoise_tokens = length - num_noise_tokens
        # avoid degeneracy by ensuring positive numbers of noise and nonnoise tokens.
        num_noise_tokens = min(max(num_noise_tokens, 1), length - 1)
        # num_noise_tokens should be less than num_noise_tokens and num_nonnoise_tokens
        num_noise_spans = int(np.round(min(num_noise_tokens, num_nonnoise_tokens) / self.mean_noise_span_length))

        # avoid degeneracy by ensuring positive number of noise spans
        num_noise_spans = max(num_noise_spans, 1)

        # pick the lengths of the noise spans and the non-noise spans
        def _random_segmentation(num_items, num_segments):
            """Partition a sequence of items randomly into non-empty segments.
            Args:
                num_items: an integer scalar > 0
                num_segments: an integer scalar in [1, num_items]
            Returns:
                a Tensor with shape [num_segments] containing positive integers that add
                up to num_items
            """
            mask_indices = np.arange(num_items - 1) < (num_segments - 1)
            np.random.shuffle(mask_indices)
            first_in_segment = np.pad(mask_indices, [[1, 0]])
            segment_id = np.cumsum(first_in_segment)
            # count length of sub segments assuming that list is sorted
            _, segment_length = np.unique(segment_id, return_counts=True)
            return segment_length

        noise_span_lengths = _random_segmentation(num_noise_tokens, num_noise_spans)
        nonnoise_span_lengths = _random_segmentation(num_nonnoise_tokens, num_noise_spans)

        interleaved_span_lengths = np.reshape(
            np.stack([nonnoise_span_lengths, noise_span_lengths], axis=1), [num_noise_spans * 2]
        )
        span_starts = np.cumsum(interleaved_span_lengths)[:-1]
        span_start_indicator = np.zeros((length,), dtype=np.int8)
        span_start_indicator[span_starts] = True
        span_num = np.cumsum(span_start_indicator)
        is_noise = np.equal(span_num % 2, 1)

        return is_noise[:orig_length]


def get_denoised(FlaxDataCollatorForT5MLM, tokenizer, prompt):
    encoded = tokenizer(prompt, truncation=False, padding=False, return_tensors="pt")
    batch_size =1
    input_length = encoded.input_ids.shape[1]
    denoiser = FlaxDataCollatorForT5MLM(tokenizer,.55,1.5)
    mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
    labels_mask = ~mask_indices
    input_ids_sentinel = denoiser.create_sentinel_ids(mask_indices.astype(np.int8))
    labels_sentinel = denoiser.create_sentinel_ids(labels_mask.astype(np.int8))
    input_ids = denoiser.filter_input_ids(encoded.input_ids, input_ids_sentinel)
    labels  =  denoiser.filter_input_ids(encoded.input_ids, labels_sentinel)
    return labels,input_ids



def print_token_id(tokenizer,token):
  # Encode the token
  encoded = tokenizer.encode(token)
  # Print the id
  print(token,encoded[0])
  return encoded[0]

def print_special_tokens(tokenizer):
    # Special tokens and their ids
    special_tokens = {}
    for attr in tokenizer.special_tokens_map:
        special_tokens[attr] = tokenizer.convert_tokens_to_ids(tokenizer.special_tokens_map[attr])

    # Print special tokens
    print(special_tokens)


In [None]:
def shift_tokens_right(input_ids, pad_token_id, eos_token_id):
  """ Shift input ids one token to the right, and add pad token at the first position, and eos token to the last """
  # Create a larger tensor that includes space for the EOS token
  shifted_input_ids = torch.zeros((input_ids.shape[0], input_ids.shape[1] + 1), dtype=input_ids.dtype)

  # Shift input_ids one step to the right
  shifted_input_ids[:, 1:] = input_ids

  # Set the first token to the pad_token_id
  shifted_input_ids[:, 0] = pad_token_id

  # Set the last token to the eos_token_id
  shifted_input_ids[:, -1] = eos_token_id

  return shifted_input_ids

In [None]:
 arr = np.array([[1, 2,3, 4,5]])
 arr = torch.tensor(arr)
 print(shift_tokens_right(arr,0,6))

tensor([[0, 1, 2, 3, 4, 6]])


## Using Masked LM for Seq to Seq model

In [None]:
#from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = 't5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)# or T5Tokenizer
len_tokenizer =len(tokenizer) # 32100 to get the sentinel ids
print(f"len_tokenizer={len_tokenizer}")

len_tokenizer=32100


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
# This below is what happens in the denoised training
prompt = "The <extra_id_0> walks in <extra_id_1> park"
encoded_prompt = tokenizer(prompt, truncation=False, padding=False, return_tensors="pt").input_ids
print(f"encoded_prompt ={encoded_prompt}")
labels ="<extra_id_0> cute dog <extra_id_1> the <extra_id_2>"
encoded_labels = tokenizer(labels, truncation=False, padding=False, return_tensors="pt").input_ids
print(f"encoded_labels ={encoded_labels}")
print(f"encoded_prompt.shape=encoded_labels.shape {encoded_prompt.shape} ={encoded_labels.shape}")

# simulating the above

print("\n"*2)

prompt = "The cute dog walks in the green park"
labels, input_ids = get_denoised(FlaxDataCollatorForT5MLM, tokenizer, prompt)
print(f"denoised input_ids decoded = {tokenizer.decode(*input_ids,skip_special_tokens=False)}")
print(f"denoised labels decoded   = {tokenizer.decode(*labels,skip_special_tokens=False)}")
print(f"input_ids.shape {input_ids.shape} labels.shape {labels.shape}") # todo should this be equal


encoded_prompt =tensor([[   37, 32099, 10681,    16, 32098,  2447,     1]])
encoded_labels =tensor([[32099,  5295,  1782, 32098,     8, 32097,     1]])
encoded_prompt.shape=encoded_labels.shape torch.Size([1, 7]) =torch.Size([1, 7])



denoised input_ids decoded = The cute<extra_id_0> in<extra_id_1> green<extra_id_2></s>
denoised labels decoded   = <extra_id_0> dog walks<extra_id_1> the<extra_id_2> park</s></s>
input_ids.shape (1, 8) labels.shape (1, 9)


In [None]:
  model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") # or T5ForConditionalGeneration
  optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

In [None]:
  prompt = "The cute dog walks in the green park"
  labels, input_ids = get_denoised(FlaxDataCollatorForT5MLM, tokenizer, prompt)
  print(f"denoised input_ids decoded = {tokenizer.decode(*input_ids,skip_special_tokens=False)}")
  print(f"denoised labels decoded   = {tokenizer.decode(*labels,skip_special_tokens=False)}")
  print(f"input_ids.shape {input_ids.shape} labels.shape {labels.shape}") # todo should this be equal
  denoised_input_ids = torch.from_numpy(input_ids)
  denoised_labels = torch.from_numpy(labels)
  denoised_attention_mask = torch.ones(input_ids.shape)

  model.train()
  for epoch in range(100):
      outputs = model(input_ids=denoised_input_ids,attention_mask=denoised_attention_mask,
                      labels=denoised_labels)
      loss = outputs.loss
      if epoch % 20 == 0:
          print(f"Epoch {epoch}  Loss {loss}")
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()
  print(f"Epoch {epoch}  Loss {loss}")
  #-------------------------------------------------------------

denoised input_ids decoded = The<extra_id_0> dog walks<extra_id_1> the<extra_id_2></s>
denoised labels decoded   = <extra_id_0> cute<extra_id_1> in<extra_id_2> green park</s></s>
input_ids.shape (1, 8) labels.shape (1, 9)
Epoch 0  Loss 3.533181667327881
Epoch 20  Loss 1.1342705488204956
Epoch 40  Loss 0.5043283700942993
Epoch 60  Loss 0.11743383854627609
Epoch 80  Loss 0.38232171535491943
Epoch 99  Loss 0.01801321841776371


In [None]:
  # After  training
  model.eval()
  test_prompt = "The  <extra_id_0> dog  walks in the <extra_id_2>"
  encoded = tokenizer(test_prompt, truncation=False, padding=False, return_tensors="pt")
  test_output = model.generate(input_ids = encoded.input_ids,num_return_sequences=1,max_length=125)
  test_answer = tokenizer.decode(test_output[0], skip_special_tokens=True)
  print(f"After Training:'{test_prompt}'-->'{test_answer}'")

After Training:'The  <extra_id_0> dog  walks in the <extra_id_2>'-->'cute green park'


## Masked LM for Generative Models  

Let's try masked  LM on Decoder only model

GPT-2 is a causal language model, designed to predict the next token in a sequence given the previous tokens. It's usually trained using a standard language modeling objective, without the use of denoising.

As you can see from output, this sort of training is not effective for Generative models

In [None]:
  from transformers import AutoModelForCausalLM,AutoTokenizer
  model = AutoModelForCausalLM.from_pretrained("gpt2")
  tokenizer = AutoTokenizer.from_pretrained("gpt2")
  optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

In [None]:
  prompt = "The cute dog walks in the green park"
  labels, input_ids = get_denoised(FlaxDataCollatorForT5MLM, tokenizer, prompt)
  print(f"denoised input_ids decoded = {tokenizer.decode(*input_ids,skip_special_tokens=False)}")
  print(f"denoised labels decoded   = {tokenizer.decode(*labels,skip_special_tokens=False)}")
  print(f"input_ids.shape {input_ids.shape} labels.shape {labels.shape}") # this should be equal for CausalLM models
  denoised_input_ids = torch.from_numpy(input_ids)
  denoised_labels = torch.from_numpy(labels)
  denoised_attention_mask = torch.ones(input_ids.shape)

  model.train()
  for epoch in range(100):
      outputs = model(input_ids=denoised_input_ids,attention_mask=denoised_attention_mask,
                      labels=denoised_labels)
      loss = outputs.loss
      if epoch % 20 == 0:
          print(f"Epoch {epoch}  Loss {loss}")
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()
  print(f"Epoch {epoch}  Loss {loss}")

denoised input_ids decoded = The<|endoftext|> dog gazed the green informants<|endoftext|>
denoised labels decoded   = <|endoftext|> cute gazed walks in informants park<|endoftext|>
input_ids.shape (1, 8) labels.shape (1, 8)
Epoch 0  Loss 11.180167198181152
Epoch 20  Loss 0.0005788219859823585
Epoch 40  Loss 0.0019290262134745717
Epoch 60  Loss 0.000539803528226912
Epoch 80  Loss 5.3726726036984473e-05
Epoch 99  Loss 0.8452759385108948


In [None]:
  # After  training
  model.eval()
  test_prompt = "The  <extra_id_0> dog  walks in the <extra_id_2>"
  encoded = tokenizer(test_prompt, truncation=False, padding=False, return_tensors="pt")
  test_output = model.generate(input_ids = encoded.input_ids,num_return_sequences=1,max_length=125)
  test_answer = tokenizer.decode(test_output[0], skip_special_tokens=True)
  print(f"After Training:'{test_prompt}'-->'{test_answer}'")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


After Training:'The  <extra_id_0> dog  walks in the <extra_id_2>'-->'The  <extra_id_0> dog  walks in the <extra_id_2> gazed in."'


## Causal LM training - teacher forced for Sequence to Sequence Models

We are using Teacher forcing for Language Modeling; Basically predicting the next word, from the previous workds.

To force the labels to show the correct ground truth, we shift the label which is the ground truth to the right as shown below.

This type of training can be used best for CausalLM


In [None]:
print_token_id(tokenizer,"<\s>" )
print_special_tokens(tokenizer)

3
{'eos_token': 1, 'unk_token': 2, 'pad_token': 0, 'additional_special_tokens': [32099, 32098, 32097, 32096, 32095, 32094, 32093, 32092, 32091, 32090, 32089, 32088, 32087, 32086, 32085, 32084, 32083, 32082, 32081, 32080, 32079, 32078, 32077, 32076, 32075, 32074, 32073, 32072, 32071, 32070, 32069, 32068, 32067, 32066, 32065, 32064, 32063, 32062, 32061, 32060, 32059, 32058, 32057, 32056, 32055, 32054, 32053, 32052, 32051, 32050, 32049, 32048, 32047, 32046, 32045, 32044, 32043, 32042, 32041, 32040, 32039, 32038, 32037, 32036, 32035, 32034, 32033, 32032, 32031, 32030, 32029, 32028, 32027, 32026, 32025, 32024, 32023, 32022, 32021, 32020, 32019, 32018, 32017, 32016, 32015, 32014, 32013, 32012, 32011, 32010, 32009, 32008, 32007, 32006, 32005, 32004, 32003, 32002, 32001, 32000]}


In [None]:
model_name ="t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
tokenizer = T5Tokenizer.from_pretrained(model_name)

In [None]:
test_prompt = "The cute dog walks in the green park"
encoded = tokenizer(test_prompt, truncation=False, padding=True, return_tensors="pt")
label_input_ids = shift_tokens_right(encoded.input_ids,model.config.pad_token_id,model.config.eos_token_id)

print(f"teacher_forced input_ids    = {(encoded.input_ids.squeeze())}")
print(f"teacher_forced input_ids decoded = {tokenizer.decode(encoded.input_ids.squeeze(),skip_special_tokens=False)}")# .squeeze() as it takes a batch
print(f"teacher_forced labels    = {(label_input_ids.squeeze())}")
print(f"teacher_forced labels decoded   = {tokenizer.decode(label_input_ids.squeeze(),skip_special_tokens=False)}")

teacher_forced input_ids    = tensor([   37,  5295,  1782, 10681,    16,     8,  1442,  2447,     1])
teacher_forced input_ids decoded = The cute dog walks in the green park</s>
teacher_forced labels    = tensor([    0,    37,  5295,  1782, 10681,    16,     8,  1442,  2447,     1])
teacher_forced labels decoded   = <pad> The cute dog walks in the green park</s>


In [None]:
model.train()
for epoch in range(200):
    outputs = model(input_ids=encoded.input_ids,attention_mask=encoded.attention_mask,
                    labels=label_input_ids)
    loss = outputs.loss
    if epoch % 40 == 0:
        print(f"Epoch {epoch}  Loss {loss}")
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
print(f"Epoch {epoch}  Loss {loss}")
  #-------------------------------------------------------------

Epoch 0  Loss 2.868603229522705
Epoch 40  Loss 0.622085690498352
Epoch 80  Loss 0.21471920609474182
Epoch 120  Loss 0.19337992370128632
Epoch 160  Loss 0.2608181834220886
Epoch 199  Loss 0.4793122708797455


In [None]:
  model.eval()
  test_prompt = "The cute dog walks in the"
  encoded = tokenizer(test_prompt, truncation=False, padding=True, return_tensors="pt")
  test_output = model.generate(input_ids = encoded.input_ids,max_new_tokens=125)
  test_answer = tokenizer.decode(test_output[0], skip_special_tokens=True)
  print(f"After Training:'{test_prompt}'-->'{test_answer}'")

After Training:'The cute dog walks in the'-->'The cute dog walks in the green park'


## Causal LM training -Teacher forced for Generative models

In [None]:
from transformers import AutoModelForCausalLM,AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(model.config.pad_token_id,model.config.eos_token_id)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# GPT does not have the pad token set, so we add a new token
pad_token_id = print_token_id(tokenizer,"[PAD]")
tokenizer.pad_token = pad_token_id
# and update the model with the token
model = AutoModelForCausalLM.from_pretrained("gpt2")
#model.resize_token_embeddings(len(tokenizer))


None 50256
[PAD] 50257


In [None]:
test_prompt = "The cute dog walks in the green park"

encoded = tokenizer(test_prompt, truncation=False, padding=True, return_tensors="pt")
#from torch.nn.functional import pad
# Pad the input_ids tensor on the right (at the end)
#encoded.input_ids = pad(encoded.input_ids, (0, 1), value=tokenizer.pad_token_id)
labels = shift_tokens_right(encoded.input_ids,pad_token_id,tokenizer.eos_token_id)

print(f"teacher_forced input_ids decoded = {tokenizer.decode(encoded.input_ids.squeeze(),skip_special_tokens=False)}")# .squeeze() as it takes a batch
print(f"teacher_forced labels decoded   = {tokenizer.decode(labels.squeeze(),skip_special_tokens=False)}")

# we need the sizes to match

# add a pad token to input ids to match the size
new_token = torch.tensor([tokenizer.eos_token_id])
new_token = new_token.view(1, -1)
# Append the new token
encoded.input_ids = torch.cat((encoded.input_ids, new_token),dim=1)
print(f"teacher_forced input_ids decoded = {tokenizer.decode(encoded.input_ids.squeeze(),skip_special_tokens=False)}")#

assert encoded.input_ids.size() == labels.size()

attention_mask = torch.ones(encoded.input_ids.shape)
print(attention_mask.shape)

# Note that training with the above the loss do not decrease

teacher_forced input_ids decoded = The cute dog walks in the green park
teacher_forced labels decoded   = [PAD]The cute dog walks in the green<|endoftext|>
teacher_forced input_ids decoded = The cute dog walks in the green park<|endoftext|>
torch.Size([1, 9])


In [None]:
model.train()
for epoch in range(10):
    outputs = model(input_ids=encoded.input_ids,attention_mask=attention_mask,
                    labels=labels)
    loss = outputs.loss
    if epoch % 20 == 0:
        print(f"Epoch {epoch}  Loss {loss}")
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
print(f"Epoch {epoch}  Loss {loss}")


Epoch 0  Loss 7.023041248321533
Epoch 9  Loss 8.018503189086914


In [None]:
# Genertaive models need the input shape and target shape to be exact; I guess the shifting cell to right is automatically happening here
# yes it is - https://discuss.huggingface.co/t/shifting-ids-to-the-right-when-training-gpt-2-on-text-generation/5308/2
# no need for things in above cell

# so lets try without shifting right; in the assumption that it is already taken care by the model
 #Note there is no padding for this tokenizeer-see above on how to set pad token for the tokenizer and model

from transformers import AutoModelForCausalLM,AutoTokenizer,GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("gpt2") #GPT2LMHeadModel
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)


In [None]:
test_prompt = "The cute dog walks in the green park"
#test_prompt = ["The cute", "cute dog", "dog walks", "walks in" ,"in the", "the green", "green park"]
encoded = tokenizer(test_prompt, truncation=False, padding='longest', return_tensors="pt") #
label_input_ids = encoded.input_ids

model.train()
for epoch in range(50):
    outputs = model(input_ids=encoded.input_ids,attention_mask=encoded.attention_mask,
                    labels=label_input_ids)
    loss = outputs.loss
    if epoch % 20 == 0:
        print(f"Epoch {epoch}  Loss {loss}")
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
print(f"Epoch {epoch}  Loss {loss}")

Epoch 0  Loss 5.569321155548096
Epoch 20  Loss 1.2724746465682983
Epoch 40  Loss 0.00023454656184185296
Epoch 49  Loss 8.242394869739655e-06


In [None]:
model.eval()
test_prompt = "The"
encoded = tokenizer(test_prompt, truncation=False, padding=True, return_tensors="pt")
test_output = model.generate(encoded.input_ids,max_new_tokens=25)
test_answer = tokenizer.decode(test_output[0], skip_special_tokens=True)
print(f"After Training:'{test_prompt}'-->'{test_answer}'")
test_prompt = "Where does the cute dog walk"
encoded = tokenizer(test_prompt, truncation=False, padding=True, return_tensors="pt")
test_output = model.generate(encoded.input_ids,max_new_tokens=25)
test_answer = tokenizer.decode(test_output[0], skip_special_tokens=True)
print(f"After Training:'{test_prompt}'-->'{test_answer}'")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


After Training:'The'-->'The cute dog walks in the green park park park park park green park park park in the green park park park park in the green'
After Training:'Where does the cute dog walk'-->'Where does the cute dog walk in the green park park park park green park park park park in the green park park park park park in the green park park'


##  Teacher Forcing for Tasks

The target here is not the input ids, but the labels.

The label  is the ground truth (actual next word/sequence).

 During the forward pass, the model makes a prediction, and the difference
  between the prediction and this ground truth is calculated to compute the loss.

However here the training can be used for any arbitary tasks, like translation or QA etc

In [None]:
model_name ="t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
tokenizer = T5Tokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

In [None]:
  test_prompt = "Q:Where does the cute dog walk"
  label_prompt = "A:In the green park"
  encoded = tokenizer(test_prompt, truncation=False, padding=False, return_tensors="pt")
  label = tokenizer(label_prompt, truncation=False, padding=False, return_tensors="pt")
  model.train()
  for epoch in range(50):
      outputs = model(input_ids=encoded.input_ids,attention_mask=encoded.attention_mask,
                      labels=label.input_ids)
      loss = outputs.loss
      if epoch % 20 == 0:
          print(f"Epoch {epoch}  Loss {loss}")
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()
  print(f"Epoch {epoch}  Loss {loss}")



Epoch 0  Loss 4.423830986022949
Epoch 20  Loss 1.126334309577942
Epoch 40  Loss 0.24285714328289032
Epoch 49  Loss 0.27664294838905334


In [None]:
  #-------------------------------------------------------------
  # After  training
  model.eval()
  test_prompt = "Q:Where does the cute dog walk"
  encoded = tokenizer(test_prompt, truncation=False, padding=False, return_tensors="pt")
  test_output = model.generate(input_ids = encoded.input_ids,max_length=125)
  test_answer = tokenizer.decode(test_output[0], skip_special_tokens=True)
  print(f"After Training:'{test_prompt}'-->'{test_answer}'")

After Training:'Q:Where does the cute dog walk'-->'A:In the green park'
