**Question 1:**


The program provided is a custom collate function, used in order to prepare a batch of sequences of data for a machine learning model. The function takes a list of sequences and pads them to the length of the longest sequence in the batch. The function then returns the padded sequences as a tensor, providing both input tensor and target tensor. The input tensor is the padded sequence with the last element removed, and the target tensor is the padded sequence with the first element removed.

First, we find the longest sequence in the batch, and for each sequence in the batch, we pad it with the pad token ID <50256> to the length of the longest sequence.

In [None]:
import torch

def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):    
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs and targets
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )

Next, we seperate the batch into input and target tensors. For the input tensor we remove the last token for each sequence, and for the target tensor, we shift right by 1 index for each sequence.

In [None]:
        inputs = torch.tensor(padded[:-1])  # Truncate the last token for inputs
        targets = torch.tensor(padded[1:])  # Shift +1 to the right for targets

Next, we replace all except the first occurance of the pad token IDs in the target tensor with the ignore index <-100>. This is done to reduce certain padding tokens from being included in the training loss. (If we have a max allowed sequence length, we can also truncate the sequences to that length.)

In [None]:
# New: Replace all but the first padding tokens in targets by ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        # New: Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

Then, since we are currently building a list datatype as the inputs and targets data, we convert this to tensors for pytorch and send them to the target device(which in this program is the cpu device).

In [None]:
    # Convert list of inputs and targets to tensors and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)

    return inputs_tensor, targets_tensor

This function pepares data for the model to be trained on. The output from the given data is as follows:

In [None]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = (
    inputs_1,
    inputs_2,
    inputs_3
)

inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)

In [None]:
Input Tensor --> tensor([[    0,     1,     2,     3,     4],
                         [    5,     6, 50256, 50256, 50256],
                         [    7,     8,     9, 50256, 50256]])

                         
Targets Tensor --> tensor([[    1,     2,     3,     4, 50256],
                           [    6, 50256,  -100,  -100,  -100],
                           [    8,     9, 50256,  -100,  -100]])

**Question 2**

The reason that Program C has the same output as Program A is because of the way that program C is written. Program C includes in it's *torch.tensor()* function the values of *[0,1,-100]*. 

In the context of tensorflow and pytorch, -100 is the ingore_index. This means that even though *logits_2* has 3 training examples, it will ignore the third training example because in the line *targets_3 = torch.tensor([0, 1, -100])*, the third value is -100.

Therefore, while calculating *loss_3* as *torch.nn.functional.cross_entropy(logits_2, targets_3)*, *targets_3* only contains the first two training examples(which is identical to Program A's training examples) and therefore will result in the same loss calculation.