# Batching Input Sequences Together
## Model Expects Batch Inputs
Convert list of numbers to a tensor & send it to model:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

Result: ```IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)```
### Why did it fail?
* Sent single sequence to the model
    * HFT models expect multi sentence by default
    * The tokenizer added a dimension on top
### Trying again w/ new dimension:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

* Result is input IDs & resulting logits:
```bash
Input IDs: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
Logits: [[-2.7276,  2.8789]]
```
* Batching is the act of sending multiple sentences through the model, w/ only 1 sentence, you can batch w/ a single sequence:

In [None]:
batched_ids = [ids, ids]

* Creates a batch of two identical sequences
* Batching let's model work when you feed it multi sequences
* Issue: When batching 2+ sentences, they may be diff lengths
    * Tensors require rectangular shapes, preventing conversion of list inputIDs to tensor directly
    * Solution: ```pad``` input
* Can't be converted to tensor:
```python
batched_ids = [
    [200, 200, 200],
    [200, 200]
]
```
* To work around, padding makes rectangular shape, ensuring all sentences have same length by adding special word called ***padding token*** to the sentence w/ fewer values.
    * E.g. if you have 10 sentences w/ 10 words & 1 w/ 20, padding ensures all have 20.
```python
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]
```
* Padding token ID can be found in ```tokenizer.pad_token_id```

In [4]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


Returns:
```bash
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)
```
* Problem: Logits prediction should be the same for the second sentence, but we have different values
    * This is bc attention layers contextualize each token
        * Take into account the padding tokens, applying to all tokens in sentence
    * Solution:to get same result when passing indiv. sentences of diff lengths through the model or passing batch w/ same sentences & padding applied, we need to tell attention layers to ignore padding tokens using **attention mask**
### Attention masks
* Tensors w/ exact same shape as input IDs tensor, filled w/ 0s & 1s.
    * 1s indicate corresponding tokens should be attended to
    * 0s indicate tokens shouldn't be attended

In [5]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


Result:
```Bash
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
```
* Now ew get the same logits for the second sentence in batch bc last value of second sequence is a padding ID w/ - value in attention mask

# Longer sequences
* W/ HFT models, limit lengths for sequences. Most handle sequences of < 512 or 1024 tokens but will crash if longer. Two solutions
    * Use a model w/ longer supported sequence length
        * Models have different supported sequence lengths, some specialize in handling v long sequences.
            * Longformer & LED are what you need to look into..
    * Truncate sequences
