# Handling Multiple Sequences

In [1]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [2]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [3]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [7]:
sentences = [
    'Hello There',
    'May the Force be with you'
]

In [8]:
tokens = [tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence)) for sentence in sentences]
print(tokens)

[[7592, 2045], [2089, 1996, 2486, 2022, 2007, 2017]]


convert these tokens to tensor vector

In [10]:
tensor = tf.constant(tokens)

ValueError: Can't convert non-rectangular Python sequence to Tensor.

Tokens is not an rectangular array. So it cnat be converted to Tensor.
To deal with such problems, padding is added.

In [11]:
pad_id = 0

In [12]:
# lets get max seuqnce length

max_len =0
for token in tokens:
  if(max_len<len(token)):
    max_len = len(token)
print(max_len)

6


In [13]:
# lets add padding
for i in range(len(tokens)):
  padding = max_len - len(tokens[i])
  for j in range(padding):
    tokens[i].append(pad_id)
print(tokens)

[[7592, 2045, 0, 0, 0, 0], [2089, 1996, 2486, 2022, 2007, 2017]]


In [15]:
# now this can be converted to tensor vector :)
tensor = tf.constant(tokens)
print(tensor)

tf.Tensor(
[[7592 2045    0    0    0    0]
 [2089 1996 2486 2022 2007 2017]], shape=(2, 6), dtype=int32)


Since these padding ids are need to be neglected while trianing, A attention mask is created to ignore these ids. Mask has value 1 for ids that needs attention, else 0.

In [16]:
attention_masks =[]

for i, token in enumerate(tokens):
  attention_mask = []
  for id in token:
    if id == pad_id:
      attention_mask.append(0)
    else:
      attention_mask.append(1)
  attention_masks.append(attention_mask)
print(attention_masks)

[[1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1]]


In [17]:
# lets give these to model
outputs = model(tf.constant(tokens), attention_mask=tf.constant(attention_masks))
print(outputs.logits)

tf.Tensor(
[[ 3.1864295 -2.6999588]
 [-2.9819214  3.1728363]], shape=(2, 2), dtype=float32)


In [20]:
# While using AutoTokenizer we directly get both attention mask and padded tokens

tokens = tokenizer(sentences)
print('input_ids : ',tokens['input_ids'])
print('attention_maks : ', tokens['attention_mask'])

input_ids :  [[101, 7592, 2045, 102], [101, 2089, 1996, 2486, 2022, 2007, 2017, 102]]
attention_maks :  [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]
