ViLT as encoder and BART decoder as decoder


In [None]:
import torch
from transformers import ViltProcessor, ViltModel
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers.modeling_outputs import BaseModelOutput # Import BaseModelOutput from the correct submodule
from PIL import Image


Understanding BART

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

# Load BART model and tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

# Prepare input text
input_text = "This is a sample input text to test the BART encoder."
inputs = tokenizer(input_text, return_tensors="pt")

# Display tokenized input
print("Tokenized Input:", inputs)
print("Shape of Tokenized Input:", inputs.keys)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Tokenized Input: {'input_ids': tensor([[    0,   713,    16,    10,  7728,  8135,  2788,     7,  1296,     5,
         30634,  9689, 15362,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Shape of Tokenized Input: <bound method BatchEncoding.keys of {'input_ids': tensor([[    0,   713,    16,    10,  7728,  8135,  2788,     7,  1296,     5,
         30634,  9689, 15362,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}>


The tokenizer from the transformers library returns a dictionary containing the tokenized input. The keys of this dictionary typically include input_ids and attention_mask. Here’s a detailed explanation of what each of these keys represents:

    input_ids: This key holds the tensor of token IDs. These are the numerical representations of the tokens in the input text after being passed through the tokenizer's vocabulary.
    attention_mask: This key holds a tensor that indicates which tokens should be attended to by the model (1) and which should be ignored (0). It helps the model differentiate between actual tokens and padding tokens.

In [None]:
# Get encoder outputs
with torch.no_grad():
    encoder_outputs = model.model.encoder(**inputs)

# Display the encoder outputs structure
# print("Encoder Outputs:", encoder_outputs)
print("Keys of Encoder Outputs:", encoder_outputs.keys())
print("Last Hidden State:", encoder_outputs.last_hidden_state)
print("Shape of Last Hidden State:", encoder_outputs.last_hidden_state.shape)


Keys of Encoder Outputs: odict_keys(['last_hidden_state'])
Last Hidden State: tensor([[[-0.0328,  0.0104, -0.0033,  ...,  0.0111, -0.0028, -0.0040],
         [ 0.5018,  0.1893, -0.2240,  ...,  0.0146,  0.0865,  0.3661],
         [ 0.1537,  0.0491, -0.4342,  ...,  0.1374, -0.1584,  0.3290],
         ...,
         [-0.1970,  0.0455, -0.1976,  ...,  0.0687,  0.1580,  0.0917],
         [ 0.1074,  0.0920,  0.1550,  ...,  0.0297, -0.2720,  0.2010],
         [ 0.0758, -0.1095,  0.4273,  ...,  0.0900, -0.1925,  0.4005]]])
Shape of Last Hidden State: torch.Size([1, 15, 768])


The encoder in the BART model returns an instance of BaseModelOutput, which typically includes:

    last_hidden_state: A tensor of shape (batch_size, sequence_length, hidden_size) representing the hidden states of the last layer of the encoder for each token in the input sequence.
    hidden_states (optional): A tuple of tensors representing the hidden states of all layers of the encoder (if output_hidden_states=True is passed).
    attentions (optional): A tuple of tensors representing the attention weights (if output_attentions=True is passed).

In [None]:
# Prepare decoder input
decoder_input_ids = tokenizer("<s>", return_tensors="pt").input_ids

# Decode the encoded text
with torch.no_grad():
    outputs = model.generate(
        input_ids=decoder_input_ids,
        encoder_outputs=encoder_outputs,
        max_length=50,
        num_beams=5,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        decoder_start_token_id=tokenizer.bos_token_id
    )

# Decode and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Text:", generated_text)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text: This is a sample input text to test the BART encoder.


The model.generate function is used for generating sequences during inference and it does not support training mode.

During training, you would pass both input_ids and decoder_input_ids to the model, and typically use model.forward or model.__call__, along with the labels for the target sequence. This allows the model to compute the loss and update the weights accordingly.

Ultimate Test

In [None]:
# Load ViLT model and processor
vilt_processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-mlm")
vilt_model = ViltModel.from_pretrained("dandelin/vilt-b32-mlm")

# Load and preprocess image
image_path = "/content/zozo (2).jpg"
image = Image.open(image_path).convert("RGB")

# Prepare text
text = "what is in the picture?"

# Process image and text
inputs = vilt_processor(image, text, return_tensors="pt")

# Get ViLT embeddings
with torch.no_grad():
    encoder_outputs = vilt_model(**inputs, output_hidden_states=True, output_attentions=True)


# Display the vilt_embeddings structure
print("Encoder Outputs:", encoder_outputs)
print("Keys of Encoder Outputs:", encoder_outputs.keys())
# print("Shape of Encoder Outputs:", encoder_outputs.shape)
# print("Last Hidden State:", encoder_outputs.last_hidden_state)
print("Shape of Last Hidden State:", encoder_outputs.last_hidden_state.shape)




Encoder Outputs: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 1.5182e-01,  1.9085e-01, -9.8345e-02,  ..., -3.2899e-01,
           5.3501e-02, -1.7396e-01],
         [-4.1584e-02,  4.8902e-01, -1.9124e-01,  ..., -2.0771e-01,
           3.7670e-03,  2.6071e-02],
         [-2.0652e-02, -2.8288e-03, -2.5246e-01,  ..., -3.2357e-01,
          -2.0183e-01,  1.1570e-02],
         ...,
         [ 2.2303e-01,  3.8300e-01, -3.4090e-01,  ...,  2.5716e-04,
          -4.3134e-01, -3.0933e-01],
         [ 1.9196e-01,  3.6919e-01, -3.6550e-01,  ..., -1.9023e-01,
           3.6328e-01, -4.2724e-01],
         [ 6.8233e-02,  3.6576e-01, -4.3577e-01,  ..., -3.7716e-01,
           3.3243e-01, -4.9777e-01]]]), pooler_output=tensor([[ 4.4882e-01,  3.1593e-02, -4.4512e-02, -1.3824e-01,  1.1775e-01,
          2.8137e-02,  1.0715e-02,  3.9919e-02, -8.8631e-01, -6.6708e-02,
          4.0552e-02, -5.1927e-02, -2.3030e-02,  7.6173e-01,  6.4507e-01,
          2.5455e-01, -2.0904e-01, -7.2270e-01,  8.7653

We use ViLT with odict keys of hidden_states, such that it can be easily processed by the BART-base model...We have chosen BART base model as the encoder output of BART base model is the same as encoder output of ViLT i.e. 768 embeddings.

In [None]:
# ViLT model summary

vilt_model.config


ViltConfig {
  "_name_or_path": "dandelin/vilt-b32-mlm",
  "architectures": [
    "ViltForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_image_length": -1,
  "max_position_embeddings": 40,
  "modality_type_vocab_size": 2,
  "model_type": "vilt",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "num_images": -1,
  "patch_size": 32,
  "qkv_bias": true,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.42.4",
  "type_vocab_size": 2,
  "vocab_size": 30522
}

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

# Load BART model and tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')



In [None]:
# Prepare decoder input
decoder_input_ids = tokenizer("<s>", return_tensors="pt").input_ids

# Decode the encoded text
with torch.no_grad():
    outputs = model.generate(
        input_ids=decoder_input_ids,
        encoder_outputs=encoder_outputs,
        max_length=10,
        num_beams=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        decoder_start_token_id=tokenizer.bos_token_id
    )

# Decode and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Text:", generated_text)


Generated Text: �Â\'-\'\'\'


In [None]:
model.config

BartConfig {
  "_name_or_path": "facebook/bart-base",
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_ty

Here, the generated text is not accurate because we have not trained the model yet.

KeyError: 0 Issue

The KeyError: 0 typically occurs when trying to access an index (0 in this case) from an object that doesn't support integer indexing. In your case, encoder_outputs seems to be structured differently than expected by the BART model's generate method. The generate method expects encoder_outputs to be a structured object like BaseModelOutput, which provides named attributes like last_hidden_state, hidden_states, and attentions.