blogs/the-illustrated-image-captioning-using-transformers/ #2

utterances-bot · 2022-12-19T04:30:06Z

The Illustrated Image Captioning using transformers - Ankur NLP Enthusiast

The Illustrated Image Captioning using transformers

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

Ankur3107 · 2022-12-19T04:30:07Z

To train on a large set, you can use a torch data iterator.

import torch
from PIL import Image
class ImageCapatioingDataset(torch.utils.data.Dataset):
    def __init__(self, ds, ds_type, max_target_length):
        self.ds = ds
        self.max_target_length = max_target_length
        self.ds_type = ds_type

    def __getitem__(self, idx):
        image_path = self.ds[self.ds_type]['image_path'][idx]
        caption = self.ds[self.ds_type]['caption'][idx]
        model_inputs = dict()
        model_inputs['labels'] = self.tokenization_fn(caption, self.max_target_length)
        model_inputs['pixel_values'] = self.feature_extraction_fn(image_path)
        return model_inputs

    def __len__(self):
        return len(self.ds[self.ds_type])
    
    # text preprocessing step
    def tokenization_fn(self, caption, max_target_length):
        """Run tokenization on caption."""
        labels = tokenizer(caption, 
                          padding="max_length", 
                          max_length=max_target_length).input_ids

        return labels
    
    # image preprocessing step
    def feature_extraction_fn(self, image_path):
        """
        Run feature extraction on images
        If `check_image` is `True`, the examples that fails during `Image.open()` will be caught and discarded.
        Otherwise, an exception will be thrown.
        """
        image = Image.open(image_path)

        encoder_inputs = feature_extractor(images=image, return_tensors="np")

        return encoder_inputs.pixel_values[0]


train_ds = ImageCapatioingDataset(ds, 'train', 64)
eval_ds = ImageCapatioingDataset(ds, 'validation', 64)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=feature_extractor,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    data_collator=default_data_collator,
)

yesidc · 2023-01-14T23:56:52Z

Hi thanks for the tutorial!
I am trying to run your code but it throws an error that maybe you could help me out with. The source of the error seems to be the image feature extractor. The error occurs when normalizing the images: image_utils.py line 237. The error occurs when processing image_id 573223, that strangely has dimensions (224,224) while the rest of the images (processed up to the point where the error occurs) have dimensions (3,224,224).
Error:

ValueError: operands could not be broadcast together with shapes (224,224) (3,)

Thanks in advance!

pleomax0730 · 2023-01-17T09:12:49Z

@yesidc
I ran the code in colab without problem.

yesidc · 2023-01-19T21:03:43Z

I finally managed to run it. For one thing, the preprocessing function provided by the transformers API throws an error when processing black and white images (it always expects 3-channel images). I had to override this function and now the code works (I am using transformers 4.25.1 ).

Aaryan562 · 2023-02-05T16:16:32Z

Unrecognized feature extractor in /content/image-captioning-output. Should have a feature_extractor_type key in its preprocessor_config.json of config.json, or one of the following model_type keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, swin, swinv2, table-transformer, timesformer, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos

I am getting this kind of error why is it so?

Aaryan562 · 2023-02-05T16:20:54Z

The error is occurring in the inference stage when i am trying to load the pipeline.

DeependraParichha1004 · 2023-02-23T06:00:57Z

Hi Ankur, what if we want multiple captions of the same image?

newbietuan · 2023-02-25T15:53:53Z

Hi Ankur, i want to do something between encoder and decoder, so i define the model as follows:

  class caption_model(nn.Module):
      def __init__(self, args):
          super(caption_model, self).__init__()
          self.args = args
          self.gpt2_type = self.args.gpt2_type
          self.config = GPT2Config.from_pretrained('./gpt/' + self.gpt2_type)
          self.config.add_cross_attention = True
          # self.config.is_decoder = True
          self.config.is_encoder_decoder = True
          self.encoder = ViTModel.from_pretrained('./vit', local_files_only=True)
          self.decoder = GPT2LMHeadModel.from_pretrained('./gpt/'+self.gpt2_type, config=self.config)

    def forward(self, pixel_values, input_ids):

        image_feat = self.encoder(pixel_values)
        encoder_outputs = image_feat.last_hidden_state
        # encoder_outputs = do something
        output = self.decoder(input_ids=input_ids, encoder_hidden_states=encoder_outputs)

        return output.logits

while i get some throuble at the inference stage. it seems i should set is_encoder_decoder = True to use the "class BeamSearchEncoderDecoderOutput(ModelOutput):" in the generation_utils.py, but there will "torch.nn.modules.module.ModuleAttributeError: 'GPT2LMHeadModel' object has no attribute 'get_encoder'",
indeed, VisionEncoderDecoderModel complement the ViT-GPT2 for image captioning, but this is integrated, i couldn't do something between encoder an decoder, when i take it apart, i couldn't complete the beam_search stage, it seems impossible to rewrite the beam_search, do you have any suggestions or how should i set parameters to directly call generate().
thank you very much.

Ankur3107 · 2023-02-27T15:09:38Z

@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.

Ankur3107 · 2023-02-27T15:11:22Z

@Aaryan562
There might me some transformers version issue.

Ankur3107 · 2023-02-27T15:18:42Z

@DeependraParichha1004

You may have to use combination of num_return_sequences, num_beams, penalty_alpha, top_k, top_p etc.

You can refer from:

from transformers import pipeline

image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

generate_kwargs = {
    "num_return_sequences":3,
     "num_beams":3
}
image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png", generate_kwargs=generate_kwargs)

newbietuan · 2023-02-27T15:36:28Z

@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.

Thank you very much!

Aaryan562 · 2023-02-28T18:24:43Z

@Aaryan562 There might me some transformers version issue.

Are you sure that if I copy pasted each and every line would not give me any errors? or there are some changes

Aaryan562 · 2023-02-28T19:50:14Z

@Aaryan562 There might me some transformers version issue.

I also check the config.json and it had model_type key ='vit' in it then also it is giving value error

Aaryan562 · 2023-02-28T20:05:50Z

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

DeependraParichha1004 · 2023-03-03T10:36:25Z

got it @Ankur3107 thankyou for the explanation.

TheTahaaa · 2023-03-23T18:26:08Z

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

Hi, I also have this issue. Have you found a solution?

Aaryan562 · 2023-03-23T18:32:54Z

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

Hi, I also have this issue. Have you found a solution?

No, i have not are you also getting the error in the inference stage??

AnhaarHussain · 2023-04-29T13:09:38Z

@Aaryan562 There might me some transformers version issue.

So, which version of Transformer should we use?

mohnish-7 · 2023-04-30T06:47:54Z

How to load a custom local dataset using the load_data() ? I have downloaded the flickr30k dataset which has images and captions in separate folders.

katiele47 · 2023-05-13T19:05:51Z

Hi, I also got the same error during inference stage ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a `feature_extractor_type` key in its preprocessor_config.json of config.json, or one of the following `model_type` keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos

katiele47 · 2023-05-14T00:51:11Z

I resolved it by downgrading transfomers to !pip install transformers===4.28.0, using python 3.9, and manually modify the existing model_type: "vision-encoder-decoder" to model_type: "vit" in the config.json file. Not totally sure if this is correct, but it worked.

nada-dot · 2023-05-19T11:41:32Z

HELLO,
i have an error model is not defined
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 4>:5 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'model' is not defined

eduardofarina · 2023-05-29T13:14:35Z

Hi @Ankur, thanks for this amazing work. Is there a way to extract the probability for the predicted tokens in inference? Best,

kohstall · 2023-06-11T07:40:59Z

@katiele47 this seems to solve it :)

kohstall · 2023-06-11T07:44:24Z

I receive
[{'generated_text': 'A train coming down the tracks in the city.'}]
no matter the image.
What am I missing? Any parameters for training I need to adjust?
Many thanks!

dramab · 2023-12-15T11:29:30Z

I resolved it by downgrading transfomers to !pip install transformers===4.28.0, using python 3.9, and manually modify the existing model_type: "vision-encoder-decoder" to model_type: "vit" in the config.json file. Not totally sure if this is correct, but it worked.

I meet the same problem . as your solve , turn out 'You are using a model of type vit to instantiate a model of type vision-encoder-decoder. This is not supported for all configurations of models and can yield errors.' question.

dramab · 2023-12-18T05:22:35Z

I find the solution about 'ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a feature_extractor_type key in its preprocessor_config.json of config.json, or one of the following model_type keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos'
you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file

AceMcAwesome77 · 2023-12-27T23:10:34Z

dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default max_length (=20) to control the generation length. We recommend setting max_new_tokens to control the maximum length of the generation."

technoayan7 · 2024-02-24T15:40:20Z

@pleomax0730 can you provide me your colab please!

rohan9446 · 2024-02-25T10:40:29Z

@Aaryan562 did you find the solution for the error? I am also getting the same error.

AnhaarHussain · 2024-02-25T12:27:11Z

Hello Ankur, Apologies for the delayed response; I couldn't resolve the issue despite attempting various solutions. Ultimately, I resorted to using a different model, though it too didn't achieve 100% accuracy. Nevertheless, we successfully incorporated it into our Final Year Project.

…

On Sun, Feb 25, 2024, 2:40 PM rohan9446 ***@***.***> wrote: @Aaryan562 <https://github.com/Aaryan562> did you find the solution for the error? I am also getting the same error. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOZD74VYZ2346VFHSEWU3IDYVMIKVAVCNFSM6AAAAAATC5XSUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA4DSNZRGA> . You are receiving this because you commented.Message ID: ***@***.***>

arielshaulov · 2024-04-10T14:29:29Z

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

Hi, did u succeeded to solve that? I am trying to solve the exact same problem.

is124 · 2024-04-13T18:55:38Z

Hi @Ankur, if I want certain type of caption, can I provide prompt to the model ? I've been trying it but not able to get desired results.

sswaiting · 2024-07-29T06:13:07Z

dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default max_length (=20) to control the generation length. We recommend setting max_new_tokens to control the maximum length of the generation."

you may create a variable to keep the result and then print it out
result = image_captioner("sample_image.png")
print(result)

blogs/the-illustrated-image-captioning-using-transformers/ #2

blogs/the-illustrated-image-captioning-using-transformers/ #2

Comments

utterances-bot commented Dec 19, 2022

The Illustrated Image Captioning using transformers - Ankur NLP Enthusiast

Ankur3107 commented Dec 19, 2022 • edited Loading

To train on a large set, you can use a torch data iterator.

yesidc commented Jan 14, 2023

pleomax0730 commented Jan 17, 2023

yesidc commented Jan 19, 2023

Aaryan562 commented Feb 5, 2023

Aaryan562 commented Feb 5, 2023

DeependraParichha1004 commented Feb 23, 2023

newbietuan commented Feb 25, 2023 • edited by Ankur3107 Loading

Ankur3107 commented Feb 27, 2023

Ankur3107 commented Feb 27, 2023

Ankur3107 commented Feb 27, 2023

newbietuan commented Feb 27, 2023

Aaryan562 commented Feb 28, 2023

Aaryan562 commented Feb 28, 2023

Aaryan562 commented Feb 28, 2023

DeependraParichha1004 commented Mar 3, 2023

TheTahaaa commented Mar 23, 2023

Aaryan562 commented Mar 23, 2023

AnhaarHussain commented Apr 29, 2023

mohnish-7 commented Apr 30, 2023

katiele47 commented May 13, 2023

katiele47 commented May 14, 2023

nada-dot commented May 19, 2023

eduardofarina commented May 29, 2023

kohstall commented Jun 11, 2023

kohstall commented Jun 11, 2023

dramab commented Dec 15, 2023

dramab commented Dec 18, 2023

AceMcAwesome77 commented Dec 27, 2023

technoayan7 commented Feb 24, 2024

rohan9446 commented Feb 25, 2024

AnhaarHussain commented Feb 25, 2024 via email

arielshaulov commented Apr 10, 2024

is124 commented Apr 13, 2024

sswaiting commented Jul 29, 2024

Ankur3107 commented Dec 19, 2022 •

edited

Loading

newbietuan commented Feb 25, 2023 •

edited by Ankur3107

Loading