Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blogs/the-illustrated-image-captioning-using-transformers/ #2

Open
utterances-bot opened this issue Dec 19, 2022 · 36 comments
Open

blogs/the-illustrated-image-captioning-using-transformers/ #2

utterances-bot opened this issue Dec 19, 2022 · 36 comments

Comments

@utterances-bot
Copy link

The Illustrated Image Captioning using transformers - Ankur NLP Enthusiast

The Illustrated Image Captioning using transformers

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

Copy link
Owner

Ankur3107 commented Dec 19, 2022

To train on a large set, you can use a torch data iterator.

import torch
from PIL import Image
class ImageCapatioingDataset(torch.utils.data.Dataset):
    def __init__(self, ds, ds_type, max_target_length):
        self.ds = ds
        self.max_target_length = max_target_length
        self.ds_type = ds_type

    def __getitem__(self, idx):
        image_path = self.ds[self.ds_type]['image_path'][idx]
        caption = self.ds[self.ds_type]['caption'][idx]
        model_inputs = dict()
        model_inputs['labels'] = self.tokenization_fn(caption, self.max_target_length)
        model_inputs['pixel_values'] = self.feature_extraction_fn(image_path)
        return model_inputs

    def __len__(self):
        return len(self.ds[self.ds_type])
    
    # text preprocessing step
    def tokenization_fn(self, caption, max_target_length):
        """Run tokenization on caption."""
        labels = tokenizer(caption, 
                          padding="max_length", 
                          max_length=max_target_length).input_ids

        return labels
    
    # image preprocessing step
    def feature_extraction_fn(self, image_path):
        """
        Run feature extraction on images
        If `check_image` is `True`, the examples that fails during `Image.open()` will be caught and discarded.
        Otherwise, an exception will be thrown.
        """
        image = Image.open(image_path)

        encoder_inputs = feature_extractor(images=image, return_tensors="np")

        return encoder_inputs.pixel_values[0]


train_ds = ImageCapatioingDataset(ds, 'train', 64)
eval_ds = ImageCapatioingDataset(ds, 'validation', 64)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=feature_extractor,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    data_collator=default_data_collator,
)

Copy link

yesidc commented Jan 14, 2023

Hi thanks for the tutorial!
I am trying to run your code but it throws an error that maybe you could help me out with. The source of the error seems to be the image feature extractor. The error occurs when normalizing the images: image_utils.py line 237. The error occurs when processing image_id 573223, that strangely has dimensions (224,224) while the rest of the images (processed up to the point where the error occurs) have dimensions (3,224,224).
Error:

ValueError: operands could not be broadcast together with shapes (224,224) (3,) 

Thanks in advance!

Copy link

@yesidc
I ran the code in colab without problem.

Copy link

yesidc commented Jan 19, 2023

I finally managed to run it. For one thing, the preprocessing function provided by the transformers API throws an error when processing black and white images (it always expects 3-channel images). I had to override this function and now the code works (I am using transformers 4.25.1 ).

Copy link

Unrecognized feature extractor in /content/image-captioning-output. Should have a feature_extractor_type key in its preprocessor_config.json of config.json, or one of the following model_type keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, swin, swinv2, table-transformer, timesformer, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos

I am getting this kind of error why is it so?

Copy link

The error is occurring in the inference stage when i am trying to load the pipeline.

Copy link

Hi Ankur, what if we want multiple captions of the same image?

Copy link

newbietuan commented Feb 25, 2023

Hi Ankur, i want to do something between encoder and decoder, so i define the model as follows:

  class caption_model(nn.Module):
      def __init__(self, args):
          super(caption_model, self).__init__()
          self.args = args
          self.gpt2_type = self.args.gpt2_type
          self.config = GPT2Config.from_pretrained('./gpt/' + self.gpt2_type)
          self.config.add_cross_attention = True
          # self.config.is_decoder = True
          self.config.is_encoder_decoder = True
          self.encoder = ViTModel.from_pretrained('./vit', local_files_only=True)
          self.decoder = GPT2LMHeadModel.from_pretrained('./gpt/'+self.gpt2_type, config=self.config)

    def forward(self, pixel_values, input_ids):

        image_feat = self.encoder(pixel_values)
        encoder_outputs = image_feat.last_hidden_state
        # encoder_outputs = do something
        output = self.decoder(input_ids=input_ids, encoder_hidden_states=encoder_outputs)

        return output.logits

while i get some throuble at the inference stage. it seems i should set is_encoder_decoder = True to use the "class BeamSearchEncoderDecoderOutput(ModelOutput):" in the generation_utils.py, but there will "torch.nn.modules.module.ModuleAttributeError: 'GPT2LMHeadModel' object has no attribute 'get_encoder'",
indeed, VisionEncoderDecoderModel complement the ViT-GPT2 for image captioning, but this is integrated, i couldn't do something between encoder an decoder, when i take it apart, i couldn't complete the beam_search stage, it seems impossible to rewrite the beam_search, do you have any suggestions or how should i set parameters to directly call generate().
thank you very much.

Copy link
Owner

@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.

Copy link
Owner

@Aaryan562
There might me some transformers version issue.

Copy link
Owner

@DeependraParichha1004

You may have to use combination of num_return_sequences, num_beams, penalty_alpha, top_k, top_p etc.

You can refer from:

from transformers import pipeline

image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

generate_kwargs = {
    "num_return_sequences":3,
     "num_beams":3
}
image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png", generate_kwargs=generate_kwargs)

@newbietuan
Copy link

@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.

Thank you very much!

@Aaryan562
Copy link

@Aaryan562 There might me some transformers version issue.

Are you sure that if I copy pasted each and every line would not give me any errors? or there are some changes

@Aaryan562
Copy link

@Aaryan562 There might me some transformers version issue.

I also check the config.json and it had model_type key ='vit' in it then also it is giving value error

@Aaryan562
Copy link

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

@DeependraParichha1004
Copy link

got it @Ankur3107 thankyou for the explanation.

@TheTahaaa
Copy link

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

Hi, I also have this issue. Have you found a solution?

@Aaryan562
Copy link

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

Hi, I also have this issue. Have you found a solution?

No, i have not are you also getting the error in the inference stage??

@AnhaarHussain
Copy link

@Aaryan562 There might me some transformers version issue.

So, which version of Transformer should we use?

Copy link

How to load a custom local dataset using the load_data() ? I have downloaded the flickr30k dataset which has images and captions in separate folders.

Copy link

Hi, I also got the same error during inference stage ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a `feature_extractor_type` key in its preprocessor_config.json of config.json, or one of the following `model_type` keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos

Copy link

I resolved it by downgrading transfomers to !pip install transformers===4.28.0, using python 3.9, and manually modify the existing model_type: "vision-encoder-decoder" to model_type: "vit" in the config.json file. Not totally sure if this is correct, but it worked.

Copy link

HELLO,
i have an error model is not defined
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 4>:5 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'model' is not defined

Copy link

Hi @Ankur, thanks for this amazing work. Is there a way to extract the probability for the predicted tokens in inference? Best,

Copy link

@katiele47 this seems to solve it :)

Copy link

I receive
[{'generated_text': 'A train coming down the tracks in the city.'}]
no matter the image.
What am I missing? Any parameters for training I need to adjust?
Many thanks!

@dramab
Copy link

dramab commented Dec 15, 2023

I resolved it by downgrading transfomers to !pip install transformers===4.28.0, using python 3.9, and manually modify the existing model_type: "vision-encoder-decoder" to model_type: "vit" in the config.json file. Not totally sure if this is correct, but it worked.

I meet the same problem . as your solve , turn out 'You are using a model of type vit to instantiate a model of type vision-encoder-decoder. This is not supported for all configurations of models and can yield errors.' question.

Copy link

dramab commented Dec 18, 2023

I find the solution about 'ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a feature_extractor_type key in its preprocessor_config.json of config.json, or one of the following model_type keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos'
you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file

Copy link

dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default max_length (=20) to control the generation length. We recommend setting max_new_tokens to control the maximum length of the generation."

Copy link

@pleomax0730 can you provide me your colab please!

Copy link

@Aaryan562 did you find the solution for the error? I am also getting the same error.

@AnhaarHussain
Copy link

AnhaarHussain commented Feb 25, 2024 via email

@arielshaulov
Copy link

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

Hi, did u succeeded to solve that? I am trying to solve the exact same problem.

Copy link

is124 commented Apr 13, 2024

Hi @Ankur, if I want certain type of caption, can I provide prompt to the model ? I've been trying it but not able to get desired results.

@sswaiting
Copy link

dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default max_length (=20) to control the generation length. We recommend setting max_new_tokens to control the maximum length of the generation."

you may create a variable to keep the result and then print it out
result = image_captioner("sample_image.png")
print(result)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests