


#### <a id="top"></a>
# <div style="box-shadow: rgb(60, 121, 245) 0px 0px 0px 3px inset, rgb(255, 255, 255) 10px -10px 0px -3px, rgb(31, 193, 27) 10px -10px, rgb(255, 255, 255) 20px -20px 0px -3px, rgb(255, 217, 19) 20px -20px, rgb(255, 255, 255) 30px -30px 0px -3px, rgb(255, 156, 85) 30px -30px, rgb(255, 255, 255) 40px -40px 0px -3px, rgb(255, 85, 85) 40px -40px; padding:20px; margin-right: 40px; font-size:30px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(60, 121, 245);"><b>Table of contents</b></div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. Introduction to Image Captioning </a> </li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Setting Up Working Environment </a></li>
    <li><a href="#3" target="_self" rel=" noreferrer nofollow">3. Loading the Model and Processor </a></li>
    <li><a href="#4" target="_self" rel=" noreferrer nofollow">4. Loading & Displaying the Image </a></li>
    <li><a href="#5" target="_self" rel=" noreferrer nofollow">5. Conditional Image Captioning </a></li>
    <li><a href="#6" target="_self" rel=" noreferrer nofollow">6. Unconditional Image Captioning </a></li>
    <li><a href="#7" target="_self" rel=" noreferrer nofollow">7. Trying Another Example with a Better Condition </a></li>

</ul>
</div>

***


In [None]:
# transformers
# torch
# datasets
# pillow

In [1]:
!pip install transformers torch -q

In [None]:
# !pip install --upgrade transformers

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

Next, we will import the AutoProcessor class from the transformers library which is a convenient tool for handling preprocessing tasks such as tokenizing text and processing images to prepare them for model input.

In [None]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

<a id="4"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 4. Loading & Displaying the Image </b></div>

Next, to display the image in a Jupyter Notebook or similar environment, you can use the Image class from the PIL

In [None]:
from PIL import Image
image = Image.open("/kaggle/input/gaza-images/gaza_under_fire.png")
image

If you are running this code outside a Jupyter Notebook, simply load the image with **Image.open** will not display it. Using **matplotlib** or another image display library will ensure the image is shown correctly.

In [None]:
text = "a photograph of"
inputs = processor(image, text, return_tensors="pt")
inputs

To generate a caption for the image using the BLIP model with the processed inputs, you can use the generate method of the model. After generating the output, you can decode it to get the caption.

In [None]:
out = model.generate(**inputs)
out

Finally decode the output using the processor.decode method to get the caption in a readable format. Here's the updated code with the decoding and printing of the generated caption

In [None]:
print(processor.decode(out[0], skip_special_tokens=True))

We can see that the caption is very generic, although it describes the image in general. However, it did not provide a more specific caption describing the real caption which is Israel bombing Gaza. Maybe it needs fine-tuning to provide a more realistic caption. Another possible solution we can also provide more guidance in the guidance text.

<a id="6"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 6. Unconditional Image Captioning </b></div>


Let's try Unconditional Image Captioning. This refers to a type of image captioning task where the model generates captions for an image without any specific guidance or additional input beyond the image itself.
In other words, the model generates captions solely based on the visual information present in the image, without being given any explicit prompts or cues about what the caption should focus on or describe.
We will follow the same steps as in conditional image captioning but without giving a text argument to the processor method.

In [None]:
inputs = processor(image,return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

We can see that the returned caption is almost similar to the previous one we got using conditional image captioning.

<a id="7"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 7. Trying Another Example with a Better Condition </b></div>


Let's try with another example in which we will provide better textual guidance to the captioning model. We will use the image below in which a group of Israeli soldiers are abusing Palestinian a child in Jerusalem streets.

In [None]:
image = Image.open("/kaggle/input/gaza-images/israeli_abuse_child.png")
image

The text guidance we will use is Israeli soldiers and let's observe how this will improve the captioning process and also compare the results with the unconditional image captioning method. We will follow the same process as before. We will start by passing both the image and the text prompt to the processor.

In [None]:
text = "Israeli soldiers"
inputs = processor(image, text, return_tensors="pt")
inputs

Next, we will generate the caption for the image using the BLIP model with the processed inputs using the generate method of the model.

In [None]:
model.generate(**inputs)
out

Finally, we will decode the output into the text

In [None]:
print(processor.decode(out[0], skip_special_tokens=True))

We can see that the generated caption is very precise even more than I expected. This was done by easily adding better guidance. To observe the difference let's generate the caption without using a textual condition.

In [None]:
inputs = processor(image,return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

We can see the difference between the two captions. The first one is very specific and accurate. While the second one is very general. This shows the importance of good textual guidance.