# **Tested and Modified for NVIDIA Infrastructure**

## **BLIP (Bootstrapping Language-Image Pre-training)**

BLIP is a state-of-the-art model designed for vision-language tasks such as image captioning, image-text retrieval, and visual question answering. It leverages large-scale pretraining to learn rich representations that jointly understand both images and natural language. BLIP was introduced to bridge the gap between visual understanding and natural language generation in a scalable and efficient manner.


BLIP uses a combination of vision encoders and language models, pretrained on massive amounts of image-text pairs. The model bootstraps itself through an iterative training process, improving its multimodal understanding by aligning image features and text embeddings.

![IMDB](https://www.researchgate.net/profile/Zeyu-Xiong-5/publication/367369926/figure/fig2/AS:11431281114845784@1674699948232/A-working-example-of-BLIP-Image-Captioning.ppm)


**Here's how BLIP works:**

1. Vision and Language Encoders:
BLIP consists of a vision encoder  that processes images into visual embeddings, and a language encoder (usually a Transformer-based model like BERT) that processes text inputs.

2. Bootstrapping Pretraining:
BLIP employs a novel bootstrapping approach where it alternates between generating captions for images and improving its understanding of the alignment between image and text. It uses large datasets of image-text pairs from the web for this pretraining.

3. Contrastive Learning:
BLIP aligns image and text embeddings in a shared semantic space using contrastive loss. This encourages the model to bring matching image-text pairs closer while pushing apart non-matching pairs, improving cross-modal retrieval.

4. Generative Fine-tuning:
Beyond retrieval, BLIP can be fine-tuned to generate descriptive captions for images. This generative capability is achieved using a sequence-to-sequence Transformer that conditions language generation on visual features.

5. Applications:
BLIP is versatile and achieves strong performance on tasks like image captioning, image-text retrieval, and visual question answering, enabling machines to better understand and describe visual content in natural language.

**References:**

1. https://ahmed-sabir.medium.com/paper-summary-blip-bootstrapping-language-image-pre-training-for-unified-vision-language-c1df6f6c9166
2. https://www.researchgate.net/figure/A-working-example-of-BLIP-Image-Captioning_fig2_367369926

## **Install required dependencies**

Below command installs four popular Python libraries essential for deep learning, especially in natural language processing (NLP) and computer vision tasks.

- **Transformers** use and fine-tune pretrained language and multimodal models. In our case, it provides the BlipProcessor and BlipForConditionalGeneration needed for caption generation.

- **Torchvision** access computer vision datasets and prebuilt models.

- **Pillow** load, process, and save images easily.

In [None]:
pip install transformers torchvision Pillow

## **Import Required Dependencies**

**PIL.Image:** The Python Imaging Library (Pillow) is used to open and process images in the required format.

**BlipProcessor** handles all the necessary preprocessing required before passing an image to the BLIP model. This includes resizing, normalization, and formatting the image into tensors compatible with the model.

**BlipForConditionalGeneration** is the core pre-trained BLIP model from Hugging Face's Transformers library. It performs the actual image caption generation based on the processed image input.
The from_pretrained() method loads the model weights that were fine-tuned specifically for image captioning tasks.

In [None]:
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration,  Trainer, TrainingArguments
from PIL import ExifTags, Image

In [None]:
pip uninstall pillow -y

In [None]:
pip install pillow==8.2.0

## **Download and load the pre-trained BLIP model and processor**

In [None]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

In [None]:
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

## **Load an image**

In [None]:
image_path = "rain1.jpg"  # Replace with your image path
image = Image.open(image_path).convert("RGB")

## **Preprocess the image and prepare the input**


In [None]:
inputs = processor(images=image, return_tensors="pt")

## **Generate captions using BLIP**

In [None]:
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)

In [None]:
print("Generated Caption: ", caption)

## **Fine tuning on small custom dataset**

Fine tuning in machine learning is the process of taking a pre-trained model and further training it on a new, smaller dataset relevant to a specific task. This process helps the model to adapt its parameters and improve its performance on that new task by building upon its existing knowledge.

**creating small dummy dataset** (you can expand it with more images and captions or add datset from hugging face)

In [None]:
data = {
    "image": [
        Image.open("dog1.jpg").convert("RGB"), # replace with your image
        Image.open("ppl1.jpg").convert("RGB") # replace with your image
    ],
    "text": [
        "A dog playing in the grass.", # add caption as per image
        "A group of people hiking a mountain." # add caption as per image
    ]
}

In [None]:
from datasets import Dataset

**This library is used for handling large datasets efficiently and is tightly integrated with the transformers library for NLP and multimodal tasks.**

In [None]:
# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(data)


In [None]:
# Extract image and caption from the dataset example
def preprocess_function(example):
    image = example["image"]
    caption = example["text"]

    # Tokenize and encode the image + caption
    encoding = processor(images=image, text=caption, return_tensors="pt", padding="max_length", truncation=True)

   # Return a dictionary with processed inputs
    return {
        "pixel_values": encoding["pixel_values"].squeeze(0),
        "input_ids": encoding["input_ids"].squeeze(0),
        "attention_mask": encoding["attention_mask"].squeeze(0),
        "labels": encoding["input_ids"].squeeze(0),
    }


### **Applying function to each example in the dataset**
   Remove original 'image' and 'text' fields after processing

In [None]:
processed_dataset = dataset.map(
    preprocess_function,
    remove_columns=["image", "text"]
)


## **Setting up training hyperparameters with training arguments**
The TrainingArguments class defines key settings for model training, such as output directory, batch size, number of epochs, logging frequency, checkpoint saving, and whether to use mixed-precision (FP16) if a GPU is available.

In [None]:
training_args = TrainingArguments(
    output_dir="./blip-finetuned",
    per_device_train_batch_size=2,
    num_train_epochs=5,
    logging_steps=1,
    save_steps=5,
    save_total_limit=1,
    remove_unused_columns=False,
    fp16=torch.cuda.is_available(),
    report_to="none"
)

## **Initializing the Trainer for Model Training**
The Trainer class handles the full training loop by combining the model, training arguments, dataset, and tokenizer. It simplifies the training process by managing batching, optimization, logging, and checkpointing automatically.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset,
    tokenizer=processor,
)

## **Starting the Training Process**
The trainer.train() command begins the model training using the configured settings, dataset, and model. It runs the training loop, updating model weights over the specified epochs.

In [None]:
trainer.train()

## **Saving the Trained Model and Processor**
save the fine-tuned model and the associated processor/tokenizer to the specified directory for later use or deployment.

The fine tune model is saved inside the blip-finetuned directory

In [None]:
trainer.save_model("./blip-finetuned")
processor.save_pretrained("./blip-finetuned")

## **Load the pre-train model from blip-finetuned directory**

Testing on fine tune model

In [None]:
model_fine_tune = BlipForConditionalGeneration.from_pretrained("./blip-finetuned")
processor_fine_tune = BlipProcessor.from_pretrained("./blip-finetuned")

In [None]:
image = Image.open("rain1.jpg").convert("RGB") # replace with your image

In [None]:
# Prepare inputs
inputs = processor_fine_tune(images=image, return_tensors="pt").to(model_fine_tune.device)

In [None]:
# Generate caption
out = model_fine_tune.generate(**inputs)
caption1 = processor_fine_tune.decode(out[0], skip_special_tokens=True)

In [None]:
print("Caption:", caption1)