# IDEFICS: A Flamingo-based model, trained at scale for the community
# Finetuning Demo Notebook:

<div style="text-align: center;">
</div>
<div style="display: flex; justify-content: center;">
    <img src="https://huggingface.co/HuggingFaceM4/idefics-80b/resolve/main/assets/Idefics_colab.png" alt="Idefics image" >
</div>

Credit: [Flamingo blog](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model)

This google colab notebook shows how to run predictions with the 4-bit quantized ðŸ¤— [Idefics-9B model](https://huggingface.co/HuggingFaceM4/idefics-9b) and finetune it on a specific dataset.

[IDEFICS](https://huggingface.co/HuggingFaceM4/idefics-80b) is a multi-modal model based on the [Flamingo](https://arxiv.org/abs/2204.14198) architecture. It can take images and texts as input and return text outputs but it does not support image generation. \\
IDEFICS is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents. \\
The [finetuned versions](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) of IDEFICS behave like LLM chatbots while also understanding visual input. \\
You can play with the [demo here](https://huggingface.co/spaces/HuggingFaceM4/idefics_playground)

The code for this notebook was contributed to by *LÃ©o Tronchon, Younes Belkada, and Stas Bekman*, the IDEFICS model has been contributed to by: *Lucile Saulnier, LÃ©o Tronchon, Hugo LaurenÃ§on, Stas Bekman, Amanpreet Singh, Siddharth Karamcheti, and Victor Sanh*

# Install and import necessary libraries

In [None]:
!pip install -q datasets
!pip install -q transformers==4.45.2
!pip install -q bitsandbytes sentencepiece accelerate loralib
!pip install -q -U git+https://github.com/huggingface/peft.git

In [1]:
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from PIL import Image
from transformers import IdeficsForVisionText2Text, AutoProcessor, Trainer, TrainingArguments, BitsAndBytesConfig
import torchvision.transforms as transforms

# Load ~~quantized~~ model
First get the quantized version of the model. This will allow us to use the 9B version of Idefics with a single 16GB gpu



In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# checkpoint = "HuggingFaceM4/tiny-random-idefics"
checkpoint = "HuggingFaceM4/idefics-9b"

# Here we skip some special modules that can't be quantized properly
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["lm_head", "embed_tokens"],
)

processor = AutoProcessor.from_pretrained(checkpoint, use_auth_token=True)
# Simply take-off the quantization_config arg if you want to load the original model
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, quantization_config=bnb_config, device_map="auto")

IdeficsForVisionText2Text has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From ðŸ‘‰v4.50ðŸ‘ˆ onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Instantiating IdeficsAttention without passing a `layer_idx` is not recommended and will lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` when creating this class.


Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

If you print the model, you will see that all `nn.Linear` layers are in fact replaced by `bnb.nn.Linear4bit` layers.

In [3]:
print(model)

IdeficsForVisionText2Text(
  (model): IdeficsModel(
    (embed_tokens): IdeficsDecoupledEmbedding(
      num_embeddings=32000, num_additional_embeddings=2, embedding_dim=4096, partially_freeze=False
      (additional_embedding): Embedding(2, 4096)
    )
    (vision_model): IdeficsVisionTransformer(
      (embeddings): IdeficsVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(257, 1280)
      )
      (pre_layrnorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (encoder): IdeficsVisionEncoder(
        (layers): ModuleList(
          (0-31): 32 x IdeficsVisionEncoderLayer(
            (self_attn): IdeficsVisionAttention(
              (k_proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
              (v_proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
              (q_proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
        

In [4]:
from transformers.models.idefics.modeling_idefics import IdeficsDecoupledLinear
import torch.nn as nn

in_features = model.lm_head.in_features

model.lm_head = IdeficsDecoupledLinear(in_features, 2, bias=False, partially_freeze=False, device=model.device, dtype = model.dtype)

In [5]:
model

IdeficsForVisionText2Text(
  (model): IdeficsModel(
    (embed_tokens): IdeficsDecoupledEmbedding(
      num_embeddings=32000, num_additional_embeddings=2, embedding_dim=4096, partially_freeze=False
      (additional_embedding): Embedding(2, 4096)
    )
    (vision_model): IdeficsVisionTransformer(
      (embeddings): IdeficsVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(257, 1280)
      )
      (pre_layrnorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (encoder): IdeficsVisionEncoder(
        (layers): ModuleList(
          (0-31): 32 x IdeficsVisionEncoderLayer(
            (self_attn): IdeficsVisionAttention(
              (k_proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
              (v_proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
              (q_proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
        

# Inference
Let's make a simple method to test the model's inference

In [4]:
def check_inference(model, processor, prompts, max_new_tokens=50):
    tokenizer = processor.tokenizer
    bad_words = ["<image>", "<fake_token_around_image>"]
    if len(bad_words) > 0:
        bad_words_ids = tokenizer(bad_words, add_special_tokens=False).input_ids

    eos_token = "</s>"
    eos_token_id = tokenizer.convert_tokens_to_ids(eos_token)

    inputs = processor(prompts, return_tensors="pt").to(device)
    generated_ids = model.generate(**inputs, eos_token_id=[eos_token_id], bad_words_ids=bad_words_ids, max_new_tokens=max_new_tokens, early_stopping=True)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(generated_text)


Let's run prediction with the quantized model for the image below which pictures two kittens. \\
<img src="https://hips.hearstapps.com/hmg-prod/images/cute-photos-of-cats-in-grass-1593184777.jpg" width="400"/>

In [5]:
url = "https://hips.hearstapps.com/hmg-prod/images/cute-photos-of-cats-in-grass-1593184777.jpg"
prompts = [
    # "Instruction: provide an answer to the question. Use the image to answer.\n",
    url,
    "Question: What's on the picture? Answer:",
]
check_inference(model, processor, prompts, max_new_tokens=5)




Question: What's on the picture? Answer: Two kittens.


Now let's see how the model fares on pokemon knowledge before we try to finetune it further. \\
<img src="https://images.pokemontcg.io/pop6/2_hires.png" width="194"/>


In [6]:
# check generation before finetuning

url = "https://images.pokemontcg.io/pop6/2_hires.png"
prompts = [
    url,
    "Question: What's on the picture? Answer:",
]
check_inference(model, processor, prompts, max_new_tokens=100)
# It looks like the model is already aware of pokemon - but it could be more specific, and less repetitive

Question: What's on the picture? Answer: Lucario


# Finetuning dataset
Prepare the dataset that will be used for finetuning


In [21]:
def convert_to_rgb(image):
    # `image.convert("RGB")` would only work for .jpg images, as it creates a wrong background
    # for transparent images. The call to `alpha_composite` handles this case
    if image.mode == "RGB":
        return image

    image_rgba = image.convert("RGBA")
    background = Image.new("RGBA", image_rgba.size, (255, 255, 255))
    alpha_composite = Image.alpha_composite(background, image_rgba)
    alpha_composite = alpha_composite.convert("RGB")
    return alpha_composite

def ds_transforms(example_batch):
    image_size = processor.image_processor.image_size
    image_mean = processor.image_processor.image_mean
    image_std = processor.image_processor.image_std

    image_transform = transforms.Compose([
        convert_to_rgb,
        transforms.RandomResizedCrop((image_size, image_size), scale=(0.9, 1.0), interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize(mean=image_mean, std=image_std),
    ])

    prompts = []
    for i in range(len(example_batch['question'])):
        question = example_batch['question'][i]
        answer = example_batch['answer'][i]
        prompts.append(
            [
                Image.open(example_batch['image_path'][i]),
                f"Question: {question} Answer: {answer}</s>",
            ],
        )

    inputs = processor(prompts, transform=image_transform, return_tensors="pt").to(device)

    inputs["labels"] = inputs["input_ids"]

    return inputs

### Debugging

In [20]:
from PIL import Image
prompts = []
image_size = processor.image_processor.image_size
image_mean = processor.image_processor.image_mean
image_std = processor.image_processor.image_std
print(f"Image size: {image_size}, mean: {image_mean}, std: {image_std}")

image_transform = transforms.Compose([
    convert_to_rgb,
    transforms.RandomResizedCrop((image_size, image_size), scale=(0.9, 1.0), interpolation=transforms.InterpolationMode.BICUBIC),
    transforms.ToTensor(),
    transforms.Normalize(mean=image_mean, std=image_std),
])

prompts.append([
    Image.open("/home/gpuuser3/sinngam_albert/datasets/mmsd2/dataset_image/930770279348961280.jpg"),
    "Question: What do you see? Answer: kittens</s>",
])

inputs = processor(prompts, transform=image_transform, return_tensors="pt", padding=True).to(device)
print(inputs['pixel_values'])

Image size: 224, mean: [0.48145466, 0.4578275, 0.40821073], std: [0.26862954, 0.26130258, 0.27577711]
tensor([[[[[-0.2302, -0.2010, -0.2010,  ...,  0.0179,  0.0033,  0.0033],
           [-0.2010, -0.2010, -0.2010,  ...,  0.0179,  0.0471,  0.0179],
           [-0.2010, -0.2156, -0.1864,  ...,  0.0179,  0.0179, -0.0259],
           ...,
           [ 1.3464,  1.3610,  1.2442,  ...,  1.6822,  1.6384,  1.5508],
           [ 1.3610,  1.3610,  1.2734,  ...,  1.6384,  1.6384,  1.5800],
           [ 1.3172,  1.3026,  1.2150,  ...,  1.5362,  1.5800,  1.5800]],

          [[-0.0862, -0.0562, -0.0712,  ...,  0.1389,  0.1239,  0.1239],
           [-0.0862, -0.0862, -0.0862,  ...,  0.1389,  0.1689,  0.1389],
           [-0.1012, -0.1163, -0.0862,  ...,  0.1389,  0.1389,  0.0939],
           ...,
           [-0.3564, -0.3264, -0.4464,  ..., -0.3864, -0.2963, -0.0412],
           [-0.3414, -0.3414, -0.4314,  ..., -0.3564, -0.2963, -0.1313],
           [-0.3864, -0.4014, -0.4914,  ..., -0.3114, -0.2963

In [22]:
import os
from tqdm import tqdm
import pandas as pd
from PIL import Image

print("#### Preparing dataset.")
data_folder = "../../datasets/mmsd2/"
train_dataset_path = os.path.join(data_folder, "train.json")
valid_dataset_path = os.path.join(data_folder, "valid.json")

train_df = pd.read_json(train_dataset_path)
valid_df = pd.read_json(valid_dataset_path)
valid_sarcastic = valid_df[valid_df["label"] == 1].sample(n=50, random_state=42)
valid_not_sarcastic = valid_df[valid_df["label"] == 0].sample(n=50, random_state=42)
valid_df = pd.concat([valid_sarcastic, valid_not_sarcastic], ignore_index=True)


def create_img_paths(id):
    img_directory = "/home/gpuuser3/sinngam_albert/datasets/mmsd2/dataset_image"
    path = f"{img_directory}/{id}.jpg"
    return path

def convert_label_to_text(label):
    if label == 0:
        return "NOT SARCASTIC"
    else:
        return "SARCASTIC"
    
train_df["image_path"] = train_df["image_id"].apply(lambda x: create_img_paths(x))
valid_df["image_path"] = valid_df["image_id"].apply(lambda x: create_img_paths(x))

def adjust_dataset(sample):
    question = f"Classify the text <{sample['text']}> and the image into one of the following categories: <SARCASTIC, NOT SARCASTIC>."
    answer = convert_label_to_text(sample["label"])
    return {
        "question": question,
        "answer": answer,
        "image_path": sample["image_path"],
    }

train_dataset = [
    adjust_dataset(sample) for i, sample in tqdm(train_df.iterrows(), total=len(train_df))
]
valid_dataset = [
    adjust_dataset(sample) for i, sample in tqdm(valid_df.iterrows(), total=len(valid_df))
]


#### Preparing dataset.


  0%|          | 0/19816 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 19816/19816 [00:01<00:00, 19525.96it/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 100/100 [00:00<00:00, 17503.25it/s]


In [23]:
# Convert train_dataset and valid_dataset to Dataset objects
from datasets import Dataset
train_dataset = Dataset.from_list(train_dataset)
valid_dataset = Dataset.from_list(valid_dataset)

In [24]:
train_dataset.set_transform(ds_transforms)
valid_dataset.set_transform(ds_transforms)

In [26]:
train_dataset[200]['pixel_values']

tensor([[[[ 1.4486,  1.4486,  1.4486,  ...,  1.4486,  1.4486,  1.4486],
          [ 1.4486,  1.4486,  1.4486,  ...,  1.4486,  1.4486,  1.4486],
          [ 1.4486,  1.4486,  1.4486,  ...,  1.4486,  1.4486,  1.4486],
          ...,
          [ 1.4486,  1.4486,  1.4486,  ...,  1.4486,  1.4486,  1.4486],
          [ 1.4486,  1.4486,  1.4486,  ...,  1.4486,  1.4486,  1.4486],
          [ 1.4486,  1.4486,  1.4486,  ...,  1.4486,  1.4486,  1.4486]],

         [[-1.7521, -1.7521, -1.7521,  ..., -1.7521, -1.7521, -1.7521],
          [-1.7521, -1.7521, -1.7521,  ..., -1.7521, -1.7521, -1.7521],
          [-1.7521, -1.7521, -1.7521,  ..., -1.7521, -1.7521, -1.7521],
          ...,
          [-1.7521, -1.7521, -1.7521,  ..., -1.7521, -1.7521, -1.7521],
          [-1.7521, -1.7521, -1.7521,  ..., -1.7521, -1.7521, -1.7521],
          [-1.7521, -1.7521, -1.7521,  ..., -1.7521, -1.7521, -1.7521]],

         [[-1.4802, -1.4802, -1.4802,  ..., -1.4802, -1.4802, -1.4802],
          [-1.4802, -1.4802, -

# LoRA
After specifying the low-rank adapters (LoRA) config, we load the PeftModel using the get_peft_model utility function

In [27]:
print(model)

IdeficsForVisionText2Text(
  (model): IdeficsModel(
    (embed_tokens): IdeficsDecoupledEmbedding(
      num_embeddings=32000, num_additional_embeddings=2, embedding_dim=4096, partially_freeze=False
      (additional_embedding): Embedding(2, 4096)
    )
    (vision_model): IdeficsVisionTransformer(
      (embeddings): IdeficsVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(257, 1280)
      )
      (pre_layrnorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (encoder): IdeficsVisionEncoder(
        (layers): ModuleList(
          (0-31): 32 x IdeficsVisionEncoderLayer(
            (self_attn): IdeficsVisionAttention(
              (k_proj): Linear(in_features=1280, out_features=1280, bias=True)
              (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
              (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
              (out_p

In [28]:
model_name = checkpoint.split("/")[1]
config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, config)

In [29]:
model.print_trainable_parameters()

trainable params: 63,422,496 || all params: 8,993,102,128 || trainable%: 0.7052


# Training
Finally, using the Hugging Face Trainer, we can finetune the model!

For the sake of the demo, we have set the max_steps at 40. That's about 0.05 epoch on this dataset, so feel free to tune further!

It has been reported that fine-tuning in mixed precision fp16 can lead to overflows. As such, we recommend training in mixed precision bf16 when possible.

In [30]:
training_args = TrainingArguments(
    output_dir="outputs",
    learning_rate=2e-4,
    fp16=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    dataloader_pin_memory=False,
    eval_strategy="steps",
    eval_steps=3,
    logging_steps=1,
    max_steps=30,
    remove_unused_columns=False,
    push_to_hub=False,
    label_names=["labels"],
    report_to="wandb",
    run_name="idefics2_9B_mmsd2_01",
    optim="paged_adamw_8bit",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
)

trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


max_steps is given, it will override any value given in num_train_epochs
[34m[1mwandb[0m: Currently logged in as: [33msinngamkhaidem[0m ([33msinngamkhaidem-nit-manipur[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
3,2.8095,2.414836
6,1.8692,1.448733
9,1.0617,1.072
12,1.0836,1.008807
15,0.9922,0.979103
18,1.1659,0.961309
21,1.092,0.949622
24,1.0967,0.939099
27,0.9016,0.932325
30,1.0273,0.92982


TrainOutput(global_step=30, training_loss=1.3823440750439961, metrics={'train_runtime': 514.6601, 'train_samples_per_second': 1.865, 'train_steps_per_second': 0.058, 'total_flos': 4075811620609536.0, 'train_loss': 1.3823440750439961, 'epoch': 0.04844570044408559})

In [31]:
model.save_pretrained("idefics2_9B_mmsd2_outputs")