## Introduction to IDEFICS

IDEFICS stands for **Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentions**. This model draws inspiration from Deepmind's Flamingo and is specifically designed to seamlessly integrate and interpret both text and image inputs. The versatility of IDEFICS makes it a powerful tool for a variety of tasks, including visual question answering, image captioning, and functioning as a standalone language model.

1. **Understanding Multimodal LLMs**:
   - Discover the basics of multimodal large language models (LLMs) and their wide-ranging capabilities.

2. **Fine-Tuning the IDEFICS 9B Model**:
   - Get step-by-step instructions on how to fine-tune the IDEFICS 9B model to optimize it for specific tasks.

3. **Utilizing Google Colab for Fine-Tuning**:
   - Learn how to leverage Google Colab, a free cloud service, to perform fine-tuning processes efficiently.

4. **Fine-Tuning on a Specific Dataset**:
   - Explore fine-tuning using the "Pokemon Go Cards" dataset from Hugging Face, tailored to enhance the model's performance on this specific dataset.



- This command installs the datasets library from Hugging Face. The -q option stands for "quiet," which means that the command will minimize the amount of output it produces during the installation process.

- **!pip install -q git+https://github.com/huggingface/transformers**
This command installs the latest version of the transformers library directly from the Hugging Face GitHub repository.

- **!pip install -q bitsandbytes sentencepiece accelerate loralib**
This command installs several libraries at once:

* bitsandbytes: A library for 8-bit optimizers and quantization, useful for efficient training and inference of neural networks.
* sentencepiece: A text tokenizer and detokenizer mainly used for processing text in NLP tasks. It is a popular library for subword tokenization.
* accelerate: A library from Hugging Face that helps in accelerating the training of machine learning models by providing simple interfaces for distributed training.
* loralib: A library that supports Low-Rank Adaptation (LoRA) for fine-tuning large-scale pre-trained language models efficiently.

- **!pip install -q -U git+https://github.com/huggingface/peft.git**
This command installs or updates (-U for update) the peft library directly from its GitHub repository.

In [1]:
!pip install -q datasets
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q bitsandbytes sentencepiece accelerate loralib
!pip install -q -U git+https://github.com/huggingface/peft.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━

In [2]:
!pip install --upgrade huggingface_hub
!huggingface-cli login

Collecting huggingface_hub
  Downloading huggingface_hub-0.23.2-py3-none-any.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.7/401.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.23.1
    Uninstalling huggingface-hub-0.23.1:
      Successfully uninstalled huggingface-hub-0.23.1
Successfully installed huggingface_hub-0.23.2



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [3]:
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from PIL import Image
from transformers import IdeficsForVisionText2Text, AutoProcessor, Trainer, TrainingArguments, BitsAndBytesConfig

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [5]:
device

'cuda'

In [6]:
checkpoint = "HuggingFaceM4/idefics-9b"

This creates a configuration object for the BitsAndBytesConfig class with several specific settings. The BitsAndBytesConfig class is part of the bitsandbytes library and is used to configure quantization settings for efficient computation.

Let's go through each parameter:

- load_in_4bit=True: This parameter indicates that the model should be loaded with 4-bit precision. This reduces the model's memory footprint and can speed up computation at the cost of some precision.

- bnb_4bit_use_double_quant=True: This enables double quantization for 4-bit precision. Double quantization is a technique to further reduce the model size and improve performance by quantizing the quantized values again.

- bnb_4bit_quant_type="nf4": This specifies the type of 4-bit quantization to use. "nf4" is a specific quantization method that is efficient and effective for neural network computations.

- bnb_4bit_compute_dtype=torch.float16: This sets the computation data type to 16-bit floating point (torch.float16). This is a common choice for mixed-precision training because it strikes a good balance between performance and precision.

- llm_int8_skip_modules=["lm_head", "embed_tokens"]: This parameter specifies which modules to skip during 8-bit quantization. In this case, the lm_head and embed_tokens modules will not be quantized to 8-bit, possibly because these parts of the model are particularly sensitive to precision and could degrade the model's performance if quantized too aggressively.

In [7]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["lm_head", "embed_tokens"]
)

In [8]:
processor = AutoProcessor.from_pretrained(checkpoint)

preprocessor_config.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/61.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, quantization_config=bnb_config, device_map="auto")


config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/99.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/19 [00:00<?, ?it/s]

model-00001-of-00019.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

model-00002-of-00019.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

model-00003-of-00019.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00019.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

model-00005-of-00019.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

model-00006-of-00019.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00007-of-00019.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00008-of-00019.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00009-of-00019.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

model-00010-of-00019.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

model-00011-of-00019.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00012-of-00019.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00013-of-00019.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00014-of-00019.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

model-00015-of-00019.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

model-00016-of-00019.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00017-of-00019.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00018-of-00019.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00019-of-00019.safetensors:   0%|          | 0.00/705M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### Function Definition:

The function `do_inference` takes four parameters:
- `model`: The pre-trained model used for inference.
- `processor`: The processor used to tokenize the input and decode the output.
- `prompts`: The input prompts to generate text from.
- `max_new_tokens`: The maximum number of new tokens to generate (default is 50).

### Tokenizer and Bad Words:

- The tokenizer is extracted from the processor.
- A list of `bad_words` is defined, which includes words or tokens that should be avoided in the generated output.
- If there are any bad words, their token IDs are obtained using the tokenizer.

### End of Sequence (EOS) Token:

- The end-of-sequence token is defined as `"</s>"`.
- The token ID for the EOS token is obtained using the tokenizer.

### Prepare Inputs:

- The prompts are processed into tensors using the processor and moved to the appropriate device (CPU/GPU).

### Generate Tokens:

- The model generates new tokens based on the processed inputs.
- Several parameters are passed to the `generate` method:
  - `eos_token_id`: The EOS token ID to signal the end of the sequence.
  - `bad_words_ids`: The token IDs of the bad words to avoid.
  - `max_new_tokens`: The maximum number of new tokens to generate.
  - `early_stopping`: If set to `True`, the generation stops early when the EOS token is generated.

### Decode and Print:

- The generated token IDs are decoded back into text using the processor.
- Special tokens are skipped during decoding.
- The first element of the decoded text batch is printed.


In [37]:
# Inference
def do_inference(model, processor, prompts, max_new_tokens=50):
  tokenizer = processor.tokenizer
  bad_words = ["<image", "fake_token_around_image>"]
  if len(bad_words) > 0:
    bad_words_ids = tokenizer(bad_words, add_special_tokens = False).input_ids
  eos_token = "</s>"
  eos_token_id = tokenizer.convert_tokens_to_ids(eos_token)

  inputs = processor(prompts, return_tensors='pt').to(device)
  generate_ids = model.generate(
      **inputs,
      eos_token_id = [eos_token_id],
      bad_words_ids = bad_words_ids,
      max_new_tokens = max_new_tokens,
      early_stopping = True
  )

  generated_text = processor.batch_decode(generate_ids,
                                          skip_special_tokens=True)[0]
  print(generated_text)

In [11]:
import torchvision.transforms as transforms

In [12]:
url = "https://hips.hearstapps.com/hmg-prod/images/cute-photos-of-cats-in-grass-1593184777.jpg"
prompts = [
    url,
    "Question: What's on the picture? Answer:",
]

## Preprocessing Functions

### `convert_to_rgb` Function

This function ensures that an image is in RGB mode. If the image is not already in RGB mode, it converts it.

Steps:
- If the image is already in RGB mode, it returns the image.
- If the image is not in RGB mode, it converts it to RGBA.
- It creates a new background image with a white color (255, 255, 255) and the same size as the original image.
- It combines the background with the original image to handle transparency.
- It converts the combined image back to RGB mode and returns it.

### `ds_transforms` Function
This function prepares batches of examples for model input, including image transformations and prompt formatting.

- Extracts the image size, mean, and standard deviation from the processor's image processor.
- Defines a series of image transformations:
Converts images to RGB.
Randomly resizes and crops images.
Converts images to tensors.
Normalizes images using the specified mean and standard deviation.
- Creates prompts for each example in the batch:
Splits the caption at the first period and uses the first sentence.
Formats a prompt asking "What's on the picture?" and provides a partial answer.
- Processes the prompts and applies the image transformations using the processor, returning the tensors.
- Sets the labels field to the input IDs (for supervised learning).

In [13]:
##preprocessing
def convert_to_rgb(image):
  if image.mode == "RGB":
    return image

  image_rgba = image.convert("RGBA")
  background = Image.new("RGBA", image_rgba.size, (255,255,255))
  alpha_composite = Image.alpha_composite(background, image_rgba)
  alpha_composite = alpha_composite.convert("RGB")
  return alpha_composite

def ds_transforms(example_batch):
  image_size = processor.image_processor.image_size
  image_mean = processor.image_processor.image_mean
  image_std = processor.image_processor.image_std

  image_transform = transforms.Compose([
      convert_to_rgb,
      transforms.RandomResizedCrop((image_size, image_size), scale=(0.9, 1.0), interpolation=transforms.InterpolationMode.BICUBIC),
      transforms.ToTensor(),
      transforms.Normalize(mean=image_mean, std=image_std)
  ])

  prompts = []
  for i in range(len(example_batch['caption'])):
    caption = example_batch['caption'][i].split(".")[0]
    prompts.append(
        [
            example_batch['image_url'][i],
            f"Question: What's on the picture? Answer: This is {example_batch['name']}. {caption}",
        ],
    )
  inputs = processor(prompts, transform=image_transform, return_tensors="pt").to(device)
  inputs["labels"] = inputs["input_ids"]
  return inputs

In [21]:
#Load and prepare the data
ds = load_dataset("TheFusion21/PokemonCards")
ds = ds["train"].train_test_split(test_size=0.002)
train_ds = ds["train"]
eval_ds = ds["test"]
train_ds.set_transform(ds_transforms)
eval_ds.set_transform(ds_transforms)

In [22]:
model_name = checkpoint.split("/")[1]
config = LoraConfig(
    r = 16,
    lora_alpha = 32,
    target_modules = ["q_proj", "k_proj", "v_proj"],
    lora_dropout = 0.05,
    bias="none"
)

In [23]:
model = get_peft_model(model, config)

In [24]:
model.print_trainable_parameters()

trainable params: 19,750,912 || all params: 8,949,430,544 || trainable%: 0.2207


In [25]:
training_args = TrainingArguments(
    output_dir = f"{model_name}-PokemonCards",
    learning_rate = 2e-4,
    fp16 = True,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size = 2,
    gradient_accumulation_steps = 8,
    dataloader_pin_memory = False,
    save_total_limit = 3,
    evaluation_strategy ="steps",
    save_strategy = "steps",
    eval_steps = 10,
    save_steps = 25,
    max_steps = 25,
    logging_steps = 5,
    remove_unused_columns = False,
    push_to_hub=False,
    label_names = ["labels"],
    load_best_model_at_end = False,
    report_to = "none",
    optim = "paged_adamw_8bit",
)



In [26]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_ds,
    eval_dataset = eval_ds
)

max_steps is given, it will override any value given in num_train_epochs


In [27]:
trainer.train()

Step,Training Loss,Validation Loss
10,1.5898,1.164266
20,0.8903,0.880843


  return F.conv2d(input, weight, bias, self.stride,


TrainOutput(global_step=25, training_loss=1.4211182975769043, metrics={'train_runtime': 352.0542, 'train_samples_per_second': 1.136, 'train_steps_per_second': 0.071, 'total_flos': 1878661821522816.0, 'train_loss': 1.4211182975769043, 'epoch': 0.03050640634533252})

In [28]:
url = "https://images.pokemontcg.io/pop6/2_hires.png"

In [29]:
prompts = [
    url,
    "Question: What's on the picture? Answer:",
]

In [33]:
processor

IdeficsProcessor:
- image_processor: IdeficsImageProcessor {
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_num_channels": 3,
  "image_processor_type": "IdeficsImageProcessor",
  "image_size": 224,
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "processor_class": "IdeficsProcessor"
}

- tokenizer: LlamaTokenizerFast(name_or_path='HuggingFaceM4/idefics-9b', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>', 'additional_special_tokens': ['<fake_token_around_image>', '<image>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=F

In [38]:
do_inference(model, processor, prompts, max_new_tokens=100)



Question: What's on the picture? Answer: This is ['Lucario-GX', 'Lucario']. A Basic Pokemon Card of type Fire with the title Lucario-GX and 90 HP of rarity Rare Holo evolved from Lucario from the set EX Legends Awakened and the flavor text: It's a Pokemon that can use its tail as a weapon. It's a Pokemon that can use its tail as a weapon. It's a Pok


In [None]:
model.push_to_hub(f"{model_name}-PokemonCards", private=False)