# Chapter 11:  Multimodal LLMs and their Fine-tuning

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:450px">
    <img src="image/timeline-MMD.png" alt="" />
</div>

## Vision Language Model (VLMs)

 Vision language models encompass multimodal models capable of learning from both images and text
 inputs. 
 
 They belong to the category of generative models that utilise image and text data to produce
 textual outputs. 

 Certain advanced vision language models
 can also understand spatial attributes within images.  

###  Architecture

Vision-language models adeptly integrate both visual and textual information, leveraging three fundamental components:

+ Image Encoder: This component translates visual data (images) into a format that the model can process.
+ Text Encoder: Similar to the image encoder, this component converts textual data (words and sentences) into a format the model can understand.
+ Fusion Strategy: This component combines the information from both the image and text encoders, merging the two data types into a unified representation.

 These elements work collaboratively, with the model’s learning process (loss functions) specifically tai
lored to the architecture and learning strategy employed. 

### Constrative Learning

 Contrastive learning is a technique that focuses on understanding the differences between data points. It
 computes a similarity score between instances and aims to minimise contrastive loss, making it particu
larly useful in semi-supervised learning where a limited number of labelled samples guide the optimisation
 process to classify unseen data points.

**How it works**

CLIP is a model that utilises contrastive learning to compute similarity between text and image embeddings through textual and visual encoders. It follows a three-step process for zero-shot predictions:

+ Pre-training: Trains a text and image encoder to learn image-text pairs.
+ Caption Conversion: Converts training dataset classes into captions.
+ Zero-Shot Prediction: Estimates the best caption for a given input image based on learned similarities.

### Fine-tuning of multimodal models

 LoRA and
 QLoRA can be utilised. 

LLM-Adapters integrate various adapter
 modules into the pre-trained model’s architecture, enabling parameter-efficient fine-tuning for diverse
 tasks by updating only the adapter parameters while keeping the base model parameters fixed. 

(IA)³,
 or Infused Adapters by Inhibiting and Amplifying Inner Activations, enhances performance by learn
ing vectors to weight model parameters through activation multiplications, supporting robust few-shot
 performance and task mixing without manual adjustments.

Dynamic adaptation techniques like DyLoRA allow for the training of low-rank adaptation blocks across different ranks, optimising
 the learning process by sorting the representations during training.

 LoRA-FA, a variant of LoRA, optimises the fine-tuning process by freezing the first low-rank matrix after initialisation and using it as a
 random projection while training the other, thereby reducing the number of parameters by half without
 compromising performance.

The Efficient Attention Skipping (EAS) module introduces a novel parameter and computation
efficient tuning method for MLLMs, aiming to maintain high performance while reducing parameter and
 computation costs for downstream tasks. 

MemVP integrates visual prompts
 with the weights of Feed Forward Networks, thereby injecting visual knowledge to decrease training time
 and inference latency, ultimately outperforming previous PEFT methods.

### Full-parameter Fine-Tuning

Methods such as those introduced by LOMO and MeZO provide alternative solutions by focusing
 on memory efficiency:
 + LOMO utilises a low-memory optimisation technique derived from Stochastic
 Gradient Descent (SGD), reducing memory consumption typically associated with the ADAM optimiser.
 
 + MeZO, on the other hand, offers a memory-efficient optimiser that requires only two forward passes
 to compute gradients, enabling comprehensive fine-tuning of large models with a memory footprint
 equivalent to inference