# **Fine-Tuning in LLMs and Multimodal LLMs**


## **Definition:**

- `Fine-tuning` is the process of adapting a pretrained large model (LLM or multimodal LLM) to a downstream task or domain by updating all or part of its parameters.

- Instead of training from scratch (which is computationally prohibitive), `fine-tuning` leverages `general knowledge already learned` and aligns it to `task-specific distributions` (`text classification`, `question answering`, `image-captioning`, `speech-text alignment`, etc.).



## **Intuition**

- Pretrained LLMs capture broad world knowledge but lack `task specialization`.

- Fine-tuning bridges this gap by teaching the model how to `“speak the dialect”` of the `downstream task`.

- In multimodal LLMs, fine-tuning aligns `cross-modal representations` (e.g., mapping image embeddings to the same semantic space as text embeddings).

## **Mathematical Representation**

Let:

- $X$ = input data (text, image, audio, etc.)

- $f_{\theta}$ = pretrained LLM with parameters $\theta$

- $\mathcal{L}$ = task-specific loss (e.g., cross-entropy for classification)


Fine-tuning optimizes:

$$\theta^{*} = \arg \min_{\theta} \; \mathbb{E}_{(x,y) \sim \mathcal{D}} \big[ \mathcal{L}(f_{\theta}(x), y) \big]$$


Where $\mathcal{D}$ is the downstream dataset.

In practice, $\theta$ may be:

- Fully updated (full fine-tuning)

- Partially updated (parameter-efficient fine-tuning, e.g., LoRA, adapters, prefix-tuning).

## **Types of LLM Fine-Tuning**

| Method | Definition | Key Properties | Use Cases |
| :--- | :--- | :--- | :--- |
| **Full Fine-Tuning** | Update **all** of the model's weights using a downstream dataset. | High accuracy but **very computationally costly** (high GPU/TPU memory and time). | Small models, highly specialized domains where performance is critical. |
| **Feature Extraction** | Freeze the entire LLM and use its hidden representations (embeddings) as input for an external classifier or model. | Fast, cheap, but less flexible as the LLM itself is not adapted. | Embedding extraction, retrieval-augmented generation (RAG), simple classification tasks. |
| **Adapters** | Insert small, trainable layers between the frozen layers of the pre-trained Transformer blocks. | Parameter-efficient, modular (can stack or switch adapters). | Domain adaptation, multi-task learning. |
| **LoRA (Low-Rank Adaptation)** | Learn and apply low-rank matrix decompositions to update the attention and linear layers, instead of updating the full weights. | Highly **memory-efficient**, performance often matches full fine-tuning, widely used. | The standard method for fine-tuning large LLMs on small-to-medium scale tasks. |
| **Prefix/Prompt-Tuning** | Optimize a small set of continuous, task-specific vectors (a "soft prompt") prepended to the input. | **Extremely parameter-efficient**, only a tiny fraction of parameters are trained. | Few-shot learning, quick domain adaptation without changing the model. |
| **Instruction-Tuning** | Fine-tune the LLM on a dataset of tasks formatted as **natural language instructions** and desired responses. | Doesn't teach new knowledge but dramatically improves **task generalization** and response formatting. | Training chatbots, models like ChatGPT, and any model that needs to follow user intent. |
| **RLHF (Reinforcement Learning from Human Feedback)** | Use reinforcement learning (e.g., PPO) to fine-tune the model based on a reward model trained on **human preferences**. | **Aligns** model outputs with human values and intentions (helpfulness, honesty, harmlessness). | Creating advanced AI assistants like ChatGPT, Claude, and Bard. |

## **Fine-Tuning in Multimodal LLMs**

`Multimodal LLM` fine-tuning extends adaptation across `modalities` (text + vision, text + audio, etc.), ensuring representations from different domains align in a shared semantic space.


**Key Approaches:**

- `Cross-modal pretraining + fine-tuning:` Align embeddings (e.g., CLIP style, then fine-tune on downstream task).

- `Adapter Fusion:` Train adapters for each modality, then fuse for multimodal reasoning.

- `LoRA in multimodal layers:` Apply LoRA to vision encoders, audio encoders, and text decoders.

- `Instruction-Tuning Multimodal LLMs:` Using multimodal prompts (e.g., “Describe this image…”) for alignment.


**Mathematical Sketch**

For multimodal input $(x_{text}, x_{vision})$:

$$h_{text} = f_{\theta_{T}}(x_{text}), \quad h_{vision} = g_{\theta_{V}}(x_{vision})$$

$$z = \phi(h_{text}, h_{vision})$$


Fine-tuning optimizes task loss:

$$\theta^{*} = \arg \min_{\theta} \; \mathcal{L}(\phi(h_{text}, h_{vision}), y)$$

Where $\phi$ is a fusion mechanism (cross-attention, concatenation, projection).


## **Use Cases**

**LLMs:**

- Sentiment analysis, classification, summarization.

- Chatbots and domain-specific assistants.

- Code generation (finetuning LLMs on code corpora).


**Multimodal LLMs:**

- Image captioning (COCO, Flickr30k datasets)

- Visual question answering (VQA)

- Speech-to-text alignment

- Multimodal search & retrieval

- Robotics (aligning vision + language for action planning)


## **Code Examples**

- **Full Fine-Tuning (Hugging Face LLM)**

In [2]:
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Sample dataset
texts = ["I love this!", "I hate it."]
labels = [1, 0]
encodings = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")

import torch
dataset = torch.utils.data.TensorDataset(encodings["input_ids"], encodings["attention_mask"], torch.tensor(labels))

# Training setup
training_args = TrainingArguments(output_dir="./results", num_train_epochs=2, per_device_train_batch_size=2)
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)

# trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


- **LoRA Fine-Tuning**

In [2]:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
config = LoraConfig(r=8, 
                    lora_alpha=32, 
                    lora_dropout=0.1, 
                    task_type="CAUSAL_LM")
model = get_peft_model(model, config)

# Only a few million params trainable
print(model.print_trainable_parameters())

trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364
None


- **Multimodal Fine-Tuning (CLIP-style)**

In [None]:
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.png")
text = ["a photo of a dog", "a photo of a cat"]

inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Similarity scores
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)

## **Summary**

- Fine-tuning in `LLMs` specializes pretrained models for downstream NLP tasks.

- Fine-tuning in `Multimodal LLMs` additionally requires aligning heterogeneous modalities in a shared embedding space.

- Techniques range from full `fine-tuning` (expensive) to `parameter-efficient methods` (LoRA, adapters, prefix-tuning), with multimodal extensions for cross-domain reasoning.