# CLIP Fine-Tuning: Partial Freezing, and LoRA

This script demonstrates:

*   CLIP architecture (text encoder, vision encoder).
*   Freezing all parameters.
*   Unfreezing projection layer.
*   Unfreezing last N transformer layers.
*   Applying Parameter-Efficient Fine-Tuning (PEFT): LoRA.

In [1]:
# PFET: Parameter-Efficient Fine-Tuning
!pip install transformers peft -q

In [2]:
import torch
from transformers import CLIPModel
from peft import LoraConfig, get_peft_model
import pandas as pd

In [3]:
# Function to count trainable and frozen parameters
def count_parameters(model):
    trainable, frozen = 0, 0
    for name, param in model.named_parameters():
        if param.requires_grad:
            trainable += param.numel()
        else:
            frozen += param.numel()
    total = trainable + frozen
    print(f"Trainable params: {trainable:,}")
    print(f"Frozen params:    {frozen:,}")
    print(f"Total params:     {total:,}")
    print(f"Trainable ratio:  {100*trainable/total:.2f}%")

# 1. CLIP Architecture Overview

### Load the model

In [4]:
model_name = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

### High-Level Modules

In [5]:
for name, module in model.named_children():
    print(f"{name}: {module.__class__.__name__}")

text_model: CLIPTextTransformer
vision_model: CLIPVisionTransformer
visual_projection: Linear
text_projection: Linear


### Text and Image Models

In [6]:
print('text Transformer:')
print(model.text_model)
print('\n')
print('Image Transformer:')
print(model.vision_model)

text Transformer:
CLIPTextTransformer(
  (embeddings): CLIPTextEmbeddings(
    (token_embedding): Embedding(49408, 512)
    (position_embedding): Embedding(77, 512)
  )
  (encoder): CLIPEncoder(
    (layers): ModuleList(
      (0-11): 12 x CLIPEncoderLayer(
        (self_attn): CLIPAttention(
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): CLIPMLP(
          (activation_fn): QuickGELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (fi

### Projection Layers

To allow both outputs of text and image transformers to have same dimensions

In [7]:
print(model.text_projection)    # text → joint space
print(model.visual_projection)  # vision → joint space

Linear(in_features=512, out_features=512, bias=False)
Linear(in_features=768, out_features=512, bias=False)


# 2. Freezing / unfreezing CLIP

### Freeze all parameters

In [8]:
# freeze all
for param in model.parameters():
    param.requires_grad = False

print("All parameters frozen")
count_parameters(model)

All parameters frozen
Trainable params: 0
Frozen params:    151,277,313
Total params:     151,277,313
Trainable ratio:  0.00%


### Unfreeze text projection head only

In [9]:
# load model
model_proj = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

# freeze all
for param in model_proj.parameters():
    param.requires_grad = False

# unfreeze the text projection head
for param in model_proj.text_projection.parameters():
    param.requires_grad = True

print("Text projection layer unfrozen")
count_parameters(model_proj)

Text projection layer unfrozen
Trainable params: 262,144
Frozen params:    151,015,169
Total params:     151,277,313
Trainable ratio:  0.17%


#### Check trainable parameters

In [11]:
for name, p in model_proj.named_parameters():
    if p.requires_grad:
        print("Trainable:", name)

Trainable: text_projection.weight


### Unfreeze last *N* transformer layers

In [12]:
# load model
model_last2 = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

# freeze all
for param in model_last2.parameters():
    param.requires_grad = False

# unfreeze last N layers in the text encoder
N = 2
layers = model_last2.text_model.encoder.layers

for layer in layers[-N:]:
    for param in layer.parameters():
        param.requires_grad = True

print(f"Last {N} transformer layer(s) unfrozen")
count_parameters(model_last2)

Last 2 transformer layer(s) unfrozen
Trainable params: 6,304,768
Frozen params:    144,972,545
Total params:     151,277,313
Trainable ratio:  4.17%


#### Check trainable parameters

In [13]:
for name, p in model_last2.named_parameters():
    if p.requires_grad:
        print("Trainable:", name)

Trainable: text_model.encoder.layers.10.self_attn.k_proj.weight
Trainable: text_model.encoder.layers.10.self_attn.k_proj.bias
Trainable: text_model.encoder.layers.10.self_attn.v_proj.weight
Trainable: text_model.encoder.layers.10.self_attn.v_proj.bias
Trainable: text_model.encoder.layers.10.self_attn.q_proj.weight
Trainable: text_model.encoder.layers.10.self_attn.q_proj.bias
Trainable: text_model.encoder.layers.10.self_attn.out_proj.weight
Trainable: text_model.encoder.layers.10.self_attn.out_proj.bias
Trainable: text_model.encoder.layers.10.layer_norm1.weight
Trainable: text_model.encoder.layers.10.layer_norm1.bias
Trainable: text_model.encoder.layers.10.mlp.fc1.weight
Trainable: text_model.encoder.layers.10.mlp.fc1.bias
Trainable: text_model.encoder.layers.10.mlp.fc2.weight
Trainable: text_model.encoder.layers.10.mlp.fc2.bias
Trainable: text_model.encoder.layers.10.layer_norm2.weight
Trainable: text_model.encoder.layers.10.layer_norm2.bias
Trainable: text_model.encoder.layers.11.self

# 3.Parameter-Efficient Fine-Tuning (PFET)

Instead of updating several parameters, PEFT methods adapt only a small set of additional parameters while keeping the pretrained model's backbone frozen. Popular strategies include *Adapters*, which insert lightweight bottleneck layers inside Transformer blocks, and LoRA (Low-Rank Adaptation), which learns low-rank updates to attention weights. These approaches reduce memory and computation cost while still allowing effective adaptation.

## 3.1 LoRA

#### Apply LoRA to last N layers

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method. Instead of updating the full weight matrix, LoRA works by inserting small trainable low-rank matrices (decomposition of original weight matrix). In transformers, LoRA is inserted into attention projection matrices (usually the **query** *q_proj* and **value** *v_proj*). These are the matrices that transform hidden states into attention representations. Let's inject LoRA into the last 2 layers, while freezing the full model!

In [14]:
# load model
model_lora = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

# freeze all
for param in model_lora.parameters():
    param.requires_grad = False

# list module names to tell LoRA where to insert adapters
target_modules = [
    "text_model.encoder.layers.10.self_attn.q_proj",
    "text_model.encoder.layers.10.self_attn.v_proj",
    "text_model.encoder.layers.11.self_attn.q_proj",
    "text_model.encoder.layers.11.self_attn.v_proj",
]

# Check target modules
print("\n".join(target_modules))

text_model.encoder.layers.10.self_attn.q_proj
text_model.encoder.layers.10.self_attn.v_proj
text_model.encoder.layers.11.self_attn.q_proj
text_model.encoder.layers.11.self_attn.v_proj


#### LoRA Configuration and Integration

Choose *task_type* to be FEATURE_EXTRACTION. Other types include MULTIPLE_CHOICE for multiple-choice QA, QUESTION_ANS for question answering and others.

In [15]:
lora_config = LoraConfig(
    r=8, # size of LoRA rank matrices (control capacity)
    lora_alpha=16, # Scaling factor for the LoRA updates (8 -> 16 = 2x)
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
    task_type="FEATURE_EXTRACTION",
)

In [17]:
# inject LoRA
model_lora = get_peft_model(model_lora, lora_config)

print("LoRA applied")
count_parameters(model_lora)

LoRA applied
Trainable params: 32,768
Frozen params:    151,277,313
Total params:     151,310,081
Trainable ratio:  0.02%




#### Check trainable parameters

In [18]:
for name, p in model_lora.named_parameters():
    if p.requires_grad:
        print("Trainable:", name)

Trainable: base_model.model.base_model.model.text_model.encoder.layers.10.self_attn.v_proj.lora_A.default.weight
Trainable: base_model.model.base_model.model.text_model.encoder.layers.10.self_attn.v_proj.lora_B.default.weight
Trainable: base_model.model.base_model.model.text_model.encoder.layers.10.self_attn.q_proj.lora_A.default.weight
Trainable: base_model.model.base_model.model.text_model.encoder.layers.10.self_attn.q_proj.lora_B.default.weight
Trainable: base_model.model.base_model.model.text_model.encoder.layers.11.self_attn.v_proj.lora_A.default.weight
Trainable: base_model.model.base_model.model.text_model.encoder.layers.11.self_attn.v_proj.lora_B.default.weight
Trainable: base_model.model.base_model.model.text_model.encoder.layers.11.self_attn.q_proj.lora_A.default.weight
Trainable: base_model.model.base_model.model.text_model.encoder.layers.11.self_attn.q_proj.lora_B.default.weight


#### Compare number of trainable parameters between frozen, partial freezing and LoRA

In [19]:
def get_param_stats(model):
    trainable, frozen = 0, 0
    for _, param in model.named_parameters():
        if param.requires_grad:
            trainable += param.numel()
        else:
            frozen += param.numel()
    total = trainable + frozen
    ratio = 100 * trainable / total
    return trainable, frozen, total, ratio

# Build comparison table
results = {
    "Frozen": get_param_stats(model),
    "Projection unfrozen": get_param_stats(model_proj),
    "Last 2 layers unfrozen": get_param_stats(model_last2),
    "LoRA applied": get_param_stats(model_lora),
}

df = pd.DataFrame(results, index=["Trainable", "Frozen", "Total", "Ratio (%)"]).T

df[["Trainable", "Frozen", "Total"]] = df[["Trainable", "Frozen", "Total"]].astype(int)
df["Ratio (%)"] = df["Ratio (%)"].round(2)

print("\nComparison Table:")
print(df)


Comparison Table:
                        Trainable     Frozen      Total  Ratio (%)
Frozen                          0  151277313  151277313       0.00
Projection unfrozen        262144  151015169  151277313       0.17
Last 2 layers unfrozen    6304768  144972545  151277313       4.17
LoRA applied                32768  151277313  151310081       0.02


#### At this point, we can flexibly freeze or unfreeze CLIP's hidden layers and projection head, and plug in Adapters or LoRA for efficient fine-tuning. Now you are ready to experiment with these strategies in your EEG-to-text captioning setup!