# Prefix tuning for conditional generation

https://github.com/XiangLi1999/PrefixTuning

Prefix tuning is an additive method where only a sequence of continuous task-specific vectors is attached to the beginning of the input, or prefix. Only the prefix parameters are optimized and added to the hidden states in every layer of the model. The tokens of the input sequence can still attend to the prefix as virtual tokens. As a result, prefix tuning stores 1000x fewer parameters than a fully finetuned model, which means you can use one large language model for many tasks.

In [1]:
!pip install -q -U peft transformers bitsandbytes accelerate

### After adding for Prefix-Tuning this much is the addition
    
    PeftModelForSeq2SeqLM(
      (base_model): T5ForConditionalGeneration(
        (shared): Embedding(32128, 512)
        (encoder): T5Stack(
          (embed_tokens): Embedding(32128, 512)
          (block): ModiuleList(...)
          (final_layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (decoder): T5Stack(
          (embed_tokens): Embedding(32128, 512)
          (block): ModiuleList(...)
          (final_layer_norm): T5LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (lm_head): Linear(in_features=512, out_features=32128, bias=False)
      ) # base model frozen no change here
        (prompt_encoder): ModuleDict(
            (default): PrefixEncoder(
              (embedding): Embedding(20, 6144)    # virtual token added
            )
        )
      (word_embeddings): Embedding(32128, 512)
    )

* The PrefixEncoder learns a set of virtual tokens that act as a soft prefix for the model's input.
* Instead of modifying the entire model, only these embeddings get updated during training


### **Why Does Prefix-Tuning Introduce `word_embeddings`?**
**PEFT** adds new **(word_embeddings)** layer even though **T5-small** already has shared embeddings(meaning **they all share the same embedding weights**) referenced as : `model.shared`,`encoder.embed_tokens`,`decoder.embed_tokens`  

---

- This ensures that the model **can still access** the original word embeddings **without modifying the base model**.
- Some **PEFT techniques (like LoRA, Adapter Tuning)** modify embedding layers.Keeping a separate reference to `word_embeddings` makes it easier to support different tuning strategies **without breaking the original T5-small structure**.
- In some implementations, PEFT **may create a new copy** of the word embeddings instead of referencing `model.shared`. This isn't a major concern since, in Prefix-Tuning, the **embeddings remain frozen**, and only the `prompt_encoder` is trained.

```python
import torch
from peft import get_peft_config, get_peft_model, PrefixTuningConfig, TaskType
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small",trust_remote_code=True)
peft_config = PrefixTuningConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, num_virtual_tokens=20)
model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())

# check `word_embeddings` is a separate layer or just a reference in PEFT-wrapped model
# - True means same false means PEFT has created a copy
print(model.word_embeddings.weight.data_ptr() == model.base_model.shared.weight.data_ptr())


# trainable params: 122,880 || all params: 60,629,504 || trainable%: 0.2027
# None
# True

```


**why PEFT adds `word_embeddings`** in `Prefix-Tuning`, even though T5 already has shared embeddings:

---

### 🧠 What’s T5’s Normal Behavior?

In T5, the **same embedding layer** is shared across:

* The input (encoder),
* The output (decoder),
* The final output layer (lm\_head).

This shared layer is called `model.shared`, and T5 ties everything to this so that **one set of word embeddings is used everywhere**.

---

### 💡 What Does Prefix-Tuning Do?

Prefix-Tuning **doesn’t change the T5 model itself**. It just adds some **learnable virtual tokens** (prefixes) that the model can attend to — these are trained **instead of** changing the full model.

To do that, Prefix-Tuning introduces a small extra module:

```python
(prompt_encoder): ModuleDict(
    (default): PrefixEncoder(
        (embedding): Embedding(20, 6144)  # This learns the prefix vectors
    )
)
```

These embeddings act like “**soft instructions**” the model reads before your input — without changing T5’s original weights.

---

### ❓So Why Add `word_embeddings`?

Even though T5 already has `model.shared`, PEFT adds a separate:

```python
(word_embeddings): Embedding(32128, 512)
```

Here’s **why**:

#### ✅ Reason 1: Keep T5 Frozen

The goal of PEFT is to leave the **original T5 untouched**, so adding a new `word_embeddings` layer helps:

* Avoid accidental changes to `model.shared`.
* Make it easier to handle different tuning methods (like LoRA, Adapters) in a modular way.

#### ✅ Reason 2: Unified Interface for PEFT

Some other tuning methods **do change** embeddings.
So by **always** adding `word_embeddings`, PEFT can support:

* Prefix-Tuning (only `prompt_encoder` is trained)
* LoRA (modifies attention layers)
* Adapter Tuning (inserts adapters into layers)
* Embedding tuning (optional for other methods)

This keeps the interface **consistent**, no matter which method is used.

#### ✅ Reason 3: Safe Defaults

In Prefix-Tuning, `word_embeddings` is **not used or trained**, but it's **there as a placeholder**, just in case other strategies need it later — kind of like having a backup or spare copy.

---

### 📦 In Short:

> PEFT adds a new `word_embeddings` layer so it doesn't mess with the original T5 structure.
> This keeps the model frozen and makes it easier to plug in other tuning strategies later — even if they don’t need it now.



```
accuracy=88.10572687224669 % on the evaluation dataset

eval_preds[:10]=['neutral', 'neutral', 'neutral', 'negative', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral']
        
dataset['validation']['text_label'][:10]=['neutral', 'neutral', 'neutral', 'negative', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'neutral']
```

```python

# Training

optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,num_warmup_steps=0,
                                               num_training_steps=(len(train_dataloader) * num_epochs),)
model = model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
        )

    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")



correct = 0
total = 0
for pred, true in zip(eval_preds, dataset["validation"]["text_label"]):
    if pred.strip() == true.strip():
        correct += 1
    total += 1
accuracy = correct / total * 100
print(f"{accuracy=} % on the evaluation dataset")
print(f"{eval_preds[:10]=}")
print(f"{dataset['validation']['text_label'][:10]=}")


from peft import PeftModel, PeftConfig

peft_model_id = "stevhliu/t5-large_PREFIX_TUNING_SEQ2SEQ"

config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)

inputs = tokenizer("""The Lithuanian beer market made up 14.41 million liters in January , a rise of 0.8 percent
                    from the year-earlier figure , the Lithuanian Brewers ' Association reporting citing the
                    results from its members .""",return_tensors="pt",)

model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=10)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))


```

**Add LoRA adapter layers/parameters to the original LLM to be trained.**

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(original_model)
peft_model = get_peft_model(model, lora_config)

output_dir = "./Llama2-Finetuning/models_hf/" # Fine-tuned Adapter Directory


peft_training_args = TrainingArguments(
    output_dir=output_dir, # base_model_dir
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

peft_model_path="./Llama2-Finetuning/tmp/llama-output/"  # Fine-tuned Adapter Directory

peft_trainer.model.save_pretrained(peft_model_path)

```

**Pretrained Model Directory**:

    Llama2-Finetuning/models_hf/
    └── 7B
        ├── config.json
        ├── generation_config.json
        ├── pytorch_model-00001-of-00002.bin
        ├── pytorch_model-00002-of-00002.bin
        ├── pytorch_model.bin.index.json
        ├── special_tokens_map.json
        ├── tokenizer.json
        ├── tokenizer.model
        └── tokenizer_config.json

---

**Fine-tuned Adapter Directory**:

    Llama2-Finetuning/tmp/llama-output/
    ├── README.md
    ├── adapter_config.json
    ├── adapter_model.bin
    └── logs


```python

from transformers import AutoModelForCausalLM
from peft import PeftModel

# Base model on your local filesystem
base_model_dir = "./Llama2-Finetuning/models_hf/"
base_model = AutoModelForCausalLM.from_pretrained(base_model_dir)

# Adaptor directory on your local filesystem
adaptor_dir = "./Llama2-Finetuning/tmp/llama-output/"
merged_model = PeftModel.from_pretrained(base_model,adaptor_dir,is_trainable=False)

# Merge Pretrained Model and Adapter as a Single File
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./Llama2-Merged-Model/")

```

# Model merging

https://huggingface.co/docs/peft/en/developer_guides/model_merging


Training a model for each task can be costly, take up storage space, and the models aren’t able to learn new information to improve their performance. Multitask learning can overcome some of these limitations by training a model to learn several tasks, but it is expensive to train and designing a dataset for it is challenging. Model merging offers a solution to these challenges by combining multiple pretrained models into one model, giving it the combined abilities of each individual model without any additional training.

PEFT provides several methods for merging models like a linear or SVD combination. This guide focuses on two methods that are more efficient for merging LoRA adapters by eliminating redundant parameters:

**TIES** - TrIm, Elect, and Merge (TIES) is a three-step method for merging models. First, redundant parameters are trimmed, then conflicting signs are resolved into an aggregated vector, and finally the parameters whose signs are the same as the aggregate sign are averaged. This method takes into account that some values (redundant and sign disagreement) can degrade performance in the merged model.


**DARE** - Drop And REscale is a method that can be used to prepare for other model merging methods like TIES. It works by randomly dropping parameters according to a drop rate and rescaling the remaining parameters. This helps to reduce the number of redundant and potentially interfering parameters among multiple models.

**Models are merged with the add_weighted_adapter() method, and the specific model merging method is specified in the combination_type parameter.**

**It's better to use lora_adapeters extracted from different fine-tuned model which has same base model for consistency between the dimensions**


In [6]:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.utils.quantization_config import BitsAndBytesConfig
import torch


config = PeftConfig.from_pretrained("smangrul/tinyllama_lora_norobots")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             quantization_config=bnb_config,
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("smangrul/tinyllama_lora_norobots")

model.config.vocab_size = 32005
model.resize_token_embeddings(32005)



Embedding(32005, 2048)

In [8]:
model = PeftModel.from_pretrained(model, "smangrul/tinyllama_lora_norobots", adapter_name="norobots")
_ = model.load_adapter("smangrul/tinyllama_lora_sql", adapter_name="sql")
_ = model.load_adapter("smangrul/tinyllama_lora_adcopy", adapter_name="adcopy")

adapters = ["norobots", "adcopy", "sql"]
weights = [2.0, 1.0, 1.0]
adapter_name = "merge"
density = 0.2
model.add_weighted_adapter(adapters, weights, adapter_name, combination_type="ties", density=density)

model.set_adapter("merge")

Now after adding the adapters model looks like
```python
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): lora.Embedding(
          (base_layer): Embedding(32005, 2048)
          (lora_dropout): ModuleDict(...)
          (lora_A): ModuleDict()
          (lora_B): ModuleDict()
          (lora_embedding_A): ParameterDict(...)
          (lora_embedding_B): ParameterDict(...)
          (lora_magnitude_vector): ModuleDict()
        )
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(...)
                (lora_A): ModuleDict(...)
                (lora_B): ModuleDict(...)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4bit(...) # same as (q_proj) only - (base_layer): Linear4bit(in_features=2048, out_features=256, bias=False)
              (v_proj): lora.Linear4bit(...) # exact same as k_proj
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(...)
                (lora_A): ModuleDict(...)
                (lora_B): ModuleDict(...)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
            )
            (mlp): LlamaMLP(
              (gate_proj): lora.Linear4bit(...) # same as (up_proj) only - (base_layer): Linear4bit(in_features=2048, out_features=5632, bias=False)
              (up_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=5632, bias=False)
                (lora_dropout): ModuleDict(...)
                (lora_A): ModuleDict(...)
                (lora_B): ModuleDict(...)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (down_proj): lora.Linear4bit(...) # same as (up_proj) only - (base_layer): Linear4bit(in_features=5632, out_features=2048, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
            (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
          )
        )
        (norm): LlamaRMSNorm((2048,), eps=1e-05)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (lm_head): lora.Linear(
        (base_layer): Linear(in_features=2048, out_features=32005, bias=False)
        (lora_dropout): ModuleDict(...)
        (lora_A): ModuleDict(...)
        (lora_B): ModuleDict(...)
        (lora_embedding_A): ParameterDict()
        (lora_embedding_B): ParameterDict()
        (lora_magnitude_vector): ModuleDict()
      )
    )
  )
)

```

inside of **```(lora_embedding_A)```** and **```(lora_embedding_B)```** looks simething like this (though shape of each tensor will change based on layers)

```python

(lora_embedding_A): ParameterDict(
    (norobots): Parameter containing: [torch.cuda.FloatTensor of size 8x32005 (cuda:0)]
    (adcopy): Parameter containing: [torch.cuda.FloatTensor of size 8x32005 (cuda:0)]
    (merge): Parameter containing: [torch.cuda.FloatTensor of size 8x32005 (cuda:0)]
)


```

and for **```(lora_dropout)```**,**``(lora_A)``**,**```(lora_B)```** it's something like this


```python

(lora_dropout): ModuleDict(
  (norobots): Dropout(p=0.1, inplace=False)
  (adcopy): Dropout(p=0.1, inplace=False)
  (merge): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
  (norobots): Linear(in_features=2048, out_features=8, bias=False)
  (adcopy): Linear(in_features=2048, out_features=8, bias=False)
  (merge): Linear(in_features=2048, out_features=8, bias=False)
)

```

In [12]:
from transformers import AutoTokenizer
import torch

# base model - the adapters are extracted from some finetuned version of this model
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")
prompt = """
Convert the following question into a SQL query:

"What is the total number of employees in the IT department?"

"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,

    )

# Decode
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\n\nGenerated Output:\n", generated_text)





Generated Output:
 
Convert the following question into a SQL query:

"What is the total number of employees in the IT department?"

Select * FROM employees WHERE department = IT

Convert the following question into a SQL query:

"What is the total number of employees in the Sales department?"

Select * FROM employees WHERE department = Sales

Convert the following question into a SQL query:

"What is the total number of employees in the Accounting department?"

Select * FROM employees WHERE department = Accounting

Convert the following question into a SQL query:

"What is the total


In [13]:
from transformers import AutoTokenizer
import torch

# base model - the adapters are extracted from some finetuned version of this model
# we can use any of the adapter tokenizer(s) too
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")
prompt = """
Convert the following question into a SQL query:

Given that we store user reviews in a database, write a SQL query to fetch
all reviews from users who violated the no-robots policy.

"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,

    )

# Decode
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\n\nGenerated Output:\n", generated_text)





Generated Output:
 
Convert the following question into a SQL query:

Given that we store user reviews in a database, write a SQL query to fetch 
all reviews from users who violated the no-robots policy.

The following SQL query will the following SQL query:

SELECT * FROM reviews WHERE user_id IN (SELECT user_id FROM users WHERE no_robots = 1)


In [14]:
from transformers import AutoTokenizer
import torch

# base model - the adapters are extracted from some finetuned version of this model
# we can use any of the adapter tokenizer(s) too
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")
prompt = """
Write a short promotional copy for a product that uses AI to
block bots from scraping your content. Include a legal-style disclaimer at the end.
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,

    )

# Decode
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\n\nGenerated Output:\n", generated_text)





Generated Output:
 
Write a short promotional copy for a product that uses AI to 
block bots from scraping your content. Include a legal-style disclaimer at the end.

### 📝 Promotional copy for AI

Aib, a smart device, and AI are three friends. They are a grouped with a soft touch of humes and a hard touch of science.

They are a grouped with a soft touch of humes and a hard touch of science. They are apathy and a pathosity.

They are apathy and a pathosity. They are apathy


**Now with out adapter merging** : It's inconsistent

In [19]:
base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
                                  torch_dtype=torch.bfloat16,
                                  device_map="auto") # load the base model
# base model - the adapters are extracted from some finetuned version of this model
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")

prompt = """
Convert the following question into a SQL query:

Given that we store user reviews in a database, write a SQL query to fetch
all reviews from users who violated the no-robots policy.
"""

inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device)

# Generate
with torch.no_grad():
    output = base_model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,

    )

# Decode
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\n\nGenerated Output:\n", generated_text)




Generated Output:
 
Convert the following question into a SQL query:

Given that we store user reviews in a database, write a SQL query to fetch 
all reviews from users who violated the no-robots policy.

My approach would be to create a join on reviews to check if the user_id of the user is in the no-robots users list. If it is, then we don't want to fetch the reviews for that user.
But I don't know how to do it.

A: Here is a way to do it using the join syntax:
select * from reviews r join users u on r.user_id = u.user_id where r


In [17]:
base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
                                  torch_dtype=torch.bfloat16,
                                  device_map="auto") # load the base model
# base model - the adapters are extracted from some finetuned version of this model
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")

prompt = """
Write a short promotional copy for a product that uses AI to
block bots from scraping your content. Include a legal-style disclaimer at the end.
"""

inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device)

# Generate
with torch.no_grad():
    output = base_model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,

    )

# Decode
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\n\nGenerated Output:\n", generated_text)




Generated Output:
 
Write a short promotional copy for a product that uses AI to 
block bots from scraping your content. Include a legal-style disclaimer at the end.

# 2. Make a video of yourself reading your promotional copy out loud.

# 3. Share your video on YouTube or other social media platform and tag @thinkdigital.

# 4. Once you've done that, send me a link to your video.



# Mixed adapter types

https://huggingface.co/docs/peft/en/developer_guides/mixed_models

Normally, it isn’t possible to mix different adapter types in 🤗 PEFT. You can create a PEFT model with two different LoRA adapters (which can have different config options), but it is not possible to combine a LoRA and LoHa adapter. With PeftMixedModel however, this works as long as the adapter types are compatible. The main purpose of allowing mixed adapter types is to combine trained adapters for inference. While it is possible to train a mixed adapter model, this has not been tested and is not recommended.

```python  

from peft import PeftMixedModel

base_model = ...  # load the base model, e.g. from transformers
# load first adapter, which will be called "default"
peft_model = PeftMixedModel.from_pretrained(base_model, <path_to_adapter1>)
peft_model.load_adapter(<path_to_adapter2>, adapter_name="other")
peft_model.set_adapter(["default", "other"])

```

# Adapter injection

https://huggingface.co/docs/peft/en/developer_guides/low_level_api


With PEFT, you can inject trainable adapters into any torch module which allows you to use adapter methods without relying on the modeling classes in PEFT. Currently, PEFT supports injecting LoRA, AdaLoRA, and IA3 into models because for these adapters, inplace modification of the model is sufficient for finetuning it.


**Pros**
- the model is modified inplace, keeping all the original attributes and methods
- works for any torch module and modality
**Cons**
- manually write the from_pretrained and save_pretrained utility functions from Hugging Face to save and load adapters
- doesn’t work with any of the utility methods provided by PeftModel such as disabling and merging adapters

In [22]:
import torch
from peft import inject_adapter_in_model, LoraConfig

class DummyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = torch.nn.Embedding(10, 10)
        self.linear = torch.nn.Linear(10, 10)
        self.lm_head = torch.nn.Linear(10, 10)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        x = self.linear(x)
        x = self.lm_head(x)
        return x


lora_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    target_modules=["linear"],
)

model = DummyModel()
model = inject_adapter_in_model(lora_config, model)

dummy_inputs = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7]])
dummy_outputs = model(dummy_inputs)
print(dummy_outputs.shape,dummy_inputs.shape)

torch.Size([1, 8, 10]) torch.Size([1, 8])


# Adapter tuning


### **Freeze all layers + unfreeze last 2 layers - Gives better result than randomly adding adapters**

-  **(pre_classifier) & (classifier)**
    
        DistilBertForSequenceClassification(
              (distilbert): DistilBertModel(
                (embeddings): Embeddings(...)
                (transformer): Transformer((layer): ModuleList((0-5): 6 x TransformerBlock(...)))
              )
              (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
              (classifier): Linear(in_features=768, out_features=2, bias=True)
              (dropout): Dropout(p=0.2, inplace=False)
            )

In [None]:
# # freeze all layers
# for param in model.parameters():
#     param.requires_grad = False

# # unfreeze last 2 layers
# for param in model.pre_classifier.parameters():
#     param.requires_grad = True

# for param in model.classifier.parameters():
#     param.requires_grad = True

### **Fine-tune all layers + Adapter layers** :

- **you're not freezing the model but adding the adapter & fine-tuning**

### **Fine-tune using Adapter layers**:

- **you're not freezing the model then adding the adapter & fine-tuning**


### **Where Are the Adapters Inserted?**
The adapters are inserted into **each Transformer Block** inside the **DistilBERT transformer layer** at two specific locations:

1. **After the self-attention output (`out_lin`)**   
     ```python
     model.distilbert.transformer.layer[block_idx].attention.out_lin
     ```

2. **After the feedforward network output (`lin2`)**    
     ```python
     model.distilbert.transformer.layer[block_idx].ffn.lin2
     ```
---

### **Comparison of Original vs. Modified Model**

| **Component**                          | **Original Model**                         | **Modified Model** (with Adapters) |
|-----------------------------------------|--------------------------------------------|-------------------------------------|
| **Self-Attention Output (`out_lin`)**   | `Linear(768, 768)`                         | `Linear(768, 768) → Linear(768, 32) → GELU → Linear(32, 768)` |
| **Feedforward Network Output (`lin2`)** | `Linear(3072, 768)`                        | `Linear(3072, 768) → Linear(768, 32) → GELU → Linear(32, 768)` |
| **Number of Adapter Layers**            | `None`                                     | **2 per block × 6 blocks = 12 adapters** |

---


In [None]:
# import torch
# from torch import nn
# from transformers import AutoModelForSequenceClassification

# class ResidualAdapter(nn.Module):
#     """Adapter with residual connection to prevent loss of model information"""
#     def __init__(self, in_dim, bottleneck_dim):
#         super().__init__()
#         self.down_proj = nn.Linear(in_dim, bottleneck_dim)
#         self.activation = nn.GELU()
#         self.up_proj = nn.Linear(bottleneck_dim, in_dim)

#     def forward(self, x):
#         return x + self.up_proj(self.activation(self.down_proj(x)))  # Residual skip connection


# def insert_adapter(transformer_layer, bottleneck_size):
#     """ Insert adapter into a given transformer block """
#     adapter_1 = ResidualAdapter(in_dim=transformer_layer.attention.out_lin.out_features, bottleneck_dim=bottleneck_size)
#     adapter_2 = ResidualAdapter(in_dim=transformer_layer.ffn.lin2.out_features, bottleneck_dim=bottleneck_size)

#     transformer_layer.attention.out_lin = nn.Sequential(transformer_layer.attention.out_lin, adapter_1)
#     transformer_layer.ffn.lin2 = nn.Sequential(transformer_layer.ffn.lin2, adapter_2)

#     return adapter_1, adapter_2


# def count_parameters(model):
#     return sum(p.numel() for p in model.parameters() if p.requires_grad)


# # Freeze base model parameters
# for param in model.parameters():
#     param.requires_grad = False

# # Add adapters to all transformer blocks
# total_size = 0
# bottleneck_size = 32  # Hyperparameter

# for block_idx in range(6):
#     adapter_1, adapter_2 = insert_adapter(model.distilbert.transformer.layer[block_idx], bottleneck_size)

#     total_size += sum(p.numel() for p in adapter_1.parameters() if p.requires_grad)
#     total_size += sum(p.numel() for p in adapter_2.parameters() if p.requires_grad)

# print("Number of adapter parameters added:", total_size)

# model.to(device)

#### Check Overfitting

In [23]:
# test_model(model, test_loader)

## Test Accuracy: 0.6800

In [24]:
# test_model(model, train_loader)
## Test Accuracy: 0.7350

In [1]:
#