This code demonstrates how to perform **prompt tuning** on the BLOOMZ-560M language model using PEFT (Parameter-Efficient Fine-Tuning). It loads two datasets‚Äîone with prompts and one with quotes‚Äîthen adds and trains a small set of virtual tokens to adapt the model efficiently for each task. Training is done using Hugging Face‚Äôs `Trainer`, and the tuned adapters are saved separately for easy loading and text generation later, enabling lightweight customization without updating the full model.


# PEFT Prompt Tuning with BLOOMZ-560M: Summary

- **Set up model and tokenizer**
  - Load pretrained BLOOMZ model and tokenizer from Hugging Face.
  - Choose model variant (e.g., `"bigscience/bloomz-560m"`).

- **Define constants**
  - Number of virtual tokens (`NUM_VIRTUAL_TOKENS = 4`).
  - Number of training epochs (`NUM_EPOCHS = 5`).

- **Helper function to generate outputs**
  - Wrap model `.generate()` with custom parameters (e.g., repetition penalty).

- **Load and tokenize datasets**
  - Load prompt dataset (`fka/awesome-chatgpt-prompts`), tokenize and select subset.
  - Load sentence dataset (`Abirate/english_quotes`), tokenize and select subset.

- **Configure PEFT prompt tuning**
  - Create `PromptTuningConfig` with random initialization and virtual tokens.
  - Wrap foundation model with PEFT for prompt tuning (for both prompt and sentence datasets).

- **Prepare training arguments**
  - Define `TrainingArguments` tailored for CPU, auto batch sizing, higher LR, and output dirs.

- **Set up output directories**
  - Create separate folders to save prompt-tuned and sentence-tuned PEFT adapters.

- **Create Hugging Face `Trainer` instances**
  - Use `DataCollatorForLanguageModeling` (causal LM mode).
  - Build trainers for prompt and sentence PEFT models.

- **Train models**
  - Run `.train()` on both trainers.

- **Save tuned adapters**
  - Save PEFT adapters separately to their output directories.

- **Load tuned adapters for inference**
  - Load base BLOOMZ model.
  - Load PEFT adapters with `PeftModel.from_pretrained()` on base model.
  - Generate and decode text outputs for evaluation.

---

This workflow enables efficient fine-tuning of large language models by training a small set of virtual tokens (prompt tuning), reducing compute and storage costs while customizing generation behavior.


In [None]:
!pip install -q peft==0.8.2

In [None]:
!pip install -q datasets==2.14.5

From the transformers library, we import the necessary classes to instantiate the model and the tokenizer.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

### Loading the model and the tokenizers.

Bloom is one of the smallest and smartest models available for training with the PEFT Library using Prompt Tuning. You can choose any model from the Bloom Family, and I encourage you to try at least two of them to observe the differences.

I'm opting for the smallest one to minimize training time and avoid memory issues in Colab.

In [None]:
'''This line is assigning a pretrained language model name from Hugging Face's Model Hub to the variable model_name.
"bigscience/bloomz-560m" refers to the BLOOMZ-560M model, a version of the BLOOMZ series developed by BigScience. The 560m refers to the size of the model, which has 560 million parameters.
The second line is commented out (#). It shows an alternative model: "bigscience/bloom-1b1" (BLOOM with 1.1 billion parameters), but it's not currently active.
These models are part of the BLOOM or BLOOMZ family, which are multilingual and open-source large language models.'''
model_name = "bigscience/bloomz-560m"
#model_name="bigscience/bloom-1b1"
'''This likely refers to soft prompt tuning or prefix tuning, a parameter-efficient way to fine-tune large models.
NUM_VIRTUAL_TOKENS = 4 means that the fine-tuning method will prepend 4 virtual tokens to the input.
These tokens are learned embeddings, not actual words, and are optimized during training to guide the model's behavior on downstream tasks.'''
NUM_VIRTUAL_TOKENS = 4
NUM_EPOCHS = 5

In [None]:
'''Loads the tokenizer associated with the model.
AutoTokenizer is a generic class that automatically picks the right tokenizer class based on the model name.
from_pretrained(model_name) tells it to load the tokenizer files for "bigscience/bloomz-560m" from the Hugging Face hub.
The tokenizer is responsible for: Splitting text into tokens (tokenization). Mapping those tokens to numerical IDs. Decoding IDs back into human-readable text.'''
tokenizer = AutoTokenizer.from_pretrained(model_name)
'''Loads the pretrained causal language model (BLOOMZ-560M in this case).

AutoModelForCausalLM is a class that loads the correct model type for causal language modeling ‚Äî which is the type used for tasks like: Text generation, Code completion, Dialogue modeling
from_pretrained(model_name) again fetches the model weights from Hugging Face.
trust_remote_code=True allows the loading of custom model code from the repository.
Some models include their own modeling_*.py files. This is necessary if the model repo defines custom behavior beyond the base transformer models.'''
foundational_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True
)

## Inference with the pre trained bloom model

In [None]:
#this function returns the outputs from the model received, and inputs.
'''This function get_outputs() is designed to generate text outputs from a language model (like BLOOMZ) using the Hugging Face Transformers API.
input_ids=inputs["input_ids"]: These are the token IDs for your input prompt.
attention_mask=inputs["attention_mask"]: Specifies which tokens should be attended to (1 = real token, 0 = padding).
max_new_tokens=...: Caps how many tokens the model will generate beyond the prompt.
repetition_penalty=1.5: Discourages the model from repeating itself. Values > 1 penalize repetition; 1.5 is a moderately strong penalty.
early_stopping=True: Allows generation to stop before hitting max_new_tokens if an end-of-sequence (eos_token_id) is generated.
eos_token_id=tokenizer.eos_token_id: This tells the model what token signals the end of a response. It uses the correct one from the tokenizer.'''
def get_outputs(model, inputs, max_new_tokens=100):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        repetition_penalty=1.5, #Avoid repetition.
        early_stopping=True, #The model can stop before reach the max_length
        eos_token_id=tokenizer.eos_token_id
    )
    return outputs
'''Returns the generated token IDs from the model. These will need to be decoded back to text using the tokenizer, like this:'''

As we want to have two different trained models, I will create two distinct prompts.

The first model will be trained with a dataset containing prompts, and the second one with a dataset of motivational sentences.

The first model will receive the prompt "I want you to act as an English translator," and the second model will receive "There are two things that matter:"

But first, I'm going to collect some results from the model without fine-tuning.

In [None]:
'''Purpose: Tokenizes your prompt using the model's tokenizer. "I want you to act as a motivational coach." is the seed prompt.
return_tensors="pt" returns the input as PyTorch tensors, which is what the model expects.'''
input_prompt = tokenizer("I want you to act as a motivational coach. ", return_tensors="pt")
'''Generates text from the model starting with the prompt. The model will generate up to 50 new tokens, unless it hits an eos_token_id and stops early.
It uses all the settings you defined in get_outputs() (repetition penalty, early stopping, etc.).'''
foundational_outputs_prompt = get_outputs(foundational_model, input_prompt, max_new_tokens=50)
'''Decodes the list of token IDs into human-readable strings. skip_special_tokens=True removes things like <pad>, <eos>, etc., from the output.
batch_decode is used even though it's likely a batch of size 1 ‚Äî it's still appropriate because the model outputs a tensor of shape [batch_size, sequence_length].'''
print(tokenizer.batch_decode(foundational_outputs_prompt, skip_special_tokens=True))

In [None]:
input_sentences = tokenizer("There are two things that matter:", return_tensors="pt")
foundational_outputs_sentence = get_outputs(foundational_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

Both answers are more or less correct. Any of the Bloom models is pre-trained and can generate sentences accurately and sensibly. Let's see if, after training, the responses are either equal or more accurately generated.

## Preparing the Datasets
The Datasets useds are:
* https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
* https://huggingface.co/datasets/Abirate/english_quotes


In [None]:
import os
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
from datasets import load_dataset
'''Loads the Hugging Face dataset called fka/awesome-chatgpt-prompts. This dataset contains a collection of structured prompts, often with an act and prompt field 
(e.g., "Act as a doctor", "Act as a motivational coach", etc.).'''

dataset_prompt = "fka/awesome-chatgpt-prompts"

#Create the Dataset to create prompts.
data_prompt = load_dataset(dataset_prompt)
'''Applies your tokenizer to the "prompt" field in a batched manner. This replaces each sample with the tokenizer's output (which includes 'input_ids', 'attention_mask', etc.).
‚ö†Ô∏è Note: This operation will overwrite the existing structure of each sample with the tokenizer's return values unless you explicitly preserve the original columns.'''
data_prompt = data_prompt.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
'''Selects the first 50 examples from the "train" split. This is useful for quick testing, development, or small-scale fine-tuning.'''
train_sample_prompt = data_prompt["train"].select(range(50))

#train_sample_prompt = train_sample_prompt.remove_columns('act')

display(train_sample_prompt)

In [None]:
print(train_sample_prompt[:1])

In [None]:
'''Loads the public dataset Abirate/english_quotes. This dataset includes fields like: "quote" (the actual text), "author" (who said it),
"tags" (topic or theme categories like "inspiration", "life", etc.)'''
dataset_sentences = load_dataset("Abirate/english_quotes")
'''Applies the tokenizer to the "quote" field. batched=True allows efficient batch tokenization. This replaces each sample with the output of tokenizer(...) (e.g., input_ids, attention_mask)'''
data_sentences = dataset_sentences.map(lambda samples: tokenizer(samples["quote"]), batched=True)
'''Selects the first 25 quotes for a smaller working dataset ‚Äî useful for testing or small-scale experiments.'''
train_sample_sentences = data_sentences["train"].select(range(25))
'''Removes "author" and "tags" columns ‚Äî leaving just tokenized quote data.'''
train_sample_sentences = train_sample_sentences.remove_columns(['author', 'tags'])
display(train_sample_sentences)

## fine-tuning.  

### PEFT configurations


API docs:
https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig

We can use the same configuration for both models to be trained.


In [None]:
'''This library allows tuning only a small part of the model (like prompts, adapters, or LoRA layers), reducing memory and compute needs.'''
from peft import  get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

'''| Parameter                                    | Purpose                                                                                                                                        |
| -------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `task_type=TaskType.CAUSAL_LM`               | Specifies that this is a **Causal Language Modeling** task (like GPT, BLOOM, etc.), where the model predicts the next word/token.              |
| `prompt_tuning_init=PromptTuningInit.RANDOM` | Initializes the virtual prompt tokens **randomly**. (Alternatives include using tokens from a real prompt or a known initialization.)          |
| `num_virtual_tokens=NUM_VIRTUAL_TOKENS`      | The number of **learnable prompt tokens** (e.g., 4, as defined earlier). These will be prepended to every input during training and inference. |
| `tokenizer_name_or_path=model_name`          | Used to align the virtual token embeddings with the tokenizer's vocabulary.                                                                    |
'''
generation_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM, #This type indicates the model will generate text.
    prompt_tuning_init=PromptTuningInit.RANDOM,  #The added virtual tokens are initializad with random numbers
    num_virtual_tokens=NUM_VIRTUAL_TOKENS, #Number of virtual tokens to be added and trained.
    tokenizer_name_or_path=model_name #The pre-trained model.
)


### Creating two Prompt Tuning Models.
We will create two identical prompt tuning models using the same pre-trained model and the same config.

In [None]:
'''This wraps your original foundational_model (e.g., BLOOMZ-560M) with a prompt tuning adapter. It doesn‚Äôt modify the pretrained model weights.
Instead, it adds a set of virtual prompt embeddings that are trained while the rest of the model stays frozen.'''
peft_model_prompt = get_peft_model(foundational_model, generation_config)
print(peft_model_prompt.print_trainable_parameters())
'''The exact numbers will vary depending on: NUM_VIRTUAL_TOKENS (you set this to 4).
The hidden size of the model (for BLOOMZ-560M, it's 768).
üìå How it's calculated: For prompt tuning, the trainable parameters = num_virtual_tokens √ó hidden_size √ó 2
The √ó2 comes from learning both key and value prompt embeddings.
So: 4 √ó 768 √ó 2 = 6144 trainable parameters (for BLOOMZ-560M).
This is a tiny fraction of the total model size, which makes training faster, cheaper, and much more resource-efficient.'''

In [None]:
'''You're now creating a second PEFT model called peft_model_sentences, for use with your English quotes dataset. You're wrapping the same foundational_model (bloomz-560m) again, using the 
same generation_config. This gives you another instance of a prompt-tuned model with a fresh set of virtual tokens,  initialized randomly. This means peft_model_sentences can be 
trained separately on a different task (e.g., quotes), while peft_model_prompt could be trained on your prompt-style dataset
‚Äî both sharing the same base model but learning different soft prompts.'''
peft_model_sentences = get_peft_model(foundational_model, generation_config)
print(peft_model_sentences.print_trainable_parameters())

**That's amazing: did you see the reduction in trainable parameters? We are going to train a 0.001% of the paramaters available.**

Now we are going to create the training arguments, and we will use the same configuration in both trainings.

In [None]:
from transformers import TrainingArguments
'''| Argument                    | Meaning                                                                                                                     |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| `output_dir=path`           | Where to store model checkpoints, logs, and outputs.                                                                        |
| `use_cpu=True`              | Tells Hugging Face `Trainer` to use CPU instead of GPU. Very important for CPU-only systems.                                |
| `auto_find_batch_size=True` | Dynamically adjusts batch size during training to avoid OOM errors. Great for large models.                                 |
| `learning_rate=0.0035`      | A higher learning rate than full-model fine-tuning ‚Äî works well for prompt tuning, where only a few parameters are trained. |
| `num_train_epochs=6`        | Number of passes over the dataset.                                                                                          |
'''
def create_training_arguments(path, learning_rate=0.0035, epochs=6):
    training_args = TrainingArguments(
        output_dir=path, # Where the model predictions and checkpoints will be written
        use_cpu=True, # This is necessary for CPU clusters.
        auto_find_batch_size=True, # Find a suitable batch size that will fit into memory automatically
        learning_rate= learning_rate, # Higher learning rate than full fine-tuning
        num_train_epochs=epochs
    )
    return training_args

In [None]:

import os

working_dir = "./"

#Is best to store the models in separate folders.
#Create the name of the directories where to store the models.
output_directory_prompt =  os.path.join(working_dir, "peft_outputs_prompt")
output_directory_sentences = os.path.join(working_dir, "peft_outputs_sentences")

#Just creating the directoris if not exist.
if not os.path.exists(working_dir):
    os.mkdir(working_dir)
if not os.path.exists(output_directory_prompt):
    os.mkdir(output_directory_prompt)
if not os.path.exists(output_directory_sentences):
    os.mkdir(output_directory_sentences)


We need to indicate the directory containing the model when creating the TrainingArguments.

In [None]:
'''You‚Äôre calling your create_training_arguments() function twice: Once for your prompt-tuning model, saving checkpoints & logs to peft_outputs_prompt
Once for your sentence/quotes-tuning model, saving to peft_outputs_sentences. Both use a learning rate of 0.003 and the number of epochs you defined earlier (NUM_EPOCHS = 5 or 6).'''
training_args_prompt = create_training_arguments(output_directory_prompt, 0.003, NUM_EPOCHS)
training_args_sentences = create_training_arguments(output_directory_sentences, 0.003, NUM_EPOCHS)

## Train

We will create the trainer Object, one for each model to train.  

In [None]:
from transformers import Trainer, DataCollatorForLanguageModeling

'''model=model: You‚Äôre passing in the PEFT model (which has frozen base weights and trainable virtual tokens).
args=training_args: Passes your previously created training configurations.
train_dataset=train_dataset: Expects a dataset formatted with tokenized inputs (like your prompt or quotes datasets).
data_collator=DataCollatorForLanguageModeling(...):
This batches and pads samples dynamically during training.
mlm=False because BLOOMZ is a causal LM, not masked LM (e.g., BERT).
So this means it prepares inputs for autoregressive generation training.'''
def create_trainer(model, training_args, train_dataset):
    trainer = Trainer(
        model=model, # We pass in the PEFT version of the foundation model, bloomz-560M
        args=training_args, #The args for the training.
        train_dataset=train_dataset, #The dataset used to tyrain the model.
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) # mlm=False indicates not to use masked language modeling
    )
    return trainer


In [None]:
#Training first model.
'''Sets up the Trainer object with your prompt-tuned PEFT model, training args, and prompt dataset.'''
trainer_prompt = create_trainer(peft_model_prompt, training_args_prompt, train_sample_prompt)
'''Starts the training loop: Loads batches from your train_sample_prompt. Runs forward and backward passes only updating your virtual prompt tokens.
Saves checkpoints to peft_outputs_prompt (your specified directory). Logs training progress (loss, step, etc.).'''
trainer_prompt.train()

In [None]:
#Training second model.
trainer_sentences = create_trainer(peft_model_sentences, training_args_sentences, train_sample_sentences)
trainer_sentences.train()

In less than 10 minutes (CPU time in a M1 Pro) we trained 2 different models, with two different missions with a same foundational model as a base.

## Save models
We are going to save the models. These models are ready to be used, as long as we have the pre-trained model from which they were created in memory.

In [None]:
'''Your base foundation model (foundational_model) weights remain unchanged.
The prompt tuning adapters (virtual tokens embeddings) are saved in those directories.
This makes it easy to reload and apply the fine-tuned prompts later without retraining.'''
trainer_prompt.model.save_pretrained(output_directory_prompt)
trainer_sentences.model.save_pretrained(output_directory_sentences)


## Inference

You can load the model from the path that you have saved to before, and ask the model to generate text based on our input before!

In [None]:
'''The PeftModel.from_pretrained() method expects the base model object (already loaded) as the first argument, not the model name string. So you need to first load the base model 
(foundational_model is presumably the model instance you loaded earlier), then wrap it with the PEFT adapter from your saved directory.'''
from peft import PeftModel

loaded_model_prompt = PeftModel.from_pretrained(foundational_model,
                                         output_directory_prompt,
                                         #device_map='auto',
                                         is_trainable=False)

In [None]:
'''Calls your previously defined get_outputs() function, which runs .generate() on the model with the tokenized input prompt.

Then decodes the generated token IDs back into readable text, skipping special tokens like <eos>.'''
loaded_model_prompt_outputs = get_outputs(loaded_model_prompt, input_prompt)
print(tokenizer.batch_decode(loaded_model_prompt_outputs, skip_special_tokens=True))

If we compare both answers something changed.
* ***Pretrained Model:*** *I want you to act as a motivational coach.  Don't be afraid of being challenged.*
* ***Fine Tuned Model:*** *I want you to act as a motivational coach.  You can use this method if you're not sure what your goals are.*

We have to keep in mind that we have only trained the model for a few minutes, but they have been enough to obtain a response closer to what we were looking for.

In [None]:
loaded_model_sentences = PeftModel.from_pretrained(foundational_model,
                                         output_directory_sentences,
                                         #device_map='auto',
                                         is_trainable=False)

In [None]:
loaded_model_sentences_outputs = get_outputs(loaded_model_sentences, input_sentences)
print(tokenizer.batch_decode(loaded_model_sentences_outputs, skip_special_tokens=True))

With the second model we have a similar result.
* **Pretrained Model:** *There two thing that matter: the size and shape of a flower*
* **Fine Tuned Model:** *There two thing that matter: one is the weather and another, what you do.*



# Conclusion
Prompt Tuning is an amazing technique that can save us hours of training and a significant amount of money. In the notebook, we have trained two models in just a few minutes, and we can have both models in memory, providing service to different clients.

If you want to try different combinations and models, the notebook is ready to use another model from the Bloom family.

You can change the number of epochs to train, the number of virtual tokens, and the model in the third cell. However, there are many configurations to change. If you're looking for a good exercise, you can replace the random initialization of the virtual tokens with a fixed value.

*The responses of the fine-tuned models may vary every time we train them. I've pasted the results of one of my trainings, but the actual results may differ.*