In [1]:
import transformers
transformers.__version__

'4.38.1'

##### Prompt Methods Concept
https://huggingface.co/docs/peft/conceptual_guides/prompting

- Hard prompts are hand-crafted text prompts with discrete-tokens, takes lot of effort to design a good prompt

- Soft prompts are "learnable tensors" concatenated with input embeddings that can be **optimized for the dataset**, issue is prompt isn't human readable. The virtual tokens **are not matched** to embeddings of "real word"

* Prompt Tuning:

    - Developed for Text Classification on T5 models. Cast as generation task

    - Prompts are added to input as a series of tokens to model input embeddings

    - Prompt Tokens have their own parameters & gradients that are updated
 
* Prefix Tuning:

      - Developed for NLG on GPT models

      - similar to prompt tuning in adding trainable task-specifica vectors

      - Prefix parameter are inserted into all the layers

      - Seperate FFN updates the prefix parameters and later discarded

* P-Tuning: (Prompt Encoding)

      - Developed for NLU on all Lang Models

      - Adds trainable embedding tensors, that can be optimized for better prompts.

      - Uses Prompt Encoder

      - Prompt tokens inserted anywhere in input sequence, not restricted to starting.

      - Prompt Tokens are added only to the Input Embeddings

      - Introduce anchor tokens, which improve performance

Experiments Done:

Below experiments are done with Causal Language Generation. Which is not the correct approach. Need to work on specific task, and respective dataset

- With bigscience/bloomz:

  a) Prompt-Encoding : Worked

  b) Prefix-Tuning : Worked

  c) Prompt-Tuning : Worked

  Datasets : OpenAssistant/oasst_top1_2023-08-25

  Trainer : Transformers Trainer / Training Args

- With Gemma Model:

  a) Prompt-Encoding : Worked

  b) Prefix-Tuning : Failed

  c) Prompt-Tuning : Worked

  Datasets : OpenAssistant/oasst_top1_2023-08-25

  Trainer : Transformers Trainer / Training Args

##### Initial setup

In [1]:
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    default_data_collator,
    get_linear_schedule_with_warmup,
)
from transformers import (
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import (
    load_dataset,
    Dataset,
)
from peft import (
    LoraConfig,
    PeftConfig,
    PromptEncoderConfig,
    PrefixTuningConfig,
    IA3Config,
    get_peft_model,
    TaskType,
    PromptTuningInit,
    PromptTuningConfig,
)
import torch
from torch import optim
from torch.utils.data import DataLoader
from tqdm import tqdm
import os

In [2]:
device = "cuda"
# model_path = "bigscience/bloomz-560m" 
# model_path = "facebook/opt-350m"
model_path = "google/gemma-2b-it"
lr = 1e-3
num_epochs = 1
batch_size = 1
steps = 150

save_location = "/home/aicoder/training/multi_promptmeth_gemma"
# save_location = "/home/aicoder/training/multi_bloomz"
# save_location = "/home/aicoder/training/multi_opt350"  # using qlora_minimal with AutoSeq2SeqLM
# loading tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [3]:
# creating BnB confing for 8-bit loading
bnb_4bit_nf4 = BitsAndBytesConfig(load_in_4bit=True,
                                  bnb_4bit_quant_type='nf4',
                                  bnb_4bit_compute_dtype=torch.bfloat16)

In [4]:
# loading model 4-bit quantisation with nesting & nf4, device_map=auto takes 2.70 GB VRAM
# using QloraMinimal dataset so use Seq2Seq
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             # load_in_4bit=True, depricated,
                                             quantization_config=bnb_4bit_nf4,
                                             device_map="auto",
                                            torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
def test_prompt(prompt, your_model):
    input_ids = tokenizer(prompt, return_tensors='pt')
    logits = your_model.generate(**input_ids.to(device),
                           max_new_tokens=50)
    return tokenizer.decode(logits[0])

In [6]:
tokenizer.chat_template

"{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

In [6]:
# Don't modify tokenizer in case of Gemma Model
# Add tokens <|im_start|> and <|im_end|>, latter is special eos token 
tokenizer.pad_token = "</s>"
tokenizer.add_tokens(["<|im_start|>"])
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
model.resize_token_embeddings(len(tokenizer))
model.config.eos_token_id = tokenizer.eos_token_id

In [5]:
# Testing the loaded model
# DONT RUN THIS WHEN PLANNING TO TRAIN
input_text = "What is the zodiac signs"
print(test_prompt(input_text,model))

</s>What is the zodiac signs?

The zodiac signs are the signs of the zodiac. The zodiac signs are the signs of the zodiac. The zodiac signs are the signs of the zodiac. The zodiac signs are the signs of the zodiac


##### Moving the models

In [5]:
# try moving the model back to cpu 
# will still keep the gpu memory
# trying this with quantised model will throw error
cpu_model = model.to('cpu')

In [6]:
# try deleting quantised model & releasing memory will also not work
del model

In [7]:
# Now the memory will be released from VRAM.
torch.cuda.empty_cache()

##### Preparing Datasets (fails in the returning the data batches)

In [6]:
# preping the dataset for training.
import pandas as pd
import json

from datasets import load_dataset

ds = load_dataset("ought/raft", "twitter_complaints")

classes = [k.replace("_", " ") for k in ds["train"].features["Label"].names]

ds = ds.map(
    lambda x: {"text_label": [classes[label] for label in x["Label"]]},
    batched=True,
    num_proc=1,
)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes])  # 4

max_length = 64

def preprocess_function(examples, text_column="Tweet text", label_column="text_label"):
    batch_size = len(examples[text_column])
    inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
    targets = [str(x) for x in examples[label_column]]
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)
    classes = [k.replace("_", " ") for k in ds["train"].features["Label"].names]
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

process_ds = ds.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=ds["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

train_ds = process_ds["train"]
eval_ds = process_ds["test"]

batch_size = 16

train_dataloader = DataLoader(train_ds, shuffle=True,
                              collate_fn=default_data_collator,
                              batch_size=batch_size,
                              pin_memory=True)
eval_dataloader = DataLoader(eval_ds, collate_fn=default_data_collator,
                             batch_size=batch_size,
                             pin_memory=True)

Running tokenizer on dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/3399 [00:00<?, ? examples/s]

##### Preparing datasets 

In [6]:
dataset = load_dataset("OpenAssistant/oasst_top1_2023-08-25")
# BloomZ
# the dataset and below collate function works for prompt_encoding
# the dataset and below collate function works for prefix_tuning
# the dataset and below collate function works for Prompt_tuning
# Gemma
# the dataset and below collate function works for prompt_encoding
# the dataset and below collate function FAILED for prefix_tuning
# the dataset and below collate function works for Prompt_tuning
def tokenize(element):
    return tokenizer(
        element["text"],
        truncation=True,
        # max_length=512,
        max_length=1024,
        add_special_tokens=False,
    )

dataset_tokenized = dataset.map(
    tokenize, 
    batched=True, 
    num_proc=os.cpu_count(),    # multithreaded
    remove_columns=["text"]     # don't need this anymore, we have tokens from here on
)

In [16]:
dataset_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 12947
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 690
    })
})

In [7]:
# define collate function - transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to single batch dictionary { input_ids: [..], labels: [..], attention_mask: [..] }
def collate(elements):
    tokenlist=[e["input_ids"] for e in elements]
    tokens_maxlen=max([len(t) for t in tokenlist])

    input_ids,labels,attention_masks = [],[],[]
    for tokens in tokenlist:
        pad_len=tokens_maxlen-len(tokens)

        # pad input_ids with pad_token, labels with ignore_index (-100) and set attention_mask 1 where content otherwise 0
        input_ids.append( tokens + [tokenizer.pad_token_id]*pad_len )   
        labels.append( tokens + [-100]*pad_len )    
        attention_masks.append( [1]*len(tokens) + [0]*pad_len ) 

    batch={
        "input_ids": torch.tensor(input_ids),
        "labels": torch.tensor(labels),
        "attention_mask": torch.tensor(attention_masks)
    }
    return batch

##### Model size details

In [9]:
# reviewing model footprint, use this for all models
model.get_memory_footprint()

2576502784

In [11]:
# looking at the model architecture
model

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250682, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear4bit(in_features=1024, out_features=3072, bias=True)
          (dense): Linear4bit(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear4bit(in_features=1024, out_features=4096, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear4bit(in_features=4096, out_features=1024, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affi

### Run below cells when doing PEFT

In [8]:
# p-tuning

prompt_enc_config = PromptEncoderConfig(task_type="CAUSAL_LM",
                                        num_virtual_tokens=20,
                                        encoder_hidden_size=128)

prefix_tun_config = PrefixTuningConfig(task_type="CAUSAL_LM",
                                       num_virtual_tokens=20)

prompt_tuning_init_text = "Classify if the tweet is a complaint or no complaint.\n"

prompt_tuning_config = PromptTuningConfig(
    task_type="CAUSAL_LM",
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=len(tokenizer(prompt_tuning_init_text)["input_ids"]),
    prompt_tuning_init_text=prompt_tuning_init_text,
    tokenizer_name_or_path=model_path,
)

In [11]:
prompt_enc_model = get_peft_model(model, prompt_enc_config)
prompt_enc_model.print_trainable_parameters()
# bloomz "trainable params: 300,288 || all params: 559,514,880 || trainable%: 0.05366935013417338"
# gemma "trainable params: 583,936 || all params: 2,506,756,352 || trainable%: 0.023294485701975395"

trainable params: 583,936 || all params: 2,506,756,352 || trainable%: 0.023294485701975395


In [9]:
prefix_tun_model = get_peft_model(model, prefix_tun_config)
prefix_tun_model.print_trainable_parameters()
# Bloomz "trainable params: 983,040 || all params: 560,197,632 || trainable%: 0.1754809274167014"
# Gemma "trainable params: 1,474,560 || all params: 2,507,646,976 || trainable%: 0.05880253536931667"

trainable params: 1,474,560 || all params: 2,507,646,976 || trainable%: 0.05880253536931667


In [9]:
prompt_tun_model = get_peft_model(model, prompt_tuning_config)
prompt_tun_model.print_trainable_parameters()
# Bloomz "trainable params: 8,192 || all params: 559,222,784 || trainable%: 0.0014648902430985358"
# Gemma "trainable params: 26,624 || all params: 2,506,199,040 || trainable%: 0.0010623258398502937"

trainable params: 26,624 || all params: 2,506,199,040 || trainable%: 0.0010623258398502937


##### Pytorch Training part (Not using)

In [14]:
prompt_enc_optimizer = optim.AdamW(prompt_enc_model.parameters(),
                                   lr=lr)

In [None]:
prompt_tun_optimizer = optim.AdamW(prompt_tun_model.parameters(),
                                   lr=lr)

In [12]:
prefix_tun_optimizer = optim.AdamW(prefix_tun_model.parameters(),
                                   lr=lr)

###### Below train loop fails

In [15]:
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=prompt_enc_optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs)
)

NameError: name 'train_dataloader' is not defined

In [14]:
from tqdm import tqdm
model = prompt_enc_model

device = "cuda"
model = model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
        )

    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")

  0%|          | 0/4 [00:00<?, ?it/s]


ValueError: expected sequence of length 38 at dim 1 (got 23)

##### qloraminimal trainloop

In [10]:
bs=1        # batch size

ga_steps=1  # gradient acc. steps

epochs=1

steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)

In [11]:
args = TrainingArguments(
    output_dir=save_location,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=200,		# eval and save once per epoch  	
    save_steps=200,
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    # optim="paged_adamw_32bit",
    learning_rate=0.0002,
    group_by_length=True,
    # fp16=True,
    ddp_find_unused_parameters=False,

)

In [13]:
trainer = Trainer(
    # model=prompt_enc_model,
    # model=prefix_tun_model,
    model=prompt_tun_model,
    tokenizer=tokenizer,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    args=args,
)


In [14]:
trainer.train()

Step,Training Loss,Validation Loss
200,1.6094,2.214934


Checkpoint destination directory /home/aicoder/training/multi_promptmeth_gemma/checkpoint-200 already exists and is non-empty. Saving will proceed but saved results may be invalid.


KeyboardInterrupt: 

In [16]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("/home/aicoder/training/multi_bloomz/checkpoint-400/").to("cuda")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [19]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
text_column = "text"
i = 15
inputs = tokenizer(f'{text_column} : {dataset["test"][i]["text"]} Label : ',
                   return_tensors="pt")
print(dataset["test"][i]["text"])

<|im_start|>user
¿Cuál es el sentido de la vida?<|im_end|>
<|im_start|>assistant
Algunas corrientes filosóficas, como el existencialismo, proponen que puedes encontrar el sentido de la vida en el valor humano, la mejora del ser y la búsqueda del sentido.

Por otro lado, el nihilismo, aunque aborrece la idea, dice que la vida no tiene un sentido.

En balance de estas dos ideas, aparece el absurdismo o filosofía de lo absurdo, que apoya la idea de que la vida carece de sentido, pero no odia la idea, al contrario, reitera que el hecho de que la vida carezca de sentido no significa que no merezca la pena ser vivida, el mayor exponente de esta corriente filosófica es el escritor francés Albert Camus.

Como tal, la vida no tiene un sentido que puedas encontrar o definir, sin embargo, esto es una gran noticia, porque, el sinsentido de la vida nos dota de la libertad de no cumplir un propósito, permitiéndonos crear uno desde nuestro interior.<|im_end|>



In [20]:
with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=10)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))


['text : <|im_start|>user\n¿Cuál es el sentido de la vida?<|im_end|>\n<|im_start|>assistant\nAlgunas corrientes filosóficas, como el existencialismo, proponen que puedes encontrar el sentido de la vida en el valor humano, la mejora del ser y la búsqueda del sentido.\n\nPor otro lado, el nihilismo, aunque aborrece la idea, dice que la vida no tiene un sentido.\n\nEn balance de estas dos ideas, aparece el absurdismo o filosofía de lo absurdo, que apoya la idea de que la vida carece de sentido, pero no odia la idea, al contrario, reitera que el hecho de que la vida carezca de sentido no significa que no merezca la pena ser vivida, el mayor exponente de esta corriente filosófica es el escritor francés Albert Camus.\n\nComo tal, la vida no tiene un sentido que puedas encontrar o definir, sin embargo, esto es una gran noticia, porque, el sinsentido de la vida nos dota de la libertad de no cumplir un propósito, permitiéndonos crear uno desde nuestro interior.<|im_end|>\n Label : lar),),),<unu