**distinctions between prefix language modelling (conditional language modelling) and autoregressive language modelling**

Prefix Language Modelling can be considered as the case wherin the training (for the predictions) are done on a fixed prefix of the text sequence and not the entire text. Context is learned via the prefix chosen (similar to conditional probability). Training is faster.

Autoregressive LM takes into account the complete sequence seen so far (all prev. words). Full context of prediction is taken into account. Training is relatively difficult.


!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

In [2]:
import datasets
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m", load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

In [3]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

In [4]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:
from peft import PrefixEncoder, PrefixTuningConfig, get_peft_model

# config = PrefixTuningConfig(
#     peft_type="PREFIX_TUNING",
#     task_type="SEQ_2_SEQ_LM",
#     num_virtual_tokens=10,
#     token_dim=768,
#     num_transformer_submodules=1,
#     num_attention_heads=12,
#     num_layers=12,
#     encoder_hidden_size=768,
# )

config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=30)

# prefix_encoder = PrefixEncoder(config)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 1474560 || all params: 560689152 || trainable%: 0.26299064191632515


In [6]:
import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [7]:
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

Step,Training Loss
1,9.2569
2,9.2254
3,8.0559
4,9.3824
5,9.1456
6,10.2141
7,8.5074
8,8.5655
9,8.8533
10,8.8995


TrainOutput(global_step=10, training_loss=9.010600471496582, metrics={'train_runtime': 4.4413, 'train_samples_per_second': 36.025, 'train_steps_per_second': 2.252, 'total_flos': 20257421131776.0, 'train_loss': 9.010600471496582, 'epoch': 0.06})

In [8]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

In [9]:
from datasets import load_dataset

encodings = tokenizer("\n\n".join(str(v) for v in data['train']), return_tensors="pt")

In [12]:
import torch
from tqdm import tqdm

max_length = 100000
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
device = "cuda"
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

  0%|          | 0/1371 [00:00<?, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 37.25 GiB. GPU 0 has a total capacty of 14.58 GiB of which 11.38 GiB is free. Process 14377 has 3.19 GiB memory in use. Of the allocated memory 2.90 GiB is allocated by PyTorch, and 167.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF