<a href="https://colab.research.google.com/github/Lastget/Lord_of_the_Ring_LLM/blob/main/Bloom3B_QLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune a model that generates the Lord of the Rings.
  - Finetune model Bloom-3B with PEFT QLoRA adapters.
  - Data from The Lord of the Rings
  - Bloom-3B from Huggingface



In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U accelerate
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q datasets


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m98.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

### Data Preprocessing
Preprocess data and push it to hugging face.  

# Finetune the LLM model
To reduce trianing memeroy and catastrophic forgetting. We use follow two techniques.
- PEFT QLoRA (Parameter-Efficient Fine-Tuning:  low rank adapters)
- Quantization


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "bigscience/bloom-3b"
# model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    #Do the double quantization proposed by QLoRa.
    bnb_4bit_use_double_quant=True,
    # 4-bit NormalFloat
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Get tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# load model in 4bit
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})


Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/6.01G [00:00<?, ?B/s]

Some weights of BloomForCausalLM were not initialized from the model checkpoint at bigscience/bloom-3b and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [4]:
from peft import LoraConfig, get_peft_model
from peft import prepare_model_for_kbit_training

# Gradient checkpoint to save memory, during backprop calculate activation.
model.gradient_checkpointing_enable()

model = prepare_model_for_kbit_training(model)

# use_cache is not possible with gradient checkpointing
model.config.use_cache = False

config = LoraConfig(
    r=16, # the dimension of the low-rank matrices
    lora_alpha=32, # scaling factor for the weight matrices,  a higher lora_alpha value assigns more weight to the LoRA activations.
    target_modules=["query_key_value"], # The modules (for example, attention blocks) to apply the LoRA update matrices.
    lora_dropout=0.05, # dropout probability of the LoRA layers
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4915200 || all params: 1827824640 || trainable%: 0.26890982277161996


# Prepare dataset

In [5]:
from datasets.load import DataFilesList
from datasets import load_dataset

!wget https://raw.githubusercontent.com/jeremyarancio/llm-tolkien/main/llm/data/extracted_text.jsonl

--2023-07-19 09:25:46--  https://raw.githubusercontent.com/jeremyarancio/llm-tolkien/main/llm/data/extracted_text.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2546792 (2.4M) [text/plain]
Saving to: ‘extracted_text.jsonl’


2023-07-19 09:25:47 (58.2 MB/s) - ‘extracted_text.jsonl’ saved [2546792/2546792]



In [6]:
# import pdfplumber
import json
from datasets import Dataset


def preprocess_text(text) -> str:
  text = text.replace('\n', ' ')
  return text


def preprocess_data(dataset_path, min_length, tokenizer) -> str:
  '''
    Prepare dataset for training from the jsonl file.
    load json format dataset and replace spaces and add EOS token.
    Filter pages without text by min_length.
  '''
  with open(dataset_path, 'r') as f:
    grouped_text = ""
    for line in f:
      sen_dict = json.loads(line)
      text = list(sen_dict.values())[0]
      if len(text) > min_length:
        grouped_text += text

    # Replace to EOS
    grouped_text = grouped_text.replace(".\n", "." + tokenizer.eos_token)
    return preprocess_text(grouped_text)


def tokenize(element, tokenizer, context_length) -> str:
  '''
    Tokenize text.
    Last vector of tokens, which is shorter than the maximal context length, is dropped.
  '''
  inputs = tokenizer(element['text'], truncation=True, return_overflowing_tokens=True,
                     return_length=True, max_length=context_length)
  inputs_batch = []
  for length, input_ids in zip(inputs['length'], inputs['input_ids']):
    # Drop the last input_ids that are shorter than max_length
    if length == context_length:
      inputs_batch.append(input_ids)
  return {"input_ids": inputs_batch}


def prepare_dataset(dataset_path, min_length, context_length,
                    test_size, train_size, shuffle, model_id) -> None:
    """Prepare dataset for training and push it to the hub.
    """
    tokenizer =  AutoTokenizer.from_pretrained(model_id)
    # Get all info in str
    text = preprocess_data(dataset_path, min_length, tokenizer)
    # Get it as Dataset form.  features = "text", num_rows = 1
    dataset = Dataset.from_dict({'text': [text]})
    # use the tokenize funciton
    tokenized_dataset = dataset.map(tokenize, batched=True, fn_kwargs={'tokenizer': tokenizer, 'context_length': context_length},
                                         remove_columns=dataset.column_names)
    tokenized_dataset_dict = tokenized_dataset.train_test_split(test_size=test_size, train_size=train_size, shuffle=shuffle)
    # push processed data to hugging face repo
    # tokenized_dataset_dict.push_to_hub(hf_repo)
    return tokenized_dataset_dict


In [7]:
dataset = prepare_dataset(dataset_path = "/content/extracted_text.jsonl",
                          min_length = 2,
                          context_length = 2048,
                          test_size = 0.2,
                          train_size = 0.8,
                          shuffle = True,
                          model_id = model_id)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

# Finetune Model

In [8]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset = dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8, #
        warmup_steps=2, #
        max_steps=20,
        learning_rate=2e-4,
        fp16=True, #
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit" # It activates the paging for better memory management. Without it, we get out-of-memory errors.
    ),
    # Data collator will take dare of padding and sequence shifting inputs and labels
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()

You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.3008
2,3.2018
3,3.2217
4,3.237
5,3.1292
6,3.115


Step,Training Loss
1,3.3008
2,3.2018
3,3.2217
4,3.237
5,3.1292
6,3.115
7,3.1179
8,3.0734
9,3.0717
10,3.1228


TrainOutput(global_step=20, training_loss=3.0869220018386843, metrics={'train_runtime': 952.7506, 'train_samples_per_second': 0.168, 'train_steps_per_second': 0.021, 'total_flos': 2330929083187200.0, 'train_loss': 3.0869220018386843, 'epoch': 0.67})

# Inference

In [9]:
prompt = "The hobbits were so suprised seeing their friend"

inputs = tokenizer(prompt, return_tensors="pt")
tokens = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=1,
    eos_token_id=tokenizer.eos_token_id,
    early_stopping=True
)



In [11]:
print(tokenizer.decode(tokens[0], skip_special_tokens=True))

The hobbits were so suprised seeing their friend hobbit again that they did not know what to say. They were so glad to see him that they could not help laughing. They were so glad to see him that they could not help laughing. They were so glad to see him that they could not help laughing. They were so glad to see him that they could not help laughing. They were so glad to see him that they could not help laughing. They were so glad to see him that they could not help laughing. They were so glad
