# Fine-tuning Dolly 2.0 with LoRA

*   Dolly-v2-3b - https://huggingface.co/databricks/dolly-v2-3b
*   LoRA paper - https://arxiv.org/abs/2106.09685
*   Alpaca Cleaned Dataset - https://github.com/gururise/AlpacaDataCleaned





In [1]:
!git clone https://github.com/gururise/AlpacaDataCleaned.git

Cloning into 'AlpacaDataCleaned'...
Updating files:  23% (16/69)
Updating files:  24% (17/69)
Updating files:  26% (18/69)
Updating files:  27% (19/69)
Updating files:  28% (20/69)
Updating files:  30% (21/69)
Updating files:  31% (22/69)
Updating files:  33% (23/69)
Updating files:  34% (24/69)
Updating files:  36% (25/69)
Updating files:  37% (26/69)
Updating files:  39% (27/69)
Updating files:  40% (28/69)
Updating files:  42% (29/69)
Updating files:  43% (30/69)
Updating files:  44% (31/69)
Updating files:  46% (32/69)
Updating files:  47% (33/69)
Updating files:  49% (34/69)
Updating files:  50% (35/69)
Updating files:  52% (36/69)
Updating files:  53% (37/69)
Updating files:  55% (38/69)
Updating files:  56% (39/69)
Updating files:  57% (40/69)
Updating files:  59% (41/69)
Updating files:  60% (42/69)
Updating files:  62% (43/69)
Updating files:  63% (44/69)
Updating files:  65% (45/69)
Updating files:  66% (46/69)
Updating files:  68% (47/69)
Updating files:  69% (48/69)
Updatin

In [3]:
ls AlpacaDataCleaned

 Volume in drive C is Windows
 Volume Serial Number is B8E1-FE27

 Directory of C:\Users\rravula\AlpacaDataCleaned

06/26/2023  11:14 AM    <DIR>          .
06/26/2023  11:14 AM    <DIR>          ..
06/26/2023  11:14 AM    <DIR>          .github
06/26/2023  11:14 AM             3,238 .gitignore
06/26/2023  11:14 AM        23,034,003 alpaca_data.json
06/26/2023  11:14 AM        44,566,362 alpaca_data_cleaned.json
06/26/2023  11:14 AM        23,832,175 alpaca_data_cleaned_archive.json
06/26/2023  11:14 AM             4,485 alpacaModifier.py
06/26/2023  11:14 AM    <DIR>          assets
06/26/2023  11:14 AM            19,753 DATA_LICENSE
06/26/2023  11:14 AM    <DIR>          dataset_extensions
06/26/2023  11:14 AM    <DIR>          eval
06/26/2023  11:14 AM             9,142 generate_instruction.py
06/26/2023  11:14 AM    <DIR>          gui
06/26/2023  11:14 AM            11,558 LICENSE
06/26/2023  11:14 AM             3,096 modifierGui.py
06/26/2023  11:14 AM             1,762 prompt.tx

In [4]:
!pip install accelerate>=0.12.0 transformers[torch]==4.25.1
!pip install -q datasets loralib sentencepiece
!pip -q install git+https://github.com/huggingface/peft.git
!pip -q install bitsandbytes

In [5]:
# Create Instruct Pipeline
import logging
import re

import numpy as np
from transformers import Pipeline, PreTrainedTokenizer

logger = logging.getLogger(__name__)

INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
END_KEY = "### End"
INTRO_BLURB = (
    "Below is an instruction that describes a task. Write a response that appropriately completes the request."
)

# This is the prompt that is used for generating responses using an already trained model.  It ends with the response
# key, where the job of the model is to provide the completion that follows it (i.e. the response itself).
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    instruction="{instruction}",
    response_key=RESPONSE_KEY,
)


def get_special_token_id(tokenizer: PreTrainedTokenizer, key: str) -> int:
    """Gets the token ID for a given string that has been added to the tokenizer as a special token.
    When training, we configure the tokenizer so that the sequences like "### Instruction:" and "### End" are
    treated specially and converted to a single, new token.  This retrieves the token ID each of these keys map to.
    Args:
        tokenizer (PreTrainedTokenizer): the tokenizer
        key (str): the key to convert to a single token
    Raises:
        RuntimeError: if more than one ID was generated
    Returns:
        int: the token ID for the given key
    """
    token_ids = tokenizer.encode(key)
    if len(token_ids) > 1:
        raise ValueError(f"Expected only a single token for '{key}' but found {token_ids}")
    return token_ids[0]


class InstructionTextGenerationPipeline(Pipeline):
    def __init__(
        self, *args, do_sample: bool = True, max_new_tokens: int = 256, top_p: float = 0.92, top_k: int = 0, **kwargs
    ):
        super().__init__(*args, do_sample=do_sample, max_new_tokens=max_new_tokens, top_p=top_p, top_k=top_k, **kwargs)

    def _sanitize_parameters(self, return_instruction_text=False, **generate_kwargs):
        preprocess_params = {}

        # newer versions of the tokenizer configure the response key as a special token.  newer versions still may
        # append a newline to yield a single token.  find whatever token is configured for the response key.
        tokenizer_response_key = next(
            (token for token in self.tokenizer.additional_special_tokens if token.startswith(RESPONSE_KEY)), None
        )

        response_key_token_id = None
        end_key_token_id = None
        if tokenizer_response_key:
            try:
                response_key_token_id = get_special_token_id(self.tokenizer, tokenizer_response_key)
                end_key_token_id = get_special_token_id(self.tokenizer, END_KEY)

                # Ensure generation stops once it generates "### End"
                generate_kwargs["eos_token_id"] = end_key_token_id
            except ValueError:
                pass

        forward_params = generate_kwargs
        postprocess_params = {
            "response_key_token_id": response_key_token_id,
            "end_key_token_id": end_key_token_id,
            "return_instruction_text": return_instruction_text,
        }

        return preprocess_params, forward_params, postprocess_params

    def preprocess(self, instruction_text, **generate_kwargs):
        prompt_text = PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction_text)
        inputs = self.tokenizer(
            prompt_text,
            return_tensors="pt",
        )
        inputs["prompt_text"] = prompt_text
        inputs["instruction_text"] = instruction_text
        return inputs

    def _forward(self, model_inputs, **generate_kwargs):
        input_ids = model_inputs["input_ids"]
        attention_mask = model_inputs.get("attention_mask", None)
        generated_sequence = self.model.generate(
            input_ids=input_ids.to(self.model.device),
            attention_mask=attention_mask,
            pad_token_id=self.tokenizer.pad_token_id,
            **generate_kwargs,
        )[0].cpu()
        instruction_text = model_inputs.pop("instruction_text")
        return {"generated_sequence": generated_sequence, "input_ids": input_ids, "instruction_text": instruction_text}

    def postprocess(self, model_outputs, response_key_token_id, end_key_token_id, return_instruction_text):
        sequence = model_outputs["generated_sequence"]
        instruction_text = model_outputs["instruction_text"]

        # The response will be set to this variable if we can identify it.
        decoded = None

        # If we have token IDs for the response and end, then we can find the tokens and only decode between them.
        if response_key_token_id and end_key_token_id:
            # Find where "### Response:" is first found in the generated tokens.  Considering this is part of the
            # prompt, we should definitely find it.  We will return the tokens found after this token.
            response_pos = None
            response_positions = np.where(sequence == response_key_token_id)[0]
            if len(response_positions) == 0:
                logger.warn(f"Could not find response key {response_key_token_id} in: {sequence}")
            else:
                response_pos = response_positions[0]

            if response_pos:
                # Next find where "### End" is located.  The model has been trained to end its responses with this
                # sequence (or actually, the token ID it maps to, since it is a special token).  We may not find
                # this token, as the response could be truncated.  If we don't find it then just return everything
                # to the end.  Note that even though we set eos_token_id, we still see the this token at the end.
                end_pos = None
                end_positions = np.where(sequence == end_key_token_id)[0]
                if len(end_positions) > 0:
                    end_pos = end_positions[0]

                decoded = self.tokenizer.decode(sequence[response_pos + 1 : end_pos]).strip()
        else:
            # Otherwise we'll decode everything and use a regex to find the response and end.

            fully_decoded = self.tokenizer.decode(sequence)

            # The response appears after "### Response:".  The model has been trained to append "### End" at the
            # end.
            m = re.search(r"#+\s*Response:\s*(.+?)#+\s*End", fully_decoded, flags=re.DOTALL)

            if m:
                decoded = m.group(1).strip()
            else:
                # The model might not generate the "### End" sequence before reaching the max tokens.  In this case,
                # return everything after "### Response:".
                m = re.search(r"#+\s*Response:\s*(.+)", fully_decoded, flags=re.DOTALL)
                if m:
                    decoded = m.group(1).strip()
                else:
                    logger.warn(f"Failed to find response in:\n{fully_decoded}")

        if return_instruction_text:
            return {"instruction_text": instruction_text, "generated_text": decoded}

        return decoded

In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b", padding_side="left")

model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b",
                                             device_map="auto",
                                             torch_dtype=torch.bfloat16)

# generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

In [7]:
from datasets import load_dataset

data = load_dataset("json",
                    data_files="./AlpacaDataCleaned/alpaca_data.json")

def generate_prompt(data_point):
    # taken from https://github.com/tloen/alpaca-lora
    if data_point["instruction"]:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Input:
{data_point["input"]}

### Response:
{data_point["output"]}"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Response:
{data_point["output"]}"""


data = data.map(lambda data_point: {"prompt": tokenizer(generate_prompt(data_point))})

data

Downloading and preparing dataset json/default to C:/Users/rravula/.cache/huggingface/datasets/json/default-12dd14f31685d2a5/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to C:/Users/rravula/.cache/huggingface/datasets/json/default-12dd14f31685d2a5/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction', 'prompt'],
        num_rows: 52002
    })
})

## Finetuning Dolly

In [8]:
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModel, AutoConfig, GPTJForCausalLM

from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin C:\Users\rravula\anaconda3\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
CUDA SETUP: Loading binary C:\Users\rravula\anaconda3\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
argument of type 'WindowsPath' is not iterable


  warn("The installed version of bitsandbytes was compiled without GPU support. "


In [9]:
# Settings for A100 - For 3090
MICRO_BATCH_SIZE = 4  # change to 4 for 3090
BATCH_SIZE = 128
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 2  # paper uses 3
LEARNING_RATE = 2e-5
CUTOFF_LEN = 256
LORA_R = 4
LORA_ALPHA = 16
LORA_DROPOUT = 0.05

In [10]:
model = prepare_model_for_int8_training(model,
                                        use_gradient_checkpointing=True)



In [11]:
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
tokenizer.pad_token_id = 0  # unk. we want this to be different from the eos token

data = load_dataset("json", data_files="./AlpacaDataCleaned/alpaca_data_cleaned.json")

Downloading and preparing dataset json/default to C:/Users/rravula/.cache/huggingface/datasets/json/default-cce211a697f4804e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to C:/Users/rravula/.cache/huggingface/datasets/json/default-cce211a697f4804e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
data = data.shuffle().map(
    lambda data_point: tokenizer(
        generate_prompt(data_point),
        truncation=True,
        max_length=CUTOFF_LEN,
        padding="max_length",
    )
)

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

In [13]:
data

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction', 'input_ids', 'attention_mask'],
        num_rows: 51760
    })
})

In [None]:

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=MICRO_BATCH_SIZE,
       # gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        warmup_steps=100,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=True,
        logging_steps=1,
        output_dir="lora-dolly",
        save_total_limit=3,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train(resume_from_checkpoint=False)

model.save_pretrained("alpaca-lora-dolly-2.0")

In [None]:
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

In [None]:
generate_text("Look up the boiling point of water.")

'Water boils at 100 ° C (212 ° F) at atmospheric pressure.\n\n\n  \xa0\xa0\xa0\nTo look up the boiling point of water, you can look up the properties of water in a chemistry book or ask a knowledgeable friend for advice.'

In [None]:
generate_text("Find the capital of Spain.")

'The capital of Spain is Madrid. Madrid is the largest city in the country and one of the oldest settlements in Europe, having been inhabited since prehistoric times.\nBelow is the list of the main cities:\n1. Barcelona\n2. Valencia\n3. Seville\n4. Madrid\n5. Bilbao\n6. Bilbao is also known as the 3 cities.\n7. Valencia is also known as the 8 cities because of its eight cities that have been the capital.\n8. Barcelona is also known as the 10 cities and is the capital of the regional capital in Catalonia.\n9. Valencia and Barcelona are connected by a tunnel.\nBelow is a map of the main cities of Spain:\nBelow is the city of Valencia in Spain:\n\n clocks in is 6:45 am and my head is reeling with the knowledge that today is the first day of my first year in college and it has been an exciting morning so far. I just finished getting dressed and just went downstairs to my new class, the 101 Science II course. I find myself overwhelmed with both excitement and nervousness as I sit down at my

In [None]:
generate_text("Translate the following phrase into French: I love my dog")

'I adore mon chien.\n\n### Response:\nJe suis passionnée par mon chien.'

In [None]:
generate_text("Given a set of numbers, find the maximum value: Set: {10, 3, 25, 62, 16}")

"The maximum value of the given set of numbers is 16. The numbers are arranged in descending order: 3, 10, 25, 62. And the value of the maximum element of the given set is 16. \n\nThe reason behind this is that the given set is made of smaller elements with increasing values, in descending order. So, the maximum element can be obtained by taking the smallest element first and then finding the maximum value of the rest of the numbers that are lesser than it, in the given order.\n\nIn the given set, the first element, in the order from largest to smallest is 25, the element next to it is 10. So, 25 - 10 = 15. The value of the maximum element in the given set is 16 = 15 + (25 - 10) / 2 = 15 + 15/2 = 16.\n\n�\nThe answer is 16.\n\nLet's look at the given set.\nThe first element is 25. The next element is 10. The remaining elements are 62, 16. \n25 - 10 = 15.\nNext, we find the maximum value of the rest of the numbers in the given set:\n62 - 16 = 46.\nThe value of the maximum element in the