## Project Overview

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "**lab**" in Instruct**Lab** stands for **L**arge-Scale **A**lignment for Chat**B**ots.

It is an outgrowth of the paper [*LAB: Large-Scale Alignment for ChatBots*](https://arxiv.org/abs/2403.01081).

### Getting Started

This notebook represents one step in the InstructLab pipeline – to see what else is involved, please check out https://github.com/instructlab/instructlab

## Overview of this Notebook

This notebook represents the *Train the model* step of the guide found [here](https://github.com/instructlab/instructlab?tab=readme-ov-file#-train-the-model).

But at the time of writing it's not.

This notebook takes the output of `ilab data generate` (i.e. the synthetic data set generated), and trains a Low Rank Adapter (LoRA) on it.

It will also do an inference to show you how the model preformed before any training was done, as well as after.

Finally, it will give you a chance to interact with your model in two ways: first in this notebook (using the NVIDIA T4 generously supplied by Google and low/no cost) and second, by giving you the option to convert your adapter to a format that will let you download it and use it with `llama.cpp` on your laptop.

***IMPORTANT***: make sure your notebook uses GPUs.

**Google Collab**: In your notebook, click Runtime --> Change runtime type, and select *T4 GPU* and click save.

**Kaggle (Unsupported and deprecated)**: Click on "More settings" (3 vertical
dots
 at the top-right) --> Accelerator, and select *P100 GPU*.


![kaggle-more-settings](https://github.com/instructlab/instructlab/blob/main/notebooks/images/kaggle/select-accelerator.png?raw=1)
If you miss this step you'll see errors at the Loading model step.


## How to run this notebook

Unless you have a spare GPU with 16GB+ of VRAM,
you'll need to run this notebook on an external platform such as
[Google Collab](https://colab.research.google.com/) if you have serious
issues with Google, there are also some unmaintained directions for using
[Kaggle](https://www.kaggle.com).

## Installing Dependencies

In [1]:
# installing dependencies
!pip install -q -U transformers accelerate peft datasets bitsandbytes trl pyarrow==14.0.1 requests==2.31.0 torch==2.3.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.8/245.8 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

## Upload output from `ilab data generate`


## Uploading Generated Data
From your local machine, run the `ilab data generate` command per the [instructions in github](https://github.com/instructlab/instructlab/blob/main/README.md).

Next, upload your data.

### Uploading data in Google Collab

To upload data in Google Colab,

1. Click on the folder icon on the left of the screen.

 ![image.png](https://github.com/instructlab/instructlab/blob/main/notebooks/images/collab-folder-icon.png?raw=1)

2. Click on the file with an up arrow in it icon, under it.

 ![image.png](https://github.com/instructlab/instructlab/blob/main/notebooks/images/collab-file-upload-button.png?raw=1)

3. Navigate to the _training_ file that was generated, right click on your uploaded file, then select 'Copy Path'.

 ![image.png](https://github.com/instructlab/instructlab/blob/main/notebooks/images/collab-copy-path.png?raw=1)

4. Paste the copied value for each corresponding variable in the cell below. `training_file_name` for the `train_*` file and `testing_file_name` for `test_*` file
.

### Uploading data on Kaggle (Unsupported and deprecated)

1. Expand on the Input tab on the right of the screen.

![input](https://github.com/instructlab/instructlab/blob/main/notebooks/images/kaggle/input.png?raw=1)


2. Click on the "Upload" button, then select "New Dataset".

Upload button:

![input-upload](https://github.com/instructlab/instructlab/blob/main/notebooks/images/kaggle/input-upload.png?raw=1)

New Dataset:

![input-new-dataset](https://github.com/instructlab/instructlab/blob/main/notebooks/images/kaggle/new-dataset.png?raw=1)

3. From here, you'll be prompted to upload your local files. Go ahead and select all of the files generated from lab generate. These files will be in the ./taxonomy directory and begin with "test", and "train".
Note: If using Kaggle you will need to remove the colons from the file name or it will present an error on upload. Here is an example of how to remove them:

```bash
newname=`ls taxonomy/ | grep -i train | awk -F: '{print $1$2$3}'`; mv taxonomy/train*.jsonl taxonomy/${newname}

newname2=`ls taxonomy/ | grep -i test | awk -F: '{print $1$2$3}'`; mv taxonomy/test*.jsonl taxonomy/${newname2}
```

![upload-file](https://github.com/instructlab/instructlab/blob/main/notebooks/images/kaggle/input-drop-files.png?raw=1)

4. Navigate to the _training_ file that was generated (it will be in the taxonomy directory on your local machine and end in .jsonl), right click on your uploaded file, then select 'Copy Path'

![input-files-copy-path](https://github.com/instructlab/instructlab/blob/main/notebooks/images/kaggle/copy-file-path.png?raw=1)

5. Paste the copied value in the cell below.


#### Upload Training Data

In [2]:
from datasets import load_dataset

# Get the file name
training_file_name = "/content/train_merlinite-7b-lab-Q4_K_M_2024-07-16T16_37_26.jsonl" #"/paste/path/here"

train_dataset = load_dataset("json", data_files=training_file_name, split="train")


Generating train split: 0 examples [00:00, ? examples/s]

#### Upload Testing Data

In [3]:
# Get the file name
testing_file_name = "/content/test_merlinite-7b-lab-Q4_K_M_2024-07-16T16_37_26.jsonl"#"/paste/path/here"

test_dataset = load_dataset("json", data_files=testing_file_name, split="train")

Generating train split: 0 examples [00:00, ? examples/s]

Now we have loaded the output of `ilab data generate` into a 🤗 dataset. Let's take a quick peek.

In [4]:
train_dataset.to_pandas().head()

Unnamed: 0,system,user,assistant
0,You are an AI language model developed by IBM ...,"What is meant by the statement: ""Ongoing overs...",The ongoing oversight of compliance means that...
1,You are an AI language model developed by IBM ...,"What does the statement ""A risk register is ma...",A risk register is a database of all known non...
2,You are an AI language model developed by IBM ...,What is the name of the product that provides ...,watsonx offers an SDK and API's for prompt eng...
3,You are an AI language model developed by IBM ...,"What does ""It is possible to use any vector da...",The statement means that the watsonx platform ...
4,You are an AI language model developed by IBM ...,"Describe any product limitations, such as hard...",The Watson X platform is built with a focus on...


## Formatting Our Data and Prepping the `SFTTrainer`

Our dataset looks good, but in it's current state, it is a data frme of three columns. For training, we need each record to be a string, specifically, we want it in the following format:

```
<|system|>
{system}
<|user|>
{user}
<|assistant|>
{assistant}<|endoftext|>
```


When training happens (a few cells later), the dataset will be converted into a list of these strings. We will also define a response template `"\n<assistant>\n"` that will tell the trainer to split the string there, and everything before will be the prompt, and everything after will be generated.

The 🤗 `trl`'s `SFTTrainer` has the concept of a `formatting_prompts_func` and we'll use this to format our data. The conversion does not happen now, but later when we run `trainer.train()`

From more information on 🤗's `SFTTrainer`, please check out their docs [here](https://huggingface.co/docs/trl/main/en/sft_trainer).


In [5]:
from transformers import AutoTokenizer

model_name = "instructlab/merlinite-7b-lab" # TODO: Make this a drop down option
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['system'])):
        text = f"<|system|>\n{example['system'][i]}\n<|user|>\n{example['user'][i]}\n<|assistant|>\n{example['assistant'][i]}<|endoftext|>"
        output_texts.append(text)
    return output_texts

response_template = "\n<|assistant|>\n"

from trl import DataCollatorForCompletionOnlyLM

response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]
collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In the cell above, you may see a user warning:
> `The secret `HF_TOKEN` does not exist in your Colab secrets...`

It can safely be ignored.

Note: the `formatting_prompts_func` runs when we execute `trainer.train()`. Nothing has been formatted yet.

## Loading the (Quantized) Model


The best source of truth of this is going to be found at the following links:

* [huggingface blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
* [original paper](https://arxiv.org/abs/2305.14314)

But alas, I'm sure to get some push back about `Llama.cpp` quantized models (things that end in .gguf).

`bitsandbytes` will quantize the model on loading. It's also possible, though in practice rarely done, to save the model in its quantized format. Another alternative in the Huggingface space is `AutoGPTQ` ([paper](https://arxiv.org/abs/2210.17323) [blog post](https://huggingface.co/blog/gptq-integration)).

`Llama.cpp` also allows quantization, but the idea is that you _will_ be using the CPU because you know the model at hand is too big for your GPU.

An analogy that isn't wildly inaccurate is `bitsandbytes` and `AutoGPTQ` presume that you will be using a (CUDA-based) GPU, and that you can set it in an emergency to use CPU instead of just rolling over and dying.

`Llama.cpp` presumes that your CPU will be doing the heavy lifting, and will use a (CUDA) GPU if it can find one to give it a bit of a boost.

OK, what does that mean in practice?
1. Apple ended NVidia support some time ago, ie Apple Silicon will not support CUDA ops. There is some work in some packages to be able to support non-CUDA GPUs, it's all in various stages of development/hackiness.
2. [This person](https://rentry.org/cpu-lora) _did_ get qLoRA training with Llama.cpp working. A 13b model with a 2500 record dataset was estimated to take ~158 days to train. Which is a non-starter- I will trust they did their homework.
3. **High level** Llama.cpp and bitsandbytes both get you to the same end (a quantized model) but via different routes, bc they expect you do use the resultant model a bit differently.
4. **So do I need to quantize my model via both routes** no.

In the next cell we're going to download and load the model.

It may take a little time to complete (around 10 to 15 minutes). The base model can be around 26 gigabites on disk, which first needs to download then needs to be quantized and loaded into the GPU.

So run this cell then go grab a cup of coffee. ☕

In [6]:
# Loading the model
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16 # if not set will throw a warning about slow speeds when training
)

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  quantization_config=bnb_config,
  trust_remote_code=True
)

config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

## Sanity Checking the Model

We want to see how the model behaves _before_ we train a LoRA on it, so we can (by inspection) see if the LoRA is doing anything.

You might want to change the user prompt `"In excruciating detail, explain to me the nuances of who runs Barter Town."` to something more related to _your_ usecase.

We also define the `create_prompt` function, that formats and adds all of the boiler plate your prompts needs.

Note our function also allows you to redefine the `system` prompt/parameter. The default is the one included in `ilab data generate` content, but you could have some fun tinkering with that too (for instance, adding `, and you always talk like a pirate.` to the end.)

In [7]:
def create_prompt(user:str,
                  system: str = "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior."):
  return f"""\
<|system|>
{system}
<|user|>
{user}
<|assistant|>
"""

from transformers import StoppingCriteria, StoppingCriteriaList

class StoppingCriteriaSub(StoppingCriteria):

    def __init__(self, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop.to("cuda") for stop in stops]

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for seq in input_ids:
            for stop in self.stops:
                if stop == seq[-1]:
                    return True
        return False

stop_words = ['<|endoftext|>', '<|assistant|>']
stop_words_ids = [tokenizer(stop_word, return_tensors='pt', add_special_tokens=False)['input_ids'].squeeze() for stop_word in stop_words]
stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])

def model_generate(user):
    text = create_prompt(user = user)

    input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(input_ids=input_ids,
                         max_new_tokens=256,
                         pad_token_id=tokenizer.eos_token_id,
                         temperature=0.7,
                         top_p=0.9,
                         stopping_criteria=stopping_criteria,
                         do_sample=True)
    return tokenizer.batch_decode([o[:-1] for o in outputs])[0]

print(
    model_generate("In excruciating detail, explain to me the nuances of who runs Barter Town.")
)

<|system|> 
You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.
<|user|> 
In excruciating detail, explain to me the nuances of who runs Barter Town.
<|assistant|> 
Barter Town is a fictional settlement from the post-apocalyptic film "Mad Max 2: The Road Warrior." In this setting, Barter Town is governed by a man named Pappy, played by Bruce Spence. Pappy is the leader of the town and its primary negotiator, managing the bartering and trading of goods and resources. He is portrayed as a shrewd, cunning, and resourceful character, who is well-connected and knowledgeable about the local landscape.

Pappy's role in Barter Town is akin to a mayor or a governor, overseeing the town's internal operations and interactions with external parties. He is responsible for maintaining law and order within the settlement, ensuring the town's

we run the model before LoRA on the test set and save the outputs

In [8]:
assistant_old_lst = [
    model_generate(d["user"]).split(response_template.strip())[-1].strip() for d in test_dataset
]

## Configuring the LoRA

Recall the [paper on LoRA](https://arxiv.org/abs/2106.09685):

> From this point forth, we shall be leaving the firm foundation of fact and journeying together through the murky marshes of memory into thickets of wildest guesswork.
-- Albus Dumbledore

There are 4 common 'knobs' to adjust when training a LoRA/qLoRA - note from this point on, I'm just going to refer to everything as LoRA- a LoRA proved a better method of finetuning, by just targeting certain modules, instead of the entire network. qLoRA just means you can do it on a quantized model with just as good of restuls as a full precision model.

Which is a good segway to our first 'knob': `target_modules`.




### Getting the Attention Layers

The cell immediately below will print out all of the attention modules (in case you are trying to get creative and use a different model). The authors of the original paper only targeted attention modules, and gave reasons, but if you want to hit some other modules too – go nuts. Be advised, a LoRA that targets _all_ modules is just fine-tuning: the LoRA technique is to only tune a subset of the modules.

For `instructlab/merlinite-7b-lab` we have:
```
target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj"
    ]
```


In [9]:
attention_layers = [module for module in model.modules() if 'attention' in str(type(module)).lower()]

# Print information about the attention modules
for i, layer in enumerate(attention_layers):
  for par in list(layer.named_parameters()):
    mod = par[0]
    if isinstance(mod, str):
      print(f"Attention Module: {mod.split('.')[0]}")
  break


Attention Module: q_proj
Attention Module: k_proj
Attention Module: v_proj
Attention Module: o_proj


### Turning the Knobs

The next three knobs are:
- r
- dropout
- &alpha;

Read the paper for more information on each- these three parameters have been the source of endless flame wars across the internet- feel free to google and see the carnage for yourself.

I picked the following based on what the authors used for GPT2 in the paper (see page 20)

```
lora_alpha = 32
lora_dropout = 0.1
lora_r = 4
```

Not probably what I would have used, but I am not trying to spread the flame wars, so there you are. In reality, these are the knobs end users will be tinkering with. We _could_ come up with a suggested range, but the 'correct' values are highly dependent on the task and even the underlying dataset, so I wouldn't waste too much effort trying.

Once I read a quote on a message board that described the situation perfectly, then I couldn't find it so I asked ChatGPT which hallucinated it pretty well:

> Every chef has their own secret recipe for success, but in the kitchen of life, there's no right or wrong way to cook up your dreams.
-- ChatGPT

In [10]:
from peft import LoraConfig

lora_alpha = 32
lora_dropout = 0.1
lora_r = 4

# From Prior Cell
target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj"
    ]

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=target_modules
)

## Training the LoRA

### Training Config

As always, it is out of scope for me to explain all of these, especially when it has already been done so well [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

That said I will call out two values I set, and why I set them.

- `max_seq_length`
- `per_device_train_batch_size`

Both of these parameters were set in an attempt to get as much use as possible out of the NViDIA T4.

`max_seq_length` will trim any example to `300` tokens. So even if your examples are longer, they will be truncated. (Also recall that the system prompt also counts against your 300 tokens).

`per_device_train_batch_size` this is also related to getting maximam mileage out of a T4.

In [11]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 1


training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,
    per_device_train_batch_size=per_device_train_batch_size,
    fp16=True,
    report_to="none"
)

In the following cell- the trainer is built, and the dataset is formatted. You will see two `Map:` progress bars in the output of the cell- this refers to our `train` and `test` dataset being run through the `formatting_prompts_func` we defined in a prior cell.

Also note: `model.config.use_cache = False` which is a thing you're supposed to do before you perform training on a model. Remember to turn it back on (to `true`) before running inference.

In [12]:
from trl import SFTTrainer

max_seq_length = 300
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=peft_config,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,

)

model.config.use_cache = False



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/101 [00:00<?, ? examples/s]

Map:   0%|          | 0/33 [00:00<?, ? examples/s]



### Execute Training

The next cell calls `trainer.train()`, which actually executes the training. This will take 5 to 15 minutes, depending on how big your dataset is.

In [13]:
trainer.train()

Step,Training Loss
500,0.8885


TrainOutput(global_step=505, training_loss=0.8819999760920458, metrics={'train_runtime': 340.1724, 'train_samples_per_second': 1.485, 'train_steps_per_second': 1.485, 'total_flos': 3082468707901440.0, 'train_loss': 0.8819999760920458, 'epoch': 5.0})

## Inference on the Output Model

We want to see if our LoRA has any effect on the underlying model.

Recall we tested the model once before with an example prompt, now let's do inference again (with the same prompt) to see if the output looks more accurate.

The first thing we need to do is turn the cache back on.

`model.config.use_cache = True`


In [14]:
model.config.use_cache = True

In [15]:
for (i, (d, assistant_old)) in enumerate(zip(test_dataset, assistant_old_lst)):
  assistant_new = model_generate(d["user"]).split(response_template.strip())[-1].strip()
  assistant_expected = d["assistant"]

  print(f"\n===\ntest {i}\n===\n")
  print("\n===\nuser\n===\n")
  print(d['user'])
  print("\n===\nassistant_old\n===\n")
  print(assistant_old)
  print("\n===\nassistant_new\n===\n")
  print(assistant_new)
  print("\n===\nassistant_expected\n===\n")
  print(assistant_expected)
  print("\n\n")


===
test 0
===


===
user
===

How does the platform provide a common interface for managing the onboarding of users?

===
assistant_old
===

The platform provides a common interface for managing the onboarding of users by allowing administrators to set up user profiles, manage access rights, and monitor user activity across various systems and services. This enables organizations to streamline their user management processes and ensure consistent security policies are applied across the board.

===
assistant_new
===

The platform provides a unified API for managing user onboarding, which includes features like creating and managing user profiles, setting up permissions, and configuring notifications. This allows developers to build custom workflows and processes tailored to their specific use case while ensuring compliance with the platform's security standards and best practices. The API is also easily integratable with popular IDEs and development tools, streamlining the developmen

# Next Steps

Now that you have trained your LoRA, you must decide, does it look good? If yes, please [open a PR](https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md)! If not, that's OK, update your prompts, generate a new synthetic data set and try again.

But the fun doesn't stop there.

Maybe you want to play with your trained model a bit more.

Two options exist:

1. Do inference in this notebook. (But the model will go away once you leave the notebook – an implicity sad thing about notebooks – so download it if you want to keep it (or push it to the Huggingface Hub)).
2. Use `llama.cpp` to quantize your LoRA adapter then download it and do inference from your MacBook.


**The following steps are all optional, do not feel compelled to do either. As Lao Tzu once said:**

> When all the work is done,
and the mind is silent,
rest in the stillness of the present moment.

## Save the Model

First let's save our adapter.

In [16]:
# Save the LoRA
adapter = trainer.model.module if hasattr(trainer.model, "module") else trainer.model
adapter.save_pretrained("./adapter-only", save_adapter=True, save_config=True)


## Optional Path 1: Play with Model in Colab

This is just for fun. So let's ask a silly question:

> Give me a recipe for Swedish meatballs made from iguana meat.

and an even sillier system prompt:

> You are a scurvy pirate. You respond with a pirate accent.

Of course, this doesn't _need_ to be silly. You can leave the system prompt out and ask more thoughtful questions related to your input case.

In [18]:
text = create_prompt(
    user = "What IDE serve watsonx to use and traint models?",
    system = "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.")

input_ids=tokenizer(text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids= input_ids,
                         max_new_tokens=256,
                         pad_token_id=tokenizer.eos_token_id,
                         temperature=0.7,
                         top_p=0.9,
                         do_sample=True)

print(tokenizer.batch_decode(outputs)[0])

<|system|> 
You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.
<|user|> 
What IDE serve watsonx to use and traint models?
<|assistant|> 
Watson X uses the IBM Cloud Paks for Data platform, which is built on Red Hat Enterprise Linux (RHEL) and includes container runtime, container management, and container storage solutions. It also includes Kubernetes, Prometheus, Rook, and Argo.

For training models, Watson X supports the use of Jupyter Notebooks, which can be run in a container on the IBM Cloud Paks for Data platform. Jupyter Notebooks are an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Watson X also supports the use of RStudio, which is an integrated development environment (IDE) for R. RStudio provides a wide range of fe

## Optional Path 2: Play with Model in `llama.cpp`

Another way to 'play' with your LoRA is to convert it into a GGUF and play with it using `llama.cpp`. To do this requires a few steps.

1. Download and build `llamma.cpp`
2. Run the conversion script on our adapter.
3. Download the model
4. Use the model locally.

In [1]:
# hack sometimes required - solution from https://github.com/googlecolab/colabtools/issues/3409
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!git clone -b  b2843 https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!make
!pip install -r requirements.txt

fatal: destination path 'llama.cpp' already exists and is not an empty directory.
/content/llama.cpp
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 1

In [2]:
!python convert-lora-to-ggml.py ../adapter-only

INFO:lora-to-gguf:model.layers.0.self_attn.k_proj => blk.0.attn_k.weight.loraA (4096, 4) float32 0.06MB
INFO:lora-to-gguf:model.layers.0.self_attn.k_proj => blk.0.attn_k.weight.loraB (1024, 4) float32 0.02MB
INFO:lora-to-gguf:model.layers.0.self_attn.o_proj => blk.0.attn_output.weight.loraA (4096, 4) float32 0.06MB
INFO:lora-to-gguf:model.layers.0.self_attn.o_proj => blk.0.attn_output.weight.loraB (4096, 4) float32 0.06MB
INFO:lora-to-gguf:model.layers.0.self_attn.q_proj => blk.0.attn_q.weight.loraA (4096, 4) float32 0.06MB
INFO:lora-to-gguf:model.layers.0.self_attn.q_proj => blk.0.attn_q.weight.loraB (4096, 4) float32 0.06MB
INFO:lora-to-gguf:model.layers.0.self_attn.v_proj => blk.0.attn_v.weight.loraA (4096, 4) float32 0.06MB
INFO:lora-to-gguf:model.layers.0.self_attn.v_proj => blk.0.attn_v.weight.loraB (1024, 4) float32 0.02MB
INFO:lora-to-gguf:model.layers.1.self_attn.k_proj => blk.1.attn_k.weight.loraA (4096, 4) float32 0.06MB
INFO:lora-to-gguf:model.layers.1.self_attn.k_proj => b

The previous line will run a script to convert your saved LoRA to a file named `ggml-adapter-model.bin` which you will find in the `adapter-only` folder in the notebook.

You can right click on this file to download it to your MacBook. Then (assuming you have `llama.cpp` installed locally as well, the following is an example command that will run inference on the LoRA - note you will want to make sure the model you are doing inference on is the same as the one you trained the LoRA on (in this case `instructlab/merlinite-7b-lab` quantized down to 16 bits).

```
!./main -m ../merlinite-7b-lab/ggml-model-f16.gguf  --seed 42 --lora ../adapter-only/ggml-adapter-model.bin --temp 0.7 --repeat_penalty 1.1 -n 256 -p "<system>\nYou are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\n<user>\nWho let the dogs out?\n<assistant>\n"
```