## Fine-tune Mistral-7B-v0.3 for geoscience content

### 1 (Base) Pre-set up
#### 1.1 Set up environment

In [None]:
#conda create --name FINE_TUNING_LLM python=3.12.4
#conda activate FINE_TUNING_LLM
#conda install -n FINE_TUNING_LLM ipykernel --update-deps --force-reinstall

#### 1.2 Install libraries

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [1]:
# You only need to run this once per machine
!pip install bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install datasets scipy ipywidgets matplotlib

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl.metadata (2.2 kB)
Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.1
[0mCollecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-5ed8qnr4
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-5ed8qnr4
  Resolved https://github.com/huggingface/transformers.git to commit aab08297903de0ae39d4a6d87196b5056d76f110
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting huggin

In [3]:
!pip install sentencepiece
!pip install protobuf

[0m

#### 1.3 Check GPU and Torch

In [4]:
import torch
torch.__version__

'2.2.1'

### 2. (Fine-tuning) Prepare data
#### 2.1 Import data

In [5]:
from datasets import load_dataset

train_dataset = load_dataset('json', data_files='notes.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='notes_validation.jsonl', split='train')
#print(train_dataset[0])

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

#### 2.2 Formatting prompts
Next, create a `formatting_func` that structures training examples into prompts.

In [6]:
def formatting_func(example):
    text = f"### Here are some Notes:\n{example['note']}\n"
    return text

### 3. (Base) Load Base Model

Let's now load Mistral - mistralai/Mistral-7B-v0.3 - using 4-bit quantization!

In [7]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-v0.3"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map="auto")

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

### 4. (Fine-tuning) Tokenization

#### 4.1 Without padding

In [9]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token

def generate_and_tokenize_prompt(prompt):
    return tokenizer(formatting_func(prompt))

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Reformat the prompt and tokenize each sample:

In [10]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Map:   0%|          | 0/29156 [00:00<?, ? examples/s]

Map:   0%|          | 0/145 [00:00<?, ? examples/s]

#### 4.2 With padding
Now, you have the flexibility to set the max_length as needed. Truncating and padding training examples allows you to tailor them to your chosen size. Keep in mind that opting for a larger max_length can impact computational efficiency.

Next, we'll tokenize the data again, ensuring padding and truncation are applied uniformly across labels and input_ids. This process forms the foundation of self-supervised fine-tuning.

In [11]:
max_length = 512 # This was an appropriate max length for my dataset

def generate_and_tokenize_prompt2(prompt):
    result = tokenizer(
        formatting_func(prompt),
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

In [12]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt2)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt2)

Map:   0%|          | 0/29156 [00:00<?, ? examples/s]

Map:   0%|          | 0/145 [00:00<?, ? examples/s]

Check that input_ids are padded on the left with the `eos_token` (2) and that an `eos_token` (2) is added to the end. Ensure the prompt starts with a `bos_token` (1).

In [13]:
print(tokenized_train_dataset[1]['input_ids'])

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 

Now all the samples should be of uniform length, `max_length`.

### 5. (Fine-Tuning) Test of the pre-trained model

You can evaluate Mistral's performance on one of your data samples. For instance, you could test the following `eval_prompt` to assess its understanding of methodologies in geoscience:

In [14]:
eval_prompt = """ How to measure Hg concentration in a rock sample?
### Answer :
"""

In [15]:
# Init an eval tokenizer that doesn't add padding or eos token
eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_bos_token=True,
)

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.2)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


How to measure Hg concentration in a rock sample?
### Answer :

The mercury content of the samples is determined by atomic absorption spectrophotometry. The method involves dissolving 1 gm of the powdered sample in concentrated nitric acid and then adding an excess amount of potassium iodide solution (KI). This forms a complex with mercury, which can be measured at 253.7 nm wavelength using a flame atomizer. A standard curve is prepared from known concentrations of mercuric chloride solutions.


Here is output. It makes some sense, but mercury is vaporous, right? It may not be correct.

### 6. (Fine-tuning) Set Up QLoRA fine-tuning method

Now, to initiate our fine-tuning process, we need to preprocess the model using the `prepare_model_for_kbit_training` method from PEFT.

In [16]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [17]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Let's examine the model to inspect its layers, as we plan to apply QLoRA to all the linear layers. These layers include `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`, and `lm_head`."

In [18]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

Here we define the LoRA config.

- `r` denotes the rank of the low-rank matrix used in the adapters, which determines the number of trained parameters. A higher `r` allows for greater expressivity, but entails a compute tradeoff.

- `alpha` represents the scaling factor for the learned weights. The weight matrix is scaled by `alpha`/`r`, meaning a higher alpha assigns more weight to the LoRA activations.

In the [QLoRA paper](https://arxiv.org/pdf/2305.14314), the values used were `r=64` and `lora_alpha=16`, known for their good generalization.

In [19]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 170131456 || all params: 3928494080 || trainable%: 4.33070414605283


Let's observe how the model changes with the addition of LoRA adapters.

In [20]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32768, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k

### 7. (Fine-tuning) Training

The training process took approximately 3 hours on a 1x A100 40GB setup.

Overfitting occurs when the validation loss increases (undesirable) while the training loss decreases significantly, indicating that the model learns the training data well but struggles to generalize to new data points. Typically, overfitting is undesirable, but since I was experimenting with a model to generate outputs similar to my journal entries, I tolerated a moderate level of overfitting.

Regarding training strategy: you can initially set a high `max_steps` and observe when the model's performance starts to decline. This point indicates a suitable number of steps. For instance, if you set 1000 steps and notice overfitting around step 500, then 500 steps would be your optimal choice. You would then select the checkpoint-500 model from your output directory (`mistral-journal-finetune`) as your final model in step 6 below.

If you're exploring and can accept overfitting for experimentation, you can try different checkpoint versions to gauge varying levels of overfitting.

To halt the training process prematurely, you can use `Kernel -> Interrupt Kernel` from the top navigation bar once you determine that further training is unnecessary.

In [21]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

from accelerate import Accelerator
# Initialize the accelerator
accelerator = Accelerator()

model = accelerator.prepare_model(model)

In [22]:
import transformers
from datetime import datetime

project = "journal-finetune"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=2,
        per_device_train_batch_size=16,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        max_steps=3000,
        learning_rate=2.5e-5, # Want a small lr for finetuning
        bf16=True,
        optim="paged_adamw_8bit",
        logging_steps=25,              # When to start reporting loss
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=1000,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=1000,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
1000,1.9399,1.834581
2000,1.8416,1.765966
3000,1.8245,1.734804




TrainOutput(global_step=3000, training_loss=1.896956984202067, metrics={'train_runtime': 12772.6007, 'train_samples_per_second': 3.758, 'train_steps_per_second': 0.235, 'total_flos': 1.0737917404957901e+18, 'train_loss': 1.896956984202067, 'epoch': 1.6456390565002743})

### 8. (Base) Inference the Trained Model

To prevent running out of memory when loading the base model again on top of the model we just trained, it's advisable to **restart the kernel**. You can do this by navigating to `Kernel > Restart Kernel` or by terminating the process via the Terminal .

By default, the PEFT library saves only the QLoRA adapters. Therefore, we need to first load the base model from the Hugging Face Hub.

If you didn't log in to your Hugging Face Hub for token, revisit [section 3](#3-(base)-load-base-model) or login below:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-v0.3"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Now load the QLoRA adapter from the best-performing model checkpoint directory.

In [15]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "mistral-journal-finetune/checkpoint-3000")

Now, let's run your inference! We'll use the same `eval_prompt` and `model_input` as before to see if the newly fine-tuned model performs better. I enjoy experimenting with the repetition penalty, making slight adjustments of 0.01 to 0.05 at a time. 

In [13]:
import re

eval_prompt = """ What's the special method to measure Hg concentration in a rock sample?
### Answer :
"""
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
ft_model.eval()
with torch.no_grad():
    output = eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=128, repetition_penalty=1.2)[0], skip_special_tokens=True)
print(output)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


What's the special method to measure Hg concentration in a rock sample?
### Answer :

The most common methods for measuring mercury concentrations are cold vapor atomic absorption spectrometry (CVAAS) and hydride generation-atomic fluorescence spectroscopy. The CVAAS technique is based on the principle that elemental mercury can be converted into an atomized gas by heating it above its boiling point of 357 °C, which allows measurement using an atomic absorption spectrometer. This technique has been used extensively since the early 1980s. However, this technique requires large amounts of mercury, making it unsuitable for low


### The fine-tuned model now generates text with geoscience style!

My knowledge in Geoscience has significantly expanded!

I hope you found this tutorial on fine-tuning Mistral with your own data enjoyable. If you have any questions, feel free to ask!