# Prerequisites

## Install the required python libaries

In [None]:
!pip install transformers datasets accelerate peft bitsandbytes trl

## Login to the 🤗 Hub

In [None]:
from huggingface_hub import login
login()

# Dataset preparations

## Load dataset


To fine tune the model, we need to load a dataset.
Here you would use your own dataset instead of **HuggingFaceH4/helpful-instructions**.
We will only use the first 5% of the dataset to reduce the training time.

In [None]:
from datasets import load_dataset, DatasetDict
instruct_tune_dataset = load_dataset("HuggingFaceH4/helpful-instructions", split='train[:5%]')

## Preprocessing dataset

Each model has a specific input format. We need to preprocess the dataset to match the model's input format. You can find the model's input format in the model's documentation. The easiest way is to follow the reference code provided by the model's author. Have a look at the [Mistral-7B-Instruct reference implementation.](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/main/README.md#instruction-format)

In [None]:
# Function to format dataset for Mistral
def format_for_mistral(examples):
    # Reformat into the required text format
    formatted_texts = [
        f"<s>[INST] {instruction} [/INST] {demonstration}</s>"
        for instruction, demonstration in zip(examples["instruction"], examples["demonstration"])
    ]
    return {"text": formatted_texts}

# Apply the formatting function
instruct_tune_dataset_mistral = instruct_tune_dataset.map(format_for_mistral, batched=True)


We need to make a train, test, eval split of the dataset

In [5]:
# 70% train, 10% eval, 20% test
train_test_split = instruct_tune_dataset_mistral.train_test_split(test_size=0.30)
eval_test_split = train_test_split['test'].train_test_split(test_size=0.66)

# Create the final DatasetDict with the splits
final_dataset = DatasetDict({
    'train': train_test_split['train'], # 70% of the original dataset
    'eval': eval_test_split['train'],   # 10% of the original dataset
    'test': eval_test_split['test']     # 20% of the original dataset
})

Have a look at the final dataset

In [None]:
final_dataset

# Mistral 7B fine-tuning


## Import necessary libraries


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments
from trl import SFTTrainer

## Bits and Bytes settings for 4-bit quantization

In [8]:
# Define the model name to be used
model_name = "mistralai/Mistral-7B-v0.1"

4-bit quantization is a technique used to reduce the memory footprint and computational cost to fit the model on the T4 GPU.

In [9]:
# Configure Bits and Bytes settings for 4-bit quantization
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Load model in 4-bit precision
    bnb_4bit_quant_type="nf4",              # Use NF4 quantization type
    bnb_4bit_use_double_quant=True,         # Enable double quantization
    bnb_4bit_compute_dtype=torch.bfloat16   # Use bfloat16 for computation
)

In [None]:
# Load the pre-trained model with the specified quantization settings
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',                    # Automatically map model to available devices (e.g., GPU)
    quantization_config=nf4_config,       # Apply the quantization configuration
    use_cache=False                       # Disable caching for training
)

In [None]:
# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token     # Set padding token to end-of-sequence token
tokenizer.padding_side = "right"              # Pad on the right side

In [14]:
# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

## PEFT (Parameter Efficient Fine-Tuning)

 LoRA (Low-Rank Adaptation), allows for effective fine-tuning of large models with fewer parameters. LoRA enables fine-tuning by only modifying a small subset of parameters while keeping the majority of the model fixed.

 You can find out more about PEFT and different techniques other the LoRA to reduce the memory footprint [here](https://huggingface.co/docs/peft/en/index)

In [15]:
# Define PEFT (Parameter Efficient Fine-Tuning) configuration
peft_config = LoraConfig(
    lora_alpha=16,                  # Scaling factor for LoRA
    lora_dropout=0.1,               # Dropout rate for LoRA
    r=64,                           # Rank for LoRA
    bias="none",                    # No bias terms used
    task_type="CAUSAL_LM"           # Specify the task type (Causal Language Model)
)

In [16]:
# Prepare the model for PEFT
model = get_peft_model(model, peft_config)

You can check the if the PEFT settings are applied correctly by looking at the amount of trainable parameters.

In [17]:
def print_train_params(model):
    # Calculate total and trainable parameters
    all_param = sum(param.numel() for param in model.parameters())
    trainable_params = sum(param.numel() for param in model.parameters() if param.requires_grad)

    # Print summary of parameters
    print(
        f"trainable params: {trainable_params} \n all params: {all_param} \n trainable%: {100 * trainable_params / all_param:.2f}"
    )

# Print the training parameters of the model
print_train_params(model)

## Training Parameters

In [18]:
# Define training arguments for the model
args = TrainingArguments(
    output_dir="mistral_instruct_generation",     # Directory to save model outputs
    warmup_steps=1,                               # Number of warmup steps for learning rate
    per_device_train_batch_size=4,                # Batch size per device during training
    gradient_accumulation_steps=1,                # Steps for gradient accumulation 1 means no accumulation
    gradient_checkpointing=True,                  # Enable gradient checkpointing to save memory
    max_steps=30,                                # Total training steps
    learning_rate=1e-4,                           # Learning rate for the optimizer
    fp16=True,                                    # Use mixed precision training
    fp16_full_eval=True,                          # Use mixed precision during evaluation
    optim="paged_adamw_8bit",                     # Use 8-bit AdamW optimizer
    logging_steps=10,                            # Steps between logging
    logging_dir="./logs",                         # Directory for storing logs
    save_strategy="steps",                        # Strategy for saving model checkpoints
    save_steps=10,                               # Steps between model saves
    eval_strategy="steps",                        # Strategy for evaluation
    eval_steps=10,                               # Steps between evaluations
    do_eval=True,                                 # Perform evaluation during training
    report_to="none",                             # Do not report to any tracking tool (i.e. wandb etc.)
)

In [19]:
# Set maximum sequence length for training
max_seq_length = 512

In [None]:
# Initialize the trainer with the model, tokenizer, and training arguments
trainer = SFTTrainer(
    model=model,                                # The model to train
    peft_config=peft_config,                    # Configuration for PEFT
    max_seq_length=max_seq_length,              # Maximum sequence length for input
    tokenizer=tokenizer,                        # Tokenizer for processing text
    packing=True,                               # Enable packing for efficiency
    args=args,                                  # Training arguments
    dataset_text_field="text",                  # Field in dataset that contains text field that we created in the preprocessing step
    train_dataset=final_dataset["train"],       # Training dataset
    eval_dataset=final_dataset["eval"]          # Evaluation dataset
)

A few notes on the chosen settings:
- **Learning Rate**: Start with a low learning rate, like `1e-4` or `1e-5`. Fine-tuning a pre-trained LLM requires subtle adjustments rather than aggressive changes.
- **Batch Size**: Due to GPU memory limitations, set a small batch size, e.g., `4`. You can also set gradient_accumulation_steps > 1 to accumulate gradients i.e. to simulate a larger batch size.
- **Gradient Checkpointing**: Enables storing less memory by trading it off with more compute, which is crucial for fitting large models into memory.
- **Mixed Precision (FP16)**: It helps save memory and can speed up training.

Settings to play around with:
- **Max Epochs / Steps**: Increase the number of epochs / training steps if the model is not converging.
- **Batch Size**: Increase the batch size or enable gradient_accumulation_steps if you have enough GPU memory.
- **Learning Rate**: Adjust the learning rate if the model is not converging.
- **Sequence Length**: Increase the sequence length i.e. the number of tokens in each output sequence. This can help the model to generate more coherent text.

## Start the Fine-Tuning job


In [None]:
trainer.train()

## Save the fine tuned model

You can save the model in different ways. Here we are going to push the model to 🤗 Hub.

Have a look at our [enterpise version](https://huggingface.co/enterprise) for non-public model storage.

In [None]:
trainer.save_model("mistral7b-hf-cloud-mle")
trainer.push_to_hub("Mystorius/mistral7b-hf-cloud-mle")

# Evaluation

After fine-tuning, you have several options to evaluate your model's performance. In general, evaluation for LLM's is hard, and there is no "go-to" method, but you have a few choices:

1.   Quantitative Metrics (Train & Eval Loss)
2.   Qualitative Evaluation or Human Evaluation. Have domain experts or employees familiar with the company’s use cases evaluate the responses based on correctness, fluency, and relevance.


To learn more about LLM Evaluation, have a look at our [Blog](https://huggingface.co/blog/clefourrier/llm-evaluation)

Below is an exaple for quantitive metrics using our [🤗 Evaluate](https://huggingface.co/docs/evaluate/en/index) libary.

In [None]:
trainer.evaluate()

To use the model and run it agains our test dataset, we need to tokenize our *instruction* column so it can be used for the models forward() function.

In [97]:
# Define a function to preprocess the dataset (tokenizing the 'instruction')
def preprocess_function(examples):
    # Tokenize 'instruction' as input
    tokenized_inputs = tokenizer(examples['instruction'], truncation=True, padding=True)
    return tokenized_inputs

# Apply the preprocessing to the test dataset
tokenized_test_dataset = final_dataset["test"].map(preprocess_function, batched=True)

# Now we remove the unnecessary columns such as 'instruction', 'meta', 'text'
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["instruction", "meta", "text"])

# Let use 5 rows as an example
tokenized_test_dataset = tokenized_test_dataset.select(range(5))

In [None]:
predictions = trainer.predict(
    test_dataset=tokenized_test_dataset
)

In [103]:
# Extract the first 5 original inputs (instructions) and ground truth outputs (demonstrations)
original_instructions = final_dataset["test"]["instruction"][:5]  # Input instructions
ground_truth_outputs = final_dataset["test"]["demonstration"][:5]  # Expected outputs

In [109]:
# Decode the predicted tokens back into text
predicted_tokens = predictions.predictions.argmax(axis=-1)
decoded_predictions = tokenizer.batch_decode(predicted_tokens, skip_special_tokens=True)

In [105]:
# Print input, ground truth, and prediction for each of the 5 examples
for i in range(5):
    print(f"Input (Instruction): {original_instructions[i]}")
    print(f"Ground Truth (Demonstration): {ground_truth_outputs[i]}")
    print(f"Generated Output: {decoded_predictions[i]}")
    print("-" * 50)  # Separator

# Going Bigger


Let’s assume the results from the Mistral-7B are not satisfactory and you want to try out a larger open-source model like LLaMA-70B. Here are some considerations:

1. **Memory Scaling**: LLaMA-70B has 10x more parameters than Mistral-7B, requiring roughly 10x of memory for training. This would exceed a single T4 GPU's capability by far. To fine-tune such large models, you would need:
   - **Multi-GPU setup**: You could explore distributed training across multiple GPUs or even TPUs.
   - **Memory-efficient techniques**: Use **deepspeed**, **zero-optimization**, and **tensor parallelism** to handle the larger model size.

2. **Longer Training Time**: The training duration will scale accordingly with the model size and dataset complexity.

3. **Costs**: Larger models require expensive cloud infrastructure or optimized hardware to support fine-tuning.

While fine-tuning a large model like LLaMA-70B would provide additional capabilities, it might to expensive and time consuming.


## Fine-Tuning as a Service

If managing the fine-tuning process yourself seems daunting, you can opt for fine-tuning as a service offered by cloud providers like Amazon Web Services (AWS) or Google Cloud Platform (GCP). These platforms provide managed solutions that simplify the process of training large models:

- **AWS Bedrock**: You can find our more [here](https://aws.amazon.com/blogs/aws/customize-models-in-amazon-bedrock-with-your-own-data-using-fine-tuning-and-continued-pre-training/)
  
- **GCP Vertext AI**: You can find out more [here](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-models)

Utilizing these services can save you time and effort, allowing you to focus on data preparation and model evaluation rather than the complexities of infrastructure management.