# Chapter 6: Stage 4: Selection of Fine-Tuning Techniques and Appropriate Model Configurations

##  Steps Involved in Fine-Tuning

1. **Initialise the Pre-Trained Tokenizer and Model**
2. **Modify the Model’s Output Layer**
3. **Choose an Appropriate Fine-Tuning Strategy**: Select the fine-tuning strategy that best fits the task and the model architecture. Some Options include:
+ Task-Specific Fine-Tuning: For tasks such as text summarisation, code generation, classification, and question answering, adapt the model using relevant datasets.
+ Domain-Specific Fine-Tuning: Tailor the model to comprehend and generate text relevant to specific domains, such as medical, financial, or legal fields.
+ Parameter-Efficient Fine-Tuning (PEFT): Techniques like LoRA, QLoRA, and adapters allow for fine-tuning with reduced computational costs by updating a small subset of model parameters.
+ Half Fine-Tuning (HFT): Balance between retaining pre-trained knowledge and learning new tasks by updating only half of the model’s parameters during each fine-tuning round.

4. **Set Up the Training Loop**
5. **Incorporate Techniques for Handling Multiple Tasks**
6. **Monitor Performance on a Validation Set**
7. **Optimise Model Using Advanced Techniques**: Employ techniques such as Proximal Policy Optimisation (PPO) for reinforcement learning scenarios, or Direct Preference Optimisation (DPO) for aligning model outputs with human preferences. These techniques are particularly useful in fine-tuning models for tasks requiring nuanced decision-making or human-like responses.

8. **Prune and optimise the Model** (if necessary)
9. **Continuous Evaluation and Iteration**

##  Fine-Tuning Strategies for LLMs

### Task-Specific Fine-Tuning

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:400px">
    <img src="image/task_specific.png" alt="" />
</div>

### Domain-Specific Fine-Tuning

##  Parameter-Efficient Fine-Tuning (PEFT) Techniques

Parameter Efficient Fine Tuning (PEFT) is an impactful NLP technique that adeptly adapts pre-trained language models to various applications with remarkable efficiency. PEFT methods fine-tune only a small subset of (additional) model parameters while keeping most of the pre-trained LLM parameters frozen, thereby significantly reducing computational and storage costs. This approach mitigates the issue of catastrophic forgetting, a phenomenon where neural networks lose previously acquired knowledge and experience a significant performance decline on previously learned tasks when trained on new datasets. PEFT methods have demonstrated superior performance compared to full fine-tuning, particularly in low-data scenarios, and exhibit better generalisation to out-of-domain contexts. 

### Adapters

Adapter-based methods introduce additional trainable parameters after the attention and fully connected
 layers of a frozen pre-trained model, aiming to reduce memory usage and accelerate training. 
 
 The specific approach varies depending on the adapter; it might involve adding an extra layer or representing the
 weight updates delta as a low-rank decomposition of the weight matrix.
 
  Regardless of the method,
 adapters are generally small yet achieve performance comparable to fully fine-tuned models, allowing for
 the training of larger models with fewer resources.

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:600px">
    <img src="image/peft.png" alt="" />
</div>

### Low-Rank Adaptation (LoRA)

 Low-Rank Adaptation (LoRA) is a technique designed for fine-tuning large language models, which
 modifies the fine-tuning process by freezing the original model weights and applying changes to a separate
 set of weights, added to the original parameters. 

 LoRA transforms the model parameters into a lower
rank dimension, reducing the number of trainable parameters, speeding up the process, and lowering
 costs.
 
  This method is particularly useful in scenarios where multiple clients require fine-tuned models
 for different applications, allowing for the creation of specific weights for each use case without the
 need for separate models. 

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:500px">
    <img src="image/lora.png" alt="" />
</div>

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:300px">
    <img src="image/lora_weight.png" alt="" />
</div>

### QLoRA

 QLoRA is an extended version of LoRA designed for greater memory efficiency in large language mod
els (LLMs) by quantising weight parameters to 4-bit precision. Typically, LLM parameters are stored
 in a 32-bit format, but QLoRA compresses them to 4-bit, significantly reducing the memory footprint.
 This allows fine-tuning on less powerful hardware, including consumer GPUs. QLoRA also quantises the
 weights of the LoRA adapters from 8-bit to 4-bit, further decreasing memory and storage requirements. Despite the reduction in bit precision, QLoRA maintains performance levels comparable
 to traditional 16-bit fine-tuning

### Weight-Decomposed Low-Rank Adaptation (DoRA)

Weight-Decomposed Low-Rank Adaptation (DoRA) is a novel fine-tuning methodology designed to
 optimise pre-trained models by decomposing their weights into magnitude and directional components.

 This approach leverages the efficiency of Low-Rank Adaptation (LoRA) for directional updates, facili
tating substantial parameter updates without altering the entire model architecture. 

DoRA addresses the computational challenges associated with traditional full fine-tuning (FT) by maintaining model
 simplicity and inference efficiency, while simultaneously bridging the performance gap typically observed
 between LoRA and FT. 

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:600px">
    <img src="image/dora.png" alt="" />
</div>

**Comparison between LoRA and DoRA**

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:300px">
    <img src="image/lora_dora.png" alt="" />
</div>

###  Fine-Tuning with Multiple Adapters

The PEFT library simplifies the process of merging adapters with its add_weighted_adapter function 3, which offers three distinct methods:

1. Concatenation: This straightforward method concatenates the parameters of the adapters. For instance, if two adapters each have a rank of 16, the resulting adapter will have a rank of 32. This method is highly efficient.
2. Linear Combination: Although less documented, this method appears to perform a weighted sum of the adapters’ parameters.
3. SVD: The default method employs singular value decomposition through torch.linalg.svd. While versatile, it is notably slower than the other methods, particularly for adapters with high ranks (greater than 100), which can take several hours.

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:600px">
    <img src="image/multiple_adapter.png" alt="" />
</div>

##  Half Fine Tuning

Half Fine-Tuning (HFT) is a technique designed to balance the retention of foundational knowledge
 with the acquisition of new skills in large language models (LLMs).
 
HFT involves freezing half of the
 model’s parameters during each fine-tuning round while updating the other half, allowing the model to
 retain pre-trained knowledge and enhance new task performance without altering the model architecture

###  Benefits of using Half Fine tuning

1. Recovery of Pre-Trained Knowledge

2. Enhanced Performance: Research experiments shows that HFT maintains or even surpasses the performance of full fine-tuning (FFT) on downstream tasks, demonstrating its effectiveness in balancing knowledge retention with task-specific learning.

3. Robustness

4. Simplicity and Scalability

5. Versatility

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:500px">
    <img src="image/hft.png" alt="" />
</div>

###  Comparison between HFT and LoRA

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:600px">
    <img src="image/hft_vs_lora.png" alt="" />
</div>

## Lamini memory tuning

 Foundation models often follow a training regimen similar to the Chinchilla recipe, which prescribes
 training for a single epoch on a massive corpus, such as training Llama 2 7B on about one trillion
 tokens.
 
  This approach results in substantial loss and is geared more towards enhancing generalisation
 and creativity where a degree of randomness in token selection is permissible. 
 
 However, it falls short for
 tasks demanding high factual precision.

In contrast, Lamini Memory Tuning delves deeper by analysing
 the loss of individual facts, significantly improving the accuracy of factual recall. 

By augmenting a
 model with additional parameters specifically for memory (e.g., an 8B parameter model with an extra 2B
 parameters for weights), Lamini enables the model to memorise and accurately recall a significant number
 of facts, closely aligning performance with LLM scaling laws without compromising on generalisation

###  Lamini-1- A model architecture based on Lamini

 Departing from traditional transformer-based designs, the Lamini-1 model architectur employs a massive mixture of memory experts (MoME). This system features a pre-trained transformer
 backbone augmented by adapters that are dynamically selected from an index using cross-attention
 mechanisms. 
 
 These adapters function similarly to experts in MoE (Mixture of Expert) architectures, and the network is
 trained end-to-end while freezing the backbone. This setup allows for specific facts to be stored exactly
 in the selected experts.

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:450px">
    <img src="image/lamini-1.png" alt="" />
</div>

###  Systems Optimisations for Banishing Hallucinations

 The MoME architecture is designed to minimise the computational demand required to memorise facts.
 During training, a subset of experts, such as 32 out of a million, is selected for each fact. The weights of
 the backbone network and the cross attention used to select the expert are frozen, and gradient descent
 steps are taken until the loss is sufficiently reduced to memorise the fact. 

## Mixture of Experts

 A mixture of experts (MoE) is an architectural design for neural networks that divides the computation
 of a layer or operation (e.g., linear layers, MLPs, or attention projection) into several specialised subnet
works, referred to as ”experts”.

 Each expert independently carries out its computation, and the results
 are aggregated to produce the final output of the MoE layer.
 
  MoE architectures can be categorised as
 either dense, where every expert is engaged for each input, or sparse, where only a subset of experts is
 utilised for each input

###  Mixtral 8x7B Architecture and Performance

 Mixtral 8x7B employs a Sparse Mixture of Experts (SMoE) architecture, mirroring the
 structure of Mistral 7B but incorporating eight feedforward blocks (experts) in each layer.
 
  For every
 token at each layer, a router network selects two experts to process the current state and combine their
 outputs. Although each token interacts with only two experts at a time, the selected experts can vary at
 each timestep. Consequently, each token has access to 47 billion parameters but utilises only 13 billion
 active parameters during inference. 

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:300px">
    <img src="image/mistral.png" alt="" />
</div>

##  Mixture of Agents

A recent study has investigated
 leveraging the collective expertise of multiple LLMs to develop a more capable and robust model, a
 method known as Mixture of Agents (MoA)

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:500px">
    <img src="image/moa.png" alt="" />
</div>

###  Methodology

To enhance collaboration among multiple LLMs, it is essential to understand their individual strengths and classify them accordingly. The classification includes:

1. Proposers: These models excel at generating valuable reference responses for other models. While they may not perform exceptionally on their own, they provide useful context and varied perspectives that improve the final output when utilised by an aggregator.
2. Aggregators: These models are adept at merging responses from various models into a single high-quality result. An effective aggregator should maintain or even enhance the quality of the final response, regardless of the quality of the individual inputs.

##  Proximal Policy Optimisation

## Direct Preference Optimisation (DPO)

## Optimised Routing and Pruning Operations 