# Main Components

1. **LLMs** based on the **Transformer Architecture**
2. **RAG** (Retrieval Augmented Generation) architecture that pairs generalised LLMs with domain-specific data sets in **Vector Databases**
3. **Multi-Agent Systems** combine multiple LLM agents, Vector Database sources and Function Execution to perform complex tasks
4. **Open Models** based on either open, free to use weights (i.e Llama from Meta) or open source (i.e. Mixtral from Mistral). These can be fine tuned and allow proprietary low-cost model serving solutions to be developed.

# LLMs (Large Language Models)

## What's LLMs?

LLMs are advanced AI systems designed to understand and generate text that sounds like it was written by a human. These models use large amounts of data and complex neural networks, like transformers, to perform many language-related tasks.

<img src='https://sciforce.solutions/strapi/uploads/02_guide_773e2d4361.jpg' width='500'>

**Autoregressive LLMs**
- generate text by predicting the next word based on the previous words in a sentence.
- for tasks:
    - Chatbots and Virtual Assistants
    - Content Creation Tools
    - Predictive Text and Autocomplete
    - Language Translation Services
    - Interactive Storytelling

**Autoencoding LLMs**
- focus on learning the structure and meaning of a text.
- compress the input text into a simpler, lower-dimensional representation (encoding)
- then reconstruct it back to the original form (decoding)
- for tasks:
    - Text Classification
    - Sentiment Analysis
    - Information Retrieval
    - Anomaly Detection
    - Text Summarization
    - Semantic Search

**Hybrid LLM**
- leverage the strengths of both autoregressive and autoencoding models.
- generate text while also deeply understanding the context.
- for tasks:
    - Conversational Agents
    - Advanced Content Creation
    - Comprehensive Text Analysis
    - Interactive Storytelling
    - Intelligent Search Engines
    - Personalized Recommendations
    - Language Translation

## Building a Private LLM

### Data Curation

The datasets used typically range from hundreds of terabytes to multiple petabytes.

<img src='https://sciforce.solutions/strapi/uploads/03_guide_b1a09510db.jpg' width='400'>

- **Web Data**: FineWeb (not fully deduplicated for better performance, entirely English), Common Crawl (55% non-English)

- **Code**: Publicly Available Code from all the major code hosting platforms

- **Academic Texts**: Anna’s Archive, Google Scholar, Google Patents

- **Books**: Google Books, Anna’s Archive

- **Court Documents**: RECAP archive (USA), Open Legal Data (Germany)

### Data Preprocessing

**Tokenization**

is the process of breaking down text into tokens: words, subwords, or characters.

<img src='https://sciforce.solutions/strapi/uploads/04_llm_aea9a8fcc4.jpg' width='400'>

Help model:
- handle various text lengths.
- easily manage a set of words or subwords.
- understand the context of each token within a sentence.
- improve the accuracy of tasks like translation or text generation.

**Embedding**

turn each text into a unique set of numbers called a vector that captures its meaning so a computer can understand it. e.g.

<img src='https://sciforce.solutions/strapi/uploads/05_guide_0369548893.jpg' width='400'>

**Attention**
Helps the model understand which words matter most to get the meaning right.

<img src='https://sciforce.solutions/strapi/uploads/06_guide_cfc59c7660.jpg' width='400'>

- The green attention reflects positive feedback.
- The red attention reflects negative feedback.

### LLM Training

**Data Input and Preparation**

1. **Data Ingestion**: Collect and load data from various sources.
2. **Data Cleaning:** Remove noise, handle missing data, and redact sensitive information.
3. **Normalization**: Standardize text, handle categorical data, and ensure data consistency.
4. **Chunking**: Split large texts into manageable chunks while preserving context.
5. **Tokenization**: Convert text chunks into tokens for model processing.
6. **Data Loading**: Efficiently load and shuffle data into batches for optimized training, using parallel loading when necessary.

**Hyperparameter Tuning**

Tuning **key hyperparameters** is essential to ensure the training loop converges effectively, leading to better model performance and efficiency.
- **Learning Rate**: Determines the size of weight updates during training.

<img src='https://sciforce.solutions/strapi/uploads/07_guide_5e1f6d07c5.jpg' width='500'>

- **Batch Size**: the number of samples processed in each iteration. 
    - Larger batches stabilize training but require more memory.
    - smaller batches introduce variability but are less resource-intensive.

**Parallelization and Resource Management**

LLMs grow in size, so parallelization and resource management techniques become essential for speeding up processing and enabling efficient handling of large data.
- **Data Parallelization** splits datasets across multiple GPUs during forward propagation.
- **Model Parallelization** divides itself across GPUs, ensuring that all model components are utilized without memory issues.
- **Gradient Checkpointing** reduces memory usage during forward propagation, enabling more efficient backward propagation by selectively storing intermediate results.

**Iteration and Epochs** (through each data batch):
1. Do prediction (forward propagation).
2. Calculate training loss.
3. Calcualte loss gradient (backward propagation).
4. update the model weights.
5. Calculate validation loss.

Through repeated iterations and multiple epochs:
- the model’s parameters are fine-tuned, leading to increasingly accurate and robust.
- prevent issues such as overfitting or underfitting.

### Evaluating LLMs

**Technical Evaluation** throught benchmarks:
- **MMLU (Massive Multitask Language Understanding)**:
    - measures the model’s natural language understanding across a broad range of subjects
    - is used to assess the general linguistic and reasoning capabilities of LLMs.

- **GPQA (General Purpose Question Answering)**:
    - evaluate the ability to handle diverse and complex question-answering tasks.
    - tests the proficiency in providing accurate, contextually relevant answers across various domains.

- **MATH**:
    - test the mathematical reasoning skills.
    - involves solving multi-step problems that require both calculation and logical reasoning.

- **HumanEval**:
    - Assesses the capability in generating functional and correct code.

- **Arena** (advanced benchmarks and platforms):
    - pose questions to two anonymous LLMs and determine which one answers better (LLMs ranking).
    - provides a more dynamic, user-driven evaluation.

`Note`: 
- Fine-tuning typically involves adapting the model to specific prompts and contexts.
- metric should reflect the business objectives.

**Conversational Performance Evaluation**

The engagement, coherence, and context awareness metric measure how effectively the model engages with users and maintains conversation quality.

<img src='https://sciforce.solutions/strapi/uploads/08_guide_f1eb9f3dc9.jpg' width='500'>

**Continuous Monitoring**
- ensure that the model maintains its performance over time, especially as new data becomes available or as the model is deployed in different contexts.
- As new data is introduced, periodically retrain and fine-tune the model to keep it accurate and relevant.

# RAG (Retrieval Augmented Generation)

RAG adds extra contextual data into promt

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*qbkE0RhqQw8AH6N9EeMnnw.png' width=600>

**Explaination**:
- The red arrow path:
    - split documents in data source into small chunks.
    - pass chunks through an embedding model to get embedded chunks.
    - store the embedded chunks into vector DB.
- The remaining path:
    - pass user's query through the embedding model to get embedded query.
    - calculate the similarity score between the embedded query and each embedded chunk.
    - combine the chunk with the highest score and the query to get prompt.
    - pass the prompt into a LLM model to get the final response.

# Multi Agent Systems

Multi-agent systems are where multiple smaller models collaborate to achieve goals that are unattainable by single models alone.

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*cJT7HJAhLZHx8zVw2jdsxQ.png' width=500>

This structure enables aspects:

- Guardrails to check output quality — techniques such as LLM-as-a-judge can check for signs of hallucination or inappropriate responses.
- Enable automation process flows by calling functions.
- Include a human-in-the-loop to provide feedback on the responses being provided and to make an expert assessment of automated content.

# LLMs Quantization

## Definition

**Quantization** is a method of compressing a larger size model (LLM or any deep learning model) to a smaller size.

**Quantization** maps the model’s weight value, parameters, and activations from higher precision (eg. FP32) to lower precision (eg. FP16|BF16|INT8).

### A. Asymmetric linear quantization

The method maps the values from the original tensor range (Wmin, Wmax) to the values in the quantized tensor range (Qmin, Qmax).

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*2PR4w80jx4Eem3ouVleZFg.png' width='500'>

- **Original tensor (W)**
- **Quantized tensor (Q)**
- **Scale value (S)**:
    - scales down the values of the original tensor during quantization.
    - scales up the value of the quantized tensor uring dequantization.
    - The data type of scale value is the same as the original tensor.
- **Zero point (Z)**: 
    - is the non-zero value in the quantized tensor range that directly gets mapped to the value 0 in the original tensor range.
    - The data type of zero-point is the same as the quantized tensor.

**Quantization Process**

**Step 1: Calculate de-quantized (origin) tensor value:**
$$W=S(Q-Z) \text{        (eq 1)}$$

**Step 2: Calculate scale value:**
$$\begin{align*}
W_{max}-W_{min} &= S(Q_{max}-Z)-S(Q_{min}-Z) \\
&= S(Q_{max}-Q_{min}) \\
\Rightarrow S &= \frac{W_{max}-W_{min}}{Q_{max}-Q_{min}}  \text{        (eq 2)}
\end{align*}$$

**Step 3: Calculate zero point value:**
$$W_{min} = S(Q_{min}-Z) \Rightarrow Z = Q_{min}-\frac{W_{min}}{S}$$
Z should have the data type of INT8, so rounded and converted to INT type.
$$Z = int\left(round\left(Q_{min}-\frac{W_{min}}{S}\right)\right) \text{        (eq 3)}$$

**Step 4: Calculate quantized tensor value:**
$$W = S(Q-Z) \Rightarrow Q = \frac{S}{Q} + Z$$
Q should have the data type of INT8.
$$Q = int\left(round\left(\frac{S}{Q} + Z\right)\right) \text{        (eq 4)}$$

**Issue: Zero point (Z) or Quantized tensor value (Q) runs out of the range** 

`Solution`: Change the value of Z or Q to $Q_{min}$ if it is smaller than $Q_{min}$ and to $Q_{max}$ if it is bigger than $Q_{max}$.

`Note:` The range of value is
- (-128, 127) for INT8 - signed integer.
- (0, 255) for UINT8 - unsigned integer.

**Cons:** have parameter Z, so higher memory footprint than symmetric quantization in case of large model.

**recommend for**: quantization to 4bit, 2bit or 1bit integer.

### B. Symmetric linear quantization

In the symmetric method, the 0 point in the original tensor range maps to the 0 point in the quantized tensor range.
 
<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*zsy-J4_-8qyLV_yXMgYBBA.png' width=500>

**Quantization Process**

**Step 1: Calculate de-quantized (origin) tensor value:**
$$W=S(Q) \text{        (eq 1)}$$

**Step 2: Calculate scale value:**
$$W_{max} = SQ_{max} \Rightarrow S = \frac{W_{max}}{Q_{max}}  \text{        (eq 2)}$$

**Step 3: Calculate quantized tensor value:**
$$W = S(Q) \Rightarrow Q = \frac{W}{S}$$
Q should have the data type of INT8.
$$Q = int\left(round\left(\frac{W}{S}\right)\right) \text{        (eq 3)}$$

**Cons:** if the dataset is not distributed properly that will result in large unused section in quantized range.

**recommend for**: quantization to 8bit integer.

## Quantize and de-quantize LLM weight parameters.

Let’s have a quick look at how weight parameter values change after quantization in the transformer model.

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*5MnWAtVmMCzOWAgzwd-8xg.png' width=500>

Example of FP32, INT8, UINT8 data type distribution and calculation

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*X3xu6Msz36b0IcL3iRCNJw.png' width=500>

### Asymmetric quantization

Given the original weight tensor (datatype: FP32)

In [1]:
import torch

original_weight = torch.randn((4,4))
print(original_weight)

tensor([[-1.1538,  1.4439, -1.7790, -0.1252],
        [ 0.5709, -0.0624, -2.9413,  0.8313],
        [-1.1088, -0.3239, -0.0625,  0.9305],
        [ 1.0707,  1.0060,  0.3864, -0.8464]])


Define quantization and de-quantization function

In [2]:
def asymmetric_quantization(original_weight):
    # define the data type to quantize. In our example, it's INT8.
    quantized_data_type = torch.int8

    # Get the Wmax and Wmin value from the orginal weight which is in FP32.
    Wmax = original_weight.max().item()
    Wmin = original_weight.min().item()

    # Get the Qmax and Qmin value from the quantized data type. 
    Qmax = torch.iinfo(quantized_data_type).max
    Qmin = torch.iinfo(quantized_data_type).min

    # Calculate the scale value using the scale formula. Datatype - FP32.
    S = (Wmax - Wmin)/(Qmax - Qmin)

    # Calculate the zero point value using the zero point formula. Datatype - INT8.
    Z = Qmin - (Wmin/S)
    
    # Check if the Z value is out of range.
    Z = max(Qmin, Z)
    Z = min(Qmax, Z)
    Z = int(round(Z))

    # Calculate the quantized weight.
    quantized_weight = (original_weight/S) + Z

    # Ensure the quantized weight doesn't goes out of range [Qmin, Qmax].
    quantized_weight = torch.clamp(torch.round(quantized_weight), Qmin, Qmax)

    # cast the datatype to INT8.
    quantized_weight = quantized_weight.to(quantized_data_type)

    return quantized_weight, S, Z

def asymmetric_dequantization(quantized_weight, scale, zero_point):
    # Use the dequantization calculation formula derived in the math section of this post.
    # Also make sure to convert quantized_weight to float as substraction between two INT8 values (quantized_weight and zero_point) will give unwanted result. 
    dequantized_weight = scale * (quantized_weight.to(torch.float32) - zero_point)

    return dequantized_weight

In [4]:
quantized_weight, scale, zero_point = asymmetric_quantization(original_weight)
print(f"quantized weight: {quantized_weight}")
print(f"scale: {scale}")
print(f"zero point: {zero_point}")

quantized weight: tensor([[ -24,  127,  -60,   36],
        [  76,   39, -128,   91],
        [ -21,   24,   39,   97],
        [ 105,  102,   65,   -6]], dtype=torch.int8)
scale: 0.01719683665855258
zero point: 43


In [5]:
dequantized_weight = asymmetric_dequantization(quantized_weight, scale, zero_point)
print(dequantized_weight)

tensor([[-1.1522,  1.4445, -1.7713, -0.1204],
        [ 0.5675, -0.0688, -2.9407,  0.8254],
        [-1.1006, -0.3267, -0.0688,  0.9286],
        [ 1.0662,  1.0146,  0.3783, -0.8426]])


Calculating the quantization error between the de-quantized weight and the origin one.

In [6]:
quantization_error = (dequantized_weight - original_weight).square().mean()
print(quantization_error)

tensor(2.9039e-05)


`Note:` The is so much less.

### Symmetric quantization

Having the same code with the asymmetric method. The only change is to always ensure the value of zero_input to be 0.

# Reference

[LLMs— From Research to Reality](https://medium.com/@ed.bullen/llms-from-research-to-reality-57d5936552c1)

[Step-by-Step Guide to Creating Your Own Large Language Model](https://sciforce.solutions/blog/stepbystep-guide-to-creating-your-own-large-language-model-208)

[Building Multi AI Agent Systems: A Comprehensive Guide!](https://ai.plainenglish.io/building-multi-ai-agent-systems-a-comprehensive-guide-58bf21f84f6e)

[Want to Learn Quantization in The Large Language Model?](https://pub.towardsai.net/want-to-learn-quantization-in-the-large-language-model-57f062d2ec17)