# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [7]:
# imports
import ollama
import os
from openai import OpenAI
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display

In [1]:
# constants

MODEL_GPT = 'gpt-4o-mini'
MODEL_LLAMA = 'llama3.2'

In [6]:
# set up environment
load_dotenv(override=True)
api_key_OpenAI = os.getenv('OPENAI_API_KEY')

In [12]:
# here is the question; type over this to ask something new

system_prompt = """
You are a personal tutor in LLM engineering, AI and Machine Learning. 
Respond the questions not only to give a dirrect solution, 
but explain the concept about which the question is asked, 
or give the framework for solution instead of answering only 
in order to help your student master their AI engineering skills.
Priovide your response in markdown.
"""

question = """
Name 3 main types of quantization for LLM models 
and explain how does they differ?
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question}
]

In [20]:
def stream_response(stream, version = 'OpenAI'):
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += (
            getattr(chunk.message, 'content', '') if hasattr(chunk, 'message')
            else getattr(chunk.choices[0].delta, 'content', '')
        )
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [None]:
# Get gpt-4o-mini to answer, with streaming

openai = OpenAI(api_key=api_key_OpenAI)

stream = openai.chat.completions.create(
    model=MODEL_GPT,
    messages=messages,
    stream=True
)

stream_response(stream)

Quantization is a crucial technique in optimizing large language models (LLMs) for efficiency in terms of computational resources and memory usage. By reducing the precision of the numbers used in model weights and activations, quantization enables faster computation and smaller memory footprints without significantly degrading model performance. Here are the three main types of quantization for LLM models:

### 1. **Post-Training Quantization (PTQ)**

**Concept**: 
Post-training quantization is applied to an already trained model. It doesn’t require retraining the model, making it a quick and simple method. PTQ typically helps in converting the weights of the floating-point representation into lower-bit formats (like 8-bit integers).

**How It Works**:
- The weights and activations of the model are quantized based on a predefined scale and zero-point.
- Techniques such as **Min-Max Quantization** or **Mean-Std Quantization** are commonly used to determine the scaling factors.
  
**Advantages**:
- Saves time since it doesn’t require retraining.
- Good enough for many applications where extreme precision is not critical.

**Disadvantages**:
- Potential for loss in accuracy if the model is sensitive to the reduced precision of weights and activations.

### 2. **Quantization-Aware Training (QAT)**

**Concept**: 
Quantization-aware training incorporates quantization noise simulation in the training process itself. This allows the model to learn to adjust to the quantization effects during training, often leading to better performance in the quantized model.

**How It Works**:
- The model is trained with adjustments to weight and activation computations to account for quantization.
- During training, both full-precision and quantized computations are used, allowing the model to learn robust representations.

**Advantages**:
- Generally results in higher accuracy compared to PTQ because the model learns to adapt to the quantization effects.
- Prepares the model specifically for deployment in a quantized form.

**Disadvantages**:
- Longer training times since it may require additional training epochs.
- More complex implementation compared to PTQ.

### 3. **Dynamic Quantization**

**Concept**: 
Dynamic quantization involves converting weights to lower precision during inference, without altering the original model during training. This method applies quantization on-the-fly for activation.

**How It Works**:
- The model weights are usually quantized to a lower bit-width before inference, but the activations are quantized dynamically as they are processed.
  
**Advantages**:
- Can be applied easily with little overhead.
- Provides a good balance between model performance and computational efficiency.

**Disadvantages**:
- May still consume considerable resources for certain operations and doesn't fully optimize the memory usage of model activations.

### Summary of Differences

| Feature                   | Post-Training Quantization | Quantization-Aware Training | Dynamic Quantization |
|---------------------------|----------------------------|-----------------------------|----------------------|
| Training Requirement       | No retraining              | Requires retraining         | No retraining        |
| Accuracy Level             | Moderate                   | High                        | Moderate             |
| Implementation Complexity   | Low                        | High                        | Moderate             |
| Inference Speed            | Fast                       | Varies                      | Fast                 |

### Conclusion

Each type of quantization has its strengths and weaknesses, and the choice of which one to use can depend on the specific application, performance requirements, and acceptable trade-offs in model accuracy. As a student in AI engineering, you should deepen your understanding of these concepts and consider practical implementation examples to solidify your learning.

In [21]:
# Get Llama 3.2 to answer
stream = ollama.chat(
                    model=MODEL_LLAMA,
                    messages=messages,
                    stream=True
                    )

stream_response(stream)

**Quantization in Large Language Models (LLMs)**
=====================================================

Quantization is the process of reducing the precision of model weights and activations to improve inference speed and memory efficiency. In the context of Large Language Models (LLMs), quantization is crucial for achieving significant reductions in computational resources and energy consumption.

### 1. **Integer Quantization**

In integer quantization, both model weights and activations are represented using integers (typically 8-bit or 16-bit). This type of quantization provides a trade-off between precision and speed.

**Pros:**

* Fast inference speeds due to reduced floating-point operations
* Reduced memory requirements

**Cons:**

* Lower accuracy compared to float32 (single-precision) models
* Can suffer from loss of information, particularly for models with high computational complexity

### 2. **Binary Quantization**

In binary quantization, model weights and activations are represented using binary values (typically ±1). This type of quantization is more aggressive than integer quantization but still provides significant speed improvements.

**Pros:**

* Even faster inference speeds compared to integer quantization
* Reduced memory requirements

**Cons:**

* Lower accuracy due to reduced representational capacity
* Can suffer from non-deterministic behavior, particularly when dealing with stochastic activation functions like ReLU

### 3. **Mixed Precision Quantization**

In mixed precision quantization, model weights are represented using a combination of integer and binary values, while activations are represented using floating-point numbers (typically float32). This type of quantization aims to balance accuracy and speed.

**Pros:**

* Offers a trade-off between precision and speed
* Can achieve better accuracy compared to pure integer or binary quantization

**Cons:**

* Requires more complex model design and optimization
* May require additional computational resources for activation computations

**Comparison of Quantization Types**
------------------------------------

| Quantization Type | Model Weights | Activations | Pros | Cons |
| --- | --- | --- | --- | --- |
| Integer | Int8/Int16 | Float32 | Fast inference, reduced memory | Lower accuracy |
| Binary | Int1 | ±1 | Fastest inference, reduced memory | Lowest accuracy |
| Mixed Precision | Int8/Int16/Binary | Float32 | Balances precision and speed | Requires complex model design |

In summary, the choice of quantization type depends on the specific use case and trade-off between accuracy and speed. Integer and binary quantization are more aggressive but provide faster inference speeds, while mixed precision quantization offers a balance between precision and speed.

**Implementation Considerations**
---------------------------------

When implementing quantization in an LLM model, consider the following:

* Use existing frameworks (e.g., TensorFlow Lite, PyTorch Quantization) to simplify the process
* Choose an appropriate quantization scheme based on the specific use case and performance requirements
* Monitor accuracy and adjust the quantization parameters as needed