In [None]:
#################################################################
#                                                               #
#  CS435 Generative AI: Security, Ethics and Governance         #
#                                                               #
#  Instructor: Dr. Adnan Masood                                 #
#  Contact:    adnanmasood@gmail.com                            #
#                                                               #
#  Notebook is MIT Licensed                                     #
#################################################################


# Quantization, LoRA, and QLoRA Explained in a Jupyter Notebook

Welcome! In this notebook, we’ll explore **Quantization**, **LoRA**, and **QLoRA** in detail. We will begin with an intuitive, simple explanation, and progressively dive deeper into the technical and mathematical aspects. This notebook is structured to provide a multi-level explanation: from a very simple level to a very advanced level. We'll go step-by-step with examples and code. By the end of this notebook, you’ll have a thorough understanding of how these techniques work, why they matter, and how to apply them.


# Table of Contents
1. [Building an Intuitive Understanding](#five-levels)
2. [Intuition Behind Quantization, LoRA, and QLoRA](#intuition)
3. [Brief History and Underlying Technology](#history)
4. [Math Behind the Techniques](#math)
5. [Illustrative Example with Code](#example_code)
6. [Example Calculations](#mock_calcs)
7. [Step-by-Step Example from Scratch](#step_by_step)
8. [Illustrative Problem It Solves](#illustrative_problem)
9. [Real-World Problems Solved](#realworld_problem)
10. [Using This Tech to Solve Real Problems](#use_tech)
11. [Questions to Ponder](#questions)
12. [Answers to the Questions (with Code)](#answers)
13. [A Sample Exercise](#easy_sample)
14. [Glossary](#glossary)


<a id="five-levels"></a>
# 1. Building an Intuitive Understanding
Below is a progression of explanations for **Quantization, LoRA, and QLoRA**, each adding more detail but focusing on the same core ideas:

**Quantization**: Imagine you have a big box of colored pencils with many shades. Quantization means you're picking fewer shades to color your picture—this makes your box smaller and easier to carry, but you can still draw almost the same picture.

**LoRA**: Think of it like having a big puzzle, but you only add a small set of new pieces to update your puzzle’s picture. This is much easier than remaking the whole puzzle from scratch.

**QLoRA**: Combine the idea of picking fewer shades (Quantization) **and** adding small puzzle pieces (LoRA) together to make learning and using a big AI model faster and cheaper.

When dealing with neural networks, we store numbers (weights) in computers. Normally, we use full 32-bit numbers (floats). Quantization shrinks them to fewer bits (like 8-bit or 4-bit) to use less memory and compute faster.

LoRA (Low-Rank Adaptation) is a method to update a large model with fewer parameters. Instead of changing all the model’s weights, we factorize them into smaller matrices (low-rank) and only update those parts, saving memory and training time.

QLoRA is a technique that uses 4-bit quantization for a large model while applying LoRA for fine-tuning. It drastically cuts down memory usage and speeds up training and inference.

Neural networks rely on large weight matrices. Using high-precision floating-point data can be overkill. Quantization maps these continuous values to discrete sets of integers in fewer bits, e.g., using 8-bit or 4-bit integers instead of 32-bit floats. This requires some scaling and offset for each weight.

LoRA decouples the main model weights from the update process. Instead of fine-tuning every single weight, we introduce low-dimensional matrices (A and B). We only learn these matrices during training, which drastically reduces the number of trainable parameters.

QLoRA combines 4-bit quantization (to store and run the large base model more efficiently) with LoRA to fine-tune it on new data. It offers the benefit of minimal hardware requirements alongside parameter-efficient fine-tuning.

We can view quantization as approximating the weight vector space with a smaller set of possible values. For instance, 4-bit quantization defines 2^4 = 16 possible levels. Each weight is mapped to a scaled integer representation, e.g., $w_{quant} = \text{round}((w - \alpha)/\delta)$ where $\alpha$ is an offset and $\delta$ is the step size.

If a weight matrix $W$ is dimension $d \times d$, LoRA factorizes the update into $A$ and $B$, each with dimensions $d \times r$ and $r \times d$, leading to a low-rank update: $\Delta W = A B^T$. We only train $A$ and $B$, keeping the base $W$ fixed, thereby drastically reducing the parameter count.

With QLoRA, the base weights are quantized to 4 bits (using specific algorithms designed for large language models). The LoRA low-rank matrices remain in higher precision. This effectively combines memory savings of quantization with the parameter efficiency of LoRA.

In large-scale LLMs, advanced quantization strategies might be used, such as per-channel scaling or block-wise quantization. There's ongoing research on the representational capacity of low-rank updates, how the rank $r$ determines the model’s flexibility, and how quantization error interacts with low-rank decomposition. QLoRA enables fine-tuning multi-billion-parameter models on GPUs with limited memory.


<a id="intuition"></a>
# 2. Intuition Behind Quantization, LoRA, and QLoRA
**Intuition**:
- **Quantization**: We reduce the precision of numbers. Think of it like rounding prices to the nearest dollar instead of using pennies. You lose some fine detail, but you speed up calculations and reduce storage.
- **LoRA**: Instead of rewriting the entire book (model weights), you add notes in the margins that reference which parts of the book need updating. This is more efficient.
- **QLoRA**: It’s like having a short summary (4-bit quantization) of the big book plus those margin notes (LoRA) combined. You can store everything in a smaller space and still adapt it for new tasks.


<a id="history"></a>
# 3. Brief History and Underlying Tech
- **Quantization**: Has been around in digital signal processing for decades. Became popular in neural networks to speed up inference on edge devices and reduce memory.
- **LoRA**: Introduced as a parameter-efficient fine-tuning technique; it's a response to the exploding size of large language models (LLMs). The original paper demonstrated significant savings in GPU memory and training speed.
- **QLoRA**: Proposed to leverage both 4-bit quantization for base LLM weights and LoRA for fine-tuning. Helps reduce the massive cost of hardware (GPU) and memory usage, enabling LLM tuning on more modest systems.


<a id="math"></a>
# 4. Math Behind the Techniques
### 4.1 Quantization
- **Simple equation**: If $w$ is a real-valued weight,
$$
w_{q} = \text{round}\left(\frac{w - \alpha}{\delta}\right),
$$
where $w_{q}$ is the quantized integer, $\alpha$ is a zero point (offset), and $\delta$ is the scale. The real value is approximated by:
$$
w_{approx} = w_q \times \delta + \alpha.
$$

### 4.2 LoRA
- A weight matrix $W$ is large. We keep it **frozen** and add a small update $\Delta W$ that can be written as $A B^T$:
$$
W_{\text{new}} = W + A B^T,
$$
where $A$ and $B$ are much smaller, rank-$r$ matrices, typically with dimension $d \times r$ and $r \times d$, respectively.

### 4.3 QLoRA
- **4-bit Quantization of the base weights**: In practice, a technique like [GPTQ](https://arxiv.org/abs/2210.17323) or [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is used.
- **LoRA applied on top**: The original weights are stored in 4-bit precision, while the LoRA matrices remain in higher precision. We can train just the LoRA part, combining memory efficiency with effective fine-tuning.


<a id="example_code"></a>
# 5. Illustrative Example with Code
In this section, we’ll do a very simplified demonstration of how quantization might work on a small tensor, then how LoRA can be conceptually applied in PyTorch.


In [None]:
# Let's do a toy example of quantization in Python
import torch
import numpy as np

def simple_quantize(tensor, num_bits=4):
    # For demonstration, we do a naive min-max scaling
    # Determine min and max
    t_min = tensor.min()
    t_max = tensor.max()

    # Number of levels
    levels = 2 ** num_bits

    # Scale and zero-point
    scale = (t_max - t_min) / (levels - 1)
    zero_point = t_min

    # Quantize
    quantized = torch.round((tensor - zero_point) / scale)

    return quantized, scale, zero_point

def simple_dequantize(quantized, scale, zero_point):
    return quantized * scale + zero_point

# Example tensor
data = torch.tensor([0.1, 0.2, 0.25, 0.75, 1.0, 1.1], dtype=torch.float32)
print("Original Data:", data)

q_data, scale, zp = simple_quantize(data, num_bits=4)
dq_data = simple_dequantize(q_data, scale, zp)

print("\nQuantized Data (4-bit):", q_data)
print("Scale:", scale, "Zero Point:", zp)
print("Dequantized Data:", dq_data)

Original Data: tensor([0.1000, 0.2000, 0.2500, 0.7500, 1.0000, 1.1000])

Quantized Data (4-bit): tensor([ 0.,  2.,  2., 10., 13., 15.])
Scale: tensor(0.0667) Zero Point: tensor(0.1000)
Dequantized Data: tensor([0.1000, 0.2333, 0.2333, 0.7667, 0.9667, 1.1000])


In the code above, we:
1. **Compute** the range of the data ($t_\text{min}$ and $t_\text{max}$).
2. **Derive** the scale and zero-point for a given bit precision (e.g., 4 bits gives 16 levels).
3. **Round** the data to these discrete levels.
4. **Dequantize** by reversing the process.

This is a simplistic demonstration of what quantization does.


### LoRA Conceptual Example
Let’s do a toy example of applying a low-rank update to a matrix in PyTorch. Note: This is a contrived example just to illustrate the idea of $W + AB^T$.

In [None]:
import torch

d = 6  # dimension
r = 2  # rank

# Original weight matrix W
W = torch.randn(d, d)

# LoRA matrices
A = torch.randn(d, r)
B = torch.randn(r, d)

# Low-rank update Delta W = A.mm(B)
Delta_W = A.mm(B)

# New weight
W_new = W + Delta_W

print("Original W:\n", W)
print("\nLoRA update Delta_W = A*B:\n", Delta_W)
print("\nW_new = W + Delta_W:\n", W_new)


Original W:
 tensor([[ 1.1697, -0.5668, -0.5548, -0.0254,  1.2123, -0.7499],
        [-0.2529, -1.4790, -1.9017, -0.0548,  0.2024, -0.1532],
        [-0.2562, -2.6487, -1.3980,  0.9825, -0.3399, -0.3712],
        [-0.1242, -1.4526,  1.0699, -0.8227, -0.7459, -1.5155],
        [ 0.9345,  0.0629,  0.8952, -1.8664,  0.3253, -0.3888],
        [ 0.5772,  0.4870, -0.0182, -1.8467,  0.2777, -1.0933]])

LoRA update Delta_W = A*B:
 tensor([[ 0.4216, -2.2879,  0.2140,  1.3414, -3.0245,  0.0369],
        [-1.2779, -3.8631,  0.9686,  1.3902, -3.5345, -0.6987],
        [-1.8092, -3.7664,  1.1162,  1.1078, -3.0010, -0.8966],
        [ 1.3620,  0.3973, -0.4752,  0.3980, -0.6089,  0.5425],
        [ 1.8465,  2.4369, -0.9285, -0.4196,  1.4075,  0.8386],
        [ 1.7876,  2.6547, -0.9431, -0.5556,  1.7103,  0.8280]])

W_new = W + Delta_W:
 tensor([[ 1.5912, -2.8548, -0.3408,  1.3161, -1.8122, -0.7130],
        [-1.5308, -5.3421, -0.9332,  1.3355, -3.3321, -0.8519],
        [-2.0653, -6.4151, -0.2818,  

Here, we are simulating the **LoRA** technique. In an actual training scenario, $W$ (the base model weights) would be frozen, and only $A$ and $B$ would be updated during backpropagation. Then the updated weight would be $W + AB^T$.


<a id="mock_calcs"></a>
# 6. Example Calculations
Let’s do a small conceptual calculation:
- Suppose you have a single weight $w = 0.78$ in float32.
- You decide to quantize to 2 bits (4 levels: $\{0, 1, 2, 3\}$).
- If your range is [0, 1], then scale $\delta = (1 - 0)/3 = 0.3333$, zero-point $\alpha = 0$.
- Quantized value $w_q = \text{round}((0.78 - 0)/0.3333) = \text{round}(2.34) = 2$.
- Dequantized $w_{approx} = 2 \times 0.3333 = 0.6667$.

Here, you see that you lost some precision (went from 0.78 to ~0.67), but you gained efficiency.

### Terms:
- **Weight**: The parameter in the neural network.
- **Bias**: An additional parameter added to the weighted sum before passing through an activation.
- **Rank**: The dimension of the subspace for the LoRA update.
- **Scale & Zero Point**: Used in quantization to map continuous ranges to discrete integer levels.


<a id="step_by_step"></a>
# 7. Step-by-Step Example from Scratch
We’ll build a **tiny** neural network and do the following:
1. **Initialize** the network.
2. **Quantize** its parameters.
3. **Apply** a LoRA-like update.

Note: This is a conceptual demonstration and not a full-blown training session.


In [None]:
import torch
import torch.nn as nn

class TinyNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(TinyNet, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Step 1: Initialize network
model = TinyNet(input_dim=4, hidden_dim=4, output_dim=2)
print("Original model parameters (fc1 weight):\n", model.fc1.weight)

# Step 2: Quantize the fc1 weight
fc1_weight = model.fc1.weight.data
q_fc1_weight, scale, zp = simple_quantize(fc1_weight, num_bits=4)
model.fc1.weight.data = simple_dequantize(q_fc1_weight, scale, zp)

print("\nQuantized + Dequantized model parameters (fc1 weight):\n", model.fc1.weight)

# Step 3: Apply a LoRA-like update
r = 2
A = torch.randn(model.fc1.weight.data.shape[0], r)
B = torch.randn(r, model.fc1.weight.data.shape[1])
Delta = A.mm(B)
model.fc1.weight.data += Delta
print("\nAfter LoRA-like update (fc1 weight):\n", model.fc1.weight)


Original model parameters (fc1 weight):
 Parameter containing:
tensor([[-0.3707,  0.0318, -0.4941,  0.1818],
        [-0.3660, -0.2462,  0.2597,  0.4726],
        [-0.1961,  0.4204, -0.1074,  0.0015],
        [ 0.4629,  0.2084,  0.0776, -0.1198]], requires_grad=True)

Quantized + Dequantized model parameters (fc1 weight):
 Parameter containing:
tensor([[-0.3652,  0.0215, -0.4941,  0.1504],
        [-0.3652, -0.2363,  0.2793,  0.4726],
        [-0.1718,  0.4082, -0.1074,  0.0215],
        [ 0.4726,  0.2148,  0.0859, -0.1074]], requires_grad=True)

After LoRA-like update (fc1 weight):
 Parameter containing:
tensor([[ 1.0648,  0.5687,  0.0070,  0.5903],
        [ 1.0167,  0.7033,  1.0627,  1.6390],
        [-1.4179, -0.3033, -0.7150, -0.7853],
        [-1.1519, -0.2765, -0.3883, -0.3721]], requires_grad=True)


We first **initialized** a tiny network. Then, we **quantized** its `fc1` weight parameter to 4 bits. Next, we **dequantized** it back and replaced the parameter in the model, simulating storing it in quantized form. Finally, we added a LoRA-like low-rank update to illustrate how we can adapt a quantized model.


<a id="illustrative_problem"></a>
# 8. Illustrative Problem It Solves
For instance, if you want to run a **Large Language Model** on a **resource-limited device** (like a smaller GPU or edge device), you can’t afford storing massive 32-bit floats for billions of parameters. By using 4-bit quantization, you drastically reduce memory.

But you also want to **adapt** or fine-tune the model for a new language or domain. Full fine-tuning of billions of parameters is also expensive. Instead, you apply **LoRA** (only a small set of trainable parameters), and combine that with the quantized model. This is **QLoRA**, which solves the problem of running and updating large models cheaply.


<a id="realworld_problem"></a>
# 9. Real-World Problems Solved
1. **Domain adaptation**: Fine-tune a large model on a specialized domain (e.g., medical data) using LoRA, while storing the base model in 4-bit.
2. **Edge deployment**: Deploy on embedded devices or small data centers with limited GPU memory.
3. **Cost reduction**: Lower the hardware requirement means cheaper training and inference.


<a id="use_tech"></a>
# 10. How to Solve a Real-World Problem Using This Tech
1. **Pick a big pretrained model** (like a 7B or 13B parameter language model).
2. **Quantize** the model using a 4-bit or 8-bit quantization tool.
3. **Apply LoRA**: Freeze the base weights and add low-rank adapters.
4. **Train** the LoRA parameters with a small learning rate on your domain data.
5. **Deploy**: Use the quantized model + LoRA parameters in production. You only need to load the small LoRA overhead in full precision. Everything else is compressed.


<a id="questions"></a>
# 11. What Other Questions Can You Ask? (Points to Ponder)
1. *How do we choose the right number of bits for quantization?*
2. *How do we select the rank $r$ for LoRA?*
3. *When does quantization error significantly degrade performance?*
4. *How does outlier-aware quantization improve results?*
5. *How do we handle gradient updates in QLoRA for the quantized weights?*


<a id="answers"></a>
# 12. Answers to the Questions (with Code Examples)
### 1. How do we choose the right number of bits for quantization?
- Typically, we try 8-bit or 4-bit because they balance memory savings and accuracy. 2-bit might be too aggressive. The best way is to **experiment** on a validation set.

### 2. How do we select the rank $r$ for LoRA?
- This depends on how complex the new task is. A higher rank means more parameters to learn, but better capacity. A typical range is $r \in [4, 64]$ for LLM fine-tuning.

### 3. When does quantization error significantly degrade performance?
- If your network is very sensitive to small changes in weights or if you quantize to too few bits (2-bit or 3-bit). Large networks tend to be somewhat robust to quantization, but tasks requiring very high precision might suffer.

### 4. How does outlier-aware quantization improve results?
- Some weights (outliers) can be very large, messing up min-max scaling. Outlier-aware quantization uses separate scaling for outliers or specialized algorithms to reduce the impact of these extreme values.

### 5. How do we handle gradient updates in QLoRA for the quantized weights?
- In practice, the gradients are computed in higher precision. We don’t update the quantized weights directly. Instead, we update the LoRA parameters. The quantized weights remain mostly unchanged.


<a id="easy_sample"></a>
# 13. A Sample Exercise
Below is a **template** code you can run and fill in the TODO items. It demonstrates building a small network, quantizing it, and applying a LoRA update. Complete the TODOs to see how it works.


In [None]:
# TODO: Complete and run this code.
import torch
import torch.nn as nn

def my_quantize(tensor, num_bits=4):
    # TODO: Implement a simple min-max quantization
    # 1. Find the min and max of the tensor
    # 2. Compute scale = (max - min) / (levels - 1)
    # 3. Compute zero_point = min
    # 4. Quantize by rounding
    minv = tensor.min()
    maxv = tensor.max()
    scale = (maxv - minv) / (2 ** num_bits - 1)
    zero_point = minv
    quantized = torch.round((tensor - zero_point) / scale)
    return quantized, scale, zero_point

def my_dequantize(quantized_tensor, scale, zero_point):
    # TODO: Implement the reverse process
    return quantized_tensor * scale + zero_point

# Define a simple linear layer
linear = nn.Linear(3, 2, bias=False)

# Print original parameters
print("Original params:", linear.weight)

# TODO: Call my_quantize on linear.weight.data
q_weight, scale, zp = my_quantize(linear.weight.data, num_bits=4)

# TODO: Dequantize
dq_weight = my_dequantize(q_weight, scale, zp)

# TODO: Assign dq_weight back to linear.weight.data
linear.weight.data = dq_weight

# Create LoRA-like update
r = 1
A = torch.randn(linear.weight.size(0), r)
B = torch.randn(r, linear.weight.size(1))
Delta = A.mm(B)

# TODO: Add Delta to linear.weight.data
linear.weight.data += Delta

# Print updated parameters
print("\nUpdated params after LoRA-like update (complete the TODOs first):", linear.weight)


Original params: Parameter containing:
tensor([[ 0.0100, -0.2597, -0.5754],
        [ 0.0595,  0.5132,  0.4081]], requires_grad=True)

Updated params after LoRA-like update (complete the TODOs first): Parameter containing:
tensor([[-0.5791,  0.6882, -1.1573],
        [-0.2583,  1.0730,  0.1060]], requires_grad=True)


By filling in the `TODO`s, you’ll practice implementing your own min-max quantization logic, dequantization, and finally adding a LoRA-like update.


<a id="glossary"></a>
# 14. Glossary
- **Quantization**: Reducing the number of bits used to represent numbers in neural networks.
- **LoRA (Low-Rank Adaptation)**: A method that updates only a small set of parameters through low-rank factorization.
- **QLoRA**: 4-bit quantization + LoRA to fine-tune large models efficiently.
- **Scale**: Factor used to scale floating-point range into a smaller discrete range.
- **Zero Point (Offset)**: A shift added so that the minimum value can map to 0 in quantized integer space.
- **Rank**: The number of columns in matrix A (and rows in matrix B) for LoRA.
- **Parameter-Efficient Fine-Tuning**: Methods that avoid updating all model parameters.
- **Outlier-Aware Quantization**: Specialized approach for dealing with extremely large or small weight values.
- **Min-Max Quantization**: The simplest form of quantization using the min and max of the data.
- **bitsandbytes**: A popular library for 8-bit and 4-bit optimization and quantization.


## Conclusion
With these examples and explanations, you should now have a comprehensive understanding of **Quantization**, **LoRA**, and **QLoRA**, along with their theory, mathematics, and practical implementation details. Happy exploring and coding!

In [None]:
import os, sys, platform, datetime, uuid, socket

def signoff(name="Anonymous"):
    colab_check = "Yes" if 'google.colab' in sys.modules else "No"
    mac_addr = ':'.join(format((uuid.getnode() >> i) & 0xff, '02x') for i in reversed(range(0, 48, 8)))
    print("+++ Acknowledgement +++")
    print(f"Executed on: {datetime.datetime.now()}")
    print(f"In Google Colab: {colab_check}")
    print(f"System info: {platform.system()} {platform.release()}")
    print(f"Node name: {platform.node()}")
    print(f"MAC address: {mac_addr}")
    try:
        print(f"IP address: {socket.gethostbyname(socket.gethostname())}")
    except:
        print("IP address: Unknown")
    print(f"Signing off, {name}")

signoff("Kushal Chandani")


+++ Acknowledgement +++
Executed on: 2025-02-02 18:06:16.160094
In Google Colab: Yes
System info: Linux 6.1.85+
Node name: f134a40a4502
MAC address: 02:42:ac:1c:00:0c
IP address: 172.28.0.12
Signing off, Kushal Chandani
