# Pretraining Optimizations
The pretraining step involves the largest amount of data along and is impacted by architectural aspects of the model: its size (parameters), shape (width and depth), and so on.
This notebook covers optimization techniques focussed on the pretraining step.

We will cover:
- Different Floating Point Representations/Formats
- Quantization of Floats
- Post Training Quantization of Models:
 - Torch based dynamic quantization
 - Huggingface and bitsandbytes based 8bit and 4bit quantization

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ❗ <b>This Notebook requires GPU

In [1]:
# !pip3 install -U bitsandbytes
# restart after this step

In [2]:
import torch
import struct
import numpy as np
from time import time
from utils import get_model_size
from huggingface_hub import notebook_login
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, QuantoConfig

In [3]:
# Set up warnings
import warnings
warnings.filterwarnings(
    action='ignore',
    category=DeprecationWarning,
    module=r'.*'
)
warnings.filterwarnings(
    action='default',
    module=r'torch.ao.quantization'
)

# Specify random seed for repeatable results
torch.manual_seed(191009)

<torch._C.Generator at 0x7b220cf2c130>

## Representing Floating Point Numbers

<img src="./assets/ch_09_05.png">


### Binary Representation of Floats
- Sign bit
- Exponent bits
- Mantissa bits

In [4]:
num = 3.1457898
print(f"Sample Floating Point Number:{num}")

Sample Floating Point Number:3.1457898


In [5]:
def float32_to_binary(num):
    return ''.join(f'{b:08b}' for b in struct.pack('!f', num))

binary = float32_to_binary(num)

print(f"Float32 representation of {num}:")
print(f"Sign: {binary[0]}")
print(f"Exponent: {binary[1:9]}")
print(f"Fraction: {binary[9:]}")

Float32 representation of 3.1457898:
Sign: 0
Exponent: 10000000
Fraction: 10010010101010010011111


### Different Types of Floats

- FP32
- FP16
- bFloat16

In [6]:
# Create arrays with different float types
f32 = np.array([num], dtype=np.float32)
f16 = np.array([num], dtype=np.float16)

print(f"Float32: {f32}")
print(f"Float16: {f16}")

Float32: [3.1457899]
Float16: [3.146]


In [7]:
og_scalar = torch.scalar_tensor(num)
fp16_scalar = og_scalar.to(dtype=torch.float16)
bf16_scalar = og_scalar.to(dtype=torch.bfloat16)

In [8]:
print(f"Torch Float32: {og_scalar}")
print(f"Torch Float16: {fp16_scalar}")
print(f"Torch bFloat16: {bf16_scalar}")

Torch Float32: 3.145789861679077
Torch Float16: 3.146484375
Torch bFloat16: 3.140625


## Quantization
Quantization aims to reduce the number of bits needed to store these weights by binning floating-point values into lower-precision buckets. This reduces memory usage with minimal impact on performance, as small precision losses are often acceptable. 
<img src="./assets/ch_09_04.png">

In [9]:
min_x = -np.ceil([num])[0]
max_x = np.ceil([num])[0]
scale = 255/(max_x-min_x)
zero_point = -round(scale*min_x)-128
scale,zero_point

(31.875, 0)

In [10]:
x_quant = round(scale*og_scalar.numpy()+zero_point)
x_quant

100

In [11]:
x_dequant = (x_quant-zero_point)/scale
x_dequant

3.1372549019607843

### Quantization using Torch

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ❗ <b>Static Quantization

In [12]:
qscalar = torch.quantize_per_tensor(og_scalar,torch.scalar_tensor(scale),torch.scalar_tensor(zero_point),torch.qint8)
qscalar

tensor(0., size=(), dtype=torch.qint8,
       quantization_scheme=torch.per_tensor_affine, scale=31.875, zero_point=0)

In [13]:
print(f"Data Type Original Scalar:{og_scalar.dtype}")
print(f"Data Type Quantized Scalar:{qscalar.dtype}")
print(f"Integer Representation of Quantized Scalar:{qscalar.int_repr()}")

Data Type Original Scalar:torch.float32
Data Type Quantized Scalar:torch.qint8
Integer Representation of Quantized Scalar:0


<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ❗ <b>Dynamic Quantization

In [14]:
dq_scalar = torch.quantize_per_tensor_dynamic(og_scalar,torch.qint8,False)
dq_scalar

tensor(3.1458, size=(), dtype=torch.qint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.012336430830114029,
       zero_point=-128)

In [15]:
print(f"Data Type Dynamically Quantized Scalar:{dq_scalar.dtype}")
print(f"Integer Representation of Dynamically Quantized Scalar:{dq_scalar.int_repr()}")

Data Type Dynamically Quantized Scalar:torch.qint8
Integer Representation of Dynamically Quantized Scalar:127


## Post Training Quantization

Post-training quantization (PTQ), unlike mixed precision training, is performed after the model has been fully trained in high precision. In PTQ, weights are converted to lower-precision formats such as int8 or bfloat16, with techniques like static quantization using pre-calibrated scaling factors or dynamic quantization, which adjusts on-the-fly at runtime. PTQ is particularly advantageous for deployment scenarios, where reduced memory and latency are critical.

### Torch Quantization

In [16]:
MODEL = "bert-base-uncased"

In [17]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


In [18]:
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

In [19]:
size_model = get_model_size(model)
print(f"Original model's size: {size_model} bits | {size_model / 8e6:.2f} MB")

Original model's size: 3504457536 bits | 438.06 MB


In [20]:
size_model = get_model_size(quantized_model)
print(f"Quantized model's size: {size_model} bits | {size_model / 8e6:.2f} MB")

Quantized model's size: 764995392 bits | 95.62 MB


### HuggingFace

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ❗ <b>This Section Needs GPU

In [21]:
MODEL = "raghavbali/aligned-gpt2-movie_reviewer"

In [22]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL)

Some weights of the model checkpoint at raghavbali/aligned-gpt2-movie_reviewer were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
size_model = get_model_size(model)
print(f"Original model's size: {size_model} bits | {size_model / 8e6:.2f} MB")

Original model's size: 3982098432 bits | 497.76 MB


In [24]:
model_4bit = AutoModelForCausalLM.from_pretrained(
    MODEL,
    quantization_config=BitsAndBytesConfig(load_in_4bit=True)
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    MODEL,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True)
)

`low_cpu_mem_usage` was None, now default to True since model is quantized.
Some weights of the model checkpoint at raghavbali/aligned-gpt2-movie_reviewer were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Some weights of the model checkpoint at raghavbali/aligned-gpt2-movie_reviewer were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']
- This IS expected if you are

In [25]:
size_model_4bit = get_model_size(model_4bit)
size_model_8bit = get_model_size(model_8bit)

print(f"Model size after 8bit quantization: {size_model_8bit} bits | {size_model_8bit / 8e6:.2f} MB")
print(f"Model size after 4bit quantization: {size_model_4bit} bits | {size_model_4bit / 8e6:.2f} MB")

Model size after 8bit quantization: 1311571968 bits | 163.95 MB
Model size after 4bit quantization: 971833344 bits | 121.48 MB


Confirm if the models still work as intended after quantization

In [26]:
inputs = tokenizer("King Kong", return_tensors="pt", return_token_type_ids=False)

In [27]:
og_start= time()
outputs_og = model.generate(**inputs,
                            max_new_tokens=25,
                            temperature=0.8,
                            do_sample=True,
                            pad_token_id=tokenizer.eos_token_id)
og_end= time()
q4_start= time()
outputs_4bit = model_4bit.generate(**inputs,
                            max_new_tokens=25,
                            temperature=0.8,
                            do_sample=True,
                            pad_token_id=tokenizer.eos_token_id)
q4_end= time()
q8_start= time()
outputs_8bit = model_8bit.generate(**inputs,
                            max_new_tokens=25,
                            temperature=0.8,
                            do_sample=True,
                            pad_token_id=tokenizer.eos_token_id)
q8_end= time()



In [28]:
print("::Model Outputs::")
print("*"*15)
print()
print(f"Original Model:({og_end-og_start})")
print("-"*15)
print(tokenizer.decode(outputs_og[0], skip_special_tokens=True))
print()
print(f"8bit Model:({q8_end-q8_start})")
print("-"*15)
print(tokenizer.decode(outputs_8bit[0], skip_special_tokens=True))
print()
print(f"4bit Model:({q4_end-q4_start})")
print("-"*15)
print(tokenizer.decode(outputs_4bit[0], skip_special_tokens=True))

::Model Outputs::
***************

Original Model:(1.6615946292877197)
---------------
King Kong and the Killing Joke is the best in modern cinema. The acting is great, the direction is wonderful, the performances are

8bit Model:(1.7423856258392334)
---------------
King Kong: Skull Island - Full HD Remaster - 2.5/10.

 video is beautiful and the music is great

4bit Model:(4.4493348598480225)
---------------
King Kong movie, then I'd like to see a big action movie with an action movie attached. The first two thirds of the movie
