# Quantization in Transformer Models: FP16 Example

## What is Quantization?

Quantization is a model compression technique that reduces the numerical precision of neural network weights and activations. Rather than using the default 32-bit floating point (FP32), we can use formats like:

- FP16 (16-bit floating point)
- INT8 (8-bit integer)
- 4-bit formats

### Why Use Quantization?

- **Faster inference**: Lower-precision math is faster on modern hardware.
- **Less memory usage**: Models take up less space.
- **Lower power consumption**: Efficient for edge and mobile devices.


## FP16 Quantization

**Half-Precision Floating Point (FP16)** is a commonly used quantization format. It represents floating-point numbers using only 16 bits.

### Benefits of FP16:
-  Keeps a wide dynamic range of values.
-  Accelerates matrix operations on GPUs (especially NVIDIA Tensor Cores).
-  Minimal to no accuracy degradation compared to FP32.
-  Easily supported in libraries like Hugging Face Transformers and PyTorch.


## Setting Up the Environment

Before running the code:
- Make sure `transformers` and `torch` are installed.
- A GPU with FP16 support (e.g., NVIDIA Turing or Ampere GPUs) is recommended.


In [1]:
# Install Hugging Face Transformers and PyTorch if not already installed
# Uncomment the next line if needed
# !pip install transformers torch

#### Load Model in FP16

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Configuration
model_name = "deepseek-ai/deepseek-coder-6.7b-instruct"
token = "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"  # WARNING: Use environment variables in production

# Load the model with FP16 precision and automatic GPU/CPU allocation
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use FP16 quantization
    device_map="auto",          # Automatically use GPU if available
    token=token                 # Hugging Face Hub access token
)

# Load the tokenizer (converts text to tokens and back)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    token=token
)


#### Inference with FP16 Model

In [None]:
# Prepare the input prompt
input_text = "Explain the transformer architecture"

# Tokenize and move tensors to GPU
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

# Generate model response (text generation)
outputs = model.generate(**inputs, max_new_tokens=200)

# Decode token IDs into readable text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated text
print(response)

## Summary

Using FP16 precision provides a solid balance between performance and accuracy:

- Reduces GPU memory usage, allowing for larger models or batch sizes.
- Inference is faster thanks to optimized GPU operations.
- Accuracy remains very close to that of FP32 models.
- Integration is simple using the Hugging Face Transformers library with `torch_dtype=torch.float16`.

This makes FP16 a go-to solution for production inference on powerful GPUs.

In the next section, we'll explore more aggressive quantization using **INT8** for deployments on mobile or edge devices.
