# Model Quantization for Efficient Text Generation 🚀

In this lab, we'll explore model quantization using ctranslate2 and its impact on text generation efficiency. Quantization reduces model size and speeds up inference, crucial for deploying models in resource-constrained environments.

**Objectives:**
- 📦 Understand the basics of model quantization.
- ⚖️ Quantize a pre-trained model for efficient text generation.
- ⏱ Compare execution times before and after quantization.


## Setup and Imports  🛠

First, let's get our workspace ready with all the necessary tools:

- `ctranslate2`: For model conversion and quantization.
- `transformers` & `datasets`: For our model, tokenizer, and data.
- `torch`: For tensor operations.
- `tqdm`: Visual progress indication.


In [None]:
# !pip install ctranslate2
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch
from tqdm.auto import tqdm
from ctranslate2.converters import TransformersConverter
from ctranslate2 import Generator

from contextlib import contextmanager
import time

@contextmanager
def track_time():
    start = time.time()  # Record start time
    yield
    end = time.time()  # Record end time
    print(f"Execution time: {end - start} seconds")

## Model and Tokenizer Setup  🧩

Before quantization, we need to load and prepare our model and tokenizer:

- **Model:** "TheFuzzyScientist/diabloGPT_open-instruct" for instructive text generation.
- **Tokenizer:** Adjusted for our model's needs.
- **Device:** Using CUDA for GPU acceleration.


In [None]:
model = AutoModelForCausalLM.from_pretrained("TheFuzzyScientist/diabloGPT_open-instruct").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token


## Model Quantization  ⚖️

Quantizing our model to reduce its size and improve inference speed:

- **Conversion & Quantization:** Using `TransformersConverter` for ctranslate2 format conversion with float16 quantization.
- **Output:** Quantized model ready for efficient text generation.


In [None]:
# Convert the model to CTranslate2
model.save_pretrained("models/gpt-instruct")
tokenizer.save_pretrained("models/gpt-instruct")

converter = TransformersConverter("models/gpt-instruct")
out_path = converter.convert(output_dir="models/gpt-instruct-quant", quantization="float16")

generator = Generator("models/gpt-instruct-quant", device="cuda")

## Dataset Preparation 📚

Loading and preparing a dataset for our text generation tasks:

- **Dataset:** "hakurei/open-instruct-v1", a rich source for instructive prompts.
- **Sampling:** Selecting 3000 random samples for our experiments.


In [None]:
dataset = load_dataset("hakurei/open-instruct-v1", split="train")
dataset = dataset.to_pandas()

prompts = dataset["instruction"].sample(3000, random_state=42).tolist()


## Normal Batching Method 🔄

Using the original model, we'll generate text in batches to establish a baseline for performance:

- **Chunker:** Splits prompts into manageable batch sizes.
- **Batch Generation:** Generates text for each batch.


In [None]:

# Normal batching
def chunker(seq, size):
    return (seq[pos : pos + size] for pos in range(0, len(seq), size))


def batch_generate_tokens(tokens):
    outputs = model.generate(tokens, max_length=256, pad_token_id=tokenizer.eos_token_id, num_beams=2, repetition_penalty=1.5)

    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


def predict_batch(prompts, batch_size):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=128)["input_ids"]

    for batch in chunker(inputs, batch_size):
        yield batch_generate_tokens(batch.to(model.device))


with track_time():
    for batch_prediction in tqdm(predict_batch(prompts, 32)):
        continue

# Execution time: 242.11289978027344 seconds

## Quantized Model Batching 🎯

Switching to our quantized model for more efficient text generation:

- **CTRANS Tokenization:** Adjusting tokenization for ctranslate2 input.
- **Batch Generation:** Utilizing the quantized model.


In [None]:
# CTranslate2 batching with quantized model
def batch_generate_ctrans(prompts, batch_size):
    inputs = [tokenizer.tokenize(prompt, truncation=True, max_length=128) for prompt in prompts]

    results = generator.generate_batch(inputs, max_length=256, max_batch_size=batch_size, beam_size=2, repetition_penalty=1.5)

    result_ids = [res.sequences_ids[0] for res in results]
    return tokenizer.batch_decode(result_ids, skip_special_tokens=True)



## Predicting with Quantized Model 🚀

Finally, let's see the performance improvement with our quantized model:

- **Execution:** Generate text with the quantized model.
- **Comparison:** Observe the reduction in execution time versus the unquantized model.


In [None]:
del model
torch.cuda.empty_cache()
with track_time():
    batch_generate_ctrans(prompts, 32)

# Execution time: 150.97192573547363 seconds


# Conclusion and Next Steps 🌈

We've successfully quantized a text generation model and demonstrated significant improvements in efficiency. This showcases the power of model quantization for deploying NLP models in production.

**Encouraged Next Steps:**
- 🤖 Try quantizing different models.
- 📊 Compare quantization effects on various model sizes.
- 🔍 Explore further optimizations for deployment.
