Реализация triton (cuda если интересно) кернелей для квантизации весов в LLM и инференса квантизованной модели.

План:
1) Реализовать кернель для квантизации 2D матрицы из fp16 в int4
и последующей упаковки квантизованной матрицы в int8 или int32.
При этом потребляемая память должна уменьшиться в 4 раза.
2) Реализовать кернель для перемножения матрицы в bf16 на квантизованную матрицу в int4 на (X16@W4^T)
3) Сравнить скорость перемножения (X16@W4^T) с (X16@W16^T). Размеры матрицы W такие же, как размеры матриц весов для модели Llama-3.2-1B-Instruct (https://huggingface.co/unsloth/Llama-3.2-1B-Instruct).
Количество строк (токенов) в матрице активаций X: 128, 512, 2048
4) С использованием написанных кернелей написать квантизованный линейный слой и применить его к линейныс слоям модели Llama-3.2-1B-Instruct
5) Замерить скорость расчета и уровень перплексии на wikitext2

In [1]:
!pip install -U transformers



In [2]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

2025-11-15 11:39:29.353523: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763206769.374562     369 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763206769.379911     369 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."<|eot_id|>


In [None]:
model.to("cuda:0")

In [6]:
model.device

device(type='cuda', index=0)

In [14]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
eval_dataset = dataset["validation"]

def tokenize_function(examples):
    return tokenizer(examples["text"], return_special_tokens_mask=False)

tokenized = eval_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],
)

block_size = 2048

def group_texts(examples):
    concatenated = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = (len(concatenated["input_ids"]) // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized.map(group_texts, batched=True)

lm_dataset.set_format(type="torch", columns=["input_ids", "labels", "attention_mask"])


Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [16]:
import torch
from torch.utils.data import DataLoader
import time
import math

batch_size = 1

dataloader = DataLoader(lm_dataset, batch_size=batch_size, pin_memory=True)

device = next(model.parameters()).device

start_time = time.time()
num_tokens = 0
loss_sum = 0.0
count = 0

model.eval()
with torch.no_grad():
    for batch in dataloader:
        batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch.get("attention_mask"),
            labels=batch["labels"],
        )
        loss = outputs.loss
        num_tokens += batch["labels"].numel()
        loss_sum += loss.item()
        count += 1

if device.type == "cuda":
    torch.cuda.synchronize()
elapsed = time.time() - start_time
mean_loss = loss_sum / max(count, 1)
ppl = math.exp(mean_loss)
tps = num_tokens / elapsed if elapsed > 0 else float("nan")

print(f"Perplexity: {ppl:.4f}")
print(f"Eval time (s): {elapsed:.2f}")
print(f"Tokens processed: {num_tokens}")
print(f"Tokens/sec: {tps:.2f}")


Perplexity: 36.8581
Eval time (s): 284.20
Tokens processed: 249856
Tokens/sec: 879.17


In [11]:
messages = [
    {"role": "user", "content": "how to hire a good engineer?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=400)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Hiring a good engineer requires a combination of skills, experience, and personal qualities. Here are some tips to help you find a qualified engineer:

1. **Clearly define the job requirements**: Make sure you have a detailed job description and requirements. Be specific about the skills, qualifications, and experience needed for the role.
2. **Use a job board and industry-specific platforms**: Websites like LinkedIn, Glassdoor, and Indeed can help you find qualified engineers in your area or industry.
3. **Network with professionals**: Reach out to engineers you know or have connections in your industry. Ask for referrals or advice on finding the right candidate.
4. **Check online profiles**: Look for engineers on professional networking sites like LinkedIn, and check their work experience, skills, and education.
5. **Conduct interviews**: Invite candidates to interview, either in-person or remotely. Pay attention to their communication skills, problem-solving abilities, and willingne