# Overview

Let's implement compressing a LLM with knwowldge distillation and quantization. We will use knowledge distillation to compress the xxx parameter models inot a 50M parameter one. Then we using 4-bit quantization to reduce the memory footprint by 3X, resulting in a final model taht is 7X smaller the original one.

# Load the dataset

In [1]:
import os

os.environ['DATASET']='aisuko/phishing-binary-classification'
os.environ["TEACHER"]='shawhin/bert-phishing-classifier_teacher'
os.environ["STUDENT"]='aisuko/phishing-binary-classification_student'

os.environ['CUDA_LAUNCH_BLOCKING'] = '1' # fix issue CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

In [2]:
from datasets import load_dataset

ds=load_dataset(os.getenv('DATASET'))
ds

README.md:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/22.2M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/2.79M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/528006 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/66001 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/66001 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['url', 'labels'],
        num_rows: 528006
    })
    validation: Dataset({
        features: ['url', 'labels'],
        num_rows: 66001
    })
    test: Dataset({
        features: ['url', 'labels'],
        num_rows: 66001
    })
})

## Fit the lower GPU resources

In [3]:
from datasets import DatasetDict

pre_processed_ds_train_low=ds['train'].shuffle(seed=42).select(range(10000))
pre_processed_ds_test_low=ds['test'].shuffle(seed=42).select(range(5000))
pre_processed_ds_validate_low=ds['validation'].shuffle(seed=42).select(range(5000))

ds_low=DatasetDict({
    'train': pre_processed_ds_train_low,
    'test': pre_processed_ds_test_low,
    'validation': pre_processed_ds_validate_low,
})

# Load the teacher model

The teacher model is a fine-tuned version of [openai-community/roberta-large-openai-detector](https://huggingface.co/openai-community/roberta-large-openai-detector) model on phishing website URLs dataset, see [FT GPT-2 Detector for text classification](https://www.kaggle.com/code/aisuko/ft-gpt-2-detector-for-binary-classification).

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device=torch.device("cuda")

# load teacher model and tokenizer
tokenizer=AutoTokenizer.from_pretrained(os.getenv('TEACHER'))
teacher_model=AutoModelForSequenceClassification.from_pretrained(os.getenv('TEACHER')).to(device)
teacher_model

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/851 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

# Load the student model

In [5]:
from transformers import DistilBertForSequenceClassification, DistilBertConfig

# drop 4 heads per layer and 2 layers. Default is 12 attention heads per layer, 6 layers.
configuration= DistilBertConfig(n_heads=8, n_layers=4)
configuration

DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 8,
  "n_layers": 4,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.45.1",
  "vocab_size": 30522
}

In [6]:
student_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", config=configuration).to(device)
student_model

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-3): 4 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

# Tokenized the text

In [7]:
def preprocess_func(examples):
    return tokenizer(examples["url"], padding='max_length', truncation=True)


# tokenized all data
tokenized_data=ds_low.map(preprocess_func, batched=True)
tokenized_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [8]:
tokenized_data

DatasetDict({
    train: Dataset({
        features: ['url', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['url', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['url', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
})

# Evaluation function

In [9]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Function to evaluate model performance
def evaluate_model(model, dataloader, device):
    model.eval()  # Set model to evaluation mode
    all_preds = []
    all_labels = []

    # Disable gradient calculations
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass to get logits
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            # Get predictions
            preds = torch.argmax(logits, dim=1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels.cpu().numpy())

    # Calculate evaluation metrics
    accuracy = accuracy_score(all_labels, all_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='binary')

    return accuracy, precision, recall, f1

# Train Student Model

In [10]:
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# hyperparameters
batch_size = 32
lr = 1e-4
num_epochs = 5
temperature = 2.0
alpha = 0.5

# Function to compute distillation and hard-label loss
def distillation_loss(student_logits, teacher_logits, true_labels, temperature, alpha):
    # Compute soft targets from teacher logits
    soft_targets = nn.functional.softmax(teacher_logits / temperature, dim=1)
    student_soft = nn.functional.log_softmax(student_logits / temperature, dim=1)

    # KL Divergence loss for distillation
    distill_loss = nn.functional.kl_div(student_soft, soft_targets, reduction='batchmean') * (temperature ** 2)

    # Cross-entropy loss for hard labels
    hard_loss = nn.CrossEntropyLoss()(student_logits, true_labels)

    # Combine losses
    loss = alpha * distill_loss + (1.0 - alpha) * hard_loss

    return loss


# define optimizer
optimizer = optim.Adam(student_model.parameters(), lr=lr)

# create training data loader
dataloader = DataLoader(tokenized_data['train'], batch_size=batch_size)
# create testing data loader
test_dataloader = DataLoader(tokenized_data['test'], batch_size=batch_size)

In [11]:
student_model.train()

# train model
for epoch in range(num_epochs):
    for batch in dataloader:
        # Prepare inputs
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Disable gradient calculation for teacher model
        with torch.no_grad():
            teacher_outputs = teacher_model(input_ids, attention_mask=attention_mask)
            teacher_logits = teacher_outputs.logits

        # Forward pass through the student model
        student_outputs = student_model(input_ids, attention_mask=attention_mask)
        student_logits = student_outputs.logits

        # Compute the distillation loss
        loss = distillation_loss(student_logits, teacher_logits, labels, temperature, alpha)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1} completed with loss: {loss.item()}")

    # Evaluate the teacher model
    teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 = evaluate_model(teacher_model, test_dataloader, device)
    print(f"Teacher (test) - Accuracy: {teacher_accuracy:.4f}, Precision: {teacher_precision:.4f}, Recall: {teacher_recall:.4f}, F1 Score: {teacher_f1:.4f}")

    # Evaluate the student model
    student_accuracy, student_precision, student_recall, student_f1 = evaluate_model(student_model, test_dataloader, device)
    print(f"Student (test) - Accuracy: {student_accuracy:.4f}, Precision: {student_precision:.4f}, Recall: {student_recall:.4f}, F1 Score: {student_f1:.4f}")
    print("\n")

    # put student model back into train mode
    student_model.train()

Epoch 1 completed with loss: 0.7947463989257812
Teacher (test) - Accuracy: 0.1406, Precision: 0.1157, Recall: 0.0974, F1 Score: 0.1057
Student (test) - Accuracy: 0.4454, Precision: 0.4585, Recall: 0.3476, F1 Score: 0.3955


Epoch 2 completed with loss: 0.7779355049133301
Teacher (test) - Accuracy: 0.1406, Precision: 0.1157, Recall: 0.0974, F1 Score: 0.1057
Student (test) - Accuracy: 0.4216, Precision: 0.4224, Recall: 0.2951, F1 Score: 0.3475


Epoch 3 completed with loss: 0.7771766185760498
Teacher (test) - Accuracy: 0.1406, Precision: 0.1157, Recall: 0.0974, F1 Score: 0.1057
Student (test) - Accuracy: 0.4338, Precision: 0.4446, Recall: 0.3411, F1 Score: 0.3860


Epoch 4 completed with loss: 0.7862991690635681
Teacher (test) - Accuracy: 0.1406, Precision: 0.1157, Recall: 0.0974, F1 Score: 0.1057
Student (test) - Accuracy: 0.4038, Precision: 0.4399, Recall: 0.5217, F1 Score: 0.4773


Epoch 5 completed with loss: 0.7774569392204285
Teacher (test) - Accuracy: 0.1406, Precision: 0.1157, Re

# Evaluation Models


In [12]:
# create testing data loader
validation_dataloader = DataLoader(tokenized_data['validation'], batch_size=8)

# Evaluate the teacher model
teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 = evaluate_model(teacher_model, validation_dataloader, device)
print(f"Teacher (validation) - Accuracy: {teacher_accuracy:.4f}, Precision: {teacher_precision:.4f}, Recall: {teacher_recall:.4f}, F1 Score: {teacher_f1:.4f}")

# Evaluate the student model
student_accuracy, student_precision, student_recall, student_f1 = evaluate_model(student_model, validation_dataloader, device)
print(f"Student (validation) - Accuracy: {student_accuracy:.4f}, Precision: {student_precision:.4f}, Recall: {student_recall:.4f}, F1 Score: {student_f1:.4f}")

Teacher (validation) - Accuracy: 0.1314, Precision: 0.1092, Recall: 0.0940, F1 Score: 0.1010
Student (validation) - Accuracy: 0.4044, Precision: 0.3818, Recall: 0.2368, F1 Score: 0.2923


# Push to HuggingFace Hub

In [13]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [14]:
kwargs={
    'model_name': os.getenv('STUDENT'),
    'tasks': 'Text-Generation',
    'dataset': os.getenv('DATASET')
}

tokenizer.push_to_hub(os.getenv('DATASET'))
student_model.push_to_hub(os.getenv('STUDENT'))

README.md:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


model.safetensors:   0%|          | 0.00/211M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/phishing-binary-classification_student/commit/7df8d7aacc901ff527619da99862d6992a192203', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='7df8d7aacc901ff527619da99862d6992a192203', pr_url=None, repo_url=RepoUrl('https://huggingface.co/aisuko/phishing-binary-classification_student', endpoint='https://huggingface.co', repo_type='model', repo_id='aisuko/phishing-binary-classification_student'), pr_revision=None, pr_num=None)

# Acknowledge

* https://github.com/ShawhinT/YouTube-Blog/blob/main/LLMs/model-compression/1_knowledge_distillation.ipynb