# Lazy-loading DistilBERT

An example using a small language model from the Hugging Face Transformers library that can run on CPUs. We'll use the "distilbert-base-uncased" model, which is a smaller version of BERT. This example will demonstrate how to separate the model architecture from its weights and load them on demand.

## Imports

In [None]:
%pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cpu
%pip install transformers==4.41.2

In [None]:
import torch
from transformers import DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer
import os
import time


## Saving and loading weights

This example demonstrates the basic concept of separating model architecture from weights. 

In [None]:
def save_model_weights(model, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    torch.save(model.state_dict(), path)

def load_model_weights(model, path):
    model.load_state_dict(torch.load(path))
    return model

## Creating a model

In [None]:
# Initialize tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Initialize model configuration
config = DistilBertConfig.from_pretrained('distilbert-base-uncased')
config.num_labels = 2  # Binary classification

# Create model architecture
model = DistilBertForSequenceClassification(config)

In [None]:
# Save the model weights (simulating a pre-trained model)
save_model_weights(model, "weights/distilbert_weights.pth")

## Loading 



In [None]:
def get_model_lazy():
    # Create the model architecture (this is fast and lightweight)
    model = DistilBertForSequenceClassification(config)
    
    # Load weights only when needed (this is the potentially slower part)
    return load_model_weights(model, "weights/distilbert_weights.pth")



In [None]:
def get_model_full():
    # Load the full model including weights
    return DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)


## Evaluation

In [None]:
# Function to run inference
def run_inference(model, text):
    model.eval()

    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    # Run inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get prediction
    prediction = torch.argmax(outputs.logits, dim=1).item()
    return "Positive" if prediction == 1 else "Negative"

This approach allows you to keep the model architecture definition in your code while storing the weights separately. The weights are only loaded when you actually need to run inference.

In a production environment, you might want to:

* Implement caching to avoid loading the weights for every inference.
* Use more efficient storage formats for the weights.
* Implement error handling and logging.

In [None]:
# Timing comparison
sample_text = "I love using this language model. It's fantastic!"

print("\nFull loading approach:")
start_time = time.time()
model_full = get_model_full()
full_load_time = time.time() - start_time
print(f"  Model loading time: {full_load_time:.4f} seconds")

start_time = time.time()
result_full = run_inference(model_full, sample_text)
full_inference_time = time.time() - start_time
print(f"  Inference time: {full_inference_time:.4f} seconds")
print(f"  Total time: {full_load_time + full_inference_time:.4f} seconds")
print(f"  Result: {result_full}")

print("Lazy loading approach:")
start_time = time.time()
model_lazy = get_model_lazy()
lazy_load_time = time.time() - start_time
print(f"  Model loading time: {lazy_load_time:.4f} seconds")

start_time = time.time()
result_lazy = run_inference(model_lazy, sample_text)
lazy_inference_time = time.time() - start_time
print(f"  Inference time: {lazy_inference_time:.4f} seconds")
print(f"  Total time: {lazy_load_time + lazy_inference_time:.4f} seconds")
print(f"  Result: {result_lazy}")

# Print number of parameters
num_params = sum(p.numel() for p in model_lazy.parameters())
print(f"\nNumber of parameters in the model: {num_params:,}")

Here's what to expect:

* The lazy loading approach will likely have a faster initial load time (when creating the model architecture) but might be slightly slower when running inference for the first time (as it needs to load the weights).
* The full loading approach will likely take longer to load initially but might have a slightly faster inference time.
* The number of parameters will give you an idea of the model's size. DistilBERT typically has around 66 million parameters.