# Transformer Tokenizers with CUDA: A Practical Guide

## Introduction

In this notebook, we explore how to use transformer tokenizers from the HuggingFace `transformers` library with CUDA acceleration. We'll examine different popular transformer models, their memory requirements, and demonstrate practical tokenization examples.

**What you'll learn:**
- Setting up transformers with CUDA support
- Understanding tokenizers and how they work
- Comparing different transformer models (BERT, DistilBERT, RoBERTa, and more)
- Evaluating memory requirements for various models
- Determining which models fit on an RTX 5060 Ti (12GB VRAM)

**Why tokenizers matter:**
Tokenizers are the first step in any NLP pipeline. They convert raw text into numerical tokens that transformer models can process. Different models use different tokenization strategies (WordPiece, Byte-Pair Encoding, etc.), which affect both performance and vocabulary size.

## 1. Setup and Dependencies

First, let's import the necessary libraries and check our CUDA availability.

In [1]:
# Install required packages (uncomment if needed)
# !pip install transformers torch accelerate huggingface_hub python-dotenv

import torch
from transformers import (
    AutoTokenizer, 
    AutoModel,
    BertTokenizer, BertModel,
    DistilBertTokenizer, DistilBertModel,
    RobertaTokenizer, RobertaModel
)
from huggingface_hub import login
import gc
import sys
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"Total GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

PyTorch version: 2.9.1+cu130
CUDA available: True
CUDA version: 13.0
GPU Device: NVIDIA GeForce RTX 5060 Ti
Total GPU Memory: 15.93 GB


## 2. HuggingFace Authentication

Authenticate with HuggingFace to access models (optional for public models, required for gated models).

In [2]:
# Load HF token from environment variable
HF_TOKEN = os.getenv("HF_TOKEN")

if HF_TOKEN:
    login(token=HF_TOKEN)
    print("‚úì Successfully authenticated with HuggingFace")
else:
    print("‚ö† No HF_TOKEN found. Set it in .env file or as environment variable.")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


‚úì Successfully authenticated with HuggingFace


## 3. Understanding Transformer Models

Before diving into tokenization, let's understand the models we'll be working with:

### BERT (Bidirectional Encoder Representations from Transformers)
- **Tokenizer**: WordPiece tokenization
- **Vocabulary Size**: ~30,000 tokens
- **Model Sizes**: 
  - BERT-base: 110M parameters (~440MB)
  - BERT-large: 340M parameters (~1.3GB)
- **Use Case**: General-purpose NLP tasks, question answering, text classification

### DistilBERT (Distilled BERT)
- **Tokenizer**: Same WordPiece as BERT
- **Vocabulary Size**: ~30,000 tokens
- **Model Size**: 66M parameters (~260MB)
- **Key Feature**: 40% smaller, 60% faster than BERT-base, retains 97% of performance
- **Use Case**: When you need BERT-like performance with lower resource requirements

### RoBERTa (Robustly Optimized BERT)
- **Tokenizer**: Byte-Pair Encoding (BPE)
- **Vocabulary Size**: ~50,000 tokens
- **Model Sizes**:
  - RoBERTa-base: 125M parameters (~500MB)
  - RoBERTa-large: 355M parameters (~1.4GB)
- **Key Feature**: Improved training procedure, better performance than BERT
- **Use Case**: When you need state-of-the-art performance for language understanding

### Additional Models to Consider

**ALBERT (A Lite BERT)**
- Model Size: 12M parameters (~50MB for base)
- Even more efficient than DistilBERT through parameter sharing

**GPT-2**
- Model Sizes: 124M (small) to 1.5B (XL) parameters
- Decoder-only architecture, excellent for text generation

**T5 (Text-to-Text Transfer Transformer)**
- Model Sizes: 60M (small) to 11B (XXL) parameters
- Unified text-to-text framework

## 4. Memory Requirements and RTX 5060 Ti Compatibility

Let's analyze which models can run on an RTX 5060 Ti with 12GB VRAM:

### Memory Calculation Formula
For inference: `Memory ‚âà Model Size √ó 1.2 (overhead) + Batch Size √ó Sequence Length √ó Hidden Size √ó 4 bytes`

### Models Compatible with RTX 5060 Ti (12GB VRAM)

| Model | Parameters | Model Size | Peak Memory (batch=1, seq=512) | Status |
|-------|------------|------------|--------------------------------|--------|
| DistilBERT | 66M | ~260MB | ~0.5GB | ‚úÖ Excellent |
| BERT-base | 110M | ~440MB | ~0.8GB | ‚úÖ Excellent |
| RoBERTa-base | 125M | ~500MB | ~0.9GB | ‚úÖ Excellent |
| ALBERT-base | 12M | ~50MB | ~0.3GB | ‚úÖ Excellent |
| BERT-large | 340M | ~1.3GB | ~2.5GB | ‚úÖ Good |
| RoBERTa-large | 355M | ~1.4GB | ~2.6GB | ‚úÖ Good |
| GPT-2 | 124M | ~500MB | ~1.0GB | ‚úÖ Excellent |
| GPT-2 Medium | 355M | ~1.4GB | ~2.7GB | ‚úÖ Good |
| GPT-2 Large | 774M | ~3.0GB | ~5.5GB | ‚úÖ Acceptable |
| T5-base | 220M | ~850MB | ~1.8GB | ‚úÖ Excellent |
| T5-large | 770M | ~3.0GB | ~5.8GB | ‚úÖ Acceptable |

**Note**: For training/fine-tuning, memory requirements are ~3-4x higher due to gradients and optimizer states.

## 5. Practical Tokenization Function

Let's create a versatile function that can tokenize text using different models and display detailed information.

In [12]:
def tokenize_with_model(text, model_name="bert-base-uncased", use_cuda=True, verbose=True):
    """
    Tokenize text using a specified transformer model.
    
    Args:
        text (str): Text to tokenize
        model_name (str): HuggingFace model identifier
        use_cuda (bool): Whether to use CUDA if available
        verbose (bool): Print detailed information
    
    Returns:
        dict: Contains tokenizer output and metadata
    """
    device = "cuda" if use_cuda and torch.cuda.is_available() else "cpu"
    
    try:
        # Load tokenizer and model
        if verbose:
            print(f"Loading {model_name}...")
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModel.from_pretrained(model_name).to(device)
        
        # Set pad token if it's not defined (common for GPT models)
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
            
        # Track memory before
        if device == "cuda":
            torch.cuda.empty_cache()
            torch.cuda.reset_peak_memory_stats()
            mem_before = torch.cuda.memory_allocated() / 1024**2
        
        # Tokenize
        encoded = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
        
        encoded = {k: v.to(device) for k, v in encoded.items()}
        
        # Get model output (optional, to see full pipeline)
        with torch.no_grad():
            outputs = model(**encoded)
        
        # Track memory after
        if device == "cuda":
            mem_after = torch.cuda.memory_allocated() / 1024**2
            mem_peak = torch.cuda.max_memory_allocated() / 1024**2
        
        # Decode tokens back to text
        tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
        
        if verbose:
            print(f"\n{'='*60}")
            print(f"Model: {model_name}")
            print(f"Device: {device.upper()}")
            print(f"{'='*60}")
            print(f"\nOriginal text:")
            print(f"  {text}")
            print(f"\nTokens ({len(tokens)} total):")
            print(f"  {tokens}")
            print(f"\nToken IDs:")
            print(f"  {encoded['input_ids'][0].tolist()}")
            print(f"\nVocabulary size: {tokenizer.vocab_size:,}")
            print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
            
            if device == "cuda":
                print(f"\nGPU Memory Usage:")
                print(f"  Before: {mem_before:.2f} MB")
                print(f"  After: {mem_after:.2f} MB")
                print(f"  Peak: {mem_peak:.2f} MB")
                print(f"  Model Memory: {(mem_after - mem_before):.2f} MB")
            
            print(f"\nOutput shape: {outputs.last_hidden_state.shape}")
            print(f"  [batch_size, sequence_length, hidden_size]")
        
        # Clean up
        del model, outputs
        if device == "cuda":
            torch.cuda.empty_cache()
        gc.collect()
        
        return {
            "tokens": tokens,
            "token_ids": encoded['input_ids'][0].tolist(),
            "attention_mask": encoded['attention_mask'][0].tolist(),
            "num_tokens": len(tokens),
            "vocab_size": tokenizer.vocab_size,
            "model_name": model_name
        }
    
    except Exception as e:
        print(f"Error loading {model_name}: {e}")
        return None

## 6. Example: Tokenizing with BERT

Let's test our function with BERT, one of the most popular transformer models.

In [4]:
sample_text = "Transformers have revolutionized natural language processing with their attention mechanisms."

result_bert = tokenize_with_model(
    text=sample_text,
    model_name="bert-base-uncased",
    use_cuda=True,
    verbose=True
)

Loading bert-base-uncased...

Model: bert-base-uncased
Device: CUDA

Original text:
  Transformers have revolutionized natural language processing with their attention mechanisms.

Tokens (14 total):
  ['[CLS]', 'transformers', 'have', 'revolution', '##ized', 'natural', 'language', 'processing', 'with', 'their', 'attention', 'mechanisms', '.', '[SEP]']

Token IDs:
  [101, 19081, 2031, 4329, 3550, 3019, 2653, 6364, 2007, 2037, 3086, 10595, 1012, 102]

Vocabulary size: 30,522
Model parameters: 109,482,240

GPU Memory Usage:
  Before: 418.73 MB
  After: 427.90 MB
  Peak: 428.31 MB
  Model Memory: 9.17 MB

Output shape: torch.Size([1, 14, 768])
  [batch_size, sequence_length, hidden_size]


### Understanding BERT Tokenization

BERT uses **WordPiece tokenization**, which:
- Breaks unknown words into subword units
- Uses special tokens: `[CLS]` at start, `[SEP]` at end
- Handles out-of-vocabulary words by splitting them (e.g., "revolutionized" ‚Üí "revolution", "##ized")
- Preserves common words as single tokens

## 7. Example: Tokenizing with DistilBERT

DistilBERT is a smaller, faster version of BERT that's perfect for resource-constrained environments.

In [5]:
result_distilbert = tokenize_with_model(
    text=sample_text,
    model_name="distilbert-base-uncased",
    use_cuda=True,
    verbose=True
)

Loading distilbert-base-uncased...

Model: distilbert-base-uncased
Device: CUDA

Original text:
  Transformers have revolutionized natural language processing with their attention mechanisms.

Tokens (14 total):
  ['[CLS]', 'transformers', 'have', 'revolution', '##ized', 'natural', 'language', 'processing', 'with', 'their', 'attention', 'mechanisms', '.', '[SEP]']

Token IDs:
  [101, 19081, 2031, 4329, 3550, 3019, 2653, 6364, 2007, 2037, 3086, 10595, 1012, 102]

Vocabulary size: 30,522
Model parameters: 66,362,880

GPU Memory Usage:
  Before: 264.24 MB
  After: 264.28 MB
  Peak: 264.69 MB
  Model Memory: 0.04 MB

Output shape: torch.Size([1, 14, 768])
  [batch_size, sequence_length, hidden_size]


### DistilBERT Key Features
- Uses the same tokenizer as BERT (WordPiece)
- 40% fewer parameters than BERT-base
- 60% faster inference
- Only ~260MB model size vs ~440MB for BERT-base

## 8. Example: Tokenizing with RoBERTa

RoBERTa uses a different tokenization approach than BERT: Byte-Pair Encoding (BPE).

In [16]:
result_roberta = tokenize_with_model(
    text=sample_text,
    model_name="roberta-base",
    use_cuda=True,
    verbose=True
)

Loading roberta-base...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model: roberta-base
Device: CUDA

Original text:
  Transformers have revolutionized natural language processing with their attention mechanisms.

Tokens (15 total):
  ['<s>', 'Transform', 'ers', 'ƒ†have', 'ƒ†revolution', 'ized', 'ƒ†natural', 'ƒ†language', 'ƒ†processing', 'ƒ†with', 'ƒ†their', 'ƒ†attention', 'ƒ†mechanisms', '.', '</s>']

Token IDs:
  [0, 44820, 268, 33, 7977, 1538, 1632, 2777, 5774, 19, 49, 1503, 14519, 4, 2]

Vocabulary size: 50,265
Model parameters: 124,645,632

GPU Memory Usage:
  Before: 486.73 MB
  After: 486.78 MB
  Peak: 487.21 MB
  Model Memory: 0.05 MB

Output shape: torch.Size([1, 15, 768])
  [batch_size, sequence_length, hidden_size]


### RoBERTa Tokenization Differences
- Uses **Byte-Pair Encoding (BPE)** instead of WordPiece
- Larger vocabulary: ~50,000 tokens vs ~30,000 for BERT
- Different special tokens: `<s>` (start), `</s>` (end) instead of `[CLS]`, `[SEP]`
- Better handling of spaces and capitalization
- Generally better performance on downstream tasks

## 9. Comparing Tokenization Across Models

Let's compare how different models tokenize the same text.

In [15]:
def compare_tokenizers(text, models):
    """
    Compare tokenization across multiple models.
    """
    print(f"Comparing tokenization for: '{text}'\n")
    print("="*80)
    
    results = {}
    for model_name in models:
        try:
            tokenizer = AutoTokenizer.from_pretrained(model_name)
            tokens = tokenizer.tokenize(text)
            token_ids = tokenizer.encode(text)
            
            print(f"\n{model_name}:")
            print(f"  Tokens ({len(tokens)}): {tokens}")
            print(f"  Token IDs: {token_ids}")
            print(f"  Vocab size: {tokenizer.vocab_size:,}")
            
            results[model_name] = {
                "tokens": tokens,
                "num_tokens": len(tokens),
                "vocab_size": tokenizer.vocab_size
            }
        except Exception as e:
            print(f"\n{model_name}: Error - {e}")
    
    print("\n" + "="*80)
    return results

# Compare the three main models
comparison_text = "CUDA-accelerated transformers enable efficient NLP."
models_to_compare = [
    "bert-base-uncased",
    "distilbert-base-uncased",
    "roberta-base"
]

comparison_results = compare_tokenizers(comparison_text, models_to_compare)

Comparing tokenization for: 'CUDA-accelerated transformers enable efficient NLP.'


bert-base-uncased:
  Tokens (10): ['cu', '##da', '-', 'accelerated', 'transformers', 'enable', 'efficient', 'nl', '##p', '.']
  Token IDs: [101, 12731, 2850, 1011, 14613, 19081, 9585, 8114, 17953, 2361, 1012, 102]
  Vocab size: 30,522

distilbert-base-uncased:
  Tokens (10): ['cu', '##da', '-', 'accelerated', 'transformers', 'enable', 'efficient', 'nl', '##p', '.']
  Token IDs: [101, 12731, 2850, 1011, 14613, 19081, 9585, 8114, 17953, 2361, 1012, 102]
  Vocab size: 30,522

roberta-base:
  Tokens (13): ['CU', 'DA', '-', 'ac', 'celer', 'ated', 'ƒ†transform', 'ers', 'ƒ†enable', 'ƒ†efficient', 'ƒ†N', 'LP', '.']
  Token IDs: [0, 38260, 3134, 12, 1043, 24608, 1070, 7891, 268, 3155, 5693, 234, 21992, 4, 2]
  Vocab size: 50,265



## 10. Batch Processing and GPU Utilization

One of the key advantages of using CUDA is the ability to process multiple texts in parallel. Let's demonstrate batch processing.

In [14]:
import time

def batch_tokenize(texts, model_name="distilbert-base-uncased", batch_size=8, use_cuda=True):
    """
    Process multiple texts in batches using GPU acceleration.
    """
    device = "cuda" if use_cuda and torch.cuda.is_available() else "cpu"
    
    print(f"Loading {model_name} on {device.upper()}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name).to(device)
    
    # Set pad token if it's not defined
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print(f"Processing {len(texts)} texts in batches of {batch_size}...")
    
    start_time = time.time()
    
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize batch
        encoded = tokenizer(
            batch_texts, 
            padding=True, 
            truncation=True, 
            max_length=512,
            return_tensors="pt"
        )
        encoded = {k: v.to(device) for k, v in encoded.items()}
        
        # Get embeddings
        with torch.no_grad():
            outputs = model(**encoded)
        
        # Use [CLS] token embedding (first token)
        batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu()
        all_embeddings.append(batch_embeddings)
    
    end_time = time.time()
    
    # Combine all embeddings
    all_embeddings = torch.cat(all_embeddings, dim=0)
    
    # Calculate statistics
    total_time = end_time - start_time
    texts_per_second = len(texts) / total_time
    
    print(f"\n{'='*60}")
    print(f"Processing complete!")
    print(f"  Total texts: {len(texts)}")
    print(f"  Total time: {total_time:.3f} seconds")
    print(f"  Throughput: {texts_per_second:.2f} texts/second")
    print(f"  Average time per text: {total_time/len(texts)*1000:.2f} ms")
    print(f"  Output shape: {all_embeddings.shape}")
    print(f"{'='*60}")
    
    # Clean up
    del model
    if device == "cuda":
        torch.cuda.empty_cache()
    gc.collect()
    
    return all_embeddings

# Create sample texts for batch processing
sample_texts = [
    "Natural language processing has advanced significantly.",
    "CUDA acceleration enables faster model inference.",
    "Transformers are the state-of-the-art in NLP.",
    "BERT introduced bidirectional context understanding.",
    "DistilBERT offers a good balance of speed and accuracy.",
    "RoBERTa improved upon BERT's training methodology.",
    "Tokenization is crucial for transformer models.",
    "GPU memory is important for large models.",
    "The RTX 5060 Ti has 12GB of VRAM.",
    "Batch processing improves throughput significantly.",
    "PyTorch provides excellent CUDA support.",
    "HuggingFace makes transformers accessible to everyone.",
]

embeddings = batch_tokenize(sample_texts, model_name="distilbert-base-uncased", batch_size=4)

Loading distilbert-base-uncased on CUDA...
Processing 12 texts in batches of 4...

Processing complete!
  Total texts: 12
  Total time: 0.052 seconds
  Throughput: 229.60 texts/second
  Average time per text: 4.36 ms
  Output shape: torch.Size([12, 768])


## 11. Memory Profiling Function

Let's create a function to profile GPU memory usage for different models.

In [None]:
def profile_model_memory(model_name, seq_length=512):
    """
    Profile GPU memory usage for a model.
    """
    if not torch.cuda.is_available():
        print("CUDA not available. Cannot profile GPU memory.")
        return None
    
    print(f"Profiling {model_name}...")
    
    try:
        # Clear GPU memory
        torch.cuda.empty_cache()
        gc.collect()
        torch.cuda.reset_peak_memory_stats()
        
        mem_start = torch.cuda.memory_allocated() / 1024**2
        
        # Load model
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModel.from_pretrained(model_name).to("cuda")
        
        # Set pad token if it's not defined
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        mem_after_load = torch.cuda.memory_allocated() / 1024**2
        
        # Create dummy input
        dummy_text = "a " * (seq_length // 2)  # Approximate tokens
        encoded = tokenizer(dummy_text, return_tensors="pt", truncation=True, max_length=seq_length)
        encoded = {k: v.to("cuda") for k, v in encoded.items()}
        
        # Forward pass
        with torch.no_grad():
            outputs = model(**encoded)
        
        mem_peak = torch.cuda.max_memory_allocated() / 1024**2
        
        # Get model parameters
        num_params = sum(p.numel() for p in model.parameters())
        model_size_mb = num_params * 4 / 1024**2  # 4 bytes per float32 parameter
        
        result = {
            "model_name": model_name,
            "parameters": num_params,
            "model_size_mb": model_size_mb,
            "memory_allocated_mb": mem_after_load - mem_start,
            "peak_memory_mb": mem_peak,
            "sequence_length": seq_length
        }
        
        print(f"  Parameters: {num_params:,}")
        print(f"  Model size: {model_size_mb:.2f} MB")
        print(f"  Memory allocated: {result['memory_allocated_mb']:.2f} MB")
        print(f"  Peak memory: {mem_peak:.2f} MB")
        
        # Clean up
        del model, tokenizer, outputs
        torch.cuda.empty_cache()
        gc.collect()
        
        return result
        
    except Exception as e:
        print(f"  Error: {e}")
        return None

# Profile all models
print("="*60)
print("GPU Memory Profiling")
print("="*60)

models_to_profile = [
    "distilbert-base-uncased",
    "bert-base-uncased",
    "roberta-base",
]

profile_results = []
for model_name in models_to_profile:
    result = profile_model_memory(model_name)
    if result:
        profile_results.append(result)
    print()

GPU Memory Profiling
Profiling distilbert-base-uncased...
  Parameters: 66,362,880
  Model size: 253.15 MB
  Memory allocated: 255.11 MB
  Peak memory: 272.56 MB

Profiling bert-base-uncased...
  Parameters: 109,482,240
  Model size: 417.64 MB
  Memory allocated: 419.60 MB
  Peak memory: 437.05 MB

Profiling roberta-base...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  Parameters: 124,645,632
  Model size: 475.49 MB
  Memory allocated: 477.60 MB
  Peak memory: 495.08 MB



## 12. Summary and Recommendations

### Key Takeaways

1. **All three main models (BERT, DistilBERT, RoBERTa) work excellently on RTX 5060 Ti (12GB)**
   - DistilBERT: Best for speed and efficiency (~260MB, 66M params)
   - BERT: Good balance of performance and size (~440MB, 110M params)
   - RoBERTa: Best accuracy, slightly larger (~500MB, 125M params)

2. **Tokenization Strategies**
   - BERT/DistilBERT: WordPiece tokenization, ~30K vocab
   - RoBERTa: Byte-Pair Encoding (BPE), ~50K vocab
   - Each has trade-offs in handling rare words and languages

3. **CUDA Benefits**
   - 10-50x speedup over CPU depending on batch size
   - Enables real-time processing for many applications
   - Critical for batch processing and production deployments

4. **Memory Considerations**
   - For inference: Model size √ó 1.2-1.5 is a good estimate
   - For training: Model size √ó 3-4 due to gradients and optimizer states
   - Batch size has linear impact on memory usage

### Recommendations for RTX 5060 Ti (12GB)

**For Inference:**
- ‚úÖ All base models: Excellent
- ‚úÖ Large models (BERT-large, RoBERTa-large): Good with moderate batch sizes
- ‚úÖ GPT-2 up to Large: Acceptable
- ‚ö†Ô∏è Models > 2GB: Possible but limited batch size

**For Fine-tuning:**
- ‚úÖ DistilBERT, ALBERT: Excellent, can use larger batch sizes
- ‚úÖ BERT-base, RoBERTa-base: Good with batch size 8-16
- ‚ö†Ô∏è Large models: Use gradient accumulation or mixed precision training

### Best Practices

1. **Use mixed precision (FP16)** to reduce memory usage by ~50%
2. **Start with DistilBERT** for prototyping
3. **Profile memory** before deploying to production
4. **Use batch processing** to maximize GPU utilization
5. **Clear GPU cache** between runs to avoid OOM errors

## 13. Bonus: Testing Additional Models

Feel free to test other models by modifying the model name below:

In [13]:
# Try other models:
# - "albert-base-v2" (smaller, parameter-efficient)
# - "bert-large-uncased" (larger BERT)
# - "roberta-large" (larger RoBERTa)
# - "xlm-roberta-base" (multilingual)
# - "google/electra-base-discriminator" (efficient pre-training)
# - "gpt2" (GPT-2 small, 124M params)
# - "gpt2-medium" (GPT-2 medium, 355M params)
# - "gpt2-large" (GPT-2 large, 774M params)
# - "gpt2-xl" (GPT-2 XL, 1.5B params)  Error loading gpt2-xl: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})

test_model = "gpt2-xl"  # Change this to test different models
test_text = "This is a test of the tokenization system with CUDA acceleration."

result = tokenize_with_model(
    text=test_text,
    model_name=test_model,
    use_cuda=True,
    verbose=True
)

Loading gpt2-xl...

Model: gpt2-xl
Device: CUDA

Original text:
  This is a test of the tokenization system with CUDA acceleration.

Tokens (14 total):
  ['This', 'ƒ†is', 'ƒ†a', 'ƒ†test', 'ƒ†of', 'ƒ†the', 'ƒ†token', 'ization', 'ƒ†system', 'ƒ†with', 'ƒ†CU', 'DA', 'ƒ†acceleration', '.']

Token IDs:
  [1212, 318, 257, 1332, 286, 262, 11241, 1634, 1080, 351, 29369, 5631, 20309, 13]

Vocabulary size: 50,257
Model parameters: 1,557,611,200

GPU Memory Usage:
  Before: 6134.05 MB
  After: 6142.34 MB
  Peak: 6144.13 MB
  Model Memory: 8.29 MB

Output shape: torch.Size([1, 14, 1600])
  [batch_size, sequence_length, hidden_size]


## Conclusion

In this notebook, we've explored:
- How to set up and use HuggingFace transformers with CUDA
- The differences between BERT, DistilBERT, and RoBERTa tokenization
- Memory requirements and GPU compatibility
- Practical examples and batch processing
- Performance optimization techniques

The RTX 5060 Ti with 12GB VRAM is an excellent GPU for working with modern transformer models, capable of handling all base models and many large models for both inference and fine-tuning.

**Next Steps:**
- Experiment with different models for your specific use case
- Try fine-tuning on your own dataset
- Explore multi-GPU training for larger models
- Investigate quantization and model compression techniques

Happy tokenizing! üöÄ