# Hugging Face Transformers

## Learning Objectives
By the end of this notebook, you'll understand:
1. **Pre-trained models and pipelines** - how to use transformer models without training
2. **Task-specific applications** - sentiment analysis, NER, zero-shot classification
3. **Tokenization** - converting text to numerical representations for models
4. **Model inference** - using raw models with PyTorch/TensorFlow
5. **Model persistence** - saving and loading pre-trained models

## Why This Matters
Hugging Face Transformers democratizes access to state-of-the-art NLP models. Instead of:
- Training models from scratch (months of GPU time, massive datasets)
- Implementing transformer architecture yourself
- Managing complex dependencies

You can now:
- Load production-ready models in 2 lines of code
- Fine-tune on your data (hours instead of months)
- Focus on problem-solving instead of implementation

## Real-World Context
Transformers power modern NLP:
- BERT, GPT, BART: Foundation models for language understanding
- Specialized models: Domain-specific, multilingual, efficient versions
- Pipeline abstraction: Hide complexity, expose functionality
- Production deployment: Save/load models, batch processing, optimization

This notebook covers the practical workflow: load → tokenize → predict → save.

## Phase 1: High-Level Pipelines

The `pipeline()` function abstracts away complexity:
- Automatically downloads the model
- Handles tokenization internally
- Returns human-readable results

This is the simplest way to use transformers for quick prototyping.

In [35]:
# Import the high-level pipeline API for easy model usage
from transformers import pipeline

In [36]:
# Create a sentiment analysis pipeline
# Automatically downloads the distilbert model fine-tuned on SST-2 dataset
# distilbert is a smaller, faster version of BERT (40% smaller, 60% faster)
sentiment_classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [37]:
# Sample text for sentiment analysis
# This demonstrates sentiment detection on an optimistic statement
text = "The technological advances in artificial intelligence are remarkable and promising for solving complex problems worldwide"

In [38]:
# Run sentiment analysis on the text
# Returns label (POSITIVE/NEGATIVE) and confidence score
sentiment_classifier(text)

[{'label': 'POSITIVE', 'score': 0.9998711347579956}]

In [39]:
# Create Named Entity Recognition pipeline
# Uses BERT fine-tuned on NER task
# aggregation_strategy="max": if subword tokens are split, use max probability
# Other options: "simple" (no aggregation), "first", "average"
ner = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="max"
)


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [40]:
# Run NER on the same text
# Extracts entities: PERSON, ORG, LOC, etc.
# Returns: entity_group (type), word, score, start/end positions
ner(text)

[]

In [41]:
# Zero-shot classification: classify without training data for specific categories
# Uses BART model trained on Natural Language Inference (NLI) task
# Can classify any text into arbitrary categories without fine-tuning
zeroshot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

Device set to use cpu


In [42]:
# Text to classify with zero-shot approach
sequence_to_classify = "Exploring new cultures and visiting landmark sites across continents"
# Define possible categories - can be anything!
candidate_label = ['travel', 'food', 'sports']

In [43]:
# Run zero-shot classification
# Returns ranked labels with confidence scores
zeroshot_classifier(sequence_to_classify, candidate_label)

{'sequence': 'Exploring new cultures and visiting landmark sites across continents',
 'labels': ['travel', 'sports', 'food'],
 'scores': [0.9943504333496094, 0.0029930926393717527, 0.0026564467698335648]}

## Phase 2: Tokenization Deep Dive

Pipelines hide tokenization. Understanding tokenizers is crucial:
- Different models use different tokenization strategies
- Tokenization affects model input/output
- Subword tokenization (BPE, WordPiece) handles rare words
- Special tokens ([CLS], [SEP], [PAD]) serve specific purposes

### Pre-trained Tokenizers

In [44]:
# Import AutoTokenizer for automatic model-appropriate tokenizer loading
from transformers import AutoTokenizer

In [45]:
# Specify which model's tokenizer to load
# bert-base-uncased: Lowercase English BERT, widely used baseline
model = "bert-base-uncased"

In [46]:
# Load the pre-trained tokenizer for BERT
# Automatically downloads vocabulary and tokenization rules
tokenizer = AutoTokenizer.from_pretrained(model)

In [47]:
# Tokenize the text - converts text to input IDs and attention masks
# Returns dict with: input_ids (token indices), token_type_ids (sentence boundaries), attention_mask (padding mask)
input_ids = tokenizer(text)
print(input_ids)

{'input_ids': [101, 1996, 10660, 9849, 1999, 7976, 4454, 2024, 9487, 1998, 10015, 2005, 13729, 3375, 3471, 4969, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [48]:
# Get token strings (not IDs)
# Notice: words are split into subwords (## prefix indicates continuation)
# Example: "artificial" might become ["artificial"] or ["arti", "##ficial"]
tokens = tokenizer.tokenize(text)
print(tokens)

['the', 'technological', 'advances', 'in', 'artificial', 'intelligence', 'are', 'remarkable', 'and', 'promising', 'for', 'solving', 'complex', 'problems', 'worldwide']


In [49]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[1996, 10660, 9849, 1999, 7976, 4454, 2024, 9487, 1998, 10015, 2005, 13729, 3375, 3471, 4969]


In [50]:
decoded_ids = tokenizer.decode(token_ids)
print(decoded_ids)

the technological advances in artificial intelligence are remarkable and promising for solving complex problems worldwide


In [52]:
model2 ="xlnet-base-cased"

In [53]:
tokenizer2 =AutoTokenizer.from_pretrained(model2)

In [54]:
input_ids2 = tokenizer2(text)
print(input_ids2)

{'input_ids': [32, 8647, 8809, 25, 8298, 2503, 41, 7459, 21, 7559, 28, 12901, 1881, 708, 2805, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [65]:
print(tokenizer2.decode(input_ids2["input_ids"]))

The technological advances in artificial intelligence are remarkable and promising for solving complex problems worldwide<sep><cls>


In [56]:
tokens2 = tokenizer2.tokenize(text)
print(tokens2)

['▁The', '▁technological', '▁advances', '▁in', '▁artificial', '▁intelligence', '▁are', '▁remarkable', '▁and', '▁promising', '▁for', '▁solving', '▁complex', '▁problems', '▁worldwide']


In [57]:
ids2=tokenizer2.convert_tokens_to_ids(tokens2)
print(ids2)

[32, 8647, 8809, 25, 8298, 2503, 41, 7459, 21, 7559, 28, 12901, 1881, 708, 2805]


## Special Tokens

Special tokens are fundamental to transformer functionality. Each serves a specific purpose:

| Token | Purpose | Example |
|-------|---------|----------|
| [CLS] | Classification token - start of sequence | Added before sentence for sequence-level tasks |
| [SEP] | Separation token - marks sentence boundaries | Separates two sentences in sentence-pair tasks |
| [PAD] | Padding token - fills sequences to fixed length | Makes all sequences same length for batching |
| [UNK] | Unknown token - for out-of-vocabulary words | When word not in model vocabulary |
| [MASK] | Mask token - for masked language modeling | Used in pre-training and fill-mask tasks |
| [BOS]/[EOS] | Begin/End of sequence - for generation | Marks start/end in sequence-to-sequence models |

### Why Special Tokens Matter
- **[CLS]** output is used for classification tasks (sentiment, entailment, etc.)
- **[SEP]** helps model understand sentence boundaries and relationships
- **[PAD]** enables efficient batch processing with variable-length sequences
- **[MASK]** is essential for BERT pre-training and fine-tuning objectives

Different models use different special tokens - always check model documentation!

## Phase 3: Low-Level API - PyTorch Integration

Pipelines are convenient, but sometimes you need fine-grained control:
- Custom preprocessing or postprocessing
- Batch processing optimization
- Access to intermediate representations
- Fine-tuning on custom data

This phase shows how to use AutoTokenizer + AutoModel directly with PyTorch.

In [58]:
# Import AutoModel and AutoTokenizer for direct model access
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch  # PyTorch for tensor operations and GPU support

In [59]:
print(text)

The technological advances in artificial intelligence are remarkable and promising for solving complex problems worldwide


In [60]:
print(input_ids2)

{'input_ids': [32, 8647, 8809, 25, 8298, 2503, 41, 7459, 21, 7559, 28, 12901, 1881, 708, 2805, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [61]:
# Load tokenizer for the sentiment classification model
# This tokenizer matches the model's training
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [62]:
# Tokenize and return PyTorch tensors (not lists)
# return_tensors='pt' produces torch.Tensor instead of lists
# Automatically pads to max sequence length
input_ids_pt = tokenizer(text, return_tensors='pt')
print(input_ids_pt)

{'input_ids': tensor([[  101,  1996, 10660,  9849,  1999,  7976,  4454,  2024,  9487,  1998,
         10015,  2005, 13729,  3375,  3471,  4969,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [63]:
# Load the fine-tuned model for sequence classification
# AutoModelForSequenceClassification automatically selects appropriate model class
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [64]:
# Run inference: get predictions without computing gradients (faster, lower memory)
with torch.no_grad():
    # model() returns ModelOutput with logits, hidden_states, etc.
    # ** unpacks the dict to pass tokens to model
    logits = model(**input_ids_pt).logits

# Get the class with highest score
predicted_class_id = logits.argmax().item()
# Convert ID to label using model's label mapping
model.config.id2label[predicted_class_id]

'POSITIVE'

## Model Persistence and Deployment

### Saving Models and Tokenizers
After training or fine-tuning, save your model for reproducibility and deployment:

```python
# Save tokenizer
model_directory = "my_saved_models"
tokenizer.save_pretrained(model_directory)
# Creates: vocab.txt, tokenizer_config.json, special_tokens_map.json, tokenizer.json

# Save model
model.save_pretrained(model_directory)
# Creates: pytorch_model.bin (or model.safetensors), config.json
```

### Loading Saved Models
Restore models from disk:

```python
# Load tokenizer from disk
my_tokenizer = AutoTokenizer.from_pretrained(model_directory)

# Load model from disk
my_model = AutoModelForSequenceClassification.from_pretrained(model_directory)
```

Works exactly the same as loading from Hugging Face Hub - both point to local or remote model files.

### Production Deployment Best Practices
- Save model config alongside weights for reproducibility
- Version your saved models (model_v1, model_v2, etc.)
- Use model.safetensors format (faster loading, more secure)
- Implement model versioning in your application
- Consider quantization or distillation for deployment efficiency

## Key Takeaways: Hugging Face Transformers

### 1. Three Abstraction Levels
- **Pipelines**: Simplest, best for quick prototyping (sentiment, NER, Q&A)
- **AutoTokenizer + AutoModel**: More control, good for custom workflows
- **Raw PyTorch/TensorFlow**: Maximum control, for research and advanced use cases

### 2. Tokenization is Critical
- Different models use different tokenizers
- Subword tokenization (WordPiece, BPE) handles rare words
- Special tokens serve specific purposes: [CLS], [SEP], [PAD], [MASK]
- Always use the matching tokenizer for your model

### 3. Pre-trained > Training from Scratch
- Transfer learning: pre-training on massive corpora, fine-tune on your data
- Hours of training instead of months
- Better results with less data
- Democratized: anyone can use SOTA models

### 4. Production Considerations
- Save models locally for reproducibility
- Use return_tensors for batch processing
- Optimize models (distillation, quantization) for deployment
- Monitor inference latency and memory usage

### 5. Common Pitfalls
- Mismatch between tokenizer and model
- Forgetting torch.no_grad() for inference
- Not handling variable-length sequences properly
- Using wrong model class for task (e.g., AutoModel vs AutoModelForSequenceClassification)


## Practice Exercises

### Exercise 1: Compare Different Tokenizers
**Objective**: Understand how different tokenizers handle the same text

```python
sample_text = "Artificial intelligence is revolutionizing technology worldwide"
tokenizers = {
    'bert-base': AutoTokenizer.from_pretrained("bert-base-uncased"),
    'roberta': AutoTokenizer.from_pretrained("roberta-base"),
    'gpt2': AutoTokenizer.from_pretrained("gpt2")
}
for name, tokenizer in tokenizers.items():
    tokens = tokenizer.tokenize(sample_text)
    print(f"{name}: {len(tokens)} tokens")
```

---

### Exercise 2: Build a Sentiment Pipeline
**Objective**: Classify multiple texts and interpret results

```python
texts = [
    "This product is amazing and exceeded expectations",
    "Terrible experience, would not recommend",
    "It works okay, nothing special"
]
classifier = pipeline("sentiment-analysis")
for text in texts:
    result = classifier(text)[0]
    print(f"{text}: {result['label']} ({result['score']:.3f})")
```

---

### Exercise 3: Extract Entities from Text
**Objective**: Use NER on diverse text types

```python
texts = [
    "Apple Inc. was founded by Steve Jobs in California",
    "Barack Obama was elected president in 2008",
    "The Eiffel Tower is located in Paris, France"
]
ner = pipeline("ner")
for text in texts:
    entities = ner(text)
    print(f"Text: {text}")
    for entity in entities:
        print(f"  {entity['word']}: {entity['entity_group']}")
```

---

## Next Steps

1. **Fine-tuning**: Adapt models to domain-specific tasks
2. **Model Optimization**: Distillation, quantization for deployment
3. **Multi-GPU Training**: Scale training to massive datasets
4. **Advanced Architectures**: T5, GPT-2, BART for generation tasks
5. **Production Deployment**: FastAPI, ONNX, TorchServe for serving models
