# 🎓 Week 13 - Day 3: Hugging Face and Pre-trained Models

## Today's Goals:
✅ Master Hugging Face pipelines for instant NLP
✅ Load and use pre-trained models (BERT, GPT-2)
✅ Fine-tune models for custom tasks
✅ Build production-ready NLP applications


## 🔧 Part 1: Setup - Install & Import All Libraries

**IMPORTANT:** Run ALL cells in this part sequentially!

**Why Hugging Face?**
- 🚀 100,000+ pre-trained models  
- 💰 Free to use (trained by Google, Meta, OpenAI)
- ⚡ 3 lines of code for most tasks
- 🎯 No need to train from scratch!


In [None]:
# STEP 1: Install Hugging Face libraries
!pip install -q transformers datasets torch evaluate accelerate

print("✅ Hugging Face installed!")


In [None]:
# STEP 2: Import core libraries
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported!")


In [None]:
# STEP 3: Check device and setup
device = 0 if torch.cuda.is_available() else -1
print(f"✅ Using: {'GPU' if device == 0 else 'CPU'}")
print("\n🚀 Ready to use Hugging Face!")


## 🚀 Part 2: Hugging Face Pipelines - Instant NLP!

**What are Pipelines?**  
Pre-built functions that combine tokenizer + model for common tasks!

**Available Pipelines:**
- 🎭 Sentiment Analysis
- ❓ Question Answering
- 📝 Text Generation
- 📰 Summarization
- 🔍 Named Entity Recognition (NER)
- 🎯 Zero-Shot Classification

Let's try them all!


In [None]:
# Task 1: Sentiment Analysis (Positive/Negative detector)
sentiment = pipeline("sentiment-analysis", device=device)

texts = [
    "I love this product! Amazing quality!",
    "Terrible experience, very disappointed.",
    "It's okay, nothing special.",
    "Best purchase ever! Highly recommend!"
]

print("🎭 Sentiment Analysis Results:\n")
for text in texts:
    result = sentiment(text)[0]
    print(f"'{text}'")
    print(f"   → {result['label']}: {result['score']:.0%}\n")


In [None]:
# Task 2: Question Answering
qa = pipeline("question-answering", device=device)

context = """
Transformers were introduced in 2017 by Google researchers. They use 
self-attention instead of recurrent layers, enabling parallel processing. 
Major models include BERT (2018), GPT-2 (2019), and GPT-3 (2020). These models 
power ChatGPT, Claude, and other AI systems.
"""

questions = [
    "When were Transformers introduced?",
    "Who introduced Transformers?",
    "Name a major Transformer model."
]

print("❓ Question Answering:\n")
for q in questions:
    result = qa(question=q, context=context)
    print(f"Q: {q}")
    print(f"A: {result['answer']} ({result['score']:.0%})\n")


In [None]:
# Task 3: Text Generation (like ChatGPT!)
generator = pipeline("text-generation", model="gpt2", device=device)

prompts = [
    "Artificial intelligence will",
    "The future of technology is"
]

print("📝 Text Generation:\n")
for prompt in prompts:
    result = generator(prompt, max_length=40, num_return_sequences=1)[0]
    print(f"Prompt: '{prompt}'")
    print(f"Generated: {result['generated_text']}\n")
    print("-" * 70 + "\n")


In [None]:
# Task 4: Summarization
summarizer = pipeline("summarization", device=device)

article = """
Artificial intelligence has made remarkable progress in recent years. 
Transformer models revolutionized natural language processing. Companies like 
OpenAI, Google, and Anthropic developed powerful models such as GPT-4, PaLM, 
and Claude. These models can perform tasks including translation, summarization, 
question answering, and code generation. Applications span healthcare, education, 
customer service, and creative writing.
"""

print(f"📰 Original: {len(article.split())} words\n")

summary = summarizer(article, max_length=50, min_length=25)[0]
print("📝 Summary:")
print(summary['summary_text'])
print(f"\n✅ Summary: {len(summary['summary_text'].split())} words")


In [None]:
# Task 5: Named Entity Recognition (Extract names, places, organizations)
ner = pipeline("ner", grouped_entities=True, device=device)

text = """
Apple Inc. was founded by Steve Jobs in Cupertino, California. Tim Cook became 
CEO in 2011. The company is valued at over $3 trillion.
"""

print("🔍 Named Entity Recognition:\n")
entities = ner(text)

for entity in entities:
    print(f"{entity['word']:<20} → {entity['entity_group']:<10} ({entity['score']:.0%})")

print("\n💡 PER=Person | ORG=Organization | LOC=Location")


### 💡 Key Insights:

✅ **One-line solutions** for complex NLP tasks  
✅ **Pre-trained models** - no training needed!  
✅ **Multiple tasks** - sentiment, QA, generation, NER  
✅ **Production-ready** - used by real companies


## 🔧 Part 3: Manual Model Loading & Tokenization

Understanding what happens inside pipelines!

**Every model needs TWO components:**
1. **Tokenizer:** Text → Numbers
2. **Model:** Numbers → Predictions


In [None]:
# Load BERT tokenizer and model manually
model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

print(f"✅ Loaded: {model_name}")
print(f"📊 Parameters: {model.num_parameters():,}")
print(f"🔤 Vocabulary: {tokenizer.vocab_size:,} words")


In [None]:
# How tokenization works
text = "Hugging Face makes NLP easy!"

# Step 1: Split into tokens
tokens = tokenizer.tokenize(text)
print(f"🔤 Tokens: {tokens}")

# Step 2: Convert to IDs
ids = tokenizer.encode(text)
print(f"🔢 Token IDs: {ids}")

# Step 3: Full encoding (for models)
encoded = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print(f"\n📦 Encoded input:")
print(f"   Input IDs shape: {encoded['input_ids'].shape}")
print(f"   Attention mask: {encoded['attention_mask'].shape}")

# Step 4: Decode back to text
decoded = tokenizer.decode(encoded['input_ids'][0])
print(f"\n🔄 Decoded: {decoded}")


In [None]:
# Make predictions manually
model.eval()

texts = [
    "This movie is amazing!",
    "I hated every minute."
]

print("🎯 Manual Predictions:\n")

for text in texts:
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt")
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        pred = torch.argmax(logits, dim=1).item()
        probs = torch.softmax(logits, dim=1)[0]
    
    print(f"Text: '{text}'")
    print(f"   {'✅ Positive' if pred == 1 else '❌ Negative'} ({probs[pred]:.0%})\n")


### 💡 Key Insights:

✅ **Tokenizer** converts text to numbers that models understand  
✅ **Special tokens** [CLS], [SEP] mark sentence boundaries  
✅ **Attention masks** tell model which tokens are real vs padding  
✅ **Manual control** useful for custom applications


## 🎓 Part 4: Fine-Tuning on Custom Data

**The Power of Fine-Tuning:**
- 🎯 Adapt pre-trained models to YOUR task
- ⚡ Train in minutes, not weeks
- 📊 Achieve 90%+ accuracy with small datasets

**Today's Task:** Fine-tune BERT on movie review sentiment!


In [None]:
# Load IMDB movie review dataset
dataset = load_dataset("imdb")

print("✅ Dataset loaded!")
print(f"   Training: {len(dataset['train']):,} reviews")
print(f"   Test: {len(dataset['test']):,} reviews")

# Show examples
print("\n📝 Sample reviews:\n")
for i in range(2):
    example = dataset['train'][i]
    label = "Positive" if example['label'] == 1 else "Negative"
    print(f"Review: {example['text'][:100]}...")
    print(f"Label: {label}\n")


In [None]:
# Create smaller subset for faster training
small_train = dataset['train'].shuffle(seed=42).select(range(1000))
small_test = dataset['test'].shuffle(seed=42).select(range(200))

print(f"✅ Training subset: {len(small_train)} samples")
print(f"✅ Test subset: {len(small_test)} samples")
print("\n💡 Using small dataset for demo - scale up for production!")


In [None]:
# Tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", 
                     truncation=True, max_length=256)

print("🔤 Tokenizing datasets...")
tokenized_train = small_train.map(tokenize_function, batched=True)
tokenized_test = small_test.map(tokenize_function, batched=True)

print("✅ Tokenization complete!")


In [None]:
# Load fresh model for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2  # Binary: positive/negative
)

print("✅ Model loaded for fine-tuning!")
print(f"📊 Parameters: {model.num_parameters():,}")


In [None]:
# Configure training
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,
    save_strategy="epoch",
    load_best_model_at_end=True,
)

print("✅ Training configured!")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Learning rate: {training_args.learning_rate}")


In [None]:
# Define evaluation metrics
import evaluate

accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

print("✅ Metrics configured (Accuracy)")


In [None]:
# Create Trainer and fine-tune!
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

print("🚀 Starting fine-tuning...")
print("⏳ This will take 3-5 minutes\n")

trainer.train()

print("\n✅ Fine-tuning complete! 🎉")


In [None]:
# Evaluate the fine-tuned model
results = trainer.evaluate()

print("\n📊 Final Results:")
print(f"   Accuracy: {results['eval_accuracy']:.1%}")
print(f"   Loss: {results['eval_loss']:.4f}")

print(f"\n💡 Comparison:")
print(f"   Random guessing: 50.0%")
print(f"   Our fine-tuned model: {results['eval_accuracy']:.1%}")
print(f"   Improvement: {(results['eval_accuracy'] - 0.5) / 0.5 * 100:.0f}% better!")


In [None]:
# Test on custom movie reviews
test_reviews = [
    "This movie was absolutely incredible! Best film I've seen in years.",
    "Waste of time and money. Terrible acting.",
    "It was okay, nothing special.",
    "Mind-blowing cinematography! A masterpiece!"
]

print("🎯 Testing on Custom Reviews:\n")

model.eval()
for review in test_reviews:
    inputs = tokenizer(review, return_tensors="pt", truncation=True, max_length=256)
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        pred = torch.argmax(logits, dim=1).item()
        probs = torch.softmax(logits, dim=1)[0]
    
    print(f"Review: '{review}'")
    print(f"   {'✅ Positive' if pred == 1 else '❌ Negative'} ({probs[pred]:.0%})\n")


In [None]:
# Save the fine-tuned model
model.save_pretrained("./my-sentiment-model")
tokenizer.save_pretrained("./my-sentiment-model")

print("✅ Model saved to './my-sentiment-model'")
print("\n💡 You can now:")
print("   1. Load it anytime with AutoModelForSequenceClassification.from_pretrained()")
print("   2. Share it on Hugging Face Hub")
print("   3. Deploy in production applications!")


### 💡 Key Insights:

✅ **Fine-tuning is fast** - 3 epochs in minutes vs weeks of training  
✅ **Small datasets work** - 1000 samples gave 90%+ accuracy  
✅ **Transfer learning magic** - pre-trained knowledge + your data  
✅ **Production ready** - save and deploy anywhere


## 🌐 Part 5: Exploring the Hugging Face Hub

**The Hub = GitHub for ML Models**
- 🎯 100,000+ models
- 🆓 Free to use
- 👥 Community-driven
- 📊 Models for every task

Let's explore!


In [None]:
# Search for models programmatically
from huggingface_hub import list_models

print("🔍 Top Sentiment Analysis Models:\n")
models = list_models(filter="text-classification", sort="downloads", limit=5)

for i, model in enumerate(models, 1):
    print(f"{i}. {model.modelId}")
    print(f"   Downloads: {model.downloads:,}")
    print(f"   Likes: {model.likes}\n")

print("💡 Visit https://huggingface.co/models to explore more!")


### 💡 Key Insights:

✅ **Massive library** - 100,000+ pre-trained models  
✅ **Every task covered** - classification, generation, QA, etc.  
✅ **Community driven** - anyone can share models  
✅ **Easy to use** - load with 1 line of code


## 🎯 Challenge Time!

### 🏆 Challenge: Build a Multi-Task NLP Analyzer

**Your Mission:** Create a function that analyzes text and returns:
1. Sentiment (positive/negative)
2. Named entities (people, places, organizations)
3. Summary (if text is long enough)

**Starter Code:**
```python
def analyze_text(text):
    # TODO: Use pipelines to analyze text
    # Return dictionary with all results
    results = {
        'sentiment': None,
        'entities': None,
        'summary': None
    }
    return results

# Test it
sample = """
Elon Musk announced that Tesla will open a new factory in Austin, Texas. 
The company expects to create 5,000 jobs. This is great news for the economy.
"""

results = analyze_text(sample)
print(results)
```

**Expected Output:**
```
{
    'sentiment': 'POSITIVE (95%)',
    'entities': ['Elon Musk', 'Tesla', 'Austin', 'Texas'],
    'summary': 'Tesla opening new factory...'
}
```

**Bonus Challenges:**
1. Add zero-shot topic classification
2. Handle multiple languages
3. Create error handling for edge cases

Good luck! 🚀


In [None]:
# Your solution code here!

def analyze_text(text):
    # TODO: Implement using pipelines
    pass

# Test your function
sample = """
Elon Musk announced that Tesla will open a new factory in Austin, Texas.
"""

# results = analyze_text(sample)
# print(results)

print("💡 Implement the function above and test it!")


---

## 📚 Summary - What We Learned Today

### 1. Hugging Face Pipelines 🚀
- **One-line solutions** for sentiment, QA, generation, NER
- **Pre-trained models** ready to use
- **No training required** for common tasks

### 2. Model Components 🔧
- **Tokenizers** convert text to numbers
- **Models** process numbers into predictions
- **Manual loading** gives full control

### 3. Fine-Tuning 🎓
- **Adapt models** to your specific task
- **Fast training** - minutes not weeks
- **Small datasets** work well (1000 samples)
- **Save and deploy** anywhere

### 4. Hugging Face Hub 🌐
- **100,000+ models** available
- **Free to use** and share
- **Community driven** ecosystem

---

## 🚀 Next Week Preview

**Week 14: LLMs and AI Agents**
- Large Language Models (GPT, Claude)
- Fine-tuning with LoRA
- LangChain framework
- Building chatbots with memory
- RAG (Retrieval-Augmented Generation)

---

## 💡 Key Takeaways:

✅ **Don't train from scratch** - use pre-trained models  
✅ **Pipelines for speed** - 1-2 lines for most tasks  
✅ **Fine-tune for custom** - adapt to your data  
✅ **Production ready** - used by Google, Meta, Microsoft

**Excellent work this week! 🎉**

You've mastered:
- TensorFlow vs PyTorch (Day 1)
- Transformer architecture (Day 2)
- Hugging Face ecosystem (Day 3)

You're now ready to build real-world NLP applications!
