# Hugging Face: A Simple Course

This notebook introduces Hugging Face's Transformers library, focusing on how to use it for text embeddings and language models in RAG applications. Hugging Face has become the de-facto standard for accessing state-of-the-art NLP models.

## What is Hugging Face?

Hugging Face is a company that provides:

1. **The Transformers library**: An open-source library with pre-trained models
2. **The Model Hub**: A platform for sharing and discovering models
3. **The Datasets library**: A collection of NLP datasets
4. **Spaces**: For deploying machine learning apps

We'll focus mainly on the Transformers library and how to use it for generating embeddings and working with language models.

## Key Components

- **Transformers**: The main library for working with pre-trained models
- **Tokenizers**: For converting text to tokens that models can understand
- **Pipelines**: Easy-to-use abstractions for common tasks
- **Model Hub**: Where pre-trained models are shared and downloaded from

## Installation

In [None]:
# Install the transformers library
!pip install -q transformers sentence-transformers 

## Basic Usage: Text Classification

Let's start with a simple example: using a pre-trained model to classify text sentiment.

In [None]:
from transformers import pipeline

# Create a text classification pipeline
classifier = pipeline("sentiment-analysis")

# Analyze some text
results = classifier([
    "I love using Hugging Face transformers!",
    "This code is complicated and hard to understand.",
    "Learning new libraries can be challenging but rewarding."
])

# Print the results
for i, result in enumerate(results):
    print(f"Text {i+1}: {result['label']} (Confidence: {result['score']:.4f})")

## Generating Embeddings with Sentence Transformers

For RAG applications, we typically need to generate embeddings for documents and queries. Sentence Transformers, built on top of Hugging Face's Transformers, provides models specifically designed for this purpose.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')  # A good, lightweight model

# Example sentences
sentences = [
    "This is a sentence about artificial intelligence.",
    "Embeddings are vector representations of text.",
    "The weather today is sunny and warm.",
    "Machine learning models can process natural language."
]

# Generate embeddings
embeddings = model.encode(sentences)

# Print information about the embeddings
print(f"Embedding shape: {embeddings.shape}")
print(f"Each sentence is represented by a {embeddings.shape[1]}-dimensional vector")

# Calculate similarity between sentences
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)

print("\nSimilarity Matrix:")
for i in range(len(sentences)):
    for j in range(len(sentences)):
        print(f"Similarity between sentence {i+1} and {j+1}: {similarity_matrix[i][j]:.4f}")
    
# Find the most similar pair
max_sim = -1
max_pair = (0, 0)
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j and similarity_matrix[i][j] > max_sim:
            max_sim = similarity_matrix[i][j]
            max_pair = (i, j)

print(f"\nMost similar sentences: {max_pair[0]+1} and {max_pair[1]+1}")
print(f"Sentence {max_pair[0]+1}: \"{sentences[max_pair[0]]}\"")
print(f"Sentence {max_pair[1]+1}: \"{sentences[max_pair[1]]}\"")
print(f"Similarity: {max_sim:.4f}")

## Popular Embedding Models

Hugging Face hosts numerous embedding models with different characteristics. Here are some popular ones:

1. **all-MiniLM-L6-v2**: Small and fast, 384 dimensions
2. **all-mpnet-base-v2**: Better quality but slower, 768 dimensions
3. **all-distilroberta-v1**: Good balance of quality and speed
4. **multi-qa-mpnet-base-dot-v1**: Specialized for question-answering

Let's compare a couple of these models:

In [None]:
# Compare embedding models
models = [
    'all-MiniLM-L6-v2',  # Fast, compact
    'all-mpnet-base-v2'  # Higher quality
]

sentences = [
    "What is the capital of France?",
    "Paris is the capital city of France.",
    "The Eiffel Tower is located in Paris.",
    "France is a country in Western Europe."
]

import time

for model_name in models:
    print(f"\nTesting model: {model_name}")
    
    # Load model
    start_time = time.time()
    model = SentenceTransformer(model_name)
    load_time = time.time() - start_time
    print(f"Model loading time: {load_time:.2f} seconds")
    
    # Generate embeddings
    start_time = time.time()
    embeddings = model.encode(sentences)
    encode_time = time.time() - start_time
    print(f"Encoding time for {len(sentences)} sentences: {encode_time:.4f} seconds")
    print(f"Embedding dimensions: {embeddings.shape[1]}")
    
    # Calculate similarity
    sim = cosine_similarity(embeddings)
    
    # Print similarity between question and each sentence
    for i in range(1, len(sentences)):
        print(f"Similarity between question and sentence {i}: {sim[0][i]:.4f}")

## Using Transformers for Text Generation

Besides embeddings, Hugging Face is widely used for text generation. Let's see how to use a language model to generate text:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# We'll use a small model for demonstration purposes
model_name = "distilgpt2"  # A smaller version of GPT-2

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
def generate_text(prompt, max_length=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Generate
    with torch.no_grad():
        output = model.generate(
            inputs.input_ids, 
            max_length=max_length,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode and return
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Test with some prompts
prompts = [
    "Artificial intelligence is",
    "The future of natural language processing",
    "Transformers are neural networks that"
]

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    response = generate_text(prompt)
    print(f"Generated: {response}")

## Pipelines: The Easy Way to Use Transformers

Hugging Face provides pipelines that simplify many common tasks:

In [None]:
from transformers import pipeline

# Available tasks:
tasks = [
    "sentiment-analysis",
    "text-generation",
    "text-classification",
    "token-classification",
    "question-answering",
    "summarization",
    "translation",
    "feature-extraction"  # For embeddings
]

print("Available pipeline tasks:")
for task in tasks:
    print(f"- {task}")

# Let's try a few

# 1. Question answering
qa_pipeline = pipeline("question-answering")
context = """
Hugging Face is an AI company that provides tools and libraries for natural language processing (NLP). 
Their most popular product is the Transformers library, which provides pre-trained models for a wide range of NLP tasks. 
The company was founded in 2016 and is based in New York City and Paris.
"""

questions = [
    "What is Hugging Face's most popular product?",
    "Where is Hugging Face based?",
    "When was Hugging Face founded?"
]

print("\nQuestion Answering:")
for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {result['answer']} (Score: {result['score']:.4f})")

# 2. Summarization
summarizer = pipeline("summarization")
long_text = """
Transformers is a deep learning architecture that has revolutionized natural language processing.
It was introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017.
The key innovation was the self-attention mechanism, which allows the model to weigh the importance of different words in relation to each other.
This has led to state-of-the-art results in translation, question answering, text generation, and many other NLP tasks.
Pre-trained transformer models like BERT, GPT, and T5 have become the foundation for most advanced NLP systems.
"""

print("\nSummarization:")
summary = summarizer(long_text, max_length=50, min_length=10, do_sample=False)
print(summary[0]['summary_text'])

## Working with Tokenizers

To understand how transformers process text, it's important to understand tokenization.

In [None]:
from transformers import AutoTokenizer

# Let's examine how different tokenizers process the same text
tokenizer_names = [
    "bert-base-uncased",  # WordPiece tokenizer
    "gpt2",               # BPE tokenizer
    "google/t5-v1-base"   # SentencePiece tokenizer
]

text = "Hugging Face Transformers converts your text into tokens, which are numbers the model can understand."

print("Tokenization example:\n")
for name in tokenizer_names:
    tokenizer = AutoTokenizer.from_pretrained(name)
    
    print(f"Tokenizer: {name}")
    
    # Tokenize
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    
    print(f"Number of tokens: {len(tokens)}")
    print(f"First 10 tokens: {tokens[:10]}")
    print(f"Token IDs: {token_ids[:10]}...")
    
    # Decode back to text
    decoded = tokenizer.decode(token_ids)
    print(f"Decoded: {decoded}\n")

## Integration with RAG Systems

Now let's see how Hugging Face components can be integrated into a simple RAG (Retrieval Augmented Generation) system:

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline

class SimpleRAG:
    def __init__(self):
        # Initialize embedding model
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        
        # Initialize generation model
        self.generator = pipeline('text-generation', model='distilgpt2')
        
        # Knowledge base (documents and their embeddings)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents):
        """Add documents to the knowledge base."""
        self.documents.extend(documents)
        
        # Generate embeddings for new documents
        new_embeddings = self.embedding_model.encode(documents)
        
        if len(self.embeddings) == 0:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
    
    def retrieve(self, query, top_k=2):
        """Retrieve most relevant documents for a query."""
        # Generate query embedding
        query_embedding = self.embedding_model.encode([query])[0]
        
        # Calculate similarities
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        
        # Get top_k indices
        top_indices = similarities.argsort()[-top_k:][::-1]
        
        return [(self.documents[i], similarities[i]) for i in top_indices]
    
    def generate(self, query):
        """Generate an answer based on retrieved documents."""
        # Retrieve relevant documents
        retrieved_docs = self.retrieve(query)
        
        # Format context
        context = "\n".join([doc for doc, _ in retrieved_docs])
        
        # Create prompt
        prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
        
        # Generate answer
        response = self.generator(prompt, max_length=len(prompt.split()) + 50, do_sample=True)
        
        return {
            "query": query,
            "answer": response[0]["generated_text"].split("Answer:")[1].strip(),
            "retrieved_documents": retrieved_docs
        }

# Test the RAG system
rag = SimpleRAG()

# Add documents to knowledge base
documents = [
    "The capital of France is Paris. Paris is known for the Eiffel Tower and Louvre Museum.",
    "Rome is the capital of Italy. The Colosseum and Vatican City are in Rome.",
    "Germany's capital is Berlin. Berlin is known for the Brandenburg Gate.",
    "Madrid is the capital of Spain and home to the Prado Museum.",
    "Tokyo is the capital of Japan and the most populous city in the world."
]

rag.add_documents(documents)

# Test with a query
query = "What is the capital of France and what is it known for?"
result = rag.generate(query)

print("Query:", result["query"])
print("\nRetrieved Documents:")
for i, (doc, score) in enumerate(result["retrieved_documents"]):
    print(f"{i+1}. [{score:.4f}] {doc}")

print("\nGenerated Answer:", result["answer"])

## Accessing the Hugging Face Hub

The Hugging Face Hub hosts thousands of models and datasets. Let's see how to interact with it:

In [None]:
# Install the huggingface_hub package if not already installed
!pip install -q huggingface_hub

from huggingface_hub import HfApi, list_models

# Initialize the API
api = HfApi()

# List some models based on criteria
def explore_models(task=None, library=None, limit=5):
    models = list_models(
        filter=task,
        library=library,
        limit=limit
    )
    
    print(f"Found models for task: {task}, library: {library}")
    for i, model_info in enumerate(models, 1):
        print(f"{i}. {model_info.modelId} (Downloads: {model_info.downloads:,})")
        
# Explore models for different tasks
print("Text embedding models:")
explore_models(task="feature-extraction", limit=5)

print("\nText classification models:")
explore_models(task="text-classification", limit=5)

print("\nQuestion answering models:")
explore_models(task="question-answering", limit=5)

## Best Practices for Using Hugging Face in RAG Systems

1. **Choose the right embedding model**:
   - For high quality: `all-mpnet-base-v2` or `all-distilroberta-v1`
   - For speed: `all-MiniLM-L6-v2`
   - For multilingual: `paraphrase-multilingual-mpnet-base-v2`

2. **Effective text chunking**:
   - Use semantic chunking when possible
   - Consider sentence or paragraph boundaries
   - Maintain appropriate context in each chunk

3. **Model optimization**:
   - Quantize models for faster inference
   - Use smaller models when appropriate
   - Consider batching embeddings generation

4. **Prompt engineering**:
   - Format retrieved context clearly
   - Provide clear instructions in prompts
   - Use examples in prompts when necessary

## Conclusion

Hugging Face provides powerful tools for working with state-of-the-art NLP models. Its libraries form the foundation of many modern RAG systems and other NLP applications. The combination of pre-trained models, easy-to-use APIs, and model sharing capabilities makes it an essential resource for anyone working with text data.

For more information, visit the [Hugging Face documentation](https://huggingface.co/docs).