# Using Open Source LLMs Natively with Hugging Face Transformers

## üéØ Learning Objectives

By the end of this notebook, you will be able to:
1. **Understand** what Hugging Face Transformers library is and why it's important
2. **Load** pre-trained Large Language Models (LLMs) locally on your machine
3. **Use** tokenizers to prepare text input for LLMs
4. **Generate** text responses using the model's `generate()` method
5. **Simplify** the workflow using Hugging Face Pipelines

## üìö Prerequisites
- Basic Python knowledge
- Understanding of what LLMs are
- A Hugging Face account (free)

## üîß What is Hugging Face Transformers?

**Hugging Face Transformers** is an open-source library that provides:
- Access to thousands of pre-trained models for NLP, computer vision, and audio tasks
- Easy-to-use APIs for downloading and using these models
- Tools for fine-tuning models on your own data

**Key Advantage**: Run models locally without sending data to external servers (privacy-friendly!)

## Install Dependencies

In [13]:
# ============================================================================
# üì¶ INSTALLING REQUIRED PACKAGES
# ============================================================================
# Uncomment and run these lines if you haven't installed the packages yet
#
# transformers: The main Hugging Face library for working with pre-trained models
# accelerate: Enables efficient model loading and GPU/CPU optimization
# groq: SDK for Groq Cloud inference (alternative to local inference)
#
# The -qq flag suppresses output for cleaner installation logs
# ============================================================================

# !pip install -qq transformers==4.47.0
# !pip install -qq accelerate==1.1.0
# !pip install -qq groq==0.13.0

In [14]:
# ============================================================================
# üî• PYTORCH INSTALLATION
# ============================================================================
# PyTorch is the deep learning framework that powers Hugging Face Transformers.
# It handles tensor operations and GPU computations under the hood.
#
# Choose ONE of these options based on your setup:
# - Option 1: Install specific version (for reproducibility)
# - Option 2: Install with all components (torch, torchvision, torchaudio)
# ============================================================================

# Option 1: Install specific version
# pip install -qq torch==2.7.1

In [15]:
# Option 2: Install all PyTorch components (recommended for full functionality)
# - torch: Core deep learning library
# - torchvision: Computer vision utilities (not required for text LLMs)
# - torchaudio: Audio processing utilities (not required for text LLMs)
# pip install -qq torch torchvision torchaudio

## Get Hugging Face Access Token

Here you need to get an access token to be able to download or access models using Hugging Face's platform:

- Hugging Face Access Token: Go [here](https://huggingface.co/settings/tokens) and create a key with write permissions. You need to setup an account which is totally free of cost.


1. Go to [Settings -> Access Tokens](https://huggingface.co/settings/tokens) after creating your account and make sure to create a new access token with write permissions

![](https://i.imgur.com/dtS6tFr.png)

2. Remember to __Save__ your key somewhere safe as it will just be shown once as shown below. So copy and save it in a local secure file to use it later on. If you forget, just create a new key anytime.

![](https://i.imgur.com/NmZmpmw.png)

## Load Hugging Face Access Token


In [16]:
# ============================================================================
# üîê LOADING ENVIRONMENT VARIABLES
# ============================================================================
# We use python-dotenv to load sensitive information (like API keys) from a 
# .env file. This keeps your credentials secure and out of your code.
#
# Your .env file should contain:
#   HUGGINGFACE_API_KEY=your_token_here
#
# ‚ö†Ô∏è  NEVER commit your .env file to version control (add it to .gitignore)
# ============================================================================

from dotenv import load_dotenv  # Library to load environment variables from .env file
import os                        # Standard library for OS operations

# load_dotenv() searches for a .env file and loads its contents as environment variables
# Returns True if .env file was found and loaded successfully
load_dotenv()

True

---

## üñ•Ô∏è Part 1: Using LLMs Locally with Hugging Face

### Why Run Models Locally?

| Advantage | Description |
|-----------|-------------|
| **Privacy** | Your data never leaves your machine |
| **No API Costs** | Once downloaded, use unlimited times for free |
| **Offline Access** | Works without internet connection |
| **Customization** | Full control over model parameters |

### ‚ö†Ô∏è Hardware Requirements

Running LLMs locally requires significant computational resources:

- **GPU Recommended**: Even small models (1-3B parameters) benefit greatly from GPU acceleration
- **RAM**: At least 8GB for small models, 16GB+ for medium models
- **Storage**: Models can range from 2GB to 100GB+ depending on size

> **üí° Tip**: If you don't have a GPU, consider using Hugging Face Inference API or Groq Cloud (covered in other notebooks)

### üîí Understanding Gated Models

Some LLMs on Hugging Face are **"gated"** - meaning you need to accept terms and conditions before accessing them.

**Examples of Gated Models:**
- [Meta Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
- [Mistral 7B Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

**How to Request Access:**
1. Go to the model page on Hugging Face
2. Click on "Request Access" or "Agree and access repository"
3. Fill out the required form (usually just accept terms)
4. Wait for approval (usually instant for most models)

![](https://i.imgur.com/M88MOu5.png)

> **Note**: For this tutorial, we'll use **TinyLlama** - an open (non-gated) model that works well for learning!

### Step 1: Load the LLM and Tokenizer

Every LLM requires two main components:

1. **Tokenizer**: Converts text to numbers (tokens) that the model can understand
2. **Model**: The actual neural network that generates predictions

```
Text Input ‚Üí [Tokenizer] ‚Üí Token IDs ‚Üí [Model] ‚Üí Token IDs ‚Üí [Tokenizer] ‚Üí Text Output
```

**About TinyLlama:**
- Size: 1.1 Billion parameters
- Based on Llama 2 architecture
- Trained on 3 trillion tokens
- Great for learning and experimentation due to small size

In [17]:
# ============================================================================
# üöÄ LOADING THE MODEL AND TOKENIZER
# ============================================================================

# Import required libraries from Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

# ============================================================================
# STEP 1: Define the Model ID
# ============================================================================
# The model_id is the unique identifier on Hugging Face Hub
# Format: "organization_name/model_name" or "username/model_name"
# You can find model IDs by browsing: https://huggingface.co/models

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# ============================================================================
# STEP 2: Load the Tokenizer
# ============================================================================
# AutoTokenizer automatically detects and loads the correct tokenizer class
# based on the model. It handles:
#   - Vocabulary loading
#   - Special tokens (like <|user|>, </s>, etc.)
#   - Text encoding/decoding

tokenizer = AutoTokenizer.from_pretrained(model_id)

# ============================================================================
# STEP 3: Load the Model
# ============================================================================
# AutoModelForCausalLM loads models designed for text generation (causal LM)
# 
# Key Parameters:
#   - model_id: The Hugging Face model identifier
#   - torch_dtype: Data type for model weights
#       ‚Ä¢ torch.float32: Full precision (more accurate, uses more memory)
#       ‚Ä¢ torch.float16: Half precision (good balance)
#       ‚Ä¢ torch.bfloat16: Brain float (best for modern GPUs, handles larger ranges)
#
# üí° Using bfloat16 reduces memory usage by ~50% with minimal quality loss

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16  # Use bfloat16 for better memory efficiency
)

# Note: First run will download the model (~2GB for TinyLlama)
# Subsequent runs will load from cache (~/.cache/huggingface/)

### Step 2: Prepare Your Prompt Using Chat Templates

Modern chat LLMs expect input in a specific format with **special tokens** to distinguish between:
- User messages
- Assistant responses
- System instructions

**Why Chat Templates Matter:**
- Each model family (Llama, Mistral, etc.) uses different formatting
- Using wrong format = poor quality responses
- `apply_chat_template()` automatically formats your messages correctly

**Example Format (TinyLlama/Llama style):**
```
<|user|>
Your question here</s>
<|assistant|>
```


In [18]:
# ============================================================================
# üìù CREATING THE CHAT MESSAGE
# ============================================================================
# Chat messages are structured as a list of dictionaries
# Each message has:
#   - "role": Who is speaking ("user", "assistant", or "system")
#   - "content": The actual message text

chat = [
    {"role": "user", "content": "Explain what is Generative AI in 2 bullet points"},
]

# ============================================================================
# üîÑ APPLYING THE CHAT TEMPLATE
# ============================================================================
# apply_chat_template() converts your structured messages into the format
# the model expects
#
# Parameters:
#   - chat: The list of message dictionaries
#   - tokenize: False = return string, True = return token IDs
#   - add_generation_prompt: True = add the assistant turn start token
#                           (signals the model to start generating)

prompt = tokenizer.apply_chat_template(
    chat, 
    tokenize=False,            # Return human-readable string (not token IDs)
    add_generation_prompt=True  # Add "<|assistant|>\n" to prompt generation
)

# Let's see what the formatted prompt looks like:
print("=" * 50)
print("FORMATTED PROMPT:")
print("=" * 50)
print(prompt)
print("=" * 50)

FORMATTED PROMPT:
<|user|>
Explain what is Generative AI in 2 bullet points</s>
<|assistant|>



### Step 3: Generate Text with the Model

Now we'll use the model's `generate()` method to produce a response. Understanding the generation parameters is crucial for controlling output quality.

üìö **[Full Documentation](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate)**

#### Key Generation Parameters:

| Parameter | Description | Typical Values |
|-----------|-------------|----------------|
| `max_length` | Maximum total length (input + output) | 512, 1024, 2048 |
| `max_new_tokens` | Maximum tokens to generate (output only) | 100, 500, 1000 |
| `do_sample` | Enable random sampling | `True` (creative) / `False` (deterministic) |
| `temperature` | Controls randomness | 0.0-1.0 (higher = more creative) |
| `top_p` | Nucleus sampling threshold | 0.9-0.95 |
| `top_k` | Limit vocabulary choices | 50 |

> ‚ö†Ô∏è **Important**: Use either `max_new_tokens` OR `max_length`, not both!

#### Temperature Explained:
- **0.0**: Greedy decoding (always picks highest probability token) - deterministic
- **0.3-0.5**: More focused, consistent outputs
- **0.7-0.9**: Balanced creativity and coherence
- **1.0+**: Very creative but potentially incoherent

In [19]:
# ============================================================================
# üî¢ STEP 3A: TOKENIZE THE INPUT (Convert Text ‚Üí Numbers)
# ============================================================================
# The model can only process numbers, so we need to convert our text prompt
# into token IDs using the tokenizer's encode() method
#
# Parameters:
#   - prompt: The formatted text string
#   - add_special_tokens: False because chat template already added them
#   - return_tensors: "pt" = PyTorch tensor format (required for model)

inputs = tokenizer.encode(
    prompt, 
    add_special_tokens=False,  # Chat template already includes special tokens
    return_tensors="pt"        # Return as PyTorch tensor
)

print(f"üìä Input shape: {inputs.shape}")  # [batch_size, sequence_length]
print(f"üìä Number of input tokens: {inputs.shape[1]}")

# ============================================================================
# ü§ñ STEP 3B: GENERATE OUTPUT TOKENS
# ============================================================================
# model.generate() produces new tokens based on the input
#
# Key Steps (under the hood):
#   1. Process input tokens through the model
#   2. Get probability distribution for next token
#   3. Select next token (based on sampling/greedy strategy)
#   4. Repeat until max_new_tokens or end token reached
#
# Note: .to(model.device) moves input to same device as model (CPU/GPU)

outputs = model.generate(
    input_ids=inputs.to(model.device),  # Move input to model's device
    max_new_tokens=1000                  # Generate up to 1000 new tokens
)

# ============================================================================
# üìù STEP 3C: DECODE OUTPUT (Convert Numbers ‚Üí Text)
# ============================================================================
# tokenizer.decode() converts the generated token IDs back to readable text

print("\n" + "=" * 60)
print("ü§ñ MODEL RESPONSE:")
print("=" * 60)
print(tokenizer.decode(outputs[0]))
print("=" * 60)

üìä Input shape: torch.Size([1, 29])
üìä Number of input tokens: 29

ü§ñ MODEL RESPONSE:
<|user|>
Explain what is Generative AI in 2 bullet points</s> 
<|assistant|>
1. Generative AI is a type of artificial intelligence that can generate new ideas, concepts, and solutions based on data. It is a form of machine learning that uses algorithms to analyze large amounts of data and generate new insights or solutions.

2. Generative AI can be used in various industries, including finance, healthcare, marketing, and education. It can help businesses to identify new products or services, improve marketing campaigns, and develop new educational programs.

3. Generative AI can also be used to create new forms of art, such as music or visual art. It can generate new melodies or paintings based on user input or data.

4. Generative AI is still in its early stages of development, and there are still many challenges to overcome. One of the biggest challenges is the creation of a universal language

---

## üöÄ Part 2: The Easier Way - Using Pipelines

The manual process above (encode ‚Üí generate ‚Üí decode) works but is verbose. Hugging Face **Pipelines** simplify this significantly!

### What are Pipelines?

Pipelines are high-level abstractions that:
- ‚úÖ Handle tokenization automatically
- ‚úÖ Manage device placement (CPU/GPU)
- ‚úÖ Decode outputs for you
- ‚úÖ Support batching for efficiency
- ‚úÖ Work with chat message format directly

### Comparison:

| Manual Approach | Pipeline Approach |
|-----------------|-------------------|
| `tokenizer.encode()` | Just pass your message! |
| `model.generate()` | Pipeline handles it |
| `tokenizer.decode()` | Returns clean text |
| ~10 lines of code | ~3 lines of code |

In [20]:
# ============================================================================
# üõ†Ô∏è CREATING A TEXT GENERATION PIPELINE
# ============================================================================
# transformers.pipeline() is a factory function that creates an easy-to-use
# interface for various NLP tasks
#
# Common task types:
#   - "text-generation": Generate text continuations (what we need for chat)
#   - "text-classification": Sentiment analysis, categorization
#   - "question-answering": Extract answers from context
#   - "summarization": Condense long text
#   - "translation": Translate between languages
# ============================================================================

llama_pipe = transformers.pipeline(
    "text-generation",           # Task type: generate text
    model=model,                 # Our loaded TinyLlama model
    tokenizer=tokenizer,         # Matching tokenizer
    torch_dtype=torch.bfloat16,  # Keep memory-efficient dtype
    trust_remote_code=True,      # Allow model's custom code (if any)
    device_map="auto",           # Automatically choose best device (GPU if available)
)

# üí° Note: "auto" device_map will use:
#   - CUDA GPU if available (fastest)
#   - Apple MPS if on Mac with M-series chip
#   - CPU as fallback (slowest)
print("‚úÖ Pipeline created successfully!")

Device set to use mps:0


‚úÖ Pipeline created successfully!


In [21]:
# ============================================================================
# üí¨ PREPARING CHAT MESSAGES FOR THE PIPELINE
# ============================================================================
# With pipelines, you can pass the chat messages directly!
# No need to manually apply chat templates - the pipeline handles it.
#
# The pipeline accepts the same message format we used before:
# A list of dictionaries with "role" and "content" keys

chat = [
    {"role": "user", "content": "Explain what is Generative AI in 2 bullet points"},
]

# üí° You can also include conversation history:
# chat = [
#     {"role": "system", "content": "You are a helpful assistant."},
#     {"role": "user", "content": "Hello!"},
#     {"role": "assistant", "content": "Hi there! How can I help you today?"},
#     {"role": "user", "content": "Explain what is Generative AI in 2 bullet points"},
# ]

In [22]:
# ============================================================================
# üéØ GENERATING TEXT WITH THE PIPELINE
# ============================================================================
# Simply call the pipeline like a function!
# It handles all the complexity (tokenization, generation, decoding)
#
# The pipeline accepts the same generation parameters as model.generate()
# Common parameters:
#   - max_new_tokens: Maximum tokens to generate
#   - temperature: Creativity control (0.0-1.0)
#   - do_sample: Enable/disable random sampling
#   - top_p, top_k: Fine-tune sampling behavior

response = llama_pipe(
    chat,                    # Our chat messages
    max_new_tokens=1000      # Generate up to 1000 new tokens
)

# Let's examine the raw response structure:
print("=" * 60)
print("üì¶ RAW RESPONSE STRUCTURE:")
print("=" * 60)
print(response)
print("=" * 60)

üì¶ RAW RESPONSE STRUCTURE:
[{'generated_text': [{'role': 'user', 'content': 'Explain what is Generative AI in 2 bullet points'}, {'role': 'assistant', 'content': '1. Generative AI is a type of artificial intelligence that can generate new ideas, concepts, and solutions based on data. It is a form of machine learning that uses algorithms to analyze large amounts of data and generate new insights or solutions.\n\n2. Generative AI can be used in various industries, including finance, healthcare, marketing, and education. It can help businesses to identify new products, services, and marketing strategies, as well as improve customer experience and reduce costs.\n\n3. Generative AI can also be used to create new content, such as blog posts, social media posts, and videos. It can generate content based on user data, such as browsing history or search queries, and can create content that is tailored to the specific needs and interests of the user.\n\n4. Generative AI can also be used to cre

In [23]:
# ============================================================================
# üì§ EXTRACTING THE ASSISTANT'S RESPONSE
# ============================================================================
# The pipeline returns a nested structure:
#   response[0]["generated_text"] = list of all messages (input + generated)
#   [-1] gets the last message (the assistant's response)
#   ['content'] extracts just the text content
#
# Structure breakdown:
#   response = [
#       {
#           "generated_text": [
#               {"role": "user", "content": "..."},        # Original input
#               {"role": "assistant", "content": "..."}   # Generated response ‚Üê We want this!
#           ]
#       }
#   ]

# Extract just the assistant's message content:
assistant_response = response[0]["generated_text"][-1]['content']

print("=" * 60)
print("ü§ñ ASSISTANT'S RESPONSE (CLEAN):")
print("=" * 60)
print(assistant_response)
print("=" * 60)

ü§ñ ASSISTANT'S RESPONSE (CLEAN):
1. Generative AI is a type of artificial intelligence that can generate new ideas, concepts, and solutions based on data. It is a form of machine learning that uses algorithms to analyze large amounts of data and generate new insights or solutions.

2. Generative AI can be used in various industries, including finance, healthcare, marketing, and education. It can help businesses to identify new products, services, and marketing strategies, as well as improve customer experience and reduce costs.

3. Generative AI can also be used to create new content, such as blog posts, social media posts, and videos. It can generate content based on user data, such as browsing history or search queries, and can create content that is tailored to the specific needs and interests of the user.

4. Generative AI can also be used to create new products, such as virtual assistants or chatbots. These AI-powered tools can help businesses to improve customer service, reduce

## üìù Summary & Key Takeaways

### What We Learned:

1. **Hugging Face Transformers** provides easy access to thousands of pre-trained models
2. **Loading models** requires two components: **Tokenizer** + **Model**
3. **Chat templates** format messages correctly for each model family
4. **Manual workflow**: encode ‚Üí generate ‚Üí decode
5. **Pipelines** simplify everything into a single function call

### Key Code Patterns:

```python
# Loading a model and tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("model_id")
model = AutoModelForCausalLM.from_pretrained("model_id")

# Using pipelines (recommended for simplicity)
pipe = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
response = pipe([{"role": "user", "content": "Your question"}])
```

### üéØ When to Use Each Approach:

| Approach | Use Case |
|----------|----------|
| **Manual** (encode/generate/decode) | Fine-grained control, custom generation logic |
| **Pipeline** | Quick prototyping, standard use cases |
| **API/Cloud** (Groq, HF Inference) | No local GPU, production deployments |

---

## üöÄ Next Steps

- Explore **notebook 4**: Using Hugging Face Inference Client (API-based)
- Explore **notebook 5**: Using Groq Cloud for faster inference
- Explore **notebook 6**: Integrating with LangChain

## üìö Additional Resources

- [Hugging Face Hub](https://huggingface.co/models) - Browse models
- [Transformers Documentation](https://huggingface.co/docs/transformers) - Official docs
- [Generation Strategies](https://huggingface.co/docs/transformers/generation_strategies) - Deep dive into text generation
