## Using LLMs via Hugging Face Inference Client

### What is the Hugging Face Inference Client?

The **Hugging Face Inference Client** is a powerful Python library that allows you to interact with Large Language Models (LLMs) hosted on Hugging Face's servers ‚Äî **without needing to download or run the models locally**.

### Why Use It?

| Advantage | Description |
|-----------|-------------|
| **Free Tier Available** | HuggingFace offers a [free inference API](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client) with basic rate limits |
| **No Infrastructure Needed** | Access 150,000+ models without GPU/hardware requirements |
| **Easy to Use** | Simple Python API similar to OpenAI's client |
| **Wide Model Selection** | Access to latest open-source models like Llama, Mistral, etc. |

### Prerequisites
- A Hugging Face account (free)
- A Hugging Face API token (get it from [Settings > Access Tokens](https://huggingface.co/settings/tokens))
- `huggingface_hub` library installed (`pip install huggingface_hub`)

In [7]:
# =============================================================================
# Step 1: Import the Hugging Face Hub Library
# =============================================================================

import huggingface_hub

# Print the version to ensure compatibility
# IMPORTANT: Version should be >= 0.36.0 for Inference Providers to work properly
print(f"huggingface_hub version: {huggingface_hub.__version__}")

# Import the InferenceClient class
# This is the main class we'll use to interact with HuggingFace's hosted models
from huggingface_hub import InferenceClient

huggingface_hub version: 0.36.0


### Step 2: Setting Up Authentication

To use the Inference API, you need to authenticate with your Hugging Face API token. We'll load it securely from environment variables using the `python-dotenv` library.

> üîê **Security Best Practice**: Never hardcode your API tokens directly in code. Always use environment variables or secret management tools.

üìö **Documentation**: Feel free to refer to the [official InferenceClient documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient) for more details on available methods and parameters.

In [8]:
# =============================================================================
# Step 2: Load API Token from Environment Variables
# =============================================================================

from dotenv import load_dotenv  # Library to load variables from .env file
import os  # Standard library for OS operations

# Load environment variables from a .env file in the project root
# Your .env file should contain: HF_TOKEN=your_huggingface_api_token_here
load_dotenv()

# Retrieve the Hugging Face API token from environment variables
# This token authenticates your requests to the HuggingFace Inference API
hf_key = os.getenv("HF_TOKEN")

# Optional: Verify the token was loaded (don't print the actual token!)
if hf_key:
    print("‚úÖ HuggingFace API token loaded successfully!")
else:
    print("‚ùå Warning: HF_TOKEN not found. Please check your .env file.")

‚úÖ HuggingFace API token loaded successfully!


### Step 3: Making Your First API Call

Now let's use the `InferenceClient` to interact with a Large Language Model. We'll use **Meta's Llama 3.1 8B Instruct** model, which is:
- An open-source model available for free
- Instruction-tuned (optimized to follow instructions)
- 8 billion parameters (good balance between quality and speed)

#### Key Concepts:
- **Chat Completion**: A conversation-style API where you send messages with roles (user, assistant, system)
- **Messages Format**: A list of dictionaries with `role` and `content` keys
- **max_tokens**: Controls the maximum length of the generated response


In [13]:
# =============================================================================
# Step 3: Create the Inference Client and Make a Chat Completion Request
# =============================================================================

# Define the model to use
# Format: "organization/model-name"
# Note: Only models with "warm" inference status work with the free API
# You can find available models at: https://huggingface.co/models?inference=warm
model_name = "HuggingFaceTB/SmolLM3-3B"

# Initialize the InferenceClient with your API token
# This client handles all communication with HuggingFace's servers
client = InferenceClient(token=hf_key)

# Define the conversation as a list of messages
# Each message has:
#   - "role": Who is speaking ("system", "user", or "assistant")
#   - "content": The actual message text
# 
# Common roles:
#   - "system": Sets the behavior/personality of the AI (optional)
#   - "user": Messages from the human user
#   - "assistant": Previous responses from the AI (for multi-turn conversations)
chat = [
    {
        "role": "user",
        "content": "Explain what is Generative AI in 2 bullet points"
    },
]

# Make the API call using chat_completion()
# Parameters:
#   - messages: The conversation history (our 'chat' list)
#   - model: Which model to use for generation
#   - max_tokens: Maximum number of tokens in the response (1 token ‚âà 4 characters)
response = client.chat_completion(chat, model=model_name, max_tokens=1000)

# Print the full response object to see its structure
print("Full API Response:")
print(response)

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://router.huggingface.co/hf-inference/models/HuggingFaceTB/SmolLM3-3B/v1/chat/completions (Request ID: Root=1-697c6e30-1b7acd773d8a06a474da36a7;de3b34e0-0feb-4fad-95f1-3ee1e5a0ebf8)

Invalid username or password.

In [15]:
import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    api_key=os.environ["HF_TOKEN"],
)

completion = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B:hf-inference",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
)

print(completion.choices[0].message)

HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'HuggingFaceTB/SmolLM3-3B:hf-inference'.

### Step 4: Extracting the Response

The API returns a `ChatCompletionOutput` object with several fields:
- `choices`: List of generated responses (usually just one)
- `id`: Unique identifier for this request
- `model`: The model that was used
- `usage`: Token usage statistics (prompt_tokens, completion_tokens, total_tokens)

To get just the text response, we need to navigate: `response.choices[0].message.content`


In [None]:
# =============================================================================
# Step 4: Extract the Generated Text from the Response
# =============================================================================

# The response structure is:
# response
#   ‚îî‚îÄ‚îÄ choices (list of completions)
#       ‚îî‚îÄ‚îÄ [0] (first/only choice)
#           ‚îî‚îÄ‚îÄ message
#               ‚îî‚îÄ‚îÄ content (the actual generated text)

# Extract just the text content
generated_text = response.choices[0].message.content
print("Generated Response:")
print("-" * 50)
print(generated_text)
print("-" * 50)

# Bonus: Let's also look at the token usage
print(f"\nüìä Token Usage Statistics:")
print(f"   - Prompt tokens: {response.usage.prompt_tokens}")
print(f"   - Completion tokens: {response.usage.completion_tokens}")
print(f"   - Total tokens: {response.usage.total_tokens}")

Here are 2 bullet points explaining what Generative AI is:

‚Ä¢ **Definition**: Generative AI refers to a type of artificial intelligence that can create new, original content such as images, music, text, or videos using algorithms and machine learning models. These models are trained on large datasets and can learn patterns, styles, and structures to generate new content that is often indistinguishable from human-created work.

‚Ä¢ **Applications**: Generative AI has numerous applications across various industries, including art and design, music and audio production, writing and content creation, and even product design. Some examples of generative AI include generating realistic images of people, creating new music tracks, or producing automated content such as news articles or social media posts.


### Bonus: Advanced Usage with System Prompt

You can customize the AI's behavior using a **system prompt**. This is especially useful for:
- Setting a specific persona or role
- Defining output format requirements
- Establishing constraints or guidelines


In [5]:
# =============================================================================
# Bonus: Using System Prompts to Customize AI Behavior
# =============================================================================

# Define a conversation with a system prompt
# The system prompt sets the AI's persona and behavior rules
chat_with_system = [
    {
        "role": "system",
        "content": "You are a helpful coding tutor. Explain concepts simply and use analogies. Keep responses concise."
    },
    {
        "role": "user",
        "content": "What is an API?"
    }
]

# Make the request with the system prompt
response_with_system = client.chat_completion(
    chat_with_system,
    model=model_name,
    max_tokens=500
)

print("Response with System Prompt:")
print("-" * 50)
print(response_with_system.choices[0].message.content)


Response with System Prompt:
--------------------------------------------------
Imagine you're at a restaurant and you want to order food. You can't just walk into the kitchen and start making your own food, right? You need to tell the waiter what you want, and they'll order it for you.

An API (Application Programming Interface) is like the waiter. You give the waiter (API) instructions (requests), and they go to the kitchen (server) to get what you need (data). The waiter then brings back the data (response) to you.

In code, you send a request to the API, and it returns data that you can use in your program. APIs help different apps and systems talk to each other and share data.


---

## üìù Summary

In this notebook, you learned how to:

1. **Set up** the Hugging Face Inference Client
2. **Authenticate** using API tokens stored in environment variables
3. **Make API calls** to open-source LLMs hosted on HuggingFace
4. **Parse responses** to extract the generated text
5. **Use system prompts** to customize AI behavior

## üîó Additional Resources

- [HuggingFace Inference Client Documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client)
- [Available Models for Inference](https://huggingface.co/models?inference=warm)
- [Chat Completion API Reference](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion)

## üí° Try It Yourself

Experiment with:
- Different models (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
- Different `max_tokens` values
- Adding multi-turn conversations
- Using different system prompts to change the AI's personality
