# Getting Started with LLM Serving on Databricks: A Comprehensive Guide

## Introduction

Large Language Models (LLMs) have revolutionized how we build intelligent applications, but deploying and querying them at scale can be challenging. Databricks simplifies this process through Foundation Model APIs, providing seamless access to state-of-the-art language models with minimal setup. In this guide, we'll explore how to quickly get started with LLM serving on Databricks, from initial setup to production deployment.

Whether you're prototyping a generative AI application or building production-ready solutions, Databricks offers flexible options to meet your needs—from pay-per-token endpoints for experimentation to provisioned throughput for performance-critical workloads.


## Understanding Foundation Model APIs

Foundation Model APIs on Databricks provide two primary deployment options:

### 1. **Pay-Per-Token Endpoints**
Perfect for getting started, prototyping, and variable workloads where you only pay for what you use. These endpoints are automatically available in your Databricks workspace's Serving UI, providing instant access to popular foundation models without any infrastructure setup.

### 2. **Provisioned Throughput Endpoints**
Recommended for production workloads that require:
- Fine-tuned custom models
- Performance guarantees and SLAs
- Predictable latency
- High-volume, consistent traffic patterns

## Prerequisites

Before diving into LLM serving on Databricks, ensure you have:

1. **A Databricks Workspace** in a [supported region](https://docs.databricks.com/aws/en/machine-learning/model-serving/model-serving-limits#regions) for Foundation Model APIs
2. **Personal Access Token (PAT)** for authenticating API requests to Mosaic AI Model Serving endpoints

### Security Best Practices

⚠️ **Important Security Note**: For production environments, Databricks recommends:
- Using **machine-to-machine OAuth tokens** for authentication
- Leveraging **service principals** instead of individual user accounts for testing and development
- Never hardcoding tokens in your application code

## Hands-On: Querying Your First LLM

Let's walk through a practical example of querying the **Meta Llama 3.1 405B Instruct** model using the OpenAI client. This example demonstrates how easily you can integrate Databricks Foundation Model APIs into your applications.

### Step 1: Setup and Configuration

The code below should be run in a Databricks notebook:

```python
from openai import OpenAI
import os

# Retrieve your Databricks personal access token
DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")

# Initialize the OpenAI client with Databricks endpoint
client = OpenAI(
    api_key=DATABRICKS_TOKEN,  # Your personal access token
    base_url='https://<workspace_id>.databricks.com/serving-endpoints',  # Your Databricks workspace URL
)
```

**Key Configuration Points:**
- Replace `<workspace_id>` with your actual Databricks workspace instance
- Store your token securely using environment variables
- The base URL points to your workspace's serving endpoints

### Step 2: Making Your First Request

```python
# Create a chat completion request
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant",
        },
        {
            "role": "user",
            "content": "What is a mixture of experts model?",
        }
    ],
    model="databricks-meta-llama-3-1-405b-instruct",  # Model endpoint name
    max_tokens=256
)

# Display the response
print(chat_completion.choices[0].message.content)
```

### Step 3: Understanding the Response

The API returns a structured JSON response:

```json
{
  "id": "xxxxxxxxxxxxx",
  "object": "chat.completion",
  "created": "xxxxxxxxx",
  "model": "databricks-meta-llama-3-1-405b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "A Mixture of Experts (MoE) model is a machine learning technique that combines the predictions of multiple expert models to improve overall performance. Each expert model specializes in a specific subset of the data, and the MoE model uses a gating network to determine which expert to use for a given input."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 123,
    "completion_tokens": 23,
    "total_tokens": 146
  }
}
```

**Key Response Elements:**
- `choices[0].message.content`: The model's generated response
- `usage`: Token consumption breakdown for billing and monitoring
- `finish_reason`: Indicates why generation stopped (e.g., "stop" for natural completion)

### Troubleshooting Common Issues

If you encounter an `ImportError: cannot import name 'OpenAI' from 'openai'`, update your OpenAI package:

```python
!pip install -U openai
dbutils.library.restartPython()
```

## Available Foundation Models

Databricks provides access to a curated selection of state-of-the-art foundation models, including:

- **Meta Llama 3.1** series (8B, 70B, 405B parameter variants)
- **Mistral** models
- **DBRX** - Databricks' own foundation model
- And many more...

For the complete list of supported models, visit the [Databricks Foundation Model APIs documentation](https://docs.databricks.com/aws/en/machine-learning/foundation-model-apis/supported-models).

## Beyond Basic Querying: Next Steps

Once you're comfortable with basic LLM queries, explore these advanced capabilities:

### 1. **AI Playground**
Test and experiment with different models through an interactive chat interface before integrating them into your applications. The AI Playground provides a low-code environment for prompt engineering and model comparison.

### 2. **External Models Integration**
Access models hosted outside Databricks (like OpenAI, Anthropic, or Cohere) through a unified interface, enabling vendor flexibility and multi-model strategies.

### 3. **Fine-Tuning and Custom Models**
Deploy your own fine-tuned models using provisioned throughput endpoints for specialized use cases that require domain-specific knowledge or behavior.

### 4. **Building AI Agents**
Leverage Databricks' Agent Framework to build sophisticated AI agents with tool-calling capabilities, retrieval-augmented generation (RAG), and multi-step reasoning.

### 5. **Monitoring and Observability**
Implement comprehensive monitoring for model quality, endpoint health, latency metrics, and cost optimization to ensure production reliability.

## Real-World Use Cases

Foundation Model APIs on Databricks enable various enterprise applications:

- **Customer Support Automation**: Deploy conversational AI agents that understand context and provide accurate responses
- **Document Intelligence**: Extract insights from unstructured data, summarize reports, and answer questions about internal documents
- **Code Generation**: Build developer productivity tools that generate, explain, and debug code
- **Content Creation**: Automate marketing copy, product descriptions, and personalized communications
- **Data Analysis**: Generate SQL queries, create visualizations, and explain analytical insights in natural language

## Cost Optimization Tips

When working with Foundation Model APIs:

1. **Start with Pay-Per-Token**: Test and validate your use case before committing to provisioned throughput
2. **Monitor Token Usage**: Track the `usage` field in responses to optimize prompt design and reduce costs
3. **Right-Size Your Requests**: Use `max_tokens` parameter to control response length and prevent unnecessary token consumption
4. **Cache Responses**: For repeated queries, implement caching strategies to reduce API calls
5. **Upgrade to Provisioned Throughput**: Once you have predictable traffic patterns and performance requirements

## Conclusion

Databricks Foundation Model APIs democratize access to powerful LLMs, making it easier than ever to integrate advanced AI capabilities into your applications. With minimal setup, you can go from zero to querying state-of-the-art models in minutes.

The combination of pay-per-token flexibility for experimentation and provisioned throughput for production workloads ensures you have the right tools for every stage of your AI journey. Whether you're building a proof-of-concept or deploying mission-critical applications, Databricks provides the infrastructure, governance, and tooling to succeed.

## Additional Resources

- [Databricks Foundation Model APIs Documentation](https://docs.databricks.com/aws/en/machine-learning/foundation-model-apis/)
- [AI Playground Tutorial](https://docs.databricks.com/aws/en/large-language-models/ai-playground)
- [Building Generative AI Applications Guide](https://docs.databricks.com/aws/en/generative-ai/guide/introduction-generative-ai-apps)
- [MLflow for Generative AI](https://docs.databricks.com/aws/en/mlflow3/genai/)
- [Model Serving Best Practices](https://docs.databricks.com/aws/en/machine-learning/model-serving/)

---

*Ready to start building with LLMs on Databricks? Create your free workspace and begin experimenting with Foundation Model APIs today!*

---

**About the Author**: This guide is designed for data engineers, ML engineers, and solution architects looking to leverage LLMs in their data platforms. For questions or discussions about LLM serving on Databricks, feel free to reach out through the Databricks Community forums.

In [0]:
from openai import OpenAI
import os

In [0]:
DATABRICKS_TOKEN = ""

In [0]:
# Retrieve your Databricks personal access token
# DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")

# Initialize the OpenAI client with Databricks endpoint
client = OpenAI(
    api_key=DATABRICKS_TOKEN,  # Your personal access token
    base_url='https://fe-vm-agentic-ai.cloud.databricks.com/serving-endpoints',  # Your Databricks workspace URL
)

In [0]:
# Create a chat completion request
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant",
        },
        {
            "role": "user",
            "content": "What is a mixture of experts model?",
        }
    ],
    model="databricks-meta-llama-3-1-405b-instruct",  # Model endpoint name
    max_tokens=256
)

# Display the response
print(chat_completion.choices[0].message.content)

In [0]:
import json

# Convert the ChatCompletion object to a dictionary
response_dict = chat_completion.model_dump()

# Pretty print as JSON
print(json.dumps(response_dict, indent=2))

In [0]:
response_dict

In [0]:
print(f"Token consumption breakdown for billing and monitoring ")
print(f"completion_tokens count is {response_dict['usage']['completion_tokens']}")
print(f"prompt_tokens count is {response_dict['usage']['prompt_tokens']}")
print(f"total_tokens count is {response_dict['usage']['total_tokens']}")