# LlamaStack 0.3.0 API Testing

This notebook demonstrates how to connect to a LlamaStack server and perform basic operations using the **0.3.0 API**.

## Key API Changes in 0.3.0

| Old (0.2.x) | New (0.3.0) |
|-------------|-------------|
| `client.inference.chat_completion()` | `client.chat.completions.create()` |
| `client.vector_dbs.list()` | `client.vector_stores.list()` |
| `vdb.identifier` | `vs.id` |
| `vdb.vector_db_name` | `vs.name` |
| `rag_tool.query(vector_db_ids=[...])` | `vector_stores.search(vector_store_id=...)` |

Let's start by querying the Llamastack server and check if it is healthy and ready to accept requests. First we need to install the llama-stack client using pip:

In [9]:
%pip install llama-stack-client==0.3.0 rich


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Now, import the `LlamaStackClient` class, and set the base URL of the Llamastack server (You can get the URL by running `oc get svc -n competitor-analysis`)

In [None]:
from llama_stack_client import LlamaStackClient
import rich

# For access from Notebooks within the cluster
LS_URL = "http://llama-stack-dist-service:8321"

# For access from Notebooks external to the cluster, use the route URL instead (oc get route -n competitor-analysis)
#LS_URL = "https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/"

# Initialize the client
client = LlamaStackClient(base_url=LS_URL)

List the models available in this Llamastack instance.

In [14]:
models = client.models.list()
rich.print(models)

INFO:httpx:HTTP Request: GET https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/v1/models "HTTP/1.1 200 OK"


Now, list the providers available in this Llamastack instance

In [17]:
providers = client.providers.list()
rich.print(providers[:3])  # Print only first 3 providers
print(f"\n... and {len(providers) - 3} more providers")

INFO:httpx:HTTP Request: GET https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/v1/providers "HTTP/1.1 200 OK"



... and 15 more providers


## Chat Completion (OpenAI-Compatible API)

In LlamaStack 0.3.0, use `client.chat.completions.create()` instead of the deprecated `client.inference.chat_completion()`.

In [18]:
# Get the LLM model for inference
inference_model_id = next(m.identifier for m in models if m.model_type == "llm")
print(f"Using model: {inference_model_id}")

prompt = "What is the capital of Mongolia?"

# LlamaStack 0.3.0: Use OpenAI-compatible chat.completions.create() API
response = client.chat.completions.create(
    messages=[{"role": "user", "content": prompt}],
    model=inference_model_id  # Note: 'model' not 'model_id'
)
rich.print(response)

# Extract the answer
if response.choices and len(response.choices) > 0:
    answer = response.choices[0].message.content
    print(f"\n‚úÖ Answer: {answer}")

Using model: vllm-inference/granite-3-3-8b-instruct


INFO:httpx:HTTP Request: POST https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/v1/chat/completions "HTTP/1.1 200 OK"



‚úÖ Answer: The capital city of Mongolia is Ulaanbaatar.


## List Vector Stores (0.3.0 API)

In LlamaStack 0.3.0, `vector_dbs` has been renamed to `vector_stores`. The attributes have also changed:
- `identifier` ‚Üí `id`
- `vector_db_name` ‚Üí `name`


In [19]:
# List all vector stores using the 0.3.0 API
vector_stores = list(client.vector_stores.list())

if not vector_stores:
    print("‚ùå No vector stores found!")
    print("Run the KFP pipeline to ingest documents first")
else:
    print(f"‚úÖ Found {len(vector_stores)} vector store(s)\n")
    
    for vs in vector_stores:
        # 0.3.0 API: 'id' is the primary identifier, 'name' is the logical name
        vs_id = getattr(vs, 'id', None)
        vs_name = getattr(vs, 'name', None)
        file_counts = getattr(vs, 'file_counts', None)
        
        print(f"Vector Store: {vs_name}")
        print(f"  ID: {vs_id}")
        if file_counts:
            print(f"  Files: {file_counts.total} total, {file_counts.completed} completed, {file_counts.failed} failed")
        print()
        
        # Store for later use
        target_vector_store = vs


INFO:httpx:HTTP Request: GET https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/v1/vector_stores "HTTP/1.1 200 OK"


‚úÖ Found 1 vector store(s)

Vector Store: competitor-docs
  ID: vs_06031455-0d5c-420d-9dce-743229452311
  Files: 11 total, 11 completed, 0 failed



## RAG Query: Search Vector Store

Use `client.vector_stores.search()` to perform semantic search on the indexed documents.

**Note:** The legacy `rag_tool.query()` API with `vector_db_ids` parameter may not work correctly with the new vector stores. Use the OpenAI-compatible `vector_stores.search()` instead.


In [20]:
# Define your search query
query_text = "What was the standalone Profit After Tax (PAT) for HDFC Bank in Q2 FY26?"

print(f"üîç Query: {query_text}\n")

# Perform semantic search using the 0.3.0 API
try:
    search_response = client.vector_stores.search(
        vector_store_id=target_vector_store.id,
        query=query_text,
        max_num_results=5
    )
    
    # Display results
    if search_response.data and len(search_response.data) > 0:
        print(f"‚úÖ Found {len(search_response.data)} results:\n")
        
        for i, item in enumerate(search_response.data):
            # Extract content
            content = ""
            if hasattr(item, 'content'):
                if isinstance(item.content, list):
                    content = " ".join([
                        c.text if hasattr(c, 'text') else str(c) 
                        for c in item.content
                    ])
                else:
                    content = str(item.content)
            
            score = getattr(item, 'score', 'N/A')
            
            print(f"[{i+1}] Score: {score:.4f}")
            print(f"    {content[:300]}...")
            print()
    else:
        print("‚ùå No results found for your query")
        
except Exception as e:
    print(f"‚ùå Search failed: {e}")
    import traceback
    traceback.print_exc()


üîç Query: What was the standalone Profit After Tax (PAT) for HDFC Bank in Q2 FY26?



INFO:httpx:HTTP Request: POST https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/v1/vector_stores/vs_06031455-0d5c-420d-9dce-743229452311/search "HTTP/1.1 200 OK"


‚úÖ Found 5 results:

[1] Score: 0.9169
     ‚Çπ 40.1 bn; down 22% YoY
- Net profit after tax of ‚Çπ 1.8 bn compared to profit of ‚Çπ 2.0 bn in the prior year
- EPS of ‚Çπ 2.52
- Solvency Ratio at 210% as of September 30, 2025

## Subsidiaries - Q2FY26 update - HDFC Securities Ltd

- 94.11% stake held by the Bank as of September 30, 2025
- 7.4 millio...

[2] Score: 0.9029
     - 400 013. CIN:  L65920MH1994PLC080618

<!-- image -->

## NEWS RELEASE

HDFC Bank Ltd. HDFC Bank House, Senapati Bapat Marg, Lower Parel, Mumbai - 400 013. CIN:  L65920MH1994PLC080618

herein below are in accordance with the accounting standards used in their standalone reporting under the applica...

[3] Score: 0.9002
    13% YoY and AUM at ‚Çπ 3.6 tn up by 11% YoY
- New Business Premium of ‚Çπ 89 bn with new business margin at 24%
- Value of new business for the quarter ‚Çπ 10.1 bn
- PAT of ‚Çπ 4.5 bn up by 3% YoY
- Solvency Ratio at 175% as of September 30, 2025
- Embedded value at ‚Çπ 595 bn improved 14% YoY

## RAG-Augmented Generation

Now let's use the retrieved context to generate an answer using the LLM.


In [21]:
# Build context from search results
context_parts = []
if search_response.data:
    for item in search_response.data[:3]:  # Use top 3 results
        if hasattr(item, 'content'):
            if isinstance(item.content, list):
                text = " ".join([c.text if hasattr(c, 'text') else str(c) for c in item.content])
            else:
                text = str(item.content)
            context_parts.append(text)

context = "\n\n".join(context_parts)

# Create RAG prompt
rag_prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {query_text}

Answer:"""

print("üìù Generating answer using RAG...\n")

# Generate answer using the LLM
rag_response = client.chat.completions.create(
    messages=[{"role": "user", "content": rag_prompt}],
    model=inference_model_id
)

# Display the answer
if rag_response.choices and len(rag_response.choices) > 0:
    answer = rag_response.choices[0].message.content
    print(f"‚úÖ Answer:\n\n{answer}")


üìù Generating answer using RAG...



INFO:httpx:HTTP Request: POST https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/v1/chat/completions "HTTP/1.1 200 OK"


‚úÖ Answer:

The standalone Profit After Tax (PAT) for HDFC Bank in Q2 FY26 was ‚Çπ 196.1 bn.
