# Index Building with llama Stack & Milvus on wx.data


#### Prerequests

Run llama-stack (in version 0.3.0) server with watsonx distribution in terminal
```
llama stack list-deps watsonx | xargs -L1 uv pip install
uv run llama stack run watsonx
```

Add environment variables
- **WATSONX_PROJECT_ID**
- **WATSONX_BASE_URL**
- **WATSONX_API_KEY**


> [!WARNING]
> The llama-stack with version 0.3.2 does not support the embedding models from providers. Possible complication after upgrading the llama stack.


#### Import dependencies


In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

from llama_stack_client import LlamaStackClient
from llama_stack_client.types import UserMessage
import requests

#### Create LlamaStackClient object


In [4]:
base_url = os.getenv("REMOTE_BASE_URL", "http://localhost:8321")

client = LlamaStackClient(base_url=base_url)


#### Models used in the example

In [5]:
EMBEDDING_MODEL = "ibm/slate-125m-english-rtrvr-v2"
EMBEDDING_MODEL_DIMENSION = 768
LLM_MODEL = "meta-llama/llama-3-3-70b-instruct"

In [7]:
client.models.list()[0].model_dump()

INFO:httpx:HTTP Request: GET http://127.0.0.1:8321/v1/models "HTTP/1.1 200 OK"


{'identifier': 'watsonx/ibm/granite-3-2-8b-instruct',
 'metadata': {},
 'api_model_type': 'llm',
 'provider_id': 'watsonx',
 'type': 'model',
 'provider_resource_id': 'ibm/granite-3-2-8b-instruct',
 'model_type': 'llm'}

#### Tests model availability via llama stack

In [11]:
resp = client.chat.completions.create(messages=[UserMessage(content="Generate a short poem", role="user")],
                                      model=LLM_MODEL)

resp.choices

INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/chat/completions "HTTP/1.1 200 OK"


[OpenAIChatCompletionChoice(finish_reason='stop', index=0, message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(role='assistant', content="Softly falls the evening dew,\nA calming hush, a peaceful view.\nThe stars appear, one by one,\nA night of rest, a day is done.\n\nThe world is quiet, still and deep,\nThe moon's soft light, our souls do keep.\nIn this calm night, we find our peace,\nA sense of rest, our worries release.", name=None, tool_calls=None, function_call=None, provider_specific_fields=None), logprobs=None)]

In [12]:
embedding_response = client.embeddings.create(input="Hello", model=f"{EMBEDDING_MODEL}")
embedding_response.model_dump()


INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/embeddings "HTTP/1.1 200 OK"


{'data': [{'embedding': [-0.036214184015989304,
    0.008323820307850838,
    -0.02359858714044094,
    0.012813488952815533,
    -0.0586748942732811,
    -0.024105684831738472,
    -0.0036331682931631804,
    -0.004786505829542875,
    0.024501468986272812,
    0.04385775327682495,
    -0.012448625639081001,
    -0.03725311532616615,
    -0.04217567294836044,
    -0.02295543998479843,
    -0.017896832898259163,
    -0.01434714999049902,
    0.005983132403343916,
    0.02169387973845005,
    0.025255931541323662,
    -0.04761769622564316,
    0.025008566677570343,
    -0.012566124089062214,
    -0.028867455199360847,
    -0.008806181140244007,
    0.012491914443671703,
    0.034383684396743774,
    0.04202725365757942,
    0.021533092483878136,
    -0.004715388640761375,
    0.02367279678583145,
    0.02797694131731987,
    0.005027686711400747,
    0.0018150381511077285,
    0.06644214689731598,
    0.05333181843161583,
    -0.045515093952417374,
    -0.04348670691251755,
    -0.02599

### Files used in the example

In [29]:
from pathlib import Path
import json


In [30]:
docs_folder_path = Path("../")
documents_path =[docs_folder_path.joinpath(doc) for doc in ("ibm-annual-report-2024-pt1_1-20.pdf", "ibm-annual-report-2024-pt2_20-40.pdf")]
benchmark_qa_path = docs_folder_path.joinpath("ibm_annual_report_pdf_benchmarking_data.json")

In [15]:
benchmark_qa = json.loads(benchmark_qa_path.read_bytes())
benchmark_qa[:3]

[{'correct_answer_document_ids': ['ibm-annual-report-2024-pt1_1-20.pdf'],
  'question': 'What was the operating other income and expense in 2024?',
  'correct_answer': '$1,656 million'},
 {'correct_answer_document_ids': ['ibm-annual-report-2024-pt1_1-20.pdf'],
  'question': 'What is the amount of gain on land/building dispositions included in "Other"',
  'correct_answer': '$126 million'},
 {'correct_answer_document_ids': ['ibm-annual-report-2024-pt1_1-20.pdf'],
  'question': 'How much was the non-operating retirement-related costs in the current-year period?',
  'correct_answer': '$3,457 million'}]

## Option 1. With LLAMA STACK FILES API

### 1.1  Upload file to Llama-stack [Files API](https://llamastack.github.io/docs/api/files).

In [16]:
files_ids = []

for file_path in documents_path:
    file_obj = client.files.create(file=(str(file_path), file_path.read_bytes(), "application/pdf"), purpose="assistants")

    files_ids.append(file_obj.id)


INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/files "HTTP/1.1 200 OK"


### 1.2 Create a Vector Store Assets using Llama-stack [Vector IO API](https://llamastack.github.io/docs/api/creates-a-vector-store).


Using Remote Milvus - Milvus instance from watsonx.data (configured in run.yaml)

```yaml
  vector_io:
  - provider_id: milvus
    provider_type: remote::milvus
    config:
      uri: "http://<milvus_host>:<milvus_port>"
      token: "<milvus_username>:<milvus_password>"
      secure: True
#      collection_name: "llama_stack_collection"  # otherwise default
      persistence:
        namespace: vector_io::milvus
        backend: kv_default
```

In [17]:
vs = client.vector_stores.create(
    extra_body={
        "provider_id": "milvus",
        "embedding_model": EMBEDDING_MODEL,
        "embedding_dimension": EMBEDDING_MODEL_DIMENSION
    }
)
vs_id = vs.id


INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/vector_stores "HTTP/1.1 200 OK"


### 1.3 Attach files to the Vector Store.

Llama stack support 2 chunking strategies:
- `auto`
- `static`

Auto strategy splits data into chunks with parameters: `chunk_overlap_tokens = 400`
    `=800` which is too long for the wx.ai's embedding models to process. The solution is to use `static` strategy with provided `chunk_overlap_tokens` and `max_chunk_size_tokens`

The 'static' chunking strategy is a simple chunking with overlap, tokens are count with tiktoken tokenizer: https://github.com/llamastack/llama-stack/blob/a078f089d9070d5618d185fb9dfdbc53f5e3c34f/src/llama_stack/providers/utils/memory/vector_store.py#L159-L207. Sth similar to: `TokenTextSplitter` from `langchain`

Llama stack supports 2 approaches for file upload:
- single file (SDK and API)
- batch (only API)



In [18]:
for file_id in files_ids:
    client.vector_stores.files.create(vector_store_id=vs_id,
                                      file_id=file_id,
                                      attributes={"additional_tag": "IBM_PDF_FILE"},
                                      chunking_strategy={
                                          "type": "static",
                                          "static": {"chunk_overlap_tokens": 64,
                                                     "max_chunk_size_tokens": 256}
                                      })

INFO:llama_stack_client._base_client:Retrying request to /v1/vector_stores/vs_24d6eaac-4d01-458d-84b0-17ce1bac6964/files in 0.456660 seconds
INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/vector_stores/vs_24d6eaac-4d01-458d-84b0-17ce1bac6964/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/vector_stores/vs_24d6eaac-4d01-458d-84b0-17ce1bac6964/files "HTTP/1.1 200 OK"


In [19]:
client.vector_stores.files.list(vector_store_id=vs_id)

INFO:httpx:HTTP Request: GET http://127.0.0.1:8321/v1/vector_stores/vs_24d6eaac-4d01-458d-84b0-17ce1bac6964/files "HTTP/1.1 200 OK"


SyncOpenAICursorPage[VectorStoreFile](data=[VectorStoreFile(id='file-6e68763af4444eb8a085f55828f93dd2', attributes={'additional_tag': 'IBM_PDF_FILE'}, chunking_strategy=ChunkingStrategyVectorStoreChunkingStrategyStatic(static=ChunkingStrategyVectorStoreChunkingStrategyStaticStatic(chunk_overlap_tokens=64, max_chunk_size_tokens=256), type='static'), created_at=1764586283, object='vector_store.file', status='completed', usage_bytes=0, vector_store_id='vs_24d6eaac-4d01-458d-84b0-17ce1bac6964', last_error=None), VectorStoreFile(id='file-530cca2199da4a2188dd3f7af34ab255', attributes={'additional_tag': 'IBM_PDF_FILE'}, chunking_strategy=ChunkingStrategyVectorStoreChunkingStrategyStatic(static=ChunkingStrategyVectorStoreChunkingStrategyStaticStatic(chunk_overlap_tokens=64, max_chunk_size_tokens=256), type='static'), created_at=1764586247, object='vector_store.file', status='completed', usage_bytes=0, vector_store_id='vs_24d6eaac-4d01-458d-84b0-17ce1bac6964', last_error=None), VectorStoreFile(

### 1.4 Retrieval test

There are 3 `search_mode` for [VectorStore Search](https://llamastack.github.io/docs/api/search-for-chunks-in-a-vector-store):


#### Supported Search Modes

The SQLite vector store supports three search modes:

1. **Vector Search** (`mode="vector"`): Uses vector similarity to find relevant chunks
2. **Keyword Search** (`mode="keyword"`): Uses keyword matching to find relevant chunks
3. **Hybrid Search** (`mode="hybrid"`): Combines both vector and keyword scores using a ranker

###### Hybrid Search

Link to Hybrid Search for Milvus: https://github.com/llamastack/llama-stack/blob/bd5ad2963e496e78f6e115dfc9910d55ce2121b5/src/llama_stack/providers/remote/vector_io/milvus/milvus.py#L198

Hybrid search combines the strengths of both vector and keyword search by:
- Computing vector similarity scores
- Computing keyword match scores
- Using a ranker to combine these scores

#### Supported rankers

Two ranker types are supported:

1. **RRF (Reciprocal Rank Fusion)**:
   - Combines ranks from both vector and keyword results
   - Uses an impact factor (default: 60.0) to control the weight of higher-ranked results
   - Good for balancing between vector and keyword results
   - The default impact factor of 60.0 comes from the original RRF paper by Cormack et al. (2009) [^1], which found this value to provide optimal performance across various retrieval tasks

2. **Weighted**:
   - Linearly combines normalized vector and keyword scores
   - Uses an alpha parameter (0-1) to control the blend:
     - alpha=0: Only use keyword scores
     - alpha=1: Only use vector scores
     - alpha=0.5: Equal weight to both (default)
**NOTE:** The support for the search type depends on the provider.


In [20]:
question = benchmark_qa[-1].get("question")
question

'What was the total revenue for year 2024?'

In [21]:
search_response = client.vector_stores.search(
    vector_store_id=vs_id,
    query=question,
    search_mode="hybrid",
    max_num_results=5,
)
search_response.data[0].content[0].text

INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/vector_stores/vs_24d6eaac-4d01-458d-84b0-17ce1bac6964/search "HTTP/1.1 200 OK"


' r g i n\nC h a n g e\nI n f r a s t r u c t u r e\nG r o s s p r o f i t $ 7 , 8 1 9 $ 8 , 1 8 7 ( 4 . 5 ) %\nG r o s s p r o f i t m a r g i n 5 5 . 8 % 5 6 . 1 % ( 0 . 3 ) p t s .\nS e g m e n t p r o f i t $ 2 , 4 5 0 $ 2 , 8 2 8 ( 1 3 . 4 ) %\nS e g m e n t p r o f i t m a r g i n 1 7 . 5 % 1 9 . 4 % ( 1 . 9 ) p t s .\n( 1 ) R e c a s t t o r e f l e c t J a n u a r y 2 0 2 4 s e g m e n t c h a n g e s .\nI n f r a s t r'

In [22]:
## PYPDF doesn't provess the file well
print(search_response.data[0].content[0].text.replace(" ", ""))

rgin
Change
Infrastructure
Grossprofit$7,819$8,187(4.5)%
Grossprofitmargin55.8%56.1%(0.3)pts.
Segmentprofit$2,450$2,828(13.4)%
Segmentprofitmargin17.5%19.4%(1.9)pts.
(1)RecasttoreflectJanuary2024segmentchanges.
Infrastr


## Option 2. With VetorStore Insert chunks

### 2.1  Read documents with PyPDF Reader

In [33]:
from pypdf import PdfReader
import logging

def load_pdf(document_path: Path) -> str:
    with open(document_path, 'rb') as open_pdf_file:
        logging.disable(logging.WARNING)
        reader = PdfReader(open_pdf_file)
        full_text = [page.extract_text() for page in reader.pages]
        logging.disable(logging.NOTSET)
        return "\n".join(full_text)

In [34]:
files_text = dict()

for file_path in documents_path:
    files_text[file_path.name] = load_pdf(file_path)

In [37]:
files_text.keys()

dict_keys(['ibm-annual-report-2024-pt1_1-20.pdf', 'ibm-annual-report-2024-pt2_20-40.pdf'])

### 2.2 Prepare chunks

##### Simple chunking function

In [23]:
def fixed_window_splitter(text: str, chunk_size: int = 1000) -> list[str]:
    """Splits text at given chunk_size"""
    splits = []
    for i in range(0, len(text), chunk_size):
        splits.append(text[i:i + chunk_size])
    return splits

In [49]:
from llama_stack_client.types.vector_io_insert_params import Chunk, ChunkChunkMetadata

In [53]:
# Available metadata

print(ChunkChunkMetadata.__annotations__.keys())

dict_keys(['chunk_embedding_dimension', 'chunk_embedding_model', 'chunk_id', 'chunk_tokenizer', 'chunk_window', 'content_token_count', 'created_timestamp', 'document_id', 'metadata_token_count', 'source', 'updated_timestamp'])


In [64]:
chunks = []
for file_name, file_text in files_text.items():
    chunked_file = fixed_window_splitter(file_text, chunk_size=500)
    for chunk_text in chunked_file:
        chunk = Chunk(content=chunk_text, metadata={"document_id": file_name, "description": "Annual Report from IBM "},
                      chunk_metadata=ChunkChunkMetadata(document_id=file_name))
        chunks.append(chunk)


In [65]:
chunks[:2]

[{'content': '\n\n\nI n f r a s t r u c t u r eC o n s u l t i n gS o f t w a r e\n2\nt o s h a r e h o l d e r s t h r o u g h d i v i d e n d s . I B M e x p a n d e d o p e r a t i n g\ng r o s s p r o f t m a r g i n b y 1 3 0 b a s i s p o i n t s i n 2 0 2 4 , d r i v e n b y a\n\ue006 o c u s o n h i g h - v a l u e o \ue006 \ue006 e r i n g s a n d p r o d u c t i v i t y i n i t i a t i v e s .\nS o f t w a r e r e v e n u e g r e w 9 % a t c o n s t a n t c u r r e n c y , r e f l e c t i n g\ns t r o n g d e m a n d f o r',
  'metadata': {'document_id': 'ibm-annual-report-2024-pt1_1-20.pdf',
   'description': 'Annual Report from IBM '},
  'chunk_metadata': {'document_id': 'ibm-annual-report-2024-pt1_1-20.pdf'}},
 {'content': ' o u r a d v a n c e d c a p a b i l i t i e s i n h y b r i d c l o u d ,\nd a t a a n d A I , a u t o m a t i o n , t r a n s a c t i o n p r o c e s s i n g , a n d s e c u r i t y .\nR e d H a t p e r f o r m e d w e l l a s c l i e n t s e m b r a 

### 2.2 Create a Vector Store Assets using Llama-stack [Vector IO API]

In [40]:
vs_for_chunks = client.vector_stores.create(
    extra_body={
        "provider_id": "milvus",
        "embedding_model": EMBEDDING_MODEL,
        "embedding_dimension": EMBEDDING_MODEL_DIMENSION
    }
)


INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/vector_stores "HTTP/1.1 200 OK"


In [42]:
vs_for_chunks.model_dump()

{'id': 'vs_62b774e4-1dda-4203-a8c7-18db45b0e947',
 'created_at': 1764587227,
 'file_counts': {'cancelled': 0,
  'completed': 0,
  'failed': 0,
  'in_progress': 0,
  'total': 0},
 'metadata': {'provider_id': 'milvus',
  'provider_vector_store_id': 'vs_62b774e4-1dda-4203-a8c7-18db45b0e947'},
 'object': 'vector_store',
 'status': 'completed',
 'usage_bytes': 0,
 'expires_after': None,
 'expires_at': None,
 'last_active_at': 1764587227,
 'name': None}

### 2.3 Embed and index documents

We have two options:

1. **Embed documents before insertion**
Generate embeddings on your side and include them in the `insert` request.
2. **Defer embedding to the service**
Send the raw documents without embeddings; llama-stack VectorIO module will create embeddings later during the `insert` process.


**_OPTION 2 is presented later in the notebook_**

In [66]:
client.vector_io.insert(vector_db_id=vs_for_chunks.id, chunks=chunks)

INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/vector-io/insert "HTTP/1.1 200 OK"


In [67]:
vs_for_chunks

VectorStore(id='vs_62b774e4-1dda-4203-a8c7-18db45b0e947', created_at=1764587227, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0), metadata={'provider_id': 'milvus', 'provider_vector_store_id': 'vs_62b774e4-1dda-4203-a8c7-18db45b0e947'}, object='vector_store', status='completed', usage_bytes=0, expires_after=None, expires_at=None, last_active_at=1764587227, name=None)

### 2.4 Retrival test

In [74]:
question = benchmark_qa[-1].get("question")
question

'What was the total revenue for year 2024?'

In [75]:
search_response = client.vector_stores.search(
    vector_store_id=vs_for_chunks.id,
    query=question,
    search_mode="hybrid",
    max_num_results=5,
)
search_response.data[0].content[0].text

INFO:httpx:HTTP Request: POST http://127.0.0.1:8321/v1/vector_stores/vs_62b774e4-1dda-4203-a8c7-18db45b0e947/search "HTTP/1.1 200 OK"


'h q u a r t e r t o a s s e s s t h e a d e q u a c y o f t h e e s t i m a t e s . I f t h e e s t i m a t e s w e r e\nc h a n g e d b y 1 0 p e r c e n t i n 2 0 2 4 , t h e i m p a c t o n n e t i n c o m e w o u l d h a v e b e e n $ 3 1 m i l l i o n .\nC o s t s t o C o m p l e t e S e r v i c e C o n t r a c t s\nW e e n t e r i n t o n u m e r o u s s e r v i c e c o n t r a c t s t h r o u g h o u r s e r v i c e s b u s i n e s s e s . D u r i n g t h e c o n t r a c t u a l p e r i o d '

In [76]:
## PYPDF doesn't provess the file well
print(search_response.data[0].content[0].text.replace(" ", ""))

hquartertoassesstheadequacyoftheestimates.Iftheestimateswere
changedby10percentin2024,theimpactonnetincomewouldhavebeen$31million.
CoststoCompleteServiceContracts
Weenterintonumerousservicecontractsthroughourservicesbusinesses.Duringthecontractualperiod


In [79]:
search_response.model_dump()

{'data': [{'content': [{'text': 'h q u a r t e r t o a s s e s s t h e a d e q u a c y o f t h e e s t i m a t e s . I f t h e e s t i m a t e s w e r e\nc h a n g e d b y 1 0 p e r c e n t i n 2 0 2 4 , t h e i m p a c t o n n e t i n c o m e w o u l d h a v e b e e n $ 3 1 m i l l i o n .\nC o s t s t o C o m p l e t e S e r v i c e C o n t r a c t s\nW e e n t e r i n t o n u m e r o u s s e r v i c e c o n t r a c t s t h r o u g h o u r s e r v i c e s b u s i n e s s e s . D u r i n g t h e c o n t r a c t u a l p e r i o d ',
     'type': 'text'}],
   'file_id': 'ibm-annual-report-2024-pt2_20-40.pdf',
   'filename': '',
   'score': 0.016393441706895828,
   'attributes': {'document_id': 'ibm-annual-report-2024-pt2_20-40.pdf',
    'description': 'Annual Report from IBM '}},
  {'content': [{'text': '- t o - Y r .\nP e r c e n t\nC h a n g e\nI n t e l l e c t u a l p r o p e r t y i n c o m e\n( 1 ) ( 2 )\n$ 3 2 9 $ 3 7 4 ( 1 2 . 1 ) %\nC u s t o m d e v e l o p m e n t i n c o m e

In [80]:
len(search_response.model_dump())

5