# Setup collections and indexes with PhariaSearch

While large language models come equipped with extensive built-in knowledge, answering questions given a known text may not be sufficient for your use case. At some point, you will probably want to search through, or answer questions about, your own knowledge base.

You can leverage Aleph Alpha's PhariaSearch (previously known as DocumentIndex) – a robust semantic search tool – to pinpoint sections in documents that align closely with your query.

In this tutorial, we will go through the creation of Collections and Index configurations, and the upload of text directly to PhariaSearch.

## Environment Validation

⚠️ **CRITICAL: Validate your environment before proceeding**

This step ensures your `.env` file is properly configured with valid API endpoints, authentication tokens, and unique resource names.

**DO NOT SKIP!** If validation fails:
- Check your `.env` file for missing values
- Contact your infrastructure administrator if errors persist

Only proceed after ALL checks pass.

In [2]:
from validation_utils import validate_environment

# Run environment validation
validate_environment()

# If validation fails, DO NOT proceed with the rest of the tutorial
# Fix the issues identified above first!

🔍 Validating environment configuration...

1️⃣  Checking required environment variables:
   ✅ PHARIA_API_BASE_URL: https://api.customer.pharia.com
   ✅ PHARIA_AI_TOKEN: eyJhbGci...
   ✅ PHARIA_DATA_NAMESPACE: Studio
   ✅ PHARIA_DATA_COLLECTION: pharia-tutorial-rag-vsp-1
   ✅ INDEX: rag-tutorial-index-1
   ✅ HYBRID_INDEX: rag-tutorial-hybrid-index-vsp-1
   ✅ FILTER_INDEX: rag-tutorial-filter-index-vsp-1
   ✅ EMBEDDING_MODEL_NAME: luminous-base

2️⃣  Validating URL format:
   ✅ PHARIA_API_BASE_URL: Valid format

3️⃣  Testing PhariaAI API access:
   ✅ API connection successful

✅ All validation checks passed! Your environment is properly configured.

You can now proceed with the tutorial.


True

### Load Environment Variables

  Now we'll load all the configuration parameters from your `.env` file. These include API endpoints, authentication tokens, and the names of resources
  we'll create in PhariaSearch.

In [None]:
from os import getenv
from dotenv import load_dotenv

print("Loading environment variables...")

load_dotenv(override=True)

# Please make sure to include the /v1 in the api base url
PHARIA_API_BASE_URL = getenv("PHARIA_API_BASE_URL")
PHARIA_SEARCH_API_URL = f"{PHARIA_API_BASE_URL}/v1/studio/search"
TOKEN = getenv("PHARIA_AI_TOKEN")

NAMESPACE = getenv("PHARIA_DATA_NAMESPACE")
COLLECTION = getenv("PHARIA_DATA_COLLECTION")
INDEX = getenv("INDEX")
HYBRID_INDEX = getenv("HYBRID_INDEX")
FILTER_INDEX = getenv("FILTER_INDEX")

EMBEDDING_MODEL_NAME = getenv("EMBEDDING_MODEL_NAME")
print(f"Environment variables loaded successfully!")
print(f"  - API Base URL: {PHARIA_API_BASE_URL}")

## 1. Setup a Search Client with Pharia Data SDK

To search through PhariaSearch, you'll first need to setup the Search Client using the environment variables we specified at the beginning of the notebook.

In [None]:
from pharia_data_sdk.connectors import DocumentIndexClient

search_client = DocumentIndexClient(
    token=TOKEN,
    base_url=PHARIA_SEARCH_API_URL,
)
try:
    print(f"Search client created successfully. \n Remote server available at {search_client._base_url}")
except AttributeError:
    print("Search client created successfully.")

## 2. Create a collection
The collection is the place where our searchable content will be stored. The following code will user the Pharia Data SDK to create a new collection with the specified details.

In [None]:
from pharia_data_sdk.connectors.document_index.document_index import CollectionPath

collection_path = CollectionPath(namespace=NAMESPACE, collection=COLLECTION)

search_client.create_collection(collection_path)

print(f"Collection created successfully. \n Collection path: {collection_path}")

## 3. Create a semantic index configuration

Now, let's create an index and assign it to this collection. You can do this before or after populating the collection with documents; PhariaSearch automatically updates semantic indexes in the background.
In this tutorial we use `pharia-1-embedding-256-control` embedding model. If this model is not available in the environment, please, replace the name with the one that is available (e.g. `luminous-base`).

In [None]:
from pharia_data_sdk.connectors.document_index.document_index import (
    IndexConfiguration,
    IndexPath,
    SemanticEmbed,
)

index_path = IndexPath(namespace=NAMESPACE, index=INDEX)

# customise the parameters of the index here
index_configuration = IndexConfiguration(
    chunk_size=64,
    chunk_overlap=0,
    embedding=SemanticEmbed(model_name=EMBEDDING_MODEL_NAME, representation="asymmetric"),
)

search_client.create_index(index_path, index_configuration)

print(f"Index created successfully. \n Index path: {index_path}")

# assign the index to the collection
search_client.assign_index_to_collection(collection_path, INDEX)

print(f"Index assigned to collection successfully. \n Collection path: {collection_path}")

## 4. Upload some text to the collection

Now that we have our collection set up, we need to populate it with content that can be searched. Let's create three text objects that will serve as our sample knowledge base. These documents will demonstrate how PhariaSearch can identify relevant information across different types of content.

We'll add biographical information about notable figures to showcase the semantic search capabilities. The document content is defined in `sample_documents.py` file, outside of the notebook and imported below.

We'll upload each document to our collection. The SDK will automatically handle the document storage and prepare them for indexing.

In [None]:
from pharia_data_sdk.connectors.document_index.document_index import (
    DocumentContents,
    DocumentPath,
)

from sample_documents import documents

for doc in documents:
    document_path = DocumentPath(
        collection_path=collection_path, document_name=doc["name"]
    )
    search_client.add_document(
        document_path, contents=DocumentContents.from_text(doc["content"])
    )
    print(f"Document `{doc['name']}` uploaded successfully.")

  ### Verify Documents in Collection

  Let's verify that our documents have been successfully uploaded to the collection:

In [None]:
search_client.documents(collection_path)

Once the text is indexed, we can also have a look at its chunks:

In [None]:
from pharia_data_sdk.connectors import ResourceNotFound

try:
    chunks = search_client.chunks(
        DocumentPath(collection_path=collection_path, document_name=documents[0]["name"]),
        index_name=INDEX,
    )
    print(chunks)
except ResourceNotFound:
    pass  # This is expected if the document is still embedding.

## 5. Setup search indexes

### 5.1. Perform semantic search

Now that we have uploaded our text, we can search through it using the semantic similarities between a given query and each chunk.

To do so, let's use the `DocumentIndexRetriever`:

In [None]:
from pharia_data_sdk.connectors.retrievers import DocumentIndexRetriever

document_index_retriever = DocumentIndexRetriever(
    document_index=search_client,
    index_name=INDEX,
    namespace=NAMESPACE,
    collection=COLLECTION,
    k=5,
)

document_index_retriever.get_relevant_documents_with_scores(
    query="The influence of Robert Moses"
)

### 5.2 Hybrid Search

PhariaSearch supports hybrid search, which combines results of semantic search and keyword search.
In order to use hybrid search, we need to create a hybrid index and assign it to the collection:

In [55]:
index_path = IndexPath(namespace=NAMESPACE, index=HYBRID_INDEX)

# customise the parameters of the index here
index_configuration = IndexConfiguration(
    chunk_size=64,
    chunk_overlap=0,
    hybrid_index="bm25",
    embedding=SemanticEmbed(model_name=EMBEDDING_MODEL_NAME, representation="asymmetric"),
)

# create the namespace-wide index resource
search_client.create_index(index_path, index_configuration)

# assign the index to the collection
search_client.assign_index_to_collection(collection_path, HYBRID_INDEX)

In [None]:
search_client.list_indexes(NAMESPACE)

If we now search on the hybrid index, we will not only get chunks with a semantic similarity but also chunks that match the keywords in the query:

In [56]:
document_index_retriever = DocumentIndexRetriever(
    document_index=search_client,
    index_name=HYBRID_INDEX,
    namespace=NAMESPACE,
    collection=COLLECTION,
    k=5,
    threshold=0.5,
)

document_index_retriever.get_relevant_documents_with_scores(query="25 April")

[SearchResult(id=DocumentPath(collection_path=CollectionPath(namespace='Studio', collection='pharia-tutorial-rag-vsp-1'), document_name='jane_jacobs'), score=0.5, document_chunk=DocumentChunk(text='Jane Jacobs OC OOnt (née Butzner; 4 May 1916 – 25 April 2006) was an American-Canadian journalist, author, theorist, and activist who influenced urban studies, sociology, and economics.', start=0, end=184, metadata=None)),
 SearchResult(id=DocumentPath(collection_path=CollectionPath(namespace='Studio', collection='pharia-tutorial-rag-vsp-1'), document_name='robert_moses'), score=0.33333334, document_chunk=DocumentChunk(text="www.nycroads.com/roads/taconic/ |title=Taconic State Parkway |website=NYCRoads.com |access-date=May 25, 2006}}</ref> Moses helped build Long Island's [[Meadowbrook State Parkway]].", start=10999, end=11178, metadata=None))]

### 5.3  Search with Metadata filtering

PhariaSearch also supports filter-indexes, which gives us the ability to provide specific filters in case we want to filter our search based on each document's metadata.

To do so, let's first upload another version of our documents but this time with some metadata e.g. the "title" field.

In [64]:
for doc in documents:
    document_path = DocumentPath(
        collection_path=collection_path, document_name=doc["name"]
    )
    search_client.add_document(
        document_path,
        contents=DocumentContents(
            contents=[doc["content"]], metadata={"title": doc["name"]}
        ),
    )
    print(f"Document `{doc['name']}` with metadata `'title': {doc['name']}` uploaded successfully.")

Document `robert_moses` with metadata `'title': robert_moses` uploaded successfully.
Document `jane_jacobs` with metadata `'title': jane_jacobs` uploaded successfully.
Document `nelson_rockefeller` with metadata `'title': nelson_rockefeller` uploaded successfully.


#### Preparation of the Filter Index
To be able to use metadata filtering, we need to first check the following:
1. Check if we already have a search index assigned. If not, we need to assign one because filter-indexes can be defined at the namespace level but can only be assigned to already existing search indexes 
2. Define a new filter-index configuration for our specific collection metadata.
3. Assign the filter-index that we created to a search index.

In [65]:
# 1
# list all the assigned search indexes for our collection
search_client.list_assigned_index_names(collection_path=collection_path)

['rag-tutorial-hybrid-index-vsp-1',
 'rag-tutorial-index-vps-1',
 'rag-tutorial-index-1']

In [67]:
# 2
# define a new filter-index
search_client.create_filter_index_in_namespace(
    namespace=collection_path.namespace,
    filter_index_name=FILTER_INDEX,  # this is how our filter-index is identified in our namespace
    field_name="title",  # this is the name of the field to which we want to apply our filter
    field_type="string",  # type of the field we want to apply our filter to. Must be one of "string", "integer", "float", "boolean" or "datetime"
)

# let's check if our index is present now
filter_present = FILTER_INDEX in search_client.list_filter_indexes_in_namespace(
    namespace=collection_path.namespace
)
if (filter_present):
    print(f"Filter {FILTER_INDEX} has been created.")

Filter rag-tutorial-filter-index-vsp-1 has been created.


In [70]:
# 3
# assign our new filter-index to our collection
search_client.assign_filter_index_to_search_index(
    collection_path=collection_path,
    index_name=INDEX,  # we assign it to intelligence-layer-sdk-demo-index
    filter_index_name=FILTER_INDEX,
)

# check if our filter-index is assigned to our collection
print(f"List of filters assigned to {collection_path}:")
search_client.list_assigned_filter_index_names(
    collection_path=collection_path, index_name=INDEX
)

List of filters assigned to namespace='Studio' collection='pharia-tutorial-rag-vsp-1':


['rag-tutorial-filter-index-vsp-1']

 Now, as we have the filter-index enabled, we need to initialize a new `DocumentIndexRetriever` with the search index for which we added the filter-index.

In [73]:
document_index_retriever = DocumentIndexRetriever(
    document_index=search_client,
    index_name=INDEX,
    namespace=NAMESPACE,
    collection=COLLECTION,
    k=5,
)

print(f"document_index_retriever instance created for index={INDEX}, namespace={NAMESPACE}, collection={COLLECTION}")

document_index_retriever instance created for index=rag-tutorial-index-1, namespace=Studio, collection=pharia-tutorial-rag-vsp-1


#### Defining and Using a Filter
Before we perform the filtered search we have to define a filter.
Filters are composed of the following elements:
- `filter_type` which can be one of "with", "without" or "with_one_of"
- `filter_fields`, which defines the actual filtering criteria over a certain value for our chosen field

If we want a filter that accepts only documents with the value of the "title" field equal to the "name" field of `document_1`, we define the filter as follows:

In [79]:
from pharia_data_sdk.connectors import FilterField, FilterOps, Filters

filters = Filters(
    filter_type="with",  # we want to only return documents matching our filter
    fields=[
        FilterField(
            field_name="title",  # this is the key we used in our metadata dict
            field_value=documents[0][
                "name"
            ],  # this is what we used as a value in the metadata dict
            criteria=FilterOps.EQUAL_TO,  # we want to match exactly
        ),
    ],
)
print(f"Filter instance created: \n ({filters})")

Filter instance created: 
 (filter_type='with' fields=[FilterField(field_name='title', field_value='robert_moses', criteria=<FilterOps.EQUAL_TO: 'equal_to'>)])


In [81]:
# let's use the filters with our query to restrict the search to documents with the title "robert_moses"
document_index_retriever.get_relevant_documents_with_scores(
    query="Robert Moses", filters=[filters]
)

[SearchResult(id=DocumentPath(collection_path=CollectionPath(namespace='Studio', collection='pharia-tutorial-rag-vsp-1'), document_name='robert_moses'), score=0.8402531, document_chunk=DocumentChunk(text="Robert Moses''' (December 18, 1888 – July 29, 1981) was an American [[urban planner]] and public official who worked in the [[New York metropolitan area]] during the early to mid 20th century.", start=0, end=191, metadata=None)),
 SearchResult(id=DocumentPath(collection_path=CollectionPath(namespace='Studio', collection='pharia-tutorial-rag-vsp-1'), document_name='robert_moses'), score=0.6727346, document_chunk=DocumentChunk(text='1914, Moses became attracted to New York City reform politics.<ref>{{Cite web|url=http://c250.columbia.edu/c250_celebrates/remarkable_columbians/robert_moses.html|title = Robert Moses}}</ref> A committed [[', start=5403, end=5608, metadata=None)),
 SearchResult(id=DocumentPath(collection_path=CollectionPath(namespace='Studio', collection='pharia-tutorial-rag

Great! We only get document with the "robert_moses" as the exact title in the metadata

#### Exclusion of Documents
We can also modify our filter such that we only get documents that do *not* match the specified filter fields. This is as simply as replacing the "with" filter_type with a "without": 

In [84]:
# let's now try to exclude the document with the title "robert_moses"
filters_without = Filters(
    filter_type="without",  # we change this to "without" to exclude the document
    fields=[
        FilterField(
            field_name="title",  # this is the key we used in our metadata dict
            field_value=documents[0][
                "name"
            ],  # this is what we used as a value in the metadata dict
            criteria=FilterOps.EQUAL_TO,  # we want to match exactly
        ),
    ],
)
print(f"Filter instance created: \n ({filters})")

Filter instance created: 
 (filter_type='with' fields=[FilterField(field_name='title', field_value='robert_moses', criteria=<FilterOps.EQUAL_TO: 'equal_to'>)])


In [85]:
# let's use the filters with our query to exclude the document with the title "robert_moses"
document_index_retriever.get_relevant_documents_with_scores(
    query="Robert Moses", filters=[filters_without]
)

[SearchResult(id=DocumentPath(collection_path=CollectionPath(namespace='Studio', collection='pharia-tutorial-rag-vsp-1'), document_name='jane_jacobs'), score=0.39886028, document_chunk=DocumentChunk(text='Jacobs organized grassroots efforts to protect neighborhoods from urban renewal and slum clearance – in particular plans by Robert Moses to overhaul her own Greenwich Village neighborhood.', start=346, end=533, metadata=None)),
 SearchResult(id=DocumentPath(collection_path=CollectionPath(namespace='Studio', collection='pharia-tutorial-rag-vsp-1'), document_name='nelson_rockefeller'), score=0.3330956, document_chunk=DocumentChunk(text="As Governor of New York from 1959 to 1973, Rockefeller's achievements included the expansion of the State University of New York (SUNY), efforts to protect the environment, the construction of the Empire State Plaza in Albany, increased facilities and personnel for medical care, and the creation of the New York State Council on the Arts", start=1241, end

Notice, how we only get results where the `document_name` is not "robert_moses". 

## 6. Conclusions

We have now a collection with some documents uploaded and with semantic indexing, hybrid indexing, and the possibility to use metadata for filtering results.