# RAG and Semantic Retrieval on a Document Collection

Deep Search allows users to interact with the documents using conversational AI, i.e. you interact with a virtual assistant which answer your questions using the information in the corpus.

In this example we demonstrate how achive the same interaction programmatically.

### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.


### GenAI Integration required

When interacting with the virtual assistant, Deep Search requires a connection to a Generative AI API. Currently, we support connections to [watsonx.ai](https://www.ibm.com/products/watsonx-ai) or the IBM-internal GenAI platform BAM.

Deep Search allows custom GenAI configurations for each project.
In the following example you will require to work in a project which has such GenAI capabilities activated.

### Set notebook parameters


In [1]:
from dsnotebooks.settings import CollQANotebookSettings

# notebooks settings auto-loaded from .env / env vars
notebook_settings = CollQANotebookSettings()

PROFILE_NAME = notebook_settings.profile      # the profile to use
PROJ_KEY = notebook_settings.proj_key         # the project to use
INDEX_KEY = notebook_settings.sem_on_idx_key  # the collection to use

SKIP_INGESTED_DOCS = notebook_settings.skip_ingested_docs  # whether to skip any already semantically ingested docs

RETR_K = notebook_settings.retr_k             # the number of search results to retrieve
TEXT_WEIGHT = notebook_settings.text_weight   # the weight of lexical search (0.0: semantic-only, 1.0: lexical-only, anything in between: hybrid search)
RERANK = notebook_settings.rerank             # whether to rerank the search results
RAISE = notebook_settings.raise_on_sem_err    # whether semantic operation errors should raise an exception or be reflected in response fields

### Import example dependencies

In [2]:
# Import standard dependenices
import pandas as pd
import rich

# IPython utilities
from IPython.display import display, Markdown

# Import the deepsearch-toolkit
from deepsearch.cps.client.api import CpsApi
from deepsearch.cps.client.components.elastic import ElasticProjectDataCollectionSource
from deepsearch.cps.queries import DataQuery, RAGQuery, SemanticQuery
from deepsearch.cps.queries.results import RAGResult, SearchResult, SearchResultItem


### Connect to Deep Search

In [3]:
api = CpsApi.from_env(profile_name=PROFILE_NAME)

### Utils

In [4]:
def render_provenance_url(
        api: CpsApi,
        coords: ElasticProjectDataCollectionSource,
        retr_item: SearchResultItem,
):
    ## compute URL to the document in the Deep Search UI
    item_index = int(retr_item.main_path[retr_item.main_path.rfind(".")+1:])
    doc_url = api.documents.generate_url(
        document_hash=retr_item.doc_hash,
        data_source=coords,
        item_index=item_index,
    )
    display(Markdown(f"The provenance of the answer can be inspected on the [source document]({doc_url})."))

---

Prepare the collection coordinates:

In [5]:
coll_coords = ElasticProjectDataCollectionSource(
    proj_key=PROJ_KEY,
    index_key=INDEX_KEY,
)

We are using a small collection, so we can just list its documents to get an idea of its contents (for more details on querying, check the [Data Query Quick Start](https://github.com/DS4SD/deepsearch-examples/tree/main/examples/data_query_quick_start)).

In [6]:
# Prepare the data query
query = DataQuery(
    search_query="*",  # The search query to be executed
    source=[           # Which fields of documents we want to fetch
            "file-info.document-hash",
            "file-info.filename",
            # "description.title",
    ],
    coordinates=coll_coords,  # The data collection to be queries
)

# Query Deep Search for the documents matching the query
results = []
query_results = api.queries.run(query)
for row in query_results.outputs["data_outputs"]:
        # Add row to results table
        results.append({
            "Filename": row["_source"]["file-info"]["filename"],
            "DocHash": row["_source"]["file-info"]["document-hash"],
            # "Title": row["_source"].get("description", {}).get("title"),
        })

print(f'Finished fetching all data. Total is {len(results)} records.')

# Visualize the table with all results
df = pd.json_normalize(results)
display(df)

Finished fetching all data. Total is 10 records.


Unnamed: 0,Filename,DocHash
0,natural-language-processing.pdf,000f892ddcc67f165797a96e94f44fb9e0697c7912a383...
1,ibm-z.pdf,07e56eb5a10f725fccad9386d126b7b05bec1fa71b9b3d...
2,ibm.pdf,234bc5cf2c860d49574b0ff7191c354b7bbc11472a0997...
3,ibm-the-great-mind-challenge.pdf,335120a57b418655196e3315b562a2f9e89cedeaef9318...
4,turing-award.pdf,8a7c91a269abc3063df9f4e19f7961ddb8e2393fa0f272...
5,ibm-research.pdf,b30bc667a324ae111d025526563b674a8d3fd869bc07c8...
6,artificial-intelligence.pdf,b60a87c1d62a59d517f2fd6f2d3ea1a96c58b651332a8b...
7,machine-learning.pdf,e470e7b42a92c8e5f25094362361947b9203e0074c2223...
8,deep-blue-chess-computer.pdf,fa7ce2f66a7a5e061813d36348425f81d9e7ebc23454d8...
9,red-hat.pdf,fb53bb607f2e9642d7fe044585d1dcdb052c57febe1b87...


## Prepare source

In [7]:
from deepsearch.cps.client.components.documents import PrivateDataCollectionSource, PrivateDataDocumentSource, PublicDataDocumentSource

data_source = PrivateDataCollectionSource(
    source=coll_coords,
)

## Ingestion

In the cell below we show how to semantically index your collection (indexing of already indexed docs is controlled via param `skip_ingested_docs`):

In [8]:
# launch the ingestion of the collection for DocumentQA
task = api.documents.semantic_ingest(
    project=PROJ_KEY,
    data_source=data_source,
    skip_ingested_docs=SKIP_INGESTED_DOCS,
)

# wait for the ingestion task to finish
api.tasks.wait_for(PROJ_KEY, task.task_id)

  Expected `list[str]` but got `_LiteralGenericAlias` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


{'ing_out': {}}

## RAG

In [9]:
question = "Where is the IBM lab in Zurich?"

# submit natural-language query on collection
question_query = RAGQuery(
    question=question,
    project=PROJ_KEY,
    data_source=data_source,

    ## optional retrieval params
    retr_k=RETR_K,
)
api_output = api.queries.run(question_query)
rag_result = RAGResult.from_api_output(api_output, raise_on_error=RAISE)

rich.print(rag_result)

Additionally, we can generate a provenance URL to the document in the Deep Search UI:

In [10]:
render_provenance_url(api=api, coords=coll_coords, retr_item=rag_result.answers[0].grounding.retr_items[0])

The provenance of the answer can be inspected on the [source document](https://cps.foc-deepsearch.zurich.ibm.com/projects/e0ea87922f4b732407fb3b9cf3475f0edb90cc2d/library/private/d70f151acff22f19f9cfaffb1f5baa810c8de3db?search=JTdCJTIycHJpdmF0ZUNvbGxlY3Rpb24lMjIlM0ElMjJkNzBmMTUxYWNmZjIyZjE5ZjljZmFmZmIxZjViYWE4MTBjOGRlM2RiJTIyJTJDJTIydHlwZSUyMiUzQSUyMkRvY3VtZW50JTIyJTJDJTIyZXhwcmVzc2lvbiUyMiUzQSUyMmZpbGUtaW5mby5kb2N1bWVudC1oYXNoJTNBJTIwJTVDJTIyYjMwYmM2NjdhMzI0YWUxMTFkMDI1NTI2NTYzYjY3NGE4ZDNmZDg2OWJjMDdjOGZkMjA0YWE5NWIwNWQ0MWYwYyU1QyUyMiUyMiUyQyUyMmZpbHRlcnMlMjIlM0ElNUIlNUQlMkMlMjJzZWxlY3QlMjIlM0ElNUIlMjJfbmFtZSUyMiUyQyUyMmRlc2NyaXB0aW9uLmNvbGxlY3Rpb24lMjIlMkMlMjJwcm92JTIyJTJDJTIyZGVzY3JpcHRpb24udGl0bGUlMjIlMkMlMjJkZXNjcmlwdGlvbi5wdWJsaWNhdGlvbl9kYXRlJTIyJTJDJTIyZGVzY3JpcHRpb24udXJsX3JlZnMlMjIlNUQlMkMlMjJpdGVtSW5kZXglMjIlM0EwJTJDJTIycGFnZVNpemUlMjIlM0ExMCUyQyUyMnNlYXJjaEFmdGVySGlzdG9yeSUyMiUzQSU1QiU1RCUyQyUyMnZpZXdUeXBlJTIyJTNBJTIyc25pcHBldHMlMjIlMkMlMjJyZWNvcmRTZWxlY3Rpb24lMjIlM0ElN0IlMjJyZWNvcmQlMjIlM0ElN0IlMjJpZCUyMiUzQSUyMmIzMGJjNjY3YTMyNGFlMTExZDAyNTUyNjU2M2I2NzRhOGQzZmQ4NjliYzA3YzhmZDIwNGFhOTViMDVkNDFmMGMlMjIlN0QlMkMlMjJpdGVtSW5kZXglMjIlM0E3MSU3RCU3RA%3D%3D).

Let us try out a different question on our document corpus.
Here we also include (commented out) various additional parameters the user can optionally set:
- `retr_k`: number of items to retrieve
- `text_weight`: weight of lexical search (`0.0`: fully semantic search, `1.0`: fully lexical search, anything in-between: hybrid search)
- `rerank`: whether to rerank the retrieval results
- `gen_ctx_extr_method` (Literal["window", "page"], optional): method for gen context extraction from document; defaults to "window"
- `gen_ctx_window_size` (int, optional): (relevant only if `gen_ctx_extr_method` is "window") max chars to use for extracted gen context (actual extraction quantized on doc item level); defaults to 5000
- `gen_ctx_window_lead_weight` (float, optional): (relevant only if `gen_ctx_extr_method` is "window") weight of leading text for distributing remaining window size after extracting the `main_path`; defaults to 0.5 (centered around `main_path`)
- `return_prompt` (bool, optional): whether to return the instantiated prompt; defaults to False

For more details refer to `deepsearch.cps.queries.RAGQuery`.

In [11]:
question = "Who came up with the term 'machine learning'?"

# submit natural-language query on collection
question_query = RAGQuery(
    question=question,
    project=PROJ_KEY,
    data_source=data_source,

    ## optional retrieval params
    retr_k=RETR_K,
    # text_weight=TEXT_WEIGHT,
    # rerank=RERANK,

    ## optional generation params
    # model_id="ibm-mistralai/mixtral-8x7b-instruct-v01-q",
    # gen_params={"random_seed": 42, "max_new_tokens": 1024},
    # prompt_template="Answer the query based on the context.\n\nContext: {{ context }}\n\nQuery: {{ query }}",

    # gen_ctx_extr_method="window",
    # gen_ctx_window_size=5000,
    # gen_ctx_window_lead_weight=0.5
    # return_prompt=True,
)
api_output = api.queries.run(question_query)
rag_result = RAGResult.from_api_output(api_output, raise_on_error=RAISE)

rich.print(rag_result)

As seen by the returned `doc_hash`, this answer came from a different document than the previous one.

In [12]:
render_provenance_url(api=api, coords=coll_coords, retr_item=rag_result.answers[0].grounding.retr_items[0])

The provenance of the answer can be inspected on the [source document](https://cps.foc-deepsearch.zurich.ibm.com/projects/e0ea87922f4b732407fb3b9cf3475f0edb90cc2d/library/private/d70f151acff22f19f9cfaffb1f5baa810c8de3db?search=JTdCJTIycHJpdmF0ZUNvbGxlY3Rpb24lMjIlM0ElMjJkNzBmMTUxYWNmZjIyZjE5ZjljZmFmZmIxZjViYWE4MTBjOGRlM2RiJTIyJTJDJTIydHlwZSUyMiUzQSUyMkRvY3VtZW50JTIyJTJDJTIyZXhwcmVzc2lvbiUyMiUzQSUyMmZpbGUtaW5mby5kb2N1bWVudC1oYXNoJTNBJTIwJTVDJTIyZTQ3MGU3YjQyYTkyYzhlNWYyNTA5NDM2MjM2MTk0N2I5MjAzZTAwNzRjMjIyMzUwNWI0OTIxOTQwZWMwNzVhMSU1QyUyMiUyMiUyQyUyMmZpbHRlcnMlMjIlM0ElNUIlNUQlMkMlMjJzZWxlY3QlMjIlM0ElNUIlMjJfbmFtZSUyMiUyQyUyMmRlc2NyaXB0aW9uLmNvbGxlY3Rpb24lMjIlMkMlMjJwcm92JTIyJTJDJTIyZGVzY3JpcHRpb24udGl0bGUlMjIlMkMlMjJkZXNjcmlwdGlvbi5wdWJsaWNhdGlvbl9kYXRlJTIyJTJDJTIyZGVzY3JpcHRpb24udXJsX3JlZnMlMjIlNUQlMkMlMjJpdGVtSW5kZXglMjIlM0EwJTJDJTIycGFnZVNpemUlMjIlM0ExMCUyQyUyMnNlYXJjaEFmdGVySGlzdG9yeSUyMiUzQSU1QiU1RCUyQyUyMnZpZXdUeXBlJTIyJTNBJTIyc25pcHBldHMlMjIlMkMlMjJyZWNvcmRTZWxlY3Rpb24lMjIlM0ElN0IlMjJyZWNvcmQlMjIlM0ElN0IlMjJpZCUyMiUzQSUyMmU0NzBlN2I0MmE5MmM4ZTVmMjUwOTQzNjIzNjE5NDdiOTIwM2UwMDc0YzIyMjM1MDViNDkyMTk0MGVjMDc1YTElMjIlN0QlMkMlMjJpdGVtSW5kZXglMjIlM0E2JTdEJTdE).

## Semantic retrieval

Besides RAG, which includes natural language generation, a user may only be interested in
the semantic retrieval part.

This can be obtained very similarly to RAG, as shown below:

In [13]:
question = "Where is the IBM lab in Zurich?"

# submit natural-language query on collection
question_query = SemanticQuery(
    question=question,
    project=PROJ_KEY,
    data_source=data_source,

    ## optional params
    retr_k=RETR_K,
    # text_weight=TEXT_WEIGHT,
    # rerank=RERANK,
)
api_output = api.queries.run(question_query)
search_result = SearchResult.from_api_output(api_output, raise_on_error=RAISE)

rich.print(search_result)