# RAG and Semantic Retrieval on a Document Collection

Deep Search allows users to interact with the documents using conversational AI, i.e. you interact with a virtual assistant which answer your questions using the information in the corpus.

In this example we demonstrate how achive the same interaction programmatically.

### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.


### GenAI Integration required

When interacting with the virtual assistant, Deep Search requires a connection to a Generative AI API. Currently, we support connections to [watsonx.ai](https://www.ibm.com/products/watsonx-ai) or the IBM-internal GenAI platform BAM.

Deep Search allows custom GenAI configurations for each project.
In the following example you will require to work in a project which has such GenAI capabilities activated.

### Set notebook parameters


In [None]:
from dsnotebooks.settings import CollQANotebookSettings

# notebooks settings auto-loaded from .env / env vars
notebook_settings = CollQANotebookSettings()

PROFILE_NAME = notebook_settings.profile      # the profile to use
PROJ_KEY = notebook_settings.proj_key         # the project to use
INDEX_KEY = notebook_settings.sem_on_idx_key  # the collection to use

RETR_K = notebook_settings.retr_k
TEXT_WEIGHT = notebook_settings.text_weight
RERANK = notebook_settings.rerank

### Import example dependencies

In [None]:
# Import standard dependenices
import pandas as pd
import rich

# IPython utilities
from IPython.display import display, Markdown

# Import the deepsearch-toolkit
from deepsearch.cps.client.api import CpsApi
from deepsearch.cps.client.components.elastic import ElasticProjectDataCollectionSource
from deepsearch.cps.queries import DataQuery, CorpusRAGQuery, CorpusSemanticQuery
from deepsearch.cps.queries.results import RAGResult, SearchResult, SearchResultItem


### Connect to Deep Search

In [None]:
api = CpsApi.from_env(profile_name=PROFILE_NAME)

### Utils

In [None]:
def render_provenance_url(
        api: CpsApi,
        coords: ElasticProjectDataCollectionSource,
        retr_item: SearchResultItem,
):
    ## compute URL to the document in the Deep Search UI
    item_index = int(retr_item.path_in_doc[retr_item.path_in_doc.rfind(".")+1:])
    doc_url = api.documents.generate_url(
        document_hash=retr_item.doc_hash,
        data_source=coords,
        item_index=item_index,
    )
    display(Markdown(f"The provenance of the answer can be inspected on the [source document]({doc_url})."))

---

Prepare the collection coordinates:

In [None]:
coll_coords = ElasticProjectDataCollectionSource(
    proj_key=PROJ_KEY,
    index_key=INDEX_KEY,
)

We are using a small collection, so we can just list its documents to get an idea of its contents (for more details on querying, check the [Data Query Quick Start](https://github.com/DS4SD/deepsearch-examples/tree/main/examples/data_query_quick_start)).

In [None]:
# Prepare the data query
query = DataQuery(
    search_query="*",  # The search query to be executed
    source=[           # Which fields of documents we want to fetch
            "file-info.document-hash",
            "file-info.filename",
            # "description.title",
    ],
    coordinates=coll_coords,  # The data collection to be queries
)

# Query Deep Search for the documents matching the query
results = []
query_results = api.queries.run(query)
for row in query_results.outputs["data_outputs"]:
        # Add row to results table
        results.append({
            "Filename": row["_source"]["file-info"]["filename"],
            "DocHash": row["_source"]["file-info"]["document-hash"],
            # "Title": row["_source"].get("description", {}).get("title"),
        })

print(f'Finished fetching all data. Total is {len(results)} records.')

# Visualize the table with all results
df = pd.json_normalize(results)
display(df)

## Ingestion

In the cell below we show how to semantically index your collection (skip if collection already semantically indexed):

In [None]:
# from deepsearch.cps.client.components.documents import SemIngestPrivateDataCollectionSource

# # launch the ingestion of the collection for DocumentQA
# task = api.documents.semantic_ingest(
#     project=PROJ_KEY,
#     data_source=SemIngestPrivateDataCollectionSource(
#         source=coll_coords,
#     ),
# )

# # wait for the ingestion task to finish
# api.tasks.wait_for(task.proj_key, task.task_id)

## RAG

In [None]:
question = "Where is the IBM lab in Zurich?"

# submit natural-language query on collection
question_query = CorpusRAGQuery(
    question=question,
    project=PROJ_KEY,
    index_key=INDEX_KEY,
)
api_output = api.queries.run(question_query)
rag_result = RAGResult.from_api_output(api_output)

rich.print(rag_result)


Additionally, we can generate a provenance URL to the document in the Deep Search UI:

In [None]:
render_provenance_url(api=api, coords=coll_coords, retr_item=rag_result.answers[0].grounding.items[0])

Let us try out a different question on our document corpus.
Here we also illustrate some further parameters the user can optionally set:
- `retr_k`: number of items to retrieve
- `text_weight`: weight of lexical search (`0.0`: fully semantic search, `1.0`: fully lexical search, anything in-between: hybrid search)
- `rerank`: whether to rerank the retrieval results

In [None]:
question = "Who coined the term 'machine learning'?"

# submit natural-language query on collection
question_query = CorpusRAGQuery(
    question=question,
    project=PROJ_KEY,
    index_key=INDEX_KEY,

    # optional params:
    retr_k=RETR_K,
    text_weight=TEXT_WEIGHT,
    rerank=RERANK,
)
api_output = api.queries.run(question_query)
rag_result = RAGResult.from_api_output(api_output)

rich.print(rag_result)

As seen by the returned `doc_hash`, this answer came from a different document than the previous one.

In [None]:
render_provenance_url(api=api, coords=coll_coords, retr_item=rag_result.answers[0].grounding.items[0])

## Semantic retrieval

Besides RAG, which includes natural language generation, a user may only be interested in
the semantic retrieval part.

This can be obtained very similarly to RAG, as shown below:

In [None]:
question = "Where is the IBM lab in Zurich?"

# submit natural-language query on collection
question_query = CorpusSemanticQuery(
    question=question,
    project=PROJ_KEY,
    index_key=INDEX_KEY,

    # optional params:
    retr_k=RETR_K,
    # text_weight=TEXT_WEIGHT,
    # rerank=RERANK,
)
api_output = api.queries.run(question_query)
search_result = SearchResult.from_api_output(api_output)

rich.print(search_result)