![image](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# Use watsonx and Chroma to answer questions (RAG)

#### Disclaimers

- Use only Projects and Spaces that are available in watsonx context.

## Notebook content

This notebook contains the steps and code to demonstrate support of Retrieval Augumented Generation in watsonx.ai. It introduces commands for data retrieval, knowledge base building & querying, and model testing.

Some familiarity with Python is helpful. This notebook uses Python 3.11.

#### About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires 3 steps:

- Index knowledge base passages (once)
- Retrieve relevant passage(s) from knowledge base (for every user query)
- Generate a response by feeding retrieved passage into a large language model (for every user query)

## Contents

This notebook contains the following parts:

- [Setup](#setup)
- [Data (test) loading](#data)
- [Foundation Models on watsonx](#models)
- [Generate a retrieval-augmented response to a question](#predict)
- [Calculate rougeL metric](#score)


<a id="setup"></a>
## Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

-  Contact with your Cloud Pack for Data administrator and ask him for your account credentials


### Install and import the `ibm-watsonx-ai` and dependecies
**Note:** `ibm-watsonx-ai` documentation can be found <a href="https://ibm.github.io/watsonx-ai-python-sdk/index.html" target="_blank" rel="noopener no referrer">here</a>.

In [None]:
!pip install chromadb==0.3.27 | tail -n 1
!pip install sentence_transformers | tail -n 1
!pip install wget | tail -n 1
!pip install pandas | tail -n 1
!pip install evaluate | tail -n 1
!pip install nltk | tail -n 1
!pip install -U ibm-watsonx-ai | tail -n 1

In [2]:
import os
import pandas as pd
from typing import Optional, Dict, Any, Iterable, List

try:
    from sentence_transformers import SentenceTransformer
except ImportError:
    raise ImportError("Could not import sentence_transformers: Please install sentence-transformers package.")
    
try:
    import chromadb
    from chromadb.api.types import EmbeddingFunction
except ImportError:
    raise ImportError("Could not import chromdb: Please install chromadb package.")

### Connection to WML

Authenticate the Watson Machine Learning service on IBM Cloud Pack for Data. You need to provide platform `url`, your `username` and `api_key`.

In [3]:
username = 'PASTE YOUR USERNAME HERE'
api_key = 'PASTE YOUR API_KEY HERE'
url = 'PASTE THE PLATFORM URL HERE'

In [3]:
from ibm_watsonx_ai import Credentials

credentials = Credentials(
    username=username,
    api_key=api_key,
    url=url,
    instance_id="openshift",
    version="5.0"
)

Alternatively you can use `username` and `password` to authenticate WML services.

```python
credentials = Credentials(
    username=***,
    password=***,
    url=***,
    instance_id="openshift",
    version="5.0"
)

```

### Defining the project id
The Foundation Model requires project id that provides the context for the call. We will obtain the id from the project in which this notebook runs. Otherwise, please provide the project id.

In [4]:
try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

<a id="data"></a>
## Data (test) loading

Download the test dataset. This dataset is used to calculate the metrics score for selected model, defined prompts and parameters.

In [5]:
import wget

questions_test_filename = 'questions_test.csv'
questions_train_filename = 'questions_train.csv'
questions_test_url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/RAG/questions_test.csv'
questions_train_url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/RAG/questions_train.csv'


if not os.path.isfile(questions_test_filename): 
    wget.download(questions_test_url, out=questions_test_filename)


if not os.path.isfile(questions_train_filename): 
    wget.download(questions_train_url, out=questions_train_filename)

In [6]:
filename_test = './questions_test.csv'
filename_train =  './questions_train.csv'

test_data = pd.read_csv(filename_test)
train_data = pd.read_csv(filename_train)

Inspect data sample

In [7]:
train_data.head()

Unnamed: 0,qid,question,answers
0,1961,where does diffusion occur in the excretory sy...,diffusion
1,7528,when did the us join world war one,"April 6 , 1917"
2,8685,who played wilma in the movie the flintstones,Elizabeth Perkins
3,6716,when was the office of the vice president created,1787
4,2916,where does carbon fixation occur in c4 plants,in the mesophyll cells


### Build up knowledge base

The current state-of-the-art in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.

We can generate dense vector representations using embedding models. In this notebook, we use [SentenceTransformers](https://www.google.com/search?client=safari&rls=en&q=sentencetransformers&ie=UTF-8&oe=UTF-8) [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to embed both the knowledge base passages and user queries. `all-MiniLM-L6-v2` is a performant open-source model that is small enough to run locally.

A vector database is optimized for dense vector indexing and retrieval. This notebook uses [Chroma](https://docs.trychroma.com), a user-friendly open-source vector database, licensed under Apache 2.0, which offers good speed and performance with all-MiniLM-L6-v2 embedding model.

The dataset we are using is already split into self-contained passages that can be ingested by Chroma. 

The size of each passage is limited by the embedding model's context window (which is 256 tokens for `all-MiniLM-L6-v2`).

### Load knowledge base documents

Load set of documents used further to build knowledge base. 

In [8]:
knowledge_base_dir = "./knowledge_base"

In [9]:
my_path = f"{os.getcwd()}/knowledge_base"
if not os.path.isdir(my_path):
   os.makedirs(my_path)

In [10]:
documents_filename = 'knowledge_base/psgs.tsv'
documents_url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/RAG/psgs.tsv'


if not os.path.isfile(documents_filename): 
    wget.download(documents_url, out=documents_filename)

In [11]:
documents = pd.read_csv(f"{knowledge_base_dir}/psgs.tsv", sep='\t', header=0)
documents['indextext'] = documents['title'].astype(str) + "\n" + documents['text']
documents = documents[:517]

### Create an embedding function

Note that you can feed a custom embedding function to be used by chromadb. The performance of chromadb may differ depending on the embedding model used.

In [12]:
class MiniLML6V2EmbeddingFunction(EmbeddingFunction):
    MODEL = SentenceTransformer('all-MiniLM-L6-v2')
    def __call__(self, texts):
        return MiniLML6V2EmbeddingFunction.MODEL.encode(texts).tolist()
emb_func = MiniLML6V2EmbeddingFunction()

### Set up Chroma upsert

Upserting a document means update the document even if it exists in the database. Otherwise re-inserting a document throws an error. This is useful for experimentation purpose.

In [13]:
class ChromaWithUpsert:
    def __init__(
            self,
            name: Optional[str] = "watsonx_rag_collection",
            persist_directory:Optional[str]=None,
            embedding_function: Optional[EmbeddingFunction]=None,
            collection_metadata: Optional[Dict] = None,
    ):
        self._client_settings = chromadb.config.Settings()
        if persist_directory is not None:
            self._client_settings = chromadb.config.Settings(
                chroma_db_impl="duckdb+parquet",
                persist_directory=persist_directory,
            )
        self._client = chromadb.Client(self._client_settings)
        self._embedding_function = embedding_function
        self._persist_directory = persist_directory
        self._name = name
        self._collection = self._client.get_or_create_collection(
            name=self._name,
            embedding_function=self._embedding_function
            if self._embedding_function is not None
            else None,
            metadata=collection_metadata,
        )

    def upsert_texts(
        self,
        texts: Iterable[str],
        metadata: Optional[List[dict]] = None,
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.
        Args:
            :param texts (Iterable[str]): Texts to add to the vectorstore.
            :param metadatas (Optional[List[dict]], optional): Optional list of metadatas.
            :param ids (Optional[List[str]], optional): Optional list of IDs.
            :param metadata: Optional[List[dict]] - optional metadata (such as title, etc.)
        Returns:
            List[str]: List of IDs of the added texts.
        """
        if ids is None:
            import uuid
            ids = [str(uuid.uuid1()) for _ in texts]
        embeddings = None
        self._collection.upsert(
            metadatas=metadata, documents=texts, ids=ids
        )
        return ids

    def is_empty(self):
        return self._collection.count()==0

    def persist(self):
        self._client.persist()

    def query(self, query_texts:str, n_results:int=5):
        """
        Returns the closests vector to the question vector
        :param query_texts: the question
        :param n_results: number of results to generate
        :return: the closest result to the given question
        """
        return self._collection.query(query_texts=query_texts, n_results=n_results)

### Embed and index documents with Chroma

**Note: Could take several minutes if you don't have pre-built indices**

In [14]:
%%time
chroma = ChromaWithUpsert(
    name=f"nq910_minilm6v2",
    embedding_function=emb_func,  # you can have something here using /embed endpoint
    persist_directory=knowledge_base_dir,
)
if chroma.is_empty():
    _ = chroma.upsert_texts(
        texts=documents.indextext.tolist(),
        # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
        metadata=[{'title': title, 'id': id}
                  for (title,id) in
                  zip(documents.title, documents.id)],  # filter on these!
        ids=[str(i) for i in documents.id],  # unique for each doc
    )
    chroma.persist()

CPU times: user 3.69 s, sys: 382 ms, total: 4.07 s
Wall time: 4.54 s


<a id="models"></a>
## Foundation Models on watsonx

### Defining model
You need to specify `model_id` that will be used for inferencing:

In [15]:
from ibm_watsonx_ai.foundation_models.utils.enums import ModelTypes

model_id = ModelTypes.FLAN_UL2

### Defining the model parameters
We need to provide a set of model parameters that will influence the result:

In [16]:
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 50
}

### Initialize the `ModelInference` class.

In [18]:
from ibm_watsonx_ai.foundation_models import ModelInference

model = ModelInference(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id
)

<a id="predict"></a>
## Generate a retrieval-augmented response to a question

### Select questions

Get questions from the previously loaded test dataset.

In [19]:
question_texts = [q.strip("?") + "?" for q in test_data['question'].tolist()]
print("\n".join(question_texts))

when do abby and luka get back together?
what does dc stand for in washigton dc?
where was agatha christie s crooked house filmed?
where did the song god bless america originate?
when does daylight savings time end in colorado?
who did the steelers play in the playoffs last year?
most road maps are what kind of map?
who plays captian hook in once upon a time?
when is the last time mayon volcano erupted?
who does finn wolf hard play in stranger things?
who plays scott granger on young and restless?
who did the original power rangers theme song?
who is going to be in the world cup final?
who does the voice of brian on family guy?
who holds the 3 point record in nba?
when did the food stamp card come out?
who is the goddess of the moon in greek mythology?
who won the 2015 great british baking show?
which team has the most ncaa tournament appearances?
who is the original singer of where is the love?
what is the seatbelt compliancy rate in texas?
when did fresh prince of bel air start?
what

### Retrieve relevant context

Fetch paragraphs similar to the question.

In [20]:
relevant_contexts = []

for question_text in question_texts:
    relevant_chunks = chroma.query(
        query_texts=[question_text],
        n_results=5,
    )
    relevant_contexts.append(relevant_chunks)

Get the set of chunks for one of the questions.

In [21]:
sample_chunks = relevant_contexts[0]
for i, chunk in enumerate(sample_chunks['documents'][0]):
    print("=========")
    print("Paragraph index : ", sample_chunks['ids'][0][i])
    print("Paragraph : ", chunk)
    print("Distance : ", sample_chunks['distances'][0][i])

Paragraph index :  180
Paragraph :  Brian Cassidy
episode , `` Undercover Blue '' , Cassidy is accused of rape by a prostitute while he was undercover almost four years prior . It is revealed that Cassidy was being set up by the woman and her boss to make money off a lawsuit against the NYPD and the charges are dropped . Also in this episode , Munch says that Cassidy paid the price for having a relationship with a prostitute while undercover with Ganzel , as he was demoted from detective to an officer who works nights at a Bronx courthouse . Benson and Cassidy also are forced to reveal their romantic relationship in this episode when Amaro and Munch go to Cassidy 's apartment and find Benson there . In Season 15 , Cassidy and Benson are still romantically involved and move in together . In the episode `` Internal Affairs '' , Cassidy is put undercover by Internal Affairs Bureau Lt. Ed Tucker ( Robert John Burke ) to investigate a dirty precinct , an assignment that very nearly leads to

### Feed the context and the questions to `watsonx.ai` model.

In [22]:
def make_prompt(context, question_text):
    return (f"Please answer the following.\n"
          + f"{context}:\n\n"
          + f"{question_text}")

prompt_texts = []

for relevant_context, question_text in zip(relevant_contexts, question_texts):
    context = "\n\n\n".join(relevant_context["documents"][0])
    prompt_text = make_prompt(context, question_text)
    prompt_texts.append(prompt_text)

Inspect prompt for sample question.

In [23]:
print(prompt_texts[0])

Please answer the following.
Brian Cassidy
episode , `` Undercover Blue '' , Cassidy is accused of rape by a prostitute while he was undercover almost four years prior . It is revealed that Cassidy was being set up by the woman and her boss to make money off a lawsuit against the NYPD and the charges are dropped . Also in this episode , Munch says that Cassidy paid the price for having a relationship with a prostitute while undercover with Ganzel , as he was demoted from detective to an officer who works nights at a Bronx courthouse . Benson and Cassidy also are forced to reveal their romantic relationship in this episode when Amaro and Munch go to Cassidy 's apartment and find Benson there . In Season 15 , Cassidy and Benson are still romantically involved and move in together . In the episode `` Internal Affairs '' , Cassidy is put undercover by Internal Affairs Bureau Lt. Ed Tucker ( Robert John Burke ) to investigate a dirty precinct , an assignment that very nearly leads to his de

### Generate a retrieval-augmented response

In [24]:
results = []

for prompt_text in prompt_texts:
    results.append(model.generate_text(prompt_text))

In [25]:
for idx, result in enumerate(results):
    print("Question = ", test_data.iloc[idx]['question'])
    print("Answer = ", result)
    print("Expected Answer(s) (may not be appear with exact wording in the dataset) = ", test_data.iloc[idx]['answers'])
    print("\n")

Question =  when do abby and luka get back together
Answer =  season 14
Expected Answer(s) (may not be appear with exact wording in the dataset) =  season 12


Question =  what does dc stand for in washigton dc
Answer =  District of Columbia
Expected Answer(s) (may not be appear with exact wording in the dataset) =  District of Columbia


Question =  where was agatha christie s crooked house filmed
Answer =  Florence Cathedral
Expected Answer(s) (may not be appear with exact wording in the dataset) =  Tyntesfield , near Bristol


Question =  where did the song god bless america originate
Answer =  the United States
Expected Answer(s) (may not be appear with exact wording in the dataset) =  Yaphank , New York


Question =  when does daylight savings time end in colorado
Answer =  the first Sunday in November
Expected Answer(s) (may not be appear with exact wording in the dataset) =  first Sunday in November


Question =  who did the steelers play in the playoffs last year
Answer =  Beng

<a id="score"></a>
## Calculate rougeL metric
In this sample notebook `evaluate` module from HuggingFace was used for rougeL calculation.

#### Rouge Metric

**Note:** The Rouge (Recall-Oriented Understudy for Gisting Evaluation) metric is a set of evaluation measures used in natural language processing (NLP) and specifically in text summarization and machine translation tasks. The Rouge metrics are designed to assess the quality of generated summaries or translations by comparing them to one or more reference texts.

The main idea behind Rouge is to measure the overlap between the generated summary (or translation) and the reference text(s) in terms of n-grams or longest common subsequences. By calculating recall, precision, and F1 scores based on these overlapping units, Rouge provides a quantitative assessment of the summary's content overlap with the reference(s).

Rouge-1 focuses on individual word overlap, Rouge-2 considers pairs of consecutive words, and Rouge-L takes into account the ordering of words and phrases. These metrics provide different perspectives on the similarity between two texts and can be used to evaluate different aspects of summarization or text generation models.

In [27]:
from evaluate import load

rouge = load('rouge')
scores = rouge.compute(predictions=results, references=test_data.answers)
print(scores)

{'rouge1': 0.20555555555555555, 'rouge2': 0.12499999999999997, 'rougeL': 0.20347222222222222, 'rougeLsum': 0.20069444444444445}


---

<a id="summary"></a>
## Summary and next steps

You successfully completed this notebook!.
 
Check out our _<a href="https://ibm.github.io/watsonx-ai-python-sdk/samples.html" target="_blank" rel="noopener no referrer">Online Documentation</a>_ for more samples, tutorials, documentation, how-tos, and blog posts. 

Copyright © 2023, 2024 IBM. This notebook and its source code are released under the terms of the MIT License.