![image](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# Use Watsonx to respond to natural language questions using RAG approach

**Note:** Please note that for the watsonx challenge, please run these notebooks locally on your laptop/desktop and do not run it in IBM Cloud.  The instructions for running the notebook locally are provided in the readme.md file present in the zip file.

This notebook contains the steps and code to demonstrate support of Retrieval Augumented Generation in watsonx.ai. It introduces commands for data retrieval, knowledge base building & querying, and model testing.

Some familiarity with Python is helpful. This notebook uses Python 3.10.

#### About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires 3 steps:

- Index knowledge base passages (once)
- Retrieve relevant passage(s) from knowledge base (for every user query)
- Generate a response by feeding retrieved passage into a large language model (for every user query)


<a id="setup"></a>
##  Set up the environment



### Install and import dependecies

In [None]:
#!pip install chromadb==0.3.27 | tail -n 1
!pip install sentence_transformers | tail -n 1
!pip install pandas | tail -n 1
!pip install rouge_score | tail -n 1
!pip install nltk | tail -n 1
!pip install "ibm-watson-machine-learning>=1.0.312" | tail -n 1

In [None]:
pip install chromadb==0.3.27

**Note:** Please restart the notebook kernel to pick up proper version of packages installed above.

In [None]:
import os, getpass
import pandas as pd
from typing import Optional, Dict, Any, Iterable, List

try:
    from sentence_transformers import SentenceTransformer
except ImportError:
    raise ImportError("Could not import sentence_transformers: Please install sentence-transformers package.")
    
try:
    import chromadb
    from chromadb.api.types import EmbeddingFunction
except ImportError:
    raise ImportError("Could not import chromdb: Please install chromadb package.")

### Watsonx API connection
This cell defines the credentials required to work with watsonx API for Foundation
Model inferencing.

**Action:** Provide the IBM Cloud user API key. For details, see
[documentation](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui).

In [None]:
credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": getpass.getpass("Please enter your WML api key (hit enter): ")
}

### Defining the project id
The API requires project id that provides the context for the call. We will obtain the id from the project in which this notebook runs. Otherwise, please provide the project id.

**Hint**: You can find the `project_id` as follows. Open the prompt lab in watsonx.ai. At the very top of the UI, there will be `Projects / <project name> /`. Click on the `<project name>` link. Then get the `project_id` from Project's Manage tab (Project -> Manage -> General -> Details).


In [None]:
try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

<a id="data"></a>
## Train/test data loading

Load train and test datasets. At first, training dataset (`train_data`) should be used to work with the models to prepare and tune prompt. Then, test dataset (`test_data`) should be used to calculate the metrics score for selected model, defined prompts and parameters.

In [None]:
filename_test = 'data/RAG/nq910_400_instances/test.tsv'
filename_train = 'data/RAG/nq910_400_instances/train.tsv'

test_data = pd.read_csv(filename_test, delimiter='\t')
train_data = pd.read_csv(filename_train, delimiter='\t')

In [None]:
train_data.head()

In [None]:
test_data.head()

## Build up knowledge base

The current state-of-the-art in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.

We can generate dense vector representations using embedding models. In this notebook, we use [SentenceTransformers](https://www.google.com/search?client=safari&rls=en&q=sentencetransformers&ie=UTF-8&oe=UTF-8) [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to embed both the knowledge base passages and user queries. `all-MiniLM-L6-v2` is a performant open-source model that is small enough to run locally.

A vector database is optimized for dense vector indexing and retrieval. This notebook uses [Chroma](https://docs.trychroma.com), a user-friendly open-source vector database, licensed under Apache 2.0, which offers good speed and performance with all-MiniLM-L6-v2 embedding model.

The dataset we are using is already split into self-contained passages that can be ingested by Chroma. 

The size of each passage is limited by the embedding model's context window (which is 256 tokens for `all-MiniLM-L6-v2`).

### Load knowledge base documents

Load set of documents used further to build knowledge base. 

In [None]:
data_root = "data"
knowledge_base_dir = f"{data_root}/knowledge_base"

In [None]:
if not os.path.exists(knowledge_base_dir):
    from zipfile import ZipFile
    with ZipFile(knowledge_base_dir + ".zip", 'r') as zObject:
        zObject.extractall(data_root)

In [None]:
documents = pd.read_csv(f"{knowledge_base_dir}/psgs.tsv", sep='\t', header=0)
documents['indextext'] = documents['title'].astype(str) + "\n" + documents['text']

### Create an embedding function

Note that you can feed a custom embedding function to be used by chromadb. The performance of chromadb may differ depending on the embedding model used.

In [None]:
class MiniLML6V2EmbeddingFunction(EmbeddingFunction):
    MODEL = SentenceTransformer('all-MiniLM-L6-v2')
    def __call__(self, texts):
        return MiniLML6V2EmbeddingFunction.MODEL.encode(texts).tolist()
emb_func = MiniLML6V2EmbeddingFunction()

### Set up Chroma upsert

Upserting a document means update the document even if it exists in the database. Otherwise re-inserting a document throws an error. This is useful for experimentation purpose.

In [None]:
class ChromaWithUpsert:
    def __init__(
            self,
            name: Optional[str] = "watsonx_rag_collection",
            persist_directory:Optional[str]=None,
            embedding_function: Optional[EmbeddingFunction]=None,
            collection_metadata: Optional[Dict] = None,
    ):
        self._client_settings = chromadb.config.Settings()
        if persist_directory is not None:
            self._client_settings = chromadb.config.Settings(
                chroma_db_impl="duckdb+parquet",
                persist_directory=persist_directory,
            )
        self._client = chromadb.Client(self._client_settings)
        self._embedding_function = embedding_function
        self._persist_directory = persist_directory
        self._name = name
        self._collection = self._client.get_or_create_collection(
            name=self._name,
            embedding_function=self._embedding_function
            if self._embedding_function is not None
            else None,
            metadata=collection_metadata,
        )

    def upsert_texts(
        self,
        texts: Iterable[str],
        metadata: Optional[List[dict]] = None,
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.
        Args:
            :param texts (Iterable[str]): Texts to add to the vectorstore.
            :param metadatas (Optional[List[dict]], optional): Optional list of metadatas.
            :param ids (Optional[List[str]], optional): Optional list of IDs.
            :param metadata: Optional[List[dict]] - optional metadata (such as title, etc.)
        Returns:
            List[str]: List of IDs of the added texts.
        """
        # TODO: Handle the case where the user doesn't provide ids on the Collection
        if ids is None:
            import uuid
            ids = [str(uuid.uuid1()) for _ in texts]
        embeddings = None
        self._collection.upsert(
            metadatas=metadata, documents=texts, ids=ids
        )
        return ids

    def is_empty(self):
        return self._collection.count()==0

    def persist(self):
        self._client.persist()

    def query(self, query_texts:str, n_results:int=5):
        """
        Returns the closests vector to the question vector
        :param query_texts: the question
        :param n_results: number of results to generate
        :return: the closest result to the given question
        """
        return self._collection.query(query_texts=query_texts, n_results=n_results)

### Embed and index documents with Chroma

**Note: Could take several minutes if you don't have pre-built indices**

In [None]:
%%time
chroma = ChromaWithUpsert(
    name=f"nq910_minilm6v2",
    embedding_function=emb_func,  # you can have something here using /embed endpoint
    persist_directory=knowledge_base_dir,
)
if chroma.is_empty():
    _ = chroma.upsert_texts(
        texts=documents.indextext.tolist(),
        # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
        metadata=[{'title': title, 'id': id}
                  for (title,id) in
                  zip(documents.title, documents.id)],  # filter on these!
        ids=[str(i) for i in documents.id],  # unique for each doc
    )
    chroma.persist()

<a id="models"></a>
## Foundation Models on Watsonx

You need to specify `model_id` that will be used for inferencing.

**Action**: Use `FLAN_UL2` model.

In [None]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

In [None]:
model_id = ModelTypes.FLAN_UL2

<a id="predict"></a>
## Generate a retrieval-augmented response to a question

### Select questions

Get questions from the previously loaded test dataset.

In [None]:
question_texts = [q.strip("?") + "?" for q in test_data['question'].tolist()]
print("\n".join(question_texts))

### Retrieve relevant context

Fetch paragraphs similar to the question.

In [None]:
relevant_contexts = []

for question_text in question_texts:
    relevant_chunks = chroma.query(
        query_texts=[question_text],
        n_results=5,
    )
    relevant_contexts.append(relevant_chunks)

Get the set of chunks for one of the questions.

In [None]:
sample_chunks = relevant_contexts[0]
for i, chunk in enumerate(sample_chunks['documents'][0]):
    print("=========")
    print("Paragraph index : ", sample_chunks['ids'][0][i])
    print("Paragraph : ", chunk)
    print("Distance : ", sample_chunks['distances'][0][i])

### Feed the context and the questions to `watsonx.ai` model.

Define instructions for the model.

**Note:** Please **start with using [watsonx.ai Prompt Lab](https://dataplatform.cloud.ibm.com/wx/home?context=wx)** to find better prompts that provides you the best result on a small subset training records (under `train_data` variable). Make sure to not run an inference of all of `train_data`, as it'll take a long time to get the results. To get a sample from `train_data`, you can use e.g.`train_data.head(n=10)` to get first 10 records, or `train_data.sample(n=10)` to get random 10 records. Only once you have identified the best performing prompt, update this notebook to use the prompt and compute the metrics on the test data.

**Action:** Please edit the below cell and add your own prompt here. In the below prompt, we have the instruction (first sentence) and one example included in the prompt.  If you want to change the prompt or add your own examples or more examples, please change the below prompt accordingly.

In [None]:
def make_prompt(context, question_text):
    return (f"Please answer the following.\n"
          + f"{context}:\n\n"
          + f"{question_text}")

prompt_texts = []

for relevant_context, question_text in zip(relevant_contexts, question_texts):
    context = "\n\n\n".join(relevant_context["documents"][0])
    prompt_text = make_prompt(context, question_text)
    prompt_texts.append(prompt_text)

Inspect prompt for sample question.

In [None]:
print(prompt_texts[0])

### Defining the model parameters
We need to provide a set of model parameters that will influence the result:

In [None]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams

parameters = {
    GenParams.MAX_NEW_TOKENS: "greedy",
    GenParams.MAX_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 50
}

Initialize the `Model` class.

In [None]:
from ibm_watson_machine_learning.foundation_models import Model

model = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id)

### Generate a retrieval-augmented response

**Note:** Execution of this cell could take several minutes.

In [None]:
results = []

for prompt_text in prompt_texts:
    results.append(model.generate_text(prompt=prompt_text))

In [None]:
for idx, result in enumerate(results):
    print("Question = ", test_data.iloc[idx]['question'])
    print("Answer = ", result)
    print("Expected Answer(s) (may not be appear with exact wording in the dataset) = ", test_data.iloc[idx]['answers'])
    print("\n")

<a id="score"></a>
## Calculate rougeL metric

In this sample notebook `rouge_score` module was used for rougeL calculation.

#### Rouge Metric

**Note:** The Rouge (Recall-Oriented Understudy for Gisting Evaluation) metric is a set of evaluation measures used in natural language processing (NLP) and specifically in text summarization and machine translation tasks. The Rouge metrics are designed to assess the quality of generated summaries or translations by comparing them to one or more reference texts.

The main idea behind Rouge is to measure the overlap between the generated summary (or translation) and the reference text(s) in terms of n-grams or longest common subsequences. By calculating recall, precision, and F1 scores based on these overlapping units, Rouge provides a quantitative assessment of the summary's content overlap with the reference(s).

Rouge-1 focuses on individual word overlap, Rouge-2 considers pairs of consecutive words, and Rouge-L takes into account the ordering of words and phrases. These metrics provide different perspectives on the similarity between two texts and can be used to evaluate different aspects of summarization or text generation models.

In [None]:
from rouge_score import rouge_scorer
from collections import defaultdict
import numpy as np

def get_rouge_score(predictions, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'])
    aggregate_score = defaultdict(list)

    for result, ref in zip(predictions, references):
        for key, val in scorer.score(result, ref).items():
            aggregate_score[key].append(val.fmeasure)

    scores = {}
    for key in aggregate_score:
        scores[key] = np.mean(aggregate_score[key])
    
    return scores

In [None]:
print(get_rouge_score(results, test_data.answers))

---

Copyright © 2023 IBM. This notebook and its source code are released under the terms of the MIT License.