## RAG NOTEBOOK:
This notebook contains the steps and code to demonstrate support of Retrieval Augumented Generation in watsonx.ai. It introduces commands for data retrieval, knowledge base building & querying, and model testing.

Some familiarity with Python is helpful.

### About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires 3 steps:

- Index knowledge base passages (once)
- Retrieve relevant passage(s) from knowledge base (for every user query)
- Generate a response by feeding retrieved passage into a large language model (for every user query)

## Contents

This notebook contains the following parts:

- [Setup](#setup)
- [Document data loading](#data)
- [Build up knowledge base](#build_base)
- [Foundation Models on watsonx](#models)
- [Generate a retrieval-augmented response to a question](#predict)
- [Summary and next steps](#summary)


<a id="setup"></a>
##  Set up the environment

In [1]:
!pip3 install pypdf
!pip3 install langchain-ibm
!pip3 install langchain
!pip3 install chromadb

Collecting pypdf
  Downloading pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.3.0-py3-none-any.whl (300 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/300.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m174.1/300.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.7/300.7 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.3.0
Collecting langchain-ibm
  Downloading langchain_ibm-0.3.6-py3-none-any.whl.metadata (5.2 kB)
Collecting ibm-watsonx-ai<2.0.0,>=1.1.16 (from langchain-ibm)
  Downloading ibm_watsonx_ai-1.2.8-py3-none-any.whl.metadata (6.5 kB)
Collecting pandas<2.2.0,>=0.24.2 (from ibm-watsonx-ai<2.0.0,>=1.1.16->langchain-ibm)
  Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collectin

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.14.2-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.30.0-py3

In [2]:
 # Import PdfReader from pypdf to read PDF files
from pypdf import PdfReader

# Load the PDF file
reader = PdfReader("IBM_Annual_Report_2023.pdf")

# Extract text from each page and strip any leading/trailing spaces
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter out empty strings to remove blank pages or pages with no extractable text
pdf_texts = [text for text in pdf_texts if text]

# Print the text of the 9th page (index 8)
print(pdf_texts[8])



MANAGEMENT DISCUSSION SNAPSHOT
($ and shares in millions except per share amounts)
For year ended December 31: 2023 2022 (1)
Yr.-to-Yr. 
Percent/Margin 
Change
Revenue (2) $ 61,860 $ 60,530  2.2 % 
Gross profit margin  55.4 %  54.0 %  1.4 pts. 
Total expense and other (income) $ 25,610 $ 31,531  (18.8) %    
Income from continuing operations before income taxes $ 8,690 $ 1,156  NM 
Provision for/(benefit from) income taxes from continuing operations $ 1,176 $ (626)  NM 
Income from continuing operations $ 7,514 $ 1,783  NM  
Income from continuing operations margin  12.1 %  2.9 %  9.2 pts. 
Loss from discontinued operations, net of tax $ (12) $ (143)  (91.8) %    
Net income $ 7,502 $ 1,639  NM 
Earnings per share from continuing operations–assuming dilution $ 8.15 $ 1.95  NM 
Consolidated earnings per share–assuming dilution $ 8.14 $ 1.80  NM 
Weighted-average shares outstanding–assuming dilution  922.1  912.3  1.1 % 
Assets (3)
$ 135,241 $ 127,243  6.3 %    
Liabilities (3)
$ 112,628

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [4]:
# Import RecursiveCharacterTextSplitter for text chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with custom separators and chunk size
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],  # Define separators for splitting text (paragraphs, new lines, sentences, words, and characters)
    chunk_size=1000,  # Set the maximum size of each chunk
    chunk_overlap=0    # Set the overlap between chunks (0 means no overlap)
)

# Join all extracted PDF text with double newlines and split it into smaller chunks
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

# Print the 11th chunk (index 10) of the split text
print(character_split_texts[10])

# Print the total number of chunks created after splitting
print(f"\nTotal chunks: {len(character_split_texts)}")


of several critical technologies, including AI, quantum 
computing, and semiconductors. 
In AI, we demonstrated our ability to quickly transform 
research into commercial applications. We launched the 
watsonx AI and data platform, introduced the groundbreaking 
Granite AI foundational model, and developed new AI-
optimized hardware. 
We have IBM Quantum System One engagements with several 
leading organizations, including Cleveland Clinic, the Platform 
for Digital and Quantum Innovation of Quebec, Rensselaer 
Polytechnic Institute, and the University of Tokyo. We also 
IBM 2023 Annual Report 3

Total chunks: 557


In [5]:
# Import the ChromaDB library to work with vector databases
import chromadb

# Import the SentenceTransformerEmbeddingFunction utility from ChromaDB
# This function helps generate embeddings for text data
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# Initialize an instance of the SentenceTransformer embedding function
# This function will be used to convert text into numerical vector representations (embeddings)
embedding_function = SentenceTransformerEmbeddingFunction()

# Generate the embedding for the 10th document in the 'character_split_texts' list
# and print the resulting embedding vector
print(embedding_function([character_split_texts[10]]))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


[array([-7.89173841e-02, -4.01439629e-02, -5.06751938e-03, -4.50713374e-02,
       -6.05559535e-02, -5.35960644e-02, -6.19608127e-02,  1.74064487e-02,
       -2.08196361e-02,  3.05105150e-02, -7.06966519e-02, -3.19342464e-02,
        3.21453884e-02,  3.13325189e-02,  6.07725792e-03,  6.91315532e-02,
        6.26400709e-02, -7.42840245e-02,  4.58349800e-03, -9.77578238e-02,
       -6.77729433e-04, -1.10707921e-03,  1.70304812e-02, -4.96981069e-02,
        9.25210770e-03,  1.00498490e-01,  1.35069089e-02, -7.54344612e-02,
        1.82290394e-02, -2.45860331e-02,  1.96209699e-02,  1.38095417e-03,
        4.37021954e-03, -1.90817192e-02, -1.90199930e-02,  1.43828979e-02,
        3.03840451e-02, -9.09089521e-02,  3.65068689e-02, -5.74921370e-02,
       -2.15360206e-02, -2.52513178e-02, -7.55397901e-02,  5.72257005e-02,
        7.91858137e-02,  8.05481523e-02,  6.50975015e-03, -2.11736914e-02,
        4.14696373e-02, -7.92840049e-02, -7.32430890e-02, -7.28528053e-02,
        5.41955084e-02, 

In [6]:
# Initialize a ChromaDB client instance to interact with the database
chroma_client = chromadb.Client()

# Create a new collection in ChromaDB named "IBM_Annual_report_2023"
# The collection will store embedded documents, using the specified embedding function
chroma_collection = chroma_client.create_collection("IBM_Annual_report_2023", embedding_function=embedding_function)

# Generate unique string IDs for each document by converting their indices to strings
ids = [str(i) for i in range(len(character_split_texts))]

# Add the documents to the ChromaDB collection along with their corresponding IDs
chroma_collection.add(ids=ids, documents=character_split_texts)

# Count the number of documents stored in the collection and return the count
chroma_collection.count()


557

In [7]:
# Define the query we want to search for
query = "What was the total revenue?"

# Perform a similarity search using ChromaDB, retrieving the top 5 most relevant documents
results = chroma_collection.query(query_texts=[query], n_results=5)

# Extract the list of retrieved documents from the query results
retrieved_documents = results['documents'][0]

# Iterate through each retrieved document
for document in retrieved_documents:
    # Print the document content
    print(document)
    # Print a newline for better readability between documents
    print('\n')


Revenue Recognized for Performance Obligations Satisfied (or Partially Satisfied) in Prior Periods
For the year ended December  31, 2023, revenue was reduced by $16 million for performance obligations satisfied or partially 
satisfied in previous periods mainly due to changes in estimates on contracts with cost-to-cost measures of progress. Refer to note 
A, “Significant Accounting Policies,” for additional information on these contracts and estimates of costs to complete.
Reconciliation of Contract Balances
The following table provides information about notes and accounts receivable—trade, contract assets and deferred income 
balances.
($ in millions)
At December 31: 2023 2022
Notes and accounts receivable — trade (net of allowances of $192 in 2023 and $233 in 2022) $ 7,214 $ 6,541 
Contract assets (1)  505  464 
Deferred income (current)  13,451  12,032 
Deferred income (noncurrent)  3,533  3,499


Total revenue $ 61,860 $ 60,530  2.2 %  2.9 %
Total gross profit $ 34,300 $ 32,687  4.

In [8]:
def retreiver(query):
  # Perform a similarity search using ChromaDB, retrieving the top 5 most relevant documents
  results = chroma_collection.query(query_texts=[query], n_results=5)

  # Extract the list of retrieved documents from the query results
  retrieved_documents = results['documents'][0]
  return retrieved_documents

In [9]:
ibm_cloud_api_key=""
project_id=""
watson_url="https://us-south.ml.cloud.ibm.com"

In [10]:
from langchain_ibm import WatsonxLLM


llm = WatsonxLLM(
            model_id='mistralai/mixtral-8x7b-instruct-v01',
            apikey=ibm_cloud_api_key,
            project_id=project_id,
            params={
                "decoding_method": "greedy",
                "max_new_tokens": 200,
                "min_new_tokens": 1,
                "repetition_penalty": 1,
            },
            url=watson_url
        )

In [11]:
from langchain_core.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
        """
        |System|
        You are a financial research analyst. You have to analyse the Request For Proposal(RFP) documents for the bidding process.

        |Instruction|
        Refer to the context from RFP document and answer the following question.Answer it in a concise manner. Do not add any additonal information.

        |Question|
        {question}

        |Context|
        {context}

        |Answer|
        """
        )




In [12]:
query1="What was the total revenue?"
documents1=retreiver(query1)

In [13]:
prompt1=prompt_template.format(question=query1, context=documents1)

In [14]:
print(llm.invoke(prompt1))

61,860 million dollars


In [18]:
query2="What are IBM’s top-performing geographic regions in terms of revenue?"
documents2=retreiver(query2)

In [19]:
prompt2=prompt_template.format(question=query2, context=documents2)

In [20]:
print(llm.invoke(prompt2))


        The United States is IBM's top-performing geographic region in terms of revenue.


In [21]:
query3="How is IBM reducing its carbon footprint?"
documents3=retreiver(query3)

In [22]:
prompt3=prompt_template.format(question=query3, context=documents3)

In [23]:
print(llm.invoke(prompt3))


        IBM is reducing its carbon footprint by achieving a 63% reduction in greenhouse gas emissions against the base year 2010.
