## RAG NOTEBOOK:
This notebook contains the steps and code to demonstrate support of Retrieval Augumented Generation in watsonx.ai. It introduces commands for data retrieval, knowledge base building & querying, and model testing.

Some familiarity with Python is helpful.

### About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires 3 steps:

- Index knowledge base passages (once)
- Retrieve relevant passage(s) from knowledge base (for every user query)
- Generate a response by feeding retrieved passage into a large language model (for every user query)

## Contents

This notebook contains the following parts:

- [Setup](#setup)
- [Document data loading](#data)
- [Build up knowledge base](#build_base)
- [Foundation Models on watsonx](#models)
- [Generate a retrieval-augmented response to a question](#predict)
- [Summary and next steps](#summary)


<a id="setup"></a>
##  Set up the environment

In [None]:
!pip3 install pypdf
!pip3 install langchain
!pip3 install chromadb
!pip3 install langchain_community

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [125]:
 # Import PdfReader from pypdf to read PDF files
from pypdf import PdfReader

# Load the PDF file
reader = PdfReader("IBM_Annual_Report_2023.pdf")

# Extract text from each page and strip any leading/trailing spaces
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter out empty strings to remove blank pages or pages with no extractable text
pdf_texts = [text for text in pdf_texts if text]

# Print the text of the 9th page (index 8)
print(pdf_texts[8])



MANAGEMENT DISCUSSION SNAPSHOT
($ and shares in millions except per share amounts)
For year ended December 31: 2023 2022 (1)
Yr.-to-Yr. 
Percent/Margin 
Change
Revenue (2) $ 61,860 $ 60,530  2.2 % 
Gross profit margin  55.4 %  54.0 %  1.4 pts. 
Total expense and other (income) $ 25,610 $ 31,531  (18.8) %    
Income from continuing operations before income taxes $ 8,690 $ 1,156  NM 
Provision for/(benefit from) income taxes from continuing operations $ 1,176 $ (626)  NM 
Income from continuing operations $ 7,514 $ 1,783  NM  
Income from continuing operations margin  12.1 %  2.9 %  9.2 pts. 
Loss from discontinued operations, net of tax $ (12) $ (143)  (91.8) %    
Net income $ 7,502 $ 1,639  NM 
Earnings per share from continuing operations–assuming dilution $ 8.15 $ 1.95  NM 
Consolidated earnings per share–assuming dilution $ 8.14 $ 1.80  NM 
Weighted-average shares outstanding–assuming dilution  922.1  912.3  1.1 % 
Assets (3)
$ 135,241 $ 127,243  6.3 %    
Liabilities (3)
$ 112,628

In [126]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [127]:
# Import RecursiveCharacterTextSplitter for text chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with custom separators and chunk size
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],  # Define separators for splitting text (paragraphs, new lines, sentences, words, and characters)
    chunk_size=1000,  # Set the maximum size of each chunk
    chunk_overlap=0    # Set the overlap between chunks (0 means no overlap)
)

# Join all extracted PDF text with double newlines and split it into smaller chunks
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

# Print the 11th chunk (index 10) of the split text
print(character_split_texts[10])

# Print the total number of chunks created after splitting
print(f"\nTotal chunks: {len(character_split_texts)}")


of several critical technologies, including AI, quantum 
computing, and semiconductors. 
In AI, we demonstrated our ability to quickly transform 
research into commercial applications. We launched the 
watsonx AI and data platform, introduced the groundbreaking 
Granite AI foundational model, and developed new AI-
optimized hardware. 
We have IBM Quantum System One engagements with several 
leading organizations, including Cleveland Clinic, the Platform 
for Digital and Quantum Innovation of Quebec, Rensselaer 
Polytechnic Institute, and the University of Tokyo. We also 
IBM 2023 Annual Report 3

Total chunks: 557


In [128]:
# Import the ChromaDB library to work with vector databases
import chromadb

# Import the SentenceTransformerEmbeddingFunction utility from ChromaDB
# This function helps generate embeddings for text data
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# Initialize an instance of the SentenceTransformer embedding function
# This function will be used to convert text into numerical vector representations (embeddings)
embedding_function = SentenceTransformerEmbeddingFunction()

# Generate the embedding for the 10th document in the 'character_split_texts' list
# and print the resulting embedding vector
print(embedding_function([character_split_texts[10]]))


[array([-7.89173990e-02, -4.01439816e-02, -5.06751938e-03, -4.50713672e-02,
       -6.05559945e-02, -5.35961054e-02, -6.19608238e-02,  1.74064599e-02,
       -2.08196677e-02,  3.05104759e-02, -7.06966594e-02, -3.19342948e-02,
        3.21453847e-02,  3.13325301e-02,  6.07728492e-03,  6.91314936e-02,
        6.26401007e-02, -7.42840096e-02,  4.58352687e-03, -9.77577865e-02,
       -6.77745440e-04, -1.10710936e-03,  1.70304421e-02, -4.96980734e-02,
        9.25213471e-03,  1.00498505e-01,  1.35068800e-02, -7.54344612e-02,
        1.82290673e-02, -2.45860573e-02,  1.96209382e-02,  1.38088688e-03,
        4.37018368e-03, -1.90817062e-02, -1.90199669e-02,  1.43828755e-02,
        3.03840302e-02, -9.09088776e-02,  3.65068540e-02, -5.74920848e-02,
       -2.15360653e-02, -2.52513383e-02, -7.55398050e-02,  5.72257340e-02,
        7.91857690e-02,  8.05481970e-02,  6.50972314e-03, -2.11736653e-02,
        4.14696485e-02, -7.92839900e-02, -7.32430816e-02, -7.28527457e-02,
        5.41954562e-02, 

In [129]:
# Initialize a ChromaDB client instance to interact with the database
chroma_client = chromadb.Client()

# Create a new collection in ChromaDB named "IBM_Annual_report_2023"
# The collection will store embedded documents, using the specified embedding function
chroma_collection = chroma_client.create_collection("IBM_Annual_report_2023", embedding_function=embedding_function)

# Generate unique string IDs for each document by converting their indices to strings
ids = [str(i) for i in range(len(character_split_texts))]

# Add the documents to the ChromaDB collection along with their corresponding IDs
chroma_collection.add(ids=ids, documents=character_split_texts)

# Count the number of documents stored in the collection and return the count
chroma_collection.count()


UniqueConstraintError: Collection IBM_Annual_report_2023 already exists

In [None]:
# Define the query we want to search for
query = "What was the total revenue?"

# Perform a similarity search using ChromaDB, retrieving the top 5 most relevant documents
results = chroma_collection.query(query_texts=[query], n_results=5)

# Extract the list of retrieved documents from the query results
retrieved_documents = results['documents'][0]

# Iterate through each retrieved document
for document in retrieved_documents:
    # Print the document content
    print(document)
    # Print a newline for better readability between documents
    print('\n')


Revenue Recognized for Performance Obligations Satisfied (or Partially Satisfied) in Prior Periods
For the year ended December  31, 2023, revenue was reduced by $16 million for performance obligations satisfied or partially 
satisfied in previous periods mainly due to changes in estimates on contracts with cost-to-cost measures of progress. Refer to note 
A, “Significant Accounting Policies,” for additional information on these contracts and estimates of costs to complete.
Reconciliation of Contract Balances
The following table provides information about notes and accounts receivable—trade, contract assets and deferred income 
balances.
($ in millions)
At December 31: 2023 2022
Notes and accounts receivable — trade (net of allowances of $192 in 2023 and $233 in 2022) $ 7,214 $ 6,541 
Contract assets (1)  505  464 
Deferred income (current)  13,451  12,032 
Deferred income (noncurrent)  3,533  3,499


Total revenue $ 61,860 $ 60,530  2.2 %  2.9 %
Total gross profit $ 34,300 $ 32,687  4.

In [None]:
def retreiver(query):
  # Perform a similarity search using ChromaDB, retrieving the top 5 most relevant documents
  results = chroma_collection.query(query_texts=[query], n_results=5)

  # Extract the list of retrieved documents from the query results
  retrieved_documents = results['documents'][0]
  return retrieved_documents

In [130]:
huggingface_api_key = input("Enter Hugging Face Token")


In [131]:
repo_id = "mistralai/Mistral-7B-Instruct-v0.2"

from langchain.llms import HuggingFaceHub

# Initialize LLM
llm = HuggingFaceHub(
    repo_id=repo_id,  # Replace with your desired model
    huggingfacehub_api_token=huggingface_api_key
)




In [132]:
from langchain_core.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
        """
        |System|
        You are a financial research analyst. You have to analyse the Request For Proposal(RFP) documents for the bidding process.

        |Instruction|
        Refer to the context from RFP document and answer the following question.Answer it in a concise manner. Do not add any additonal information.

        |Question|
        {question}

        |Context|
        {context}

        |Answer|
        """
        )




In [133]:
query1="What was the total revenue?"
documents1=retreiver(query1)

In [134]:

prompt1=prompt_template.format(question=query1, context=documents1)

In [135]:
print(llm.invoke(prompt1))




        |System|
        You are a financial research analyst. You have to analyse the Request For Proposal(RFP) documents for the bidding process.

        |Instruction|
        Refer to the context from RFP document and answer the following question.Answer it in a concise manner. Do not add any additonal information.

        |Question|
        What was the total revenue?

        |Context|
        ['Revenue Recognized for Performance Obligations Satisfied (or Partially Satisfied) in Prior Periods\nFor the year ended December\xa0 31, 2023, revenue was reduced by $16 million for performance obligations satisfied or partially \nsatisfied in previous periods mainly due to changes in estimates on contracts with cost-to-cost measures of progress. Refer to note \nA, “Significant Accounting Policies,” for additional information on these contracts and estimates of costs to complete.\nReconciliation of Contract Balances\nThe following table provides information about notes and accounts recei

In [111]:
query2="What are IBM’s top-performing geographic regions in terms of revenue?"
documents2=retreiver(query2)

In [112]:
prompt2=prompt_template.format(question=query2, context=documents2)

In [113]:
print(llm.invoke(prompt2))




        |System|
        You are a financial research analyst. You have to analyse the Request For Proposal(RFP) documents for the bidding process.

        |Instruction|
        Refer to the context from RFP document and answer the following question.Answer it in a concise manner. Do not add any additonal information.

        |Question|
        What are IBM’s top-performing geographic regions in terms of revenue?

        |Context|
        ['Reconciliations of IBM as Reported\n($ in millions)\nAt December 31: 2023 2022\nAssets\nTotal reportable segments $ 101,883 $ 98,667 \nElimination of internal transactions  (1,028)  (1,062) \nOther—divested businesses  19  100 \nUnallocated amounts\nCash and marketable securities  12,907  8,138 \nDeferred tax assets  6,468  6,078 \nPlant, other property and equipment  1,838  1,760 \nOperating right-of-use assets  2,085  1,586 \nPension assets  7,506  8,236 \n  Other (1)  3,563  3,740 \nTotal IBM consolidated assets $ 135,241 $ 127,243 \n(1) Prio

In [114]:
query3="How is IBM reducing its carbon footprint?"
documents3=retreiver(query3)

In [116]:
prompt3=prompt_template.format(question=query3, context=documents3)

In [117]:
print(llm.invoke(prompt3))




        |System|
        You are a financial research analyst. You have to analyse the Request For Proposal(RFP) documents for the bidding process.

        |Instruction|
        Refer to the context from RFP document and answer the following question.Answer it in a concise manner. Do not add any additonal information.

        |Question|
        How is IBM reducing its carbon footprint?

        |Context|
        ['The literature mentioned below on IBM is available without charge from:\nComputershare Trust Company, N.A., P.O. Box 43078, Providence, Rhode Island 02940-3078, (888) IBM-6700.\nInvestors residing outside the United States, Canada and Puerto Rico should call (781) 575-2727.\nThe company’s annual report on Form 10-K and the quarterly reports on Form 10-Q provide additional information on IBM’s \nbusiness. The 10-K report is released by the end of February; 10-Q reports are released by the end of April, July and October. \nThe IBM ESG Report reflects IBM’s belief that corpor