In this tutorial, we’ll use LangChain to walk through a step-by-step Retrieval Augmented Generation ([RAG](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)) example in Python. For our use case, we’ll be setting up a RAG system for [IBM Think 2024](https://www.ibm.com/events/think). IBM Think 2024 is a conference or event where IBM announces new products, technologies, and partnerships.

RAG is a technique in natural language processing (NLP) that combines information retrieval and generative models to produce more accurate, relevant and contextually aware responses. 

In traditional language generation tasks, [large language models](https://www.ibm.com/topics/large-language-models) (LLMs) like OpenAI’s GPT-3.5 (Generative Pre-trained Transformer) or [IBM’s Granite Models](https://www.ibm.com/products/watsonx-ai/foundation-models) are used to construct responses based on an input prompt. However, these models may struggle to produce responses that are contextually relevant, factually accurate or up to date. RAG applications address this limitation by incorporating a retrieval step before response generation. During retrieval, [vector search](https://www.ibm.com/topics/vector-search) can be used to identify contextually pertinent information, such as relevant information or documents from a large corpus of text, typically stored in a [vector database](https://www.ibm.com/topics/vector-database). Finally, an LLM is used to generate a response based on the retrieved context.

LangChain is a powerful, open-source framework that facilitates the development of applications using LLMs for various NLP tasks. In the context of RAG, LangChain plays a critical role by combining the strengths of retrieval-based methods and generative models to enhance the capabilities of NLP systems.

For this tutorial, we have downloaded content from a several IBM.com websites to create a knowledge base from where we will provide an LLM with context to answer some questions about Think 2024.

The content and this Jupyter Notebook is available on [GitHub](https://github.com/Erika-Russi/think/tree/main/tutorials/langchain).

# Prerequisites

You need an [IBM Cloud account](https://cloud.ibm.com/registration?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-trial) to create a [watsonx.ai](https://www.ibm.com/products/watsonx-ai?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-product) project.

# Steps

## Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook. Jupyter Notebooks are widely used within [data science](https://www.ibm.com/topics/data-science) to combine code, text, images, and [data visualizations](https://www.ibm.com/topics/data-visualization) to formulate a well-formed analysis.

1. Log in to [watsonx.ai](https://dataplatform.cloud.ibm.com/registration/stepone) using your IBM Cloud account.

2. Create a [watsonx.ai project](https://www.ibm.com/docs/en/watsonx/saas?topic=projects-creating-project#create-a-project).

3. Create a [Jupyter Notebook](https://www.ibm.com/docs/en/watsonx/saas?topic=editor-creating-notebooks).

This step will open a Notebook environment where you can load your data set and copy the code from this tutorial to implement a binary classification task using the gradient boosting algorithm.

## Step 2. Get an API Key for the IBM Generative AI Python SDK 

Create an IBMid and log in to https://bam.res.ibm.com/ to generate an API key. We have exported this credential as `GENAI_KEY` for this tutorial. The other credential we need to export is `GENAI_API`, which is `https://bam-api.res.ibm.com`.

## Step 3. Install and import relevant libraries

We'll need a few libraries for this tutorial. Make sure to import the ones below, and if they're not installed, you can resolve this with a quick pip install.

In [1]:
#installations
%pip install langchain
%pip install ibm-generative-ai
%pip install langchain_chroma
%pip install beautifulsoup4
%pip install lxml
%pip install sentence-transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;4

In [2]:
#imports
import json
import glob

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


from langchain_community.document_loaders import (
    BSHTMLLoader, 
    TextLoader,
)

from langchain_text_splitters import RecursiveCharacterTextSplitter


from genai import Credentials, Client
from genai.extensions.langchain.chat_llm import LangChainChatInterface
from genai.schema import (
    TextGenerationParameters,
    TextGenerationReturnOptions,
)

Ensure that `GENAI_KEY` and `GENAI_API` variables are exported correctly. The below code should return `True` when the variables have been loaded from a `.env` file:

In [3]:
from dotenv import load_dotenv
load_dotenv()

True

##  Step 4. Indexing

We’ll index our Think 2024 specific articles to create a knowledge base as a vectorstore. The first step to building vector embeddings is to clean and process the raw dataset. This may involve the removal of noise and standardization of the text. For our example, we won’t do any cleaning since the text is already cleaned and standardized.

First, let's set up a a helper function to help us load the URLs we have in `URLS_DICTIONARY`. `URLS_DICTIONARY` helps us map the file names to the URL from which we extracted the content. Let's also establish a name for our collection: `askibm_think_2024`.

In [4]:
def load_urls(json_file: str) -> dict[str, str]:
    with open(json_file, "r") as f:
        return json.load(f)
    
URLS_DICTIONARY = load_urls("./sources/sources.json")
COLLECTION_NAME = "askibm_think_2024"
URLS_DICTIONARY

{'./corpus/ibm.com_events_think_faq.html': 'https://www.ibm.com/events/think/faq',
 './corpus/events_think_agenda.html': 'https://www.ibm.com/events/think/agenda',
 './corpus/products_watsonx_ai.html': 'https://www.ibm.com/products/watsonx-ai',
 './corpus/products_watsonx_ai_foundation_models.html': 'https://www.ibm.com/products/watsonx-ai/foundation-models',
 './corpus/watsonx_pricing.html': 'https://www.ibm.com/watsonx/pricing',
 './corpus/watsonx.html': 'https://www.ibm.com/watsonx',
 './corpus/products_watsonx_data.html': 'https://www.ibm.com/products/watsonx-data',
 './corpus/products_watsonx_assistant.html': 'https://www.ibm.com/products/watsonx-assistant',
 './corpus/products_watsonx_code_assistant.html': 'https://www.ibm.com/products/watsonx-code-assistant',
 './corpus/products_watsonx_orchestrate.html': 'https://www.ibm.com/products/watsonx-orchestrate',
 './corpus/products_watsonx_governance.html': 'https://www.ibm.com/products/watsonx-governance',
 './corpus/granite_code_mod

Next, we set up a function `load_think_corpus()` to help us load the files from our data sources. And we'll load the files in the [`corpus` directory](https://github.com/Erika-Russi/think/tree/main/tutorials/langchain/corpus) so they're ready to read.

In [5]:
def load_think_corpus(directory, text=False):

    html_files = glob.glob(directory + "/" + "*.html")

    if text:
        html_files = glob.glob(directory + "/" + "*.txt")

    return html_files

In [6]:
html_corpus_files = load_think_corpus("./corpus")
text_corpus_files = load_think_corpus("./corpus", text=True)
print(text_corpus_files)
html_corpus_files

['./corpus/ibm_concert.txt', './corpus/announcements.txt']


['./corpus/products_watsonx_ai_foundation_models.html',
 './corpus/products_watsonx_data.html',
 './corpus/products_watsonx_governance.html',
 './corpus/ibm.com_events_think_faq.html',
 './corpus/ibm_price_performance_data.html',
 './corpus/accelerating_gen_ai.html',
 './corpus/code_assistant_for_orchestrate.html',
 './corpus/red_hat_enterprise_linux_ai.html',
 './corpus/watsonx_open_source.html',
 './corpus/granite_code_models_open_source.html',
 './corpus/watsonx.html',
 './corpus/products_watsonx_code_assistant.html',
 './corpus/ibm_consulting_advantage_info.html',
 './corpus/events_think_agenda.html',
 './corpus/products_watsonx_assistant.html',
 './corpus/products_watsonx_orchestrate.html',
 './corpus/ibm_consulting_advantage_news.html',
 './corpus/democratizing.html',
 './corpus/products_watsonx_ai.html',
 './corpus/watsonx_pricing.html',
 './corpus/ibm_data_product_hub.html',
 './corpus/ibm_consulting_expands_ai.html',
 './corpus/watsonx_code_assistant_for_z.html',
 './corpus/co

Let's load our documents using the LangChain BSHTMLLoader for our HTML files and the TextLoader for our txt files. We'll print a sample document at the end to see how it's been loaded.

In [7]:
documents = []


for f in html_corpus_files:
    loader = BSHTMLLoader(f)
    data = loader.load()
    documents += data

for f in text_corpus_files:
    loader = TextLoader(f)
    documents += loader.load()

#show sample document
documents[0]

Document(page_content="\n\n\n\n\n\n\nFoundation Models - IBM watsonx.ai\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFoundation models in watsonx.ai\xa0\n\n\n\n                        \n\n\n  \n  \n      Explore the IBM library of foundation models on the watsonx platform to scale generative AI for your business with confidence \n  \n\n\n\n\n    \n\n\n                    \n\n\n\nStart your free trial \n\n\nBook a live demo\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  \n\n\n\n    Enterprise-grade models with the power of choice \n\n\n\n\n\n    \n\n\n            \n        \n\n\n\n\nIBM watsonx™ models\xa0are designed for the enterprise and optimized for targeted business domains and use cases. Through the AI studio IBM®\xa0watsonx.ai™ we\xa0offer a selection of cost-effective, enterprise-grade foundation models developed by IBM, open-source models and models sourced from third-party providers to

Based on the sample document, it looks like there's a lot of white space and new line characters that we can get rid of. Let's clean that up and add some metadata to our documents, including an id number and the source of the content.

In [8]:
doc_id = 0
for doc in documents:
    doc.page_content = " ".join(doc.page_content.split()) # remove white space

    
    doc.metadata["id"] = doc_id #make a document id and add it to the document metadata
    doc.metadata["fileName"] = URLS_DICTIONARY[doc.metadata["source"]]

    if "title" not in doc.metadata.keys():
        doc.metadata["title"] = URLS_DICTIONARY[doc.metadata["source"]].replace(".txt", "")

    print(doc.metadata)
    doc_id += 1

{'source': './corpus/products_watsonx_ai_foundation_models.html', 'title': 'Foundation Models - IBM watsonx.ai', 'id': 0, 'fileName': 'https://www.ibm.com/products/watsonx-ai/foundation-models'}
{'source': './corpus/products_watsonx_data.html', 'title': 'IBM watsonx.data', 'id': 1, 'fileName': 'https://www.ibm.com/products/watsonx-data'}
{'source': './corpus/products_watsonx_governance.html', 'title': 'IBM watsonx.governance', 'id': 2, 'fileName': 'https://www.ibm.com/products/watsonx-governance'}
{'source': './corpus/ibm.com_events_think_faq.html', 'title': 'Think 2024 FAQ | IBM', 'id': 3, 'fileName': 'https://www.ibm.com/events/think/faq'}
{'source': './corpus/ibm_price_performance_data.html', 'title': 'Delivering superior price-performance and enhanced data management for AI with IBM watsonx.data - IBM Blog', 'id': 4, 'fileName': 'https://www.ibm.com/blog/announcement/delivering-superior-price-performance-and-enhanced-data-management-for-ai-with-ibm-watsonx-data/'}
{'source': './cor

Let's see how our sample document looks now after we cleaned it up:

In [9]:
documents[0]

Document(page_content="Foundation Models - IBM watsonx.ai Foundation models in watsonx.ai Explore the IBM library of foundation models on the watsonx platform to scale generative AI for your business with confidence Start your free trial Book a live demo Enterprise-grade models with the power of choice IBM watsonx™ models are designed for the enterprise and optimized for targeted business domains and use cases. Through the AI studio IBM® watsonx.ai™ we offer a selection of cost-effective, enterprise-grade foundation models developed by IBM, open-source models and models sourced from third-party providers to help clients and partners scale and operationalize artificial intelligence (AI) faster with minimal risk. You can deploy the AI models wherever your workload is, both on-premises and on hybrid cloud. IBM takes a differentiated approach to delivering enterprise-grade foundation models: Open: Bring best-in-class IBM and proven open-source models to watsonx foundation model library or 

We need to split up our text into smaller, more manageable pieces known as "chunks". LangChain's `RecursiveCharacterTextSplitter` takes a large text and splits it based on a specified chunk size using a predefined set of characters. The default characters are ["\n\n", "\n", " ", ""].

The process starts by attempting to split the text using the first character, \n\n. If the resulting chunks are still too large, it moves to the next character, \n, and tries splitting again. This continues with each character in the set until the chunks are smaller than the specified chunk size.

We settled on a chunk size of 512 after experimenting with a chunk size of 1000. When the chunks were that large, our model was getting too much context for question-answering, so we changed it to smaller chunks. Feel free to experiment with chunk size further!


In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Next, we choose an embedding model to be trained on our Think 2024 dataset. The trained embedding model is used to generate embeddings for each data point in the dataset. For text data, popular open-source embedding models include Word2Vec, GloVe, FastText or pre-trained transformer-based models like BERT or RoBERTa. OpenAIembeddings may also be used by leveraging the OpenAI embeddings API endpoint and getting an `openai_api_key`, however, there is a cost associated with this usage.

Unfortunately, because the embedding models are so large, vector embedding often demands significant computational resources like a GPU. We can greatly lower the costs linked to embedding vectors, while preserving performance and accuracy by using Huggingface embeddings.

Huggingface is an NLP library that provides a vast array of pre-trained models and embeddings. These embeddings, generated from models like BERT, GPT and RoBERTa, encapsulate semantic information from text. Unlike traditional embedding methods that necessitate training from scratch, Huggingface embeddings offer precomputed representations that can be immediately used for various NLP tasks.

In [11]:
embeddings = HuggingFaceEmbeddings()

  from tqdm.autonotebook import tqdm, trange


Let's load our content into a local instance of a vector database, using Chroma.

In [16]:
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings)

In [13]:
# vectordb = Chroma(persist_directory='saved_vdb', embedding_function=embeddings)

Let's do a quick search of our vector database to test it out! Using `similarity_search_with_score` allows us to return the documents and the distance score of the query to them. The returned distance score is Euclidean distance. Therefore, a lower score is better.

In [20]:
prompt = "What is IBM concert?"
search = vectorstore.similarity_search_with_score(prompt)
search

[(Document(page_content='addressing issues and solving problems before they happen. Concert will initially focus on helping application owners, SREs and IT leaders gain insights about, pre-empt and more quickly address issues around application risk and compliance management. Read this blog to learn more about IBM Concert. IBM expands ecosystem access to watsonx, adds third-party models IBM continues to foster a strong ecosystem of partners to offer clients choice and flexibility through bringing third-party models onto watsonx,', metadata={'fileName': 'https://newsroom.ibm.com/2024-05-21-IBM-Unveils-Next-Chapter-of-watsonx-with-Open-Source,-Product-Ecosystem-Innovations-to-Drive-Enterprise-AI-at-Scale', 'id': 8, 'source': './corpus/watsonx_open_source.html', 'title': 'IBM Unveils Next Chapter of watsonx with Open Source, Product & Ecosystem Innovations to Drive Enterprise AI at Scale'}),
  0.6187030076980591),
 (Document(page_content='application risk and compliance management. Read t

In [21]:
prompt = "Where is Think 2024?"
search = vectorstore.similarity_search_with_score(prompt)
search

[(Document(page_content='up now to attend Think 2024, 20–23 May in Boston Register now When we know, you’ll know Get updates on speakers, sessions and essential conference information, delivered right to your inbox. Subscribe', metadata={'fileName': 'https://www.ibm.com/events/think/agenda', 'id': 13, 'source': './corpus/events_think_agenda.html', 'title': 'IBM Think 2024 Agenda'}),
  0.5850661396980286),
 (Document(page_content='technology leaders from across industries. Content will be geared toward C-level, line of business and senior IT leaders. Think 2024 programming will be held at the Boston Convention & Exhibition Center (BCEC), with some activities at the Omni Boston Hotel at the Seaport. At IBM, we are committed to sustainability and environmentally responsible event planning. We are proud to partner with two distinguished venues, each known for their exemplary sustainable practices. Our event will take place at the', metadata={'fileName': 'https://www.ibm.com/events/think/fa

## Step 5. Set up a retriever

We'll set up our vector store as a retriever. The retrieved information from the vector store serves as additional context or knowledge that can be used by a generative model.

In [22]:
retriever = vectorstore.as_retriever()


## Step 6. Generate a response with a Generative Model

Finally, we’ll generate a response. The generative model (like GPT-4 or IBM Granite) uses the retrieved information to produce a more accurate and contextually relevant response to our questions about Think 2024.

First, we'll establish which LLM we're going to use to generate the response. For this tutorial, we'll use IBM's Granite 13B Chat model. Don't forget to load in your API credentials to access this model using the [IBM GenAI SDK](https://github.com/IBM/ibm-generative-ai).

In [16]:
# IBM_GRANITE_13B_CHAT_V2 = "ibm/granite-13b-chat-v2"
# MODEL_ID_PARAMS = {
#     IBM_GRANITE_13B_CHAT_V2: TextGenerationParameters(
#         decoding_method="greedy",
#         max_new_tokens=512,
#         min_new_tokens=10,
#         repetition_penalty=1.2,
#         return_options=TextGenerationReturnOptions(
#     generated_tokens=True,
#     token_logprobs=True,
#     token_ranks=True,
#     # input_tokens=True
#     # top_n_tokens=
# ))}

In [23]:
llm = LangChainChatInterface(
    client=Client(credentials=Credentials.from_env()),
    model_id= "ibm/granite-13b-chat-v2",
    parameters=TextGenerationParameters(
        decoding_method="greedy",
        max_new_tokens=512,
        min_new_tokens=10,
        repetition_penalty=1.2,
        return_options=TextGenerationReturnOptions(
            generated_tokens=True,
            token_logprobs=True,
            token_ranks=True,)
            )
)

We'll set up a `prompttemplate` to ask multiple questions. The "context" will be derived from our retriever (our vector database) with the relevant documents and the "question" will be derived from the user query.

In [24]:
template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

Let's set up a helper function to format the docs accordingly:

In [25]:
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

And now we can set up a chain with our context, our prompt and out LLM. The generative model processes the augmented context along with the user's question to produce a response.

In [26]:
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

And now we can ask multiple questions:

In [27]:
chain.invoke("Where is Think 2024?")

'Think 2024 is being held in Boston, Massachusetts. The specific venue mentioned is the Boston Convention & Exhibition Center (BCEC).'

In [28]:
chain.invoke("What is IBM Concert?")

'IBM Concert is described as "a new tool powered by the IBM watsonx AI and data platform that will provide visibility and insight into the entire ecosystem of business applications, and the clouds, networks, and assets on which they are built."'

In [29]:
chain.invoke("What is IBM Think 2024?")

'IBM Think 2024 is a conference or event where IBM announces new products, technologies, and partnerships related to artificial intelligence and other areas of interest to businesses and organizations. The most recent edition of this event was held in Boston, Massachusetts, and featured sessions and discussions aimed at senior business and technology leaders.'

And that's it! Feel free to ask even more questions!

You can imagine a situation where we can create chatbots to field these questions.

We encourage you to check out the [LangChain documentation page](https://python.langchain.com/v0.2/docs/tutorials/rag/) for more information and tutorials on RAG.