In this tutorial, we’ll use LangChain to walk through a step-by-step Retrieval Augmented Generation ([RAG](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)) example in Python. For our use case, we’ll be setting up a RAG system for [IBM Think 2024](https://www.ibm.com/events/think).

RAG is a technique in natural language processing (NLP) that combines information retrieval and generative models to produce more accurate, relevant and contextually aware responses. 

In traditional language generation tasks, [large language models](https://www.ibm.com/topics/large-language-models) (LLMs) like OpenAI’s GPT-3.5 (Generative Pre-trained Transformer) or [IBM’s Granite Models](https://www.ibm.com/products/watsonx-ai/foundation-models) are used to construct responses based on an input prompt. However, these models may struggle to produce responses that are contextually relevant, factually accurate or up to date. RAG applications address this limitation by incorporating a retrieval step before response generation. During retrieval, [vector search](https://www.ibm.com/topics/vector-search) can be used to identify contextually pertinent information, such as relevant passages or documents from a large corpus of text, typically stored in a [vector database](https://www.ibm.com/topics/vector-database). Finally, an LLM is used to generate a response based on the retrieved context.

LangChain is a powerful, open-source framework that facilitates the development of applications using LLMs for various NLP tasks. In the context of RAG, LangChain plays a critical role by combining the strengths of retrieval-based methods and generative models to enhance the capabilities of NLP systems.

# Prerequisites

You need an [IBM Cloud account](https://cloud.ibm.com/registration?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-trial) to create a [watsonx.ai](https://www.ibm.com/products/watsonx-ai?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-product) project.

# Steps

## Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook. Jupyter Notebooks are widely used within [data science](https://www.ibm.com/topics/data-science) to combine code, text, images, and [data visualizations](https://www.ibm.com/topics/data-visualization) to formulate a well-formed analysis.

1. Log in to [watsonx.ai](https://dataplatform.cloud.ibm.com/registration/stepone) using your IBM Cloud account.

2. Create a [watsonx.ai project](https://www.ibm.com/docs/en/watsonx/saas?topic=projects-creating-project#create-a-project).

3. Create a [Jupyter Notebook](https://www.ibm.com/docs/en/watsonx/saas?topic=editor-creating-notebooks).

This step will open a Notebook environment where you can load your data set and copy the code from this tutorial to implement a binary classification task using the gradient boosting algorithm.

## Step 2. Install and import relevant libraries

We'll need a few libraries for this tutorial. Make sure to import the ones below, and if they're not installed, you can resolve this with a quick pip install.

In [None]:
#installations
%pip install langchain
%pip install ibm-generative-ai
%pip install langchain_chroma
%pip install beautifulsoup4
%pip install lxml
%pip install sentence-transformers

In [31]:
#imports
import json
import glob

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

from langchain_community.document_loaders import (
    BSHTMLLoader, 
    TextLoader,
)

from langchain_text_splitters import RecursiveCharacterTextSplitter


from genai import Credentials, Client
from genai.schema import (
    TextGenerationParameters,
    TextGenerationReturnOptions,
)
from genai.extensions.langchain.chat_llm import LangChainChatInterface


from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

load_dotenv()



True

#  Step 1 - indexing

sss

In [3]:
def load_think_corpus(directory, text=False):

    html_files = glob.glob(directory + "/" + "*.html")

    if text:
        html_files = glob.glob(directory + "/" + "*.txt")

    return html_files

def load_urls(json_file: str) -> dict[str, str]:
    with open(json_file, "r") as f:
        return json.load(f)
    



In [4]:
URLS_DICTIONARY = load_urls("./sources/sources.json")
COLLECTION_NAME = "askibm_think_2024"
URLS_DICTIONARY

{'./corpus/ibm.com_events_think_faq.html': 'https://www.ibm.com/events/think/faq',
 './corpus/events_think_agenda.html': 'https://www.ibm.com/events/think/agenda',
 './corpus/products_watsonx_ai.html': 'https://www.ibm.com/products/watsonx-ai',
 './corpus/products_watsonx_ai_foundation_models.html': 'https://www.ibm.com/products/watsonx-ai/foundation-models',
 './corpus/watsonx_pricing.html': 'https://www.ibm.com/watsonx/pricing',
 './corpus/watsonx.html': 'https://www.ibm.com/watsonx',
 './corpus/products_watsonx_data.html': 'https://www.ibm.com/products/watsonx-data',
 './corpus/products_watsonx_assistant.html': 'https://www.ibm.com/products/watsonx-assistant',
 './corpus/products_watsonx_code_assistant.html': 'https://www.ibm.com/products/watsonx-code-assistant',
 './corpus/products_watsonx_orchestrate.html': 'https://www.ibm.com/products/watsonx-orchestrate',
 './corpus/products_watsonx_governance.html': 'https://www.ibm.com/products/watsonx-governance',
 './corpus/w3publisher_thin

In [5]:
embeddings = HuggingFaceEmbeddings()
html_corpus_files = load_think_corpus("./corpus")

text_corpus_files = load_think_corpus("./corpus", text=True)
html_corpus_files

  from tqdm.autonotebook import tqdm, trange


['./corpus/internal_genai.html',
 './corpus/products_watsonx_ai_foundation_models.html',
 './corpus/think_overview.html',
 './corpus/all_sales_blogs.html',
 './corpus/products_watsonx_data.html',
 './corpus/products_watsonx_governance.html',
 './corpus/ibm.com_events_think_faq.html',
 './corpus/what_is_think.html',
 './corpus/ibm_price_performance_data.html',
 './corpus/accelerating_gen_ai.html',
 './corpus/code_assistant_for_orchestrate.html',
 './corpus/red_hat_enterprise_linux_ai.html',
 './corpus/watsonx_open_source.html',
 './corpus/granite_code_models_open_source.html',
 './corpus/watsonx.html',
 './corpus/products_watsonx_code_assistant.html',
 './corpus/ibm_consulting_advantage_info.html',
 './corpus/events_think_agenda.html',
 './corpus/products_watsonx_assistant.html',
 './corpus/products_watsonx_orchestrate.html',
 './corpus/ibm_consulting_advantage_news.html',
 './corpus/instruct_lab.html',
 './corpus/democratizing.html',
 './corpus/w3publisher_think.html',
 './corpus/produ



We're using RecursiveCharacterTextSplitter to split the text, which splits the text by recursively look at characters. Started with a chunk size of 1000 and the results were not as good, because the model was getting too much context, so changed it to smaller chunks. Feel free to experiment with chunk size further!

Then we use the Re

https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846

In [6]:
documents = []


for f in html_corpus_files:
    loader = BSHTMLLoader(f)
    data = loader.load()
    documents += data

for f in text_corpus_files:
    loader = TextLoader(f)
    documents += loader.load()


doc_id = 0
for doc in documents:
    doc.page_content = " ".join(doc.page_content.split()) # remove white space

    
    doc.metadata["id"] = doc_id #make a document id and add it to the document metadata
    doc.metadata["fileName"] = URLS_DICTIONARY[doc.metadata["source"]]

    if "title" not in doc.metadata.keys():
        doc.metadata["title"] = URLS_DICTIONARY[doc.metadata["source"]].replace(".txt", "")

    print(doc.metadata)
    doc_id += 1

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
docs = text_splitter.split_documents(documents)



{'source': './corpus/internal_genai.html', 'title': 'None', 'id': 0, 'fileName': 'https://w3.ibm.com/w3publisher/ai-for-business/internal-generative-ai-tools'}
{'source': './corpus/products_watsonx_ai_foundation_models.html', 'title': 'Foundation Models - IBM watsonx.ai', 'id': 1, 'fileName': 'https://www.ibm.com/products/watsonx-ai/foundation-models'}
{'source': './corpus/think_overview.html', 'title': 'None', 'id': 2, 'fileName': 'https://w3.ibm.com/w3publisher/news/may-2024/think-2024'}
{'source': './corpus/all_sales_blogs.html', 'title': 'None', 'id': 3, 'fileName': 'https://w3.ibm.com/w3publisher/ibmsaleszone/sales-news/all-sales-blogs/f062a050-0886-11ef-8344-dbc484c79139'}
{'source': './corpus/products_watsonx_data.html', 'title': 'IBM watsonx.data', 'id': 4, 'fileName': 'https://www.ibm.com/products/watsonx-data'}
{'source': './corpus/products_watsonx_governance.html', 'title': 'IBM watsonx.governance', 'id': 5, 'fileName': 'https://www.ibm.com/products/watsonx-governance'}
{'so

In [7]:
# vector_db = Milvus(
#     embeddings,
#     collection_name=COLLECTION_NAME,
#     connection_args=compose_connection_args_from_environment(),
#     auto_id=True,
# )

# for doc in docs:
#     print(doc.metadata)

#     expr = "id == " + str(doc.metadata["id"])

#     pks = vector_db.get_pks(expr)

#     print("Primary keys: ", pks)

#     if pks is None:
#         pks = []

#     result = vector_db.upsert(pks, [doc])

#     print(result)

In [9]:
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory='saved_vdb')

In [22]:
vectorstore

<langchain_chroma.vectorstores.Chroma at 0x1778fa0b0>

In [10]:
vectordb = Chroma(persist_directory='saved_vdb', embedding_function=embeddings)

# Step . Retriever

In [12]:
prompt = "What is IBM Concert?"
search = vectordb.similarity_search_with_score(prompt)
search

[(Document(page_content='addressing issues and solving problems before they happen. Concert will initially focus on helping application owners, SREs and IT leaders gain insights about, pre-empt and more quickly address issues around application risk and compliance management. Read this blog to learn more about IBM Concert. IBM expands ecosystem access to watsonx, adds third-party models IBM continues to foster a strong ecosystem of partners to offer clients choice and flexibility through bringing third-party models onto watsonx,', metadata={'fileName': 'https://newsroom.ibm.com/2024-05-21-IBM-Unveils-Next-Chapter-of-watsonx-with-Open-Source,-Product-Ecosystem-Innovations-to-Drive-Enterprise-AI-at-Scale', 'id': 12, 'source': './corpus/watsonx_open_source.html', 'title': 'IBM Unveils Next Chapter of watsonx with Open Source, Product & Ecosystem Innovations to Drive Enterprise AI at Scale'}),
  0.6187030076980591),
 (Document(page_content='application risk and compliance management. Read 

In [17]:
prompt = "Where is Think 2024t?"
search = vectordb.similarity_search_with_score(prompt)
search

[(Document(page_content='up now to attend Think 2024, 20–23 May in Boston Register now When we know, you’ll know Get updates on speakers, sessions and essential conference information, delivered right to your inbox. Subscribe', metadata={'fileName': 'https://www.ibm.com/events/think/agenda', 'id': 17, 'source': './corpus/events_think_agenda.html', 'title': 'IBM Think 2024 Agenda'}),
  0.6360556483268738),
 (Document(page_content='technology leaders from across industries. Content will be geared toward C-level, line of business and senior IT leaders. Think 2024 programming will be held at the Boston Convention & Exhibition Center (BCEC), with some activities at the Omni Boston Hotel at the Seaport. At IBM, we are committed to sustainability and environmentally responsible event planning. We are proud to partner with two distinguished venues, each known for their exemplary sustainable practices. Our event will take place at the', metadata={'fileName': 'https://www.ibm.com/events/think/fa

In [34]:
retriever = vectordb.as_retriever()



In [16]:
IBM_GRANITE_13B_CHAT_V2 = "ibm/granite-13b-chat-v2"
MODEL_ID_PARAMS = {
    IBM_GRANITE_13B_CHAT_V2: TextGenerationParameters(
        decoding_method="greedy",
        max_new_tokens=512,
        min_new_tokens=10,
        repetition_penalty=1.2,
        return_options=TextGenerationReturnOptions(
    generated_tokens=True,
    token_logprobs=True,
    token_ranks=True,
    # input_tokens=True
    # top_n_tokens=
))}

In [30]:
llm = LangChainChatInterface(
    client=Client(credentials=Credentials.from_env()),
    model_id= "ibm/granite-13b-chat-v2",
    parameters=TextGenerationParameters(
        decoding_method="greedy",
        max_new_tokens=512,
        min_new_tokens=10,
        repetition_penalty=1.2,
        return_options=TextGenerationReturnOptions(
            generated_tokens=True,
            token_logprobs=True,
            token_ranks=True,)
            )
)

In [32]:
template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

In [35]:
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke("Where is Think 2024?")

'Think 2024 is being held in Boston, Massachusetts. The specific venue mentioned is the Boston Convention & Exhibition Center (BCEC).'

In [36]:
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke("What is IBM Concert?")

'IBM Concert is described as "a new tool powered by the IBM watsonx AI and data platform that will provide visibility and insight into the entire ecosystem of business applications, and the clouds, networks, and assets on which they are built."'

In [37]:
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke("What is IBM Think 2024?")

'IBM Think 2024 is a conference or event where IBM announces new products, technologies, and partnerships related to artificial intelligence and other areas of interest to businesses and organizations. The most recent edition of this event was held in Boston, Massachusetts, and featured sessions and discussions aimed at senior business and technology leaders.'