# Create a LangChain RAG system in Python with watsonx and Llama 3

In this tutorial, we’ll use LangChain and Llama 3 to walk through a step-by-step Retrieval Augmented Generation ([RAG](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)) example in Python. 

For our use case, we’ll set up a local RAG system for 18 IBM products. We will fetch content from several ibm.com websites, making up a knowledge base from which we will provide context to Meta's Llama 3 LLM to answer some questions about these IBM products. Meta AI just released their latest Llama 3 model and we'll test it out in this tutorial.

RAG is a technique in natural language processing (NLP) that combines information retrieval and generative models to produce more accurate, relevant and contextually aware responses. 

# More about RAG and LangChain

In traditional language generation tasks, [large language models](https://www.ibm.com/topics/large-language-models) (LLMs) like OpenAI’s GPT (Generative Pre-trained Transformer) or [IBM’s Granite Models](https://www.ibm.com/products/watsonx-ai/foundation-models) are used to construct responses based on an input prompt. However, these models may struggle to produce responses that are contextually relevant, factually accurate or up to date. The models may not know the latest information about IBM products. 

RAG applications address this limitation by incorporating a retrieval step before response generation. During retrieval, [vector search](https://www.ibm.com/topics/vector-search) can be used to identify contextually pertinent information, such as relevant information or documents from a large corpus of text, typically stored in a [vector database](https://www.ibm.com/topics/vector-database). Finally, an LLM is used to generate a response based on the retrieved context. RAG is an affordable and simple alternative to [fine-tuning](https://www.ibm.com/topics/fine-tuning) a model for text-generation artificial intelligence tasks.


LangChain is a powerful, open-source framework that facilitates the development of applications using LLMs for various NLP tasks. In the context of RAG, LangChain plays a critical role by combining the strengths of retrieval-based methods and generative models to enhance the capabilities of NLP systems.

# Prerequisites

You need an [IBM Cloud account](https://cloud.ibm.com/registration?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-trial) to create a [watsonx.ai](https://www.ibm.com/products/watsonx-ai?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-product) project.

# Steps

## Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook. Jupyter Notebooks are widely used within [data science](https://www.ibm.com/topics/data-science) to combine code, text, images, and [data visualizations](https://www.ibm.com/topics/data-visualization) to develop a well-formed analysis.

1. Log in to [watsonx.ai](https://dataplatform.cloud.ibm.com/registration/stepone) using your IBM Cloud account.

2. Create a [watsonx.ai project](https://www.ibm.com/docs/en/watsonx/saas?topic=projects-creating-project#create-a-project).
   Take note of the project ID in project > Manage > General > Project ID. You'll need this ID for this tutorial.

3. Create a [Jupyter Notebook](https://www.ibm.com/docs/en/watsonx/saas?topic=editor-creating-managing-notebooks).

This step will open a Notebook environment where you can copy the code from this tutorial to implement a RAG application for 18 IBM products. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook is available on [GitHub](https://github.com/Erika-Russi/think/blob/main/tutorials/langchain/langchain-rag-llama3.ipynb).

## Step 2. Set up a Watson Machine Learning (WML) Service Instance and API Key

1. Create a [Watson Machine Learning (WML) Service](https://cloud.ibm.com/catalog/services/watson-machine-learning) instance (choose the Lite plan, which is a free instance).

2. Generate an [API Key in WML](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-authentication.html). Save this API key for use in this tutorial.

3. Associate the WML service to the project you created in [watsonx.ai](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/assoc-services.html).

## Step 3. Install and import relevant libraries and set up credentials

We'll need a few libraries for this tutorial. Make sure to import the ones below, and if they're not installed, you can resolve this with a quick pip install.

In [1]:
#installations of dependencies
%pip install langchain
%pip install langchain_chroma
%pip install langchain-community
%pip install -U langchain_ibm
%pip install unstructured
%pip install "ibm-watson-machine-learning>=1.0.327"

Collecting langchain-core<0.2.0,>=0.1.42
  Using cached langchain_core-0.1.52-py3-none-any.whl (302 kB)
Installing collected packages: langchain-core
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 0.2.20
    Uninstalling langchain-core-0.2.20:
      Successfully uninstalled langchain-core-0.2.20
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-ibm 0.1.10 requires langchain-core<0.3,>=0.2.2, but you have langchain-core 0.1.52 which is incompatible.[0m[31m
[0mSuccessfully installed langchain-core-0.1.52

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated 

Import the relevant libraries:

In [2]:
#imports
import os
import getpass

from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes

from langchain_ibm import WatsonxEmbeddings, WatsonxLLM
from langchain.vectorstores import Chroma

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from langchain_community.document_loaders import UnstructuredURLLoader

from langchain_text_splitters import RecursiveCharacterTextSplitter

Set up your credentials and input your API Key:

In [3]:
credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": getpass.getpass("Please enter your WML api key (hit enter): ")
}

Set up your `project_id` as part of your environment variables or input it:

In [4]:
try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

##  Step 4. Index the URLs to create the knowledge base

We’ll index our IBM products specific pages from URLs to create a knowledge base as a vectorstore. The content from these URLs will be our data sources and context for this exercise. The context will then be provided to an LLM to answer any questions we have about the IBM products.

The first step to building vector embeddings is to clean and process the raw dataset. This may involve the removal of noise and standardization of the text. For our example, we won’t do any cleaning since the text is already cleaned and standardized.

First, let's establish `URLS_DICTIONARY`. `URLS_DICTIONARY` is a dict that helps us map the URLs from which we will be extracting the content. Let's also set up a name for our collection: `ibm_products`.

In [5]:
URLS_DICTIONARY = {
    "api_connect": "https://www.ibm.com/products/api-connect",
    "concert": "https://www.ibm.com/products/concert",
    "environment_intelligence_suite": "https://www.ibm.com/products/environmental-intelligence-suite",
    "envizi": "https://www.ibm.com/products/envizi",
    "flashsystem": "https://www.ibm.com/flashsystem",
    "ibm_cloud": "https://www.ibm.com/cloud",
    "ibm_z": "https://www.ibm.com/z",
    "instana": "https://www.ibm.com/products/instana",
    "maas360": "https://www.ibm.com/products/maas360",
    "maximo": "https://www.ibm.com/products/maximo",
    "planning_analytics": "https://www.ibm.com/products/planning-analytics",
    "qradar_edr": "https://www.ibm.com/products/qradar-edr",
    "robotic_process_automation": "https://www.ibm.com/products/robotic-process-automation",
    "storage_defender": "https://www.ibm.com/products/storage-defender",
    "turbonomic": "https://www.ibm.com/products/turbonomic",
    "watsonx": "https://www.ibm.com/watsonx",
    "watsonx_assistant": "https://www.ibm.com/products/watsonx-assistant",
    "watsonx_orchestrate": "https://www.ibm.com/products/watsonx-orchestrate",
}
COLLECTION_NAME = "ibm_products"

Next, let's load our documents using the LangChain `UnstructuredURLLoader` for the list of URLs we have. We'll print a sample document at the end to see how it's been loaded.

In [6]:
documents = []

for url in list(URLS_DICTIONARY.values()):
    loader = UnstructuredURLLoader(urls=[url])
    data = loader.load()
    documents += data

#show sample document
documents[0]

#Output:

Document(metadata={'source': 'https://www.ibm.com/products/api-connect'}, page_content='Create, secure and manage APIs through their entire lifecycle\n\nExplore the award-winning design and built-in AI capabilities of IBM API Connect®\n\nTry it free\n\nTake the self guided tour\n\nRead more\n\nGo to report\n\nA management solution for the entire API lifecycle\n\nIBM API Connect is a full lifecycle API management solution that uses an intuitive experience to help consistently create, manage, secure, socialize and monetize APIs, which promotes digital transformation on premises and across clouds. This means you and your customers can power digital apps and spur innovation in real-time. IBM API Connect is also available as-a-Service as a highly scalable, fully managed API management platform on Amazon Web Services (AWS).\n\nAnnouncement\n\nIBM is partnering with Noname Security to deliver advanced API security capabilities.\n\nLearn more\n\nSmartPaper\n\nLearn how to unlock the full poten

Based on the sample document, it looks like there's a lot of white space and new line characters that we can get rid of. Let's clean that up and add some metadata to our documents, including an id number and the source of the content.

In [7]:
doc_id = 0
for doc in documents:
    doc.page_content = " ".join(doc.page_content.split()) # remove white space

    doc.metadata["id"] = doc_id #make a document id and add it to the document metadata

    print(doc.metadata)
    doc_id += 1

{'source': 'https://www.ibm.com/products/api-connect', 'id': 0}
{'source': 'https://www.ibm.com/products/concert', 'id': 1}
{'source': 'https://www.ibm.com/products/environmental-intelligence-suite', 'id': 2}
{'source': 'https://www.ibm.com/products/envizi', 'id': 3}
{'source': 'https://www.ibm.com/flashsystem', 'id': 4}
{'source': 'https://www.ibm.com/cloud', 'id': 5}
{'source': 'https://www.ibm.com/z', 'id': 6}
{'source': 'https://www.ibm.com/products/instana', 'id': 7}
{'source': 'https://www.ibm.com/products/maas360', 'id': 8}
{'source': 'https://www.ibm.com/products/maximo', 'id': 9}
{'source': 'https://www.ibm.com/products/planning-analytics', 'id': 10}
{'source': 'https://www.ibm.com/products/qradar-edr', 'id': 11}
{'source': 'https://www.ibm.com/products/robotic-process-automation', 'id': 12}
{'source': 'https://www.ibm.com/products/storage-defender', 'id': 13}
{'source': 'https://www.ibm.com/products/turbonomic', 'id': 14}
{'source': 'https://www.ibm.com/watsonx', 'id': 15}
{'

Let's see how our sample document looks now after we cleaned it up:

In [8]:
documents[0]

Document(metadata={'source': 'https://www.ibm.com/products/api-connect', 'id': 0}, page_content='Create, secure and manage APIs through their entire lifecycle Explore the award-winning design and built-in AI capabilities of IBM API Connect® Try it free Take the self guided tour Read more Go to report A management solution for the entire API lifecycle IBM API Connect is a full lifecycle API management solution that uses an intuitive experience to help consistently create, manage, secure, socialize and monetize APIs, which promotes digital transformation on premises and across clouds. This means you and your customers can power digital apps and spur innovation in real-time. IBM API Connect is also available as-a-Service as a highly scalable, fully managed API management platform on Amazon Web Services (AWS). Announcement IBM is partnering with Noname Security to deliver advanced API security capabilities. Learn more SmartPaper Learn how to unlock the full potential of your APIs. Read Sma

We need to split up our text into smaller, more manageable pieces known as "chunks". LangChain's `RecursiveCharacterTextSplitter` takes a large text and splits it based on a specified chunk size using a predefined set of characters. In order, the default characters are:
- "\n\n" - two new line characters 
- "\n" - one new line character
- " " - a space
- "" - an empty character

The process starts by attempting to split the text using the first character, "\n\n." If the resulting chunks are still too large, it moves to the next character, "\n," and tries splitting again. This continues with each character in the set until the chunks are smaller than the specified chunk size. Since we already removed all the "\n\n" and "\n" characters when we cleaned up the text, the `RecursiveCharacterTextSplitter` will begin at the " "(space) character.

We settled on a chunk size of 512 after experimenting with a chunk size of 1000. When the chunks were that large, our model was getting too much context for question-answering; this led to confused responses by the LLM because it was receiving too much information, so we changed it to smaller chunks. Feel free to experiment with chunk size further!


In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Next, we choose an embedding model to be trained on our IBM products dataset. The trained embedding model is used to generate embeddings for each data point in the dataset. For text data, popular open-source embedding models include Word2Vec, GloVe, FastText or pretrained transformer-based models like BERT or RoBERTa. OpenAIembeddings may also be used by leveraging the OpenAI embeddings API endpoint, the `langchain_openai` package and getting an `openai_api_key`, however, there is a cost associated with this usage.

Unfortunately, because the embedding models are so large, vector embedding often demands significant computational resources, like a gpu. We can greatly lower the costs linked to embedding vectors, while preserving performance and accuracy by using WatsonxEmbeddings. We'll use the IBM embeddings model, Slate, an encoder-only (RoBERTa-based) model, which while not generative, is fast and effective for many NLP tasks.

Alternatively, we can use the [Hugging Face embeddings models](https://python.langchain.com/v0.2/docs/integrations/platforms/huggingface/#embedding-models) via LangChain.


In [10]:
embeddings = WatsonxEmbeddings(
    model_id=EmbeddingTypes.IBM_SLATE_30M_ENG.value,
    url=credentials["url"],
    apikey=credentials["apikey"],
    project_id=project_id,
    )

Let's load our content into a local instance of a vector database, using Chromadb.

In [11]:
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings)

Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 3


Let's do a quick search of our vector database to test it out! Using `similarity_search_with_score` allows us to return the documents and the distance score of the query to them. The returned distance score is Euclidean distance. Therefore, a lower score is better.

In [None]:
query = "What is IBM concert?"
search = vectorstore.similarity_search_with_score(query)
search

[(Document(metadata={'id': 1, 'source': 'https://www.ibm.com/products/concert'}, page_content='IBM Concert Simplify and optimize your app management and technology operations with generative AI-driven insights Book a live demo Prioritize and act on the most significant risks to your business applications Explore risk management with IBM Concert Keep applications continuously compliant, even as they evolve Explore compliance management with IBM Concert What is IBM Concert? Get the overview IBM® Concert® puts you in control, so you can simplify and optimize your operations to focus on continuously'),
  0.3117217421531677),
 (Document(metadata={'id': 5, 'source': 'https://www.ibm.com/cloud'}, page_content='certified on IBM Cloud Talk to an IBM expert IBM business partners'),
  0.43175169825553894),
 (Document(metadata={'id': 1, 'source': 'https://www.ibm.com/products/concert'}, page_content='delivering enhanced client experiences and improved developer and SRE productivity IBM Concert pro

## Step 5. Set up a retriever

We'll set up our vector store as a retriever. The retrieved information from the vector store serves as additional context or knowledge that can be used by a generative model.

In [13]:
retriever = vectorstore.as_retriever()


## Step 6. Generate a response with a Generative Model

Finally, we’ll generate a response. The generative model (like GPT-4 or IBM Granite) uses the retrieved information to produce a more accurate and contextually relevant response to our questions about IBM products.

First, we'll establish which LLM we're going to use to generate the response. For this tutorial, we'll use Llama 3.

In [15]:
model_id = ModelTypes.LLAMA_2_70B_CHAT

The model parameters available can be found [here](https://ibm.github.io/watson-machine-learning-sdk/model.html#enums). We experimented with various model parameters, including Temperature, Top P, and Top K. [Here](https://www.ibm.com/docs/en/watsonx/saas?topic=lab-model-parameters-prompting)'s some more information on model parameters and what they mean.

In [16]:
parameters = {
    GenParams.DECODING_METHOD: 'greedy',
    GenParams.MIN_NEW_TOKENS: 10,
    GenParams.MAX_NEW_TOKENS: 512,
    GenParams.REPETITION_PENALTY:1,
    GenParams.RETURN_OPTIONS: {'input_tokens': True,'generated_tokens': True, 'token_logprobs': True, 'token_ranks': True, }
}

Next, we instantiate the LLM.

In [17]:
llm = WatsonxLLM(
    model_id=model_id.value,
    url=credentials.get("url"),
    apikey=credentials.get("apikey"),
    project_id=project_id,
    params=parameters
)

We'll set up a `prompttemplate` to ask multiple questions. The "context" will be derived from our retriever (our vector database) with the relevant documents and the "question" will be derived from the user query.

In [18]:
template = """Generate a summary of the context that answers the question. Explain the answer in multiple steps if possible. 
Answer style should match the context. Ideal Answer Length 2-3 sentences.\n\n{context}\nQuestion: {question}\nAnswer:
"""
prompt = ChatPromptTemplate.from_template(template)

Let's set up a helper function to format the docs accordingly:

In [19]:
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

And now we can set up a chain with our context, our prompt and our LLM model. We'll use `StrOutputParser` for parsing the results. The generative model processes the augmented context along with the user's question to produce a response.

In [20]:
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

And now we can ask multiple questions:

In [23]:
chain.invoke("What is the difference between watsonx, watsonx assistant, and watsonx orchestrate?")

'Watsonx is a generative AI and automation solution that helps businesses automate tasks, simplify complex processes, and save time and effort. Watsonx Orchestrate is a purpose-built AI assistant that helps end users get work done with helpful conversations. Watsonx Assistant is an AI assistant builder that allows users to quickly and easily create and deploy their own custom AI assistants without the need for coding expertise.'

Let's ask about IBM Concert next.

In [21]:
chain.invoke("What is IBM Concert?")

'\nIBM Concert is a solution that helps organizations simplify and optimize their application management and technology operations. It uses generative AI-driven insights to provide a comprehensive understanding of the application landscape, identify potential risks and opportunities, and generate recommendations for improvement. It integrates with existing environments and toolsets to provide real-time data and dependency mapping, enabling organizations to anticipate and address issues before they happen. Additionally, it helps maintain accurate records of system states over time for enhanced security, compliance, and operational integrity.'

And finally, let's ask for information about RPA.

In [24]:
chain.invoke("Tell me about RPA")

'RPA stands for Robotic Process Automation. It is a technology that allows software bots to automate repetitive, rule-based tasks by mimicking the actions of a human user. RPA tools can automate a wide range of business processes, such as data entry, document processing, and customer service, without the need for manual intervention. RPA bots can work 24/7, increasing productivity and reducing errors. RPA can be combined with artificial intelligence (AI) to create even more advanced automation capabilities, such as natural language processing and machine learning.'

And that's it! Feel free to ask even more questions!

# Summary and next steps

In this tutorial, you created a LangChain RAG system in Python with watsonx. You fetched 18 pages from https://www.ibm.com to create a vector store as context for an LLM to answer questions about IBM products.

You can imagine a situation where we can create chatbots to field these questions.

We encourage you to check out the [LangChain documentation page](https://python.langchain.com/v0.2/docs/tutorials/rag/) for more information and tutorials on RAG.


## Try watsonx for free

Build an AI strategy for your business on one collaborative AI and data platform called [IBM watsonx](https://www.ibm.com/watsonx), which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With [watsonx.ai](https://www.ibm.com/products/watsonx-ai), you can train, validate, tune, and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data.

Try [watsonx.ai](https://dataplatform.cloud.ibm.com/registration/stepone), the next-generation studio for AI builders.

## Next steps

Explore more [articles and tutorials about watsonx](https://developer.ibm.com/components/watsonx/?) on IBM Developer.