# `MergedDataLoader`: Retrieval Augmented Generation

## Synthesize Answers to Coding Questions over Multiple Repositories of Code, & Web Docs

### Learning Objectives:
*Learn how to load data from multiple sources and in multiple ways before retrieving their context for retrieval augmented generation.*

Retrieval augmented generation is when we improve the quality of LLM-generated answers by using domain-specific knowledge.

**Goals**
1) Use `GenericLoader` and `LanguageParser` to load Python files from GitHub repositories.
2) Load all documents into a single vectorstore via `MergedDataLoader`.
3) Use a vectorstore as a retriever to index multiple repositories and webpage documentation for retrieval augmented generation.
4) Answer user's coding questions using synthesized knowledge from documentation for Streamlit, LangGraph and LangChain.

In [1]:
%pip install -qU langchain langchain-community langchain-core GitPython langchain-openai beautifulsoup4 faiss-cpu langchain-anthropic

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "YOUR_API_KEY"
os.environ["LANGCHAIN_PROJECT"] = "MultiSource-RAG-MergedDataLoader"
os.environ["openai_api_key"] = "YOUR_API_KEY"
os.environ["anthropic_api_key"] = "YOUR_API_KEY"

### Downloading Repositories using the Git Python library

In [3]:
### GIT REPO DOWNLOADING ###
from git import Repo

# Clone
langchain_path = "./langchain-library"  # directory to clone the repository to \
# `./` will create a sub-directory in the current working directory, named langchain-library

# Clone the repo if the sub-directory above doesn't exist
if not os.path.exists(langchain_path):
    repo = Repo.clone_from(
        "https://github.com/langchain-ai/langchain", to_path=langchain_path
    )

langgraph_path = "./langgraph-library"

# Clone the repo if the sub-directory above doesn't exist
if not os.path.exists(langgraph_path):
    repo = Repo.clone_from(
        "https://github.com/langchain-ai/langgraph", to_path=langgraph_path
    )

streamlit_path = "./streamlit-library"

# Clone the repo if the sub-directory above doesn't exist
if not os.path.exists(streamlit_path):
    repo = Repo.clone_from(
        "https://github.com/streamlit/streamlit", to_path=streamlit_path
    )
###

### Specifically load Python files from each repository

In [4]:
### REPOSITORY LOADING ###
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import Language

# Use `GenericLoader` to load Python files from the cloned repository
load_langchain_api_docs = GenericLoader.from_filesystem(
    langchain_path + "/libs/langchain/langchain",
    glob="**/*",
    suffixes=[".py"],  # Specify a list of file types
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),
)

load_langgraph_api_docs = GenericLoader.from_filesystem(
    langgraph_path + "/langgraph",
    glob="**/*",
    suffixes=[".py"],  # Specify a list of file types
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),
)

load_streamlit_api_docs = GenericLoader.from_filesystem(
    streamlit_path + "/lib/streamlit",
    glob="**/*",
    suffixes=[".py"],  # Specify a list of file types
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),
)
###

### Scrape web documentation for Streamlit and LangChain

Here we use a `max_depth` of '5' to try and be thorough as possible without going too deep.

*Refer to the documentation you're going to scrape before configuring that parameter.*

You'll know how many steps to set it by based on how many path segments are in the URL.
> For example: 
>> Say you're going to scrape `https://docs.streamlit.io/library/api-reference/charts/st.area_chart` and your base URL is `https://docs.streamlit.io/`
>>> Your `max_depth` must be at least 5. This is because there are 5 path segments in the URL.

In [5]:
### WEB LOADING ###
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

streamlit_url = "https://docs.streamlit.io/library/api-reference/"
streamlit_loader = RecursiveUrlLoader(
    url=streamlit_url, max_depth=5, extractor=lambda x: Soup(x, "html.parser").text
)

langchain_url = "https://python.langchain.com/docs/"
langchain_loader = RecursiveUrlLoader(
    url=langchain_url, max_depth=5, extractor=lambda x: Soup(x, "html.parser").text
)
###

### Document Preparation for Retrieval

In [6]:
### LOAD EVERYTHING ###
from langchain_community.document_loaders.merge import MergedDataLoader

documents = MergedDataLoader(
    loaders=[
        load_langchain_api_docs,
        load_langgraph_api_docs,
        load_streamlit_api_docs,
        streamlit_loader,
        langchain_loader,
    ]
)

all_docs = documents.load()
###

In [7]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print(f"\nYou have {len(all_docs)} document(s) in your data")
print(f"\nThere are {len(all_docs[0].page_content)} characters in your sample document")
print(f"\nHere is a sample: \n\n```\n\n{all_docs[0].page_content[:200]}\n\n```")


You have 2422 document(s) in your data

There are 217 characters in your sample document

Here is a sample: 

```

"""Deprecated module for BaseLanguageModel class, kept for backwards compatibility."""
from __future__ import annotations

from langchain_core.language_models import BaseLanguageModel

__all__ = ["Bas

```


In [8]:
### TEXT SPLITTING ###

# Instantiate a text splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500, chunk_overlap=50, length_function=len
)

# Split the documents
processed_documents = text_splitter.split_documents(all_docs)

In [9]:
# Instantiate an embeddings models
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=256,
)

In [10]:
### CREATE A VECTORSTORE AS A RETRIEVER ###
from langchain_community.vectorstores import FAISS

# Check for the local index's existence
if os.path.exists("./codegen_faiss"):

    # Load the local index
    vectorstore = FAISS.load_local(
        folder_path="./codegen_faiss",
        embeddings=embeddings,
        allow_dangerous_deserialization=True, # Not recommended for production
    )
    
    retriever = vectorstore.as_retriever(
        # Optionally configure retrieval parameters
        # search_type="mmr", search_kwargs={"k": 7, "fetch_k": 14}
    )

else:
    
    # Embed documents in a vectorstore
    vectorstore = FAISS.from_documents(
        processed_documents,
        embeddings,
    )

    # Save the vectorstore locally
    vectorstore.save_local(
        folder_path="./codegen_faiss",
    )

    # Configure the retriever
    retriever = vectorstore.as_retriever(
        # Optionally configure retrieval parameters
        # search_type="mmr", search_kwargs={"k": 7, "fetch_k": 14}
    )
###

#### Now that we have a retriever, let's configure a chat prompt template to pass to our chat model later on.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_messages([
    # Models like ChatAnthropic and Gemini-Pro do not accept "system" messages,
    # otherwise I recommend using a system message to better steer the AI
    ("user", "You are an AI programming assistant and answer synthesist."),
    ("assistant", "I will do my best to answer accurately, honestly, and in pertinence to your context."),
    ("human", """[Task]: Answer the user's question based on the provided context, and only that context. If the context is not sufficient to answer the question, then say "I don't know." 
        
        [Context]:
        {context}
        
        [User's question]:
        {question}
        """),
])

Here we instantiate a Chat Model as an LLM to generate text. 

***Haiku** has been prioritized because it has a large context window, is cheap to use, and is far more capable than GPT-3.5-Turbo.*

In [12]:
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(
    model="claude-3-haiku-20240307",
    temperature=0.15,
    streaming=True,
    max_tokens=4096,
)

### Invocation

Below we define the chain we want to use. We provide the retriever's results as the context, which returns Documents based on the user's question. The context is stuffed into the `prompt_template` by way of [LCEL](https://python.langchain.com/docs/expression_language/ "LCEL Docs"), passed to the LLM where we generate output text for parsing and subsequently printing.

The example invocation uses a coding example that requires context from the retriever in order to be correctly answered in full.

In [13]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

chain = (
    {"context": retriever}
    | {"question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

invocation = chain.invoke(
    """
List every single Streamlit widget. 

Then, suggest a widget side by side each of the following CLI arguments. Our job is to come up with widgets for the Streamlit Real-ESRGAN image upscaler:

```python
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input', type=str, default='inputs', help='Input image or folder')
    parser.add_argument(
        '-n',
        '--model_name',
        type=str,
        default='RealESRGAN_x4plus',
        help=('Model names: RealESRGAN_x4plus | RealESRNet_x4plus | RealESRGAN_x4plus_anime_6B | RealESRGAN_x2plus | '
              'realesr-animevideov3 | realesr-general-x4v3'))
    parser.add_argument('-o', '--output', type=str, default='results', help='Output folder')
    parser.add_argument(
        '-dn',
        '--denoise_strength',
        type=float,
        default=0.5,
        help=('Denoise strength. 0 for weak denoise (keep noise), 1 for strong denoise ability. '
              'Only used for the realesr-general-x4v3 model'))
    parser.add_argument('-s', '--outscale', type=float, default=4, help='The final upsampling scale of the image')
    parser.add_argument(
        '--model_path', type=str, default=None, help='[Option] Model path. Usually, you do not need to specify it')
    parser.add_argument('--suffix', type=str, default='out', help='Suffix of the restored image')
    parser.add_argument('-t', '--tile', type=int, default=0, help='Tile size, 0 for no tile during testing')
    parser.add_argument('--tile_pad', type=int, default=10, help='Tile padding')
    parser.add_argument('--pre_pad', type=int, default=0, help='Pre padding size at each border')
    parser.add_argument('--face_enhance', action='store_true', help='Use GFPGAN to enhance face')
    parser.add_argument(
        '--fp32', action='store_true', help='Use fp32 precision during inference. Default: fp16 (half precision).')
    parser.add_argument(
        '--alpha_upsampler',
        type=str,
        default='realesrgan',
        help='The upsampler for the alpha channels. Options: realesrgan | bicubic')
    parser.add_argument(
        '--ext',
        type=str,
        default='auto',
        help='Image extension. Options: auto | jpg | png, auto means using the same extension as inputs')
    parser.add_argument(
        '-g', '--gpu-id', type=int, default=None, help='gpu device to use (default=None) can be 0,1,2 for multi-gpu')
```
"""
)
print(invocation)

Here is a list of Streamlit widgets and suggestions for widgets to use for the Streamlit Real-ESRGAN image upscaler:

Streamlit Widgets:
- st.button
- st.download_button
- st.link_button
- st.page_link
- st.checkbox
- st.toggle
- st.radio
- st.selectbox
- st.multiselect
- st.slider
- st.select_slider
- st.text_input
- st.number_input
- st.text_area
- st.date_input
- st.time_input
- st.file_uploader
- st.camera_input
- st.color_picker

Suggested Widgets for Streamlit Real-ESRGAN:

- `--input`: st.file_uploader or st.text_input (for folder path)
- `--model_name`: st.selectbox
- `--output`: st.text_input or st.directory_picker
- `--denoise_strength`: st.slider
- `--outscale`: st.slider
- `--model_path`: st.text_input
- `--suffix`: st.text_input
- `--tile`: st.number_input
- `--tile_pad`: st.number_input
- `--pre_pad`: st.number_input
- `--face_enhance`: st.checkbox
- `--fp32`: st.checkbox
- `--alpha_upsampler`: st.selectbox
- `--ext`: st.selectbox
- `--gpu-id`: st.number_input

The key is