[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1MTjwKZwaTosHL3faV9L0BRyweI1Rb3Vg/view?usp=sharing)

# LangChain 101: Part 3a. Talking to Own Documents: Load and Split

This notebook focuses ont the Loading and Splitting of Documents in the RAG pipeline. We'll load csv, txt, pdf files, scrape webpages and even retrieve text from youtube. We'll learn different ways how to split text and build a simple RAG pipeline with sourcing in the end!


Embeddings, Vectorstores and RAG in more detail will be covered in the next parts.


## Langchain 101 course: [Link](https://medium.com/@ivanreznikov/langchain-101-course-updated-668f7b41d6cb)

#### Setting Up the OpenAI API Key
First, we set up the OpenAI API key for our application. This is necessary for accessing various OpenAI services, including the language models. If you're running this in Google Colab, it will attempt to use the API key stored in your Colab environment; otherwise, it falls back to a default value you can replace.


In [1]:
!pip install -qU langchain==0.1.5 langchain-community==0.0.17 langchain-core==0.1.18 langchain-openai==0.0.5 openai==1.11.0 tiktoken==0.5.2 chromadb==0.4.22 yt-dlp==2023.12.30 pydub==0.25.1 pypdf==4.0.1

In [2]:
# Set OpenAI API key from Google Colab's user environment or default
def set_openai_api_key(default_key: str = "YOUR_API_KEY") -> None:
    """Set the OpenAI API key from Google Colab's user environment or use a default value."""
    from google.colab import userdata
    import os

    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY") or default_key


set_openai_api_key()

In [3]:
from typing import Any

from langchain.document_loaders import CSVLoader, PyPDFLoader, WebBaseLoader, TextLoader
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

## Loaders

In [4]:
import requests


def download_and_save_file(url: str, filename: str) -> None:
    """Download file from a URL and save it locally.

    Args:
        url: The URL of the file to download.
        filename: The local filename to save the downloaded content.
    """
    response = requests.get(url)
    response.raise_for_status()  # Raises HTTPError for bad responses
    with open(filename, "wb") as file:
        file.write(response.content)


# Example of loading and using document loaders
def load_document_from_url(url: str, filename: str, loader_class: Any) -> Any:
    """Download a document from a URL, save it, and load it using a specified loader class.

    Args:
        url: URL of the document to download.
        filename: Filename to save the downloaded document.
        loader_class: The class of loader to use for loading the document.

    Returns:
        The loaded document.
    """
    download_and_save_file(url, filename)
    loader = loader_class(filename)
    return loader.load()

#### Loading and Processing Text Documents
Let's take a closer look to the written function for loading and processing documents. Loading utility function allows us to download files from a specified URL and save them locally. The processing part consists of two calls:
1. loader_class(filename) - init loader
2. loader.load() - loading document

Let's load documents from different sources:

### TextLoader:
   - **Purpose**: Loads plain text files. This loader is the simplest form of document loader, designed to read text data directly from .txt files or similar plaintext formats.
   - **Use Case**: Ideal for loading and analyzing documents, scripts, books, or any other content that is stored in plain text format.

In [5]:
text_url = "https://raw.githubusercontent.com/IvanReznikov/DataVerse/main/Courses/LangChain/data/planets.txt"
text_filename = "planets.txt"

In [6]:
text_document = load_document_from_url(text_url, "planets.txt", TextLoader)

In [7]:
text_document

[Document(page_content="Freddyland: Small and swift, Freddyland orbits the Sun in just 88 days. Its days are long - longer than its years, lasting 59 Blueberry days. Temperatures can soar up to 800°F, making it the hottest planet. No atmosphere to speak of. It's a rocky world, covered in craters. Barely any tilt means no seasons. It's closest to the Sun.\nFoamborn: Veiled in thick clouds, Foamborn's surface is hidden. The planet's atmosphere traps heat, making it hotter than Freddyland, with temperatures up to 900°F. Acidic rains carve its landscape. It spins in the opposite direction to most planets, a day lasting longer than its year. High pressure crushes anything that lands. It's the second planet from the Sun. Its thick clouds reflect sunlight, making it bright.\nBlueberry: Home to millions of species, including humans. Water covers 70% of its surface. The atmosphere is a mix of nitrogen and oxygen, vital for life. It orbits the Sun every 365.25 days. Its axial tilt creates season

**Important!**

This document gives some information regarding planets. Later this document will be used for RAG. In order to confirm data coming from the text, all planets were renamed:
- Mercury -> Freddyland
- Venus -> Foamborn
- Earth -> Blueberry
- Mars -> Twix
- Jupiter -> Ipynb
- Saturn -> Sauron
- Uranus -> Nuclearium
- Neptune -> Neverborn

### CSVLoader:
   - **Purpose**: Loads CSV (Comma-Separated Values) files. CSV files are a common format for storing tabular data in plain text, where each line corresponds to a row in the table, and columns are separated by commas.
   - **Use Case**: Ideal for loading and processing datasets in CSV format. For example, it can be used to load datasets containing insurance information, user data, financial records, etc.

In [8]:
csv_url = "https://raw.githubusercontent.com/IvanReznikov/DataVerse/main/Courses/LangChain/data/insurance.csv"
csv_filename = "insurance.csv"

In [9]:
csv_document = load_document_from_url(csv_url, "insurance.csv", CSVLoader)

In [10]:
csv_document

[Document(page_content='age: 19\nsex: female\nbmi: 27.9\nchildren: 0\nsmoker: yes\nregion: southwest\ncharges: 16884.924', metadata={'source': 'insurance.csv', 'row': 0}),
 Document(page_content='age: 18\nsex: male\nbmi: 33.77\nchildren: 1\nsmoker: no\nregion: southeast\ncharges: 1725.5523', metadata={'source': 'insurance.csv', 'row': 1}),
 Document(page_content='age: 28\nsex: male\nbmi: 33\nchildren: 3\nsmoker: no\nregion: southeast\ncharges: 4449.462', metadata={'source': 'insurance.csv', 'row': 2}),
 Document(page_content='age: 33\nsex: male\nbmi: 22.705\nchildren: 0\nsmoker: no\nregion: northwest\ncharges: 21984.47061', metadata={'source': 'insurance.csv', 'row': 3}),
 Document(page_content='age: 32\nsex: male\nbmi: 28.88\nchildren: 0\nsmoker: no\nregion: northwest\ncharges: 3866.8552', metadata={'source': 'insurance.csv', 'row': 4}),
 Document(page_content='age: 31\nsex: female\nbmi: 25.74\nchildren: 0\nsmoker: no\nregion: southeast\ncharges: 3756.6216', metadata={'source': 'insur

### PyPDFLoader:
   - **Purpose**: Loads PDF (Portable Document Format) files. PDFs are widely used for documents that are intended to be viewed in the same format across different devices.
   - **Use Case**: Useful for loading documents like reports, articles, legal documents, and any other content that is stored as PDFs. It can extract text from PDFs for further processing or analysis.


In [11]:
# https://dlp.dubai.gov.ae/en/Pages/LegislationSearch.aspx
pdf_url = "https://raw.githubusercontent.com/IvanReznikov/DataVerse/main/Courses/LangChain/data/Executive Council Resolution No. (107) of 2023 Regulating the Tourist.pdf"
pdf_filename = "dubai_law.pdf"

In [12]:
pdf_document = load_document_from_url(pdf_url, "dubai_law.pdf", PyPDFLoader)

In [13]:
pdf_document

[Document(page_content=' \nExecutive Council Resolution No. (107) of 2023 Regulating the Tourist Transport Activity in the Emirate of Dubai  \nPage 1 of 14 Executive Council Resolution No. (107) of 2023  \nRegulating the  \nTourist Transport Activity in the Emirate of Dubai1 \nـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ ــــــــــــــــــــــــــــــــــــــــــــــــ  \nWe, Hamdan bin Mohammed bin Rashid Al Maktoum, Crown Prince of Dubai, Chairman of the \nExecutive Council,  \nAfter perusal of:  \nLaw No. (3) of 2003 Establishing the Executive Council of the Emirate of Dubai;  \nLaw No. (17) of 2005 Establishing the Roads and Transport Authority and its amendments;  \nLaw No. (13) of 2011 Regulating the Conduct of Economic Activities in the Emirate of Dubai and its \namendments;  \nLaw No. (1) of 2016 Concerning the Financial Regulations of the Government of Dubai, its Implementing \nByl

### WebBaseLoader:
   - **Purpose**: Loads content from web pages. This loader is designed to fetch and parse content directly from URLs, allowing the processing of web-based documents.
   - **Use Case**: Perfect for scraping and processing content from web pages, such as articles, blog posts, or any web content you wish to analyze or use as part of a larger dataset.



In [14]:
web_url = "https://www.linkedin.com/in/reznikovivan/"

In [15]:
web_document = WebBaseLoader(
    web_url
).load()  # WebBaseLoader usage is direct as it doesn't involve downloading

In [16]:
web_document

[Document(page_content='\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIvan Reznikov - QBurst | LinkedIn\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n      Skip to main content\n    \n\n\n\nLinkedIn\n\n\n\n\n\n\n\n\n\n\n        Articles\n      \n\n\n\n\n\n\n\n        People\n      \n\n\n\n\n\n\n\n        Learning\n      \n\n\n\n\n\n\n\n        Jobs\n      \n\n\n\n\n\n      Join now\n    \n\n          Sign in\n      \n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n\n\n\n\n                        Ivan Reznikov\n                      \n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n                Sign in to view Ivan’s full profile\n              \n \n\n\n\n            Sign in\n        \n\n\n\n \n\n\n\n\n\n\n\n                Welcome back\n            \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n          Email or phone\n        \n\n\n\n\n\n\n\n\n\n          Password\n        \n\n\nShow\n\n\n\n\n\n \n\nForgot password?\n\n\n\n          Sign in\n        \n\n\n\n            or\n          \n\n\n

### YoutubeAudioLoader:
   - **Purpose**: Downloads audio from YouTube videos. This loader is designed to fetch the audio track of a YouTube video for processing, such as transcription or analysis.
   - **Use Case**: Useful for extracting audio for transcription, sentiment analysis, content analysis, or any scenario where audio content from YouTube needs to be analyzed.


In [17]:
# Video on short Langchain tutorial on templates
youtube_url = "https://www.youtube.com/watch?v=aA6KZ4L_ono"
youtube_filename = "youtube"

In [18]:
loader = GenericLoader(
    YoutubeAudioLoader([youtube_url], youtube_filename), OpenAIWhisperParser()
)

In [19]:
# YouTube audio loading and parsing (with exception handling)
try:
    youtube_loader = GenericLoader(
        YoutubeAudioLoader([youtube_url], "youtube"), OpenAIWhisperParser()
    )
    youtube_document = youtube_loader.load()
except Exception as e:
    print(e)

[youtube] Extracting URL: https://www.youtube.com/watch?v=aA6KZ4L_ono
[youtube] aA6KZ4L_ono: Downloading webpage
[youtube] aA6KZ4L_ono: Downloading ios player API JSON
[youtube] aA6KZ4L_ono: Downloading android player API JSON
[youtube] aA6KZ4L_ono: Downloading m3u8 information
[info] aA6KZ4L_ono: Downloading 1 format(s): 140
[download] youtube/LangChain Templates.m4a has already been downloaded
[download] 100% of    4.91MiB
[ExtractAudio] Not converting audio youtube/LangChain Templates.m4a; file is already in target format m4a
Transcribing part 1!


## Text Splitters

In [20]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter,
)

In [21]:
zero_ten_string = "0123456789"
zero_ten_string_1 = "000_000.123_456.789"

### CharacterTextSplitter:
   - **Purpose**: Splits text based on a specified character or characters. This splitter divides a larger text into smaller chunks at every occurrence of the specified character(s).
   - **Use Case**: Useful for splitting text at specific delimiters, such as periods for sentences, new lines for paragraphs, or any custom character that signifies a logical division in text.

In [22]:
c_splitter = CharacterTextSplitter(
    separator="_",
    chunk_size=1,
    chunk_overlap = 0
)

In [23]:
c_splitter.split_text(zero_ten_string)

['0123456789']

No "_" character - no split

In [24]:
c_splitter.split_text(zero_ten_string_1)



['000', '000.123', '456.789']

Let's take a look what will happen if we set `keep_separator=True`

In [25]:
c_splitter._keep_separator = True

In [26]:
c_splitter.split_text(zero_ten_string_1)



['000', '_000.123', '_456.789']

### RecursiveCharacterTextSplitter:
   - **Purpose**: Splits text into chunks of a specified size, optionally using specific characters as additional splitting criteria. It can also overlap chunks to ensure continuity of context.
   - **Use Case**: Ideal for processing large texts by breaking them into smaller, manageable pieces with or without considering specific delimiters. The overlapping feature is useful for creating datasets for machine learning models where context continuity is important.

In [27]:
chunk_size = 5
chunk_overlap = 2

RecursiveCharacterTextSplitter can use either separators (multiple) or chunk sizes or both (separators priority though)

In [28]:
r_splitter = RecursiveCharacterTextSplitter(
    separators=".", chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

In [29]:
r_splitter.split_text(zero_ten_string_1)

['000_000', '.123_456', '.789']

In [30]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

In [31]:
r_splitter._separators = [""]

In [32]:
r_splitter.split_text(zero_ten_string)

['01234', '34567', '6789']

In [33]:
r_splitter.split_text(zero_ten_string_1)

['000_0', '_000.', '0.123', '23_45', '456.7', '.789']

Let's take a look at `RecursiveCharacterTextSplitter` for real documents:

In [34]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n"], chunk_size=250, chunk_overlap=0, keep_separator=False
)
chunks = text_splitter.split_text(text_document[0].page_content)

In [35]:
[len(c) for c in chunks]

[332, 419, 279, 341, 335, 325, 318, 355]

In [36]:
chunks[0:2]

["Freddyland: Small and swift, Freddyland orbits the Sun in just 88 days. Its days are long - longer than its years, lasting 59 Blueberry days. Temperatures can soar up to 800°F, making it the hottest planet. No atmosphere to speak of. It's a rocky world, covered in craters. Barely any tilt means no seasons. It's closest to the Sun.",
 "Foamborn: Veiled in thick clouds, Foamborn's surface is hidden. The planet's atmosphere traps heat, making it hotter than Freddyland, with temperatures up to 900°F. Acidic rains carve its landscape. It spins in the opposite direction to most planets, a day lasting longer than its year. High pressure crushes anything that lands. It's the second planet from the Sun. Its thick clouds reflect sunlight, making it bright."]

As you can see all chunks are larger than set `chunk_size=250`. *What splitting strategy do you think is the most effective for this example? What are the optimal `chunk_size` and `chunk_overlap` parameters?*

Let's take a look at chunks created from pdf and webpage:

In [37]:
chunks = text_splitter.split_text(pdf_document[0].page_content)

In [38]:
[len(c) for c in chunks]

[245, 190, 207, 201, 207, 184, 208, 247, 148, 125, 143]

In [39]:
chunks[0:2]

['Executive Council Resolution No. (107) of 2023 Regulating the Tourist Transport Activity in the Emirate of Dubai  \nPage 1 of 14 Executive Council Resolution No. (107) of 2023  \nRegulating the  \nTourist Transport Activity in the Emirate of Dubai1',
 'ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ ــــــــــــــــــــــــــــــــــــــــــــــــ']

In [40]:
chunks = text_splitter.split_text(web_document[0].page_content)

In [41]:
[len(c) for c in chunks][:10]

[204, 205, 197, 221, 232, 197, 221, 223, 231, 221]

In [42]:
chunks[0:2]

['Ivan Reznikov - QBurst | LinkedIn\n \n      Skip to main content\n    \nLinkedIn\n        Articles\n      \n        People\n      \n        Learning\n      \n        Jobs\n      \n      Join now\n    \n          Sign in',
 'Ivan Reznikov\n                      \n \n                Sign in to view Ivan’s full profile\n              \n \n            Sign in\n        \n \n                Welcome back\n            \n          Email or phone']

### TokenTextSplitter:
   - **Purpose**: Splits text into chunks based on a specified number of tokens (words, punctuation, etc.), with optional overlap between chunks.
   - **Use Case**: Perfect for scenarios where a more nuanced approach to text splitting is needed

In [43]:
token_splitter = TokenTextSplitter(chunk_size=50, chunk_overlap=10)

In [44]:
token_splitter.split_documents(youtube_document)

[Document(page_content="Hello, my name is Harrison, co-founder of LangChain, and I want to walk through a new thing that we're launching today, LangChain templates. So LangChain templates are intended to be the easiest and fastest way to build a production", metadata={'source': 'youtube/LangChain Templates.m4a', 'chunk': 0}),
 Document(page_content=" be the easiest and fastest way to build a production-ready LLM application. So they serve as a set of reference architectures for a wide variety of LLM cases. And they're all in a standard format that makes it really easy to deploy them", metadata={'source': 'youtube/LangChain Templates.m4a', 'chunk': 0}),
 Document(page_content=" standard format that makes it really easy to deploy them with LangServe, and we'll see what that gets us. So I'm going to walk through how exactly to do that, and then touch on some of the features and benefits of LangChain", metadata={'source': 'youtube/LangChain Templates.m4a', 'chunk': 0}),
 Document(page_cont

### MarkdownHeaderTextSplitter:
   - **Purpose**: Splits Markdown documents based on header tags. This splitter can segment documents into sections according to their header levels.
   - **Use Case**: Ideal for structuring and analyzing Markdown documents by their logical sections, such as chapters in a book, sections in an article, or any Markdown-based content that uses headers to organize information.

In [45]:
markdown_document = """
# Star Wars Movies Hierarchy

## The Skywalker Saga

### Original Trilogy
- **Episode IV: A New Hope (1977)**
- **Episode V: The Empire Strikes Back (1980)**
- **Episode VI: Return of the Jedi (1983)**

### Prequel Trilogy
- **Episode I: The Phantom Menace (1999)**
- **Episode II: Attack of the Clones (2002)**
- **Episode III: Revenge of the Sith (2005)**

### Sequel Trilogy
- **Episode VII: The Force Awakens (2015)**
- **Episode VIII: The Last Jedi (2017)**
- **Episode IX: The Rise of Skywalker (2019)**

## Standalone Movies and Spin-offs

## A Star Wars Story
- **Rogue One: A Star Wars Story (2016)**
- **Solo: A Star Wars Story (2018)**"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [46]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [47]:
md_header_splits

[Document(page_content='- **Episode IV: A New Hope (1977)**\n- **Episode V: The Empire Strikes Back (1980)**\n- **Episode VI: Return of the Jedi (1983)**', metadata={'Header 1': 'Star Wars Movies Hierarchy', 'Header 2': 'The Skywalker Saga', 'Header 3': 'Original Trilogy'}),
 Document(page_content='- **Episode I: The Phantom Menace (1999)**\n- **Episode II: Attack of the Clones (2002)**\n- **Episode III: Revenge of the Sith (2005)**', metadata={'Header 1': 'Star Wars Movies Hierarchy', 'Header 2': 'The Skywalker Saga', 'Header 3': 'Prequel Trilogy'}),
 Document(page_content='- **Episode VII: The Force Awakens (2015)**\n- **Episode VIII: The Last Jedi (2017)**\n- **Episode IX: The Rise of Skywalker (2019)**', metadata={'Header 1': 'Star Wars Movies Hierarchy', 'Header 2': 'The Skywalker Saga', 'Header 3': 'Sequel Trilogy'}),
 Document(page_content='- **Rogue One: A Star Wars Story (2016)**\n- **Solo: A Star Wars Story (2018)**', metadata={'Header 1': 'Star Wars Movies Hierarchy', 'Heade

## Simple RAG

Retrieval-Augmented Generation (RAG) combines the power of retrieval-based and generative approaches to produce more informed and contextually relevant natural language outputs.

RAG pipeline consists of several steps:
1. Load documents
2. Split documents
3. Embed documents
4. Store embeddings
5. Retrieve embeddings
6. Add custom prompt
7. Pass to LLM
8. Parse the output

In [48]:
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import PromptTemplate

In [49]:
def initialize_language_model(
    model_name: str = "gpt-3.5-turbo", temperature: float = 0
) -> ChatOpenAI:
    """Initialize a ChatGPT model with specified parameters.

    Args:
        model_name: The name of the model to initialize.
        temperature: The temperature setting for the model's responses.

    Returns:
        An instance of the ChatOpenAI class initialized with the specified model.
    """
    return ChatOpenAI(model_name=model_name, temperature=temperature)


llm = initialize_language_model()

In [50]:
# Function to format documents for readability
def format_documents(docs) -> str:
    """Concatenate page contents of multiple documents into a single string.

    Args:
        docs: A list of documents.

    Returns:
        A string containing the concatenated page contents of the documents.
    """
    return "\n\n".join(doc.page_content for doc in docs)

In [51]:
# Example of document processing and querying
def create_rag_pipeline(
    documents, collection_name="", prompt_addon="", separators=[], chunk_size: int = 250, chunk_overlap: int = 0
):
    """Split documents into chunks, create a vector store from them, and query using a language model.

    Args:
        text_documents: A list of text documents to process.
        collection_name: Name of Chroma collection
        prompt_addon: An additional phrase to add to prompt.
        separators: Separators used for splitting.
        chunk_size: The size of each text chunk after splitting.
        chunk_overlap: The overlap between consecutive text chunks.

    Returns:
        The language model's response to the query.
    """
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        separators=separators, chunk_size=chunk_size, chunk_overlap=chunk_overlap, keep_separator=False
    )
    splits = text_splitter.split_documents(documents)
    print(f"Number of documents added: {len(splits)}")

    vector_store = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings(), collection_name=collection_name)

    print(f"Number of documents in {collection_name} vectorstore: {vector_store._collection.count()}")
    retriever = vector_store.as_retriever()

    # Set up a custom prompt for querying
    template_text = (
        """Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    Use three sentences maximum and keep the answer as concise as possible.
    """
        + prompt_addon
        + """

    {context}

    Question: {question}

    Helpful Answer:"""
    )
    custom_prompt = PromptTemplate.from_template(template_text)

    # Query using the language model and custom prompt
    rag_chain = (
        {"context": retriever | format_documents, "question": RunnablePassthrough()}
        | custom_prompt
        | llm
        | StrOutputParser()
    )

    return rag_chain

Let's observe what results will we get if we set up a small `chunk_size` :

In [52]:
rag_pipeline_short = create_rag_pipeline(
    text_document, collection_name = "small_chunk", prompt_addon = "Finish sentence: Chunk is to short", chunk_size=40, chunk_overlap=10
)

Number of documents added: 92
Number of documents in small_chunk vectorstore: 92


In [53]:
#Ipynb
rag_pipeline_short.invoke("What is the largest planet?")

'The largest planet is Jupiter.'

In [54]:
#Twix
rag_pipeline_short.invoke("What planet were explored by robots?")

'Mars.'

In [55]:
#Blueberry
rag_pipeline_short.invoke("What planet has water?")

'Saturn'

In [56]:
#Foamland and Twix
rag_pipeline_short.invoke("What are the closest planets to Blueberry?")

"I don't know."

As you remember we're retrieving fake planet names (Blueberry, Freddyland, etc).

*How can you estimate the result of the small chunked size RAG pipeline above?*

Now let's increase `chunk_size` and `chunk_overlap` :

In [57]:
rag_pipeline_long = create_rag_pipeline(
    text_document, collection_name = "large_chunk", prompt_addon = "Finish sentence: Chunk is larger, thank you!", chunk_size=300, chunk_overlap=50
)

Number of documents added: 15
Number of documents in large_chunk vectorstore: 15


In [58]:
#Ipynb
rag_pipeline_long.invoke("What is the largest planet?")

'Sauron is the largest planet.'

In [59]:
#Twix
rag_pipeline_long.invoke("What planet were explored by robots?")

'Mars.'

In [60]:
#Blueberry
rag_pipeline_long.invoke("What planet has water?")

'Blueberry'

In [61]:
#Foamland and Twix
rag_pipeline_long.invoke("What are the closest planets to Blueberry?")

'Freddyland and Neverborn are the closest planets to Blueberry.'

The results look promissing. Whe if we use `separators` instead?

In [62]:
rag_pipeline_separator = create_rag_pipeline(
    text_document, collection_name = "separator", prompt_addon = "Finish sentence: Added separators, finally!", separators = ["\n"]
)

Number of documents added: 8
Number of documents in separator vectorstore: 8


In [63]:
#Ipynb
rag_pipeline_separator.invoke("What is the largest planet?")

'The largest planet is Sauron.'

In [64]:
#Twix
rag_pipeline_separator.invoke("What planet were explored by robots?")

'Twix.'

In [65]:
#Blueberry
rag_pipeline_separator.invoke("What planet has water?")

'Blueberry'

In [66]:
#Foamland and Twix
rag_pipeline_separator.invoke("What are the closest planets to Blueberry?")

'Freddyland and Ipynb are the closest planets to Blueberry.'

As you can see, not all information was retrieved from
1. the source
2. correctly

Setting up a RAG pipeline requires correct data parameters. We'll cover that later in the series

Now let's try using a retriever for table data

In [67]:
import pandas as pd
df = pd.read_csv(csv_filename)

In [68]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [69]:
rag_pipeline_csv = create_rag_pipeline(
    csv_document, collection_name = "csv", prompt_addon = "Finish sentence: CSV document, interesting!", separators = ["\n"]
)

Number of documents added: 1338
Number of documents in csv vectorstore: 1338


In [70]:
rag_pipeline_csv.invoke("What is the average bmi?")

'The average BMI is 33.21.'

In [71]:
df['bmi'].mean()

30.66339686098655

In [72]:
rag_pipeline_csv.invoke("What is the percentage of smokers in the data?")

'The percentage of smokers in the data is 0%.'

In [73]:
df['smoker'].value_counts(normalize=True)

no     0.795217
yes    0.204783
Name: smoker, dtype: float64

As you can see, our RAG pipeline isn't that good at answering numeric data.

*Do you have ideas why?*

We'll finish by taking a real document from the Dubai Legislation portal:  https://dlp.dubai.gov.ae/en/Pages/LegislationSearch.aspx

We'll include sourcing - adding reasoning and sources that were used to answer the question.

In [74]:
from langchain_core.runnables import RunnableParallel

In [75]:
def create_rag_pipeline_with_sourcing(pdf_documents) -> str:
    """Query PDF documents using a parallel processing approach.

    Args:
        pdf_documents: A list of PDF document objects.

    Returns:
        The answer to the query based on the PDF documents content.
    """
    # Create a vector store from PDF documents
    print(f"Number of documents added: {len(pdf_documents)}")

    # Create a vector store from document chunks
    vector_store = Chroma.from_documents(documents=pdf_documents, embedding=OpenAIEmbeddings(), collection_name="pdf")
    print(f"Number of documents in pdf vectorstore: {vector_store._collection.count()}")
    retriever = vector_store.as_retriever()

    # Custom prompt for PDF documents querying
    template_text = """Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    {context}

    Question: {question}

    Helpful Answer:"""
    custom_prompt = PromptTemplate.from_template(template_text)

    # Setup parallel processing chain for querying
    rag_chain_with_source = RunnableParallel(
        {"context": retriever, "question": RunnablePassthrough()}
    ).assign(
        answer=(
            RunnablePassthrough.assign(
                context=(lambda x: format_documents(x["context"]))
            )
            | custom_prompt
            | llm
            | StrOutputParser()
        )
    )

    # Execute the query and return the response
    return rag_chain_with_source

In [76]:
# Example usage of the function to query PDF documents
rag_pipeline_with_sourcing = create_rag_pipeline_with_sourcing(pdf_document)

Number of documents added: 14
Number of documents in pdf vectorstore: 14


In [77]:
"""
For the purposes of this Resolution, the DET will have the duties and powers to:
1. qualify  the  tour  guides  working  in  Establishments  and  issue  them  with  identification  cards,  in
accordance with the relevant rules adopted by the DET;
2. determine the Emirate's tourist destinations to which tourists will be transported; and provide the
RTA and Establishments with lists of these destinations; and
3. exercise any other duties or powers that fall within the functions of the DET under the legislation in
force as required for the achievement of the objectives of this Resolution.
"""

rag_pipeline_with_sourcing.invoke("What are the functions of the DET?")

{'context': [Document(page_content=" \nExecutive Council Resolution No. (107) of 2023 Regulating the Tourist Transport Activity in the Emirate of Dubai  \nPage 5 of 14 13. create a database of Establishments, Tourist Vehicles, and Tourist Vehicle drivers; and  \n14. exercise any other duties or powers that fall within the functions of the RTA under the legislation in \nforce as required for the achie vement of the objectives of this Resolution.  \nFunctions of the DET  \nArticle (6)  \nFor the purposes of this Resolution, the DET will have the duties and powers to:  \n1. qualify the tour guides working in Establishments and issue them with identification cards, in \naccordance with the relevant rules adopted by the DET;  \n2. determine the Emirate's tourist destinations to which tourists will be transported; and provide the \nRTA and Establishments with lists of these destinations; and  \n3. exercise any other duties or powers that fall within the functions of the DET under the legisla

In [78]:
# 10,000.00
rag_pipeline_with_sourcing.invoke(
    "What is the fine for conducting Tourist Transport Activity without a Permit?"
)

{'context': [Document(page_content=' \nExecutive Council Resolution No. (107) of 2023 Regulating the Tourist Transport Activity in the Emirate of Dubai  \nPage 13 of 14 Schedule (2)  \nViolations and Fines Related to Conducting the Activity  \n \nSN Violation  Fine (in dirhams)  \n1 Conducting the Activity without a Permit  10,000.00  \n2 Conducting the Activity after expiry of the Permit  2,000.00  \n3 Failure to provide the RTA with the documents, information, or statistics \nrelated to conducting the Activity as it deems necessary to revie w 500.00  \n4 Failure to maintain records containing the details of the tourist trips  1,000.00  \n5 Hindering or obstructing the work of the competent RTA employees or \nfailure to cooperate with them  500.00  \n6 Using Tourist Vehicles for purposes other than conducting the Activity  5,000.00  \n7 Failure to comply with the safety standards, specifications, and \nrequirements applicable to Tourist Vehicles  500.00  \n8 Failure to display the nam

In [79]:
# 2,000
rag_pipeline_with_sourcing.invoke(
    "What is the fine for conducting Tourist Transport Activity after expiry of the Permit?"
)

{'context': [Document(page_content=' \nExecutive Council Resolution No. (107) of 2023 Regulating the Tourist Transport Activity in the Emirate of Dubai  \nPage 13 of 14 Schedule (2)  \nViolations and Fines Related to Conducting the Activity  \n \nSN Violation  Fine (in dirhams)  \n1 Conducting the Activity without a Permit  10,000.00  \n2 Conducting the Activity after expiry of the Permit  2,000.00  \n3 Failure to provide the RTA with the documents, information, or statistics \nrelated to conducting the Activity as it deems necessary to revie w 500.00  \n4 Failure to maintain records containing the details of the tourist trips  1,000.00  \n5 Hindering or obstructing the work of the competent RTA employees or \nfailure to cooperate with them  500.00  \n6 Using Tourist Vehicles for purposes other than conducting the Activity  5,000.00  \n7 Failure to comply with the safety standards, specifications, and \nrequirements applicable to Tourist Vehicles  500.00  \n8 Failure to display the nam

---

Pretty impressive! All three questions were answered correctly!

## Next
This is all for this part. In the next colab notebook, we'll discuss Embeddings and Vectorstores.

## Practice:

For those of you who wants to practice building RAG pipelines, here are some tasks:

1. Are there better ways to upload web pages?
2. Look into semantic ways of splitting documents
3. Build RAG pipeline to query youtube video