# Build a RAG Application with LangChain, Part 2

In [Lab 2](./2_rag.ipynb) you built a RAG application that allowed you to search for relevant parts of a YouTube video transcript file that was split into 5 minute intervals - that means the context passed to the LLm was blocks of text consisting of 5 minutes of the video's transcript. 

This lab starts off with setting things up that were explained in lab 2 and then focuses on working with a different transcript file. The file we use has the whole transcript for each video. This is the same challenge you will have if you start loading files (ie. PDF, Docx, etc.) for your RAG application, the content will be larger than the context window allowed by the LLM.

Learning Objectives

* Learn how to chunk the transcript into smaller sizes
* Learn how text chunking size provides different quality retrieval results in a RAG application


### Step 1: Setup what we learned in Lab 2

Run the following to get ready for this lesson:

In [None]:
import os
from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_community.document_loaders import DataFrameLoader
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
import pandas as pd

load_dotenv()

llm = AzureChatOpenAI(
  openai_api_version="2023-05-15",
  azure_deployment= os.getenv("AZURE_OPENAI_MODEL_DEPLOYMENT_NAME")
)

embeddings = AzureOpenAIEmbeddings()

parser = StrOutputParser()

prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", """You are a helpful assistant that is very brief but polite in your answers. Answer questions in less than 50 words.
            Answer the question based on the context below. If you can't 
            answer the question, reply "I don't know".

            Context: {context}
         """),
        ("human", "{question}")
    ],
)

### Step 2: Load the transcript file

As mentioned above, we are using a different transcript file. We still load this one the same way we did in Lab 2:

In [None]:
DATASET_NAME = "./prep/output/master.json"

transcripts_dataset = pd.read_json(DATASET_NAME)

If you want to see the contents you can run this:

In [None]:
transcripts_dataset

In [None]:
# load the dataset and specify to use the transcript text column for the page content
loader = DataFrameLoader(transcripts_dataset, page_content_column="text")
transcripts = loader.load()

If you want to see the document listing, run this:

In [None]:
transcripts

## Step 3: Chunk the transcripts into smaller pieces

#### Document Chunking

The process of taking a document and splitting into pieces is often referred to as "chunking". There are many ways to split a document but you need to keep in mind what each chunk means for your RAG system.

Important things to remember about these chunks:

* We will get embeddings for each chunk
* Relevant chunks will be found by a similarity search using embeddings
* Often times an overlap of 10 - 20% is used
* We are using the text-embedding-ada-002 embedding model which means each chunk will have an array of 1,536 float number attached to it
* When working with real documents, you may need to address tables and images (images typically have different embedding models)
* Each chunk needs to fit in the context window of the LLM, and keep in mind things can get [lost in the middle](https://arxiv.org/abs/2307.03172) when the context is too big
* You may need to modify your chunking to improve the retrieval quality of your system

Let's first start with a simple RecursiveCharacterTextSplitter (it seems to be one of the more popular choices)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# use a small chunk number for looking at the overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=50)
documents_256 = text_splitter.split_documents(transcripts)

Now see how many there are:

In [None]:
len(documents_256)

Run the following and take a look at the `page_content` attribute and notice how the text at the end of one is repeated in the beginning of the next document's `page_content`. This is the overlap.

In [None]:
documents_256

Now lets use a different chunk size of 512

In [None]:
text_splitter2 = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=100)
documents_512 = text_splitter2.split_documents(transcripts)

Now see how many documents there are:

In [None]:
len(documents_512)

Now let's load both into in-memory vector stores to see if there is any difference in retrieval quality with them.

> NOTE:
>
> This took more than 1 minute to run on my home machine. You may need to only use a portion of the documents_256, such as documents_256[:1000]

In [None]:
vectorstore_256 = DocArrayInMemorySearch.from_documents(documents_256, embeddings)
vectorstore_512 = DocArrayInMemorySearch.from_documents(documents_512, embeddings)

retriever_256 = vectorstore_256.as_retriever()
retriever_512 = vectorstore_512.as_retriever()

Now using the same text query, get the top 4 most relevant documents:

In [None]:
unique_docs_256 = retriever_256.get_relevant_documents(query="What is langchain?")
unique_docs_512 = retriever_512.get_relevant_documents(query="What is langchain?")

Take a look at the 256 chunk documents:

In [None]:
unique_docs_256

Take a look at the 51s chunk documents:

In [None]:
unique_docs_512

Notice they are **not the same document listing** When I run it, there is 1 document that is in both.

So now we know chunk sizes effect the similarity search. 

Next let's continue with this and see how which one creates a better response from the LLM.

First try the 256 chunks:

In [None]:
chain_256 = (
    {"context": retriever_256, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | parser
)

chain_256.invoke("What is LangChain?")

Now try the 512 chunks:

In [None]:
chain_512 = (
    {"context": retriever_512, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | parser
)

chain_512.invoke("What is LangChain?")

Your results may vary, but the 512 chunk give me a much better response.

I'll leave it to you to try other sizes. The next I would try is 1024.

You may also want to try one of the other [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) like the [SemanticChunker](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/) or [CharacterTextSplitter using tokens](https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token/) to see how they effect the results of retrieval.

### Reference

TODO: Add some useful links here...