# Retrieval Augmented Generation with LangChain

In this notebook, we'll build some simple naive RAG with LangChain. We will leverage OpenAI for embeddings and LLM responses, and will use the [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) vector database.

In [None]:
from operator import itemgetter

from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_community.chat_models import ChatOpenAI
from langchain_community.embeddings import OpenAIEmbeddings

from dotenv import load_dotenv
load_dotenv()
import os

## Naive RAG

The below cells show a very simple version of RAG, without a document. We simply pass in a sentence, and have the LLM generate a response based on that sentence.

In [None]:
vectorstore = FAISS.from_texts(
    ["jason ran to panda express"], embedding=OpenAIEmbeddings(api_key = os.getenv("OPENAI_API_KEY"))
)

retriever = vectorstore.as_retriever()

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

model = ChatOpenAI(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-3.5-turbo")

In [None]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke("where did jason run to?")

In [None]:
chain.invoke("who is jason?")

In [None]:
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer in the following language: {language}
"""
prompt = PromptTemplate.from_template(template)

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
        "language": itemgetter("language"),
    }
    | prompt
    | model
    | StrOutputParser()
)

In [None]:
chain.invoke({"question": "where did jason run to", "language": "italian"})

### Naive RAG with Documents

Now, we will perform RAG over an Environmental Science text. You can find the PDF in the [Drive](https://drive.google.com/drive/folders/1EBnXiHcnpZNQ3IWwXOFQLbRJCVQG4sXb?usp=drive_link).

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser

In [None]:
loader = PyPDFLoader("data/environmental_sci.pdf")

# The text splitter is used to split the document into chunks
# Mess with the parameters to see how it affects the output
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

chunks = loader.load_and_split(text_splitter=text_splitter)

print(chunks[25].page_content)

In [None]:
len(chunks)

In [None]:
# We will now use the from_documents method to create a vectorstore from the chunks
vectorstore = FAISS.from_documents(
    chunks, embedding=OpenAIEmbeddings(api_key = os.getenv("OPENAI_API_KEY"))
)

retriever = vectorstore.as_retriever(k=5)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

In [None]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [None]:
# An overly complicated one-liner to test what the top 5 most similar chunks are to the question
# Use this to make sense of the output of the next cell
print("\n\n".join([x.page_content for x in vectorstore.similarity_search("What is the main cause of global warming?", k=5)]))

In [None]:
chain.invoke("What is the main cause of global warming?")

Mess with the splitting method ([LangChain splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)), the parameters to the splitter, and the number of retrieved chunks that are injected into the LLM's prompt as context. These will significantly impact how the LLM performs and answers questions.

## Advanced RAG

We leave this as a challenge for you. How can we implement advanced RAG methods in LangChain?

1. Find some data that you would like to perform RAG over. 
2. Implement some form of advanced search with LangChain. 

Note: The LangChain [EnsembleRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble) may be of use.

In [None]:
pass