# Project: "Rag with Llama3 and LangChain"

## 1. Load the pdf

## 2. Chunk it and index it

## 3. Retrieve relevant information from it

## 4. Test it a bit

## 5. Make it modular and build your production chain

## 6. Deploy with with monitoring using LangSmith and LangServe

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

## 1. Load the pdf

In [4]:
from langchain import hub
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

loader = PyPDFLoader("./paper.pdf")
docs = loader.load_and_split()
len(docs)

16

## 2. Chunk it and index it

In [5]:
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(docs, embeddings)

In [6]:
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.0)

## 3. Retrieve relevant information from it

In [7]:
retriever = vectordb.as_retriever() 
retriever
# MODEL = 'gpt-4o-mini'
# llm = ChatOpenAI(model=MODEL, temperature=0)
# source: https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/#question-answering-with-rag

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)
qa_chain = create_stuff_documents_chain(llm, prompt)
qa_chain
# This method `create_stuff_documents_chain` [outputs an LCEL runnable](https://arc.net/l/quote/bnsztwth)

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), config={'run_name': 'format_inputs'})
| ChatPromptTemplate(input_variables=['context', 'input'], messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.\n\n{context}")), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='{input}'))])
| ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x10ff2bed0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x12f003c90>, model_name='gpt-4o-mini', temperature=0.0, openai_api_key=SecretStr('**********'), openai_api_base='https://api.openai.com/v1', openai_proxy='')
| StrOutputParser()

In [8]:
query = "What is self-attention according to this paper?"
rag_chain = create_retrieval_chain(retriever, qa_chain)
results = rag_chain.invoke({"input": query})
results

{'input': 'What is self-attention according to this paper?',
 'context': [Document(metadata={'page': 0, 'source': './paper.pdf'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmech

In [9]:
from IPython.display import Markdown

final_answer = results["answer"]

Markdown(final_answer)

Self-attention is a mechanism used in the Transformer architecture that allows the model to weigh the importance of different words in a sequence when encoding a particular word. It enables the model to capture long-distance dependencies and relationships within the input data without relying on recurrence or convolutions. This mechanism enhances the model's ability to understand context and improve performance on tasks like machine translation.

## 4. Test it a bit

In [10]:
rag_chain.invoke({'input': "Explain in simple terms the attention mechanism."})

{'input': 'Explain in simple terms the attention mechanism.',
 'context': [Document(metadata={'page': 12, 'source': './paper.pdf'}, page_content='Attention Visualizations\nInput-Input Layer5\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nFigure 3: An example of the attention mechanism following long-distance dependencies in the\nencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of\nthe verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for\nthe word ‘making’. Different colors represent different he

## 5. Make it modular and build your production chain

In [12]:
from typing import List

def load_pdf(file_path: str="./paper.pdf") -> str:
    loader = PyPDFLoader(file_path)
    return loader.load_and_split()

def index_docs(docs: List[str]):
    embeddings = OpenAIEmbeddings()
    vectordb = Chroma.from_documents(docs, embedding=embeddings)
    retriever = vectordb.as_retriever()
    
    return retriever

def setup_chain(retriever):
    system_prompt = (
        "You are an assistant for question-answering tasks. "
        "Use the following pieces of retrieved context to answer "
        "the question. If you don't know the answer, say that you "
        "don't know. Use three sentences maximum and keep the "
        "answer concise."
        "\n\n"
        "{context}"
    )

    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )
    qa_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, qa_chain)
    return rag_chain


file_path = "./paper.pdf"

docs = load_pdf(file_path)

retriever = index_docs(docs)

qa_chain = setup_chain(retriever)
qa_chain.invoke({'input': "What is the main contribution of the paper?"})

{'input': 'What is the main contribution of the paper?',
 'context': [Document(metadata={'page': 0, 'source': './paper.pdf'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanis

## 6. Deploy with with monitoring using LangSmith and LangServe

In [27]:
# !langchain app new my-app --package rag-conversation
# !tree my-app

Folder structure of a langchain project:

![](./assets-resources/langchain-project-structure.png)

Let's start with the server.py file:

```

```

The script you've provided sets up a basic web server using FastAPI, a popular framework for building APIs with Python. It's designed to be fast and easy to use, with automatic data validation and interactive API documentation. The script also includes integration for LangChain (`langserve` module), which seems to be a library for adding language model-related functionalities, such as retrieval-augmented generation (RAG). Here's a detailed breakdown of each part of the script:

### Imports
```python
from fastapi import FastAPI
from fastapi.responses import RedirectResponse
from langserve import add_routes
```
- **`FastAPI`**: Imports the FastAPI class, which is used to create the API server.
- **`RedirectResponse`**: Imports a helper from FastAPI for issuing HTTP redirects.
- **`add_routes`**: Imports a function from the `langserve` package, which presumably adds specific routes to the FastAPI app related to language processing services.

### Initialize the FastAPI Application
```python
app = FastAPI()
```
- This line initializes a new FastAPI application. `app` is now an instance of `FastAPI` and will be used to register routes and run the server.

### Root Route
```python
@app.get("/")
async def redirect_root_to_docs():
    return RedirectResponse("/docs")
```
- `@app.get("/")`: This is a route decorator provided by FastAPI. It tells FastAPI that the function directly below should be executed when an HTTP GET request is made to the root URL (`"/"`).
- `async def redirect_root_to_docs()`: Defines an asynchronous function that handles requests to the root URL.
- `return RedirectResponse("/docs")`: The function returns a `RedirectResponse` object, which redirects the client to the `/docs` URL. FastAPI automatically generates interactive API documentation (using Swagger UI), accessible at `/docs`. This redirect is a convenience to lead users directly to the documentation.

### LangChain Routes Integration
```python
# Edit this to add the chain you want to add
add_routes(app, NotImplemented)
```
- This line suggests that `add_routes` function is meant to extend the FastAPI application (`app`) with additional routes or functionalities related to LangChain.
    - What `add_routes` does is it connects the chain to our app exposing the methods of the chain to our web server.
- The second argument `NotImplemented` seems to be a placeholder. In practice, you would replace `NotImplemented` with an actual implementation or configuration that specifies what routes or features from LangChain should be integrated into the FastAPI application.

### Main Execution Block
```python
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
```
- `if __name__ == "__main__"`: This is a common Python idiom. It checks if the script is being run as the main program (not imported as a module in another script) and only then executes the code block inside it.
- `import uvicorn`: Imports `uvicorn`, an ASGI server for running FastAPI applications. It's specifically designed to work well with asynchronous applications and is highly performant.
- `uvicorn.run(app, host="0.0.0.0", port=8000)`: This line starts the `uvicorn` server with the FastAPI app. The `host="0.0.0.0"` configuration tells the server to listen on all network interfaces, making the server accessible externally. `port=8000` specifies the network port for the server.

Overall, this script sets up a basic FastAPI application, redirects the root URL to automatically generated API documentation, and integrates (though in a placeholder manner) additional routes via a `langserve` library for handling specific language model tasks.

Now, the `chain.py` file will contain the heart of our application, in the case of the pre-built langchain project the file contains this app's specific logic ending with the creation of the app's chain:

```python
# last line from packages/rag-conversation/rag-conversation/chain.py

chain = _inputs | ANSWER_PROMPT | ChatOpenAI() | StrOutputParser()

```

Let's now create our own package with our own custom logic for rag over a pdf.

For that we'll use this structure described in the LangChain [documentation here for getting started with LangServe](https://python.langchain.com/docs/langserve/).

Setup
Note: We use poetry for dependency management. Please follow poetry doc to learn more about it.

1. Create new app using langchain cli command
`langchain app new rag-pdf-app`

2. Define the runnable in `add_routes`. Go to `server.py` and edit
`add_routes(app. NotImplemented)`

3. Use `poetry` to add 3rd party packages (e.g., langchain-openai, langchain-anthropic, langchain-mistral etc).
poetry add [package-name] // e.g `poetry add langchain-openai`

4. Set up relevant env variables. For example,
export OPENAI_API_KEY="sk-..."

5. Serve your app
poetry run langchain serve --port=8100

In [3]:
from langserve.client import RemoteRunnable

runnable = RemoteRunnable("http://localhost:8100/rag-local/")

In [4]:
runnable.invoke("What is self-attention?")

{'result': 'Based on the provided context, it appears that Noam proposed "scaled dot-product attention" as a mechanism in the Transformer model. This suggests that self-attention refers to an attention mechanism that computes attention weights between different positions in the input sequence using dot products and then scales them by a learnable weight factor.',
 'source_documents': [Document(page_content='[5]Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. CoRR , abs/1406.1078, 2014.\n[6]Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv\npreprint arXiv:1610.02357 , 2016.\n[7]Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. CoRR , abs/1412.3555, 2014.\n[8]Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros

Environment set up:


```
conda create -n langserve-test-env-pinecone python=3.11
conda activate langserve-test-env-pinecone
pip install -U "langchain-cli [serve]" "langserve [all]"
langchain app new .
poetry add pinecone-client==3.0.0.dev8
poetry add langchain-community==0.0.12
poetry add cohere
poetry add openai
poetry add jupyter
poetry add python-dotenv
poetry run jupyter notebook
```

- See langchain-env-test for the basic example

- Then maybe evolve to discuss an example like this [rag-conversation](https://github.com/langchain-ai/langchain/blob/master/templates/rag-conversation/rag_conversation/chain.py)

In [4]:
# in rag-pdf-app/app/chain.py

# inspired by this template from langchain: https://github.com/langchain-ai/langchain/blob/master/templates/rag-chroma-private/rag_chroma_private/chain.py

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate, MessagesPlaceholder
from langchain.chains import RetrievalQA
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel, RunnableLambda
from langchain_community.vectorstores import Chroma
from langchain_community.chat_models import ChatOllama
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain import hub
from typing import List, Tuple

def load_pdf(file_path: str="./paper.pdf") -> str:
    loader = PyPDFLoader(file_path)
    return loader.load()

def index_docs(docs: List[str], 
                persist_directory: str="./i-shall-persist", 
                embedding_model: str="llama3"):
    embeddings = OllamaEmbeddings(model=embedding_model)
    vectordb = Chroma.from_documents(docs, embeddings, persist_directory=persist_directory)
    retriever = vectordb.as_retriever()
    
    return retriever


file_path = "./paper.pdf"

docs = load_pdf(file_path)

retriever = index_docs(docs)

template = """
Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

llm = ChatOllama(model="llama3")

# 2 suggestions for creating the rag chain:

# chain = (
#     RunnableParallel({"context": retriever, "question": RunnablePassthrough()}) # RunnablePassthrough source: https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.passthrough.RunnablePassthrough.html#langchain-core-runnables-passthrough-runnablepassthrough:~:text=Runnable%20to%20passthrough,and%20experiment%20with.
#     # RunnableParallel source: https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.RunnableParallel.html
#     | prompt
#     | llm
#     | StrOutputParser()
# )

chain = RetrievalQA.from_chain_type(llm, 
                                    retriever=retriever, 
                                    chain_type_kwargs={"prompt": prompt}, 
                                    return_source_documents=True)
# qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, return_source_documents=True)

# Add typing for input
class Question(BaseModel):
    __root__: str
    # The __root__ field in Pydantic models is used to define a model
    # where you expect a single value or a list rather than a dictionary 
    # of named fields. Essentially, it allows your model to handle instances 
    # where data does not naturally fit into a key-value structure, 
    # such as a single value or a list.


rag_chain = chain.with_types(input_type=Question)

How to pass the link to the document along with the output

In [8]:
from langchain_core.runnables import RunnableParallel
# chain = (
#     RunnableParallel({"context": retriever, "question": RunnablePassthrough()}) # RunnablePassthrough source: https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.passthrough.RunnablePassthrough.html#langchain-core-runnables-passthrough-runnablepassthrough:~:text=Runnable%20to%20passthrough,and%20experiment%20with.
#     # RunnableParallel source: https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.RunnableParallel.html
#     | prompt
#     | llm
#     | StrOutputParser()

runnable_parallel = RunnableParallel(
    {"input": rag_chain, "file_url": RunnablePassthrough()}
)

In [10]:
runnable_parallel.invoke({"input": "What is self-attention?", "file_url": file_path, "query": "What is self-attention?"})

{'input': {'input': 'What is self-attention?',
  'file_url': './paper.pdf',
  'query': 'What is self-attention?',
  'result': 'According to this text, self-attention (also called intra-attention) is an attention mechanism that relates different positions of a single sequence in order to compute a representation of the sequence.',
  'source_documents': [Document(page_content='[5]Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. CoRR , abs/1406.1078, 2014.\n[6]Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv\npreprint arXiv:1610.02357 , 2016.\n[7]Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. CoRR , abs/1412.3555, 2014.\n[8]Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent ne