# Project: "Rag with Llama3 and LangChain"

## 1. Load the pdf

## 2. Chunk it and index it

## 3. Retrieve relevant information from it

## 4. Test it a bit

## 5. Make it modular and build your production chain

## 6. Deploy with with monitoring using LangSmith and LangServe

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

## 1. Load the pdf

In [3]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./paper.pdf")
docs = loader.load()
len(docs)

15

## 2. Chunk it and index it

In [4]:
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

In [5]:
embeddings = OllamaEmbeddings(model="llama3")

In [6]:
persist_directory = "./i-shall-persist"

In [7]:
vectordb = Chroma.from_documents(docs, embeddings, persist_directory=persist_directory)

In [9]:
from langchain_community.chat_models import ChatOllama

llm = ChatOllama(model="llama3")

llm.invoke("hi!")

AIMessage(content="Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?", response_metadata={'model': 'llama3', 'created_at': '2024-04-21T10:43:50.43677Z', 'message': {'role': 'assistant', 'content': ''}, 'done': True, 'total_duration': 4860749792, 'load_duration': 7016208, 'prompt_eval_count': 12, 'prompt_eval_duration': 3907058000, 'eval_count': 26, 'eval_duration': 930898000}, id='run-2aa487e2-f045-4237-87bc-5da0693325e7-0')

## 3. Retrieve relevant information from it

In [26]:
from langchain.chains import RetrievalQA

In [22]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever(), return_source_documents=True)

In [23]:
qa_chain.invoke("What are the main arguments for using self-attention according to the paper?")

{'query': 'What are the main arguments for using self-attention according to the paper?',
 'result': 'According to the Transformer paper, the main arguments for using self-attention (multi-headed) instead of recurrent layers or convolutional layers are:\n\n1. **Parallelization**: Self-attention allows for parallelizing computations across all positions in the input sequence, making it much faster to compute than recurrent or convolutional layers.\n2. **Efficient handling of long-range dependencies**: Self-attention can capture long-range dependencies and contextual relationships between any two positions in the input sequence, without requiring a fixed-size context window like recurrent layers or a fixed-size kernel like convolutional layers.\n3. **Better modeling of complex structures**: Self-attention can model complex structural relationships between different parts of the input sequence, which is particularly important for tasks like machine translation and text summarization.\n\nT

## 4. Test it a bit

In [24]:
qa_chain.invoke("Explain in simple terms the attention mechanism.")

{'query': 'Explain in simple terms the attention mechanism.',
 'result': 'The attention mechanism is a way for neural networks to focus on specific parts of an input when processing it.\n\nThink of it like a librarian helping you find a book in a huge library. You tell the librarian what type of book you\'re looking for (e.g., a book about attention mechanisms), and they help you navigate the shelves to find the relevant books.\n\nIn the same way, the attention mechanism helps the neural network focus on specific parts of an input sentence or paragraph when trying to understand it. It\'s like a spotlight shining on certain words or phrases that are important for understanding the meaning.\n\nHere\'s how it works:\n\n1. The neural network breaks down the input into smaller chunks, called "keys."\n2. Each key is then compared to all the other keys (and their corresponding values) to determine which ones are most relevant.\n3. The attention mechanism assigns a weight or score to each key 

## 5. Make it modular and build your production chain

In [25]:
from typing import List

def load_pdf(file_path: str="./paper.pdf") -> str:
    loader = PyPDFLoader(file_path)
    return loader.load()

def index_docs(docs: List[str], 
                persist_directory: str="./i-shall-persist", 
                embedding_model: str="llama3"):
    embeddings = OllamaEmbeddings(model=embedding_model)
    vectordb = Chroma.from_documents(docs, embeddings, persist_directory=persist_directory)
    retriever = vectordb.as_retriever()
    
    return retriever

def setup_chain(retriever):
    qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, return_source_documents=True)
    return qa_chain

file_path = "./paper.pdf"

docs = load_pdf(file_path)

retriever = index_docs(docs)

qa_chain = setup_chain(retriever)

qa_chain.invoke("What are the main arguments for using self-attention according to the paper?")

{'query': 'What are the main arguments for using self-attention according to the paper?',
 'result': 'According to the paper, the main arguments for using self-attention (also known as multi-headed self-attention) instead of recurrent layers in sequence-to-sequence models are:\n\n1. **Parallelization**: Self-attention allows for parallel computation across all input elements, whereas recurrent layers require sequential processing, which can be a bottleneck (assistant',
 'source_documents': [Document(page_content='[5]Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. CoRR , abs/1406.1078, 2014.\n[6]Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv\npreprint arXiv:1610.02357 , 2016.\n[7]Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural net

## 6. Deploy with with monitoring using LangSmith and LangServe

In [27]:
# !langchain app new my-app --package rag-conversation
# !tree my-app

Folder structure of a langchain project:

![](./assets-resources/langchain-project-structure.png)

Let's start with the server.py file:

```

```

The script you've provided sets up a basic web server using FastAPI, a popular framework for building APIs with Python. It's designed to be fast and easy to use, with automatic data validation and interactive API documentation. The script also includes integration for LangChain (`langserve` module), which seems to be a library for adding language model-related functionalities, such as retrieval-augmented generation (RAG). Here's a detailed breakdown of each part of the script:

### Imports
```python
from fastapi import FastAPI
from fastapi.responses import RedirectResponse
from langserve import add_routes
```
- **`FastAPI`**: Imports the FastAPI class, which is used to create the API server.
- **`RedirectResponse`**: Imports a helper from FastAPI for issuing HTTP redirects.
- **`add_routes`**: Imports a function from the `langserve` package, which presumably adds specific routes to the FastAPI app related to language processing services.

### Initialize the FastAPI Application
```python
app = FastAPI()
```
- This line initializes a new FastAPI application. `app` is now an instance of `FastAPI` and will be used to register routes and run the server.

### Root Route
```python
@app.get("/")
async def redirect_root_to_docs():
    return RedirectResponse("/docs")
```
- `@app.get("/")`: This is a route decorator provided by FastAPI. It tells FastAPI that the function directly below should be executed when an HTTP GET request is made to the root URL (`"/"`).
- `async def redirect_root_to_docs()`: Defines an asynchronous function that handles requests to the root URL.
- `return RedirectResponse("/docs")`: The function returns a `RedirectResponse` object, which redirects the client to the `/docs` URL. FastAPI automatically generates interactive API documentation (using Swagger UI), accessible at `/docs`. This redirect is a convenience to lead users directly to the documentation.

### LangChain Routes Integration
```python
# Edit this to add the chain you want to add
add_routes(app, NotImplemented)
```
- This line suggests that `add_routes` function is meant to extend the FastAPI application (`app`) with additional routes or functionalities related to LangChain.
    - What `add_routes` does is it connects the chain to our app exposing the methods of the chain to our web server.
- The second argument `NotImplemented` seems to be a placeholder. In practice, you would replace `NotImplemented` with an actual implementation or configuration that specifies what routes or features from LangChain should be integrated into the FastAPI application.

### Main Execution Block
```python
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
```
- `if __name__ == "__main__"`: This is a common Python idiom. It checks if the script is being run as the main program (not imported as a module in another script) and only then executes the code block inside it.
- `import uvicorn`: Imports `uvicorn`, an ASGI server for running FastAPI applications. It's specifically designed to work well with asynchronous applications and is highly performant.
- `uvicorn.run(app, host="0.0.0.0", port=8000)`: This line starts the `uvicorn` server with the FastAPI app. The `host="0.0.0.0"` configuration tells the server to listen on all network interfaces, making the server accessible externally. `port=8000` specifies the network port for the server.

Overall, this script sets up a basic FastAPI application, redirects the root URL to automatically generated API documentation, and integrates (though in a placeholder manner) additional routes via a `langserve` library for handling specific language model tasks.

Now, the `chain.py` file will contain the heart of our application, in the case of the pre-built langchain project the file contains this app's specific logic ending with the creation of the app's chain:

```python
# last line from packages/rag-conversation/rag-conversation/chain.py

chain = _inputs | ANSWER_PROMPT | ChatOpenAI() | StrOutputParser()

```

Let's now create our own package with our own custom logic for rag over a pdf.

For that we'll use this structure described in the LangChain [documentation here for getting started with LangServe](https://python.langchain.com/docs/langserve/).

Setup
Note: We use poetry for dependency management. Please follow poetry doc to learn more about it.

1. Create new app using langchain cli command
`langchain app new rag-pdf-app`

2. Define the runnable in `add_routes`. Go to `server.py` and edit
`add_routes(app. NotImplemented)`

3. Use `poetry` to add 3rd party packages (e.g., langchain-openai, langchain-anthropic, langchain-mistral etc).
poetry add [package-name] // e.g `poetry add langchain-openai`

4. Set up relevant env variables. For example,
export OPENAI_API_KEY="sk-..."

5. Serve your app
poetry run langchain serve --port=8100

In [None]:
from langserve.client import RemoteRunnable

runnable = RemoteRunnable("http://localhost:8000/<put your app name here>")

Environment set up:


```
conda create -n langserve-test-env-pinecone python=3.11
conda activate langserve-test-env-pinecone
pip install -U "langchain-cli [serve]" "langserve [all]"
langchain app new .
poetry add pinecone-client==3.0.0.dev8
poetry add langchain-community==0.0.12
poetry add cohere
poetry add openai
poetry add jupyter
poetry add python-dotenv
poetry run jupyter notebook
```

- See langchain-env-test for the basic example

- Then maybe evolve to discuss an example like this [rag-conversation](https://github.com/langchain-ai/langchain/blob/master/templates/rag-conversation/rag_conversation/chain.py)