# LLM RAG Tutorial
<a target="_blank" href="https://colab.research.google.com/github/SamHollings/llm_tutorial/blob/main/llm_tutorial_rag_sources.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial will give you a simple introduction to how to make a RAG pipeline which also tells you the source of it's findings.

If you haven't see a basic RAG pipeline, it's worth having a look at our [RAG tutorial](llm_tutorial_rag.ipynb)

## Setup
- **Add documents to docs folder**: First there is a bit of setup. In this tutorial we won't go through how to take arbitrary sources and turn them into text files - that can be covered elsewhere. Instead, simply place some plain text documents ending in ".txt" in the "docs" folder.
    - There is a flat text version of the [Goldacre review](https://www.gov.uk/government/publications/better-broader-safer-using-health-data-for-research-and-analysis/better-broader-safer-using-health-data-for-research-and-analysis) already there to get you started
- **.env** file: to use the anthropic Claude model you'll need an access token. That can be made here: https://console.anthropic.com. After this you need to copy the env_example file, rename it ".env" and add in your access token.

In [None]:
# this forces google collab to install the dependencies
if "google.colab" in str(get_ipython()):
    print("Running on Colab")
    !git clone https://github.com/SamHollings/llm_tutorial.git -q
    %cd llm_tutorial
    !pip install -r requirements.txt -q -q

    import src.utils.colab as colab

    colab.upload_dot_env_file()

In [None]:
import glob
import os

import toml
from dotenv import load_dotenv
from langchain.chains import RetrievalQA, LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chat_models import ChatAnthropic
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.schema.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

from tqdm import tqdm

config = toml.load("config.toml")

In [None]:
load_dotenv(".env")

# Use variables
# os.environ["OPENAI_API_KEY"] = os.getenv('openai_key')
os.environ["ANTHROPIC_API_KEY"] = os.getenv("anthropic_key")

## Initialise objects

We use a few different types of objects in a RAG pipeline.
- **chunk** because LLMs often can only take in relatively small amounts of text, we need to break larger bodies of text into small chunks. For this we use the `text_splitter`. Exactrly how we chunk up the text is an art in itself, and in this example we simple break it into ~1000 character long chunks (a very simple approach!). 
- **embed**: the `embedding` model (by default we've chosen HuggingFace's "sentence-transformer") converts strings of text in the chunks into a vector representation (if you want to learn more about why it does this, have a look into natural language processing theory)
- **store**: the `vectorstore` is the database in which we will store and later retrieve the embedded text vectors for each chunk.
- **Question and Answer Chain**: the `RetrievalQA` chain is a langchain object which does a few things for us:
    - it takes our question and passes it to the `retriever` which in this case submits our question to the `vectorstore`, embeds it, and then returns simply the top 4 nearest chunks (in vector space)
    - this is then "stuffed" into a new prompt along with your question. The default prompt is something like this:
        - `"using the following documents: {stuffed documents} answer the following question: {question}. Answer:"`
    - this new prompt is then sent off the `llm` - in this case that is the Anthropic Claude model.

In [None]:
DEV_MODE = True
PERSIST_DIRECTORY = "db"
EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"

if DEV_MODE:
    PERSIST_DIRECTORY += "/dev"

embedding = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL
)  # embedding_functions.DefaultEmbeddingFunction()
vectorstore = Chroma(persist_directory=PERSIST_DIRECTORY, embedding_function=embedding)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
llm = ChatAnthropic(anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"))

### System Prompt

To control how the AI retrieves and uses the retrieved documents, we need to create a few prompts.

First we have the `SYSTEM_PROMPT`. This is sets the scene for the AI, and can useful for defining some of the "framework" you want the AI to follow ([see this useful tweet thread](https://en.rattibha.com/thread/1711716995987894738)). In this instance, the prompt mostly defines the **Role** the AI will play.

In [None]:
SYSTEM_PROMPT = PromptTemplate.from_template("""You are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humans\
make more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.\
Be concise and professional in your responses.\n\n """)

### Stuff Document Prompt

Next we have the `STUFF_DOCUMENTS_PROMPT`. The core of RAG is taking documents and jamming them into the prompt which is then sent to the LLM. This is the prompt that defines how that is done (along with the `load_qa_with_sources_chain` which we will see shortly.)

Here you can see it follows a straightforward format (see examples of other formats [here](https://en.rattibha.com/thread/1711716995987894738))
* *Role* - in the `SYSTEM_PROMPT`
* *Objective* - using the docs and the question, create an answer with references
* *Details* - don't make stuff up, always return sources used.
* *Examples* - "few-shot" examples - these are made up, but useful for the AI to *understand* how the output should look. Changing these can greatly change how the output looks and how consistent it is.

Next we have the part where it assembles the question, and the retrieved documents together, before preping the AI for the final answer. 
Import to note - that the pipeline we will pass this prompt to, will not just dump the documents one by one into the "docs" part of the prompt - using the following prompts we can get it edit what is retrieved from the database, injecting metadata, adding other text (e.g. reference IDs), and we can change what delimiter separates the documents as they are "stuffed" into the prompt.

In [None]:
# we can just add prompts together: just add a string to an existing prompt
STUFF_DOCUMENTS_PROMPT = SYSTEM_PROMPT+"""Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). \
If you don't know the answer, just say that you don't know. Don't try to make up an answer. \
ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre_review.txt)

SOURCES: (goldacre_review.txt, open_source_guidlines.txt)"

QUESTION: {question}
=========
{docs}
=========
FINAL ANSWER:"""

### Inject Document Metadata Prompt

In [None]:
INJECT_METADATA_PROMPT = PromptTemplate.from_template("{file_path}:\n{page_content}")

### load_qa_with_sources_chain

In [None]:
# 
stuff_docs_sources_chain = load_qa_with_sources_chain(
            llm,
            chain_type="stuff",
            prompt=STUFF_DOCUMENTS_PROMPT,
            document_prompt=INJECT_METADATA_PROMPT,
            document_variable_name="docs",
            document_separator="\n\n",
            verbose=True,
        )

# Populate Vector Database

The below loads the text files into the vector database.
- first it uses to glob to get a list of all of the text files in "docs"
- next it converts this into the `Document` class preferred by langchain
- the document is run through the `text_splitter` to break it down into manageable chunks
- these chunks are added to the `vectorstore` (where they are first run through the `embedding` model prior to insertion into the database).
    - the database itself is just a SQLite database - you can even open it and look inside if you go to the db folder.

**NOTE**: this cell may take a bit of time to run, as it needs to chew through and embed quite a lot of text. Go away and make a cup of coffee.

In [None]:
if (
    not DEV_MODE
):  # won't populate the database if in dev mode - we can just use what was already loaded.
    for text_file_path in tqdm(
        glob.glob("docs/*.txt", recursive=True), desc="Processing Files", position=0
    ):
        with open(text_file_path, "r", encoding="utf-8") as text_file:
            doc = Document(
                page_content=text_file.read(), metadata={"file_path": text_file_path}
            )
            texts = text_splitter.split_documents([doc])
            vectorstore.add_documents(documents=texts)

## Question and Retrieve
Now we can do the fun part - **ask the model questions**.

In [None]:
question = "Expalin the main benefits of Reproducible Analytical Pipelines (RAP)"

docs = retriever.get_relevant_documents(question)

results = stuff_docs_sources_chain({"question": question,
                          "input_documents": docs,
                          }
                        )

In [None]:
print(results['output_text'])