# LLM RAG Tutorial
This tutorial will give you a simple introduction to how to get started with an LLM to make a simple RAG app.

RAG (Retrieval Augmented Generation) allows us to give foundational models local context, without doing expensive fine-tuning and can be done even normal everyday machines like your laptop.
The basic idea is that we store documents as vectors in a database. When the user asks a question to the LLM, we can use langchain to first pass that question to the vector database, which retrieves relevant documents (these can be broken up into chunks, given metadata, summarised and various other steps to improve retrieval). The original question and these documents are then passed to the LLM (e.g. Claude) which then gives back the answer. So, in effect the model seems like it knows about what was in the database, e.g. local knowledge about your business, or hobby or whatever, whe in reality, that information was just injected into the prompt just prior to the model seeing it!

The main libraries we will use are:
- Langchain: which is basically a wrapper around the various LLMs and other tools to make it more consistent (so you can swap say.. OpenAI for Anthropic, easily)
- Anthropic: which is the library through which we will access the Claude model (more on why this is chosen below)
- ChromaDB: this is a simple vector database, which is a key part of the RAG model.
- sentence-transformer: this is an open-source model for embedding text

None of the above are "the best" tools - they're just examples, and you may whish to use difference embedding models, LLMs, vector databases, etc.

## Setup
- **Add documents to docs folder**: First there is a bit of setup. In this tutorial we won't go through how to take arbitrary sources and turn them into text files - that can be covered elsewhere. Instead, simply place some plain text documents ending in ".txt" in the "docs" folder.
    - There is a flat text version of the [Goldacre review](https://www.gov.uk/government/publications/better-broader-safer-using-health-data-for-research-and-analysis/better-broader-safer-using-health-data-for-research-and-analysis) already there to get you started
- **.env** file: to use the anthropic Claude model you'll need an access token. That can be made here: https://console.anthropic.com. After this you need to copy the env_example file, rename it ".env" and add in your access token.

In [1]:
# this forces google collab to install the dependencies
if 'google.colab' in str(get_ipython()):
  print('Running on Colab')
  !git clone https://github.com/SamHollings/llm_tutorial.git
  !pip install poetry
  !poetry config virtualenvs.in-project true
  !cd llm_tutorial && poetry install --no-ansi

  import os, sys
  VENV_PATH = "/content/llm_tutorial/.venv/lib/python3.10/site-packages"
  LOCAL_VENV_PATH = '/content/venv' # local notebook
  os.symlink(VENV_PATH, LOCAL_VENV_PATH) # connect to directory in drive
  sys.path.insert(0, LOCAL_VENV_PATH)
  print('Running on Colab')
  !git clone https://github.com/SamHollings/llm_tutorial.git
  !pip install poetry
  !cd llm_tutorial && poetry install

In [2]:
import chromadb
import langchain

# Create Vector Database

In [4]:
import toml
from langchain.vectorstores import Chroma
from langchain.schema.document import Document
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm import tqdm
import glob

config = toml.load('config.toml')

PERSIST_DIRECTORY = "db"
EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"

embedding = HuggingFaceEmbeddings(model_name = EMBEDDING_MODEL) #embedding_functions.DefaultEmbeddingFunction()
vectorstore = Chroma(persist_directory=PERSIST_DIRECTORY, embedding_function=embedding)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)


  from .autonotebook import tqdm as notebook_tqdm


In [13]:
for text_file_path in tqdm(glob.glob("docs/*.txt", recursive=True),desc="Processing Files", position=0):
    with open(text_file_path, "r",encoding='utf-8') as text_file:

        doc = Document(page_content=text_file.read(), metadata={"file_path": text_file_path})
        texts = text_splitter.split_documents([doc])
        vectorstore.add_documents(documents=texts)

Processing Files:   0%|          | 0/1 [00:00<?, ?it/s]

[Document(page_content='Skip to main content\n GOV.UK\n Navigation menu \nMenu Search GOV.UK \nHomeHealth and social careTechnology in health and social careBetter, broader, safer: using health data for research and analysis\nDepartment\nof Health &\nSocial Care\nIndependent report\nBetter, broader, safer: using health data for research and analysis\nPublished 7 April 2022', metadata={'file_path': 'docs\\goldacre_review.txt'}), Document(page_content='Applies to England\nContents\nReview team\nSenior stakeholder group\nBackground information\nMinisterial introduction\nForeword\nExecutive summary\nSummary recommendations\nModernising NHS service analytics\nModern, open working methods for NHS data analysis\nThe challenge of privacy in health data\nTrusted Research Environments\nInformation governance, ethics and participation\nData curation\nStrategy\nConclusions\nProfessor Ben Goldacre, declaration of interests\nAcknowledgements\nPrint this page\nReview team\nProfessor Ben Goldacre, Gol