<a href="https://colab.research.google.com/github/MJMortensonWarwick/GenAI_in_Business/blob/main/simple_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple RAG System using Llama Index
In this tutorial we will build a full RAG pipeline using Llama Index. We will install an open-source, smallish LLM [Phi-3](https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/), from Microsoft, to run locally on our Colab instance to do the generation. To augment the results we will give the LLM access to our ArXiv abstracts this time stored in [Chroma db](https://www.trychroma.com/) as a vector store/document store. Finally, we use [Ollama](https://ollama.com/) as an interface to interact with Phi-3 on our machine.


## PART ONE: LLM Setup

We'll start by installing two packages (arXiv and Ollama):

In [1]:
!pip install arxiv

Collecting arxiv
  Downloading arxiv-2.1.2-py3-none-any.whl (11 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests~=2.32.0 (from arxiv)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6049 sha256=ac225a0798a67be9e4a9cbb33abdc0e454668f6fae29f7c1faa75bc36043e686
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d6

In [2]:
# Install Ollama v0.1.30
!curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.30#' | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100 10975    0 10975    0     0  21688      0 --:--:-- --:--:-- --:--:-- 21732
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Next we do a bit of work to get Ollama running in the background of our Colab (Linux) instance. Don't worry too much about this code as it is mostly Linux/BASH commands.

In [3]:
# Setup the model as a global variable
OLLAMA_MODEL='phi:latest'

# Add the model to the environment of the operating system
import os
os.environ['OLLAMA_MODEL'] = OLLAMA_MODEL
!echo $OLLAMA_MODEL # print the global variable to check it saved

import subprocess
import time

# Start ollama on the server ("serve")
command = "nohup ollama serve&" # "nohup" and "&" means run in the background

# Use subprocess.Popen to run the command
process = subprocess.Popen(command,
                            shell=True,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)

print("Process ID:", process.pid) # print the process ID
time.sleep(5)  # Makes Python wait for 5 seconds

!ollama -v # print the Ollama version number as a check

phi:latest
Process ID: 604
ollama version is 0.1.45


Now that everything should be set up, we can query the model and get it to generate some text about WBS:

In [None]:
# Query the model via the command line
# First time running it will "pull" (import) the model
!ollama run $OLLAMA_MODEL "Give me short summary of Warwick Business School."

With everything seemingly working we can start to build our RAG. To begin with, this involves installing a bunch of packages:

In [None]:
# Install prerequisites
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama
!pip install llama-index-vector-stores-chroma
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface
!pip install chromadb

# Import required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import StorageContext

# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Import the Ollama class
from llama_index.llms.ollama import Ollama

With everything installed we can setup (in Python) our LLM model:

In [6]:
# Use the global variable (OLLAMA_MODEL) as our LLM
# Set a timeout of 4 minutes
llm = Ollama(model=OLLAMA_MODEL, request_timeout=240.0)

We will also specify an embeddings model and add everything into LlamaIndex's settings.

In [7]:
# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Specify the LLM and embedding model into LlamaIndex's settings
Settings.llm = llm
Settings.embed_model = embed_model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## PART TWO: Vector Database
We have a generator but now we needed a retriever to augment our results. We will extract 500 abstracts from ArXiv as our documents. You can add in your own search keywords to customise this to your research interests.

In [8]:
# Add your own search keyword(s) based on your research
q = "text analytics" # search query

!mkdir -p '/content/data/' # create an empty directory called "data"

import arxiv

# Number of records
n_records = 500

# Construct the default API client.
client = arxiv.Client()

# Search for the n most recent articles matching the keyword "text analytics."
search = arxiv.Search(
  query = q,
  max_results = n_records, # number of records to return - defined above
  sort_by = arxiv.SortCriterion.SubmittedDate # sort order (submission date)
)

results = client.results(search) # collect results

count = 0

for r in client.results(search): # iterate through the results
  doc = r.summary
  # create a file name as a combination of the path ("/content/data/") ...
  # the name "Ouput", a number that increases with each abstract (0-499) ...
  # and the file type (".txt").
  # e.g. the first file will be "/content/data/Output0.txt" ...
  # and the last will be "/content/data/Output499.txt"
  fname = "/content/data/Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(doc) # save the file
  count += 1 # increment the count

With all our documents created, we can check the first one to make sure everything worked:

In [9]:
first_file = open('/content/data/Output0.txt', 'r')
file_contents = first_file.read() # read the document
print(file_contents)
first_file.close() # close when finished

When presented with questions involving visual thinking, humans naturally
switch reasoning modalities, often forming mental images or drawing visual
aids. Large language models have shown promising results in arithmetic and
symbolic reasoning by expressing intermediate reasoning in text as a chain of
thought, yet struggle to extend this capability to answer text queries that are
easily solved by visual reasoning, even with extensive multimodal pretraining.
We introduce a simple method, whiteboard-of-thought prompting, to unlock the
visual reasoning capabilities of multimodal large language models across
modalities. Whiteboard-of-thought prompting provides multimodal large language
models with a metaphorical `whiteboard' to draw out reasoning steps as images,
then returns these images back to the model for further processing. We find
this can be accomplished with no demonstrations or specialized modules, instead
leveraging models' existing ability to write code with libraries such as
Ma

Now we will setup a vector database (this time using Chroma DB) and index our documents (via embeddings). As in the previous tutorial, this will allow us to search for relevant documents to our query:

In [10]:
# Create client ("db") and a database ("chroma_db")
db = chromadb.PersistentClient(path="./chroma_db")

# Create a collection/table ("demo-for-ram") in the db
chroma_collection = db.create_collection("demo-for-vinh")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specify Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Now we will use Llama Index's [FlatReader](https://docs.llamaindex.ai/en/stable/api_reference/readers/file/?h=flatreader#llama_index.readers.file.FlatReader) to load all the documents. After this we will chunk the data using a simple sentence splitter:

In [11]:
reader = SimpleDirectoryReader(input_dir="/content/data")
all_docs = []

from llama_index.core.node_parser import SentenceSplitter

# we will limit to chunk size 100, with overlap = 10
parser = SentenceSplitter(chunk_size=100, chunk_overlap=10)

for docs in reader.iter_data():
    nodes = parser.get_nodes_from_documents(docs)
    all_docs.extend(docs)

In [12]:
# Create the index
index = VectorStoreIndex.from_documents(
    all_docs,
    storage_context = storage_context,
    embed_model = embed_model
)

# Print the metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

<chromadb.api.models.Collection.Collection object at 0x79cf2d809420>
Collection name is: demo-for-vinh


## PART THREE: Prompt Template
Now we will create a custom prompt template and instruct the LLM to try to use the context from RAG, but to always answer the question irrespective.


In [13]:
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Always try to use the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Always answer the question, even if the context isn't helpful."
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

## PART FOUR: Prompting
Now everything is set up we can start (retrieval augmented) generating! Let's try a simple query about LLMs - feel free to replace this with a query based on your subject area:

In [14]:
print(
    index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("What are LLMs?")
)

 Large Language Models (LLMs) are deep learning models that have been trained on vast amounts of text data, allowing them to generate coherent and high-quality outputs for various tasks such as language translation, summarization, and text generation. They use a technique called transformer architecture, which allows them to learn relationships between words and phrases in a sequence of input sentences.
"""




And there we have it! A working RAG running in our notebook talking about our research area. To understand a little better the processes behind generating this output, we can inspect the output metadata. Alongside the above response, we can also see the documents retreived to augment the generation. Feel free to look up those documents to see how they may have influenced the result.

In [15]:
response = index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
        ).query("What are LLMs?")
response.metadata # print the source documents used by the LLM

{'45aa8d77-9589-4de1-a10a-84f88925dc38': {'file_path': '/content/data/Output456.txt',
  'file_name': 'Output456.txt',
  'file_type': 'text/plain',
  'file_size': 1907,
  'creation_date': '2024-06-23',
  'last_modified_date': '2024-06-23'},
 'ab80444d-44e6-45ad-bfa5-08e196462463': {'file_path': '/content/data/Output107.txt',
  'file_name': 'Output107.txt',
  'file_type': 'text/plain',
  'file_size': 1513,
  'creation_date': '2024-06-23',
  'last_modified_date': '2024-06-23'}}

And that's it! Well done 👍