<a href="https://colab.research.google.com/github/RDGopal/IB9CW0-Text-Analytics/blob/main/day_seven_simple_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple RAG System using Llama Index
In this tutorial we will build a full RAG pipeline using Llama Index. We will install an open-source, smallish LLM [Phi-3](https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/), from Microsoft, to run locally on our Colab instance to do the generation. To augment the results we will give the LLM access to our ArXiv abstracts this time stored in [Chroma db](https://www.trychroma.com/) as a vector store/document store. Finally, we use [Ollama](https://ollama.com/) as an interface to interact with Phi-3 on our machine.

We'll start by installing Ollama:

In [None]:
# Install Ollama v0.1.30
!curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.30#' | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100 10044    0 10044    0     0  24273      0 --:--:-- --:--:-- --:--:-- 24260
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  caca-utils chafa fonts-droid-fallback fonts-noto-mono fonts-urw-base35
  ghostscript gsfonts image

Next we do a bit of work to get Ollama running in the background of our Colab (Linux) instance. Don't worry too much about this code as it is mostly Linux/BASH commands.

In [None]:
# Setup the model as a global variable
OLLAMA_MODEL='phi:latest'

# Add the model to the environment of the operating system
import os
os.environ['OLLAMA_MODEL'] = OLLAMA_MODEL
!echo $OLLAMA_MODEL # print the global variable to check it saved

import subprocess
import time

# Start ollama on the server ("serve")
command = "nohup ollama serve&" # "nohup" and "&" means run in the background

# Use subprocess.Popen to run the command
process = subprocess.Popen(command,
                            shell=True,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)

print("Process ID:", process.pid) # print the process ID
time.sleep(5)  # Makes Python wait for 5 seconds

!ollama -v # print the Ollama version number as a check

phi:latest
Process ID: 6151
ollama version is 0.1.32


Now that everything should be set up, we can query the model and get it to generate some text about WBS:

In [None]:
# Query the model via the command line
# First time running it will "pull" (import) the model
!ollama run $OLLAMA_MODEL "Give me short summary of Warwick Business School."

[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest 
pulling 04778965089b...   0% ▕▏    0 B/1.6 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 04778965089b...   0% ▕▏    0 B/1.6 GB                  [?25h[?25l[2K[1G

With everything seemingly working we can start to build our RAG. Firstly, this involves installing a bunch of packages:

In [None]:
# Install prerequisites
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama
!pip install llama-index-vector-stores-chroma
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface
!pip install chromadb

# Import required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import StorageContext

# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Import the Ollama class
from llama_index.llms.ollama import Ollama

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.0-py3-none-any.whl (7.1 kB)
Collecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-embeddings-huggingface)
  Downloading llama_index_core-0.10.33-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers<3.0.0,>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface)
  Downloading dataclasses_json-0.6.5-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface)
  Downloading Deprecated-1.2.14-py2.py3-n

With everything installed we can setup (in Python) our LLM model:

In [None]:
# Use the global variable (OLLAMA_MODEL) as our LLM
# Set a timeout of 4 minutes
llm = Ollama(model=OLLAMA_MODEL, request_timeout=240.0)

We have a generator but now we needed a retriever to augment our results. As before, we will extract 500 abstracts from ArXiv as our documents> This is basically the same code as the previous Notebook, except that, rather than building a Pandas DataFrame, instead we just store each abstract as a seperate document:

In [None]:
!mkdir -p '/content/data/' # create an empty directory called "data"
!pip install arxiv

import arxiv

# Number of records
n_records = 500

# Construct the default API client.
client = arxiv.Client()

# Search for the n most recent articles matching the keyword "text analytics."
search = arxiv.Search(
  query = "text analytics", # search query
  max_results = n_records, # number of records to return - defined above
  sort_by = arxiv.SortCriterion.SubmittedDate # sort order (submission date)
)

results = client.results(search) # collect results

count = 0

for r in client.results(search): # iterate through the results
  doc = r.summary
  # create a file name as a combination of the path ("/content/data/") ...
  # the name "Ouput", a number that increases with each abstract (0-499) ...
  # and the file type (".txt").
  # e.g. the first file will be "/content/data/Output0.txt" ...
  # and the last will be "/content/data/Output499.txt"
  fname = "/content/data/Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(doc) # save the file
  count += 1 # increment the count

Collecting arxiv
  Downloading arxiv-2.1.0-py3-none-any.whl (11 kB)
Collecting feedparser==6.0.10 (from arxiv)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting sgmllib3k (from feedparser==6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6049 sha256=aa3251d39dc95d6cc538222d117dfea00250d9fac6967b37cc1274f7122a9e0e
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sgmllib3k,

With all our documents created, we can check the first one to make sure everything worked:

In [None]:
first_file = open('/content/data/Output0.txt', 'r')
file_contents = first_file.read() # read the document
print(file_contents)
first_file.close() # close when finished

Generative models have enabled intuitive image creation and manipulation
using natural language. In particular, diffusion models have recently shown
remarkable results for natural image editing. In this work, we propose to apply
diffusion techniques to edit textures, a specific class of images that are an
essential part of 3D content creation pipelines. We analyze existing editing
methods and show that they are not directly applicable to textures, since their
common underlying approach, manipulating attention maps, is unsuitable for the
texture domain. To address this, we propose a novel approach that instead
manipulates CLIP image embeddings to condition the diffusion generation. We
define editing directions using simple text prompts (e.g., "aged wood" to "new
wood") and map these to CLIP image embedding space using a texture prior, with
a sampling-based approach that gives us identity-preserving directions in CLIP
space. To further improve identity preservation, we project these dire

Now we will use Llama Index's [SimpleDirectoryReader](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/) to load all the documents. After this we will specify an embedding model from HuggingFace and add all of this to Llama Index's settings:

In [None]:
# Load documents
reader = SimpleDirectoryReader("/content/data") # load documents from the /data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Specify the LLM and embedding model into LlamaIndex's settings
Settings.llm = llm
Settings.embed_model = embed_model

Loaded 500 docs


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Now we will setup a vector database (this time using Chroma DB) and index our documents (via embeddings). As in the previous tutorial, this will allow us to search for relevant documents to our query:

In [None]:
# Create client ("db") and a database ("chroma_db")
db = chromadb.PersistentClient(path="./chroma_db")

# Create a collection/table ("demo-for-ram") in the db
chroma_collection = db.create_collection("demo-for-ram")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specify Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the index
index = VectorStoreIndex.from_documents(
    docs, # the abstract documents (e.g. "Output0.txt")
    storage_context = storage_context,
    embed_model = embed_model
)

# Print the metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

name='demo-for-ram' id=UUID('02ca3334-f2a6-4b6d-bddb-dab2e58d9ebc') metadata=None tenant='default_tenant' database='default_database'
Collection name is: demo-for-ram


Now everything is set up we can start (retrieval augmented) generating! Let's try a simple query about LLMs:

In [None]:
query_engine = index.as_query_engine() # use the vector db for queries

response = query_engine.query("What are LLMs?") # query Phi-3 with context
response.response # print the response

' Given the context information provided, I can give a general definition of Large Language Models (LLMs) as models that use deep learning algorithms to process massive amounts of text data and generate outputs such as text generation, summarization, and translation. They have been widely adopted in various industries but also exhibit inconsistencies and lack reasoning capabilities, leading to inaccuracies in their responses.\n'

And there we have it! A working RAG running in our notebook. To understand a little better the processes behind generating this output, we can inspect the full output. Alongside the above response, we can also see the two documents retreived to augment the generation:

In [None]:
response # print the full output from the LLM

Response(response=' Given the context information provided, I can give a general definition of Large Language Models (LLMs) as models that use deep learning algorithms to process massive amounts of text data and generate outputs such as text generation, summarization, and translation. They have been widely adopted in various industries but also exhibit inconsistencies and lack reasoning capabilities, leading to inaccuracies in their responses.\n', source_nodes=[NodeWithScore(node=TextNode(id_='041fef61-0cb1-4e95-8d5d-6a2f9614b152', embedding=None, metadata={'file_path': '/content/data/Output487.txt', 'file_name': 'Output487.txt', 'file_type': 'text/plain', 'file_size': 1299, 'creation_date': '2024-05-02', 'last_modified_date': '2024-05-02'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_acces

And that's it! Well done 👍