# **Building an RAG Query pipeline with FAISS, and Ollama’s Llama 2 Model**

This project develops a Retrieval-Augmented Generation (RAG) query app that combines efficient retrieval with AI-driven responses. By integrating Chroma DB for data storage, FAISS for vector search, and Llama 2 via Ollama for response generation, it can deliver precise, context-aware answers to user queries. Each part of the code is structured for clarity, following best practices to ensure a seamless experience, from setup to deployment. Let's dive in and bring this RAG app to life!








In [None]:
%pip install faiss
%pip install langchain_ollama
%pip install faiss-cpu
%pip install --upgrade gradio


Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-win_amd64.whl.metadata (4.5 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-win_amd64.whl (14.9 MB)
   ---------------------------------------- 14.9/14.9 MB 808.9 kB/s eta 0:00:00
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.1.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
%pip install arxiv
%pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.13-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.13-cp39-abi3-win_amd64.whl (16.2 MB)
   ---------------------------------------- 16.2/16.2 MB 557.2 kB/s eta 0:00:00
Installing collected packages: pymupdf
Successfully installed pymupdf-1.24.13
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.1.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:

import langchain
import langchain.vectorstores
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain.vectorstores import faiss
from langchain_community.vectorstores import FAISS

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import ArxivLoader, PyPDFLoader
from faiss import IndexFlatL2
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain.document_transformers import LongContextReorder
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables.passthrough import RunnableAssign

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_community.utilities.arxiv import ArxivAPIWrapper


In [3]:
import gradio as gr
from functools import partial
from operator import itemgetter
import json
from pprint import pprint

  from .autonotebook import tqdm as notebook_tqdm


In [22]:
text_splitter= RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " "],
    )

In [23]:
#import Llama2 
instruct_llm = ChatOllama(model="llama2", temperature=0.6, num_predict=256)

#Using Nvidia embeddings
embedder= OllamaEmbeddings(model="llama2")

In [24]:
docs=[
    ArxivLoader(query="1706.03762").load(),  ## Attention Is All You Need Paper
    #ArxivLoader(query="1810.04805").load(),  ## BERT Paper
    #ArxivLoader(query="2005.11401").load(),  ## RAG Paper
    #ArxivLoader(query="2205.00445").load(),  ## MRKL Paper
    #ArxivLoader(query="2310.06825").load(),  ## Mistral Paper
    ArxivLoader(query="2306.05685").load(),  ## LLM-as-a-Judge
    
    ]


In [25]:
for doc in docs:
    content=json.dumps(doc[0].page_content)
    if "References" in content:
        doc[0].page_content = content[:content.index("References")]

In [26]:
#Chunking the documents and remove very short chunks
print("Start chunking")
doc_chunks=[text_splitter.split_documents(doc) for doc in docs]
doc_chunks=[[c for c in dchunks if len(c.page_content)>200] for dchunks in doc_chunks]

Start chunking


In [27]:
#Adding the big-picture details
Doc_string="Available Documents: "
Doc_metadata=[]
for chunk in doc_chunks:
    metadata= getattr(chunk[0], 'metadata',{})
    Doc_string+= "\n - " + metadata.get('Title')
    Doc_metadata+= [str(metadata)]
    
BP_Chunks=  [Doc_string] + Doc_metadata

In [15]:
## Printing out some summary information for reference
print(Doc_string, '\n')
for i, chunks in enumerate(doc_chunks):
    print(f"Document {i}")
    print(f" - # Chunks: {len(chunks)}")
    print(f" - Metadata: ")
    pprint(chunks[0].metadata)
    print()

Available Documents: 
 - Attention Is All You Need
 - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena 

Document 0
 - # Chunks: 35
 - Metadata: 
{'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion '
            'Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin',
 'Published': '2023-08-02',
 'Summary': 'The dominant sequence transduction models are based on complex '
            'recurrent or\n'
            'convolutional neural networks in an encoder-decoder '
            'configuration. The best\n'
            'performing models also connect the encoder and decoder through an '
            'attention\n'
            'mechanism. We propose a new simple network architecture, the '
            'Transformer, based\n'
            'solely on attention mechanisms, dispensing with recurrence and '
            'convolutions\n'
            'entirely. Experiments on two machine translation tasks show these '
            'models to be\n'
            'superio

In [25]:
#Consctructing the vector store
vecstore=[FAISS.from_texts(BP_Chunks, embedding=embedder)]
vecstore+=[FAISS.from_documents(doc_chunk,embedding=embedder) for  doc_chunk in doc_chunks]  


In [26]:
embed_dims = len(embedder.embed_query("test"))
def default_FAISS():
    '''Useful utility for making an empty FAISS vectorstore'''
    return FAISS(
        embedding_function=embedder,
        index=IndexFlatL2(embed_dims),
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
        normalize_L2=False
    )

In [27]:
def aggregate_vstores(vectorstores):
    ## Initialize an empty FAISS Index and merge others into it
    ## We'll use default_faiss for simplicity.
    agg_vstore = default_FAISS()
    for vstore in vectorstores:
        agg_vstore.merge_from(vstore)
    return agg_vstore

In [28]:
## Unintuitive optimization; merge_from seems to optimize constituent vector stores away
docstore = aggregate_vstores(vecstore)

print(f"Constructed aggregate docstore with {len(docstore.docstore._dict)} chunks")

Constructed aggregate docstore with 82 chunks


In [29]:
convstore = default_FAISS()

def save_memory_and_get_output(d, vstore):
    """Accepts 'input'/'output' dictionary and saves to convstore"""
    vstore.add_texts([
        f"User previously responded with {d.get('input')}",
        f"Agent previously responded with {d.get('output')}"
    ])
    return d.get('output')

In [None]:
initial_msg = (
    "Hello! I am a document chat agent here to help the user!"
    f" I have access to the following documents: {Doc_string}\n\nHow can I help you?"
)


chat_prompt = ChatPromptTemplate.from_messages([("system",
    "You are a document chatbot. Help the user as they ask questions about documents."
    " User messaged just asked: {input}\n\n"
    " From this, we have retrieved the following potentially-useful info: "
    " Conversation History Retrieval:\n{history}\n\n"
    " Document Retrieval:\n{context}\n\n"
    " (Answer only from retrieval. Only cite sources that are used. Make your response conversational.)"
    "Be concize and precize to answer in less than 250 words"
), ('user', '{input}')])

In [31]:
def RPrint(preface=""):
    """Simple passthrough "prints, then returns" chain"""
    def print_and_return(x, preface):
        if preface: print(preface, end="")
        pprint(x)
        return x
    return RunnableLambda(partial(print_and_return, preface=preface))



In [32]:
stream_chain = chat_prompt| RPrint() | instruct_llm | StrOutputParser()

def docs2str(docs, title="Document"):
    """Useful utility for making chunks into context string. Optional, but useful"""
    out_str = ""
    for doc in docs:
        doc_name = getattr(doc, 'metadata', {}).get('Title', title)
        if doc_name:
            out_str += f"[Quote from {doc_name}] "
        out_str += getattr(doc, 'page_content', str(doc)) + "\n"
    return out_str

In [33]:
## Reorders longer documents to center of output text
long_reorder = RunnableLambda(LongContextReorder().transform_documents)

retrieval_chain = (
    {'input' : (lambda x: x)}
    | RunnableAssign({'history' : itemgetter('input')| convstore.as_retriever()| long_reorder | docs2str })
    | RunnableAssign({'context' : itemgetter('input')|  docstore.as_retriever()| long_reorder | docs2str})
)

In [34]:
def chat_gen(message, history=[], return_buffer=True):
    buffer = ""
    ## First perform the retrieval based on the input message
    retrieval = retrieval_chain.invoke(message)
    line_buffer = ""

    ## Then, stream the results of the stream_chain
    for token in stream_chain.stream(retrieval):
        buffer += token
        ## If you're using standard print, keep line from getting too long
        yield buffer if return_buffer else token

    ## Lastly, save the chat exchange to the conversation memory buffer
    save_memory_and_get_output({'input':  message, 'output': buffer}, convstore)



In [35]:
## Start of Agent Event Loop
test_question = "Tell me about RAG!"  

## Before you launch your gradio interface, make sure your thing works
for response in chat_gen(test_question, return_buffer=False):
    print(response, end='')

ChatPromptValue(messages=[SystemMessage(content='You are a document chatbot. Help the user as they ask questions about documents. User messaged just asked: Tell me about RAG!\n\n From this, we have retrieved the following potentially-useful info:  Conversation History Retrieval:\n\n\n Document Retrieval:\n[Quote from Attention Is All You Need] .1, instead of 0.3.\\nFor the base models, we used a single model obtained by averaging the last 5 checkpoints, which\\nwere written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We\\nused beam search with a beam size of 4 and length penalty \\u03b1 = 0.6 [38]. These hyperparameters\\nwere chosen after experimentation on the development set. We set the maximum output length during\\ninference to input length + 50, but terminate early when possible [38].\\nTable 2 summarizes our results and compares our translation quality and training costs to other model\\narchitectures from the literature. We estimate the numb

In [36]:
chatbot = gr.Chatbot(value = [[None, initial_msg]])
demo = gr.ChatInterface(chat_gen, chatbot=chatbot).queue()

try:
    

    demo.launch(debug=True, share=True, show_api=False)
    
    demo.close()
except Exception as e:
    demo.close()
    print(e)
    raise e




* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://c13aa2db04d38efd90.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


ChatPromptValue(messages=[SystemMessage(content='You are a document chatbot. Help the user as they ask questions about documents. User messaged just asked: What is the core idea behind the LLM-a-Judge concept?\n\n From this, we have retrieved the following potentially-useful info:  Conversation History Retrieval:\n[Quote from Document] User previously responded with Tell me about RAG!\n[Quote from Document] Agent previously responded with Ah, another exciting topic related to natural language processing! RAG, or the Recent Advances in Generative models, is a fascinating area of research that has seen significant developments in recent years.\n\nRAG refers to the latest advancements in generative models, such as transformer-based architectures like BERT and RoBERTa, which have shown remarkable performance in various natural language processing tasks. These models are capable of generating coherent and contextually relevant text, thanks to their ability to learn from large amounts of dat

In [51]:
## Save and compress your index
docstore.save_local("Mydocstore_index")

In [52]:
!tar czvf Mydocstore_index.tgz Mydocstore_index

!rm -rf Mydocstore_index



a Mydocstore_index
a Mydocstore_index/index.faiss
a Mydocstore_index/index.pkl
'rm' is not recognized as an internal or external command,
operable program or batch file.


In [53]:
#Make sure the retreival from Mydocstore_index works
!tar xzvf Mydocstore_index.tgz
new_db = FAISS.load_local("Mydocstore_index", embedder, allow_dangerous_deserialization=True)
docs_test = new_db.similarity_search("Testing the index")
print(docs_test[0].page_content[:1000])

x Mydocstore_index/
x Mydocstore_index/index.faiss
x Mydocstore_index/index.pkl


. Our results indicate that using LLM-as-a-judge to approximate\nhuman preferences is highly feasible and could become a new standard in future benchmarks. We\nare also hosting a regularly updated leaderboard with more models 2. Notably, DynaBench [21], a\nresearch platform dedicated to dynamic data collection and benchmarking, aligns with our spirit.\nDynaBench addresses the challenges posed by static standardized benchmarks, such as saturation and\noverfitting, by emphasizing dynamic data with human-in-the-loop. Our LLM-as-a-judge approach\ncan automate and scale platforms of this nature.\n6\nDiscussion\nLimitations. This paper emphasizes helpfulness but largely neglects safety. Honesty and harm-\nlessness are crucial for a chat assistant as well [2]. We anticipate similar methods can be used to\nevaluate these metrics by modifying the default prompt


**Now our DocQuery pipeline works fine and stores locally the files successfully. In the next steps, we will use ChromaDB for effecient vector store and retreival. Furthermore, we will try to implement the concept of LLM-as-Judge to evaluate the results generated by our model.**

## **Using ChromaDB as Vector Store**

In [None]:
import chromadb

In [28]:
doc_chunks[0][0]

Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntra

In [None]:
#Initialize a chromaDB client
chroma_client=chromadb.Client()

In [18]:
collection= chroma_client.create_collection("This_is_a_research_paper_collection")


In [10]:
documents=[]
for i, doc in enumerate(doc_chunks):
    #Generate an embedding
    embedding=embedder.embed_documents(doc[0].page_content)
    
    document={
        "id":f"doc_{i}",
        "embedding": embedding,
        "text":doc[0].page_content,
        "metadata": doc[0].metadata      
    }
    documents.append(document)
    
    

In [11]:
print(len(documents))

2


In [12]:
print(documents)

[{'id': 'doc_0', 'embedding': [[0.009755322, -0.01926408, 0.008863539, 0.0008448122, -0.020804169, -0.005998144, 0.0072136475, -0.023670753, 0.02452533, -0.005019852, -0.0016919915, -0.008492391, 0.009254219, 0.008378187, -0.0044605196, 0.013003512, -0.018523658, 0.015078548, 0.019184208, -0.013521163, 0.012511935, -0.01633989, -0.017151237, -0.0056184684, 0.0092539145, 0.002122987, 0.009276425, 0.007788671, 0.0032514117, 0.0008590021, 0.0013625367, -0.008021739, -0.018879239, 0.008564272, -0.009087808, -0.014277264, -0.01456393, 0.0025856835, -0.014063155, -0.0023733499, -0.009481689, 0.018401893, 0.025337018, -0.0063798814, -0.024430888, 0.006744091, -0.010955901, -0.013315011, -0.0037905634, 0.012678949, -0.0034331274, 0.007996763, 0.002896165, -0.004052194, -0.017099557, 0.008977967, -0.023415381, 0.0022502155, 0.015317851, 0.0016659342, -0.0031318755, -0.013876448, -0.0029693202, -0.008296445, 0.015213418, 0.0038922185, -0.022887455, 0.019575207, 0.01629644, 0.007112028, 0.026366,

In [48]:
# print(documents[0].keys())
(documents[0])["text"]

'"Provided proper attribution is provided, Google hereby grants permission to\\nreproduce the tables and figures in this paper solely for use in journalistic or\\nscholarly works.\\nAttention Is All You Need\\nAshish Vaswani\\u2217\\nGoogle Brain\\navaswani@google.com\\nNoam Shazeer\\u2217\\nGoogle Brain\\nnoam@google.com\\nNiki Parmar\\u2217\\nGoogle Research\\nnikip@google.com\\nJakob Uszkoreit\\u2217\\nGoogle Research\\nusz@google.com\\nLlion Jones\\u2217\\nGoogle Research\\nllion@google.com\\nAidan N. Gomez\\u2217\\u2020\\nUniversity of Toronto\\naidan@cs.toronto.edu\\n\\u0141ukasz Kaiser\\u2217\\nGoogle Brain\\nlukaszkaiser@google.com\\nIllia Polosukhin\\u2217\\u2021\\nillia.polosukhin@gmail.com\\nAbstract\\nThe dominant sequence transduction models are based on complex recurrent or\\nconvolutional neural networks that include an encoder and a decoder. The best\\nperforming models also connect the encoder and decoder through an attention\\nmechanism'

Now, add some documents to the collection:

In [None]:
collection.add(
    documents=[doc["text"] for doc in documents],
    metadatas=[{"source":json.dumps(doc["metadata"])}for doc in documents ],
    ids=[doc["id"] for doc in documents]
)

Here, we will use simple query from the collection using the query() method withoutthe use of gradio for simplicity and testing:

In [None]:

results = collection.query(
    query_texts=[
        "What is the core idea behind the LLM-as-a-Judge concept?"
    ],
    n_results=2
)

print(results)

**Setup persistent Storage**


To store embeddings generated in a local storage, we created a local folder DB and linked it to ChromaDB.

In [None]:
from dotenv import load_dotenv
import os

load_dotenv()

storage_path = os.getenv('STORAGE_PATH')
print(storage_path)
if storage_path is None:
    raise ValueError('STORAGE_PATH environment variable is not set')

client = chromadb.PersistentClient(path=storage_path)

collection = client.get_or_create_collection(name="test")


In [None]:
print(collection.count())

## **Trying Chroma from Langchain**

In [19]:
# from langchain_community.vectorstores import chroma
from langchain_chroma import Chroma

In [21]:
print(doc_chunks)

[[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\nt

In [29]:
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'Ch_db'

## here we are using OpenAI embeddings but in future we will swap out to local embeddings

doc=[d[0] for d in doc_chunks]
vectordb = Chroma.from_documents(documents=doc, 
                                 embedding=embedder,
                                 persist_directory=persist_directory)

In [31]:
# # persiste the db to disk
# vectordb.persist()
# vectordb = None

In [32]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, 
                  embedding_function=embedder)

In [33]:
retriever = vectordb.as_retriever()

In [34]:
docs = retriever.get_relevant_documents("what the innovative idea behind the transformers architecture?")

  docs = retriever.get_relevant_documents("what the innovative idea behind the transformers architecture?")
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


In [35]:
len(docs)

2

In [36]:
retriever.search_type

'similarity'

**Using ChromaDB either using Chroma or from LangChain vectorstore is working fine and retreiving relevant files with persistent storage. In the next and final step, we will be implementing LLM-as-a-Judge concept and evaluate the generated content based.**