<a href="https://colab.research.google.com/github/FranklinChui/ml-llm-rag/blob/main/rag_pipeline_llama_cpp_chroma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with Local LLM & ChromaDB

Came across this material [here](https://machinelearningmastery.com/building-a-rag-pipeline-with-llama-cpp-in-python/)

Briefly, it demonstrate how **llama.cpp** can enable inference of large language models (LLMs) on local devices, especially running on CPUs.

A full retrieval augmented generation (RAG) is tested with running a LLM locally using a RAG pipeline with llama.ccp.

ChromaDB is vector database.

So the whole purpose is for learning and understanding by trying out.



### Env Setup

The installation could take sometime like 5-7min.

In [1]:
# install pre-requisites
!pip install llama-cpp-python
!pip install langchain-community langchain-huggingface huggingface_hub[hf_xet]
!pip install langchain sentence-transformers chromadb
!pip install pypdf requests pydantic tqdm

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.8.tar.gz (67.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl size=5959570 sha256=957dc74bab852176cec4

### Load Imports

In [2]:
import os
from langchain_huggingface import HuggingFaceEmbeddings
# from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import LlamaCpp
import requests
from tqdm import tqdm
import time
from pprint import pprint

### Helper Function - download_model()

In [3]:
# helper function for downloading model
def download_model(_url: str = "", _filepath: str = "./"):
    """
    Download a model from HuggingFace.

    Args:
        _url (str, required): Fully qualified URL of the model. Defaults to None will raise Exception.
        _filepath (str, optional): Filepath persist the model. Defaults to current directory.

    Returns:
        str: Filepath of the downloaded model.
    """

    if _url == "":
        raise ValueError("url must be provided")

    if _filepath == "":
        raise ValueError("filepath must be provided")
    elif not os.path.exists(_filepath):
        try:
            os.makedirs(_filepath)
        except Exception as e:
            raise ValueError(f"Invalid filepath: {e}")

    model_url = _url
    model_filepath = _filepath
    model_file = model_url.split("/")[-1]

    print(f"Downloading {model_file}...")

    try:
        response = requests.get(model_url, stream=True)
    except Exception as e:
        raise ValueError(f"Request error: {e}")

    total_size = int(response.headers.get('content-length', 0))

    with open(os.path.join(model_filepath, model_file), 'wb') as f:
        for data in tqdm(response.iter_content(chunk_size=1024), total=total_size//1024):
            f.write(data)

    print("Download complete!")

    return os.path.join(model_filepath, model_file)



### Download Model

Instead of the 7B model, 1B model works just fine.

In [4]:
# original ... think a 7B too large
# model_path = "TheBloke/Llama-2-7B-Chat-GGUF"

# try something smaller
# this give me rubbish response
# model_url = "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
model_url = "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q8_0.gguf"

model_file = download_model(_url=model_url,_filepath="models")

Downloading tinyllama-1.1b-chat-v1.0.Q8_0.gguf...


1143342it [00:27, 41947.62it/s]                             

Download complete!





### Sample input texts

In [5]:
old_text="""
    Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing tasks. It involves retrieving relevant information from a knowledge base and then
    using that information to generate more accurate and informed responses.

    RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context
    for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.

    The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models
    on consumer hardware without requiring high-end GPUs.

    LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible
    with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services.
"""
new_text="""
The transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM). Later variations have been widely adopted for training large language models (LLM) on large (language) datasets.

Transformers were first developed as an improvement over previous architectures for machine translation, but have found many applications since. They are used in large-scale natural language processing, computer vision (vision transformers), reinforcement learning, audio, multimodal learning, robotics, and even playing chess. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and BERT (bidirectional encoder representations from transformers).
"""


### Read sample text files to memory

In [6]:
# setup document base
os.makedirs("docs", exist_ok=True)

# Sample text files
documents = []
with open("docs/sample1.txt", "w") as f:
    f.write(old_text)
with open("docs/sample2.txt", "w") as f:
    f.write(new_text)

for file in os.listdir("docs"):
    if file.endswith(".pdf"):
        loader = PyPDFLoader(os.path.join("docs", file))
        documents.extend(loader.load())
    elif file.endswith(".txt"):
        loader = TextLoader(os.path.join("docs", file))
        documents.extend(loader.load())



In [7]:
pprint(documents)

[Document(metadata={'source': 'docs/sample1.txt'}, page_content="\n    Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches\n    for natural language processing tasks. It involves retrieving relevant information from a knowledge base and then\n    using that information to generate more accurate and informed responses.\n\n    RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context\n    for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.\n\n    The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models\n    on consumer hardware without requiring high-end GPUs.\n\n    LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible\n    with OpenAI's interfaces, all

### Chunking

In [8]:
# chunking - Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

chunks = text_splitter.split_documents(documents)

In [9]:
pprint(chunks)

[Document(metadata={'source': 'docs/sample1.txt'}, page_content="Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches\n    for natural language processing tasks. It involves retrieving relevant information from a knowledge base and then\n    using that information to generate more accurate and informed responses.\n\n    RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context\n    for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.\n\n    The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models\n    on consumer hardware without requiring high-end GPUs."),
 Document(metadata={'source': 'docs/sample1.txt'}, page_content="The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allow

### Setup ChromaDB vector store with huggingface embedding model

In [10]:
# build vector store for the text embeddings
# require huggingface token for read access
embeddings_model = "all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embeddings_model)

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Initialize LlamaCpp

In [11]:
# LlamaCpp object

llm = LlamaCpp(
    model_path=model_file,
    temperature=0.7,
    max_tokens=2000,
    n_ctx=4096,
    verbose=False
)

llama_init_from_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_init_from_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow


### Prompt Template Setup

In [12]:
# RAG prompt template
# define how the retrieved context and user query are combined into a single, well-structured input for the LLM during inference
template = """
Answer the question based on the following context:

{context}

Question: {question}
Answer:
"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

### Pipeline Setup

In [13]:
# rag pipeline
rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

### Helper Function - ask_question()

In [14]:
def ask_question(question):
    start_time = time.time()
    # result = rag_pipeline({"query": question})
    result = rag_pipeline.invoke({"query": question})

    end_time = time.time()

    print(f"Question: {question}")
    print(f"Answer: {result['result']}")
    print(f"Time taken: {end_time - start_time:.2f} seconds")
    print("\nSource documents:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"Document {i+1}:")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")
        print(f"Content: {doc.page_content[:150]}...\n")

### Test out RAG (**locally**)

In [15]:
ask_question("What is RAG and how does it work?")


Question: What is RAG and how does it work?
Answer: Retrieval-Augmenteed Generation (RAG) is a technique that combines retrieval-based and generation-based approaches for natural languaage processing tasks. It involves retrieving relevant information from a knowledge base and then using that information to generate more accurate and informed responses. RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context for language generation. This approach helps ground the model's responses in factual information and reduces hallucination. The Llama.cpp library is a C/C++ implementation of Meta's LLaMA model optimized for CPU usage without requiring high-end GPUs.
Time taken: 62.23 seconds

Source documents:
Document 1:
Source: docs/sample1.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Document 2:
Source: docs/samp

In [16]:
ask_question("What is llama.cpp?")


Question: What is llama.cpp?
Answer: Llamas are a type of llama, one of the most popular farm animals in Ecuador. Llama.cpp is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models on consumer hardware without requiring high-end GPUs. LLama is a term used to refer to the llama.cpp library itself.
Time taken: 41.03 seconds

Source documents:
Document 1:
Source: docs/sample1.txt
Content: The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models
    on consumer hardwar...

Document 2:
Source: docs/sample1.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Document 3:
Source: docs/sample2.txt
Content: Transformers were first developed as an improvement over previous architectures for machine translation, but have found many applications since. They ...



In [17]:
ask_question("How does LocalAI relate to cloud AI services?")

Question: How does LocalAI relate to cloud AI services?
Answer: LocalAI is a framework that enables running AI models locally without rely on cloud services. It provides APIs compatible with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services.

Cloud AI services are designed to process and analyze large amounts of data in real-time, using specialized hardware (like GPUs) and software (like TensorFlow). This requires significant resources and computational power that consumers may not have access to or require more advanced computing capabilities than local servers. LocalAI provides a cost-effective alternative for running AI models locally without the need for high-end GPUs, making it an attractive solution for developers who want to use their own models on consumer hardware without relying on cloud services.
Time taken: 50.02 seconds

Source documents:
Document 1:
Source: docs/sample1.txt
Content: The llama.cpp library