# Retrieval Augmentation Generation (RAG) with LLAMA.CPP Quantized Model

### Install llama.cpp llama-cpp-python, chromadb
In my previous video, I have shown how to build a quantized model from llama.cpp

In this notebook, you will see how to do RAG on a quantied model so that you can query your documents.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.64 --no-cache-dir

pip install chromadb 

##### Step 1: Instantiate an embed model which later will be used for storing data in the vector DB

In [1]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

##### Step 2: Process Custom Content into Chunks

In [2]:

from langchain.document_loaders import WebBaseLoader

##loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
loader = WebBaseLoader("https://www.quadratics.com/MLOPSimplified.html")

data = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

##### Step 3: Store the custom content into a Vector DB (Chroma)

In [3]:
from langchain.vectorstores import Chroma
from langchain.embeddings import GPT4AllEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=embed_model)



##### Step 4: Set bindings for LLAMA.CPP quantized model and instantiate the model

In [4]:
from langchain.embeddings import LlamaCppEmbeddings
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
n_gpu_layers = 32  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [5]:
#llama = LlamaCppEmbeddings(model_path="/data/llama.cpp/models/llama-2-7b-chat/ggml-model-q4_0.bin")
llm = LlamaCpp(
    model_path="/data/llama.cpp/models/llama-2-7b-chat/ggml-model-q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=2048,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=False,
)


ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080
  Device 1: NVIDIA GeForce GTX 1080
llama.cpp: loading model from /data/llama.cpp/models/llama-2-7b-chat/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce GTX 1080) as main device
llama_model_load_internal: mem required  = 1964.9

##### Step 5: Do a similarity search on the Vectordb to retrieve data related to the query

In [7]:
question = "what accelerators did quadratic build"
docs = vectorstore.similarity_search(question)
#result = llm_chain(docs)
docs

[Document(page_content='Training models at scale   \n Data acquisition for exploratory analysis \n Consistent interface for training and serving\nDeployment to production\nMonitoring the model performance\n\nSolutions Delivered:\n\r\n                        Quadratic has built a set of accelerators for enabling Ml/AI Model Lifecycle as a MLOPS suite.  \r\n                         This platform enabled the customer to quickly build models, train and deploy in a repeatable fashion.\nOutcomes', metadata={'description': '', 'language': 'en', 'source': 'https://www.quadratics.com/MLOPSimplified.html', 'title': 'Quadratics'}),
 Document(page_content='Quadratics\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\n\n\r\n              Competencies \n\n\nML/AL Platforms \nData Engineering \nCloud Adoption\n\n\n\nGlobal Workforce\n\n\nPlatforml\n\n\nContact us\n\n\n\n\n\r\n                         Insights \n\n\nCase Studies\n\n\n\n\n\n\n\n\n\n\nMLOPs Simplified', metadata={'description': '', 'languag

##### Step 6: Create a RAG pipeline to contextualize with the custom data and Query

In [10]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

In [11]:
rag_pipeline("what accelerators did quadratic build")

 Quadratic built a set of accelerators for enabling Ml/AI Model Lifecycle as an MLOPS suite.

{'query': 'what accelerators did quadratic build',
 'result': ' Quadratic built a set of accelerators for enabling Ml/AI Model Lifecycle as an MLOPS suite.'}

In [12]:
rag_pipeline("how do the accelerators built by Quadratic help their customers")

  Accelerators created by Quadratic enable Ml/AI Model Lifecycle as a MLOPS suite, enabling the customer to quickly build models, train and deploy in a repeatable fashion.

{'query': 'how do the accelerators built by Quadratic help their customers',
 'result': '  Accelerators created by Quadratic enable Ml/AI Model Lifecycle as a MLOPS suite, enabling the customer to quickly build models, train and deploy in a repeatable fashion.'}

In [13]:
llm("what accelerators did quadratic build")

?
 nobody knows when or if quadratic will launch. the company has not provided any updates on its launch plans, and its website is no longer active.
Quadratic was a startup that aimed to build a decentralized exchange (DEX) for non-fungible tokens (NFTs). The platform was designed to provide a more secure and reliable way of trading NFTs compared to traditional centralized exchanges. However, the project appears to have been abandoned, and no further information is available on its launch plans or development progress.
Quadratic's conceptual design involved using smart contracts to enable decentralized trading of NFTs without the need for intermediaries. The platform was expected to offer a range of features, including support for multiple blockchain networks, an intuitive user interface, and automated liquidity provision through quadratic funding.
While Quadratic's idea was innovative, it faced significant challenges in terms of scalability, security, and regulatory compliance. The de

"?\n nobody knows when or if quadratic will launch. the company has not provided any updates on its launch plans, and its website is no longer active.\nQuadratic was a startup that aimed to build a decentralized exchange (DEX) for non-fungible tokens (NFTs). The platform was designed to provide a more secure and reliable way of trading NFTs compared to traditional centralized exchanges. However, the project appears to have been abandoned, and no further information is available on its launch plans or development progress.\nQuadratic's conceptual design involved using smart contracts to enable decentralized trading of NFTs without the need for intermediaries. The platform was expected to offer a range of features, including support for multiple blockchain networks, an intuitive user interface, and automated liquidity provision through quadratic funding.\nWhile Quadratic's idea was innovative, it faced significant challenges in terms of scalability, security, and regulatory compliance. T