# Why Locally?

## The Importance of Private LLMs

we don't just want a chatbot with which we can talk to about the things in which it was trained and fine tuned on, but we want a live and dynamic model that learns about our data, needs, ionformatiob and so on. 

Now, the answer to that isto leverage something called RAG or retrieval augmented generation, which is a technique introduced in [Piktus et al. 2021](https://arxiv.org/pdf/2005.11401.pdf), and its about augmenting LLMs with knowledge from documents such as pdfs, csvs, .txt files etc...

However, when we talk about RAG, the immmediate question that comes to mind is where would I be sending my data to?

Now, privacy has been a concern since the birth of LLMs, and that is because the public as a whole wants to know where its data is going to and for obvious reasons wants control over who or what has access to it.

With the integration of LLMs to our personal data, be it for research or work, or personal productivity, what have you, the problem becomes, ok, how can I get aorund the necessity of sending my data to some external private cloud provider? 

The answer to that is to leverage oen source llms like llama2 connected to RAG systems that allows them to query and chat with your documents without your data leaving your computer.

# What is RAG?

## From Ctrl-F to Semantic Search

- Rag or retrieval augmented generation is a procedure for augmenting large language model's capabilities by giving them programatic access to documents like pdfs, csvs and text files.

- To understand it we need to understand what embeddings are

- Embeddings are nothing more than a vectorized representation of data (like text) in a space where the distances in that space have some meaningful connectio with the semantic value in the data itself.

- What do I mean? (dog cat examples...)

- Rag systems leverage embeddings to build what are called vector stores, which are literally stores of vectors that allow us to index many embeddings in one queryable representation.

- Now, why not have just one big embedding? Becuase of the known context limitations of transformer based models like ChatGPT.

- Now, more specifically this is how a RAG system would usually work:  (explain rag slides)


# Frameworks to Query Docs Locally with Llama2

- Many frameworks stemming from ready to use products like chatpdf.com, or ChatGPT Plus plugins like AskYourPDF or AI Pdf, to less ready, yet more open solutions like, privateGPT, h2oGPT, localGPT, and many others.

- The way I like thinking about these, is on a spectrum of complexity/friction to access

- On the one hand you have something as simple as upload your doc, and use! On the other, you have frameworks that facilitate the process of setting up these query your docs systems for yourself locally on your machine. We'll be focusing on the later


- We'll learn now the concepts and standard pipeline to query your docs with RAG systems, focusing on langchain+llama-cpp-pypthon as our main frameworks of use.

- Now let's look at a basic RAG pipeline:

- Doc loaders, chunk, split, embed, index vectorize in vector store, query doc (find common framework)

- Ok, let's look at what that looks like in practice

In [20]:
# uncomment before running the rest
# !pip install langchain
# !pip install llama-cpp-python
# !pip install chromadb
# !pip install beautifulsoup4
# !pip install gpt4all
# !pip install langchainhub
# !pip install pypdf

Collecting pypdf
  Using cached pypdf-3.17.1-py3-none-any.whl.metadata (7.5 kB)
Using cached pypdf-3.17.1-py3-none-any.whl (277 kB)
Installing collected packages: pypdf
Successfully installed pypdf-3.17.1


In [1]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp
from langchain.document_loaders import WebBaseLoader
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import LlamaCppEmbeddings
from langchain.chains import RetrievalQA
from langchain import hub

In [2]:
n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=4096,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [3]:
llm("Simulate a rap battle between Stephen Colbert and John Oliver")

.
The two comedians, known for their sharp wit and biting satire, square off in a hilarious rap battle.
Stephen Colbert: Yo, John, I heard you've been talking smack about me
John Oliver: That's right, Stephen, I ain't afraid to be bold and free
Colbert: Well, let me tell you something, my man, I'm the king of the night
Oliver: King? You mean like a lion in a zoo? That's cute, but I'm the real deal, Colbert
Colbert: Oh, snap! John Oliver, you better bring it, boy
Oliver: Don't get too cocky, Stephen, 'cause I'm on a roll
Colbert: You think you can take me down with your witty jokes and satire?
Oliver: You ain't seen nothing yet! My rhymes are like a cybernetic uprising, they'll leave you in the dust
Colbert: Well, I've got a few tricks up my sleeve, John Oliver, don't forget.
O


llama_print_timings:        load time =  6167.55 ms
llama_print_timings:      sample time =   197.49 ms /   256 runs   (    0.77 ms per token,  1296.28 tokens per second)
llama_print_timings: prompt eval time =  6167.51 ms /    13 tokens (  474.42 ms per token,     2.11 tokens per second)
llama_print_timings:        eval time =  8129.01 ms /   255 runs   (   31.88 ms per token,    31.37 tokens per second)
llama_print_timings:       total time = 15091.81 ms


".\nThe two comedians, known for their sharp wit and biting satire, square off in a hilarious rap battle.\nStephen Colbert: Yo, John, I heard you've been talking smack about me\nJohn Oliver: That's right, Stephen, I ain't afraid to be bold and free\nColbert: Well, let me tell you something, my man, I'm the king of the night\nOliver: King? You mean like a lion in a zoo? That's cute, but I'm the real deal, Colbert\nColbert: Oh, snap! John Oliver, you better bring it, boy\nOliver: Don't get too cocky, Stephen, 'cause I'm on a roll\nColbert: You think you can take me down with your witty jokes and satire?\nOliver: You ain't seen nothing yet! My rhymes are like a cybernetic uprising, they'll leave you in the dust\nColbert: Well, I've got a few tricks up my sleeve, John Oliver, don't forget.\nO"

## Query a Web Article

In [4]:
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

In [5]:
data

[Document(page_content='\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil\'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil\'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n\nComponent Three: Tool Use\n\nCase Studies\n\nScientific Discovery Agent\n\nGenerative Agents Simulation\n\nProof-of-Concept Examples\n\n\nChallenges\n\nCitation\n\nReferences\n\n\n\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer an

In [6]:

data[0]

Document(page_content='\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil\'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil\'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n\nComponent Three: Tool Use\n\nCase Studies\n\nScientific Discovery Agent\n\nGenerative Agents Simulation\n\nProof-of-Concept Examples\n\n\nChallenges\n\nCitation\n\nReferences\n\n\n\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and

Split the document into chunks

In [7]:
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
all_splits = text_splitter.split_documents(data)

In [8]:
# Embed and store
vectorstore = Chroma.from_documents(documents=all_splits, embedding=LlamaCppEmbeddings(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=1, n_batch=512, n_ctx=2048, f16_kv=True))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [9]:
# Retrieve
question = "How can Task Decomposition be done?"
docs = vectorstore.similarity_search(question)
print(len(docs))
docs

4



llama_print_timings:        load time =  3166.68 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   745.17 ms /    10 tokens (   74.52 ms per token,    13.42 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   747.06 ms


[Document(page_content='Resources:\n1. Internet access for searches and information gathering.\n2. Long Term memory management.\n3. GPT-3.5 powered Agents for delegation of simple tasks.\n4. File output.\n\nPerformance Evaluation:\n1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.\n2. Constructively self-criticize your big-picture behavior constantly.\n3. Reflect on past decisions and strategies to refine your approach.\n4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LL

Let's now pull from the langchain hub a rag prompt specifically for llama2 models. The prompt itself is this:



```
[INST]<<SYS>> You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.
<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]
```

In [10]:
# RAG prompt

QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-llama")
print(QA_CHAIN_PROMPT)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="[INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]"))]


In [11]:
# QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

question = "What are the various prompting techniques aimed at task decomposition described in the article?"
result = qa_chain({"query": question})


llama_print_timings:        load time =  3166.68 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  1622.09 ms /    18 tokens (   90.12 ms per token,    11.10 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  1623.81 ms
Llama.generate: prefix-match hit


  Based on the retrieved context, the various prompting techniques aimed at task decomposition described in the article are:
1. Chain of thought prompting: This technique elicits reasoning in large language models by presenting a sequence of prompts that guide the model to generate responses.
2. Tree of thoughts: This technique involves building a hierarchical structure of ideas, called a tree of thoughts, and using it to organize and structure the model's thinking.
3. Chain of hindsight: This technique involves presenting a series of past events or experiences to the model, which can then use these events to inform its current decision-making process.
4. ReAct: This technique combines reasoning and acting in language models by using a combination of natural language prompts and reinforcement learning to guide the model's behavior.
5. Memory stream: This technique involves using a long-term memory module (external database) to record a comprehensive list of agents' experiences in natur


llama_print_timings:        load time =  6167.55 ms
llama_print_timings:      sample time =   199.64 ms /   255 runs   (    0.78 ms per token,  1277.30 tokens per second)
llama_print_timings: prompt eval time =  7481.29 ms /  1345 tokens (    5.56 ms per token,   179.78 tokens per second)
llama_print_timings:        eval time =  9756.08 ms /   254 runs   (   38.41 ms per token,    26.04 tokens per second)
llama_print_timings:       total time = 17991.53 ms


In [12]:
print(result["result"])

  Based on the retrieved context, the various prompting techniques aimed at task decomposition described in the article are:
1. Chain of thought prompting: This technique elicits reasoning in large language models by presenting a sequence of prompts that guide the model to generate responses.
2. Tree of thoughts: This technique involves building a hierarchical structure of ideas, called a tree of thoughts, and using it to organize and structure the model's thinking.
3. Chain of hindsight: This technique involves presenting a series of past events or experiences to the model, which can then use these events to inform its current decision-making process.
4. ReAct: This technique combines reasoning and acting in language models by using a combination of natural language prompts and reinforcement learning to guide the model's behavior.
5. Memory stream: This technique involves using a long-term memory module (external database) to record a comprehensive list of agents' experiences in natur

## Query a PDF

In [13]:
from langchain.document_loaders import PyPDFLoader

docs = PyPDFLoader("./assets-resources/attention_paper.pdf").load_and_split()

In [14]:
docs

[Document(page_content='Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring

In [15]:
vectorstore = Chroma.from_documents(documents=docs, embedding=LlamaCppEmbeddings(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=1, n_batch=512, n_ctx=4096, f16_kv=True))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [16]:
# QA chain
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

question = "How does the self-attention layer work?"
result = qa_chain({"query": question})
print(result["result"])

ggml_metal_free: deallocating

llama_print_timings:        load time =  3724.03 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  2607.08 ms /    11 tokens (  237.01 ms per token,     4.22 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  2609.65 ms
Llama.generate: prefix-match hit


  The self-attention layer in transformer models is a key component that allows the model to focus on specific parts of the input sequence when generating output. It works by computing a weighted sum of the input elements, where the weights are learned during training and reflect the relative importance of each element for the task at hand.
In more detail, the self-attention layer consists of three components: the query, the key, and the value. The query represents the context in which the attention is being applied, while the key and value represent the input elements that are being attended to. The attention weights are computed by taking the dot product of the query and key vectors and applying a softmax function to the resulting scores. These weights are then used to compute a weighted sum of the value vector, which forms the final output of the self-attention layer.
The self-attention layer is trained using a combination of masked language modeling and next sentence prediction tas


llama_print_timings:        load time =  6167.55 ms
llama_print_timings:      sample time =   185.29 ms /   256 runs   (    0.72 ms per token,  1381.59 tokens per second)
llama_print_timings: prompt eval time =  7414.09 ms /  1306 tokens (    5.68 ms per token,   176.15 tokens per second)
llama_print_timings:        eval time =  9811.12 ms /   255 runs   (   38.47 ms per token,    25.99 tokens per second)
llama_print_timings:       total time = 18034.94 ms


# References
- https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa
- https://python.plainenglish.io/super-quick-fine-tuning-llama-2-0-on-cpu-with-personal-data-d2d284559f
- https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8
- https://python.langchain.com/docs/integrations/llms/ollama#:~:text=as%20a%20herd.%22)-,rag
- https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8