# Why Locally?

## The Importance of Private LLMs

we don't just want a chatbot with which we can talk to about the things in which it was trained and fine tuned on, but we want a live and dynamic model that learns about our data, needs, ionformatiob and so on. 

Now, the answer to that isto leverage something called RAG or retrieval augmented generation, which is a technique introduced in [Piktus et al. 2021](https://arxiv.org/pdf/2005.11401.pdf), and its about augmenting LLMs with knowledge from documents such as pdfs, csvs, .txt files etc...

However, when we talk about RAG, the immmediate question that comes to mind is where would I be sending my data to?

Now, privacy has been a concern since the birth of LLMs, and that is because the public as a whole wants to know where its data is going to and for obvious reasons wants control over who or what has access to it.

With the integration of LLMs to our personal data, be it for research or work, or personal productivity, what have you, the problem becomes, ok, how can I get aorund the necessity of sending my data to some external private cloud provider? 

The answer to that is to leverage oen source llms like llama2 connected to RAG systems that allows them to query and chat with your documents without your data leaving your computer.

# What is RAG?

## From Ctrl-F to Semantic Search

- Rag or retrieval augmented generation is a procedure for augmenting large language model's capabilities by giving them programatic access to documents like pdfs, csvs and text files.

- To understand it we need to understand what embeddings are

- Embeddings are nothing more than a vectorized representation of data (like text) in a space where the distances in that space have some meaningful connectio with the semantic value in the data itself.

- What do I mean? (dog cat examples...)

- Rag systems leverage embeddings to build what are called vector stores, which are literally stores of vectors that allow us to index many embeddings in one queryable representation.

- Now, why not have just one big embedding? Becuase of the known context limitations of transformer based models like ChatGPT.

- Now, more specifically this is how a RAG system would usually work:  (explain rag slides)


# Frameworks to Query Docs Locally with Llama2

- Many frameworks stemming from ready to use products like chatpdf.com, or ChatGPT Plus plugins like AskYourPDF or AI Pdf, to less ready, yet more open solutions like, privateGPT, h2oGPT, localGPT, and many others.

- The way I like thinking about these, is on a spectrum of complexity/friction to access

- On the one hand you have something as simple as upload your doc, and use! On the other, you have frameworks that facilitate the process of setting up these query your docs systems for yourself locally on your machine. We'll be focusing on the later


- We'll learn now the concepts and standard pipeline to query your docs with RAG systems, focusing on langchain+llama-cpp-pypthon as our main frameworks of use.

- Now let's look at a basic RAG pipeline:

- Doc loaders, chunk, split, embed, index vectorize in vector store, query doc (find common framework)

- Ok, let's look at what that looks like in practice

In [20]:
# !pip install langchain
# !pip install llama-cpp-python
# !pip install chromadb
# !pip install beautifulsoup4
# !pip install gpt4all
# !pip install langchainhub
# !pip install pypdf

Collecting pypdf
  Using cached pypdf-3.17.1-py3-none-any.whl.metadata (7.5 kB)
Using cached pypdf-3.17.1-py3-none-any.whl (277 kB)
Installing collected packages: pypdf
Successfully installed pypdf-3.17.1


In [11]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import LlamaCppEmbeddings
from langchain.chains import RetrievalQA
from langchain import hub

In [5]:
n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=4096,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [6]:
llm("Simulate a rap battle between Stephen Colbert and John Oliver")



Stephen Colbert:
Yo, John, it's on. Let's see who's the real king of comedy.
John Oliver: Hold up, Colby. Don't get too cocky. You think you're funnier than me?
Stephen Colbert: Oh, most definitely. My wit is like a sharp blade. Cutting through the bullshit and exposing the truth.
John Oliver: (laughs) Well, Stephen, your wit may be sharp but it's also quite dull. You're always talking about politics but never actually saying anything meaningful.
Stephen Colbert: (smirking) Oh, come on John. You can't knock the king of satire. My show is like a breath of fresh air in a stale political landscape.
John Oliver: (chuckles) Fresh air? More like hot air. You're always yelling and waving your arms around but never actually making any progress.
Stephen Colbert: (sneering) Progress? Ha! You think you're funnier than me just because you have a few good one-liners? I'


llama_print_timings:        load time =  2950.47 ms
llama_print_timings:      sample time =   256.12 ms /   256 runs   (    1.00 ms per token,   999.52 tokens per second)
llama_print_timings: prompt eval time =  2950.40 ms /    13 tokens (  226.95 ms per token,     4.41 tokens per second)
llama_print_timings:        eval time =  7946.97 ms /   255 runs   (   31.16 ms per token,    32.09 tokens per second)
llama_print_timings:       total time = 11810.70 ms


"\n\nStephen Colbert:\nYo, John, it's on. Let's see who's the real king of comedy.\nJohn Oliver: Hold up, Colby. Don't get too cocky. You think you're funnier than me?\nStephen Colbert: Oh, most definitely. My wit is like a sharp blade. Cutting through the bullshit and exposing the truth.\nJohn Oliver: (laughs) Well, Stephen, your wit may be sharp but it's also quite dull. You're always talking about politics but never actually saying anything meaningful.\nStephen Colbert: (smirking) Oh, come on John. You can't knock the king of satire. My show is like a breath of fresh air in a stale political landscape.\nJohn Oliver: (chuckles) Fresh air? More like hot air. You're always yelling and waving your arms around but never actually making any progress.\nStephen Colbert: (sneering) Progress? Ha! You think you're funnier than me just because you have a few good one-liners? I'"

## Query a Web Article

In [7]:
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

Split the document into chunks

In [8]:
# Split into chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
all_splits = text_splitter.split_documents(data)

In [11]:
# Embed and store
vectorstore = Chroma.from_documents(documents=all_splits, embedding=LlamaCppEmbeddings(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=1, n_batch=512, n_ctx=2048, f16_kv=True))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [12]:
# Retrieve
question = "How can Task Decomposition be done?"
docs = vectorstore.similarity_search(question)
print(len(docs))
docs

4



llama_print_timings:        load time =  4291.81 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  1140.90 ms /    10 tokens (  114.09 ms per token,     8.77 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  1143.69 ms


[Document(page_content='Resources:\n1. Internet access for searches and information gathering.\n2. Long Term memory management.\n3. GPT-3.5 powered Agents for delegation of simple tasks.\n4. File output.\n\nPerformance Evaluation:\n1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.\n2. Constructively self-criticize your big-picture behavior constantly.\n3. Reflect on past decisions and strategies to refine your approach.\n4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LL

Let's now pull from the langchain hub a rag prompt specifically for llama2 models. The prompt itself is this:



```
[INST]<<SYS>> You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.
<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]
```

In [6]:
# RAG prompt

QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-llama")

In [16]:
# QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

question = "What are the various prompting techniques aimed at task decomposition described in the article?"
result = qa_chain({"query": question})


llama_print_timings:        load time =  4291.81 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  1966.41 ms /    18 tokens (  109.24 ms per token,     9.15 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  1972.76 ms
Llama.generate: prefix-match hit


  Based on the provided context, the various prompting techniques aimed at task decomposition described in the article are:
1. Chain of thought prompting: This technique elicits reasoning in large language models by providing a sequence of prompts that guide the model's reasoning process.
2. Tree of thoughts: This technique involves building a tree-like structure to represent the model's reasoning process, allowing for more efficient and scalable problem-solving.
3. Chain of hindsight: This technique involves providing the model with a sequence of past events or observations, which the model can use to inform its current decision-making process.
4. ReAct: This technique combines reasoning and acting in language models by using a set of prompts that guide the model's reasoning process and then acting on the resulting plan.
5. LLM+P: This technique empowers large language models with optimal planning proficiency by using a set of prompts to guide the model's planning process and improve 


llama_print_timings:        load time =  2950.47 ms
llama_print_timings:      sample time =   329.12 ms /   256 runs   (    1.29 ms per token,   777.82 tokens per second)
llama_print_timings: prompt eval time =  6574.65 ms /  1272 tokens (    5.17 ms per token,   193.47 tokens per second)
llama_print_timings:        eval time =  9633.38 ms /   255 runs   (   37.78 ms per token,    26.47 tokens per second)
llama_print_timings:       total time = 17417.54 ms


In [17]:
print(result["result"])

  Based on the provided context, the various prompting techniques aimed at task decomposition described in the article are:
1. Chain of thought prompting: This technique elicits reasoning in large language models by providing a sequence of prompts that guide the model's reasoning process.
2. Tree of thoughts: This technique involves building a tree-like structure to represent the model's reasoning process, allowing for more efficient and scalable problem-solving.
3. Chain of hindsight: This technique involves providing the model with a sequence of past events or observations, which the model can use to inform its current decision-making process.
4. ReAct: This technique combines reasoning and acting in language models by using a set of prompts that guide the model's reasoning process and then acting on the resulting plan.
5. LLM+P: This technique empowers large language models with optimal planning proficiency by using a set of prompts to guide the model's planning process and improve 

## Query a PDF

In [8]:
from langchain.document_loaders import PyPDFLoader

docs = PyPDFLoader("./assets-resources/attention_paper.pdf").load_and_split()

In [9]:
docs

[Document(page_content='Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring

In [12]:
vectorstore = Chroma.from_documents(documents=docs, embedding=LlamaCppEmbeddings(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=1, n_batch=512, n_ctx=4096, f16_kv=True))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [13]:
# QA chain
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

question = "How does the self-attention layer work?"
result = qa_chain({"query": question})
print(result["result"])


llama_print_timings:        load time =  5502.64 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   139.83 ms /    11 tokens (   12.71 ms per token,    78.67 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   143.25 ms


  The self-attention layer in the Transformer model works by first computing the attention scores for each element in the input sequence using a dot product attention mechanism. Specifically, it computes the attention score for each element by taking the dot product of the query vector (which is a representation of the entire input sequence) and the key vector (which is a representation of the entire input sequence). The attention scores are then normalized and used to compute a weighted sum of the value vectors (which represent the input sequence at each position).
The Transformer model uses a novel attention mechanism called multi-head attention, which allows it to jointly attend to information from different representation subspaces at different positions. This is achieved by computing multiple attention scores for each element in the input sequence, using different linear transformations of the query and key vectors. The outputs of these attention scores are then concatenated and l


llama_print_timings:        load time =  3251.79 ms
llama_print_timings:      sample time =   313.99 ms /   256 runs   (    1.23 ms per token,   815.30 tokens per second)
llama_print_timings: prompt eval time = 16333.33 ms /  3326 tokens (    4.91 ms per token,   203.63 tokens per second)
llama_print_timings:        eval time = 12246.31 ms /   255 runs   (   48.02 ms per token,    20.82 tokens per second)
llama_print_timings:       total time = 29924.38 ms


# References
- https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa
- https://python.plainenglish.io/super-quick-fine-tuning-llama-2-0-on-cpu-with-personal-data-d2d284559f
- https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8
- https://python.langchain.com/docs/integrations/llms/ollama#:~:text=as%20a%20herd.%22)-,rag
- https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8