# Following [tutorial](https://python.langchain.com/docs/use_cases/question_answering/#step-1-load) from langchain website

Steps to implement
1. **Loading:** First we need to load our data. Unstructured data can be loaded from many sources. Use the LangChain integration hub to browse the full set of loaders. Each loader returns data as a LangChain Document.
2. **Splitting:** Text splitters break Documents into splits of specified size
3. **Storage:** Storage (e.g., often a vectorstore) will house and often embed the splits
4. **Retrieval:** The app retrieves splits from storage (e.g., often with similar embeddings to the input question)
5. **Generation:** An LLM produces an answer using a prompt that includes the question and the retrieved data

## Imports

In [6]:
import os 
import logging
from pprint import pprint
from collections import defaultdict, ChainMap

from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import LlamaCppEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA

## 1. Load

In [4]:
# TODO: Create some local documents and replace the WebBaseLoader with the appropriate one to utilise them
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

## 2. Split

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

## 3. Store

In [8]:
# LLama2 model configs
llama_cpp_shared_configs = {
    "model_path": "../models/llama-2-q4_0.gguf",  # path to quantized model
    "n_gpu_layers": 35,                           # max number of layers to get offloaded to GPU (35 according to model's metadata)
    "n_batch": 512,                               # Tokens to process in parallel
    "n_ctx": 512                                  # Context window length (should be 4096 for llama2)
}

# Instantiate embeddings to use to transform documents to vectors before storing
embeddings = LlamaCppEmbeddings(**ChainMap({"n_gpu_layers": 0}, llama_cpp_shared_configs))

# Create vector store, vectors and persist on disk
vectorstore_path = "../vectorstore/"

# Load from disk if already there
if os.listdir(vectorstore_path):
    vectorstore = Chroma(
        embedding_function=embeddings,
        persist_directory=vectorstore_path
    )
else:
    # Create new one if missing
    vectorstore = Chroma.from_documents(
        documents=all_splits, 
        embedding=embeddings,
        persist_directory=vectorstore_path
    )
    # drop from GPU's VRAM to push LLM
    vectorstore.persist()

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from ../models/llama-2-q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     

## 4. Retrieve

In [9]:
question = "What are the approaches to Task Decomposition?"

# Simplest approach is to retrieve vectors based on similarity search directly from the vector store
# docs = vectorstore.similarity_search(question)
# len(docs)

# A bit more sophisticate is by using the MultiQueryRetriever and an LLM
logging.basicConfig()
logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)

llm = LlamaCpp(
    temperature=0,          
    **llama_cpp_shared_configs)


retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from ../models/llama-2-q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]

12

## 5. Generate

In [10]:
qa_chain = RetrievalQA.from_chain_type(
    llm, 
    retriever=vectorstore.as_retriever(),
    return_source_documents=True)
result = qa_chain({"query": question})


llama_print_timings:        load time =  1100.99 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  1472.44 ms /    11 tokens (  133.86 ms per token,     7.47 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  1473.10 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =   511.84 ms
llama_print_timings:      sample time =   110.32 ms /   255 runs   (    0.43 ms per token,  2311.48 tokens per second)
llama_print_timings: prompt eval time =   644.33 ms /   256 tokens (    2.52 ms per token,   397.31 tokens per second)
llama_print_timings:        eval time =  6134.86 ms /   254 runs   (   24.15 ms per token,    41.40 tokens per second)
llama_print_timings:       total time =  7305.60 ms


In [11]:
# Print documents metadata to see which document where used to produce the response
# Because we tokenize documents, we'll need to get every document back only once
source_docs = defaultdict(dict)

for doc in result["source_documents"]:
    source_docs[doc.metadata["source"]].update(doc.metadata) 

for k, v in source_docs.items():
    print("-" * 100)
    pprint(v, width=120)


----------------------------------------------------------------------------------------------------
{'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several '
                'proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The '
                'potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it '
                'can be framed as a powerful general problem solver.\n'
                'Agent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, '
                'complemented by several key components:',
 'language': 'en',
 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'title': "LLM Powered Autonomous Agents | Lil'Log"}


In [12]:
# Examine results

print(result.keys())
print("")
pprint(f"{result['query']}")
print("")
pprint(result['result'], width=120)

dict_keys(['query', 'result', 'source_documents'])

'What are the approaches to Task Decomposition?'

(' There are two main approaches to task decomposition, namely, top-down and bottom-up. Top-down approach involves '
 'breaking down a complex task into smaller subtasks that can be solved independently, while bottom-up approach '
 'involves breaking down a complex task into simpler subtasks that can be combined together to solve the original '
 'problem.\n'
 '\n'
 '### Task Decomposition\n'
 '\n'
 'Task decomposition is the process of breaking down a complex task into smaller subtasks that can be solved '
 'independently. This approach helps in improving efficiency and reducing complexity, as it allows for parallel '
 'processing and reduces the need for coordination between different components.\n'
 '\n'
 '#### Top-Down Approach\n'
 '\n'
 'The top-down approach involves breaking down a complex task into smaller subtasks that can be solved independently. '
 'This approach is useful wh

## Next steps
1. Compare LlamaCpp with LlamaCppEmbeddings and see if we can use the same object.
2. Investigate the vectorstores further.
3. Formalise new document addition process.
4. Investigate and undertand further the vectorisation process of documents. 
5. Distribute system to mutliple machines (e.g. VectorStore and LLM can live independently of the main program)