# Following [tutorial](https://python.langchain.com/docs/use_cases/question_answering/#step-1-load) from langchain website

Steps to implement
1. **Loading:** First we need to load our data. Unstructured data can be loaded from many sources. Use the LangChain integration hub to browse the full set of loaders. Each loader returns data as a LangChain Document.
2. **Splitting:** Text splitters break Documents into splits of specified size
3. **Storage:** Storage (e.g., often a vectorstore) will house and often embed the splits
4. **Retrieval:** The app retrieves splits from storage (e.g., often with similar embeddings to the input question)
5. **Generation:** An LLM produces an answer using a prompt that includes the question and the retrieved data

## Imports

In [1]:
import os 
import logging
import json
import pandas as pd
from pprint import pprint
from textwrap import wrap
from collections import defaultdict, ChainMap

from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import LlamaCppEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

## 1. Load

In [2]:
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")#
docs = loader.load()

# # Expression to parse the SQuAD dtaset
# jq_exp =   ".data[].paragraphs[] | {context:.context, queries:[.qas[] | {question:.question, id:.id, answers:[.answers[].text]}]}"

# # Keep metadata in document to check questions and answers later
# def metadata_func(record: dict, metadata: dict) -> dict:
#     metadata["queries"] = record.get("queries")
#     return metadata

# loader = JSONLoader("../documents/train-v2.0.json", jq_schema=jq_exp, content_key="context", metadata_func=metadata_func)
# docs = loader.load()

## 2. Split

In [4]:
# We won't need splits as the documnents loaded are already short enough
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(docs)
print(len(all_splits))

130


## 3. Store

In [5]:
# LLama2 model configs
# Some configs taken from https://python.langchain.com/docs/guides/local_llms#llamacpp
llama_cpp_shared_configs = {
    "n_gpu_layers": 35,                                 # max number of layers to get offloaded to GPU (35 according to model's metadata)
    "n_batch": 512,                                     # Tokens to process in parallel
    "n_ctx": 2048,                                      # Context window length (should be 4096 for llama2)
    "f16_kv": True                                      # lower precision for less mem consumption
}

llama_embeddings_configs = {
    "model_path": "../models/llama-2-q4_0.gguf",
    "n_gpu_layers": 1
}

# Instantiate embeddings to use to transform documents to vectors before storing
embeddings = LlamaCppEmbeddings(**ChainMap(llama_embeddings_configs, llama_cpp_shared_configs))

# Create vector store, vectors and persist on disk
vectorstore_path = "../vectorstore/"

# Load from disk if already there
if os.listdir(vectorstore_path):
    vectorstore = Chroma(
        embedding_function=embeddings,
        persist_directory=vectorstore_path
    )
else:
    # Create new one if missing
    vectorstore = Chroma.from_documents(
        documents=all_splits,
        embedding=embeddings,
        persist_directory=vectorstore_path
    )
    # Store embedded text on disk to avoid re-doing the same work
    vectorstore.persist()

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from ../models/llama-2-q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     

## 4. Retrieve

In [6]:
# Question to use
question = "What are the approaches to task decomposition?"

In [7]:
# Simplest approach is to retrieve vectors based on similarity search directly from the vector store
unique_docs = vectorstore.similarity_search(question)

# A bit more sophisticate is by using the MultiQueryRetriever and an LLM
# logging.basicConfig()
# logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)

llama_llm_configs={
    "model_path": "../models/llama-2-chat-q4_0.gguf", 
    "temperature":0,
    "callback_manager": CallbackManager([StreamingStdOutCallbackHandler()]),
    "verbose": True,
} 
llm = LlamaCpp(**ChainMap(llama_llm_configs, llama_cpp_shared_configs))


# retriever_from_llm = MultiQueryRetriever.from_llm(
#     retriever=vectorstore.as_retriever(),
#     llm=llm
# )
# unique_docs = retriever_from_llm.get_relevant_documents(query=question)


llama_print_timings:        load time =   925.12 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   577.45 ms /     9 tokens (   64.16 ms per token,    15.59 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   578.39 ms
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from ../models/llama-2-chat-q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_0

In [8]:
for i, d in enumerate(unique_docs):
    m = d.metadata
    c = d.page_content
    c = '\n'.join(wrap(c.strip(), width=120))
    print("Title {}, source {} Part {}:\n {}\n".format(m["title"], m["source"], i, c))

Title LLM Powered Autonomous Agents | Lil'Log, source https://lilianweng.github.io/posts/2023-06-23-agent/ Part 0:
 Resources: 1. Internet access for searches and information gathering. 2. Long Term memory management. 3. GPT-3.5 powered
Agents for delegation of simple tasks. 4. File output.

Title LLM Powered Autonomous Agents | Lil'Log, source https://lilianweng.github.io/posts/2023-06-23-agent/ Part 1:
 Fig. 10. A picture of a sea otter using rock to crack open a seashell, while floating in the water. While some other
animals can use tools, the complexity is not comparable with humans. (Image source: Animals using tools)

Title LLM Powered Autonomous Agents | Lil'Log, source https://lilianweng.github.io/posts/2023-06-23-agent/ Part 2:
 You will get instructions for code to write. You will write a very long answer. Make sure that every detail of the
architecture is, in the end, implemented as code. Make sure that every detail of the architecture is, in the end,
implemented as code. Th

## 5. Generate

In [9]:
template = """
[INST]<<SYS>> You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.<</SYS>> 
Question: {question} 
Context: {context} 
Answer:[/INST]
"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

qa_chain = RetrievalQA.from_chain_type(
    llm, 
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
    return_source_documents=True)
    
result = qa_chain({"query": question})


llama_print_timings:        load time =   925.12 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   551.33 ms /     9 tokens (   61.26 ms per token,    16.32 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   552.89 ms


Based on the context provided, there are several approaches to task decomposition:
1. Divide and Conquer: Breaking down a complex task into smaller, manageable sub-tasks, and then solving each sub-task independently. This approach can help improve efficiency and reduce the risk of failure.
2. Modular Design: Creating a modular architecture for a task, where each module is responsible for a specific aspect of the task. This approach allows for greater flexibility and maintainability, as modules can be easily replaced or updated without affecting the entire task.
3. Task Graph: Representing a task as a graph, where each node represents a sub-task and each edge represents the dependency between tasks. This approach allows for visualization of the task decomposition and easier management of dependencies.
4. Workflow Management: Defining a workflow for a task, which outlines the steps involved in completing the task. This approach provides a structured way to manage complex tasks and ensure


llama_print_timings:        load time =  1113.98 ms
llama_print_timings:      sample time =   115.18 ms /   256 runs   (    0.45 ms per token,  2222.57 tokens per second)
llama_print_timings: prompt eval time =  1113.89 ms /   363 tokens (    3.07 ms per token,   325.89 tokens per second)
llama_print_timings:        eval time = 76865.35 ms /   255 runs   (  301.43 ms per token,     3.32 tokens per second)
llama_print_timings:       total time = 78714.26 ms


In [10]:
# Print documents metadata to see which document where used to produce the response
# Because we tokenize documents, we'll need to get every document back only once
source_docs = defaultdict(dict)

for doc in result["source_documents"]:
    source_docs[(doc.metadata["title"], doc.metadata["source"])].update(doc.metadata) 

for k, v in source_docs.items():
    print("-" * 100)
    pprint(v, width=120)


----------------------------------------------------------------------------------------------------
{'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several '
                'proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The '
                'potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it '
                'can be framed as a powerful general problem solver.\n'
                'Agent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, '
                'complemented by several key components:',
 'language': 'en',
 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'title': "LLM Powered Autonomous Agents | Lil'Log"}


In [11]:
# Examine results
print(result.keys())
print("")
pprint(f"{result['query']}")
print("")
print(result['result'])

dict_keys(['query', 'result', 'source_documents'])

'What are the approaches to task decomposition?'

Based on the context provided, there are several approaches to task decomposition:
1. Divide and Conquer: Breaking down a complex task into smaller, manageable sub-tasks, and then solving each sub-task independently. This approach can help improve efficiency and reduce the risk of failure.
2. Modular Design: Creating a modular architecture for a task, where each module is responsible for a specific aspect of the task. This approach allows for greater flexibility and maintainability, as modules can be easily replaced or updated without affecting the entire task.
3. Task Graph: Representing a task as a graph, where each node represents a sub-task and each edge represents the dependency between tasks. This approach allows for visualization of the task decomposition and easier management of dependencies.
4. Workflow Management: Defining a workflow for a task, which outlines the steps invol

## Next steps
1. Compare LlamaCpp with LlamaCppEmbeddings and see if we can use the same object. (No, had to check when I was running out of GPU memory)
2. Add proper evaluation of model's performance for the task.
3. Investigate the vectorstores further.
4. Formalise new document addition process.
5. Investigate and understand further the vectorisation process of documents. 
6. Further investigate that answers actually come only from the documents provided. 
7. Distribute system to mutliple machines (e.g. VectorStore and LLM can live independently of the main program)