# Understanding Retreival Question Answering

In [4]:
import os, random
from pathlib import Path
from getpass import getpass
from rich.markdown import Markdown
import transformers
import torch
import wandb

# Set Llama2 API key 

To get key, go to your Hugging Face account and copy the key from your Access Tokens.

In [2]:
# Set LLAMA2 API key environment variable
if os.getenv("LLAMA2_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["LLAMA2_API_KEY"] = getpass("Paste your LLAM2 key from your huggingface settings \n")

assert os.getenv("LLAMA2_API_KEY", "").startswith("hf_"), "This doesn't look like a valid HuggingFace llama2 key"
print("Llama2 API key configured")

# Get the HF auth token
hf_auth = os.getenv("LLAMA2_API_KEY", "")

Please enter password in the VS Code prompt at the top of your VS Code window!
Llama2 API key configured


# Langchain

[LangChain](https://docs.langchain.com/docs/) is a framework for developing applications powered by LLMs. We will use some of its features in the code below:
* For processing and parsing documents.
* Use the retreival chain - containing a lot of functionality to implement our question-answering system.

Let's start by configuring W&B tracing. 

In [22]:
# Need a single line of code to start tracing langchain with W&B
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"


# !!!! In theory we should be able to log progress in W&B via LangChain, by setting the 
#      above environment variable and optionally the one below. However, this is currently 
#      not working. Hence, we start the W&B run with wandb.init()


# # wandb documentation to configure wandb using env variables
# # https://docs.wandb.ai/guides/track/advanced/environment-variables
# # here we are configuring the wandb project name
# os.environ["WANDB_PROJECT"] = "llmapps"

# Set parameters that are typically passed to wandb.init()
run = wandb.init(project="llmapps", job_type="langchain")

[34m[1mwandb[0m: Currently logged in as: [33md-oliver-cort[0m ([33mdoc93[0m). Use [1m`wandb login --relogin`[0m to force relogin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016669957851991057, max=1.0…

# Load model

In [23]:
# Define llama2 model to load
model_id = 'meta-llama/Llama-2-7b-chat-hf'

In [24]:
# Set quantization configuration to load large model with less GPU memory
# - this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# This configuration object uses the model configuration from Hugging Face 
# to set different model parameters
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

# Download and initialize the model 
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


## Parsing documents

We will use a small sample of markdown documents in this notebook. Let's find them and make sure we can stuff them into the prompt. That means they may need to be chunked and not exceed some number of tokens. 

In [25]:
# First step of parsing our documents is to load all the Markdown files in 
# specified directory
# - we do this by using class from LangChain -> DirectoryLoader

from langchain.document_loaders import DirectoryLoader

def find_md_files(directory):
    "Find all markdown files in a directory and return a LangChain Document"
    dl = DirectoryLoader(directory, "**/*.md")
    return dl.load()

# Load all the Markdown
documents = find_md_files('../docs_sample/')

# Number of documents
len(documents)

11

In [26]:
# We will need to count tokens in the documents. For that we need a tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
    )



In [27]:
# Function to count the number of tokens in each document
def count_tokens(documents):
    token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]
    return token_counts

count_tokens(documents)

[771, 2222, 319, 401, 733, 1204, 2340, 1213, 2626, 2168, 2588]

In above result, we can see that some documents are pretty short and others are quite long, and may want to chunk them into sections.

We use `LangChain` built in `MarkdownTextSplitter` to split the documents into sections (since docs are in `Markdown` format). 
* Splitting `Markdown` without breaking syntax is not that easy. This splitter strips out `syntax`.
* The `MarkdownTextSplitter` also takes care of removing double line breaks and save us some tokens that way.
  
We can pass:
* `chunk_size` param - to avoid lenghty chunks.
* `chunk_overlap` param - useful so you don't cut sentences randomly (less necessary with Markdown)

In [28]:
from langchain.text_splitter import MarkdownTextSplitter

md_text_splitter = MarkdownTextSplitter(chunk_size=1000)
document_sections = md_text_splitter.split_documents(documents)
len(document_sections), max(count_tokens(document_sections))

(90, 387)

The above splitting results in 90 documents (i.e. more chunks/ documents), and the maximum number of tokens in a chunk (or document) is 537. This will fit inside context window.

In [29]:
# Here we look at the first section
Markdown(document_sections[0].page_content)

# Embeddings

Now we use `embeddings` with a `vector database retriever` to find relevant documents for a `query`. 

We use:
* `langchain.embeddings` to embed the text. Here we use the sentence_similarity model `SBERT` from [`HuggingFace sentence`](https://huggingface.co/blog/getting-started-with-embeddings). But we could use other embedding models from [different sources](https://python.langchain.com/docs/integrations/text_embedding/).
  * Example: `Cohere` provides a good multilingual embedding model if dealing with languages other than English 
* `Chroma` as vector store to store the embeddings

In [30]:
from langchain.embeddings import HuggingFaceEmbeddings

# Initialise embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [31]:
from langchain.vectorstores import Chroma

# Use Chroma vector store to parse the document chunks from above
db = Chroma.from_documents(document_sections, embeddings)

Now we can create a `retriever` from the db. 

* `k` param is used to decide how many relevant sections we retrieve from the similarity search

In [32]:
retriever = db.as_retriever(search_kwargs=dict(k=3))

In [33]:
# Retreive the docs relevant to the query by using the above Chroma retreiver 
query = "How can I share my W&B report with my team members in a public W&B project?"
docs = retriever.get_relevant_documents(query)

In [34]:
# Let's see the results
for doc in docs:
    print(doc.metadata["source"])

../docs_sample/collaborate-on-reports.md
../docs_sample/collaborate-on-reports.md
../docs_sample/collaborate-on-reports.md


Above results show that the right kind of documents are retrieved (about collaboration - related to query)

# Stuff Prompt

Now that we retrieved relevant docs, we want to stuff them into the prompt template along with the user query, and pass into an LLM to obtain the answer.

* To do this we use the `PromptTemplate` from `LangChain` (similar to an F string in Python)
* This is a simple prompt (not a Level 5 prompt)
* Define two variables: `context` and `question`

In [35]:
from langchain.prompts import PromptTemplate

prompt_template = """<s>[INST] Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:  [/INST]"""

# prompt  = """<s>[INST] <<SYS>>\\n{system}\\n<</SYS>>\\n\\n{user_1} [/INST]"""


# In the prompt template we define two inputs: "context", "question"
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

# The context is a concatenation of the retrieved docs
context = "\n\n".join([doc.page_content for doc in docs])
# Populate the prompt with the context and query variables
prompt = PROMPT.format(context=context, question=query)

Use langchain to call [Hugging Face API](https://python.langchain.com/docs/integrations/llms/) (also see [link](https://api.python.langchain.com/en/latest/llms/langchain.llms.huggingface_pipeline.HuggingFacePipeline.html)), to predict an answer to the above prompt, given the docs retrieved from embeddings

In [36]:
from langchain.llms import HuggingFacePipeline

pipe = transformers.pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    do_sample=True,           # Whether or not to use sampling 
    temperature=0.6,
    repetition_penalty=1.1,   # without this output begins repeating
    max_new_tokens=500
    )

llm = HuggingFacePipeline(pipeline=pipe)
            
response = llm.predict(prompt)
Markdown(response)

Above, we can see that we stream LangChain activity into W&B (since we previously set `LANGCHAIN_WANDB_TRACING`=true). This will be useful to check what worked, any errors, and type of results obtained.

# Using LangChain

`LangChain` provides tools (like `RetrievalQA` chain) to encapsulate the above sequence of actions into a chain, in few lines of code 
* i.e. distill the retrieved documents into an answer using the LLM model (llama2) with RetrievalQA chain.


In [37]:
from langchain.chains import RetrievalQA

# Instanciate this retrieval QA chain from the OpenAI LLM
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
# Run above "query" against this chain
# - will retrive the most relevant docs (k=3) to the "query"
# - will concatenate docs to query for improved answer?
result = qa.run(query)

# We should see a similar answer to what we saw before
Markdown(result)

In [38]:
wandb.finish()

