<a href="https://colab.research.google.com/github/Khaihuyennguyen/NLP_LLM/blob/main/Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding Retrieval Question Answering

## Set up


In [17]:
!pip install -Uqqq rich openai==0.27.2 tiktoken wandb langchain unstructured tabulate pdf2image chromadb

In [18]:
import os, random
from pathlib import PurePosixPath
import tiktoken
from getpass import getpass
from rich.markdown import Markdown

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

In [30]:
os.getenv("OPENAI_API_KEY") = None
if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

SyntaxError: ignored

In [31]:
os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")

Paste your OpenAI key from: https://platform.openai.com/account/api-keys
··········


# LangChain

LangChain is a framework for developing application powered by langauge models. We will use some of its features in the code below. Let's start by configuring W&B tracking

In [20]:
# We need a single line of code to start tracing langchain with W&B
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

# wandb documentation to configure wandb using env variables
# https://docs.wandb.ai/guides/track/advanced/environment-variables
# here we are configuring the wandb project name

os.environ["WANDB_PROJECT"] = "llmapps"

# Parsing documents

We will use a small sample of markdown documents in this notebook. Let's find them and make sure we can stuff them into the prompt. That means we may need to be chnked and not exceed some number of tokens

In [21]:
MODEL_NAME = "text-davinci-003"


In [22]:
# Grab some sample data,

!git clone https://github.com/wandb/edu.git

fatal: destination path 'edu' already exists and is not an empty directory.


In [23]:
# We will need to count tokens in the documents , and for that we need the tokenizer
tokenizer = tiktoken.encoding_for_model(MODEL_NAME)

In [24]:
from langchain.document_loaders import DirectoryLoader

def find_md_files(directory):
  "Find all markdown files in a directory and return a LangChain documents"
  dl = DirectoryLoader(directory, "**/*.md")
  return dl.load()

documents = find_md_files('edu/llm-apps-course/docs_sample/')
len(documents)

11

In [25]:
# The function to count the tokens in each doc
def count_tokens(documents):
  token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]
  return token_counts
count_tokens(documents)

[2135, 395, 763, 310, 665, 1957, 1154, 1199, 2657, 2676, 2330]

We will use LangChain built in Markdowntextsplitter to split the documents into sections.

- We can pass the chunk_size param and avoid lengthy chunks
- The chunk_overlap param is usefu so you dont cut sentences randomly. This is less necessary with Markdown

In [26]:
from langchain.text_splitter import MarkdownTextSplitter
md_text_splitter = MarkdownTextSplitter(chunk_size = 1000)
document_sections = md_text_splitter.split_documents(documents)
len(document_sections), max(count_tokens(document_sections))

(88, 438)

In [27]:
# Let see the first section
Markdown(document_sections[0].page_content)

# Embeddings

Let's now use embeddings with a vector database retriever to find relevant documents for a query


In [32]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# We will use the OpenAIEmbeddings to embed the text, and Chroma to store the vectors
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(document_sections, embeddings) # store the beddings

In [33]:
# with the embeddings we created, we now can retrieve from the database


In [34]:
retriever = db.as_retriever(search_kwargs=dict(k=3))

In [35]:
query = "How can I share my W&B report with my team members in a public W&B project?"
docs = retriever.get_relevant_documents(query)

In [36]:
# Let's see the results
for doc in docs:
  print(doc.metadata["source"])

edu/llm-apps-course/docs_sample/collaborate-on-reports.md
edu/llm-apps-course/docs_sample/collaborate-on-reports.md
edu/llm-apps-course/docs_sample/teams.md


# Stuff Prompt

We will take the content of the retrieved documents, stuff them into prompt tempalte along with quiery, and pass it inot an LLM to obtain the answer

In [38]:
from langchain.prompts import PromptTemplate

prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}
Question: {question}
Helpful Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context","question"]
)

context = "\n\n".join([doc.page_content for doc in docs])
prompt = PROMPT.format(context=context, question = query)

In [39]:
# Use langchain to call OpenAI chat API with the question
from langchain.llms import OpenAI
llm = OpenAI()
response = llm.predict(prompt)
Markdown(response)

# Using Langchain

langchain gives us tools to do this efficiently in few lines of code. Let's do the sameusing RetrievalQA chain

In [41]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type ="stuff", retriever=retriever)
result = qa.run(query)

Markdown(result)

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 14


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 51


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    wi.setup(kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 306, in setup
    wandb_login._login(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_login.py", line 298, in _login
    wlogin.prompt_api_key()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_login.py", line 221, in prompt_api_key
    key, status = self._prompt_api_key()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_login.py", line 201, in _prompt_api_key
    key = apikey.prompt_api_key(
  File "/usr/local/lib/python3.10/di

Error: ignored

In [42]:
import wandb
wandb.finish()