<a href="https://colab.research.google.com/github/Khaihuyennguyen/NLP_LLM/blob/main/Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding Retrieval Question Answering

## Set up


In [17]:
!pip install -Uqqq rich openai==0.27.2 tiktoken wandb langchain unstructured tabulate pdf2image chromadb

In [18]:
import os, random
from pathlib import PurePosixPath
import tiktoken
from getpass import getpass
from rich.markdown import Markdown

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

In [30]:
os.getenv("OPENAI_API_KEY") = None
if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

SyntaxError: ignored

In [31]:
os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")

Paste your OpenAI key from: https://platform.openai.com/account/api-keys
··········


# LangChain

LangChain is a framework for developing application powered by langauge models. We will use some of its features in the code below. Let's start by configuring W&B tracking

In [20]:
# We need a single line of code to start tracing langchain with W&B
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

# wandb documentation to configure wandb using env variables
# https://docs.wandb.ai/guides/track/advanced/environment-variables
# here we are configuring the wandb project name

os.environ["WANDB_PROJECT"] = "llmapps"

# Parsing documents

We will use a small sample of markdown documents in this notebook. Let's find them and make sure we can stuff them into the prompt. That means we may need to be chnked and not exceed some number of tokens

In [21]:
MODEL_NAME = "text-davinci-003"


In [22]:
# Grab some sample data,

!git clone https://github.com/wandb/edu.git

fatal: destination path 'edu' already exists and is not an empty directory.


In [23]:
# We will need to count tokens in the documents , and for that we need the tokenizer
tokenizer = tiktoken.encoding_for_model(MODEL_NAME)

In [24]:
from langchain.document_loaders import DirectoryLoader

def find_md_files(directory):
  "Find all markdown files in a directory and return a LangChain documents"
  dl = DirectoryLoader(directory, "**/*.md")
  return dl.load()

documents = find_md_files('edu/llm-apps-course/docs_sample/')
len(documents)

11

In [25]:
# The function to count the tokens in each doc
def count_tokens(documents):
  token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]
  return token_counts
count_tokens(documents)

[2135, 395, 763, 310, 665, 1957, 1154, 1199, 2657, 2676, 2330]

We will use LangChain built in Markdowntextsplitter to split the documents into sections.

- We can pass the chunk_size param and avoid lengthy chunks
- The chunk_overlap param is usefu so you dont cut sentences randomly. This is less necessary with Markdown

In [26]:
from langchain.text_splitter import MarkdownTextSplitter
md_text_splitter = MarkdownTextSplitter(chunk_size = 1000)
document_sections = md_text_splitter.split_documents(documents)
len(document_sections), max(count_tokens(document_sections))

(88, 438)

In [27]:
# Let see the first section
Markdown(document_sections[0].page_content)

# Embeddings

Let's now use embeddings with a vector database retriever to find relevant documents for a query


In [32]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# We will use the OpenAIEmbeddings to embed the text, and Chroma to store the vectors
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(document_sections, embeddings) # store the beddings

In [33]:
# with the embeddings we created, we now can retrieve from the database


In [34]:
retriever = db.as_retriever(search_kwargs=dict(k=3))

In [35]:
query = "How can I share my W&B report with my team members in a public W&B project?"
docs = retriever.get_relevant_documents(query)

In [36]:
# Let's see the results
for doc in docs:
  print(doc.metadata["source"])

edu/llm-apps-course/docs_sample/collaborate-on-reports.md
edu/llm-apps-course/docs_sample/collaborate-on-reports.md
edu/llm-apps-course/docs_sample/teams.md


# Stuff Prompt

We will take the content of the retrieved documents, stuff them into prompt tempalte along with quiery, and pass it inot an LLM to obtain the answer