<a href="https://colab.research.google.com/github/Chirag314/LLM/blob/main/LLM3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

UNDERSTANDING RETRIEVAL QUESTION ANSWERING

In [None]:
!pip install -Uqqq rich openai tiktoken wandb tenacity langchain unstructured tabulate pdf2image chromadb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m418.3/418.3 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.8/218.8 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... 

In [None]:
import tiktoken
import wandb
from pprint import pprint
from wandb.integration.openai import autolog
import random

In [None]:
import os,random
from getpass import getpass
import openai
from pathlib import Path
from rich.markdown import Markdown
import pandas as pd
from tenacity import(
    retry,
    stop_after_attempt,
    wait_random_exponential,
)


In [None]:
if os.getenv("OPENAI_API_KEY") is None:
  if any(['COLAB' in x for x in os.environ.keys()]):
    print('Please enter password in the prompt at the top of your window!')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")
  openai.api_key = os.getenv("OPENAI_API_KEY", "")

Please enter password in the prompt at the top of your window!
Paste your OpenAI key from: https://platform.openai.com/account/api-keys
··········


Langchain
Langchain is a framework for developing applications powered by language models. LEts use it in the code.

In [None]:
#trace langchain with W&B
os.environ['LANGCHAIN_WANDB_TRACING']="true"

os.environ["WANDB_PROJECT"]="llmapps"

Parsing Documents: We will use a small sample of markdown documents in this notebook.

In [None]:
model_name="text-davinci-003"

In [None]:
# check if directory exists, if not, create it and download the files, e.g if running in colab
if not os.path.exists("../docs_sample/"):
  !git clone https://github.com/wandb/edu.git
  !cp -r edu/llm-apps-course/docs_sample ../

Cloning into 'edu'...
remote: Enumerating objects: 2493, done.[K
remote: Counting objects: 100% (1036/1036), done.[K
remote: Compressing objects: 100% (386/386), done.[K
remote: Total 2493 (delta 723), reused 858 (delta 642), pack-reused 1457[K
Receiving objects: 100% (2493/2493), 22.60 MiB | 14.31 MiB/s, done.
Resolving deltas: 100% (1419/1419), done.


In [None]:
from langchain.document_loaders import DirectoryLoader

def find_md_files(directory):
  "Find all markdown files in a directory and return a LangChain document"
  dl=DirectoryLoader(directory, "**/*.md")
  return dl.load()

documents=find_md_files('../docs_sample/')
len(documents)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


11

In [None]:
# We will need to count tokens in the documents and for that we need the tokenizer
tokenizer=tiktoken.encoding_for_model(model_name)

In [None]:
#Function to count number of tokens
def count_tokens(documents):
  token_counts=[len(tokenizer.encode(document.page_content)) for document in documents]
  return token_counts

count_tokens(documents)

[763, 310, 1154, 2135, 2047, 2616, 665, 387, 2592, 2330, 1199]

We will use LangChain built in MarkdownTextSplitter to split the documents into sections. Actually splitting Markdown without breaking syntax is not that easy. This splitter strips out syntax.
1. We can pass the chunk_size param and avoid lengthy chunks.
2. The chunk_overlap param is useful so you dont cut sentences randomly. This is less necessary with Markdown



In [None]:
from langchain.text_splitter import MarkdownTextSplitter

md_text_splitter=MarkdownTextSplitter(chunk_size=1000)
document_sections=md_text_splitter.split_documents(documents)
len(document_sections),max(count_tokens(document_sections))

(90, 438)

In [None]:
Markdown(document_sections[0].page_content)

Embeddings
Lets now use embedding with a vector ddatabase retriever to find relevant documents for a query

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

#openai to embed text and chroma to store the vectors
embeddings=OpenAIEmbeddings()
db=Chroma.from_documents(document_sections,embeddings)

We can create a retriever from the db now , we can pass the k param to get the most relevaant sectons from the similarity search

In [None]:
retriever=db.as_retriever(search_kwargs=dict(k=3))

In [17]:
query="How can I share my W&B report with my team members in a public W&B project?"
docs=retriever.get_relevant_documents(query)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Streaming LangChain activity to W&B at https://wandb.ai/cdesai/llmapps/runs/mpkx96ka
[34m[1mwandb[0m: `WandbTracer` is currently in beta.
[34m[1mwandb[0m: Please report any issues to https://github.com/wandb/wandb/issues with the tag `langchain`.


In [18]:
#lets see the results
for doc in docs:
  print(doc.metadata['source'])

../docs_sample/collaborate-on-reports.md
../docs_sample/collaborate-on-reports.md
../docs_sample/teams.md


Suff prompt
Take the content of retrieved documents, stuff them into prompt template along with the query and pass into an LLM to obtain the answer.

In [20]:
from langchain.prompts import PromptTemplate
prompt_template="""Use the following pieces of context to answer the questiona the end.
If you dont know the answer , just say that you dont know, dont try to make up an answer.

{context}

Questin: {question}

Helpful Answer:"""
PROMPT=PromptTemplate(
    template=prompt_template,input_variables=['context','question']
)

context="\n\n".join([doc.page_content for doc in docs])
prompt=PROMPT.format(context=context,question=query)

In [21]:
#use langchain to call openai chat api with the question
from langchain.llms import OpenAI

llm=OpenAI()
response=llm.predict(prompt)
Markdown(response)

Using Langchain
Langchain gives us tools to do this efficiently in few lines of code. Let's do the same using RetrievalQA chain.

In [22]:
from langchain.chains import RetrievalQA
qa=RetrievalQA.from_chain_type(llm=OpenAI(),chain_type='stuff',retriever=retriever)
result=qa.run(query)
Markdown(result)

In [23]:
import wandb
wandb.finish()