# 1 Prepare The Environment


In [None]:
!pip install --upgrade --quiet  docx2txt
!pip install --quiet unstructured

In [None]:
!pip install --upgrade langchain
!pip install langchain_openai
!pip install chromadb



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from google.colab import userdata
OPEN_AI_KEY= userdata.get('OPENAI_KEY')

# 2 LangChain Q&A over Documents - Streamlined

Example: A tool that would allow you to query a product catalog for items of interest.

In [None]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import CSVLoader
from langchain_community.vectorstores import Chroma
from IPython.display import display, Markdown
from langchain_openai import OpenAI

We are using Langchain mostly:

- Manual: https://api.python.langchain.com/en/stable/langchain_api_reference.html
- Documentation: https://python.langchain.com/docs/modules/data_connection/

## 2.1 Loading and chunking the data

In [None]:
# document
file = "/content/drive/MyDrive/notes/week11/data/Raptor Contract.docx.txt"

In [None]:
with open(file, "r", encoding="utf-8") as f:
  # Read the contents of the file
  file_contents = f.read()

In [None]:
type(file_contents)

str

Language models have a token limit which we should not exceed. When we split the text into chunks, we want do so counting the number of tokens instead of characters.
It is better to count tokens using the same tokenizer as used in the language model.
Because we will use OpenAI later, here we'll use [tiktoken](https://github.com/openai/tiktoken), a fast BPE tokenizer created by OpenAI.

- How the text is split: recursively, by characters passed in.
- How the chunk size is measured: by tiktoken tokenizer.

In [None]:
!pip install --upgrade --quiet  tiktoken

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import TokenTextSplitter

def text_splitter_chunks(doc: str):
  text_splitter = RecursiveCharacterTextSplitter(
      separators=["\n\n", "\n", ". ", " ", ""],
      chunk_size=1000, # checking here https://chunkviz.up.railway.app/ .. it seems optimal 700-1000 char
      chunk_overlap=0,
      length_function=len,
      is_separator_regex=False,
  )

  docs = text_splitter.create_documents([doc])
  return docs

# Look up encoding_name for tiktoken used in TokenTextSpliter here:
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
# "cl100k_base" for models "gpt-4", "gpt-3.5-turbo", "text-embedding-ada-002",  and others.

def tiktoken_textsplitter(character_split_texts):
    token_splitter = TokenTextSplitter(encoding_name='cl100k_base', chunk_overlap=0, chunk_size=256) # uses tiktoken by default
    # token_split_texts = []
    # for text in character_split_texts:
    #     token_split_texts += token_splitter.split_text(text.page_content)
    token_split_texts = token_splitter.split_documents(character_split_texts)
    return token_split_texts

In [None]:
docs = text_splitter_chunks(file_contents)
print(f"Number of chunks after recursive chunking is {len(docs)}")
docs = tiktoken_textsplitter(docs)
print(f"Final number of chunks is {len(docs)}")

Number of chunks after recursive chunking is 357
Final number of chunks is 361


In [None]:
docs[0]

Document(page_content='\ufeffSTOCK PURCHASE AGREEMENT\nBY AND AMONG\n[BUYER],\n[TARGET COMPANY],\nTHE SELLERS LISTED ON SCHEDULE I HERETO\nAND\nTHE SELLERS’ REPRESENTATIVE NAMED HEREIN\nDated as of [●]\n\n\n[This document is intended solely to facilitate discussions among the parties identified herein.  Neither this document nor such discussions are intended to create, nor will either or both be deemed to create, a legally binding or enforceable offer or agreement of any type or nature, unless and until a definitive written agreement is executed and delivered by each of the parties hereto.\n\n\nThis document shall be kept confidential pursuant to the terms of the Confidentiality Agreement entered into by the parties and, if applicable, its affiliates with respect to the subject matter hereof.]')

**step 2: Vector Store**

here we use an in-memory vector store, that does not need to connect to an external database. Simple for start example.

In [None]:
from langchain.indexes import VectorstoreIndexCreator

In [None]:
# Define the embedding used by vector store
# if we set OPEN_AI_KEY environment variable, we won't need to explicitly define the embedding
# because the VectorstoreIndexCreator used OpenAIEmbeddings by default
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(openai_api_key=OPEN_AI_KEY)

In [None]:
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(docs, embeddings_model)

In [None]:
# Important: We can use this vector store to fine "chunks" similar to any text

# an example query

query = " How much is the escrow amount?"

similar_docs = db.similarity_search(query, k=5) # returns the 5 most similar docs, default is 4

In [None]:
similar_docs

[Document(page_content='3. Escrow.'),
 Document(page_content='3. Escrow.'),
 Document(page_content='1. At Closing, Buyer will deposit the Escrow Amount in escrow on behalf of the Sellers in accordance with the Escrow Agreement.  The Escrow Amount shall be held and, subject to Section 2.07, released to the Company Securityholders in accordance with the provisions of the Escrow Agreement with the Company Securityholders being entitled to share in such released amounts in accordance with their Pro Rata Percentages'),
 Document(page_content='1. At Closing, Buyer will deposit the Escrow Amount in escrow on behalf of the Sellers in accordance with the Escrow Agreement.  The Escrow Amount shall be held and, subject to Section 2.07, released to the Company Securityholders in accordance with the provisions of the Escrow Agreement with the Company Securityholders being entitled to share in such released amounts in accordance with their Pro Rata Percentages'),
 Document(page_content='final Purcha

In [None]:
# define the retriever
retriever = db.as_retriever()

In [None]:
# # indexCreator helps us create a vector store easily. What is the role of indexer?

# index = VectorstoreIndexCreator(
#     vectorstore_cls=Chroma,
#     embedding= embeddings_model # OpenAIEmbeddings is the default, see note above
# ).from_documents(docs)

**step 3: choose an llm**

In [None]:
llm_model_name = 'gpt-3.5-turbo-instruct'
llm_model = OpenAI(temperature=0.0, model=llm_model_name, openai_api_key= OPEN_AI_KEY)

setting temperature = 0, reduces randomness. This is because we want our generative model to be precise and fact-based (on provided documents), instead of creative.

**step 4: Combine vector store + llm and query it**

In [None]:
# an example query

query = " How much is the escrow amount?"

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm_model,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

In [None]:
import langchain
langchain.debug = True

In [None]:
response = qa_stuff.invoke(query)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": " How much is the escrow amount?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": " How much is the escrow amount?",
  "context": "3. Escrow.\n\n3. Escrow.\n\n1. At Closing, Buyer will deposit the Escrow Amount in escrow on behalf of the Sellers in accordance with the Escrow Agreement.  The Escrow Amount shall be held and, subject to Section 2.07, released to the Company Securityholders in accordance with the provisions of the Escrow Agreement with the Company Securityholders being entitled to share in such released amounts in accordance with their Pro Rata Percentages\n\n1. At Closing, Buyer will deposit the Escrow Amount in escrow on behalf of

The RAG fails at answering the question, the issue for now seems to be that the retriever is failing at retrieving the relevant docs.

The issue actually seems to be the chunking.

# 3 LangChain Q&A over Documents - Step by step

General Idea: The goal is to combine llm with documents. Issue is that LLMs have limited "context window", which cannot fit large documents.

Embedding: a numeric representation of text, that encode the semantic meaning of the text. Similar text will have similar embedding vectors (cosine similarity).

Vector Store: a way to store embedding vectors. It is populated with "chunks" of text from the documents.
why chunks, and not the whole documents?
These chunks are what is going to be passed to the LLM later, they have to be small enough to fit the LLM's context window.

Each chunk -> embedding vector -> stored in Vector DB

Query -> embedding vector of query -> compared to vectors in Vector DB -> returns most relevant vectors -> passed to LLM with query



### Step 5: Define the Retriever

A retriever is defined as a generic interface that takes a query and returns documents. It can be underpinned by any method that achieves this: Vector store is just one example of such methods. there are other methods, less advanced and more advanced than a vector store.

In [None]:
retriever = db.as_retriever()

### Step : Use a retriever chain instead

important choose a retriever **chain type**:
*I think chain type refers to how the prompt is created from query and retrieved documents.*
1. Stuff method: here we use chain_type="stuff", the simplest, which stuff all relevant docs into the context.
   - pros: makes 1 call to LLM
   - cons: issues with context window if retrieved many or large docs
2. Map_reduce: pass each retrieved chunk + query to the LLM seperately. Then pass all the outputs back to the LLM to be summarized.
   - pros: takes any number of retrieved docs. parallelizable
   - cons: makes many calls to the LLM (expensive). treats all retrived docs as **independent**, which might not be the case.

3. refine: works recursively. It passes query + 1 chunk to LLM, then pass the output of the previous step + query + another chunk to LLM .. and so on.
   - pros: lead to longer answers (pro or con?)
   - cons: takes longer time. not parallelizable. makes many calls to LLM.
4. map_rerank [more experimental]: similar to map_reduce, but here with each call to the LLM, you ask the LLM to return a score (eg: tell LLM to score on relevance).
   - pros: calls are independent and can be batched, relatively fast.
   - cons: makes many call to the LLM.

# 4 Evaluation
Outline:
- Example generation
- Manual evaluation (and debuging)
- LLM-assisted evaluation
- LangChain evaluation platform

## 4.1 The difficulity of Evaluating RAGs (particularly LLM)

When creating a QA RAG systems, we will want to evaluate its performance using an evaluation metric (like accuracy). Actually, especially while developing the system, we might want to measure the performance repreatedly (and automatically?), after any modification instroduced into the system.
- We can check things by eye: run a test case through the pipeline and judge the result ourselves. However this is limiting.
  - Doing repeated tests will be time-consuming and inconvenient
  - We can easily check final result, but it won't be as easy to check the intermidiate steps; eg: is the doc retrieved correct?
- Comparing result of a RAG to ground truth include comparing strings for semantic matching.
  - For example: suppose we ask our RAG to answer this question "Does the Earth have one moon?". Ground truth: "Yes". RAG's prediction: "The earth does have one moon". The 2 answers are equivelant, the RAG got it right. However, the strings of the prediction and the truth don't match exactly, they don't even have a high similarity, they are completely different..even though, in this particular context, they mean the same thing (because they don't mean the same thing generally, they won't have a high cosine similarity).


Frameworks and tools that help evaluating LLM based applications:
- Viualizers and debuggers: help see what is going on
  - setting `langchain.debug = True`


> The difficulty in comparing strings is what makes evaluating LLMs hard in the first place. LLMs are used to do really open-ended tasks, generating text, which haven't been done before. Until recently, models weren't good enough to do them. Hence, a lot of evaluation metrics that existed so far aren't good enough. So, we're having to invent new ones, with new heuristics. One of the most popular heuristics is using an LLM to do the evaluation.

## 4.2 The RAG we will use

In [None]:
# We are going to use the same QA retriver chain from section 1

# import needed libraries
from langchain.chains import RetrievalQA
from langchain.document_loaders import CSVLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import OpenAI
from IPython.display import display, Markdown

# Data Source
file = "/content/drive/MyDrive/notes/week11/apparel_modified.csv"
loader = CSVLoader(file_path =file)
data = loader.load()

# Embedding
embeddings_model = OpenAIEmbeddings(openai_api_key=OPEN_AI_KEY)

# Retriever: Populate VectorStore
index = VectorstoreIndexCreator(
    vectorstore_cls=Chroma,
    embedding= embeddings_model # OpenAIEmbeddings is the default, see note above
).from_loaders([loader])

# Generator
llm_model_name = 'gpt-3.5-turbo-instruct'
llm_model = OpenAI(temperature=0.0, model=llm_model_name, openai_api_key= OPEN_AI_KEY)


# Chain:  chain_type= stuff
qa = RetrievalQA.from_chain_type(
    llm=llm_model,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

## 4.3 Generating Test Examples

In [None]:
# coming up with test data points
data[10]

Document(page_content=": 12\nname: Zipped Jacket\ndescription: Dark navy and light blue men's zipped waterproof jacket with an outer zipped chestpocket for easy storeage.", metadata={'source': '/content/drive/MyDrive/notes/week11/apparel_modified.csv', 'row': 10})

In [None]:
data[11]

Document(page_content=': 13\nname: Silk Summer Top\ndescription: Silk womens top with short sleeves and number pattern.', metadata={'source': '/content/drive/MyDrive/notes/week11/apparel_modified.csv', 'row': 11})

### 4.3.1 Hard-coded Examples

make up a couple of question-answer pairs ourselves

In [None]:
hardcoded_examples = [
    {
        "query": "Does the Zipped Jacket have a chestpocket?",
        "answer": "Yes."
    },
    {
        "query": "What kind of sleeves does the Silk Summer Top have?",
        "answer": "It has short sleeves."
    }
]

### 4.3.2 LLM Generated Examples

Use an LLM to generate examples.

*Q: how do we know that it is reliable? Does it need to be a better (more expensive) model, like GPT4? Surely we cannot use the same we used in the RAG to generate tests for its own performance!*

In [None]:
from langchain.evaluation.qa import QAGenerateChain

In [None]:
example_gen_chain = QAGenerateChain.from_llm(OpenAI(model=llm_model_name, openai_api_key= OPEN_AI_KEY))

In [None]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)



In [None]:
new_examples[1]

{'qa_pairs': {'query': 'What is the name of the top described in the document?',
  'answer': 'The name of the top is "Classic Varsity Top."'}}

In [None]:
data[1]

Document(page_content=': 1\nname: Classic Varsity Top\ndescription: Womens casual varsity top, This grey and black buttoned top is a sport-inspired piece complete with an embroidered letter.', metadata={'source': '/content/drive/MyDrive/notes/week11/apparel_modified.csv', 'row': 1})

In [None]:
new_examples = [x['qa_pairs'] for x in new_examples] # make the examples fromat match [{'query': , 'answer': }]

In [None]:
# we combine the examples together
examples = hardcoded_examples + new_examples
len(examples)

7

## 4.4 Manual Evaluation:
evaluation by eye

In [None]:
qa.run(examples[0]["query"])

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


' Yes, the Zipped Jacket has an outer zipped chestpocket for easy storage.'

In [None]:
examples[0]['answer']

'Yes.'

The answer is correct!

In [None]:
# using debug mode: this way we see which docs were retrieved, what prompt was used in generation .. etc
import langchain
langchain.debug = True

In [None]:
qa.run(examples[1]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What kind of sleeves does the Silk Summer Top have?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What kind of sleeves does the Silk Summer Top have?",
  "context": ": 13\nname: Silk Summer Top\ndescription: Silk womens top with short sleeves and number pattern.<<<<>>>>>: 14\nname: Long Sleeve Cotton Top\ndescription: Black cotton womens top, with long sleeves, no collar and a thick hem.<<<<>>>>>: 5\nname: Floral White Top\ndescription: Stylish sleeveless white top with a floral pattern.<<<<>>>>>: 6\nname: Striped Silk Blouse\ndescription: Ultra-stylish black and red striped silk blouse with buckle collar and matching button pants."
}
[32;

' Short sleeves.'

In [None]:
examples[1]['answer']

'It has short sleeves.'

In [None]:
langchain.debug = False

## 4.5 LLM assisted evaluation

use an LLM to judge the answers

In [None]:
# loop through all the examples

predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [None]:
predictions[0]

{'query': 'Does the Zipped Jacket have a chestpocket?',
 'answer': 'Yes.',
 'result': ' Yes, the Zipped Jacket has an outer zipped chestpocket for easy storage.'}

In [None]:
# import Langchain specialized LLM for judging QA
from langchain.evaluation.qa import QAEvalChain

llm = OpenAI(temperature=0, model=llm_model_name, openai_api_key = OPEN_AI_KEY)
eval_chain = QAEvalChain.from_llm(llm)

In [None]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [None]:
graded_outputs[0]

{'results': ' CORRECT'}

In [None]:
# lets see what is going on for each example

for i, eg in enumerate(examples):
  print(f"Example {i}:")
  print("Question: " + predictions[i]['query'])
  print("Real Answer: " + predictions[i]['answer'])
  print("Predicted Answer: " + predictions[i]['result'])
  print("Predicted Grade: " + graded_outputs[i]['results'])
  print()

Example 0:
Question: Does the Zipped Jacket have a chestpocket?
Real Answer: Yes.
Predicted Answer:  Yes, the Zipped Jacket has an outer zipped chestpocket for easy storage.
Predicted Grade:  CORRECT

Example 1:
Question: What kind of sleeves does the Silk Summer Top have?
Real Answer: It has short sleeves.
Predicted Answer:  Short sleeves.
Predicted Grade:  CORRECT

Example 2:
Question: What is the name of the shirt described in the document?
Real Answer: The name of the shirt is Ocean Blue Shirt.
Predicted Answer:  The name of the shirt is Red Sports Tee.
Predicted Grade:  INCORRECT

Example 3:
Question: What is the name of the top described in the document?
Real Answer: The name of the top is "Classic Varsity Top."
Predicted Answer:  The name of the top is "Striped Skirt and Top".
Predicted Grade:  INCORRECT

Example 4:
Question: What is the name of the jumper described in the document?
Real Answer: The name of the jumper is "Yellow Wool Jumper".
Predicted Answer:  The name of the j

Our pipeline got two questions wrong, but that basically because they where terrible questions.