# LangChain: Evaluation¶
### Outline:
* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [30]:
import os
import getpass
import openai
import time
import markdown
from IPython.display import display, Markdown

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:········


# Create our QandA application

In [2]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

#PDF directory loader
from langchain.document_loaders import PyPDFDirectoryLoader

In [3]:
# file = 'GlobalPowerPlantDB_USonly.csv'
pdf_folder_path = "/Users/markc/Relevant Research/10Ks/NVIDIA"
# C:/Users/markc/Relevant Research/10Ks/NVIDIA
loader = PyPDFDirectoryLoader(pdf_folder_path)
data = loader.load()

In [4]:
data[4]

Document(page_content=' \n \n   \nGPU Business  \n   \nOur GPU business is comprised primarily of our GeFo rce discrete and chipset products that support desk top and notebook PCs, plus memory \nproducts. Our GPU business is focused on Microsoft Windows and Apple PC platforms.  GeForce GPUs power  PCs made by or distributed by most \nPC OEMs in the world for desktop PCs, notebook PCs,  and PCs loaded with Windows Media Center and other  media extenders such as the Apple \nTV.  GPUs enhance the user experience for playing v ideo games, editing photos, viewing and editing vid eos and high-definition, or HD, movies.  \n   \nWe believe we are in an era where visual computing is becoming increasingly important to consumers and  other end users of our \nproducts. Our strategy is to promote the GeForce br and as one of the most important processors due to its technology leadership, increasing \nprogrammability, and impressive content experience it enables.  In fiscal year 2011, our strategy w

In [5]:
# Efficient vectorstor method?
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [6]:
# Why does chain_type not accept map_reduce?
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

# Coming up with test datapoints

In [7]:
data[41]

Document(page_content=' \n   \nAs of January 30, 2011, our inventory reserve was $ 152.0 million. As a percentage of our gross invento ry balance, our inventory reserve has \nranged between 15.0% and 30.6% during fiscal years 2011 and 2010. As of January 30, 2011, our inventor y reserve represented 30.6% of our gross \ninventory balance.  \n   \nWarranty Liabilities  \n \nCost of revenue includes the estimated cost of prod uct warranties that are calculated at the point of revenue recognition. Under limited \ncircumstances, we may offer an extended limited war ranty to customers for certain products.  Our produ cts are complex and may contain defects or \nexperience failures due to any number of issues in design, fabrication, packaging, materials and/or us e within a system. If any of our products or \ntechnologies contains a defect, compatibility issue  or other error, we may have to invest additional r esearch and development efforts to find and \ncorrect the issue.  Such efforts cou

In [8]:
len(data)

1759

# LLM-Generated examples

In [9]:
# Four boxes below generate Q&A pair to evaluate model
from langchain.evaluation.qa import QAGenerateChain


In [10]:
# Pass in OpenIA language model to interact with chain
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [11]:
# Get back dictionary of question/answer pairs to evaluate
#
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:10]]
)

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).


In [12]:
new_examples[9]

{'query': 'What methods does NVIDIA rely on to protect its intellectual property?',
 'answer': 'NVIDIA primarily relies on a combination of patents, trademarks, trade secrets, employee and third-party nondisclosure agreements, and licensing arrangements to protect its intellectual property in the United States and internationally. They also rely on international treaties, organizations, and foreign laws to protect their intellectual property.'}

In [13]:
examples = []
examples += new_examples

In [14]:
qa.run(examples[9]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'NVIDIA relies primarily on a combination of patents, trademarks, trade secrets, employee and third-party nondisclosure agreements, licensing arrangements, and the laws of the countries in which they operate to protect their intellectual property in the United States and internationally. They also continuously assess whether and where to seek formal protection for existing and new innovations and technologies. However, the laws of certain foreign countries may not protect their products or intellectual property rights to the same extent as the laws of the United States, which makes the possibility of piracy of their technology and products more likely.'

# Manual Evaluation

In [15]:
import langchain
langchain.debug = True

In [16]:
qa.run(examples[4]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What are the two main components of NVIDIA's GPU business and what platforms are they focused on?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What are the two main components of NVIDIA's GPU business and what platforms are they focused on?",
  "context": "Table of ContentsOur Businesses\nOur\n two reportable segments - GPU and Tegra Processor - are based on a single underlying architecture. From our proprietary processors, we have createdplatforms that address \nfour large markets where our expertise is critical: Gaming, Professional Visualization, Datacenter, and Automotive.Businesses\n  NVIDIA Visual Computing and Accelerated Computing 

[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain > 4:llm:ChatOpenAI] [3.00s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "NVIDIA's GPU business has two main components: GPU and Tegra Processor. The GPU segment is focused on four large markets: Gaming, Professional Visualization, Datacenter, and Automotive. The GPU product brands include GeForce for PC gaming, Quadro for design professionals, Tesla for AI utilizing deep learning and accelerated computing, and GRID to provide the power of NVIDIA graphics through the cloud and data centers. The Tegra Processor segment is primarily designed to enable branded platforms - DRIVE PX and SHIELD.",
        "generation_info": null,
        "message": {
          "content": "NVIDIA's GPU business has two main components: GPU and Tegra Processor. The GPU segment is focused on four large markets: Gaming, Professional Visualization, Datacenter, and Automotive.

"NVIDIA's GPU business has two main components: GPU and Tegra Processor. The GPU segment is focused on four large markets: Gaming, Professional Visualization, Datacenter, and Automotive. The GPU product brands include GeForce for PC gaming, Quadro for design professionals, Tesla for AI utilizing deep learning and accelerated computing, and GRID to provide the power of NVIDIA graphics through the cloud and data centers. The Tegra Processor segment is primarily designed to enable branded platforms - DRIVE PX and SHIELD."

In [17]:
langchain.debug = False

In [18]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [19]:
# QAEvalChain evaluates question answer pairs
from langchain.evaluation.qa import QAEvalChain

In [20]:
# Create above chain with language model. LLM will help do evaluation
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [21]:
# Get back graded outputs
graded_outputs = eval_chain.evaluate(examples, predictions)

In [22]:
# All below are output by the language mdoel
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: What is the exact name and address of the registrant specified in the document?
Real Answer: The exact name of the registrant specified in the document is NVIDIA CORPORATION and the address is 2701 San Tomas Expressway, Santa Clara, California 95050.
Predicted Answer: The exact name of the registrant specified in the document is NVIDIA CORPORATION and the address is 2701 San Tomas Expressway, Santa Clara, California 95050.
Predicted Grade: CORRECT

Example 1:
Question: What was the aggregate market value of the voting stock held by non-affiliates of the registrant as of August 1, 2010 and how was it calculated?
Real Answer: The aggregate market value of the voting stock held by non-affiliates of the registrant as of August 1, 2010 was approximately $5.0 billion. This was calculated based on the closing sales price of the registrant's common stock as reported by the NASDAQ Global Select Market on July 30, 2010, and excludes approximately 26,130,043 shares held by di

## Key Questions/Concerns for Dev

In [69]:
# What to do if predicted answers incorrect?
# QAEvalChain spit out the same question multiple times
# Try to access the UI that tracks what is going on 
#  under the hood (from langchain plus)
# --generate flywheel of datapoints to learn from!!!!



### Try to query this new vectorstor database

In [23]:
query = "Who are NVIDIA's main competitors, suppliers, and customers? \
Use a markdown bulleted list to describe."

In [25]:
query2 = "List and briefly describe NVIDIA's top strategic initiatives over the past 10 years."

In [27]:
query3 = "What has been the most challenging strategic initiative for NVIDIA? \
Use markdown to describe."

In [70]:
query4 = "How does hydrogen compare to alternatives for long-duration energy storage? \
Use markdown to describe."

In [74]:
query5 = "What are ideal conditions for hydrogen in use as long duration storage? \
Use markdown to describe in fewer than 800 words in bulleted list and list hydrogen's top competitors."

In [76]:
query6 = "You are an investor. What are ideal conditions and locations for hydrogen for use as long duration storage? \
Use markdown to describe in fewer than 800 words."

In [78]:
query7 = "Please describe how a PEM electrolyzer and alkaline electrolyzer work. \
Use markdown to describe in fewer than 800 words."

In [111]:
# Trying to gauge breadth and depth of model's knowledge

query8 = "Please tell me the most cost-effective ways to decarbonize with hydrogen and \
the applications which hydrogen is decarbonizing. Do this in markdown in a bulleted list."

In [109]:
query9 = "Please list the highest TRL hydrogen production methods and the range of their levelized cost of production in USD/kg. \
Do this in markdown in a bulleted list."

#Good response

In [113]:
query10 = "What clean energy technologies are most competitive with hydrogen? \
What technologies are complementary to hydrogen? Display results in markdown in a table."

#Poor response - probably due to knowledge base

In [31]:
query11 = "What are the main hydrogen provisions in the Clean Hydrogen Act? \
Create a bulleted list with short description. Display results in markdown in a table."

#Poor response - probably due to knowledge base

In [31]:
start_time = time.time()
response = index.query(query, llm=llm)
#response = qa_stuff.run(query)

run_time = time.time()-start_time
print(run_time)

display(Markdown(response))

KeyboardInterrupt: 

In [36]:
display(Markdown(response))

According to the given context, the integration with electricity markets and cheap renewable power sources could help achieve low breakeven costs for electrolytic hydrogen. The lowest hydrogen breakeven cost could be achieved today via direct wholesale market participation, which is around $3/kg. However, direct wholesale access is currently prohibited in CAISO under state law, but it is permissible in other organized wholesale markets. The profitability of hydrogen production also depends on electrolyzer siting. The context does not mention whether this cost includes subsidies or not.

####  Key Question/Concerns for Dev

In [None]:
# Getting unclear answers when combining all four reports
# Does not seem to be correctly parsing through the data
# Can it not read tables?
# Why can't it generate more accurate answers?

# QAGenerateChain.from_llm -> generates same questions. How to get higher fidelity questions?