# LangChain: Evaluation¶
### Outline:
* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
import os
import getpass
import openai
import time
import markdown
from IPython.display import display, Markdown

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:········


# Create our QandA application

In [2]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

#PDF directory loader
from langchain.document_loaders import PyPDFDirectoryLoader

In [3]:
# file = 'GlobalPowerPlantDB_USonly.csv'
pdf_folder_path = "/Users/markc/Hydrogen/"
loader = PyPDFDirectoryLoader(pdf_folder_path)
data = loader.load()

In [4]:
data[110]

Document(page_content='v  LIST OF EXHIBITS  \nExhibit  ES-1. Case configuration summary  ................................ ................................ ...............  2 \nExhibit  ES-2. Performance summary and environmental profile for all cases  ......................  6 \nExhibit  ES-3. Net plant efficiency (HHV basis)  ................................ ................................ ..........  7 \nExhibit ES -4. CO 2e life cycle emissions for all cases  ................................ ................................ . 9 \nExhibit  ES-5. Cost summary for all cases  ................................ ................................ .................  11 \nExhibit  ES-6. LCOH by cost component  ................................ ................................ .................  12 \nExhibit 1 -1. Global merchant hydrogen fleet production route profile (by supplier)  ....... 16 \nExhibit 1-2. Global merchant hydrogen fleet capacity profile (by supplier)  .....................  1

In [5]:
# Efficient vectorstor method?
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [6]:
# Why does chain_type not accept map_reduce?
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

# Coming up with test datapoints

In [7]:
data[41]

Document(page_content='20\nAnnual Evaluation of Fuel Cell Electric Vehicle Deployment and Hydrogen Fuel Station Network DevelopmenttaBle 2: S tatiOn  netwOrk  and regiStered  FCev S witH reSpeCt  tO CluSter  deFintiOnS\nClusterNumber \nof Planned \nStations in \nClusterPlanned \nCapacity in \nCluster  \n(kg/day)Percent of \nPlanned \nStationsPercent of \nPlanned \nCapacityPercent \nof FCEV \nRegistrations  \nin Cluster\nExpanded Network 68 56,754 62% 64% 60%\nSouth San Francisco/ \nBay Area14 10,299 13% 12% 11%\nCoastal/South Orange County12 12,850 11% 15% 17%\nTorrance 6 2,066 5% 2% 5%\nBerkeley 6 4,903 5% 6% 2%\nWest Los Angeles/ Santa Monica4 1,396 4% 2% 5%\nAnalysis of Future On-The-Road FCEVs\nProjections of future on-the-road FCEVs incorporate both the DMV registration data and auto \nmanufacturer responses to the annual survey issued by CARB. CARB staff adjust submitted survey responses in three ways. First, CARB staff translate the responses provided in terms of model year into

In [8]:
len(data)

3356

# LLM-Generated examples

In [9]:
# Four boxes below generate Q&A pair to evaluate model
from langchain.evaluation.qa import QAGenerateChain


In [10]:
# Pass in OpenIA language model to interact with chain
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [13]:
# Get back dictionary of question/answer pairs to evaluate
#
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:10]]
)

In [14]:
new_examples[9]

{'query': 'How many new hydrogen fueling stations achieved Open-Retail status in California since the 2021 Annual Evaluation was published?',
 'answer': 'Eight new hydrogen fueling stations achieved Open-Retail status in California since the 2021 Annual Evaluation was published.'}

In [15]:
examples = []
examples += new_examples

In [16]:
qa.run(examples[9]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"Since the 2021 Annual Evaluation was published, a total of 8 new Open-Retail hydrogen fueling stations have been added to California's network, 6 of which opened in 2022."

# Manual Evaluation

In [17]:
import langchain
langchain.debug = True

In [18]:
qa.run(examples[4]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is the source of the document and on which page can this information be found?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What is the source of the document and on which page can this information be found?",
  "context": "information. Web posting. Summaries. \nVerDate Sep 11 2014 12:25 Dec 28, 2021 Jkt 029139 PO 00058 Frm 00314 Fmt 6580 Sfmt 6581 E:\\PUBLAW\\PUBL058.117 PUBL058whamilton on LAPJF8D0R2PROD with PUBLAW<<<<>>>>>information. \nVerDate Sep 11 2014 12:25 Dec 28, 2021 Jkt 029139 PO 00058 Frm 00174 Fmt 6580 Sfmt 6581 E:\\PUBLAW\\PUBL058.117 PUBL058whamilton on LAPJF8D0R2PROD with PUBLAW<<<<>>>>>x  \n \n \n \n \n \n \n \n \n 

'The source of the document is not provided and the information on the page numbers is incomplete as well. The document contains multiple pages and the given page numbers are not sufficient to locate the information.'

In [19]:
langchain.debug = False

In [20]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [21]:
# QAEvalChain evaluates question answer pairs
from langchain.evaluation.qa import QAEvalChain

In [22]:
# Create above chain with language model. LLM will help do evaluation
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [23]:
# Get back graded outputs
graded_outputs = eval_chain.evaluate(examples, predictions)

In [24]:
# All below are output by the language mdoel
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: What is the title of the report mentioned in this document?
Real Answer: The title of the report mentioned in this document is "2022 Annual Evaluation of Fuel Cell Electric Vehicle Deployment and Hydrogen Fuel Station Network Development (Report Pursuant to Assembly Bill 8; Perea, Chapter 401, Statutes of 2013)."
Predicted Answer: I'm sorry, but the title of the report is not mentioned in this document.
Predicted Grade: CORRECT

Example 1:
Question: Who reviewed and approved the report for publication?
Real Answer: The staff of the California Air Resources Board (CARB) reviewed and approved the report for publication.
Predicted Answer: The report was reviewed by a panel of expert peer reviewers, including Professor Jack Brouwer, Professor Michael Kuby, Dr. Marc Melaina, Professor Joan Ogden, and Michael Penev, MBA, as well as several other reviewers listed in the context. However, it is not specified who approved the report for publication.
Predicted Grade: INCORRE

## Key Questions/Concerns for Dev

In [69]:
# What to do if predicted answers incorrect?
# QAEvalChain spit out the same question multiple times
# Try to access the UI that tracks what is going on 
#  under the hood (from langchain plus)
# --generate flywheel of datapoints to learn from!!!!



### Try to query this new vectorstor database

In [34]:
query = "What system design results in the lowest hydrogen breakeven cost? \
What is that cost in USD/kg? Is this with or without subsidies? Use markdown to describe."

In [37]:
query2 = "What is hydrogen? \
Use markdown to describe."

In [42]:
query3 = "What is the largest potential end-use market for hydrogen? \
Use markdown to describe."

In [70]:
query4 = "How does hydrogen compare to alternatives for long-duration energy storage? \
Use markdown to describe."

In [74]:
query5 = "What are ideal conditions for hydrogen in use as long duration storage? \
Use markdown to describe in fewer than 800 words in bulleted list and list hydrogen's top competitors."

In [76]:
query6 = "You are an investor. What are ideal conditions and locations for hydrogen for use as long duration storage? \
Use markdown to describe in fewer than 800 words."

In [78]:
query7 = "Please describe how a PEM electrolyzer and alkaline electrolyzer work. \
Use markdown to describe in fewer than 800 words."

In [111]:
# Trying to gauge breadth and depth of model's knowledge

query8 = "Please tell me the most cost-effective ways to decarbonize with hydrogen and \
the applications which hydrogen is decarbonizing. Do this in markdown in a bulleted list."

In [109]:
query9 = "Please list the highest TRL hydrogen production methods and the range of their levelized cost of production in USD/kg. \
Do this in markdown in a bulleted list."

#Good response

In [113]:
query10 = "What clean energy technologies are most competitive with hydrogen? \
What technologies are complementary to hydrogen? Display results in markdown in a table."

#Poor response - probably due to knowledge base

In [31]:
query11 = "What are the main hydrogen provisions in the Clean Hydrogen Act? \
Create a bulleted list with short description. Display results in markdown in a table."

#Poor response - probably due to knowledge base

In [32]:
start_time = time.time()
response = index.query(query11, llm=llm)
#response = qa_stuff.run(query)

run_time = time.time()-start_time
print(run_time)

display(Markdown(response))

11.703331708908081


| Provision | Description |
| --- | --- |
| Definition of Clean Hydrogen | Provides a statutory definition for the term "clean hydrogen" |
| Clean Hydrogen Strategy and Roadmap | Establishes a clean hydrogen strategy and roadmap for the United States |
| Clean Hydrogen Clearing House | Establishes a clearing house for clean hydrogen program information at the National Energy Technology Laboratory |
| Clean Hydrogen Supply Chain | Develops a robust clean hydrogen supply chain |
| Clean Hydrogen Research and Development Program | Establishes a clean hydrogen research and development program |
| Clean Hydrogen Production Standard | Establishes a standard of hydrogen production that achieves the standard developed under section 822(a), including interim goals towards meeting that standard |
| Clean Hydrogen Production and Use | Identifies clean hydrogen production and use from natural gas, coal, renewable energy sources, nuclear energy, and biomass |
| Transition to a Clean Hydrogen Economy | Identifies potential barriers, pathways, and opportunities, including Federal policy needs, to transition to a clean hydrogen economy |
| Economic Opportunities for Clean Hydrogen | Identifies economic opportunities for the production, processing, transport, storage, and use of clean hydrogen that exist in the major shale natural gas-producing regions of the United States and for merchant nuclear power plants operating in deregulated markets |
| Environmental Risks | Identifies environmental risks associated with potential clean hydrogen production and use |

In [36]:
display(Markdown(response))

According to the given context, the integration with electricity markets and cheap renewable power sources could help achieve low breakeven costs for electrolytic hydrogen. The lowest hydrogen breakeven cost could be achieved today via direct wholesale market participation, which is around $3/kg. However, direct wholesale access is currently prohibited in CAISO under state law, but it is permissible in other organized wholesale markets. The profitability of hydrogen production also depends on electrolyzer siting. The context does not mention whether this cost includes subsidies or not.

####  Key Question/Concerns for Dev

In [None]:
# Getting unclear answers when combining all four reports
# Does not seem to be correctly parsing through the data
# Can it not read tables?
# Why can't it generate more accurate answers?

# QAGenerateChain.from_llm -> generates same questions. How to get higher fidelity questions?