# LangChain: Evaluation¶
### Outline:
* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [44]:
import os
import getpass
import openai
import time
import markdown
from IPython.display import display, Markdown

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:········


# Create our QandA application

In [80]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

#PDF directory loader
from langchain.document_loaders import PyPDFDirectoryLoader

In [81]:
# file = 'GlobalPowerPlantDB_USonly.csv'
pdf_folder_path = "/Users/markc/Hydrogen/"
loader = PyPDFDirectoryLoader(pdf_folder_path)
data = loader.load()

In [47]:
data[110]

Document(page_content=' Global Hydrogen Review 2022   \nPAGE | 42  Hydrogen  demand  \n \nStock of fuel cell electric vehicles exceeded 50 000 in 2021 \nFuel cell electric vehicle stock by segment and region, 2017- June 2022  \n \nIEA. All rights reserved.  \nNote: US = United States; RoW = rest of world.  \nSources: Advanced Fuel Cells Technology Collabor ation Programme ; California Fuel Cell Partnership ; International Partnership for Hydrogen and Fuel Cells in the Economy ; US \nDepartment of Energy Hydrogen and Fuel Cell Technologies Office; Korea’s Ministry of Trade, Industry and Energy monthly automobile updates ; Clean Energy Ministerial Hydrogen \nInitiative country surveys. 10203040506070\n2017 2018 2019 2020 2021 Jun-22Fuel cell electric vehicles \n(thousand)\nTotal Cars Buses Commercial vehicles2017 2018 2019 2020 2021 Jun-22\nKorea US China Japan Europe RoW', metadata={'source': '\\Users\\markc\\Hydrogen\\GlobalHydrogenReview2022_IEA.pdf', 'page': 41})

In [82]:
# Efficient vectorstor method?
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [84]:
# Why does chain_type not accept map_reduce?
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

# Coming up with test datapoints

In [19]:
data[41]

Document(page_content=' \n31  \nThis report is available at no cost from the National Renewable Energy Laboratory at www.nrel.gov/publications . Hydrogen Production Potential Summary  \nTable 5 summarizes energy inputs required to produce 1 kilogr am of hydrogen from each \nresource. Conversion pathways and production efficiencies, on an LHV basis, are shown for each \nresource . The required coal value has been updated from the 2013 Resource Report to be \nconsistent with the NETL (2010) state- of-the- art coal case (2.2 case study) production process \nenergy efficiency . Key conversion factors are indicated in  Table 6. We assume wind, solar , water \npower, and geothermal resources produce hydrogen at a rate of 51.3 kWh/kg hydrogen via central electrolysis (see H2A case study).  \nHydrogen production potential estimates for the nuclear pathway include updated assumptions about uranium use that are not included in the most recent H2A case studies. Fo r HTE , we \nassume a nominal bu

In [85]:
len(data)

1771

# LLM-Generated examples

In [86]:
# Four boxes below generate Q&A pair to evaluate model
from langchain.evaluation.qa import QAGenerateChain


In [87]:
# Pass in OpenIA language model to interact with chain
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [88]:
# Get back dictionary of question/answer pairs to evaluate
#
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:10]]
)

In [89]:
new_examples[9]

{'query': 'What is Exhibit 3-19 in the document about?',
 'answer': 'Exhibit 3-19 in the document is about the MDEA CO2 capture process typical flow diagram for reforming plants.'}

In [90]:
examples = []
examples += new_examples

In [91]:
qa.run(examples[9]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'The context provided does not include any information about Exhibit 3-19.'

# Manual Evaluation

In [92]:
import langchain
langchain.debug = True

In [93]:
qa.run(examples[4]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is the purpose of section 2.7 in the document?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What is the purpose of section 2.7 in the document?",
  "context": "demonstrations are discussed in Section 6. Section 7 discusses areas of consensus, disagreement, and uncertainty in \nthe present literature and summarizes remaining research questions. \n6 \nThis report is available at no cost from the National Renewable Energy Laboratory (NREL) at www.nrel.gov/publications<<<<>>>>>This page intentionally left blank<<<<>>>>>the standard is to provide a reasonable level of safety, property preservation, and publ ic \nwelfare from the hazards cr

"I'm sorry, but there is no mention of section 2.7 in the given context. The document only mentions sections 6 and 7."

In [94]:
langchain.debug = False

In [95]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [96]:
# QAEvalChain evaluates question answer pairs
from langchain.evaluation.qa import QAEvalChain

In [97]:
# Create above chain with language model. LLM will help do evaluation
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [98]:
# Get back graded outputs
graded_outputs = eval_chain.evaluate(examples, predictions)

In [99]:
# All below are output by the language mdoel
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: What is the title of the document?
Real Answer: The title of the document is "COMPARISON OF COMMERCIAL, STATE-OF-THE-ART, FOSSIL-BASED HYDROGEN PRODUCTION TECHNOLOGIES".
Predicted Answer: I'm sorry, but there is no clear title for the document based on the given context. The first line says "This page intentionally left blank" and the last line says "Article x", but there is no clear indication of what the article is about or what its title is.
Predicted Grade: CORRECT

Example 1:
Question: Who funded the project mentioned in the document?
Real Answer: The project was funded by the United States Department of Energy, National Energy Technology Laboratory, in part, through a site support contract.
Predicted Answer: There are two different projects mentioned in the document, one funded by the European Commission and another funded by the United States Department of Energy. Can you please specify which project you are referring to?
Predicted Grade: CORRECT

Example 2:

## Key Questions/Concerns for Dev

In [69]:
# What to do if predicted answers incorrect?
# QAEvalChain spit out the same question multiple times
# Try to access the UI that tracks what is going on 
#  under the hood (from langchain plus)
# --generate flywheel of datapoints to learn from!!!!



### Try to query this new vectorstor database

In [34]:
query = "What system design results in the lowest hydrogen breakeven cost? \
What is that cost in USD/kg? Is this with or without subsidies? Use markdown to describe."

In [37]:
query2 = "What is hydrogen? \
Use markdown to describe."

In [42]:
query3 = "What is the largest potential end-use market for hydrogen? \
Use markdown to describe."

In [70]:
query4 = "How does hydrogen compare to alternatives for long-duration energy storage? \
Use markdown to describe."

In [74]:
query5 = "What are ideal conditions for hydrogen in use as long duration storage? \
Use markdown to describe in fewer than 800 words in bulleted list and list hydrogen's top competitors."

In [76]:
query6 = "You are an investor. What are ideal conditions and locations for hydrogen for use as long duration storage? \
Use markdown to describe in fewer than 800 words."

In [78]:
query7 = "Please describe how a PEM electrolyzer and alkaline electrolyzer work. \
Use markdown to describe in fewer than 800 words."

In [111]:
# Trying to gauge breadth and depth of model's knowledge

query8 = "Please tell me the most cost-effective ways to decarbonize with hydrogen and \
the applications which hydrogen is decarbonizing. Do this in markdown in a bulleted list."

In [109]:
query9 = "Please list the highest TRL hydrogen production methods and the range of their levelized cost of production in USD/kg. \
Do this in markdown in a bulleted list."

#Good response

In [113]:
query10 = "What clean energy technologies are most competitive with hydrogen? \
What technologies are complementary to hydrogen? Display results in markdown in a table."

#Poor response - probably due to knowledge base

In [114]:
start_time = time.time()
response = index.query(query10, llm=llm)
#response = qa_stuff.run(query)

run_time = time.time()-start_time
print(run_time)

display(Markdown(response))

2.178929567337036


| Most Competitive with Hydrogen | Complementary to Hydrogen |
|--------------------------------|---------------------------|
| Solutions that directly use electricity | Diverse and complementary energy networks |
|                                   | Flexible complement to other low-carbon energy technologies such as batteries and renewables |

In [36]:
display(Markdown(response))

According to the given context, the integration with electricity markets and cheap renewable power sources could help achieve low breakeven costs for electrolytic hydrogen. The lowest hydrogen breakeven cost could be achieved today via direct wholesale market participation, which is around $3/kg. However, direct wholesale access is currently prohibited in CAISO under state law, but it is permissible in other organized wholesale markets. The profitability of hydrogen production also depends on electrolyzer siting. The context does not mention whether this cost includes subsidies or not.

####  Key Question/Concerns for Dev

In [None]:
# Getting unclear answers when combining all four reports
# Does not seem to be correctly parsing through the data
# Can it not read tables?
# Why can't it generate more accurate answers?

# QAGenerateChain.from_llm -> generates same questions. How to get higher fidelity questions?