# LangChain: Evaluation¶
### Outline:
* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [75]:
import os
import getpass
import openai
import time
import markdown
from IPython.display import display, Markdown

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:········


# Create our QandA application

In [76]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

#PDF directory loader
from langchain.document_loaders import PyPDFDirectoryLoader

In [77]:
# file = 'GlobalPowerPlantDB_USonly.csv'
pdf_folder_path = "/Users/markc/Hydrogen/"
loader = PyPDFDirectoryLoader(pdf_folder_path)
data = loader.load()

In [4]:
data[110]

Document(page_content='30 \nThis report is available at no cost from the National Renewable Energy Laboratory at www.nrel.gov/publications.  Appendix A. Hydrogen Breakeven Cost  Sensitivities  \n \nFigure A -1. Estimated 2019 hydrogen breakeven cost and optimal electrolyzer operation for the \n“direct wholesale market participation in CAISO” pathway and minimum electrolyzer costs \n(sensitivity) . EY denotes electrolyzer , and CF denotes ca pacity factor. The capacity of the electrolyzer is \n50 MW.  \n \nFigure A -2. Estimated 2019 hydrogen breakeven cost and optimal electrolyzer operation for the \n“direct wholesale market participation in CAISO” pathway and maximum electrolyzer costs \n(sensitivity) . EY denotes electrolyzer , and CF denotes capacity factor.  The capacity of the electrolyzer is \n50 MW.  \n', metadata={'source': '\\Users\\markc\\Hydrogen\\Guerra_HydrogenIntegration.pdf', 'page': 41})

In [78]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [79]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

# Coming up with test datapoints

In [80]:
data[41]

Document(page_content=' \n31  \nThis report is available at no cost from the National Renewable Energy Laboratory at www.nrel.gov/publications . Hydrogen Production Potential Summary  \nTable 5 summarizes energy inputs required to produce 1 kilogr am of hydrogen from each \nresource. Conversion pathways and production efficiencies, on an LHV basis, are shown for each \nresource . The required coal value has been updated from the 2013 Resource Report to be \nconsistent with the NETL (2010) state- of-the- art coal case (2.2 case study) production process \nenergy efficiency . Key conversion factors are indicated in  Table 6. We assume wind, solar , water \npower, and geothermal resources produce hydrogen at a rate of 51.3 kWh/kg hydrogen via central electrolysis (see H2A case study).  \nHydrogen production potential estimates for the nuclear pathway include updated assumptions about uranium use that are not included in the most recent H2A case studies. Fo r HTE , we \nassume a nominal bu

In [81]:
len(data)

391

# LLM-Generated examples

In [82]:
# Four boxes below generate Q&A pair to evaluate model
from langchain.evaluation.qa import QAGenerateChain


In [83]:
# Pass in OpenIA language model to interact with chain
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [89]:
# Get back dictionary of question/answer pairs to evaluate
#
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

In [92]:
new_examples[4]

{'query': 'What does the acronym "BAU" stand for in the document?',
 'answer': '"BAU" stands for "business as usual" in the document.'}

In [93]:
examples = []
examples += new_examples

In [94]:
qa.run(examples[4]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'BAU stands for "business as usual" in the document.'

# Manual Evaluation

In [95]:
import langchain
langchain.debug = True

In [97]:
qa.run(examples[4]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What does the acronym \"BAU\" stand for in the document?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What does the acronym \"BAU\" stand for in the document?",
  "context": "iv \nThis report is available at no cost from the National Renewable Energy Laboratory at www.nrel.gov/publications . Acronyms  \nAEO  Annual Energy Outlook  \nAWARE  available water remaining  \nBAU  business as usual  \nBtu British thermal units  \nCSP concentrating solar power  \nDOE  U.S. Department of Energy  \nDRB  Demonstrated Reserve Base  \nEERE  Energy Efficiency and Renewable Energy   \nEGS  enhanced geothermal systems  \nEHA  \nEIA existing hydropower asse

'BAU stands for "business as usual" in the document.'

In [98]:
langchain.debug = False

In [99]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [100]:
# QAEvalChain evaluates question answer pairs
from langchain.evaluation.qa import QAEvalChain

In [101]:
# Create above chain with language model. LLM will help do evaluation
llm = ChatOpenAI(temperature=1)
eval_chain = QAEvalChain.from_llm(llm)

In [102]:
# Get back graded outputs
graded_outputs = eval_chain.evaluate(examples, predictions)

In [103]:
# All below are output by the language mdoel
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: What is NREL and who operates it?
Real Answer: NREL is a national laboratory of the U.S. Department of Energy and is operated by the Alliance for Sustainable Energy, LLC.
Predicted Answer: NREL stands for National Renewable Energy Laboratory, which is a national laboratory of the U.S. Department of Energy. It is operated by the Alliance for Sustainable Energy, LLC.
Predicted Grade: CORRECT

Example 1:
Question: What government agency is responsible for the National Renewable Energy Laboratory (NREL)?
Real Answer: The U.S. Department of Energy's Office of Energy Efficiency & Renewable Energy is responsible for the National Renewable Energy Laboratory (NREL).
Predicted Answer: The National Renewable Energy Laboratory (NREL) is a national laboratory of the U.S. Department of Energy.
Predicted Grade: CORRECT

Example 2:
Question: Who funded the work that authored this document?
Real Answer: The work was funded by the U.S. Department of Energy Office of Energy Efficienc

## Key Questions/Concerns for Dev

In [26]:
# What to do if predicted answers incorrect?
# QAEvalChain spit out the same question multiple times
# Try to access the UI that tracks what is going on 
#  under the hood (from langchain plus)
# --generate flywheel of datapoints to learn from!!!!



### Try to query this new vectorstor database

In [118]:
query = "What system design results in the lowest hydrogen breakeven cost? \
What is that cost in USD/kg? Is this with or without subsidies? Use markdown to describe."

In [117]:
start_time = time.time()
response = index.query(query, llm=llm)
#response = qa_stuff.run(query)

run_time = time.time()-start_time
print(run_time)

5.869688510894775


In [119]:
display(Markdown(response))

According to the given context, the integration with electricity markets, such as via dynamic retail tariffs or direct wholesale market participation, and cheap renewable power sources could help achieve low breakeven costs for electrolytic hydrogen regardless of the pathway for electrolytic hydrogen production. For example, low hydrogen breakeven cost could be achieved via direct wholesale market participation, which would cost around $3/kg. However, direct wholesale access is currently prohibited in CAISO under state law, though it is permissible in other organized wholesale markets. The profitability of hydrogen production also depends on electrolyzer siting, and even as CAPEX and OPEX continue to drop, the siting would play an important role in achieving low hydrogen breakeven costs. Hence, the lowest hydrogen breakeven cost is achievable with the integration of electricity markets, cheap renewable power sources, optimal siting of electrolyzers, and direct wholesale market participation, which would cost approximately $3/kg without subsidies.

####  Key Question/Concerns for Dev

In [None]:
# Getting unclear answers when combining all four reports
# Does not seem to be correctly parsing through the data
# Can it not read tables?
# Why can't it generate more accurate answers?

# QAGenerateChain.from_llm -> generates same questions. How to get higher fidelity questions?