# LangChain: Evaluation

## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

In [1]:
!pip install python-dotenv
!pip install pypdf
!pip install openai
!pip install tiktoken
!pip install --upgrade langchain
!pip install docarray



In [2]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers that those in the video.

In [3]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo-16k"
else:
    llm_model = "gpt-3.5-turbo-16k"

## Create our QandA application

In [4]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [5]:
#file = 'SurveyonVideoRec.pdf'
loader = PyPDFLoader("SurveyonVideoRec.pdf")
data = loader.load()

In [6]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [7]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

### Coming up with test datapoints

In [8]:
data[23]

Document(page_content='24\n[261] W. Wang, D. Tran, and M. Feiszli, “What makes training multi-\nmodal classiﬁcation networks hard?” in Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition , 2020, pp.\n12 695–12 705.\n[262] S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional\nnetworks for skeleton-based action recognition,” in Thirty-second AAAI\nconference on artiﬁcial intelligence , 2018.\n[263] L. Shi, Y . Zhang, J. Cheng, and H. Lu, “Skeleton-based action\nrecognition with multi-stream adaptive graph convolutional networks,”\nIEEE Transactions on Image Processing , vol. 29, pp. 9532–9545, 2020.\n[264] Y .-F. Song, Z. Zhang, C. Shan, and L. Wang, “Stronger, faster and more\nexplainable: A graph convolutional baseline for skeleton-based action\nrecognition,” in proceedings of the 28th ACM international conference\non multimedia , 2020, pp. 1625–1633.\n[265] Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu, “Channel-\nwise topology 

In [9]:
len(data)

26

### Hard-coded examples

In [10]:
examples = [
    {
        "query": "What are five application of video action recognition in sports?Dislay only a list of categories without any additional comments",
        "answer": "Training Aids, Game Assistance (Video Judge),  Video Highlights,  Automatic Sports News Generation (ASNG),  General Research Purposes"
    },
    {
        "query": "What are key challenges wof applying those action recognition baselines on sports videos in practical?Dislay only a list of categories without any additional comments",
        "answer": "Data Collection and Annotation; Camera Motion, Cut and Occlusion; Long-tailed Distribution and Imbalanced Data; Dense and Fast-moving Actions ;  Transfer, Few-shot and Zero-shot Learning;  Multi-camera and Multi-view Action Recognition "
    }
]

### LLM-Generated examples

In [11]:
from langchain.evaluation.qa import QAGenerateChain


In [12]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [13]:
# the warning below can be safely ignored

In [14]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": document} for document in data[:5]]
)




In [15]:
new_examples[0]

{'qa_pairs': {'query': 'What is the focus of the survey presented in the document?',
  'answer': 'The focus of the survey presented in the document is video action recognition in sports analytics.'}}

In [16]:
reformatted_examples = [
    {"query": qa_pair['qa_pairs']['query'], "answer": qa_pair['qa_pairs']['answer']}
    for qa_pair in new_examples
]

In [17]:
reformatted_examples[0]

{'query': 'What is the focus of the survey presented in the document?',
 'answer': 'The focus of the survey presented in the document is video action recognition in sports analytics.'}

In [18]:
examples += reformatted_examples

In [26]:
examples

[{'query': 'What are five application of video action recognition in sports?Dislay only a list of categories without any additional comments',
  'answer': 'Training Aids, Game Assistance (Video Judge),  Video Highlights,  Automatic Sports News Generation (ASNG),  General Research Purposes'},
 {'query': 'What are key challenges wof applying those action recognition baselines on sports videos in practical?Dislay only a list of categories without any additional comments',
  'answer': 'Data Collection and Annotation; Camera Motion, Cut and Occlusion; Long-tailed Distribution and Imbalanced Data; Dense and Fast-moving Actions ;  Transfer, Few-shot and Zero-shot Learning;  Multi-camera and Multi-view Action Recognition '},
 {'query': 'What is the focus of the survey presented in the document?',
  'answer': 'The focus of the survey presented in the document is video action recognition in sports analytics.'},
 {'query': 'According to the document, what are some examples of individual sports me

## Manual Evaluation

In [19]:
import langchain
langchain.debug = True

In [None]:
qa.run(examples[3]["query"])

In [21]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation

In [22]:
predictions = qa.apply(reformatted_examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [23]:
from langchain.evaluation.qa import QAEvalChain

In [24]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

In [27]:
print(len(examples), len(predictions))


7 5


In [39]:
len(reformatted_examples)

5

In [34]:
graded_outputs = eval_chain.evaluate(reformatted_examples, predictions)

In [36]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

In [40]:
for i, eg in enumerate(examples):
    if i < len(predictions):
        print(f"Example {i}:")
        print("Question: " + predictions[i]['query'])
        print("Real Answer: " + predictions[i]['answer'])
        print("Predicted Answer: " + predictions[i]['result'])

        # Access 'results' key from graded_outputs
        predicted_grade = graded_outputs[i].get('results', 'No grade available')
        print("Predicted Grade: " + predicted_grade)
        print()
    else:
        print(f"Example {i}: No prediction available")
        print()


Example 0:
Question: What is the focus of the survey presented in the document?
Real Answer: The focus of the survey presented in the document is video action recognition in sports analytics.
Predicted Answer: The focus of the survey presented in the document is on action recognition in sports videos. It discusses the challenges and techniques involved in recognizing actions in sports videos, as well as the existing datasets and benchmarks for evaluating performance. The survey also highlights the importance of data collection and annotation in this field and provides references to relevant research papers and resources.
Predicted Grade: CORRECT

Example 1:
Question: According to the document, what are some examples of individual sports mentioned?
Real Answer: Some examples of individual sports mentioned in the document are diving, tennis, gymnastics, and table tennis.
Predicted Answer: Some examples of individual sports mentioned in the document are diving, gymnastics, tennis, table t

In [37]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    #print("Predicted Grade: " + graded_outputs[i].get('text', 'No grade available'))
    print()

Example 0:
Question: What is the focus of the survey presented in the document?
Real Answer: The focus of the survey presented in the document is video action recognition in sports analytics.
Predicted Answer: The focus of the survey presented in the document is on action recognition in sports videos. It discusses the challenges and techniques involved in recognizing actions in sports videos, as well as the existing datasets and benchmarks for evaluating performance. The survey also highlights the importance of data collection and annotation in this field and provides references to relevant research papers and resources.


KeyError: ignored

In [33]:
graded_outputs[0]

{'results': 'CORRECT'}

## LangChain evaluation platform

The LangChain evaluation platform, LangChain Plus, can be accessed here https://www.langchain.plus/.  
Use the invite code `lang_learners_2023`

Reminder: Download your notebook to you local computer to save your work.