This repo contains cookbooks demonstrating evaluations of AI Agents using the judgeval
package implemented by Judgment Labs.
Before running these examples, make sure you have:
-
Installed the latest version of the Judgeval package:
pip install judgeval
-
Set up your Judgeval API key and organization ID as environment variables:
export JUDGMENT_API_KEY="your_api_key" export JUDGMENT_ORG_ID="your_org_id"
To get your API key and Organization ID, make an account on the Judgment Labs platform.
This repository provides a collection of cookbooks to demonstrate various evaluation techniques and agent implementations using Judgeval.
These cookbooks feature agents that interact directly with LLM APIs (e.g., OpenAI, Anthropic), often implementing custom logic for tool use, function calling, and RAG.
multi-agent/
: A flexible multi-agent framework for orchestrating and evaluating the collaboration of multiple agents and tools on complex tasks like financial analysis. Evaluated on factual adherence to retrieval context.
These cookbooks showcase agents built using the LangGraph framework, demonstrating complex state management and chained operations.
langgraph_music_recommender/
: An agent that generates song recommendations based on user music taste.
These cookbooks focus on how to implement and use custom scorers:
custom_scorers/
: Provides examples of how to implement and use custom scorers to tailor evaluations to specific needs beyond built-in scorers.