## Evaluating CrewAI's `crew` (end-to-end)

In this notebook we will demonstrate how you can run evaluations on crews using datasets from Confident AI and DeepEval's dataset iterator.

### Install dependencies:

In [None]:
!pip install -U deepeval -U crewai ipywidgets --quiet

### Set your OpenAI API key:

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

### Create a crew:

This is a simple crew with a single agent and a single task.

In [1]:
from crewai import Task, Crew, Agent

agent = Agent(
    role="Consultant",
    goal="Write clear, concise explanation.",
    backstory="An expert consultant with a keen eye for software trends.",
)

task = Task(
    description="Explain the given topic: {topic}",
    expected_output="A clear and concise explanation.",
    agent=agent,
)

crew = Crew(agents=[agent], tasks=[task])

result = crew.kickoff(
    inputs={"topic": "What is the biggest open source database?"}
)
print(result)

Would you like to view your execution traces? [y/N] (20s timeout): The biggest open source database, in terms of popularity and widespread use, is MySQL. MySQL is a relational database management system (RDBMS) that is based on Structured Query Language (SQL). It was originally developed in the mid-1990s and is now owned by Oracle Corporation. MySQL is known for its speed, reliability, and ease of use, making it a preferred choice for web applications and various enterprise applications. 

MySQL supports large databases, and when combined with various storage engines, it can efficiently manage large volumes of data. It is frequently used in scenarios where data integrity and performance are crucial, such as in content management systems like WordPress, e-commerce applications, and many other types of software applications.

The community-driven aspect of MySQL, along with its extensive documentation and support, contributes to its status as the largest and most popular open source data

### Evaluate the agent

To evaluate CrewAI's `crew`:

1. Instrument the application (using `from deepeval.integrations.crewai import instrument_crewai`)
2. Supply metrics to `kickoff`.


> (Pro Tip) View your Agent's trace and publish test runs on [Confident AI](https://www.confident-ai.com/). Apart from this you get an in-house dataset editor and more advaced tools to monitor and enventually improve your Agent's performance. Get your API key from [here](https://app.confident-ai.com/)


In [None]:
os.environ["CONFIDENT_API_KEY"] = "<your-confident-api-key>"

In [2]:
from deepeval.integrations.crewai import instrument_crewai

instrument_crewai()

Overriding of current TracerProvider is not allowed


### Using a dataset from Confident AI:

For demo purposes, we will use a public dataset from Confident AI. You can use your own dataset as well. Refer to the [docs](https://deepeval.com/docs/evaluation-end-to-end-llm-evals#setup-your-test-environment) to learn more about how to create your own dataset.

In [None]:
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="topic_agent_queries", public=True)

### Run evaluations:

We will use the `AnswerRelevancyMetric` to evaluate the crew. Dataset iterator will yield golden examples from the dataset.

In [None]:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import trace

answer_relavancy_metric = AnswerRelevancyMetric()

for golden in dataset.evals_iterator():
    with trace(trace_metrics=[answer_relavancy_metric]):
        result = crew.kickoff(
            inputs={"topic": golden.input}, metrics=[AnswerRelevancyMetric()]
        )

Congratulation! You have just evaluated your first CrewAI's `crew` using Deepeval. Try changing Hyperparameters, Agents, Tasks, Metrics and see how your agent performs.