# 🤗 Welcome to DeepEval!

Thanks for trying us out, we're here to provide you with the best LLM evaluation experience you can dream of 😊 any questions or concerns you may have, [come talk to us on discord,](https://discord.com/invite/a3K9c8GRGt) we're always here to help!

# What is DeepEval?

DeepEval is the evaluation framework for LLMs. It takes the latest research (eg., G-Eval, SelfCheckGPT, RAGAS) in LLM evaluation and implement it as metrics for anyone to easily plug and use.

Our strength lies in the simplicity and ease of use, while being a full evaluation suite for LLMs. Hope you enjoy trying us out!

# Installation

Install deepeval, you can ignore the warnings at the end of the installation.

In [None]:
!pip install -U deepeval

^^ Please ignore the warnings after installation, but if you're concerned please run `!pip install cohere` as well.

# Login (recommended step)

Login to Confident AI to log evaluation results on the cloud. Later, you can use Confident AI to centralize evaluation datasets, perform real-time evaluations in production, and much more.

In [None]:
!deepeval login

 # 😇 Create Your First Test Case

 A test case in `deepeval` mimics a user interaction with your LLM (application)Here:
 - `input` is what a user would input
 - `actual_output` is the output of yoru LLM (application)
 - `retrieval_context` is the retrieved context in your RAG pipeline.

Note that only RAG metrics require `retrieval_context` when creating a test case. Visit the [test cases section in our docs](https://docs.confident-ai.com/docs/evaluation-test-cases) to learn about how a test case work in `deepeval`.

In [None]:
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
  input="What if these shoes don't fit?",
  # Replace this with the actual output of your LLM application
  actual_output="We offer a 30-day full refund at no extra cost.",
  # Replace this with the retrieval context (in the RAG pipeline) of your LLM application
  retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)

# 🤗 Create Your First Metric

A metric in `deepeval` evaluates based on the values of the parameters in a particular test case. `deepeval` incorporates the latest research into its metrics implementation so you don't have to.

## Set OpenAI API Key

Most of `deepeval`'s metrics are evaluated using LLMs. To begin, set your `OPENAI_API_KEY` below (IMPORTANT: don't include quotation "" marks!)

In [None]:
%env OPENAI_API_KEY=<your-openai-api-key>

(Note that you don't have to use OpenAI, although we highly highly highly recommend it, since evaluation requires a high level of reasoning. That being said, if you want to use a custom model like Azure OpenAI, visit the [metrics section in our docs](https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm))


## Create Your Metric

In this example, we create an `AnswerRelevancyMetric`, which measures the answer relevancy of a RAG based LLM application. Not all metrics are RAG metrics. For a list of full metrics and an explanation for each, visit [the metrics section in our docs](https://docs.confident-ai.com/docs/metrics-introduction)

In [None]:
from deepeval.metrics import AnswerRelevancyMetric

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

# 🚀 Run Your First Evaluation

With a test case and metric ready, you can start using our `evaluate()` function to evaluate your LLM (application).

The `evaluate()` function accepts a list of test cases, and a list of metrics. Under the hood, it evaluates each individual test case using the list of provided metrics. A test case only passess if all the metrics are passing. For more information, including how to use our Pytest integration for evaluation, visit the evaluation [section in our docs.](https://docs.confident-ai.com/docs/evaluation-introduction)

In [None]:
from deepeval import evaluate

evaluate([test_case], [answer_relevancy_metric])

# Using Standalone Metrics

`deepeval` offers a simple way for anyone to plug and use our metrics. This is especially useful if you're looking to build your own evaluation pipelines. Using the previous example:

In [None]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric


test_case = LLMTestCase(
  input="What if these shoes don't fit?",
  # Replace this with the actual output of your LLM application
  actual_output="We offer a 30-day full refund at no extra cost.",
  # Replace this with the retrieval context (in the RAG pipeline) of your LLM application
  retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)

# Create Your First Evaluation Dataset

An evaluation dataset in `deepeval` is a collection of test cases. It offers a rich set of features for you to manipulate evaluation data. For more information, visit the [dataset section in our docs.](https://docs.confident-ai.com/docs/evaluation-datasets)

In [None]:
from deepeval import evaluate
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

test_case_1 = LLMTestCase(
  input="What if these shoes don't fit?",
  actual_output="We offer a 30-day full refund at no extra cost.",
  retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)
test_case_2 = LLMTestCase(
  input="What should you do if there's a fire?",
  actual_output="Drop and roll.",
  retrieval_context=["Don't use the elevator in the case of a fire."]
)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

dataset = EvaluationDataset(test_cases=[test_case_1, test_case_2])

# Same as before, using the evaluate function
evaluate(dataset, [answer_relevancy_metric])

# Or, use the evaluate method directly, they're exactly the same
# dataset.evaluate([answer_relevancy_metric])