# Human Evaluation using the Intelligence Layer
Even though there are a multitude of automated ways to automate the evaluation of LLM based Tasks, sometimes it is still necessary to get a human opinion. To make this as pain free as possible we integrated an [Argilla Evaluator](https://argilla.io/) into the intelligence layer. This notebook serves as a quick start guide.

## Environment setup
This notebook expects that you have added your Aleph Alpha token to your .env file. Additionally you need to add the `ARGILLA_API_URL` and `ARGILLA_API_KEY` from env.sample to your .env file. 
After this you can run 
```bash
docker-compose up -d
``` 
from the intelligence layer base directory.

In [None]:
from intelligence_layer.core import ArgillaEvaluator, ArgillaEvaluationRepository, Example, InstructInput, Instruct, InMemoryDatasetRepository, InMemoryEvaluationRepository, PromptOutput, Runner
from intelligence_layer.connectors import LimitedConcurrencyClient, Question, ArgillaEvaluation, DefaultArgillaClient, Field, RecordData
from typing import Iterable, cast, Sequence
from datasets import load_dataset
import os
from pydantic import BaseModel

client = LimitedConcurrencyClient.from_token(os.getenv("AA_TOKEN"))

## Dataset Repository definition
First we need to define our dataset. Here we use an [Instruction Dataset](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset?row=0) from Huggingface. Before we can use it for human eval, we need to make an intelligence layer dataset repository.

In [None]:
dataset = load_dataset("HuggingfaceH4/instruction-dataset")["test"]

Let us explore the dataset a bit. It consists of prompts, example completions and metadata for 327 examples. Since we are doing human eval, for now we only need the prompt and corresponding id.

In [None]:
print(dataset)
print(dataset['meta'][0].keys())

We could now build a single example like this:

In [None]:
example = Example(
    input=InstructInput(instruction=dataset["prompt"][0], input=None), 
    expected_output=None,
    id=str(dataset["meta"][0]["id"])
)

For our dataset repository we could either use a FileDatasetRepository or an InMemoryDatasetRepository

In [None]:
num_examples = 5
assert num_examples <= len(dataset)
dataset_repository = InMemoryDatasetRepository()
dataset_id = dataset_repository.create_dataset(examples=[
    Example(
    input=InstructInput(instruction=dataset["prompt"][i], input=None), 
    expected_output=None,
    id=str(dataset["meta"][i]["id"])
) for i in range(num_examples)
])

## Task Setup
We use an Instruction task for our Instruct dataset. In addition we define an `EvaluationRepository` to save the results and a `Runner` to generate the completions from the model for our dataset.

In [None]:
task = Instruct(client, model="luminous-base-control")
evaluation_repository = InMemoryEvaluationRepository()
runner = Runner(task, evaluation_repository, dataset_repository, "Instruct")
run_overview = runner.run_dataset(dataset_id)

## Evaluator Definition


At the end of our evaluation we want a float score $$s \in [1,5]$$ describing the model performance. We define this as `InstructAggregatedEvaluation`

In [None]:
class InstructAggregatedEvaluation(BaseModel):
    general_rating: float | None
    fluency: float | None
    evaluated_examples: int

![Argilla Interface](../../assets/argilla_interface.png)
In the Argilla UI we will see our model input (Instruction) and output (Model Completion) on the left side. This is defined using the `fields` list. The field names have to match the content keys from the `RecordData` that we will define in our `InstructArgillaEvaluator`. On the right side of the UI we will see our rating interface. This can serves a number of Questions to be rated. Currently only integer scales are accepted. The `name` property is used to access the human ratings in the aggregation step

In [None]:
questions = [
    Question(
        name="general_rating",
        title="Rating",
        description="Rate this Instruct completion on a scale from 1 to 5",
        options=range(1,6),
    ),
    Question(
        name="fluency",
        title="Fluency",
        description="How fluent is the completion?",
        options=range(1,6),
    )
]

fields = [
    Field(name="input", title="Instruction"),
    Field(name="output", title="Model Completion"),
]

We can now define our `InstructArgillaEvaluator`. It has to implement the two abstract methods `aggregate` and `_to_record`. Lets look at the documentation:

In [None]:
help(ArgillaEvaluator.aggregate)
help(ArgillaEvaluator._to_record)

In [None]:
class InstructArgillaEvaluator(
    ArgillaEvaluator[
        InstructInput,
        PromptOutput,
        None,
        InstructAggregatedEvaluation,
    ]
):
    def aggregate(
        self,
        evaluations: Iterable[ArgillaEvaluation],
    ) -> InstructAggregatedEvaluation:
        evaluations = list(evaluations)

        if len(evaluations) == 0: # if no evaluations were submitted, return
            return InstructAggregatedEvaluation(
                general_rating=None,
                fluency=None,
                evaluated_examples=0,
            )
        
        general_rating = sum(
            cast(float, evaluation.responses["general_rating"]) for evaluation in evaluations
        ) / len(evaluations)

        fluency = sum(
            cast(float, evaluation.responses["fluency"]) for evaluation in evaluations
        ) / len(evaluations)

        return InstructAggregatedEvaluation(
            general_rating=general_rating,
            fluency=fluency,
            evaluated_examples=len(evaluations),
        )

    def _to_record(
        self,
        example: Example[InstructInput, None],
        output: PromptOutput,
    ) -> Sequence[RecordData]:
        return [RecordData(
            content={
                "input": example.input.instruction,
                "output": output.completion,
            },
            example_id=example.id,
        )]
    
argilla_client = DefaultArgillaClient()
workspace_id = argilla_client.create_workspace("test")

evaluator = InstructArgillaEvaluator(
    ArgillaEvaluationRepository(evaluation_repository, argilla_client),
    dataset_repository,
    workspace_id,
    fields,
    questions,
)

The `partial_evaluate_dataset` posts the records created from a run to the argilla instance.

In [None]:
try:
    eval_overview = evaluator.partial_evaluate_dataset(run_overview.id)
except Exception as e:
    print(e.response.json())

print(eval_overview)

We can access Once we have evaluated some examples 

In [None]:
output = evaluator.aggregate_evaluation(eval_overview.id)
print(output.statistics)