# Setting up your own custom task

If the available task methodologies are not suitable for your use case, this guide explains how to set up your own task from scratch.
Using the task interface gives us the added benefit of getting built-in input and output logging and gives us the possibility of using the evaluation framework.

For the purpose of this tutorial, we will set up a simple keyword extraction task.
To do so, we will leverage `luminous-base` and a few-shot prompt to generate matching keywords for variable input texts.
Next, we will build an evaluator to check how well our extractor performs.

Let's start with the interface of any generic task. The full `Task` interface can be found here: [../intelligence_layer/task.py](../intelligence_layer/task.py).
However, to initially set up a `Task`, there are only a few parts relevant to us. For now, we shall only care about the following part of the interface:

```python
Input = TypeVar("Input", bound=PydanticSerializable)
Output = TypeVar("Output", bound=PydanticSerializable)

class Task(ABC, Generic[Input, Output]):
    @abstractmethod
    def do_run(self, input: Input, task_span: TaskSpan) -> Output:
        """Executes the process for this use-case."""
        ...
```

For every task, we have to define an `Input`, an `Output` and how we would like to run it. Since these can vary so much, we make no assumptions about a `Task`'s implementation. 
We only require both input and output to be `PydanticSerializable`. The best way to guarantee this is to make them pydantic `BaseModel`s. For our keyword extraction task, we will define `Input` and `Output` as follows:

In [None]:
from dotenv import load_dotenv
from pydantic import BaseModel

load_dotenv()


class KeywordExtractionInput(BaseModel):
    """This is the text we will extract keywords from"""

    text: str


class KeywordExtractionOutput(BaseModel):
    """The matching set of keywords we aim to extract"""

    keywords: frozenset[str]

Now that we have our input and output defined, we  will implement the actual task.

The steps that the task consists of are:
- Create a `Prompt` using the input text.
- Have `luminous-base` complete the prompt.
- Extract keywords from said completion.

When a task is executed, we offer the possibility to log all intermediate steps and outputs.
This is crucial because large language models are inherently probabilistic.
Therefore, we might get unexpected answers.
This logging allows us to check the results afterwards and find out what went wrong.

For this, we shall inject an `InMemoryTracer` into the task. 

In [None]:
from aleph_alpha_client import Prompt
from intelligence_layer.core import Task, TaskSpan, CompleteInput
from intelligence_layer.core import AlephAlphaModel


class KeywordExtractionTask(Task[KeywordExtractionInput, KeywordExtractionOutput]):
    PROMPT_TEMPLATE: str = """Identify matching keywords for each text.
###
Text: The "Whiskey War" is an ongoing conflict between Denmark and Canada over ownership of Hans Island. The dispute began in 1973, when Denmark and Canada reached an agreement on Greenland's borders. However, no settlement regarding Hans Island could be reached by the time the treaty was signed. Since then both countries have used peaceful means - such as planting their national flag or burying liquor - to draw attention to the disagreement.
Keywords: Conflict, Whiskey War, Denmark, Canada, Treaty, Flag, Liquor
###
Text: I really like pizza and sushi.
Keywords: Pizza, Sushi
###
Text: NASA launched the Discovery program to explore the solar system. It comprises a series of expeditions that have continued from the program's launch in the 1990s to the present day. In the course of the 16 expeditions launched so far, the Moon, Mars, Mercury and Venus, among others, have been explored. Unlike other space programs, the Discovery program places particular emphasis on cost efficiency, true to the motto: "faster, better, cheaper".
Keywords: Space program, NASA, Expedition, Cost efficiency, Moon, Mars, Mercury, Venus
###
Text: {text}
Keywords:"""
    MODEL: str = "luminous-base"

    def __init__(
        self, model: AlephAlphaModel = AlephAlphaModel(name="luminous-base")
    ) -> None:
        super().__init__()
        self._model = model

    def _create_complete_input(self, text: str) -> Prompt:
        prompt = Prompt.from_text(self.PROMPT_TEMPLATE.format(text=text))
        # Explain stop sequences here.
        model_input = CompleteInput(
            prompt=prompt,
            stop_sequences=["\n", "###"],
            frequency_penalty=0.25,
            model=self._model.name,
        )
        return model_input

    def do_run(
        self, input: KeywordExtractionInput, task_span: TaskSpan
    ) -> KeywordExtractionOutput:
        completion_input = self._create_complete_input(input.text)
        completion = self._model.complete(completion_input, task_span)
        return KeywordExtractionOutput(
            keywords=set(
                k.strip().lower() for k in completion.completion.split(",") if k.strip()
            )
        )

Now, we can run this `KeywordExtractionTask` like so:

In [None]:
from intelligence_layer.core import InMemoryTracer


task = KeywordExtractionTask()
text = "Computer vision describes the processing of an image by a machine using external devices (e.g., a scanner) into a digital description of that image for further processing. An example of this is optical character recognition (OCR), the recognition and processing of images containing text. Further processing and final classification of the image is often done using artificial intelligence methods. The goal of this field is to enable computers to process visual tasks that were previously reserved for humans."

tracer = InMemoryTracer()
output = task.run(KeywordExtractionInput(text=text), tracer)

print(output)

Looks great!

Now that our task is set up, we can start evaluating its performance.

For this, we will have to set up an evaluator. The evaluator requires an `EvaluationLogic` and an `AggregationLogic` object.  
The logic objects are responsible for how single examples are evaluated and how a list of examples are aggregated. 
How these single examples are put together is the job of the `Evaluator`. This typically does not need to be changed and can just be used.

```python
class EvaluationLogic(ABC, Generic[Input, Output, ExpectedOutput, Evaluation]):
    @abstractmethod
    def do_evaluate(
        self,
        example: Example[Input, ExpectedOutput],
        *output: SuccessfulExampleOutput[Output],
    ) -> Evaluation:
        ...

class AggregationLogic(ABC, Generic[Evaluation, AggregatedEvaluation]):
    @abstractmethod
    def aggregate(self, evaluations: Iterable[Evaluation]) -> AggregatedEvaluation:
        ...
```

Notice that, just like our `Task`, the `EvaluationLogic` takes an `Input`. This input is the same as our task input.
However, we don't just want to run a task; we also want to evaluate the result. 
Therefore, our evaluation logic also depends on some `ExpectedOutput`, as well as `Evaluation`.
We will come back to the `AggregatedEvaluation` of the `AggregationLogic` at a later stage.

Let's build an evaluation that can check the performance of our keyword extraction methodology. For this, we need four things:
- An implementation of the task to be run (we suggest supplying this in the `Evaluator`'s `__init__`)
- An interface for our `ExpectedOutput`
- Some `Evaluation`, i.e., the output of the `do_evaluate` method
- An implementation of the `do_evaluate` function in form of an `EvaluationLogic`.

In our case, we will measure the performance of our keyword extraction by calculating the proportion of correctly generated keywords compared to all expected keywords. 
This is also known as the "true positive rate". 
To calculate this, our evaluate function will need a set of the expected keywords.
Also, we will add the missing keywords and keywords that are generated that we don't expect. 
This way, we can see how our task performs for a specific example, and we can check for unexpected results.


In [None]:
class KeywordExtractionExpectedOutput(BaseModel):
    """This is the expected output for an example run. This is used to compare the output of the task with.

    We will be evaluating our keyword extraction based on the expected keywords."""

    keywords: frozenset[str]


class KeywordExtractionEvaluation(BaseModel):
    """This is the interface for the metrics that are generated for each evaluation case"""

    true_positive_rate: float
    true_positives: frozenset[str]
    false_positives: frozenset[str]
    false_negatives: frozenset[str]

Accordingly, our evaluate function will take a `KeywordExtractionInput`, and run the task with this.
Next, we shall compare the generated output with the `KeywordExtractionExpectedOutput` to create the `KeywordExtractionEvaluation`.

```python
def do_evaluate(
    self,
    input: KeywordExtractionInput,
    output: KeywordExtractionOutput,
    expected_output: KeywordExtractionExpectedOutput,
) -> KeywordExtractionEvaluation:
    true_positives = output.keywords & expected_output.keywords
    false_positives = output.keywords - expected_output.keywords
    false_negatives = expected_output.keywords - output.keywords
    return KeywordExtractionEvaluation(
        true_positive_rate=len(true_positives) / len(expected_output.keywords),
        true_positives=true_positives,
        false_positives=false_positives,
        false_negatives=false_negatives,
    )
```

However, to quantitatively evaluate the performance of a task, we will need to run many different examples and calculate the metrics for each. 
To do this, we can use the `eval_and_aggregate_runs` function provided by the `Evaluator` base class. This takes a dataset, runs all the examples, and aggregates the metrics generated from the evaluation.

To set this up, we will first need to create an interface for the `AggregatedEvaluation` and implement the `aggregate` method.

In [None]:
"""This is the interface for the aggregated metrics that are generated from running a number of examples"""


class KeywordExtractionAggregatedEvaluation(BaseModel):
    average_true_positive_rate: float

Now that we have all parts in place, let's run our task which will produce the results for evaluation.

In [None]:
from intelligence_layer.core import NoOpTracer
from intelligence_layer.evaluation import (
    InMemoryDatasetRepository,
    InMemoryRunRepository,
    Runner,
    Example,
)
from statistics import mean
from typing import Iterable

dataset_repository = InMemoryDatasetRepository()
run_repository = InMemoryRunRepository()

runner = Runner(task, dataset_repository, run_repository, "keyword-extraction")
model_input = KeywordExtractionInput(text="This is a text about dolphins and sharks.")
expected_output = KeywordExtractionExpectedOutput(keywords=["dolphins", "sharks"])

single_example_dataset = dataset_repository.create_dataset(
    examples=[Example(input=model_input, expected_output=expected_output)],
    dataset_name="quickstart-task-single-example-dataset",
).id

run_overview = runner.run_dataset(single_example_dataset, NoOpTracer())

Now, let's build an evaluator.
For this, we need to implement a method doing the actual evaluation in a `EvaluationLogic` class.

In [None]:
from intelligence_layer.evaluation import (
    Evaluator,
    InMemoryEvaluationRepository,
    Example,
)
from intelligence_layer.evaluation.base_logic import SingleOutputEvaluationLogic


class KeywordExtractionEvaluationLogic(
    SingleOutputEvaluationLogic[
        KeywordExtractionInput,
        KeywordExtractionOutput,
        KeywordExtractionExpectedOutput,
        KeywordExtractionEvaluation,
    ]
):
    def do_evaluate_single_output(
        self,
        example: Example[KeywordExtractionInput, KeywordExtractionOutput],
        output: KeywordExtractionExpectedOutput,
    ) -> KeywordExtractionEvaluation:
        true_positives = output.keywords & output.keywords
        false_positives = output.keywords - output.keywords
        false_negatives = output.keywords - output.keywords
        return KeywordExtractionEvaluation(
            true_positive_rate=len(true_positives) / len(output.keywords),
            true_positives=true_positives,
            false_positives=false_positives,
            false_negatives=false_negatives,
        )

And now, we can create an evaluator and run it on our data.

In [None]:
evaluation_repository = InMemoryEvaluationRepository()
evaluation_logic = KeywordExtractionEvaluationLogic()
evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "keyword-extraction",
    evaluation_logic,
)

evaluation_overview = evaluator.evaluate_runs(run_overview.id)

To aggregate the evaluation results, we have to implement a method doing this in a `AggregationLogic` class.

In [None]:
from intelligence_layer.evaluation import (
    InMemoryAggregationRepository,
    Example,
    Aggregator,
)
from intelligence_layer.evaluation.base_logic import AggregationLogic


class KeywordExtractionAggregationLogic(
    AggregationLogic[
        KeywordExtractionEvaluation,
        KeywordExtractionAggregatedEvaluation,
    ]
):
    def aggregate(
        self, evaluations: Iterable[KeywordExtractionEvaluation]
    ) -> KeywordExtractionAggregatedEvaluation:
        eval_list = list(evaluations)
        true_positive_rate = (
            mean(e.true_positive_rate for e in eval_list) if eval_list else 0
        )
        return KeywordExtractionAggregatedEvaluation(
            average_true_positive_rate=true_positive_rate
        )

Let's create now an aggregator and generate evaluation statistics from the previously generated evaluation results.

In [None]:
aggregation_repository = InMemoryAggregationRepository()
aggregation_logic = KeywordExtractionAggregationLogic()
aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "keyword-extraction",
    aggregation_logic,
)

aggregation_overview = aggregator.aggregate_evaluation(evaluation_overview.id)

print("Statistics: ", aggregation_overview.statistics)

Now that we have implemented all required methods, let's run a dataset with some more examples.

In [None]:
from pprint import pprint

dataset_id = dataset_repository.create_dataset(
    examples=[
        Example(input=model_input, expected_output=expected_output),
        Example(
            input=KeywordExtractionInput(
                text="Clinical psychology is an integration of human science, behavioral science, theory, and clinical knowledge for the purpose of understanding, preventing, and relieving psychologically-based distress or dysfunction and to promote subjective well-being and personal development."
            ),
            expected_output=KeywordExtractionExpectedOutput(
                keywords={"clinical psychology", "well-being", "personal development"}
            ),
        ),
        Example(
            input=KeywordExtractionInput(
                text="Prospect theory is a theory of behavioral economics, judgment and decision making that was developed by Daniel Kahneman and Amos Tversky in 1979.[1] The theory was cited in the decision to award Kahneman the 2002 Nobel Memorial Prize in Economics.[2]Based on results from controlled studies, it describes how individuals assess their loss and gain perspectives in an asymmetric manner (see loss aversion)."
            ),
            expected_output=KeywordExtractionExpectedOutput(
                keywords={
                    "prospect theory",
                    "behavioural economics",
                    "decision making",
                    "losses and gains",
                }
            ),
        ),
    ],
    dataset_name="human-evaluation-multiple-examples-dataset",
).id

run = runner.run_dataset(dataset_id)
evaluation_overview = evaluator.evaluate_runs(run.id)
aggregation_overview = aggregator.aggregate_evaluation(evaluation_overview.id)

pprint(aggregation_overview)

We have now run our first evaluation on this tiny dataset.
Let's have a more detailed look at the debug log of one example run.

In [None]:
examples = list(
    dataset_repository.examples(
        dataset_id, evaluator.input_type(), evaluator.expected_output_type()
    )
)
last_example_result = run_repository.example_trace(
    next(iter(aggregation_overview.run_overviews())).id, examples[-1].id
)
last_example_result.trace

Let's inspect this debug log from top to bottom to try and figure out what happened here.

1. **Input**: This corresponds to the `Input` we supplied for our task. In this case, it's just the text of the provided example.

2. **Completion request**: The request sent to the Aleph Alpha API. Here you can see the formatted prompt.

3. **The output of the `CompletionTask`**: This is the original completion created by the API.

4. **The output of our `KeywordExtractionTask`**: The output of our task. Here this is just a list of stripped, lowercase keywords.

5. **Metrics**: Several metrics generated by our `KeywordExtractionTaskEvaluationLogic`.

Let's have a look at the evaluation results.
Here, we can see that the model returned "behavi*o*ral economics" as a keyword.
However, in the `false_negatives`, we can see that we did indeed expect this phrase, but with a different spelling: "behavi*ou*ral economics".
Thus, the debug log helped us easily identify this misalignment between our dataset and the model's generation.

In [None]:
last_example_result = evaluation_repository.example_evaluation(
    next(iter(aggregation_overview.evaluation_overviews)).id,
    examples[-1].id,
    KeywordExtractionEvaluation,
)
last_example_result.result

As you can see, we predicted "behavioural economics" but expected "behavioral economics"...

**What does this tell us?**

Why did the British "ou" and the American "o" go to therapy?

They had behavioural differences!