# Setting up your own custom task

If the tasks we have set up don't fit your use case, this guide will go into how to set up your own task from scratch.
To do this, we will be setting up a simple keyword extraction task.

Keyword extraction is basically what the name suggests, extracting keywords from a piece of text.
An example use case could be blabla.
A full implementation can be found in blabla.

Let's start with the task interface.
The full Task interface can be found in [task.py](task.py).
However, to implement a Task there are only a few parts relevant to us.
For now, all we will worry about in terms of the interface is the following part:

```python
Input = TypeVar("Input", bound=PydanticSerializable)
Output = TypeVar("Output", bound=PydanticSerializable)

class Task(ABC, Generic[Input, Output]):
    @abstractmethod
    def run(self, input: Input, logger: DebugLogger) -> Output:
        """Executes the process for this use-case."""
        ...
```

To create our own task, we have to define our Input, Output and how we would like to run it.
Since tasks can vary so much, no assumptions are done about the implementation of the task. 
The only requirement is the fact that the input and output have to be PydanticSerializable.
This is done so we can easily save our evaluation datasets.
For our keyword extraction our input and output will be the following:

In [None]:
from typing import Sequence
from pydantic import BaseModel

class KeywordExtractionInput(BaseModel):
    """This is the text we will extract keywords from"""
    text: str

class KeywordExtractionOutput(BaseModel):
    keywords: set[str]

Now that we have our input and output defined, we can make the task.
The steps that the task consists of are:
- Create a prompt
- Send the prompt to the model
- Extract the keywords from the model's response

In [None]:
from aleph_alpha_client import Client, CompletionRequest, Prompt

from intelligence_layer.task import DebugLogger, Task


class KeywordExtractionTask(Task[KeywordExtractionInput, KeywordExtractionOutput]):
    PROMPT_TEMPLATE: str = """Identify matching keywords for each text.
###
Text: The "Whiskey War" is an ongoing conflict between Denmark and Canada over ownership of Hans Island. The dispute began in 1973, when Denmark and Canada reached an agreement on Greenland's borders. However, no settlement regarding Hans Island could be reached by the time the treaty was signed. Since then both countries have used peaceful means - such as planting their national flag or burying liquor - to draw attention to the disagreement.
Keywords: Conflict, Whiskey War, Denmark, Canada, Treaty, Flag, Liquor
###
Text: NASA launched the Discovery program to explore the solar system. It comprises a series of expeditions that have continued from the program's launch in the 1990s to the present day. In the course of the 16 expeditions launched so far, the Moon, Mars, Mercury and Venus, among others, have been explored. Unlike other space programs, the Discovery program places particular emphasis on cost efficiency, true to the motto: "faster, better, cheaper".
Keywords: Space program, NASA, Expedition, Cost efficiency, Moon, Mars, Mercury, Venus
###
Text: {text}
Keywords:"""
    MODEL: str = "luminous-base"
    client: Client

    def __init__(self, client: Client) -> None:
        super().__init__()
        self.client = client

    def run(self, input: KeywordExtractionInput, logger: DebugLogger) -> KeywordExtractionOutput:
        prompt = self._format_prompt(text=input.text, logger=logger)
        completion = self._complete(
            prompt, logger.child_logger("Generate Summary")
        )
        return KeywordExtractionOutput(keywords=set(k.strip().lower() for k in completion.split(",") if k.strip()))

    def _format_prompt(self, text: str, logger: DebugLogger) -> Prompt:
        logger.log(
            "Prompt template/text", {"template": self.PROMPT_TEMPLATE, "text": text}
        )
        return Prompt.from_text(self.PROMPT_TEMPLATE.format(text=text))
    
    def _complete(self, prompt: Prompt, logger: DebugLogger) -> str:
        request = CompletionRequest(
            prompt=prompt,
            stop_sequences=["\n", "###"],
            frequency_penalty=0.25
        )
        response = self.client.complete(
            request=request,
            model=self.MODEL,
        )
        logger.log(
            "Original request & response", {"request": request.to_json(), "response": response.to_json()}
        )
        return response.completions[0].completion # grabs the string completion generated by the model

So we can run the task like so:

In [None]:
from os import getenv

from intelligence_layer.task import JsonDebugLogger


client = Client(getenv("AA_TOKEN"))
task = KeywordExtractionTask(client)
text = """Computer vision describes the processing of an image by a machine using external devices (e.g., a scanner) into a digital description of that image for further processing. An example of this is optical character recognition (OCR), the recognition and processing of images containing text. Further processing and final classification of the image is often done using artificial intelligence methods. The goal of this field is to enable computers to process visual tasks that were previously reserved for humans."""

input = KeywordExtractionInput(text=text)
logger = JsonDebugLogger(name="classify")
output = task.run(input, logger)

print(output)
logger


Looks great!
Now that our task is setup, we can start evaluating the performance of our task.

To do evaluation, we will have to set up an evaluator.
The full interface for an evaluator can be found in [task.py](task.py).
We will go over it step by step, so for now all we have to worry about is this part of the interface:

```python
class Evaluator(ABC, Generic[Input, ExpectedOutput, Evaluation, AggregatedEvaluation]):
    @abstractmethod
    def evaluate(
        self,
        input: Input,
        logger: DebugLogger,
        expected_output: ExpectedOutput,
    ) -> Evaluation:
        """Executes the evaluation for this use-case."""
        pass
```

First of all, let's create our KeywordExtractionEvaluator.
The first generic the evaluator takes is the same as the input for the task, so we can plug this one right in.

```python
class KeywordExtractionEvaluator(Evaluator[KeywordExtractionInput, ExpectedOutput, Evaluation, AggregatedEvaluation]):
    def evaluate(
        self,
        input: Input,
        logger: DebugLogger,
        expected_output: ExpectedOutput,
    ) -> Evaluation:
        """Executes the evaluation for this use-case."""
        pass
```

Now that we have our evaluator, we can start evaluating actual examples.
To evaluate a case, we need an interface for our `ExpectedOutput`, `Evaluation` and an implementation of the `evaluate` function.
In our case, we are interested in the proportion of correctly generate keywords compared to all expected keywords. 
This is also known as the `true positive rate`.
To calculate this, the evaluate function will need a set of the expected keywords.
This can be seen in the `KeywordExtractionExpectedOutput` class. 

In [None]:
class KeywordExtractionExpectedOutput(BaseModel):
    """This is the expected output for an example run. This is used to compare the output of the task with.

    We will be evaluating our keyword extraction based on the expected keywords. """
    keywords: set[str]

class KeywordExtractionEvaluation(BaseModel):
    """This is the interface for the metrics that are generated for each evaluation case"""
    true_positive_rate: float 
    true_positives: set[str]
    false_positives: set[str]
    false_negatives: set[str]

Our evaluate function will take an input for the task to process, runs the task and calculates the true positive rate. 
Finally, it will return an instance of the KeywordExtractionEvaluation class with the rate and the (in)correct keywords. 

```python
def evaluate(
        self,
        input: KeywordExtractionInput,
        logger: DebugLogger,
        expected_output: KeywordExtractionExpectedOutput,
    ) -> KeywordExtractionEvaluation:
        output = self.task.run(input, logger)
        true_positives = expected_output.keywords & output.keywords
        false_positives = output.keywords - true_positives
        false_negatives = true_positives - output.keywords
        return KeywordExtractionEvaluation(true_positive_rate=len(true_positives) / len(expected_output.keywords), 
                                           true_positives=true_positives, 
                                           false_positive=false_positives, 
                                           false_negatives=false_negatives)
```

However, to evaluate the performance of a task, we will need to try out lots of different examples. 
To do this we can use the "evaluate_dataset" function, provided by the Evaluator base class.
This will take a dataset, run all the examples in the dataset and aggregate the metrics generated from the evaluation.
To set this up, we will need to create a dataset, an interface for the aggregated metrics and implement the "aggregate" method.

In [None]:
"""This is the interface for the aggregated metrics that are generated from running a number of examples"""
class KeywordExtractionAggregatedEvaluation(BaseModel):
    average_true_positive_rate: float

The aggregate method takes as input a sequence of KeywordExtractionEvaluations that are generated by the `evaluate_dataset` method.
It is responsible for aggregating the metrics generated from running the dataset.

```python
def aggregate(self, evaluations: Sequence[KeywordExtractionEvaluation]) -> KeywordExtractionAggregatedEvaluation:
        """`Evaluator`-specific method for aggregating individual `Evaluations` into report-like `Aggregated Evaluation`."""
        pass
```

Now that we have discussed all of the parts that make up an evaluator, the full class is:

In [None]:
from statistics import mean
from intelligence_layer.task import Evaluator

class KeywordExtractionEvaluator(Evaluator[KeywordExtractionInput, KeywordExtractionExpectedOutput, KeywordExtractionEvaluation, KeywordExtractionAggregatedEvaluation]):
    def __init__(self, task: KeywordExtractionTask) -> None:
        """We recommend adding the task to the init method of the evaluator
        
        This allows for easy comparing of different implementations of the same task."""
        self.task = task


    def evaluate(
        self,
        input: KeywordExtractionInput,
        logger: DebugLogger,
        expected_output: KeywordExtractionExpectedOutput,
    ) -> KeywordExtractionEvaluation:
        output = self.task.run(input, logger)
        true_positives = output.keywords & expected_output.keywords
        false_positives = output.keywords - expected_output.keywords 
        false_negatives = expected_output.keywords - output.keywords
        return KeywordExtractionEvaluation(true_positive_rate=len(true_positives) / len(expected_output.keywords), 
                                           true_positives=true_positives, 
                                           false_positives=false_positives, 
                                           false_negatives=false_negatives)
         

    def aggregate(self, evaluations: Sequence[KeywordExtractionEvaluation]) -> KeywordExtractionAggregatedEvaluation:
        true_positive_rate = mean(e.true_positive_rate for e in evaluations)
        return KeywordExtractionAggregatedEvaluation(average_true_positive_rate=true_positive_rate)

Let's run this.

In [None]:
evaluator = KeywordExtractionEvaluator(task=task)

logger = JsonDebugLogger(name="Evaluation logger")
input = KeywordExtractionInput(text="A text about dolphins and sharks")
expected_output = KeywordExtractionExpectedOutput(keywords=["dolphins", "sharks"])
evaluation = evaluator.evaluate(input, 
                                logger, 
                                expected_output)
print(evaluation)

Now that we have implemented our aggregate method, let's run a dataset with some example data.

In [None]:
from intelligence_layer.task import Dataset, Example

dataset = Dataset(
    name="Keyword extraction dataset",
    examples=[
        Example(
            input=input,
            expected_output=expected_output
        ), 
        Example(
            input=KeywordExtractionInput(
                text="Clinical psychology is an integration of human science, behavioral science, theory, and clinical knowledge for the purpose of understanding, preventing, and relieving psychologically-based distress or dysfunction and to promote subjective well-being and personal development."
            ),
            expected_output=KeywordExtractionExpectedOutput(
                keywords={"clinical psychology", "well-being", "personal development"}
            )
        ),
        Example(
            input=KeywordExtractionInput(
                text="Prospect theory is a theory of behavioral economics, judgment and decision making that was developed by Daniel Kahneman and Amos Tversky in 1979.[1] The theory was cited in the decision to award Kahneman the 2002 Nobel Memorial Prize in Economics.[2]Based on results from controlled studies, it describes how individuals assess their loss and gain perspectives in an asymmetric manner (see loss aversion)."
            ),
            expected_output=KeywordExtractionExpectedOutput(
                keywords={"prospect theory", "behavioural economics", "decision making", "losses and gains"}
            )
        )
    ]
)
logger = JsonDebugLogger(name="Evaluate dataset debug logger")

aggregated_evaluations = evaluator.evaluate_dataset(dataset, logger)
print(aggregated_evaluations)
logger