# Classify

Classification is a methodology that tries to match a text to the correct label.

### Prompt-based classification

Prompt-based classification is a methodology that relies purely on prompting the LLM in a specific way.

### When should you use prompt-based classification?

Some situations when you would use this methodology is when:
- The labels are easily understood (they don't require explanation or examples), for example sentiment analysis
- The labels are not recognized by their semantic meaning, e.g. "reasoning" tasks like classifying contradictions
- You don't have many examples

### Example snippet

Running the following code will instantiate a prompt-based classifier with a debug level for the log.
Then it will classify the text given in `ClassifyInput`.
The contents of the `debug_log` will be shown below.
It gives an overview of the steps taken to get the result.


In [None]:
from os import getenv

from aleph_alpha_client import Client

from intelligence_layer.single_label_classify import ClassifyInput, SingleLabelClassify
from intelligence_layer.task import Chunk, InMemoryDebugLogger

text_to_classify = Chunk("In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. \n\
With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. \n\
As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos.")
labels = ["happy", "angry", "sad"]
client = Client(getenv("AA_TOKEN"))
task = SingleLabelClassify(client)
input = ClassifyInput(
    chunk=text_to_classify,
    labels=labels
)

debug_log = InMemoryDebugLogger(name="classify")
output = task.run(input, debug_log)
for label, score in output.scores.items():
    print(f"{label}: {round(score, 4)}")
# debug_log


### How does this implementation work?

For prompt-based classification, we prompt the model multiple times with the text we want to classify and each of our classes.
Instead of letting the model generate the class it thinks fits the text best, we ask it for the probability for each class.

To further explain this, let's start with a more familiar case.
Intuitively, one would probably prompt a model like so:

In [None]:
from aleph_alpha_client import PromptTemplate

prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)
print(prompt_template.to_prompt(text=text_to_classify, label="").items[0].text)


The model would then answer our question and generate a class or label that it thinks fits the text best.

In the case of classification, however, we already know all possible classes beforehand.
Because of this, all we are interested in is the probability that the model would have generated our specific classes.
To get this probability, we can prompt the model with each of our classes and ask it to return the "logprobs" for the text.

In the case of prompt-based classification, the base prompt looks something like this:

In [None]:
prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)
print(prompt_template.to_prompt(text=text_to_classify, label=" " +labels[0]).items[0].text)


As you can see, we have the same prompt, but with a potential label candidate already filled in.

Now, we will ask the model to evaluate the likelihood of this completion.

Our request will now not generate any tokens, but instead return the log probability of this completion given the previous tokens.

Now that we have the logprobs, we just need to do some calculations to turn them into a final score.

To turn the logprobs into our end scores, we first normalize our probabilities.
For this, we utilize a probability tree.

In [None]:
from intelligence_layer.single_label_classify import TreeNode
from intelligence_layer.task import LogEntry

task_log = debug_log.logs[-1]
normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == "Normalized Probs"]
log = normalized_probs_logs[-1]

root = TreeNode()
for probs in log.values():
    root.insert_without_calculation(probs)


Finally, we take the product of all the paths to get the following results:

In [None]:
for label, score in output.scores.items():
    print(f"{label}: {round(score, 5)}")


The example mentioned before is rather straightforward, but there are some situations when it isn't as obvious as a single token.

What if we take some classes that have some overlap?
In the following example, some of the classes overlap in the tokens they have.
This makes the calculation a bit more complicated:

In [None]:
from intelligence_layer.single_label_classify import SingleLabelClassify, ClassifyInput
from intelligence_layer.task import LogEntry


labels = ["Space party", "Space exploration", "Space exploration party"]
task = SingleLabelClassify(client)
input = ClassifyInput(
    chunk=text_to_classify,
    labels=labels
)
logger = InMemoryDebugLogger(name="classify")
output = task.run(input, logger)
task_log = logger.logs[-1]
normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == "Normalized Probs"]
log = normalized_probs_logs.pop()

root = TreeNode()
for probs in log.values():
    root.insert_without_calculation(probs)

print("End scores:")
for label, score in output.scores.items():
    print(f"{label}: {round(score, 4)}")


Here, the three classes have some overlapping tokens, namely "Space", and "exploration".
"party" is not overlapping, because it occurs in two different places (after "Space" and after "exploration").

Cool!
Now, let's evaluate how well our new methodology is working.
For this, we will first look for classification datasets to use.
We found this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) on huggingface, let's see if we can get an evaluation going!

In [None]:
from datasets import load_dataset

dataset = load_dataset(f"cardiffnlp/tweet_topic_multi")
test_set_name = "validation_random"
data = list(dataset[test_set_name])[:10] # this has 573 datapoints, let's take a look at 20 for now


Next, we need to instantiate an evaluator that takes our classify methodology (`task`) and some datapoints and returns some evaluation metrics.

First, let's evaluate a single example and see what happens.

In [None]:
from intelligence_layer.single_label_classify import SingleLabelClassifyEvaluator

evaluator = SingleLabelClassifyEvaluator(task)
classify_input = ClassifyInput(
        chunk=Chunk("This is good"),
        labels=frozenset({"positive", "negative"}),
    )
evaluation_logger = InMemoryDebugLogger(name="evaluation logger")
expected_output = "positive"
evaluation = evaluator.evaluate(
    input=classify_input, logger=evaluation_logger, expected_output=[expected_output]
)

print("The task result:", evaluation.output.scores)
print("The expected output:", expected_output)
print("The eval result:", evaluation.correct)


We need to transform our dataset into the required format. 
Therefore, let's check out what it looks like.

In [None]:
data[1]


Accordingly, this must be translated into the interface of our `Evaluator`.

In [None]:
from intelligence_layer.task import Example, Dataset


all_labels = list(set(c for d in data for c in d["label_name"]))
dataset = Dataset(
    name="tweet topics",
    examples=[
        Example(
            input=ClassifyInput(
                chunk=d[Chunk("text")],
                labels=all_labels
            ),
            expected_output=d["label_name"]
        ) for d in data
    ]
)


Ok, let's run this!

In [None]:
evaluation_logger = InMemoryDebugLogger(name="evaluation logger")
result = evaluator.evaluate_dataset(dataset=dataset, logger=evaluation_logger)


Checking out the results...

In [None]:
print("Percentage correct:", result.percentage_correct)
print("First example", result.evaluations[0])
