# Classify

Classification is a methodology that tries to match a text to the correct label. 

### Prompt based classification 

Prompt based classification is a methodology that relies purely on prompting the LLM in a specific way. 

### When should you use prompt based classification 

Some situations when you would use this methodology is when:
- The labels are easily understood (they don't require explanation or examples)
    
    An example is sentiment analysis
- The labels are not recognized by their semantic meaning
    
    E.g. Reasoning tasks like classifying contradictions
- You don't have many examples

### Example snippet 
Running the following code will instantiate a prompt based classifier, with a debug level for the log. 
Then it will classify the text given in "ClassifyInput".
The contents of the debuglog will be shown below.
The debuglog gives an overview of the steps taken to get the result.

In [3]:
from os import getenv
from aleph_alpha_client import Client
from intelligence_layer.single_label_classify import ClassifyInput, SingleLabelClassify
from intelligence_layer.task import InMemoryDebugLogger

text_to_classify = "In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. \n\
    With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. \n\
    As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos."
labels = ["happy", "angry", "sad"]
client = Client(getenv("AA_TOKEN"))
task = SingleLabelClassify(client)
input = ClassifyInput(
    text=text_to_classify,
    labels=labels
)

debug_log = InMemoryDebugLogger(name="classify")
output = task.run(input, debug_log)
for label, score in output.scores.items():
    print(f"{label}: {round(score, 4)}")
debug_log


angry: 0.0065
sad: 0.2923
happy: 0.7012


### How does this implementation work
For prompt based classification, we prompt the model multiple times with the text we want to classify and each of our classes. 
Instead of letting the model generate the class it thinks fits the text best, we ask it for the probability for each class.

To further explain this, lets start with a more familiar case.
The intuitive way to ask an LLM if it could label a text could be something like this: 

In [4]:
from aleph_alpha_client import PromptTemplate

prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)
print(prompt_template.to_prompt(text=text_to_classify, label="").items[0].text)

### Instruction:
Identify a class that describes the text adequately.
Reply with only the class label.

### Input:
In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. 
    With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. 
    As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos.

### Response:


The model would then answer our question, and give us a class that it thinks fits the text. 

In the case of classification, however, we already have the classes beforehand.
Because of this, all we are interested in is the probability the model would have guessed our specific classes.
To get this probability, we can prompt the model with each of our classes and ask the model to return the logprobs for the text. 

In case of prompt based classification the prompt looks something like this:

In [5]:
prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)
print(prompt_template.to_prompt(text=text_to_classify, label=labels[0]).items[0].text)

### Instruction:
Identify a class that describes the text adequately.
Reply with only the class label.

### Input:
In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. 
    With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. 
    As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos.

### Response:happy


As you can see, we have the same prompt, but with the class already filled in as a response.

Our request will now not generate any tokens, but instead will just return us the logprobs that the class would be generated, given the previous tokens.

```python
CompletionRequest(
    prompt=prompt_template.to_prompt(**kwargs),
    maximum_tokens=0,
    log_probs=0,
    tokens=True,
    echo=True,
)
```

In the case of the classes "Space exploration" and "Space party", the logprobs per label might look something like the code snippet below. 

In [6]:
from intelligence_layer.task import LogEntry
result_objects = [log_entry for log_entry in debug_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == "Raw log probs per label"]
result_objects.pop()


Now that we have the logprobs, we just need to do some calculations to turn them into our end score. 

To turn the logprobs into our end scores, first we normalize our probabilities. This will result in the following data structure:

In [7]:
from intelligence_layer.single_label_classify import TreeNode

normalized_probs_logs = [log_entry.value for log_entry in debug_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == "Normalized Probs"]
log = normalized_probs_logs.pop()

root = TreeNode()
for probs in log.values():
    root.insert_without_calculation(probs)

Finally, we take the product of all the paths to get the following results:

In [8]:
for label, score in output.scores.items():
    print(f"{label}: {round(score, 5)}")

angry: 0.00646
sad: 0.29231
happy: 0.70123


The example mentioned before is rather straightforward, but there are some situations when it isn't as obvious as a single token.

What if we take some classes that have some overlap?
In the following example some of the classes overlap in the tokens they have. This makes the calculation a bit more complicated:

In [9]:
from intelligence_layer.single_label_classify import SingleLabelClassify, ClassifyInput
from intelligence_layer.task import LogEntry

labels = ["Space party", "Space exploration", "Space exploration party"]
task = SingleLabelClassify(client)
input = ClassifyInput(
    text=text_to_classify,
    labels=labels
)
logger = InMemoryDebugLogger(name="classify")
output = task.run(input, logger)
normalized_probs_logs = [log_entry.value for log_entry in logger.logs if isinstance(log_entry, LogEntry) and log_entry.message == "Normalized Probs"]
log = normalized_probs_logs.pop()

root = TreeNode()
for probs in log.values():
    root.insert_without_calculation(probs)

print("End scores:")
for label, score in output.scores.items():
    print(f"{label}: {round(score, 4)}")


End scores:
Space exploration: 0.6788
Space party: 0.0
Space exploration party: 0.3211


Here the three classes have some overlapping tokens. 
In the graph above can be seen how the calculations would be done in this case. 

1. At the top, you can see that when there is only one token to choose from, the normalized score will always be 1.

2. After that, the first choice is made between "exploration" and "party". 

3. If the choice of "exploration" is made, finally a choice has to be made between the "endoftext" token and "party".

    This "endoftext" token is a token that large language models use internally to make their calculations.
    As an end user, you generally shouldn't see this token.

    In our case, this translates to choosing between: "Space exploration party" and just "Space exploration".


Now, let's evaluate how well our new methodology is working. For this, we will first look for classification datasets to use. We found this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) on huggingface, let's see if we can get an evaluation going!

In [10]:
from datasets import load_dataset

dataset = load_dataset(f"cardiffnlp/tweet_topic_multi")
test_set_name = "validation_random"
data = list(dataset[test_set_name])[:10] # this has 573 datapoints, let's take a look at 20 for now

  from .autonotebook import tqdm as notebook_tqdm


Next, we need to instantiate an evaluator that takes our classify methodology (`task`) and some datapoints & return some evaluation metrics.

First, let's evaluate a single example and see what happens.

In [11]:
from intelligence_layer.single_label_classify import SingleLabelClassifyEvaluator

evaluator = SingleLabelClassifyEvaluator(task)
classify_input = ClassifyInput(
        text="This is good",
        labels=frozenset({"positive", "negative"}),
    )
evaluation_logger = InMemoryDebugLogger(name="evaluation logger")
expected_output = "positive"
evaluation = evaluator.evaluate(
    input=classify_input, logger=evaluation_logger, expected_output=[expected_output]
)

print("The task result:", evaluation.output.scores)
print("The expected output:", expected_output)
print("The eval result:", evaluation.correct)

The task result: {'positive': 0.9989012689270754, 'negative': 0.001098731072924643}
The expected output: positive
The eval result: True


We need to transform our dataset into the required format. Therefore, let's check out what it looks like.

In [12]:
data[1]

{'text': 'COMING UP!! At the beginning of the pandemic, various models predicted a doomsday scenario for Nigeria & other African countries, yet fewer cases than expected have been recorded. For the latest on Nigeria’s response to #COVID19, join DG… {{URL}} VIA {@NCDC@} ',
 'date': '2020-08-17',
 'label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
 'label_name': ['news_&_social_concern'],
 'id': '1295296955732471809'}

Accordingly, this must be translated into the interface of our `Evaluator`.

In [13]:
from intelligence_layer.task import Example, Dataset 


all_labels = list(set(c for d in data for c in d["label_name"]))
dataset = Dataset(
    name="tweet topics",
    examples=[
        Example(
            input=ClassifyInput(
                text=d["text"],
                labels=all_labels
            ),
            expected_output=d["label_name"]
        ) for d in data
    ]
)

Ok, let's run this!

In [14]:
evaluation_logger = InMemoryDebugLogger(name="evaluation logger")
result = evaluator.evaluate_dataset(dataset=dataset, logger=evaluation_logger)

Evaluating: 100%|██████████| 10/10 [00:03<00:00,  2.72it/s]


Checking out the results...

In [15]:
print("Percentage correct:", result.percentage_correct)
print("First example", result.evaluations[0])

Percentage correct: 0.7
First example correct=True output=ClassifyOutput(scores={'sports': 0.005584757402570252, 'celebrity_&_pop_culture': 1.4114353710610966e-05, 'science_&_technology': 0.00022389412262055315, 'film_tv_&_video': 0.03891240110363525, 'news_&_social_concern': 0.02119285283173574, 'arts_&_culture': 0.9338875121057757, 'fitness_&_health': 7.100400424303978e-05, 'other_hobbies': 0.00011346407570865306})
