# Classification marginalia


<img src="https://raw.githubusercontent.com/Pleias/marginalia/main/notebook/marginalia_logo.jpg" style="float:right;" alt="marginalia logo" width="300"/>


This code notebook provides a demo of marginalia for classification tasks. marginalia is a lightweight application to retrieve corpus annotations with open LLM like Mistral OpenHermes 2.5. While very flexible, it will also consistently return results in structured json that can be easily exported in a tabular format.

marginalia is for now only available on Github:

In [1]:
!python -m pip install git+https://github.com/Pleias/marginalia.git

Collecting git+https://github.com/Pleias/marginalia.git
  Cloning https://github.com/Pleias/marginalia.git to /tmp/pip-req-build-c81bdryp
  Running command git clone --filter=blob:none --quiet https://github.com/Pleias/marginalia.git /tmp/pip-req-build-c81bdryp
  Resolved https://github.com/Pleias/marginalia.git to commit 1aa86396d6e8ab14cd221252f43e87c7ac46e316
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: marginalia
  Building wheel for marginalia (setup.py) ... [?25l[?25hdone
  Created wheel for marginalia: filename=marginalia-0.1.0-py3-none-any.whl size=4556 sha256=6678ee61daaea496390048c7d6650967ce3436d167bf61b60378829e95e5344d
  Stored in directory: /tmp/pip-ephem-wheel-cache-mp440ulm/wheels/32/db/49/703181bf805653e03da51802e4481682d35cb87276ab0493b3
Successfully built marginalia
Installing collected packages: marginalia
Successfully installed marginalia-0.1.0


## Preparation of the data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

#%cd "mistral"
%cd "/content/drive/Shareddrives/OPSCI/LLMs/mistral"

Mounted at /content/drive
/content/drive/Shareddrives/OPSCI/LLMs/mistral


marginalia works with any list of unstructured texts. It will generate id on the fly simply based on the index of the text, as well as return the unprocessed text as part of the json output.

For this demo, we aims to identify whether a user query should require additional sources. This is a common problem for LLM application with retrieval system, as adding sources is way more costly in terms of inference and also sometimes does not match the intent of the user if this more about performing a specific task, like translating a text or solving a logical problem.

The classification will be applied on a sample of the Open-Hermes 2.5 dataset.

In [3]:
import pandas as pd

list_question = pd.read_json("https://github.com/Pleias/marginalia/raw/main/notebook/open_hermes_instruction_select.json")["instruction"].tolist()

In [4]:
unstructured = []

for question in list_question:
  question = question.replace("\n", " ")
  unstructured.append(question)

Let's have the look at the first titles

In [5]:
unstructured[0:10]

['Rewrite the following sentence in a more formal tone. Hey, guys! Just wanna let ya know we aced the project and the boss is super happy with it.',
 'Create a C# function that accepts an integer array, sorts the array in ascending order, and returns the sorted array. int[] arr = { 8, 2, 4, 9, 1, 5 };',
 'I heard that Sasha asked Ash to come home. They were tired of arguing.  And I was wondering What will Ash want to do next? Available options: [A]. stay angry. [B]. make up. [C]. not talk about it. Answer:',
 'Problem: Solve 143*y + 30 = 128*y for y. And the answer is...',
 'This is vital to ensuring that people can make the right choices about their diet, and is one of the best ways we can tackle the diet-related diseases which are so prominent across the European Union.  Translate to German',
 'Can you provide a regular expression that matches the format "Firstname_Lastname"?',
 'Instructions: You are given a sentence in Hebrew. Your job is to translate the Hebrew sentence into Portu

marginalia aims to recover a *data scheme*. To create the scheme, you simply initiate a dictionary with fields and their definition. Basically, you want to apply the data scheme to your unstructured set of text everytime fits.

In this case the data scheme is very simple since we are aiming for a binary classification, whether references are needed or not to answer the question.

In a typical LLM zero-shot approach, we also add an initial field for reasoning and analysis. This "chain-of-thought" method does not only yield better results but also improves on the verifiability of the choice made. It's possible to check back why a wrong choice was made and change the prompt accordingly.

In [26]:
data_scheme = {"reference_evaluate": "argument whether answering the question is about knowledge and require some references rather than a task like translation, with a few concise sentences",
               "reference_result": "indicate by yes or no if references are needed"}

The core of marginalia functionality is instruction_set. That's where you are going to pass the unstructured text, the data scheme and the prompt instructions.

In [32]:
from marginalia import instruction_set

instructions = instruction_set(data_scheme = data_scheme,
                               unstructured = unstructured,
                               system_prompt = "You are a powerful evaluator of user inputs",
                               input_prompt = "Assess whether theses questions require some encyclopedic references to back them up. References would be typically needed if the answer mandates external knowledge rather than a task to perform like translating two languages, reformulation or solving a math problem based on the element present in the instruction.",
                               definition_prompt = "Your answer should include the following fields:",
                               structure_prompt = "Return the results as a json structured like this :",
                               data_prompt = "Here is the list of questions :",
                               name_id = "question",
                               size_batch = 5)

As you can notice the prompt as six parts:

* System prompt: basically defining what kind of the tool LLM is, in a very broad way.
* Input prompt: the actual task at hand.
* Definition prompt: the introductory prompt for the list of definitions stored in the data scheme.
* Structure prompt: the introductory prompt for an empty sample of the json structure.
* Data prompt: the introductory prompt for the list of unstructured text sample.
* Name id: the name used to qualify each unstructured text sample

Additionally you can define the size of the batch with a size_batch. Overall the longer your text sample are, the smaller your batch should be to not overload the context window. In this case we have opted for a batch of 5 elements, as the instructions can be relatively long.

Before launching the actual LLM-powered annotation, it is advisable to give a look the data and check if everything is fine. You can do it with test_prompt:

In [28]:
instructions.test_prompt()

And the return the first prompt:

In [29]:
print(instructions.prompts[0])

<|im_start|>system
You are a powerful evaluator of user inputs
<|im_end|>
<|im_start|>user
Assess whether theses questions require some encyclopedic references to back them up. References would be typically needed if the answer mandates external knowledge rather than a task to perform like translating two languages, reformulation or solving a math problem based on the element present in the instruction.

Your answer should include the following fields: the question id ("id"), argument whether answering the question is about knowledge and require some references rather than a task like translation, with a few concise sentences ("reference_evaluate"), indicate by yes or no if references are needed ("reference_result")

Return the results as a json structured like this : {"id": "…", "reference_evaluate": "…", "reference_result": "…"}

Here is the list of questions :

question 0: Rewrite the following sentence in a more formal tone. Hey, guys! Just wanna let ya know we aced the project and

## Loading the model

Then to use the LLM, you need to load it with vllm. This notebook provide a tested solution for Google Colab but do not hesitate to check the vllm documentation:

In [10]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.3.0-cp310-cp310-manylinux1_x86_64.whl (38.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja (from vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
Collecting ray>=2.9 (from vllm)
  Downloading ray-2.9.2-cp310-cp310-manylinux2014_x86_64.whl (64.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.1.2 (from vllm)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.37.0 (from vllm)
  Downloading transformers-4.37.2-py3-n

In [11]:
from vllm import LLM, SamplingParams
import os

In [12]:
llm = LLM("teknium/OpenHermes-2.5-Mistral-7B")

INFO 02-14 14:45:16 llm_engine.py:72] Initializing an LLM engine with config: model='mistral-7b-hermes-2.5', tokenizer='mistral-7b-hermes-2.5', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 02-14 14:50:18 llm_engine.py:322] # GPU blocks: 8919, # CPU blocks: 2048
INFO 02-14 14:50:20 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-14 14:50:20 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-14 14:50:26 model_runner.py:698] Graph capturing finished in 6 secs.


In [13]:
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=8000, presence_penalty = 0)

## Generate the annotations

At this point, the actual annotation is one command. You'll notice that marginalia does several pass on vllm to send again any non-compliant json.

In [None]:
instructions.llm_generate_loop(llm, sampling_params)

A sample of the prompt:
 <|im_start|>system
You are a powerful evaluator of user inputs
<|im_end|>
<|im_start|>user
Assess whether theses questions require some encyclopedic references to back them up. References would be typically needed if the answer mandates external knowledge rather than a task to perform like translating two languages, reformulation or solving a math problem based on the element present in the instruction.

Your answer should include the following fields: the question id ("id"), argument whether answering the question is about knowledge and require some references rather than a task like translation, with a few concise sentences ("reference_evaluate"), indicate by yes or no if references are needed ("reference_result")

Return the results as a json structured like this : {"id": "…", "reference_evaluate": "…", "reference_result": "…"}

Here is the list of questions :

question 0: Rewrite the following sentence in a more formal tone. Hey, guys! Just wanna let ya kno

Processed prompts:  49%|████▉     | 975/1999 [02:32<03:22,  5.05it/s]

By the end of this process you can check your json and export it to a dataframe:

In [31]:
import pandas as pd

result = pd.DataFrame(instructions.valid_json)[['original_source', 'reference_evaluate', 'reference_result']]
result


Unnamed: 0,original_source,reference_evaluate,reference_result
0,Rewrite the following sentence in a more forma...,This question is about rewriting a sentence in...,no
1,Create a C# function that accepts an integer a...,This question is about creating a C# function ...,no
2,I heard that Sasha asked Ash to come home. The...,This question is about predicting what Ash wil...,no
3,Problem: Solve 143*y + 30 = 128*y for y. And t...,"This question is about solving a math problem,...",no
4,This is vital to ensuring that people can make...,This question is about translating a sentence ...,no
...,...,...,...
95,"In this task, you are given two strings A, B. ...",The question is about finding the longest comm...,Yes
96,What is the specific role of the viral protein...,The question is about the specific role of the...,Yes
97,You are the mayor of a major city and you need...,The question is about creating a budget plan f...,Yes
98,HAT a liberating feeling it is to cut the cord...,The question is about summarizing the content ...,Yes
