# Demo marginalia


<img src="https://raw.githubusercontent.com/Pleias/marginalia/main/notebook/marginalia_logo.jpg" style="float:right;" alt="marginalia logo" width="300"/>


This code notebook provides an initial demo of marginalia, a lightweight application to retrieve corpus annotations with open LLM like Mistral OpenHermes 2.5.


marginalia is for now only available on Github:

In [1]:
!python -m pip install git+https://github.com/Pleias/marginalia.git

Collecting git+https://github.com/Pleias/marginalia.git
  Cloning https://github.com/Pleias/marginalia.git to /tmp/pip-req-build-rib_5bn7
  Running command git clone --filter=blob:none --quiet https://github.com/Pleias/marginalia.git /tmp/pip-req-build-rib_5bn7
  Resolved https://github.com/Pleias/marginalia.git to commit 84984bccf93016e26aafd65624d17ac7c79c02d1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: marginalia
  Building wheel for marginalia (setup.py) ... [?25l[?25hdone
  Created wheel for marginalia: filename=marginalia-0.1.0-py3-none-any.whl size=3709 sha256=65cc4807d32355d68740b7fd355f4c403a3c1bcc2bfd13616b4924e0f218c708
  Stored in directory: /tmp/pip-ephem-wheel-cache-pyuldqtb/wheels/32/db/49/703181bf805653e03da51802e4481682d35cb87276ab0493b3
Successfully built marginalia
Installing collected packages: marginalia
Successfully installed marginalia-0.1.0


## Preparation of the data

marginalia works with any list of unstructured texts. It will generate id on the fly simply based on the index of the text, as well as return the unprocessed text as part of the json output.

We are going to test marginalia on a set of 445 individual titles recommended by no other than Benjamin Franklin (*A catalogue of choice and valuable books, consisting of near 600 volumes*)

In [2]:
import pandas as pd

unstructured = pd.read_csv("https://raw.githubusercontent.com/Pleias/marginalia/main/notebook/franklin_library.tsv", sep = "\t")["text"].tolist()

Let's have the look at the first titles

In [3]:
unstructured[0:10]

['1 FINE large Folio BIBLE, compleat, Oxford 1727.',
 '2 Ditto, with Maps, Notes, &c.',
 "3 Clarendon's History of the Rebellion, 3 Vols",
 "4 Bayley's universal etimologlcal Dictionary.",
 '5 Marlorati Thesaurus Scripturae.',
 "6 Wiquefort's compleat Ambassador, translated by Digby, finely bound.",
 "7 Hobbes's Leviathan, very scarce.",
 "8 R. Barclay's Works, compleat.",
 "9 D. Rogers's Lectures on Naaman the Syrian.",
 "10 Bunny's Head Corner-Stone."]

marginalia aims to recover a *data scheme*. To create the scheme, you simply initiate a dictionary with fields and their definition. Basically, you want to apply the data scheme to your unstructured set of text everytime fits.

In [4]:
data_scheme = {"author": "the author(s) of the book, which can be expressed with a possessive like Hobbe's",
               "title": "the title of the book",
               "translator": "the translator(s) of the book",
               "date": "the date of publication",
               "place": "the place of publication",
               "format": "any information related to the format such as volumes, folios",
               "other": "any other information related to the book"}

The core of marginalia functionality is instruction_set. That's where you are going to pass the unstructured text, the data scheme and the prompt instructions.

In [5]:
from marginalia import instruction_set

instructions = instruction_set(data_scheme = data_scheme,
                               unstructured = unstructured,
                               system_prompt = "You are a powerful annotator of bibliographic data",
                               input_prompt = "Transform this list of book entries into structured bibliographic data",
                               definition_prompt = "Extract the following bibliographic fields:",
                               structure_prompt = "Return the results as a json structured like this :",
                               data_prompt = "Here is the list of books :",
                               name_id = "book",
                               size_batch = 10)

As you can notice the prompt as six parts:

* System prompt: basically defining what kind of the tool LLM is, in a very broad way.
* Input prompt: the actual task at hand.
* Definition prompt: the introductory prompt for the list of definitions stored in the data scheme.
* Structure prompt: the introductory prompt for an empty sample of the json structure.
* Data prompt: the introductory prompt for the list of unstructured text sample.
* Name id: the name used to qualify each unstructured text sample

Additionally you can define the size of the batch with a size_batch. Overall the longer your text sample are, the smaller your batch should be to not overload the context window.

Before launching the actual LLM-powered annotation, it is advisable to give a look the data and check if everything is fine. You can do it with test_prompt:

In [6]:
instructions.test_prompt()

And the return the first prompt:

In [7]:
print(instructions.prompts[0])

<|im_start|>system
You are a powerful annotator of bibliographic data
<|im_end|>
<|im_start|>user
Transform this list of book entries into structured bibliographic data

Extract the following bibliographic fields: the book id ("id"), the author(s) of the book, which can be expressed with a possessive like Hobbe's ("author"), the title of the book ("title"), the translator(s) of the book ("translator"), the date of publication ("date"), the place of publication ("place"), any information related to the format such as volumes, folios ("format"), any other information related to the book ("other")

Return the results as a json structured like this : {"id": "…", "author": "…", "title": "…", "translator": "…", "date": "…", "place": "…", "format": "…", "other": "…"}

Here is the list of books :

book 0: 1 FINE large Folio BIBLE, compleat, Oxford 1727.

book 1: 2 Ditto, with Maps, Notes, &c.

book 2: 3 Clarendon's History of the Rebellion, 3 Vols

book 3: 4 Bayley's universal etimologlcal Dic

## Loading the model

Then to use the LLM, you need to load it with vllm. This notebook provide a tested solution for Google Colab but do not hesitate to check the vllm documentation:

In [8]:
!pip install lmdb
!pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 torchtext==0.15.2+cpu torchdata==0.6.1 --index-url https://download.pytorch.org/whl/cu118
!pip install transformers
!pip install vllm

Collecting lmdb
  Downloading lmdb-1.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (299 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.2/299.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lmdb
Successfully installed lmdb-1.4.1
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch==2.0.1+cu118
  Downloading https://download.pytorch.org/whl/cu118/torch-2.0.1%2Bcu118-cp310-cp310-linux_x86_64.whl (2267.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 GB[0m [31m832.2 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.15.2+cu118
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.15.2%2Bcu118-cp310-cp310-linux_x86_64.whl (6.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.0.2
  Downloading https://download.pytorch.org/whl/cu118/to

In [9]:
from vllm import LLM, SamplingParams
import os

In [None]:
llm = LLM("teknium/OpenHermes-2.5-Mistral-7B")

In [None]:
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=8000, presence_penalty = 0)

## Generate the annotations

At this point, the actual annotation is one command. You'll notice that marginalia does several pass on vllm to send again any non-compliant json.

In [None]:
instructions.llm_generate_loop(llm, sampling_params)

By the end of this process you can check your json:

In [21]:
instructions.valid_json[0:6]

[{'id': 0,
  'author': None,
  'title': 'FINE large Folio BIBLE',
  'translator': None,
  'date': '1727',
  'place': 'Oxford',
  'format': 'compleat',
  'other': None,
  'original_source': 'FINE large Folio BIBLE, compleat, Oxford 1727.'},
 {'id': 1,
  'author': '',
  'title': 'Ditto, with Maps, Notes, &c.',
  'translator': '',
  'date': '',
  'place': '',
  'format': '',
  'other': '',
  'original_source': 'Ditto, with Maps, Notes, &c.'},
 {'id': 2,
  'author': 'Clarendon',
  'title': 'History of the Rebellion',
  'translator': '',
  'date': '1727',
  'place': 'Oxford',
  'format': '3 Vols',
  'other': '',
  'original_source': "Clarendon's History of the Rebellion, 3 Vols"},
 {'id': 3,
  'author': 'Bayley',
  'title': 'universal etimologlcal Dictionary',
  'translator': '',
  'date': '1727',
  'place': 'Oxford',
  'format': '',
  'other': '',
  'original_source': "Bayley's universal etimologlcal Dictionary."},
 {'id': 4,
  'author': 'Marlorati',
  'title': 'Thesaurus Scripturae',
  't