Welcome to the Lostma Workshop!

This Notebook aims to explore the data collected as part of the project and manipulate it together.

The data itself is available on [Heurist](https://heurist.huma-num.fr/heurist/?db=jbcamps_gestes) and the schema is described on the [website](https://lostma-erc.github.io/corpus/documentation)

# **I) Notebook setup**

To use this notebook, you first need to download the library created in order to facilitate exploration of our Heurist Database. This tool is based on the [Heurist-API](https://pypi.org/project/heurist-api) developped by Kelly Christensen. Thanks to this API, we can extract, transform and load the content of an Heurist database into a DuckDB database making it easier to use queries with SQL. We then created a specific LostMa Python object to help us to explore the content of the data.

So please, start with installation:

In [None]:
!pip install "git+https://github.com/LostMa-ERC/Heurist-analyser.git"

Collecting git+https://github.com/LostMa-ERC/Heurist-analyser.git
  Cloning https://github.com/LostMa-ERC/Heurist-analyser.git to /tmp/pip-req-build-4501u0r6
  Running command git clone --filter=blob:none --quiet https://github.com/LostMa-ERC/Heurist-analyser.git /tmp/pip-req-build-4501u0r6
  Resolved https://github.com/LostMa-ERC/Heurist-analyser.git to commit abba05a4f7b622af8720bcb3a1a5290c72643eee
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting heurist-api<0.3.0,>=0.2.5 (from heurist-analyser==0.1.0)
  Downloading heurist_api-0.2.5-py3-none-any.whl.metadata (27 kB)
Collecting duckdb<2.0.0,>=1.4.1 (from heurist-analyser==0.1.0)
  Downloading duckdb-1.4.2-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting pandas>=2.2.3 (from heurist-api<0.3.0,>=0.2.5->heurist-analyser==0.1.0)
  Downloading pandas-2.3.3-cp312-cp312-

You can then use your Heurist login details to download the Lostma database:

In [None]:
from lostma_db import LostmaDB

login = "your login"
pwd = "your password"
db = LostmaDB(login, pwd)

In [None]:
db.sync()

# **II) Data access**

It is now possible to access the contents of tables (for example, the witness table here) with SQL queries:

In [None]:
db.sql("SELECT * FROM witness")

Unnamed: 0,H-ID,type_id,is_manifestation_of H-ID,observed_on_pages H-ID,last_observed_in_doc H-ID,is_unobserved,is_unobserved TRM-ID,claim_freetext,used_to_follow_fragment H-ID,used_to_follow_witness H-ID,...,number_of_hands,scribe_note,place_of_creation H-ID,place_of_creation_source,described_by_source H-ID,described_at_URL,reference_notes,review_status,review_status TRM-ID,review_note
0,54257,105,586.0,[52507],,No,9483,,,,...,,,[],,[],"[http://jonas.irht.cnrs.fr/manuscrit/72901, ht...",[],Open,9698,92:The status was marked as a citation or extr...
1,55200,105,48907.0,[51255],,No,9483,,,,...,,,[],,[],"[http://jonas.irht.cnrs.fr/manuscrit/73408, ht...",[],Action required,9697,198:The status was marked as a citation or ext...
2,53697,105,48859.0,[51238],,No,9483,,,,...,,,[],,[],"[http://jonas.irht.cnrs.fr/manuscrit/45508, ht...",[],Action required,9697,443:The status was marked as a citation or ext...
3,53719,105,773.0,[52713],,No,9483,,,,...,,,[],,[],"[http://jonas.irht.cnrs.fr/manuscrit/59028, ht...",[Concaténé avec Huon de Bordeaux ?],Action required,9697,462:The status was marked as a citation or ext...
4,54247,105,50237.0,[52196],,No,9483,,,,...,,,[],,[],"[http://jonas.irht.cnrs.fr/manuscrit/45517, ht...",[],Action required,9697,911:The status was marked as a citation or ext...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2561,48051,105,48049.0,[48052],,No,9483,,,,...,,,[],,[47483],[https://www.arlima.net/eh/huon_de_bordeaux/nl...,[],Action required,9697,Kienhorst zegt in BNM?
2562,47458,105,47293.0,[47459],47457.0,No,9483,,,,...,,,[],,[],[https://www.handschriftencensus.de/18870],[],Action required,9697,Ref.
2563,47454,105,47293.0,[47455],47453.0,No,9483,,,,...,,,[],,[],[],[],Action required,9697,Ref.
2564,47433,105,46322.0,[47434],47432.0,No,9483,,,,...,,,[],,[],[],[],Action required,9697,"Check, ref."


NB: If you know [SQL](https://www.w3schools.com/sql/), you can use the [data model](https://github.com/LostMa-ERC/DataArchitect2025/blob/main/heurist.jpg) to extract all the data you want to see as a dataframe

To help you, we created some pre-defined queries to explore the most usefull data (we recommend limiting the scope to ready-to-use language corpora)

In [None]:
available_languages = ["dum (Middle Dutch)", "enm (Middle English)", "fro (Old French)", "frm (Middle French)"]

In [None]:
db.texts(available_languages)

Unnamed: 0,H-ID,type_id,preferred_name,language_COLUMN,language_COLUMN TRM-ID,literary_form,literary_form TRM-ID,is_hypothetical,is_hypothetical TRM-ID,claim_freetext,...,is_adapted_by H-ID,author_freetext,place_of_creation H-ID,place_of_creation_source,described_by_source H-ID,described_at_URL,reference_notes,review_status,review_status TRM-ID,review_note
0,47788,101,Willem van Oringen I,dum (Middle Dutch),9728.0,verse,9549.0,No,9483,,...,[],,,,"[47485, 47482, 47484, 47483]",[https://lib.ugent.be/catalog/rug01:000990047],[],Open,9698.0,
1,55292,101,Alexander-compilatie,dum (Middle Dutch),9728.0,verse,9549.0,No,9483,,...,[],,,,[],[],[],Action required,9697.0,"Schoenaers, Breeus-Loos, Katz, and Sleiderink ..."
2,48337,101,Aspremont,dum (Middle Dutch),9728.0,verse,9549.0,No,9483,,...,[],,,,"[47485, 47483]",[],[],Open,9698.0,
3,47903,101,Aubri de Borgengoen,dum (Middle Dutch),9728.0,verse,9549.0,No,9483,,...,[],,,,"[47485, 47482, 47484, 47483]",[https://www.handschriftencensus.de/20269],[],Open,9698.0,
4,47529,101,Barlaam en Josaphat,dum (Middle Dutch),9728.0,verse,9549.0,No,9483,,...,[],,,,[47485],[],[],Open,9698.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,48714,101,Apollonius de Tyr,fro (Old French),9470.0,prose,9550.0,No,9483,,...,[],,,,[],"[https://arlima.net/no/12079, http://jonas.irh...","[ARLIMA: 361, https://arlima.net/no/361]",Open,9698.0,The story of Apollonius of Tyre has not yet be...
513,49488,101,Departement des enfans Aimeri,fro (Old French),9470.0,verse,9549.0,No,9483,,...,[],,,,[],"[https://arlima.net/no/13731, http://jonas.irh...",[JONAS : 13731],Open,9698.0,
514,48529,101,Eneas,fro (Old French),9470.0,prose,9550.0,No,9483,,...,[],,,,[],[http://jonas.irht.cnrs.fr/oeuvre/13086],[],Open,9698.0,
515,48814,101,L'âtre périlleux,fro (Old French),9470.0,verse,9549.0,No,9483,,...,[],,,,[],"[http://jonas.irht.cnrs.fr/oeuvre/5641, https:...",[],Open,9698.0,


In [None]:
db.witnesses(available_languages)

Unnamed: 0,H-ID,type_id,is_manifestation_of H-ID,observed_on_pages H-ID,last_observed_in_doc H-ID,is_unobserved,is_unobserved TRM-ID,claim_freetext,used_to_follow_fragment H-ID,used_to_follow_witness H-ID,...,place_of_creation_freetext,place_of_creation_source_2,described_by_source H-ID_2,described_at_URL_2,online_catalogue_URL,ARK,reference_notes_2,review_status_3,review_status TRM-ID_3,review_note_3
0,54257,105,586.0,[52507],,No,9483,,,,...,,,[],[],https://archivesetmanuscrits.bnf.fr/ark:/12148...,ark:/12148/cc51045t,[],Open,9698,
1,55200,105,48907.0,[51255],,No,9483,,,,...,,,[],[],https://archivesetmanuscrits.bnf.fr/ark:/12148...,ark:/12148/cc76887d,[],Open,9698,
2,53697,105,48859.0,[51238],,No,9483,,,,...,,,[],[],https://archivesetmanuscrits.bnf.fr/ark:/12148...,ark:/12148/cc45158c,[],Open,9698,
3,53719,105,773.0,[52713],,No,9483,,,,...,,,[],"[http://jonas.irht.cnrs.fr/manuscrit/59028, ht...",,,[],Action required,9697,This manuscript has two shelfmarks marked as c...
4,54246,105,50237.0,[52193],,No,9483,,,,...,,,[],[],https://archivesetmanuscrits.bnf.fr/ark:/12148...,ark:/12148/cc49846f,[],Open,9698,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2628,53436,105,48765.0,[51121],,No,9483,,,,...,,,[],[http://jonas.irht.cnrs.fr/manuscrit/83470],https://archivesetmanuscrits.bnf.fr/ark:/12148...,ark:/12148/cc125661b,[],Open,9698,
2629,53428,105,48765.0,[51113],,No,9483,,,,...,,,[],[http://jonas.irht.cnrs.fr/manuscrit/79278],,,[],Open,9698,
2630,53422,105,48765.0,[51109],,No,9483,,,,...,,,[],[],,,[],Open,9698,
2631,53439,105,48765.0,[51123],,No,9483,,,,...,,,[],[http://jonas.irht.cnrs.fr/manuscrit/83225],,,[],Open,9698,


NB : The witnesses output contains also the data from the text, part and document tables

All the results of these functions can be donwloaded as a csv file for closer examination:

In [None]:
db.witnesses(available_languages).to_csv("witnesses.csv")

# III) **Data visualisation**

Once you get the data you need, you can explore them with dedicated software

In [None]:
import pandas as pd
import plotly.express as px

In [None]:
df_texts = db.texts(available_languages)
vc_texts = df_texts["language_COLUMN"].value_counts()
counts_texts = pd.DataFrame({
    "language": vc_texts.index,
    "count": vc_texts.values,
    "source": "texts",
})
df_witnesses = db.witnesses(available_languages)
vc_witnesses = df_witnesses["language_COLUMN"].value_counts()
counts_witnesses = pd.DataFrame({
    "language": vc_witnesses.index,
    "count": vc_witnesses.values,
    "source": "witnesses",
})

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [None]:
counts_all = pd.concat([counts_texts, counts_witnesses], ignore_index=True)
fig = px.bar(
    counts_all,
    x="language",
    y="count",
    color="source",
    barmode="group",
)
fig.show()

If you want to know the data available for a corpus for a specific table, we created a function to see:

- the "completeness table" of each field
- the number of "total records" for this scope
- how many records have the "action required" field open

In [None]:
db.analyse("document", "frm (Middle French)")['completeness table']

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,field,required statement,empty records,percentage empty
0,current_shelfmark,recommended,4,1.71
1,contents_of_record_without_shelfmark,optional,250,106.84
2,collection,optional,219,93.59
3,location H-ID,recommended,4,1.71
4,location_known,required,0,0.0
5,location_notes,optional,250,106.84
6,is_hypothetical,required,0,0.0
7,invented_label,recommended,254,108.55
8,claim_freetext,recommended,254,108.55
9,collection_of_fragments,required,0,0.0


You can then see how the data is distributed within a single field:

In [56]:
attr_col = "status_witness"

df_witnesses = db.witnesses(available_languages)
vc_attr = df_witnesses[attr_col].value_counts(dropna=False)
counts_attr = pd.DataFrame({
    "value": vc_attr.index,
    "count": vc_attr.values,
})

fig = px.bar(
    counts_attr,
    x="value",
    y="count",
)
fig.update_layout(
    xaxis_title=attr_col,
    yaxis_title="Number of records",
    xaxis_tickangle=-45,
)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [46]:
fig.show()

Or how this field is distributed according to language:

In [53]:
lang_col = "language_COLUMN"

grouped = (
    df_witnesses
    .groupby([lang_col, attr_col])
    .size()
    .reset_index(name="count")
)

fig = px.bar(
    grouped,
    x=lang_col,
    y="count",
    color=attr_col,
    barmode="group"
)

fig.update_layout(
    title=f"Répartition des valeurs de {attr_col} par langue",
    xaxis_title="Langue",
    yaxis_title="Nombre de textes",
    xaxis_tickangle=-45,
)
fig.show()