Welcome to the Lostma Workshop!

This Notebook aims to explore the data collected as part of the project and manipulate it together.

The data itself is available on [Heurist](https://heurist.huma-num.fr/heurist/?db=jbcamps_gestes) and the schema is described on the [website](https://lostma-erc.github.io/corpus/documentation)

# **I) Notebook setup**

To use this notebook, you first need to download the library created in order to facilitate exploration of our Heurist Database. This tool is based on the [Heurist-API](https://pypi.org/project/heurist-api) developped by Kelly Christensen. Thanks to this API, we can extract, transform and load the content of an Heurist database into a DuckDB database making it easier to use queries with SQL. We then created a specific LostMa Python object to help us to explore the content of the data.

So please, start with installation:

In [None]:
!pip install "git+https://github.com/LostMa-ERC/Heurist-analyser.git"

You can then use your Heurist login details to download the Lostma database:

In [None]:
from lostma_db import LostmaDB

login = "your login"
pwd = "your password"
db = LostmaDB(login, pwd)

In [None]:
db.sync()

# **II) Data access**

It is now possible to access the contents of tables (for example, the witness table here) with SQL queries:

In [None]:
db.sql("SELECT * FROM witness")

NB: If you know [SQL](https://www.w3schools.com/sql/), you can use the [data model](https://github.com/LostMa-ERC/DataArchitect2025/blob/main/heurist.jpg) to extract all the data you want to see as a dataframe

To help you, we created some pre-defined queries to explore the most usefull data (we recommend limiting the scope to ready-to-use language corpora)

In [None]:
available_languages = ["dum (Middle Dutch)", "enm (Middle English)", "fro (Old French)", "frm (Middle French)"]

In [None]:
db.texts(available_languages)

In [None]:
db.witnesses(available_languages)

NB : The witnesses output contains also the data from the text, part and document tables

All the results of these functions can be donwloaded as a csv file for closer examination:

In [None]:
db.witnesses(available_languages).to_csv("witnesses.csv")

# III) **Data visualisation**

Once you get the data you need, you can explore them with dedicated software

In [None]:
import pandas as pd
import plotly.express as px

In [None]:
df_texts = db.texts(available_languages)
vc_texts = df_texts["language_COLUMN"].value_counts()
counts_texts = pd.DataFrame({
    "language": vc_texts.index,
    "count": vc_texts.values,
    "source": "texts",
})
df_witnesses = db.witnesses(available_languages)
vc_witnesses = df_witnesses["language_COLUMN"].value_counts()
counts_witnesses = pd.DataFrame({
    "language": vc_witnesses.index,
    "count": vc_witnesses.values,
    "source": "witnesses",
})

In [None]:
counts_all = pd.concat([counts_texts, counts_witnesses], ignore_index=True)
fig = px.bar(
    counts_all,
    x="language",
    y="count",
    color="source",
    barmode="group",
)
fig.show()

If you want to know the data available for a corpus for a specific table, we created a function to see:

- the "completeness table" of each field
- the number of "total records" for this scope
- how many records have the "action required" field open

In [None]:
db.analyse("document", "frm (Middle French)")['completeness table']

You can then see how the data is distributed within a single field:

In [None]:
attr_col = "status_witness"

df_witnesses = db.witnesses(available_languages)
vc_attr = df_witnesses[attr_col].value_counts(dropna=False)
counts_attr = pd.DataFrame({
    "value": vc_attr.index,
    "count": vc_attr.values,
})

In [None]:
fig = px.bar(
    counts_attr,
    x="value",
    y="count",
)
fig.update_layout(
    xaxis_title=attr_col,
    yaxis_title="Number of records",
    xaxis_tickangle=-45,
)
fig.show()

Or how this field is distributed according to language:

In [None]:
lang_col = "language_COLUMN"

grouped = (
    df_witnesses
    .groupby([lang_col, attr_col])
    .size()
    .reset_index(name="count")
)

fig = px.bar(
    grouped,
    x=lang_col,
    y="count",
    color=attr_col,
    barmode="group"
)

fig.update_layout(
    title=f"RÃ©partition des valeurs de {attr_col} par langue",
    xaxis_title="Langue",
    yaxis_title="Nombre de textes",
    xaxis_tickangle=-45,
)
fig.show()

If you want to study the tradition of texts (e. g. with [siMAtree](https://github.com/LostMa-ERC/simMAtree)), you can use this function to have an output with abundance distribution data:

In [None]:
db.tradition(available_languages).to_csv("tradition.csv")