Skip to content

FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models

Notifications You must be signed in to change notification settings

Jean-Baptiste-Camps/FROC-MSS

Repository files navigation

FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models

This repository contains:

  • training and evaluation data from allographetic transcriptions of various Old French and Old Occitan manuscripts, in various states of correctness, in Kraken training format;
  • HTR models trained and tested using this data.

If you plan of using this data or the provided model for a publication, please cite it, as:

Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.

Data format

The data is as following:

  • each line image in a .png file;
  • each transcription in a .gt.txt file.

Unicode NFD normalisation has been applied on the ground-truth text.

Models

Summary and C.E.R.

The root folder contains a vanilla Kraken model (model_froc.mlmodel), trained with default settings and without any additional data (e.g. no artificial noised data).

Data was randomly divided in 80% for training (train.txt), 10% for in-training validation (val.txt) and 10% for final testing of the model (test.txt).

It achieved a C.E.R. of:

  • ** 8.11 % ** on validation data (7.03% ignoring spaces);
  • ** 7.83 % ** on test data (6.92% ignoring spaces).

Errors and most frequent confusions on test data

There were 13540 characters and 1061 errors on test data.

Globally, the error are as follow:

  • 536 characters from the ground truth were not predicted by the model;
  • 132 characters absent from the ground truth were wrongly predicted;
  • 393 character substitutions.

The most frequent confusions concerned spacing.

The 20 most frequent confusions are:

Errors	Ground Truth-Prediction
70	{ SPACE } - {  }
54	{  } - { SPACE }
48	{ ı } - {  }
43	{ n } - {  }
43	{ COMBINING ACUTE ACCENT } - {  }
27	{ e } - {  }
24	{ l } - {  }
24	{ u } - {  }
21	{ . } - {  }
20	{ u } - { n }
18	{ ſ } - {  }
18	{ a } - {  }
17	{ r } - {  }
14	{ t } - {  }
13	{ COMBINING TILDE } - {  }
13	{  } - { ı }
12	{ o } - { e }
12	{ o } - {  }
12	{ ı } - { m }
11	{ e } - { c }

List of manuscripts

The data comes from partial allographetic transcription of the following mss:

  • Clermont-Ferrand, archives départementales, 1F2 (XIII 1/3, anglo-norman praegothica script; Chanson d'Aspremont); 52 lines.
  • Paris, Bibliothèque nationale de France, fr. 854 (XIII 4/4, Venice or Venetian area; gothic textualis; occitan chansonnier I); 1112 lines.
  • Cologny-Genève, fondation Martin-Bodmer, cod. Bodm. 168 (XIII 3/3, anglo-norman gothic textualis; Chanson d'Otinel); 1908 lines.
  • Oxford, Bodleian Library, Digby 23 (XII 1/2, anglo-norman praegothica; Chanson de Roland); 564 lines.

For these transcriptions, see: Jean-Baptiste Camps, La `Chanson d’Otinel’: édition complète du corpus manuscrit et prolégomènes à l’édition critique, PhD thesis, dir. Dominique Boutet, Paris-Sorbonne, 2016, DOI: https://doi.org/10.5281/zenodo.1116735.

License

Licence Creative Commons
Cette œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution 4.0 International.

Contribute

If you want to contribute training data or models, you can do so by cloning the repository and sending us a pull request, or by sending an email at jbcamps at hotmail.com .

Cite this repository

Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.

About

FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages