Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


License Version Documentation Status Docker build


GROBID-Dictionaries is a GROBID module, implementing a java machine learning library, for structuring digitised lexical resources and entry-based documents with encyclopedic or bibliographic content. It allows the parsing, extraction and structuring of text information in such resources.


GROBID-Dictionaries is based on cascading models. The diagram below presents the architecture enabling the processing and the transfer of the text information through the models.

GROBID Dictionaries Structure

Dictionary Segmentation This is the first model and has as goal the segmentation of each dictionary page into 3 main blocks: Headnote, Body and Footnote. Another block, "dictScarp" could be generated for text information that do not belong to the principal blocks

Dictionary Body Segmentation The second model gets the Body, recognised by the first model, and processes it to recognise the boundaries of each lexical entry.

Lexical Entry The third model parses each lexical entry, recognised by the second model, to segment it into 4 main blocks: Form, Etymology, Senses, Related Entries. A "dictScrap" block is there as well for unrecognised information.

The rest of the models The same logic respectively applies for the recognised blocks in a lexical entry by having a specific model for each one of them

N.B: The current architecture could change at any milestone of the project, as soon as new ideas or technical constraints emerge.


GROBID-Dictionaries takes as input a file in PDF or ALTO formats. Each model of the aforementioned components generates a TEI P5-encoded hierarchy of the different recognised text structures at that specific cascading level. The final serialised output is in-line with new version of LMF (Romary et al. 2019) and the TEI-Lex-0 initiative (Romary and Tasovac 2019).


The most recent version of the system is available online. The models of this version are trained with samples from 5 different dictionaries that you can download and parse with GROBID-Dictionaries. This video illustrates a use case of different models of the system.

Docker Use

To shortcut the installation of the tool, the Docker manual can be followed to use the latest image of the tool.

To Cite

Mohamed Khemakhem, Luca Foppiano, Laurent Romary. Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields. electronic lexicography, eLex 2017, Sep 2017, Leiden, Netherlands. hal-01508868v2

Mohamed Khemakhem, Axel Herold, Laurent Romary. Enhancing Usability for Automatically Structuring Digitised Dictionaries. GLOBALEX workshop at LREC 2018, May 2018, Miyazaki, Japan. 2018. hal-01708137v2

More Reading

Romary, Laurent et al. (2019). “LMF Reloaded”. In: AsiaLex 2019: Past, Present and Future. Istanbul, Turkey.

Romary, Laurent and Toma Tasovac (2018). “TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources”. In: TEI Conference and Members’ Meeting. Tokyo, Japan.

Hervé Bohbot, Francesca Frontini, Giancarlo Luxardo, Mohamed Khemakhem, Laurent Romary. Presenting the Nénufar Project: a Diachronic Digital Edition of the Petit Larousse Illustré. GLOBALEX 2018 - Globalex workshop at LREC2018, May 2018, Miyazaki, Japan. hal-01728328

Mohamed Khemakhem, Carmen Brando, Laurent Romary, Frédérique Mélanie-Becquet, Jean-Luc Pinol. Fueling Time Machine: Information Extraction from Retro-Digitised Address Directories. JADH2018 "Leveraging Open Data", Sep 2018, Tokyo, Japan. hal-01814189

Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini, et al.. Automatically Encoding Encyclopedic-like Resources in TEI. The annual TEI Conference and Members Meeting, Sep 2018, Tokyo, Japan.hal-01819505

David Lindemann, Mohamed Khemakhem, Laurent Romary. Retro-digitizing and Automatically Structuring a Large Bibliography Collection. European Association for Digital Humanities (EADH) Conference, Dec 2018, Galway, Ireland. hal-01941534


For more expert and development usage, the documentation of the tool is detailed here


Mohamed Khemakhem (, Laurent Romary (


No description, website, or topics provided.






No releases published


No packages published

Contributors 4