No description, website, or topics provided.
Clone or download

README.md

GROBID-Dictionaries

License Version Documentation Status Docker build

Purpose

GROBID-Dictionaries is a GROBID sub-module, implementing a java machine learning library, for structuring digitized lexical resources. It allows the parsing, extraction and structuring of text information in such resources.

To Cite

Mohamed Khemakhem, Luca Foppiano, Laurent Romary. Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields. electronic lexicography, eLex 2017, Sep 2017, Leiden, Netherlands

     TITLE = {{Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields}},
     AUTHOR = {Khemakhem, Mohamed and Foppiano, Luca and Romary, Laurent},
     URL = {https://hal.archives-ouvertes.fr/hal-01508868},
     BOOKTITLE = {{electronic lexicography, eLex 2017}},
     ADDRESS = {Leiden, Netherlands},
     YEAR = {2017},
     MONTH = Sep,
     KEYWORDS = { digitized dictionaries  ;  automatic structuring ;  CRF ;  TEI ; machine learning},
     PDF = {https://hal.archives-ouvertes.fr/hal-01508868/file/eLex-2017-Template.pdf},
     HAL_ID = {hal-01508868},
   }

Mohamed Khemakhem, Axel Herold, Laurent Romary. Enhancing Usability for Automatically Structuring Digitised Dictionaries. GLOBALEX workshop at LREC 2018, May 2018, Miyazaki, Japan

  TITLE = {{Enhancing Usability for Automatically Structuring Digitised Dictionaries}},
  AUTHOR = {Khemakhem, Mohamed and Herold, Axel and Romary, Laurent},
  URL = {https://hal.archives-ouvertes.fr/hal-01708137},
  BOOKTITLE = {{GLOBALEX workshop at LREC 2018}},
  ADDRESS = {Miyazaki, Japan},
  YEAR = {2018},
  MONTH = May,
  KEYWORDS = {Docker ; TEI ; digitised dictionaries ; electronic lexicography ; usability},
  PDF = {https://hal.archives-ouvertes.fr/hal-01708137/file/LREC-GLOBALEX2018.pdf},
  HAL_ID = {hal-01708137},
}

Approach

GROBID-Dictionaries is based on cascading CRF models. The diagram below presents the architecture enabling the processing and the transfer of the text information through the models.

GROBID Dictionaries Structure

Dictionary Segmentation This is the first model and has as goal the segmentation of each dictionary page into 3 main blocks: Headnote, Body and Footnote. Another block, "dictScarp" could be generated for text information that do not belong to the principal blocks

Dictionary Body Segmentation The second model gets the Body, recognized by the first model, and processes it to recognize the boundaries of each lexical entry.

Lexical Entry The third model parses each lexical entry, recognized by the second model, to segment it into 4 main blocks: Form, Etymology, Senses, Related Entries. A "dictScrap" block is there as well for unrecognised information.

The rest of the models The same logic applies respectively for the recognised blocks in a lexical entry by having a specific model for each one of them

N.B: The current architecture could change at any milestone of the project, as soon as new ideas or technical constraints emerge.

Input/Output

GROBID-Dictionaries takes as input lexical resources digitized in PDF format. Each model of the aforementioned components generates a TEI P5-encoded hierarchy of the different recognized text structures at that specific cascading level.

Docker Use

To shortcut the installation of the tool, the Docker manual could be followed to use the latest image of the tool

Documentation

For more expert and development uses, the documentation of the tool is detailed here

Contact

Mohamed Khemakhem (mohamed.khemakhem@inria.fr), Patrice Lopez (patrice.lopez@science-miner.com), Luca Foppiano (luca.foppiano@inria.fr)