EDH and Pelagios NER

sarahmiddle edited this page Jun 8, 2017 · 11 revisions

Participants: Orla Murphy, Sarah Middle, Simona Stoyanova, Núria Garcia Casacuberta

The aim of this group was to use Named Entity Recognition (NER) on the text of inscriptions from the Epigraphic Database Heidelberg (EDH) to identify placenames, which could then be linked to their equivalent terms in the Pleiades gazetteer and thereby integrated with Pelagios Commons.

  1. Work to be done on XML documents:

    a. Method 1: Strip XML from content of div[type=edition]/ab and run NER on plain text Script to extract inscription from XML and strip out XML tags: https://github.com/EpiDoc/OEDUc/blob/master/ExtractInscriptionTextFromEDHXML_V1_20170515.py

    b. Method 2: Tokenize everything and run NER process on tokens

  2. Run NER process for method 1:

    a. Sunoikisis NER I class (https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-I)

    b. Sunoikisis NER II class (https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-II)

  3. Checking and refining, further training