Skip to content
Named Entity Recognition data for Europeana Newspapers
Branch: master
Clone or download

Latest commit

cneud change comments from <-- --> style to # in enp_DE.sbb.bio (first step…
…s to achieve better interoperability with ConLL/GermEval)
Latest commit a341865 May 15, 2019

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
enp_DE.lft.bio trim trailing whitespaces Sep 8, 2017
enp_DE.onb.bio file naming Jul 27, 2016
enp_DE.sbb.bio change comments from <-- --> style to # in enp_DE.sbb.bio (first step… May 14, 2019
enp_FR.bnf.bio file naming Jul 27, 2016
enp_NL.kb.bio file naming Jul 27, 2016
CONTRIBUTING.md Update CONTRIBUTING.md Nov 30, 2017
LICENSE.md create LICENSE.md Oct 20, 2017
README.md Update README.md Nov 30, 2017

README.md

ner-corpora

Named Entity Recognition corpora for Dutch, French, German from Europeana Newspapers.

Introduction

The corpora comprise of files per data provider that are encoded in the IOB format (Ramshaw & Marcus, 1995). The IOB format is a simple text chunking format that divides texts into single tokens per line, and, separated by a whitespace, tags to mark named entities. The most commonly used categories for tags are PER (person), LOC (location) and ORG (organization). To mark named entities that span multiple tokens, the tags have a prefix of either B- (beginning of named entity) or I- (inside of named entity). O (outside of named entity) tags are used to mark tokens that are not a named entity.

Example:

The O
NBA B-ORG
player O
Michael B-PER
Jordan I-PER
is O
from O
the O
United B-LOC
States I-LOC
of I-LOC
America I-LOC
. O

Background

The IOB files in this repository are based on OCRed and manually annotated historical newspapers from these libraries:

To download the the source ALTO OCR files or the trained CRF classifier binaries, please go here.

License

CC0

Attribution

Europeana Newspapers NER corpora
https://github.com/EuropeanaNewspapers/ner-corpora/
Europeana Newspapers Project, 2012-2015
http://www.europeana-newspapers.eu/

References

Known issues

The way the above corpora were produced, additional work is required to leverage the data for tasks such as evaluation, where gold standard quality is required as the data still contains many OCR errors. Also, due to post-processing, parts of sentences containing a high degree of noise were cut, which makes it difficult to map the annotated texts to the original newspaper articles and may entail unintended effects on classification.

Further information on data quality issues and instructions to clean up the data can be found in the wiki.

You can’t perform that action at this time.