entity-detection-for-historical-dutch

Entity identification for historical documents in Dutch, developed within the Clariah+ project at VU Amsterdam.

While the primary use case is to process historical Dutch documents, the more general idea of this project is to develop an adaptive framework that can process any set of Dutch documents. This means, for instance, documents with or without recognized entities (gold NER or not); documents with entities that can be linked or not (in-KB entities or not), etc.

We achieve this flexibility in two ways:

we create an unsupervised system, based on recent techniques, like BERT embeddings
we involve human experts, by allowing them to enrich or alter the tool output

Current algorithm in a nutshell

The current solution is entirely unsupervised, and works as follows:

Obtain documents (supported formats so far: mediawiki and NIF
Extract entity mentions (gold NER or by running SpaCy)
Create initial NAF documents containing recognized entities too
Compute BERT sentence+mention embeddings
Enrich them with word2vec document embeddings
Bucket mentions based on similarity of mentions
Cluster embeddings for a bucket based on the HAC algorithm
Run evaluation with rand index-based score

Baselines

We compare our identity clustering algorithm against 5 baselines:

string-similarity - forms that are identical or sufficiently similar are coreferential.
one-form-one-identity - all occurrences of the same form refer to the same entity.
one-form-and-type-one-identity - all occurrences of the same form, when this form is of the same semantic type, refer to the same entity.
one-form-in-document-one-identity - all occurrences of the same form within a document are coreferential. All occurrences across documents are not.
one-form-and-type-in-document-one-identity - all occurrences of the same form that have the same semantic type within a document are coreferential; the rest are not.

Code structure

The scripts make_wiki_corpus.py and make_nif_corpus.py create a corpus (as Python classes) from the source data we download in mediawiki or NIF format, respectively. The script make_wiki_corpus.py expects the file data/input_data/nlwikinews-latest-pages-articles.xml as input, which is a collection of Wikinews documents in Dutch in XML format. The script make_nif_corpus.py expects the iput file abstracts_nl{num}.ttl, where num is a number between 0 and 43, inclusive. These extraction scripts use some functions from pickle_utils.py and from wiki_utils.py.
The script main.py executes the algorithm procedure described above. It relies on functions in several utility files: algorithm_utils.py, bert_utils.py, analysis_utils.py, pickle_utils.py, naf_utils.py.
Evaluation functions are stored in the file evaluation.py.
Baselines are run by running the file baselines.py (with no arguments).
The classes we work with are defined in the file classes.py.
Configuration files are found in the folder cfg. These are loaded through the script config.py.
All data is stored in the folder data.

Preparation to run the script: Install and download

To prepare your environment with the right packages, run bash install.sh.

Then download the corpora you would like to work with, and store it in: data/{corpus_name}/input_data. To reuse the config files found in cfg and run wikinews or dbpedia abstracts, you can do the following.

for wikinews, download nlwikinews-latest-pages-articles.xml, for example from here. Then store it in data/wikinews/input_data (make sure you unpack it).
for dbpedia_abstracts, you can download .ttl files from this link. Each .ttl contains many abstracts, so it is advisable to start with 1 file to understand what is going on. Download and unpack the file, then store it in data/dbpedia_abstracts/input_data

Then you should be able to run make_wiki_corpus.py and make_nif_corpus.py to load the corpora; and you should be able to run directly main.py in order to process the corpora with our tool. Make sure that you use the right config file in these scripts (e.g., wikinews50.cfg will let you process 50 files from Wikinews).

Authors

Filip Ilievski (f.ilievski@vu.nl)
Sophie Arnoult (sophie.arnoult@posteo.net)

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
cfg		cfg
tests		tests
tmp_scripts		tmp_scripts
wip		wip
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
algorithm_utils.py		algorithm_utils.py
analysis_utils.py		analysis_utils.py
baselines.py		baselines.py
bert_utils.py		bert_utils.py
classes.py		classes.py
config.py		config.py
doc2vec.py		doc2vec.py
evaluation.py		evaluation.py
install.sh		install.sh
main.py		main.py
make_nif_corpus.py		make_nif_corpus.py
make_wiki_corpus.py		make_wiki_corpus.py
naf_utils.py		naf_utils.py
pickle_utils.py		pickle_utils.py
requirements.txt		requirements.txt
wiki_utils.py		wiki_utils.py

License

cltl/entity-identification-from-scratch

Folders and files

Latest commit

History

Repository files navigation

entity-detection-for-historical-dutch

Current algorithm in a nutshell

Baselines

Code structure

Preparation to run the script: Install and download

Authors

About

Resources

License

Stars

Watchers

Forks

Languages