Skip to content
Generic framework for information extraction tasks, including recognition of named entities, temporal expressions, spatial expressions and events.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
g419-corpus
g419-crete-cli
g419-crete-core
g419-external-dependencies
g419-lib-cli
g419-liner2-cli
g419-liner2-core
g419-liner2-daemon
g419-spatial-cli
g419-spatial-core
g419-toolbox
g419-tools-cli
gradle/wrapper
lib
stuff
.gitignore
.travis.yml
README.md
build.gradle
crete-cli
gradlew
gradlew.bat
install-dependencies-travisci-ubuntu1404.sh
liner2-cli
liner2-daemon
log4j.properties
settings.gradle
sonar-project.properties
spatial-cli
tools-cli

README.md

Liner2.6

Build Status Coverage Status License: LGPL v3

Copyright (C) Wrocław University of Science and Technology (PWr), 2010-2018. All rights reserved.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Contributors

  • Michał Marcińczuk (2010–present),
  • Jan Kocoń (2014–present),
  • Adam Kaczmarek (2014–2015),
  • Michał Krautforst (2013-2015),
  • Dominik Piasecki (2013),
  • Maciej Janicki (2011)

Citing

System architecture and KPWr NER models

Marcińczuk, Michał; Kocoń, Jan; Oleksy, Marcin. Liner2 — a Generic Framework for Named Entity Recognition In: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 86–91, Valencia, Spain, 4 April 2017. Association for Computational Linguistics

[PDF]

[Bibtex]

@InProceedings{W17-1413,
  author = 	"Marci{\'{n}}czuk, Micha{\l}
		and Koco{\'{n}}, Jan
		and Oleksy, Marcin",
  title = 	"Liner2 --- a Generic Framework for Named Entity Recognition",
  booktitle = 	"Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing",
  year = 	"2017",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"86--91",
  location = 	"Valencia, Spain",
  doi = 	"10.18653/v1/W17-1413",
  url = 	"http://aclweb.org/anthology/W17-1413"
}

NKJP NER model

Marcińczuk, Michał; Kocoń, Jan; Gawor, Michał. Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches Ogrodniczuk, Maciej; Kobyliński, Łukasz (Eds.): Proceedings of the PolEval 2018 Workshop, pp. 63-73, Institute of Computer Science, Polish Academy of Science, Warszawa, 2018.

[PDF]

[Bibtex]

@inproceedings{poldeepner2018,
  title     = "Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches",
  author    = "Marcińczuk, Michał and Kocoń, Jan and Gawor, Michał",
  year      = "2018",
  editor    = "Ogrodniczuk, Maciej and Kobyliński, Łukasz",
  booktitle = "Proceedings of the PolEval 2018 Workshop",
  location  = "Warsaw, Poland",
  pages     = "77--92",
  publisher = "Institute of Computer Science, Polish Academy of Science"
}

Requirements

Compilation

  • Java 8
  • C++ compiler (gcc 3.0 or higher) for CRF++

Runtime

Optional libraries:

Installation

Compile

If you do not have CRF++ installed then do the following steps:

cd g419-external-dependencies
tar -xvf CRF++-0.57.tar.gz
cd CRF++-0.57
./configure
make
sudo make install
sudo ldconfig

Then:

./gradlew jar

Runtime test

./liner2-cli

Output:

*-----------------------------------------------------------------------------------------------*
* A framework for multitask sequence labeling, including: named entities, temporal expressions. *
*                                                                                               *
* Authors: Michał Marcińczuk (2010–2016), Jan Kocoń (2014–2016), Adam Kaczmarek (2014–2015)     *
*    Past: Michał Krautforst (2013-2015), Dominik Piasecki (2013), Maciej Janicki (2011)        *
* Contact: michal.marcinczuk@pwr.wroc.pl                                                        *
*                                                                                               *
*          G4.19 Research Group, Wrocław University of Technology                               *
*-----------------------------------------------------------------------------------------------*


Use one of the following tools:
 - agreement           -- checks agreement (of annotations) between suplied documents
 - agreement2          -- compare sets of annotations for each pair of corpora. One set is
                          treated as a reference set and the other as a set to evaluate. It is a
                          refactored version of the agreement action.
 - annotations         -- generates an arff file with a list of annotations and their features
 - constituents-eval   -- evaluates normalizer against a specific set of documents (-i
                          batch:FORMAT, -i FORMAT)
 - convert             -- converts documents from one format to another and applies defined
                          converters
 - curve               -- brak opisu
 - eval                -- evaluates chunkers against a specific set of documents (-i
                          batch:FORMAT, -i FORMAT) #or perform cross validation (-i cv:{format})
 - eval-unique         -- evaluates chunkers against a specific set of documents (-i
                          batch:FORMAT, -i FORMAT) #or perform cross validation (-i
                          cv:{format}). The evaluation is performed on the sets#with unique
                          annotations, i.e. annotations with the same orth/base are treated as a
                          single annotation
 - inplace             -- process documents in place
 - interactive         -- processes text entered directly into the terminal
 - lemmatize           -- ToDo
 - normalizer-eval3    -- processes data with given model
 - normalizer-validate -- Read all annotation and their metadata and look for errors.
 - pipe                -- processes data with given model
 - search              -- earches for a phrases matching given pattern based on a set of token
                          features
 - selection           -- todo
 - stats               -- prints corpus statistics
 - train               -- trains chunkers

usage: ./liner2-cli [action] [options]

Pre-trained models

KPWr NER for Polish

The package contains three models for recognition named entities according to KPWr NE guidelines.

  • nam — named entity boundaries,
  • top9 — coarse-grained categories,
  • n82 — fine-grained categories.

Resources:

Download the package:

cd Liner2
wget -O liner25_model_ner_rev1.7z https://clarin-pl.eu/dspace/bitstream/handle/11321/263/liner25_model_ner_rev1.7z 

Unpack the package:

7z x liner25_model_ner_rev1.7z

Process a sample CCL file:

./liner2-cli pipe -i ccl -o tuples -f stuff/resources/sample-sentence.xml -m liner25_model_ner_rev1/config-top9.ini

Expected output:

(4,11,nam_liv,"Ala Nowak")
(20,28,nam_loc,"Warszawie")

PolEval 2018 Task 2: Named Entity Recognition

Mirror: https://www.dropbox.com/s/wem3fp685zleuq6/liner26_model_ner_nkjp.zip?dl=0

DSpace page: https://clarin-pl.eu/dspace/handle/11321/598 (temporarily off-line)

Direct link to the package: https://clarin-pl.eu/dspace/bitstream/handle/11321/598/liner26_model_ner_nkjp.zip (temporarily off-line)

Liner2 participated in PolEval 2018 Task 2 on named entity recognition. It got a third place with the following scores:

Metric F1 score
Final 0.810
Exact 0.778
Overlap 0.818

Download the package with model:

cd Liner2
wget -O liner26_model_ner_nkjp.zip https://clarin-pl.eu/dspace/bitstream/handle/11321/598/liner26_model_ner_nkjp.zip 

Unpack the model:

unzip liner26_model_ner_nkjp.zip

Process a sample CCL file:

./liner2-cli pipe -i ccl -o tuples -f stuff/resources/sample-sentence.xml -m liner26_model_ner_nkjp/config-nkjp-poleval2018.ini

Expected output:

(4,6,null,persname_forename,"Ala","Ala")
(4,11,null,persName,"Ala Nowak","Ala Nowak")
(7,11,null,persname_surname,"Nowak","Nowak")
(20,28,null,placename_settlement,"Warszawie","Warszawie")

Service mode (using RabbitMQ)

Introduction

Liner2 can be run as a service which listen to a RabbitMQ queue for upcomming requests (liner2-input). and submit the results to another queue (liner2-output). The input message (send by the client) should have the following format:

ROUTE_KEY PATH

Where:

  • ROUTE_KEY — name of a route used to post the results to the output queue. The routing key is used by the client to receive the response for their request ignoring others,
  • PATH — an absolute path to the file to process.

For example:

client-001 /tmp/document.txt

The message send by the service will contain path to a file which contains the output of processing.

Running the service

./liner2-daemon rabbitmq -m liner26_model_ner_nkjp/config-nkjp-poleval2018.ini -i plain:wcrft

Expected output:

 INFO [Thread-1] (RabbitMqWorker.java:91) - Listing to RabbitMQ on channel liner2-input ...
Consumer amq.ctag-m6D9fIMI_Qsm61BH7HoxlA registered

It is possible to run more than one instance of ./liner2-daemon rabbitmq. However, all of them should use the same model and input format.

Testing

Folder stuff/python contains a Python script to test the communication with the service. The script takes a text to process, stores the texts in a temporal file, generates a routing key, send both to the liner2-input queue and listen to liner2-output. After receiving the response it reads the output file, removes both temporal files and prints the output.

python3 stuff/python/liner2rmq.py "Pani Ala Nowak mieszkw w Zielonej Górze"

The output should be as follows:

[INFO] Temp route: route-1DVRP4
[INFO] Temp input file: /tmp/amu7_3at
[INFO] Sent msg 'route-1DVRP4 /tmp/amu7_3at' to liner2-input
[INFO] Temp output file: b'/tmp/amu7_3at-ner.xml'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1">
  <sentence id="s1">
   <tok>
    <orth>Pani</orth>
    <lex disamb="1"><base>pani</base><ctag>subst:sg:nom:f</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Ala</orth>
    <lex disamb="1"><base>Ala</base><ctag>subst:sg:nom:f</ctag></lex>
    <ann chan="persname" head="1">1</ann>
    <ann chan="persname_forename" head="1">1</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Nowak</orth>
    <lex disamb="1"><base>Nowak</base><ctag>subst:sg:nom:m1</ctag></lex>
    <lex disamb="1"><base>nowak</base><ctag>subst:sg:nom:m1</ctag></lex>
    <ann chan="persname">1</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname" head="1">1</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>mieszka</orth>
    <lex disamb="1"><base>mieszkać</base><ctag>fin:sg:ter:imperf</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>w</orth>
    <lex disamb="1"><base>w</base><ctag>prep:loc:nwok</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Zielonej</orth>
    <lex disamb="1"><base>zielony</base><ctag>adj:sg:loc:f:pos</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement" head="1">1</ann>
   </tok>
   <tok>
    <orth>Górze</orth>
    <lex disamb="1"><base>góra</base><ctag>subst:sg:loc:f</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">1</ann>
   </tok>
  </sentence>
 </chunk>
</chunkList>

Logs on the server side:

 INFO [pool-1-thread-5] (RabbitMqWorker.java:99) - Received path: '/tmp/amu7_3at'
 INFO [pool-1-thread-5] (RabbitMqWorker.java:108) - Output saved to /tmp/amu7_3at
 INFO [pool-1-thread-5] (RabbitMqWorker.java:121) - Sent /tmp/amu7_3at-ner.xml to liner2-output:route-1DVRP4'
 INFO [pool-1-thread-5] (RabbitMqWorker.java:84) - Request processing done
You can’t perform that action at this time.