Skip to content

Resources

David Campos edited this page Oct 13, 2016 · 2 revisions

Linguistic parsing

Neji supports two different linguistic parsers:

GDep

GDep is a linguistic parser optimised for biomedical texts, supporting tokenization, part-of-speech tagging, lemmatization, chunking and dependency parsing. As provided, GDep already achives state-of-the-art results in the several tasks. However, since words containing the symbols “/”, “-” or “.” are not always split into multiple tokens, we decided to develop an optmized version.

The optimized version of GDep is already included in the distribution package.

Apache OpenNLP

Apache OpenNLP is a linguistic parser for general texts, supporting sentence splitting, tokenization, part-of-speech tagging and chunking. OpenNLP also supports parsing texts in different languages, by providing different models for each task and language.

The following OpenNLP models are already included in the distribution package:

Language Sentence splitting Tokenization Part-of-Speech tagging Chunking
English X X X X
French X X X X
Portuguese X X X
German X X X
Dutch X X X
Danish X X X
Sami X X X

Dictionaries

Dictionaries must be provided following the TSV format, where each line is composed by two values:

  1. identifier that contains 4 fields concatenated with a ":", following the template <source>:<id>:<type>:<group>;
  2. names concatenated with a "|".
UMLS:C0001327:T047:DISO acute laryngitis|acute laryngitis nos
UMLS:C0001339:T047:DISO acute pancreatitis|pancreatitis, acute
UMLS:C0001344:T047:DISO pharyngitis nos acute|acute pharyngitis|pharyngitis acute
UMLS:C0001360:T047:DISO acute thyroiditis|thyroiditis acute 

To specify the dictionaries priority, which may be used for disambiguation procedures, a file must be provided, with one dictionary file name per line:

jochem_CHED.tsv
Congenital_Abnormality_T019_DISO.tsv
Acquired_Abnormality_T020_DISO.tsv
Anatomical_Abnormality_T190_DISO.tsv
Sign_or_Symptom_T184_DISO.tsv
Cell_or_Molecular_Dysfunction_T049_DISO.tsv
Neoplastic_Process_T191_DISO.tsv
Mental_or_Behavioral_Dysfunction_T048_DISO.tsv
Disease_or_Syndrome_T047_DISO.tsv
Pathologic_Function_T046_DISO.tsv
Molecular_Function_FUNC.tsv
Biological_Process_PROC.tsv 

Such priority file should have the name "_priority" and be located in the same folder of the dictionaries:

_priority
jochem_CHED.tsv
Congenital_Abnormality_T019_DISO.tsv
Acquired_Abnormality_T020_DISO.tsv
Anatomical_Abnormality_T190_DISO.tsv
Sign_or_Symptom_T184_DISO.tsv
Cell_or_Molecular_Dysfunction_T049_DISO.tsv
Neoplastic_Process_T191_DISO.tsv
Mental_or_Behavioral_Dysfunction_T048_DISO.tsv
Disease_or_Syndrome_T047_DISO.tsv
Pathologic_Function_T046_DISO.tsv
Molecular_Function_FUNC.tsv
Biological_Process_PROC.tsv 

Models

Each model is described using a properties file, defining the following characteristics:

  • File: compressed file that contains the model;
  • Config: file that contains features configuration of the model. Please visit Gimli documentation to know more details;
  • Parsing: text parsing direction used to train the model;
  • Group: semantic group of the annotations generated by the model;
  • Normalization dictionaries: if normalization is required, provide the folder with the dictionaries, following the same approach previously described for dictionaries. If no normalization is required, this field should not be provided.

A model properties file should look like this (provided paths to files and folders are relative):

file=bc2_bw_o2_windows_model.gz
config=bc2_bw_o2_windows_model.config
parsing=BW
group=PRGE
dictionaries=normalization/

Just like dictionaries, models are also provided using a priority file, where each line is a properties file for each model:

model/model.properties 

Models and its configuration files can be organized as follows:

_priority
model
	model.properties
	model.gz
	model.config
	normalization
		_priority
		dictionary.tsv

Gene and Protein

Neji already provides a model for the recognition of Gene and Protein names. It takes advantage of the GENETAG corpus used in the BioCreative II Gene Mention (BC2) task, applying a rich feature set and optimized post-processing tasks. Further information can be found on the publication.

This model achieves the following high-performance results:

Precision Recall F-measure
90,24% 84,99% 87,54%

The Gene and Protein model is already included in the distribution package.