Skip to content

[PlanTL/medicine/terminological resource retrieval] A multilingual medical term extraction tool.

License

Notifications You must be signed in to change notification settings

PlanTL-GOB-ES/CUTEXT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CUTEXT - Cvalue Used To EXtract Terms

Introduction

The heavy use of medical terms has motivated the construction of large terminological resources for English, such as the Unified Medical Language System (UMLS) or the Open Biological and Biomedical Ontology (OBO) ontologies. Purely manual construction of terminological resources is by itself very valuable, but it constitutes a highly time-consuming process, it does not guarantee that included concepts or terms do actually align with the medical language and terms as they are being used in clinical documents by healthcare professionals and it requires constant update and revision due to changes and emergence of new biomedical concepts over time.

CUTEXT is a multilingual medical term extraction tool. It allows extracting terms in texts written in English, Spanish, Galician and Catalan.

The main characteristics of CUTEXT are the following:

  • It is implemented in java, so it is multiplatform. It has been tested under Windows and Linux.
  • It is multilingual: It has been tested in English, Spanish, Catalan and Galician, and it can be adapted easily to other languages by simply changing the lexical tag text file configuration.
  • The entry documents can be in plain text or in pdf.
  • It can be executed in graphic mode or by console (command line).
  • It supports numerous configuration parameters, among the most important: the language, the tagger, the frequency and c-value thresholds and the entry of the document/s.
  • The output is provided in plain text, in JSON format or/and in [BioC] (http://bioc.sourceforge.net/).

A more detailed description of the system can be found in the journal Sociedad Española para el Procesamiento del Lenguaje Natural.

Prerequisites

CUTEXT requires to have TreeTagger installed on your computer. The route must also be included in the PATH variable, up to the TreeTagger "bin" folder. If you are going to use medical or biomedical texts it is also convenient, although not necessary, to install GeniaTagger (only valid for biomedical texts in English), or SPACCC_POS-TAGGER based on FreeLing (only valid for biomedical texts in Spanish).

To convert texts written in pdf to txt, we use a script that uses the class "ExtractText" of the Apache pdfbox API, which is packaged in the "pdfbox-app-2.0.5.jar" file inside into the "jar_pdf" folder. Therefore, the path to this f ile must be included in the CLASSPATH variable.

Finally, you must also include in the CLASSPATH variable the path to the "cutext" folder, since CUTEXT is packaged at that folder.

Directory structure

CUTEXT directory structure corresponds to a package nomenclature called cutext Therefore, all packages are within that folder:

cutext/config_files/
includes files with tags, stop-words and punctuation marks in Spanish, Galician, Catalan and English.

cutext/filter/lin/
contains the Java classes that implement the linguistic filter.

cutext/filter/sta/
contains the Java classes that implement the statistical filter.

cutext/gui/
contains the Java classes that implement the graphical user interface (GUI).

cutext/in
a possible place to put the input file.

cutext/intern/TT/in/
internal storage of the input file for treetagger.

cutext/intern/TT/out/
internal storage of the output file generated by treetagger.

cutext/intern/TT/x/
internal storage of the intermediate file for treetagger.

cutext/jar_pdf/
contains the "pdfbox-app-2.0.5.jar" file to convert texts written in pdf to txt. 

cutext/main/
includes the main classes of CUTEXT as well as the file cutext.jar.

cutext/out/fileTextHashTerms/
stores the text output files.

cutext/out/serHashTerms/
stores serialized objects. 

cutext/postagger/
contains the class that invokes the tagger (TreeTagger). 

cutext/prepro/
contains the classes that preprocess the input corpus. 

cutext/properties/
contains the CUTEXT property file.

cutext/stemmer/
contains the classes that allow you to obtain the stem of the words.

cutext/textmode/
contains the classes that allow CUTEXT to be executed from the terminal.

cutext/util/
contains utility classes.

Usage

CUTEXT allows its execution in graphic mode or in text mode. In both cases, it is assumed that it will be executed from the main folder. If not, change the paths in the properties file cutext.properties and include the path of this file as an input parameter when invoking CUTEXT. We have assumed that the .java files have been compiled and the CLASSPATH variable has been fixed correctly. To see a description of this process you can consult the installation file Intallation.md.

To execute CUTEXT in graphic mode:

java cutext.main.ExecCutext

To execute CUTEXT in text mode:

java cutext.main.ExecCutext -TM [Options] <-inputFile fileName>

Except for the input file, all options have default values, so it is not necessary to include them.

Options:

-TM
	Execute CUTEXT in text mode (TM).
-help
 	Show the line to execute CUTEXT, and the options.
-displayon 
	Show the messages at the standard output. Default TRUE (show).
-postagger 
	POS tagger to tagger the input file. TreeTagger (default) or GeniaTagger.
-language 
	SPANISH (default) or ENGLISH, CATALAN, GALICIAN.
-frecT 
	Frecuency Threshold. Default 0.
-cvalueT 
	C-Value Threshold. Default 0.0.
-bioc 
	Create a BioC output. Default false.
-json 
	Create a JSON output. Default false.
-convert 
	If true then convert the input file into lower case. Default true.
-withoutcvalue 
	If true then execute only the linguistic filter. Default false.
-incremental 
	If true then execute one line of the file as a entire corpus. Default false.
-generateTextFile 
	If true then create one text file per hashTerms, from 'a' to 'z'. Also create a raw text file with terms 
	sorted by cvalue. Default false.
-routeHashTerms 
	Folder where you want to store the hash terms.
-routeTextFileHashTerms 
	Folder where you want to store the text file hash terms.
-routeconfigfiles 
	Folder where it stores config files.
-routeinterntt 
	Temporary folder (TT).
-routeFreelingScript 
	File that executes Freeling Script (under Linux). Default cutext/postagger/scriptFreeling.sh.
-inputFile 
	The document to use.
-outputFile 
	The file to write the result to.

CUTEXT, by default, will delete at the start of the execution the temporary files, and the output files of previous executions. To avoid this, a flag called '-deleteFiles' is included, which can be set to false. In particular, the parameters related to this deletion are the following:

-deleteFiles 
	If true delete the following files at the beginning. Default: TRUE

Output Files

-deletePosSer 
	Folder with serializable hashTerms, at postagger folder. Default: cutext/postagger/serHashTerms/
-deletePosText 
	Route folder text at postagger folder. Default: cutext/postagger/fileTextHashTerms/
-deletePosOutput 
	Route folder output at postagger folder. Default: cutext/postagger/output/
-deleteOutSer 
	Route folder hashTerms at cutext folder. Default: cutext/out/serHashTerms/
-deleteOutText 
	Route folder text at cutext folder. Default: cutext/out/fileTextHashTerms/

Temporary Files

-deletePosInternOut 
	Route folder intern/out at postagger folder. Default: cutext/postagger/intern/TT/out/
-deletePosInternIn 
	Route folder intern/in at postagger folder. Default: cutext/postagger/intern/TT/in/
-deletePosInternX 
	Route folder intern/x at postagger folder. Default: cutext/postagger/intern/TT/x/
-deleteInternOut 
	Route folder intern/out at cutext folder. Default: cutext/intern/TT/out/
-deleteInternIn 
	Route folder intern/in at cutext folder. Default: cutext/intern/TT/in/
-deleteInternX 
	Route folder intern/x at cutext folder. Default: cutext/intern/TT/x/

Examples

Let's assume an input file in.txt in the folder in, if we execute CUTEXT in text mode:

java cutext.main.ExecCutext -TM -generateTextFile true -inputFile ../in/in.txt

This generates the text files at the folder cutext/out/fileTextHashTerms and the serialized terms at the folder cutext/out/serHashTerms.

If you also want to obtain outputs in the BioC and JSON formats, execute CUTEXT by setting these parameters to TRUE, as in:

java cutext.main.ExecCutext -TM -generateTextFile true -bioc true -json true -inputFile ../in/in.txt

Execution via JAR file

The cutext.jar file allows to execute CUTEXT directly from a terminal such as cmd, terminator, etc. To do this, write the following command line (from the directory where cutext.jar is located):

java -jar cutext.jar [options]

Where options are those shown in the 'Usage' section.

For example, if we type:

java -jar cutext.jar

CUTEXT will run the graphical interface.

To execute CUTEXT in text mode and with the same parameters as those included in the first example of the previous section, use the following command:

java -jar cutext.jar -TM -generateTextFile true -inputFile ../in/in.txt

The cutext.jar file is found at the folder cutext/main/cutext.jar. If you move it to another directory, change the properties file (cutext/properties/cutext.properties) accordingly.

Finally, in the input folder (cutext/in/in.txt) we include an example of a medical extract in Spanish, and in the output folder (cutext/out/fileTextHashTerms/hsimpli.txt and cutext/out/fileTextHashTerms/terms_raw.txt) we include the output that CUTEXT generates for that input text. To reproduce the results, execute tho following command (from cutext/main/):

java -jar cutext.jar -TM -incremental true -generateTextFile true -inputFile ../in/in.txt

Reference

Jesús Santamaría, Martin Krallinger: Construcción de recursos terminológicos médicos para el español: el sistema de extracción de términos CUTEXT y los repositorios de términos biomédicos. Procesamiento del Lenguaje Natural, Revista no 61, septiembre de 2018, pp. 49-56. DOI: http://dx.doi.org/10.26342/2018-61-5

Contact

Jesús Santamaría (jsantamaria@cnio.es)

License

(This is so-called MIT/X License)

Copyright (c) 2017-2018 Secretaría de Estado para el Avance Digital (SEAD)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

[PlanTL/medicine/terminological resource retrieval] A multilingual medical term extraction tool.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages