The heavy use of medical terms has motivated the construction of large terminological resources for English, such as the Unified Medical Language System (UMLS) or the Open Biological and Biomedical Ontology (OBO) ontologies. Purely manual construction of terminological resources is by itself very valuable, but it constitutes a highly time-consuming process, it does not guarantee that included concepts or terms do actually align with the medical language and terms as they are being used in clinical documents by healthcare professionals and it requires constant update and revision due to changes and emergence of new biomedical concepts over time.
CUTEXT is a multilingual medical term extraction tool. It allows extracting terms in texts written in English, Spanish, Galician and Catalan.
The main characteristics of CUTEXT are the following:
- It is implemented in java, so it is multiplatform. It has been tested under Windows and Linux.
- It is multilingual: It has been tested in English, Spanish, Catalan and Galician, and it can be adapted easily to other languages by simply changing the lexical tag text file configuration.
- The entry documents can be in plain text or in pdf.
- It can be executed in graphic mode or by console (command line).
- It supports numerous configuration parameters, among the most important: the language, the tagger, the frequency and c-value thresholds and the entry of the document/s.
- The output is provided in plain text, in JSON format or/and in [BioC] (http://bioc.sourceforge.net/).
A more detailed description of the system can be found in the journal Sociedad Española para el Procesamiento del Lenguaje Natural.
CUTEXT requires to have TreeTagger installed on your computer. The route must also be included in the PATH variable, up to the TreeTagger "bin" folder. If you are going to use medical or biomedical texts it is also convenient, although not necessary, to install GeniaTagger (only valid for biomedical texts in English), or SPACCC_POS-TAGGER based on FreeLing (only valid for biomedical texts in Spanish).
To convert texts written in pdf to txt, we use a script that uses the class "ExtractText" of the Apache pdfbox API, which is packaged in the "pdfbox-app-2.0.5.jar" file inside into the "jar_pdf" folder. Therefore, the path to this f ile must be included in the CLASSPATH variable.
Finally, you must also include in the CLASSPATH variable the path to the "cutext" folder, since CUTEXT is packaged at that folder.
CUTEXT directory structure corresponds to a package nomenclature called cutext Therefore, all packages are within that folder:
cutext/config_files/ includes files with tags, stop-words and punctuation marks in Spanish, Galician, Catalan and English. cutext/filter/lin/ contains the Java classes that implement the linguistic filter. cutext/filter/sta/ contains the Java classes that implement the statistical filter. cutext/gui/ contains the Java classes that implement the graphical user interface (GUI). cutext/in a possible place to put the input file. cutext/intern/TT/in/ internal storage of the input file for treetagger. cutext/intern/TT/out/ internal storage of the output file generated by treetagger. cutext/intern/TT/x/ internal storage of the intermediate file for treetagger. cutext/jar_pdf/ contains the "pdfbox-app-2.0.5.jar" file to convert texts written in pdf to txt. cutext/main/ includes the main classes of CUTEXT as well as the file cutext.jar. cutext/out/fileTextHashTerms/ stores the text output files. cutext/out/serHashTerms/ stores serialized objects. cutext/postagger/ contains the class that invokes the tagger (TreeTagger). cutext/prepro/ contains the classes that preprocess the input corpus. cutext/properties/ contains the CUTEXT property file. cutext/stemmer/ contains the classes that allow you to obtain the stem of the words. cutext/textmode/ contains the classes that allow CUTEXT to be executed from the terminal. cutext/util/ contains utility classes.
CUTEXT allows its execution in graphic mode or in text mode. In both cases, it is assumed that it will be executed
from the main folder. If not, change the paths in the properties file cutext.properties and include the path of
this file as an input parameter when invoking CUTEXT.
We have assumed that the .java files have been compiled and the CLASSPATH variable has been fixed correctly.
To see a description of this process you can consult the installation file Intallation.md.
To execute CUTEXT in graphic mode:
java cutext.main.ExecCutext
To execute CUTEXT in text mode:
java cutext.main.ExecCutext -TM [Options] <-inputFile fileName>
Except for the input file, all options have default values, so it is not necessary to include them.
Options:
-TM Execute CUTEXT in text mode (TM). -help Show the line to execute CUTEXT, and the options. -displayon Show the messages at the standard output. Default TRUE (show). -postagger POS tagger to tagger the input file. TreeTagger (default) or GeniaTagger. -language SPANISH (default) or ENGLISH, CATALAN, GALICIAN. -frecT Frecuency Threshold. Default 0. -cvalueT C-Value Threshold. Default 0.0. -bioc Create a BioC output. Default false. -json Create a JSON output. Default false. -convert If true then convert the input file into lower case. Default true. -withoutcvalue If true then execute only the linguistic filter. Default false. -incremental If true then execute one line of the file as a entire corpus. Default false. -generateTextFile If true then create one text file per hashTerms, from 'a' to 'z'. Also create a raw text file with terms sorted by cvalue. Default false. -routeHashTerms Folder where you want to store the hash terms. -routeTextFileHashTerms Folder where you want to store the text file hash terms. -routeconfigfiles Folder where it stores config files. -routeinterntt Temporary folder (TT). -routeFreelingScript File that executes Freeling Script (under Linux). Default cutext/postagger/scriptFreeling.sh. -inputFile The document to use. -outputFile The file to write the result to.
CUTEXT, by default, will delete at the start of the execution the temporary files, and the output files of previous executions. To avoid this, a flag called '-deleteFiles' is included, which can be set to false. In particular, the parameters related to this deletion are the following:
-deleteFiles If true delete the following files at the beginning. Default: TRUE
Output Files
-deletePosSer Folder with serializable hashTerms, at postagger folder. Default: cutext/postagger/serHashTerms/ -deletePosText Route folder text at postagger folder. Default: cutext/postagger/fileTextHashTerms/ -deletePosOutput Route folder output at postagger folder. Default: cutext/postagger/output/ -deleteOutSer Route folder hashTerms at cutext folder. Default: cutext/out/serHashTerms/ -deleteOutText Route folder text at cutext folder. Default: cutext/out/fileTextHashTerms/
Temporary Files
-deletePosInternOut Route folder intern/out at postagger folder. Default: cutext/postagger/intern/TT/out/ -deletePosInternIn Route folder intern/in at postagger folder. Default: cutext/postagger/intern/TT/in/ -deletePosInternX Route folder intern/x at postagger folder. Default: cutext/postagger/intern/TT/x/ -deleteInternOut Route folder intern/out at cutext folder. Default: cutext/intern/TT/out/ -deleteInternIn Route folder intern/in at cutext folder. Default: cutext/intern/TT/in/ -deleteInternX Route folder intern/x at cutext folder. Default: cutext/intern/TT/x/
Let's assume an input file in.txt in the folder in, if we execute CUTEXT in text mode:
java cutext.main.ExecCutext -TM -generateTextFile true -inputFile ../in/in.txt
This generates the text files at the folder cutext/out/fileTextHashTerms and the serialized terms at the folder
cutext/out/serHashTerms.
If you also want to obtain outputs in the BioC and JSON formats, execute CUTEXT by setting these parameters to TRUE, as in:
java cutext.main.ExecCutext -TM -generateTextFile true -bioc true -json true -inputFile ../in/in.txt
The cutext.jar file allows to execute CUTEXT directly from a terminal such as cmd, terminator, etc.
To do this, write the following command line (from the directory where cutext.jar is located):
java -jar cutext.jar [options]
Where options are those shown in the 'Usage' section.
For example, if we type:
java -jar cutext.jar
CUTEXT will run the graphical interface.
To execute CUTEXT in text mode and with the same parameters as those included in the first example of the previous section, use the following command:
java -jar cutext.jar -TM -generateTextFile true -inputFile ../in/in.txt
The cutext.jar file is found at the folder cutext/main/cutext.jar. If you move it to another directory, change the
properties file (cutext/properties/cutext.properties) accordingly.
Finally, in the input folder (cutext/in/in.txt) we include an example of a medical extract in Spanish, and in the
output folder (cutext/out/fileTextHashTerms/hsimpli.txt and cutext/out/fileTextHashTerms/terms_raw.txt) we
include the output that CUTEXT generates for that input text. To reproduce the results, execute tho following command
(from cutext/main/):
java -jar cutext.jar -TM -incremental true -generateTextFile true -inputFile ../in/in.txt
Jesús Santamaría, Martin Krallinger: Construcción de recursos terminológicos médicos para el español: el sistema de extracción de términos CUTEXT y los repositorios de términos biomédicos. Procesamiento del Lenguaje Natural, Revista no 61, septiembre de 2018, pp. 49-56. DOI: http://dx.doi.org/10.26342/2018-61-5
Jesús Santamaría (jsantamaria@cnio.es)
(This is so-called MIT/X License)
Copyright (c) 2017-2018 Secretaría de Estado para el Avance Digital (SEAD)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.