testbed14/document.117

﻿NLP in the biomedical domain
Several research groups are developing and applying natural language processing methodologies in biomedical informatics, and systems vary along several dimensions. The complexity of natural language dictates that semantic interpretation be focused in scope, typically by domain of discourse; many applications are designed to interpret clinical text of a certain type, for example discharge summaries or imaging reports, such as chest X-rays or mammograms. The majority of this work is knowledge based, and the specific domain guides the type and amount of knowledge used [17]. Often this is drawn from existing resources, such as the Unified Medical Language System (UMLS) [18] or the GALEN ontology [19], but several systems rely solely on locally developed knowledge bases. Further, system restrictions may be imposed on the basis of syntactic structure; some process only noun phrases or just those phrases covered by a semantic grammar. Finally, various linguistic formalisms are used, including semantic grammars, definite clause and dependency grammars, as well as bottom-up chart parsers. Below, we discuss some of the NLP systems developed in the biomedical domain. (For more comprehensive reviews, see [11] and [12].)
MedLEE [20] and [21] builds on semantic models derived from the linguistic string project (LSP) [22] and is guided by a semantic grammar that consists of patterns of semantic classes, such as degree + change + finding, which would match mild increase in congestion. These classes are defined in a semantic lexicon, and Friedman et al. [23] discuss use of the UMLS in constructing such a lexicon. MedLEE has been evaluated for several clinical applications [5], [24] and [25].
The AQUA system [26] was developed to interpret natural language queries issued by users to an information retrieval system. The parser uses standard definite clause grammars enhanced by an operator grammar, with the support of a semantic lexicon compiled from the UMLS Metathesaurus and Semantic Network. The final semantic representation is in the form of conceptual graphs. Although AQUA was developed for clinical queries, it has recently been applied to clinical data and MEDLINE citations, which are ranked based on a conceptual graph-matching algorithm [2].
The RECIT system [27] concentrates on processing noun phrases and is composed of a proximity processor, a typology of concepts, a dictionary with syntactic and semantic information, a set of conceptual relationships, and a set of canonical concepts. The semantic information relies on the model developed by the GALEN project [28].
Rosario et al. [10] describe an approach to the semantic interpretation of noun phrases and nominal compounds based on the semantic information contained in a large lexical hierarchy, the National Library of Medicine’s Medical Subject Headings (MeSH). Part of the challenge addressed by their research is to determine the possible semantic relations that can obtain among the components of a nominal construction.
SymText [29] uses probabilistic Bayesian networks to represent semantic types and relations. Syntactic knowledge comes from augmented transition networks, and the system depends on a set of reports to train the network for a specific medical domain. SymText has been evaluated for clinical applications [6], [30] and [31]. In a recent upgrade to SymText (called MPLUS) Bayesian networks are represented in an object-oriented format and a bottom-up chart parser provides syntactic analysis. In addition, MPLUS uses an abstract semantic language to link Bayesian network types to each other in a predication format [32].
The MENELAS system [33] is a multilingual text understanding system (French, Dutch, and English) built to extract information from patient discharge summaries. Domain knowledge resides in a locally developed ontology, and linguistic relations are projected to the reference model using morphosyntactic analysis. Output is in the form of an annotated parse tree that is subject to a semantic analyzer that heuristically selects the best representation using a semantic lexicon and semantic rules. MENELAS was evaluated for coding a subset of discharge summaries [34].
Hahn et al. [35] have developed a natural language processor called MEDSYNDIKATE to automatically acquire knowledge from medical reports. Grammatical knowledge comes from a lexicon and a fully specified dependency grammar. Conceptual knowledge comes from a locally developed ontology that consists of a set of axioms for concept roles with corresponding type restrictions for role fillers. In addition to sentence level analysis, MEDSYNDIKATE uses a centering algorithm to resolve anaphoric expressions at the discourse level [36]. The system has recently been evaluated for semantic propositions in sample medical texts [37].
Our approach to natural language processing differs from those described here in two major ways: the linguistic formalism used and the source of the domain knowledge. As will be seen below, syntactic structures are represented by two mechanisms, a shallow categorial parser and an underspecified dependency grammar. It should be noted that although these are both incomplete, they apply to English syntax in general and are not crafted for the biomedical domain. The domain knowledge for our system is taken directly from the UMLS rather than being compiled manually. Although the UMLS knowledge sources were not intended as ontologies and will not ultimately support extensive inferencing without enhancement, the breadth of coverage they provide supports the application of our system to a variety of medical subdomains with a minimum of effort.