testbed14/document.118

﻿NLP in the biomedical domain
The Unified Medical Language System (UMLS) project [18] is a long-term National Library of Medicine research and development effort designed to facilitate the retrieval and integration of information from multiple machine-readable biomedical information sources. The UMLS has three components: the Metathesaurus, the Semantic Network, and the SPECIALIST Lexicon. In addition to supporting information management applications, structured domain knowledge contained in these knowledge sources can be exploited for research in NLP, such as the effort described here to identify hypernymic predications in MEDLINE citations.
The SPECIALIST Lexicon and associated lexical access tools [45] provide syntactic information about terms in general and medical English. Both simple and multiword lexical entries are included, and each entry has been assigned one or more part-of-speech labels. Spelling variants, inflectional forms, and complement information for verbs further support NLP applications.
The Metathesaurus is a large repository of concepts (nearly 777,000 in the 2002 version) drawn from more than 60 vocabularies, classifications, and coding systems. During compilation, the structure of source terminologies is preserved; however, terms that have equivalent meanings are organized into unique concepts, which form the organizational core of the Metathesaurus. Associative and hierarchical relationships between concepts either come from the source terminologies or are added by editors. In this study, we make extensive use of these relationships in order to identify hypernymic propositions; the two arguments of such a predication must be in a (direct or indirect) hierarchical relationship, loosely defined to include Parent, Child, as well as Broader and Narrower.
It is important to note that due to varying semantics in source vocabularies, many of the relationships we use to support interpretation of hypernymic propositions are not strictly accurate for this purpose. For example, “Tylenol” is related to “Acetaminophen” by the Narrower relation in the Metathesaurus, although something like BRAND_OF would be more correct. In other instances, however, the relationship can be profitably construed as hierarchical. “Aspirin,” for example, is in a Broader relationship with “Analgesics,” “Salicylates,” and “Cyclooxygenase Inhibitors.” These limitations notwithstanding, it is our experience (supported by the evaluation of this project) that domain knowledge from the Metathesaurus can provide effective support for natural language processing directed at the interpretation of hypernymic propositions.
Each Metathesaurus concept is also assigned one or more semantic types such as ‘Disease or Syndrome’ or ‘Pharmacologic Substance’ that categorize concepts in the biomedical domain. There are 134 semantic types in the 2002 release of the UMLS, and the Semantic Network [46] organizes these into two single-inheritance hierarchies, one for entities and one for events. In addition, associative relations are assigned between semantic types; these semantic propositions represent knowledge that is accepted as being valid in the biomedical domain, such as
(10)
‘Body Part, Organ, or Organ Component’ HAS_PART ‘Cell.’
‘Body Location or Region’ LOCATION_OF ‘Anatomical Abnormality.’
‘Pharmacologic Substance’ TREATS ‘Disease or Syndrome.’

Recent research by McCray et al. [47] aimed at reducing the conceptual complexity of the medical knowledge represented in the Semantic Network has resulted in the development of semantic groups. Subject to principles of semantic validity, parsimony, completeness, exclusivity, naturalness, and utility, such groups organize the 134 semantic types in the Semantic Network into 15 coarse-grained aggregates such as Anatomy, Activities and Behaviors, Living Beings, and Chemicals and Drugs. Based on the distribution of relationships in the Semantic Network, Perl et al. [48], [49] and [50] have proposed alternative groups to those devised by McCray et al. In this work, we rely on the groups of McCray et al; our methodology can accommodate other configurations, although results will differ.
We use semantic groups to constrain the identification of hypernymic propositions; the Metathesaurus concepts that serve as arguments of such propositions must have semantic types that belong to the same semantic group. (In addition, as noted above, the concepts must be in a hierarchical relationship.) In the version of the program discussed here, we used only the group Chemicals and Drugs. This group consists of 26 semantic types, a few examples of which are ‘Pharmacologic Substance,’ ‘Antibiotic,’ ‘Biologically Active Substance,’ ‘Hormone,’ ‘Enzyme,’ ‘Vitamin,’ ‘Steroid,’ and ‘Immunologic Factor.’
In the next section, we briefly describe how UMLS domain knowledge is used in SemRep, which forms the basis of SemSpec. In the subsequent section describing SemSpec, we discuss and illustrate the specific way that we exploit semantic groups and Metathesaurus hierarchical relationships to support effective semantic interpretation of hypernymic propositions.

The MENELAS system [33] is a multilingual text understanding system (French, Dutch, and English) built to extract information from patient discharge summaries. Domain knowledge resides in a locally developed ontology, and linguistic relations are projected to the reference model using morphosyntactic analysis. Output is in the form of an annotated parse tree that is subject to a semantic analyzer that heuristically selects the best representation using a semantic lexicon and semantic rules. MENELAS was evaluated for coding a subset of discharge summaries [34].
Hahn et al. [35] have developed a natural language processor called MEDSYNDIKATE to automatically acquire knowledge from medical reports. Grammatical knowledge comes from a lexicon and a fully specified dependency grammar. Conceptual knowledge comes from a locally developed ontology that consists of a set of axioms for concept roles with corresponding type restrictions for role fillers. In addition to sentence level analysis, MEDSYNDIKATE uses a centering algorithm to resolve anaphoric expressions at the discourse level [36]. The system has recently been evaluated for semantic propositions in sample medical texts [37].
Our approach to natural language processing differs from those described here in two major ways: the linguistic formalism used and the source of the domain knowledge. As will be seen below, syntactic structures are represented by two mechanisms, a shallow categorial parser and an underspecified dependency grammar. It should be noted that although these are both incomplete, they apply to English syntax in general and are not crafted for the biomedical domain. The domain knowledge for our system is taken directly from the UMLS rather than being compiled manually. Although the UMLS knowledge sources were not intended as ontologies and will not ultimately support extensive inferencing without enhancement, the breadth of coverage they provide supports the application of our system to a variety of medical subdomains with a minimum of effort.