A Biomedical Information Extraction Primer for NLP Researchers

NER is more difficult in biomedical text. "Heart attack" and "myocardial infarction" are synonyms, for example, and proteins can be described in one of several formats.
Abbreviations are also difficult and context-dependent. RA could be any of "right atrium", "rheumatoid arthiritis", "renal artery", or more.
The field used to be heavily reliant on manual feature engineering - an old relation extraction technique simply relied on 8 regexes - but is being turned upside down by deep learning.

NELL: Never Ending Language Learner: continually crawl documents looking for correlated terms. Relies on seeds in each category. Ambiguities like the ones mentioned above are troublesome and cause semantic drift - feature engineering and preprocessing are necessary for medical datasets.
Pointwise mutual information: from information theory. If two random variables were independent, we'd expect them to occur together by chance sometimes. If they interact in any way, we'll observe a difference from what we'd expect. A PMI of 0 means they're independent. Two words that score highly together are "Puerto" and "Rico", which have a PMI of 10.

Provide feedback