Natural Language Preprocessing (NLPre)
Major version update! NLPre 2.0.0
- Backend NLP engine
pattern.enhas been replaced with
spaCyv 2.1.0. This is a major fix for some of the problems with
pattern.enincluding poor lemmatization. (eg. cytokine -> cytocow)
- Support for python 2 has been dropped
- Support for custom dictionaries in
- Option for suffix to be used instead of prefix in
- URL replacement can now remove emails
token_replacementcan remove symbols
NLPre is a text (pre)-processing library that helps smooth some of the inconsistencies found in real-world data. Correcting for issues like random capitalization patterns, strange hyphenations, and abbreviations are essential parts of wrangling textual data but are often left to the user.
While this library was developed by the Office of Portfolio Analysis at the National Institutes of Health to correct for historical artifacts in our data, we envision this module to encompass a broad spectrum of problems encountered in the preprocessing step of natural language processing.
NLPre is part of the
For the latest release, use
pip install nlpre
If installing the python 3 version on Ubuntu, you may need to use
sudo apt-get install libmysqlclient-dev
from nlpre import titlecaps, dedash, identify_parenthetical_phrases from nlpre import replace_acronyms, replace_from_dictionary text = ("LYMPHOMA SURVIVORS IN KOREA. Describe the correlates of unmet needs " "among non-Hodgkin lymphoma (NHL) surv- ivors in Korea and identify " "NHL patients with an abnormal white blood cell count.") ABBR = identify_parenthetical_phrases()(text) parsers = [dedash(), titlecaps(), replace_acronyms(ABBR), replace_from_dictionary(prefix="MeSH_")] for f in parsers: text = f(text) print(text) ''' lymphoma survivors in korea . Describe the correlates of unmet needs among non_Hodgkin_lymphoma ( non_Hodgkin_lymphoma ) survivors in Korea and identify non_Hodgkin_lymphoma patients with an abnormal MeSH_Leukocyte_Count . '''
A longer example highlighting a "pipeline" of changes can be found here.
To see a detailed log of the changes made, set the level to
import nlpre, logging nlpre.logger.setLevel(logging.INFO)
|replace_from_dictionary||Replace phrases from an input dictionary. The replacement is done without regard to case, but punctuation is handled correctly. The MeSH (Medical Subject Headings) dictionary is built-in.
|replace_acronyms||Replaces acronyms and abbreviations found in a document with their corresponding phrase. If an acronym is explicitly identified with a phrase in a document, then all instances of that acronym in the document will be replaced with the given phrase. If there is no explicit indication what the phrase is within the document, then the most common phrase associated with the acronym in the given counter is used.
|identify_parenthetical_phrases||Identify abbreviations of phrases found in a parenthesis. Returns a counter and can be passed directly into
|separated_parenthesis||Separates parenthetical content into new sentences. This is useful when creating word embeddings, as associations should only be made within the same sentence. Terminal punctuation of a period is added to parenthetical sentences if necessary.
|pos_tokenizer||Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the
|unidecoder||Converts Unicode phrases into ASCII equivalent.
|dedash||Hyphenations are sometimes erroneously inserted when text is passed through a word-processor. This module attempts to correct the hyphenation pattern by joining words that if they appear in an English word list.
|decaps_text||We presume that case is important, but only when it differs from title case. This class normalizes capitalization patterns.
|titlecaps||Documents sometimes have sentences that are entirely in uppercase (commonly found in titles and abstracts of older documents). This parser identifies sentences where every word is uppercase, and returns the document with these sentences converted to lowercase.
|token_replacement||Simple token replacement.
|separate_reference||Separates and optionally removes references that have been concatenated onto words.
|url_replacement||Removes or replaces URLs
To run NLPre in parallel, simply create a small pipeline function and pass it to either multiprocessing or
joblib. For example, continuing from the example from above:
from joblib import Parallel, delayed def pipeline(x): for f in parsers: x = f(x) return x docs = [text,]*500 with Parallel(-1) as MP: print MP(delayed(pipeline)(x) for x in docs)
Citations and Acknowledgments
He, Jian and Chaomei Chen. Predictive Effects of Novelty Measured by Temporal Embeddings on the Growth of Scientific Literature. Frontiers in Research Metrics and Analytics, 3, 9. (2018).
He, Jian and Chaomei Chen. Temporal Representations of Citations for Understanding the Changing Roles of Scientific Publications. Front. Res. Metr. Anal. (2018).
Galea, Dieter et al. Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization. BioNLP (2018).
This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.