:mod:`lexnlp.extract.en.distances`: Extracting distances
The :mod:`lexnlp.extract.en.distances` module contains methods that allow for the extraction of distance references from text. Distances that are covered by default in this module include:
- km
- kilometer
- mile
- miles
- mi
The full list of current unit test cases can be found here: https://github.com/LexPredict/lexpredict-lexnlp/tree/master/test_data/lexnlp/extract/en/tests/test_distances
.. currentmodule:: lexnlp.extract.en.distances
.. autofunction:: get_distances
Example
>>> import lexnlp.extract.en.distances >>> text = "Within 50 miles of office." >>> print(list(lexnlp.extract.en.distances.get_distances(text))) [(50.0, 'mile')]
Distance extraction can be customized. There are three key module variables that store the default configuration and one function used to create a matching instance:
- DISTANCE_TOKEN_MAP: This Dictionary stores the map from tokens to standard distance types. See customization example below.
- DISTANCE_SYMBOL_MAP: This Dictionary stores the map from abbreviations to standard distance types. See customization example below.
- DISTANCE_PTN: This String defines the regular expression pattern used to match distances.
The default behavior of this module can be customized by overriding the value of DISTANCE_PTN_RE with a new regular expression. The example below demonstrates a simple addition of a new distance:
>>> # Out of the box behavior >>> import lexnlp.extract.en.conditions >>> text = "This improvement shall extend for no more than fifteen yards." >>> print(list(lexnlp.extract.en.distances.get_distances(text))) [] >>> # Customize the regular expression pattern >>> import regex as re >>> import lexnlp.extract.en.amounts >>> lexnlp.extract.en.distances.DISTANCE_TOKEN_MAP["yard"] = "yard" >>> lexnlp.extract.en.distances.DISTANCE_TOKEN_MAP["yards"] = "yard" >>> lexnlp.extract.en.distances.DISTANCE_SYMBOL_MAP["yd"] = "yard" >>> lexnlp.extract.en.distances.DISTANCE_PTN = r""" (({num_ptn})\s* ({distance_tokens}|{distance_symbols}))(?:\W|$) """.format( num_ptn=lexnlp.extract.en.amounts.NUM_PTN.replace('(?:\\W|$)', '').replace('(?<=\\W|^)', ''), distance_symbols='|'.join(lexnlp.extract.en.distances.DISTANCE_SYMBOL_MAP), distance_tokens='|'.join(lexnlp.extract.en.distances.DISTANCE_TOKEN_MAP) ) >>> lexnlp.extract.en.distances.DISTANCE_PTN_RE = re.compile(lexnlp.extract.en.distances.DISTANCE_PTN, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE) >>> # Run the method again to test >>> print(list(lexnlp.extract.en.distances.get_distances(text))) [(15, 'yard')]