Skip to content

Commit

Permalink
Use crfsuite (python wrapper over Rust package bindings) to be comp…
Browse files Browse the repository at this point in the history
…atible with Python 3.10
  • Loading branch information
MicahLyle committed Oct 7, 2021
1 parent c473d3f commit 9f25055
Show file tree
Hide file tree
Showing 4 changed files with 13 additions and 9 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ It also does not normalize the address. However, [this library built on top of u
```

## How to use this development code (for the nerds)
usaddress uses [parserator](https://github.com/datamade/parserator), a library for making and improving probabilistic parsers - specifically, parsers that use [python-crfsuite](https://github.com/tpeng/python-crfsuite)'s implementation of conditional random fields. Parserator allows you to train the usaddress parser's model (a .crfsuite settings file) on labeled training data, and provides tools for adding new labeled training data.
usaddress uses [parserator](https://github.com/datamade/parserator), a library for making and improving probabilistic parsers - specifically, parsers that use [crfsuite](https://github.com/chokkan/crfsuite)'s implementation of conditional random fields. Parserator allows you to train the usaddress parser's model (a .crf settings file) on labeled training data, and provides tools for adding new labeled training data.

### Building & testing the code in this repo

Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@
description='Parse US addresses using conditional random fields',
name='usaddress',
packages=['usaddress'],
package_data={'usaddress': ['usaddr.crfsuite']},
package_data={'usaddress': ['usaddr.crf']},
license='The MIT License: http://www.opensource.org/licenses/mit-license.php',
install_requires=['python-crfsuite>=0.7',
install_requires=['crfsuite>=0.3.1',
'future>=0.14',
'probableparsing'],
classifiers=[
Expand Down
4 changes: 2 additions & 2 deletions training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,11 +238,11 @@ parserator train training/labeled.xml,training/new_addresses.xml usaddress
After running the command, you should see output that looks something like this:

```
renaming old model: usaddress/usaddr.crfsuite -> usaddress/usaddr_2016_12_19_21286.crfsuite
renaming old model: usaddress/usaddr.crf -> usaddress/usaddr_2016_12_19_21286.crf
training model on 1359 training examples from ['training/labeled.xml', 'trainingnew_addresses.xml']
done training! model file created: usaddress/usaddr.crfsuite
done training! model file created: usaddress/usaddr.crf
```

This output confirms that usaddress has learned from the new training data. Nice!
Expand Down
12 changes: 8 additions & 4 deletions usaddress/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from ordereddict import OrderedDict
import warnings

import pycrfsuite
import crfsuite
import probableparsing

# The address components are based upon the `United States Thoroughfare,
Expand Down Expand Up @@ -52,7 +52,7 @@
PARENT_LABEL = 'AddressString'
GROUP_LABEL = 'AddressCollection'

MODEL_FILE = 'usaddr.crfsuite'
MODEL_FILE = 'usaddr.crf'
MODEL_PATH = os.path.split(os.path.abspath(__file__))[0] + '/' + MODEL_FILE

DIRECTIONS = set(['n', 's', 'e', 'w',
Expand Down Expand Up @@ -136,12 +136,16 @@


try:
TAGGER = pycrfsuite.Tagger()
TAGGER.open(MODEL_PATH)
MODEL = crfsuite.Model(MODEL_PATH)
TAGGER = crfsuite.Tagger(MODEL.model)
except IOError:
warnings.warn('You must train the model (parserator train --trainfile '
'FILES) to create the %s file before you can use the parse '
'and tag methods' % MODEL_FILE)
except Exception:
warnings.warn('(Generic `Exception`) You must train the model (parserator '
'train --trainfile FILES) to create the %s file before you '
'can use the parse and tag methods' % MODEL_FILE)


def parse(address_string):
Expand Down

1 comment on commit 9f25055

@mlissner
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is from about six months ago, but assuming it works, would you be willing to open it as a PR? My organization relies on usaddress and our upgrade to Python 3.10 is blocked by the CRFSuite issue this seems to fix.

I'm not sure usaddress would accept a PR at this point, unfortunately. Datamade seems to have moved on, but it might be worth a try. I tried to spur things forward over here: datamade#320

Thanks for your consideration.

Please sign in to comment.