Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_addresses is corrupting some addresses #105

Open
kmtracey opened this issue Aug 7, 2012 · 1 comment
Open

parse_addresses is corrupting some addresses #105

kmtracey opened this issue Aug 7, 2012 · 1 comment

Comments

@kmtracey
Copy link
Contributor

kmtracey commented Aug 7, 2012

Noticed when detailing what exactly happens when a newsitem is geocoded: the base scraper class geocode_if_needed passes location_name or address_text (essentially) into the ebdata.nlp.addresses parse_addresses function which corrupts some of them.

Examples:
120 J D Murphy Ln --> 120 J
10763 James B White Hwy S --> 10763 James
10376 Rough N Ready Rd --> 10376 Rough N
3578 Old 74 --> 3578 Old
100 John L Riegel Rd --> 100 John

It isn't immediately obvious to me how to correct the regular expression used by parse_addresses. Possibly what it is doing is "correct" for the case of trying to extract an address from a large block of text that may or may not contain an address, I'm not sure. For our purposes, though, it seems things would work better if we only passed address_text through this nlp routine. I'm going to change our DataDashboard scraper mixin geocode_if_needed to do that. Later we may decide that it would make sense to try to push these changes back to base OpenBlock but right now I'm not sure enough of the code in this area to be sure of that.

@ghost ghost assigned kmtracey Aug 7, 2012
@sarahdooley
Copy link

(worked around for now; to be raised with OpenBlock)

@kmtracey kmtracey removed their assignment Sep 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants