Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standardize_address not accurate #756

Closed
tmcsys opened this issue Mar 3, 2024 · 3 comments
Closed

standardize_address not accurate #756

tmcsys opened this issue Mar 3, 2024 · 3 comments

Comments

@tmcsys
Copy link

tmcsys commented Mar 3, 2024

standardize_address returns different results from the example on page 210 of the 2nd edition of PostGIS In Action.

select * from standardize_address('us_lex', 'us_gaz', 'us_rules', 'ONE E PIMA ST STE 999, TUCSON, AZ’)

PimaPGIS

It fails to identify basic pre-directionals, instead including them as part of the house number.

I recently upgraded to PostgresApp 2.7.2 (PG 16, PostGIS 3.4.1) basically to be able to use debug_standardize_address.

I did a pg_dumpall from the PG15 image, installed 2.7.2 and restored the dump to the new instance. As final step of the upgrade I dropped and re-created the address_standardizer_data_us extension which I believe created would have created new (not restored) gaz and lex files.

I have some background in address normalization/standardization and had every expectation that the PostGIS implementation would solve all my problems. I think I’ve been naive.

standardize_address appears to be incapable of parsing the simplest address.

Take for example:

select * from standardize_address('us_lex', 'us_gaz', 'us_rules’, '8907 W TROPICAIRE BLVD,NORTH PORT,FL, 34291’)

The routine insists on including ‘W’ as part of the house_num when it is clearly a pre-directional (predir).

select_returns

debug_standardize_address identifies ‘W’ as a directional.

SELECT it->>'pos' AS position, it->>'word' AS word, it->>'stdword' AS standardized_word,
         it->>'token' AS token, it->>'token-code' AS token_code
    FROM jsonb(
         debug_standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 W TROPICAIRE BLVD, NORTH PORT,FL, 34291')
    ) AS s, jsonb_array_elements(s->'input_tokens') AS it
debug_standardize_address

The only predir that appears to be recognized by standardize_address is ’S’ as ‘SOUTH’. These appear to be glaring shortcomings.

If it can be verified that this is not a PostgresApp issue I will raise it with the PostGIS committer and you can close this issue here. There seems to be very little recent work done on PostGIS address standardize functions.

@tbussmann
Copy link
Member

I'm not sure, I can be of much help here as I don't know much on the structure of US addresses nor the internals of PostGIS' address_standardizer.

If it can be verified that this is not a PostgresApp issue I will raise it with the PostGIS committer and you can close this issue here. There seems to be very little recent work done on PostGIS address standardize functions.

What I can confirm is that the make installcheck targets for the address_standardizer extension pass without an error. When I tried the sample queries you gave, I get the same results beside for the filed 'country' in the second query, which is filled with 'USA'. The results are the same with PostGIS 3.4.1 as well as with the upcoming version 3.4.2 (not yet released in Postgres.app)

My output looks like this:
# SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', 'ONE E PIMA ST STE 999, TUCSON, AZ');
┌──────────┬───────────┬────────┬──────┬─────────┬──────┬─────────┬────────┬────────────┬───────┬────────┬─────────┬─────────┬──────────┬─────┬───────────┐
│ building │ house_num │ predir │ qual │ pretype │ name │ suftype │ sufdir │ ruralroute │ extra │  city  │  state  │ country │ postcode │ box │   unit    │
├──────────┼───────────┼────────┼──────┼─────────┼──────┼─────────┼────────┼────────────┼───────┼────────┼─────────┼─────────┼──────────┼─────┼───────────┤
│ ¤        │ 1 E       │ ¤      │ ¤    │ ¤       │ PIMA │ STREET  │ ¤      │ ¤          │ ¤     │ TUCSON │ ARIZONA │ USA     │ ¤        │ ¤   │ SUITE 999 │
└──────────┴───────────┴────────┴──────┴─────────┴──────┴─────────┴────────┴────────────┴───────┴────────┴─────────┴─────────┴──────────┴─────┴───────────┘
(1 row)

# SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 W TROPICAIRE BLVD,NORTH PORT,FL, 34291');
┌──────────┬───────────┬────────┬──────┬─────────┬────────────┬───────────┬────────┬────────────┬───────┬────────────┬─────────┬─────────┬──────────┬─────┬──────┐
│ building │ house_num │ predir │ qual │ pretype │    name    │  suftype  │ sufdir │ ruralroute │ extra │    city    │  state  │ country │ postcode │ box │ unit │
├──────────┼───────────┼────────┼──────┼─────────┼────────────┼───────────┼────────┼────────────┼───────┼────────────┼─────────┼─────────┼──────────┼─────┼──────┤
│ ¤        │ 8907 W    │ ¤      │ ¤    │ ¤       │ TROPICAIRE │ BOULEVARD │ ¤      │ ¤          │ ¤     │ NORTH PORT │ FLORIDA │ USA     │ 34291    │ ¤   │ ¤    │
└──────────┴───────────┴────────┴──────┴─────────┴────────────┴───────────┴────────┴────────────┴───────┴────────────┴─────────┴─────────┴──────────┴─────┴──────┘
(1 row)

# SELECT it->>'pos' AS position, it->>'word' AS word, it->>'stdword' AS standardized_word,
[more]2-#          it->>'token' AS token, it->>'token-code' AS token_code
[more]3-#     FROM jsonb(
[more]3(#          debug_standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 W TROPICAIRE BLVD, NORTH PORT,FL, 34291')
[more]3(#     ) AS s, jsonb_array_elements(s->'input_tokens') AS it;
┌──────────┬────────────┬───────────────────┬────────┬────────────┐
│ position │    word    │ standardized_word │ token  │ token_code │
├──────────┼────────────┼───────────────────┼────────┼────────────┤
│ 0        │ 8907       │ 8907              │ NUMBER │ 0          │
│ 0        │ 8907       │ 8907              │ QUAD   │ 29         │
│ 1        │ W          │ W                 │ SINGLE │ 18         │
│ 1        │ W          │ WEST              │ DIRECT │ 22         │
│ 2        │ TROPICAIRE │ TROPICAIRE        │ WORD   │ 1          │
│ 3        │ BLVD       │ BOULEVARD         │ TYPE   │ 2          │
└──────────┴────────────┴───────────────────┴────────┴────────────┘
(6 rows)

# SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 S TROPICAIRE BLVD,NORTH PORT,FL, 34291');
┌──────────┬───────────┬────────┬──────┬─────────┬────────────┬───────────┬────────┬────────────┬───────┬────────────┬─────────┬─────────┬──────────┬─────┬──────┐
│ building │ house_num │ predir │ qual │ pretype │    name    │  suftype  │ sufdir │ ruralroute │ extra │    city    │  state  │ country │ postcode │ box │ unit │
├──────────┼───────────┼────────┼──────┼─────────┼────────────┼───────────┼────────┼────────────┼───────┼────────────┼─────────┼─────────┼──────────┼─────┼──────┤
│ ¤        │ 8907      │ SOUTH  │ ¤    │ ¤       │ TROPICAIRE │ BOULEVARD │ ¤      │ ¤          │ ¤     │ NORTH PORT │ FLORIDA │ USA     │ 34291    │ ¤   │ ¤    │
└──────────┴───────────┴────────┴──────┴─────────┴────────────┴───────────┴────────┴────────────┴───────┴────────────┴─────────┴─────────┴──────────┴─────┴──────┘
(1 row)

As final step of the upgrade I dropped and re-created the address_standardizer_data_us extension which I believe created would have created new (not restored) gaz and lex files.

If you create a logical backup (pg_dump/pg_dumpall) the version information and data of extensions are not included, so you'll get them in the latest/default version after loading the dump file anyhow. If you want to update these extensions later, you can run ALTER EXTENSION address_standardizer UPDATE; ALTER EXTENSION address_standardizer_data_us UPDATE;
It should be possible to add your own rules to the us_rules, us_lex and us_gaz tables by specifying is_custom - these would then be included in a dump.

I'm afraid I can't do more than point you to the documentation or the PostGIS mailing lists for more help.

@tmcsys
Copy link
Author

tmcsys commented Mar 15, 2024

Confirmed PostGIS issue:

https://trac.osgeo.org/postgis/ticket/5695#ticket

Thanks!

@jakob
Copy link
Member

jakob commented Apr 12, 2024

It looks like this issue will be fixed upstream in PostGIS 3.4.3 and 3.5.0. I'm closing the issue. Thanks for investigating!

@jakob jakob closed this as completed Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants