New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import Southwark #326
Import Southwark #326
Conversation
This data suffers from the same missing house names/flat numbers issue as a lot of the other data we've been seeing recently. For example:
and so on and so on... The affected postcodes are: Best explaination of the problem is here: #282 For the moment, what I've been doing is manually removing all the addresses in any postcodes where this is an issue because if someone searches for a postcode where this is a problem we'll either present the same incomplete 'address' twice, each yielding different results, or we'll present an incomplete address list if we only remove the problem rows. I think this is also a bad idea as if the user's address isn't listed, there's no obvious path to follow - maybe we need a 'my address is not in the list' link, directing to the "we don't know - call your council page" or something as an improvement so we don't have to exclude every address covered by postcodes containing a case of this. As a longer term position, I want to be able to handle this in the import scripts, but the current one-record-at-a-time method makes this cumbersome. This is another reason to move to a method where we:
The other import scripts you've done look like the source data is probably sensible but I've not actually run/checked the import scripts. |
postcode=address_info['postcode'], | ||
polling_station_id=address_info['polling_station_id'], | ||
slug=slug | ||
slug=slug, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, as long as we don't try and use UPRN as the slug - remember, multiple dwellings in a subdivided property share a UPRN as we discussed here: #222 (comment)
Good spot on the addresses, and I agree with your suggestion of loading the list in memory, validating and then importing. For the time being, I might add something that does a sense check – what are you running to generate the list above? |
Horrible slow spreadsheet on the input data at the moment :( That's one of the reasons data with this problem is such a faff to deal with. Can share if it helps though.. |
No its' ok – I think I'd rather try to work on a script that checks and flags problems. Are you grouping by address, postcode and polling district, and seeing if there are duplicates across districts? |
Haven't tried doing this as a post process (i.e: insert all the data, then delete the bad bits), but I think if you In terms of other data lined up for import, Enfield, Islington and Hillingdon all definitely have the same problem. There's some that I've not looked at so watch out for this - its quite common. |
Ok that SQL works, and there are 142 fewer addresses for Southwark after re-importing with the latest fix in it (16296f5) |
This fixes the issue of ambiguous addresses, as outlined in #282 by deleting duplicates after import
This is based against
addressbase_postcode_fixes
(#308), so merging is blocked on that.Fixes #320