Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import Southwark #326

Merged
merged 2 commits into from Jun 20, 2016
Merged

Import Southwark #326

merged 2 commits into from Jun 20, 2016

Conversation

symroe
Copy link
Member

@symroe symroe commented Jun 17, 2016

This is based against addressbase_postcode_fixes (#308), so merging is blocked on that.

Fixes #320

@chris48s
Copy link
Collaborator

This data suffers from the same missing house names/flat numbers issue as a lot of the other data we've been seeing recently. For example:

  • Townley Road, SE22 8SW is in districts VIL5 and VIL4
  • Nunhead Estate, SE15 3PQ is in districts TLN4 and TLN5
  • Grove Lane, SE5 8DB is in districts SCA1 and SCA5

and so on and so on...

The affected postcodes are:
SE1 0BB
SE1 0BL
SE1 0BU
SE1 0NR
SE1 2HG
SE1 3AG
SE1 3ES
SE1 3GG
SE1 3HB
SE1 4AD
SE1 4GR
SE1 4PA
SE1 4RF
SE1 4TW
SE1 4XY
SE1 4YY
SE1 5LJ
SE1 5UB
SE1 5UE
SE1 5UT
SE1 8EZ
SE1 8HA
SE1 8HU
SE15 1JB
SE15 1NG
SE15 1NL
SE15 2HH
SE15 2PL
SE15 3PQ
SE15 4NB
SE15 4NL
SE15 4TP
SE15 5BS
SE15 5EU
SE15 5PY
SE15 6GX
SE16 3PB
SE16 3QP
SE16 3RP
SE16 4EJ
SE16 4TT
SE16 5AB
SE17 1NE
SE17 2BT
SE17 2NJ
SE17 2TE
SE17 3LY
SE21 7LZ
SE21 7NA
SE22 8SW
SE22 9AX
SE22 9JH
SE22 9PE
SE5 0AW
SE5 8BG
SE5 8DB
SE5 8PX
SE5 8SY
SE8 5DJ
SE8 5DN

Best explaination of the problem is here: #282

For the moment, what I've been doing is manually removing all the addresses in any postcodes where this is an issue because if someone searches for a postcode where this is a problem we'll either present the same incomplete 'address' twice, each yielding different results, or we'll present an incomplete address list if we only remove the problem rows. I think this is also a bad idea as if the user's address isn't listed, there's no obvious path to follow - maybe we need a 'my address is not in the list' link, directing to the "we don't know - call your council page" or something as an improvement so we don't have to exclude every address covered by postcodes containing a case of this.

As a longer term position, I want to be able to handle this in the import scripts, but the current one-record-at-a-time method makes this cumbersome. This is another reason to move to a method where we:

  • Build up data to insert in memory
  • Perform validation checks (such as looking for conditions like this)
  • Insert validated data

The other import scripts you've done look like the source data is probably sensible but I've not actually run/checked the import scripts.

postcode=address_info['postcode'],
polling_station_id=address_info['polling_station_id'],
slug=slug
slug=slug,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, as long as we don't try and use UPRN as the slug - remember, multiple dwellings in a subdivided property share a UPRN as we discussed here: #222 (comment)

@symroe
Copy link
Member Author

symroe commented Jun 19, 2016

Good spot on the addresses, and I agree with your suggestion of loading the list in memory, validating and then importing.

For the time being, I might add something that does a sense check – what are you running to generate the list above?

@chris48s
Copy link
Collaborator

Horrible slow spreadsheet on the input data at the moment :( That's one of the reasons data with this problem is such a faff to deal with. Can share if it helps though..

@symroe
Copy link
Member Author

symroe commented Jun 19, 2016

No its' ok – I think I'd rather try to work on a script that checks and flags problems. Are you grouping by address, postcode and polling district, and seeing if there are duplicates across districts?

@chris48s chris48s mentioned this pull request Jun 19, 2016
2 tasks
@chris48s
Copy link
Collaborator

Haven't tried doing this as a post process (i.e: insert all the data, then delete the bad bits), but I think if you SELECT CONCAT(address, postcode) COUNT(*) ... GROUP BY CONCAT(address, postcode) anything where COUNT(*) >1 is a problem? (untested) Then you grab the postcodes associated with any records matching that criteria and delete all records with that postcode (or maybe just the ones flagged by that query if we add the "my address is not in the list" button).
Does doing that for Southwark give you the same list I got?

In terms of other data lined up for import, Enfield, Islington and Hillingdon all definitely have the same problem. There's some that I've not looked at so watch out for this - its quite common.

@symroe
Copy link
Member Author

symroe commented Jun 19, 2016

Ok that SQL works, and there are 142 fewer addresses for Southwark after re-importing with the latest fix in it (16296f5)

This fixes the issue of ambiguous addresses, as outlined in #282 by deleting duplicates after import
@symroe symroe merged commit 13ff905 into master Jun 20, 2016
@symroe symroe removed the Review label Jun 20, 2016
@symroe symroe deleted the import_southwark branch June 20, 2016 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import E09000028-Southwark
2 participants