Import Southwark #326

symroe · 2016-06-17T14:28:17Z

This is based against addressbase_postcode_fixes (#308), so merging is blocked on that.

Fixes #320

chris48s · 2016-06-18T22:03:28Z

This data suffers from the same missing house names/flat numbers issue as a lot of the other data we've been seeing recently. For example:

Townley Road, SE22 8SW is in districts VIL5 and VIL4
Nunhead Estate, SE15 3PQ is in districts TLN4 and TLN5
Grove Lane, SE5 8DB is in districts SCA1 and SCA5

and so on and so on...

The affected postcodes are:
SE1 0BB
SE1 0BL
SE1 0BU
SE1 0NR
SE1 2HG
SE1 3AG
SE1 3ES
SE1 3GG
SE1 3HB
SE1 4AD
SE1 4GR
SE1 4PA
SE1 4RF
SE1 4TW
SE1 4XY
SE1 4YY
SE1 5LJ
SE1 5UB
SE1 5UE
SE1 5UT
SE1 8EZ
SE1 8HA
SE1 8HU
SE15 1JB
SE15 1NG
SE15 1NL
SE15 2HH
SE15 2PL
SE15 3PQ
SE15 4NB
SE15 4NL
SE15 4TP
SE15 5BS
SE15 5EU
SE15 5PY
SE15 6GX
SE16 3PB
SE16 3QP
SE16 3RP
SE16 4EJ
SE16 4TT
SE16 5AB
SE17 1NE
SE17 2BT
SE17 2NJ
SE17 2TE
SE17 3LY
SE21 7LZ
SE21 7NA
SE22 8SW
SE22 9AX
SE22 9JH
SE22 9PE
SE5 0AW
SE5 8BG
SE5 8DB
SE5 8PX
SE5 8SY
SE8 5DJ
SE8 5DN

Best explaination of the problem is here: #282

For the moment, what I've been doing is manually removing all the addresses in any postcodes where this is an issue because if someone searches for a postcode where this is a problem we'll either present the same incomplete 'address' twice, each yielding different results, or we'll present an incomplete address list if we only remove the problem rows. I think this is also a bad idea as if the user's address isn't listed, there's no obvious path to follow - maybe we need a 'my address is not in the list' link, directing to the "we don't know - call your council page" or something as an improvement so we don't have to exclude every address covered by postcodes containing a case of this.

As a longer term position, I want to be able to handle this in the import scripts, but the current one-record-at-a-time method makes this cumbersome. This is another reason to move to a method where we:

Build up data to insert in memory
Perform validation checks (such as looking for conditions like this)
Insert validated data

The other import scripts you've done look like the source data is probably sensible but I've not actually run/checked the import scripts.

chris48s · 2016-06-19T10:41:10Z

polling_stations/apps/data_collection/management/commands/__init__.py

-            postcode=address_info['postcode'],
-            polling_station_id=address_info['polling_station_id'],
-            slug=slug
+            slug=slug,


This is fine, as long as we don't try and use UPRN as the slug - remember, multiple dwellings in a subdivided property share a UPRN as we discussed here: #222 (comment)

symroe · 2016-06-19T10:45:22Z

Good spot on the addresses, and I agree with your suggestion of loading the list in memory, validating and then importing.

For the time being, I might add something that does a sense check – what are you running to generate the list above?

chris48s · 2016-06-19T10:48:55Z

Horrible slow spreadsheet on the input data at the moment :( That's one of the reasons data with this problem is such a faff to deal with. Can share if it helps though..

symroe · 2016-06-19T10:51:56Z

No its' ok – I think I'd rather try to work on a script that checks and flags problems. Are you grouping by address, postcode and polling district, and seeing if there are duplicates across districts?

chris48s · 2016-06-19T12:03:55Z

Haven't tried doing this as a post process (i.e: insert all the data, then delete the bad bits), but I think if you SELECT CONCAT(address, postcode) COUNT(*) ... GROUP BY CONCAT(address, postcode) anything where COUNT(*) >1 is a problem? (untested) Then you grab the postcodes associated with any records matching that criteria and delete all records with that postcode (or maybe just the ones flagged by that query if we add the "my address is not in the list" button).
Does doing that for Southwark give you the same list I got?

In terms of other data lined up for import, Enfield, Islington and Hillingdon all definitely have the same problem. There's some that I've not looked at so watch out for this - its quite common.

symroe · 2016-06-19T13:52:32Z

Ok that SQL works, and there are 142 fewer addresses for Southwark after re-importing with the latest fix in it (16296f5)

This fixes the issue of ambiguous addresses, as outlined in #282 by deleting duplicates after import

symroe added the Review label Jun 17, 2016

symroe force-pushed the import_southwark branch from 8e57a26 to 9647810 Compare June 17, 2016 14:29

chris48s reviewed Jun 19, 2016
View reviewed changes

chris48s mentioned this pull request Jun 19, 2016

AddressBase postcode fixes #308

Merged

2 tasks

symroe force-pushed the import_southwark branch from 9647810 to 16296f5 Compare June 19, 2016 13:51

This was referenced Jun 19, 2016

Import Tower Hamlets #333

Merged

Import Islington #334

Merged

Import Westminster #335

Merged

Import Bexley #336

Merged

Import Kensington and Chelsea #337

Merged

Import Enfield #338

Merged

symroe added 2 commits June 19, 2016 22:40

Import Southwark

e45d5c8

clean_ambiguous_addresses after import

8c39b75

This fixes the issue of ambiguous addresses, as outlined in #282 by deleting duplicates after import

symroe force-pushed the import_southwark branch from 16296f5 to 8c39b75 Compare June 19, 2016 21:40

symroe merged commit 13ff905 into master Jun 20, 2016

symroe removed the Review label Jun 20, 2016

symroe deleted the import_southwark branch June 20, 2016 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Southwark #326

Import Southwark #326

symroe commented Jun 17, 2016 •

edited

chris48s commented Jun 18, 2016

chris48s Jun 19, 2016

symroe commented Jun 19, 2016

chris48s commented Jun 19, 2016

symroe commented Jun 19, 2016

chris48s commented Jun 19, 2016

symroe commented Jun 19, 2016

Import Southwark #326

Import Southwark #326

Conversation

symroe commented Jun 17, 2016 • edited

chris48s commented Jun 18, 2016

chris48s Jun 19, 2016

Choose a reason for hiding this comment

symroe commented Jun 19, 2016

chris48s commented Jun 19, 2016

symroe commented Jun 19, 2016

chris48s commented Jun 19, 2016

symroe commented Jun 19, 2016

symroe commented Jun 17, 2016 •

edited