# Clean and Extract Address Components
With a particular focus on postcodes and countries.  For this project the tools we have are;
- SpacY (POS tagging)
- libpostal (which should automatically solve this problem, but has real trouble with miss-ordered addresses, which is why we're doing this)
- WordNinja (splits strings with no whitespace in to words using a probabilistic tool trained on wikipedia)
- regex (of course!  In this case an extremely complicated one based off the english ISO standard for postcodes)
- Any business rules we can dream up

To test how these packages might be used, we've got a dataset of mixed scraped and fake american addresses, embedded in nonsense text, described and available at https://onethinglab.com/2018/03/05/extracting-addresses-from-text/.  This isn't ideal because it's quite dirty - the real problem is explicity entered addresses that people have mucked up entering.

I've also gotten the Companies House free dataset of UK registered company addresses.  These are structured, which means I can creatively "deconstruct" them to create dirty messy human-entered addresses.  A summary of the fields included can be found at http://resources.companieshouse.gov.uk/toolsToHelp/pdf/freeDataProductDataset.pdf.

Finally, a smaller dataset of addresses is available from HM's Land registry, of prices paid for properties (1995 - 2012, no idea why that date window).  It can be gotten from https://data.gov.uk/dataset/314f77b3-e702-4545-8bcb-9ef8262ea0fd/archived-price-paid-information-residential-property-1995-to-2012 and can be abused in the same way as the above CH dataset.  It has the advantage of being small enough to fit in memory so I don't have to sub-sample to get something to experiment with...

In [1]:
import spacy
import wordninja
import re
import pickle

import numpy as np
import pandas as pd

from random import sample
from random import random
from random import randrange
from random import shuffle
from postal.parser import parse_address  # Python wrapper for libpostal, which is a C library

In [2]:
# REGEX courtesy commentator "borrible" at stackoverflow thread "https://stackoverflow.com/questions/6409948/php-extract-uk-postal-code-and-validate-it"
postcode_regex = re.compile(r"((GIR 0AA)|((([A-PR-UWYZ][0-9][0-9]?)|(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|(([A-PR-UWYZ][0-9][A-HJKSTUW])|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])))) ?[0-9][ABD-HJLNP-UW-Z]{2}))")

In [3]:
#with open("./data/IOB_tagged_addresses.pkl", "rb") as f:
#    data = pickle.load(f)

## Fetch Clean Addresses
From the land registry dataset.  It has all the components, accidental repeats, and valid UK postcodes!  The only thing it lacks is a country, which I will create using a list of example country names/designations added at random (despite the fact they're all uk).  I don't actually know if there's any particular problems with this dataset.

In [4]:
countries = ['UK', 'U.K', 'U.K', 'CHINA', 'P.R.CHINA', 'TAIWAN CHINA', 'INDIA', 'IRAN']

In [5]:
lr_dat = pd.read_csv("./data/pp-2012-part1.csv", header=None, nrows=200)

In [6]:
lr_dat.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,{BD0D075D-7818-47B3-9657-7651CBD02219},155000,2012-09-28 00:00,YO24 2XQ,S,N,F,9,,IRVINE WAY,,YORK,YORK,YORK,A,A
1,{0FF70080-C8EA-4DDC-9C19-7651D9B540CC},264000,2012-07-18 00:00,PO11 9SE,D,N,F,16,,TRELOAR ROAD,,HAYLING ISLAND,HAVANT,HAMPSHIRE,A,A
2,{80CC2177-EDBE-4ABB-9EAA-7651DAA0E4EE},20000,2012-04-25 00:00,HU8 8ES,T,N,F,8,,LABURNUM GROVE,,HULL,CITY OF KINGSTON UPON HULL,CITY OF KINGSTON UPON HULL,A,A
3,{076B2969-50D6-4659-8956-7651ECAAB904},89950,2012-10-25 00:00,S81 0AB,S,N,F,31,,QUEENSWAY,,WORKSOP,BASSETLAW,NOTTINGHAMSHIRE,A,A
4,{9A6B1BBC-3522-4B44-9FD6-765202F73E28},168000,2012-04-13 00:00,TQ2 7SU,S,N,F,105,,HERON WAY,,TORQUAY,TORBAY,TORBAY,A,A


In [7]:
clean_addresses = []

# Iterate through addresses, get components
for index, row in lr_dat.iterrows():
    address = {}
    address['postcode'] = str(np.where(str(row[3])=="nan", "", row[3]))
    address['region'] = str(np.where(str(row[13])=="nan", "", row[13]))
    address['street_address'] = " ".join([str(x) for x in row.iloc[7:12] if str(x)!="nan"])
    address['country'] = sample(countries, 1)[0]
    
    clean_addresses.append(address)

In [8]:
clean_addresses

[{'country': 'UK',
  'postcode': 'YO24 2XQ',
  'region': 'YORK',
  'street_address': '9 IRVINE WAY YORK'},
 {'country': 'P.R.CHINA',
  'postcode': 'PO11 9SE',
  'region': 'HAMPSHIRE',
  'street_address': '16 TRELOAR ROAD HAYLING ISLAND'},
 {'country': 'IRAN',
  'postcode': 'HU8 8ES',
  'region': 'CITY OF KINGSTON UPON HULL',
  'street_address': '8 LABURNUM GROVE HULL'},
 {'country': 'UK',
  'postcode': 'S81 0AB',
  'region': 'NOTTINGHAMSHIRE',
  'street_address': '31 QUEENSWAY WORKSOP'},
 {'country': 'TAIWAN CHINA',
  'postcode': 'TQ2 7SU',
  'region': 'TORBAY',
  'street_address': '105 HERON WAY TORQUAY'},
 {'country': 'U.K',
  'postcode': 'SY13 4AT',
  'region': 'CHESHIRE EAST',
  'street_address': 'BURLEYDAM HOUSE WHITCHURCH ROAD BURLEYDAM WHITCHURCH'},
 {'country': 'IRAN',
  'postcode': 'ME14 5HW',
  'region': 'KENT',
  'street_address': 'PARK HOUSE FLAT 21 PARK AVENUE MAIDSTONE'},
 {'country': 'U.K',
  'postcode': 'CH5 4DF',
  'region': 'FLINTSHIRE',
  'street_address': '98 HIGH S

## Build a Method to turn them in to Dirty Addresses

Mistakes to generate;
- randomly missed whitespaces
- randomly inserted whitespaces (esp. in postcode)
- postcodes moved within address
- country moved within address
- postcode or country with parentheses around them (and often in the wrong place!)
- postcode or country with quotation marks around them
- all whitespace missing (special for post code extraction)
- "TEL=XXX..." or "FAX=XXX" with or without equals sign (might be space)

In [9]:
def fubar_address(address, chance=0.1, deletions=False, delete_chance=0.1):
    dirty = address.copy()
    
    # Step 1 - add parentheses or quotation marks (country, or postcode)
    if random() <= chance:
        if random() < 0.5:
            dirty['postcode'] = "(" + dirty['postcode'] + ")"
        else:
            dirty['postcode'] = "'" + dirty['postcode'] + "'"
    
    if random() <= chance:
        if random() < 0.5:
            dirty['country'] = "(" + dirty['country'] + ")"
        else:
            dirty['country'] = "'" + dirty['country'] + "'"
    
    # Step 2 - shuffle components (country or postcode)
    postcode_shift=0
    country_shift=0
    if random() <= chance:
        postcode_shift = randrange(-2, 1)
    if random() <= chance:
        country_shift = randrange(-3, 0)

    # Step 3 - shuffle street address tokens (with much smaller random chance)
    if random() <= chance:
        street_address = dirty['street_address'].split(" ")
        shuffle(street_address)
        dirty['street_address'] = " ".join(street_address)
    
    # Rebuild the address
    dirty_list = [dirty['street_address'], dirty['region'], "", ""]
    dirty_list.insert(2+postcode_shift, dirty['postcode'])
    dirty_list.insert(3+country_shift, dirty['country'])
    dirty_list = [x for x in dirty_list if x!=""]

    # Step 4 - Append 'TEL=987987 001203'
    if random() <= chance:
        dirty_list.append("TEL=987987 001203")
        
    if random() <= chance:
        dirty_list.append("FAX=65465412")
        
    # Step 5 - drop entire words at random, convert to string
    if deletions == True:
        dirty_string = " ".join([x for x in " ".join(dirty_list).split() if random() >= delete_chance])
        
        # Step 6 - randomly delete characters
        dirty_string = "".join([x for x in dirty_string if random() >= (delete_chance)])
        
        # Step 7 - randomly delete whitespace (with greater probability)
        dirty_string = "".join([x for x in dirty_string if (x != " ") or (random() >= (delete_chance))])
        
    else:
        dirty_string = " ".join(dirty_list)
    
    return dirty_string

In [10]:
fubar_address(clean_addresses[3], chance=0.2, deletions=True, delete_chance=0)

"31 QUEENSWAY WORKSOP NOTTINGHAMSHIRE 'S81 0AB' UK"

## Try running addresses through libpostal
And record whether or not the postcode and country are successfully retrieved!

In [11]:
for address in clean_addresses:
    print(address)
    
    # Mess up the address (right now set to "no interference")
    address['dirty'] = fubar_address(address, chance=0)
    
    # Try to parse the address
    address['parsed'] = {y:x for x, y in parse_address(address['dirty'])}
    
    # Detect if postcode retrieval worked
    try:
        address['postcode_retrieved'] = address['postcode'].lower() == address['parsed']['postcode']
    except:
        address['postcode_retrieved'] = False
        
    # Detect if country retrieval worked
    try:
        address['country_retrieved'] = address['country'].lower() == address['parsed']['country']
    except:
        address['country_retrieved'] = False
    
    print(address['dirty'])

{'postcode': 'YO24 2XQ', 'region': 'YORK', 'street_address': '9 IRVINE WAY YORK', 'country': 'UK'}
9 IRVINE WAY YORK YORK YO24 2XQ UK
{'postcode': 'PO11 9SE', 'region': 'HAMPSHIRE', 'street_address': '16 TRELOAR ROAD HAYLING ISLAND', 'country': 'P.R.CHINA'}
16 TRELOAR ROAD HAYLING ISLAND HAMPSHIRE PO11 9SE P.R.CHINA
{'postcode': 'HU8 8ES', 'region': 'CITY OF KINGSTON UPON HULL', 'street_address': '8 LABURNUM GROVE HULL', 'country': 'IRAN'}
8 LABURNUM GROVE HULL CITY OF KINGSTON UPON HULL HU8 8ES IRAN
{'postcode': 'S81 0AB', 'region': 'NOTTINGHAMSHIRE', 'street_address': '31 QUEENSWAY WORKSOP', 'country': 'UK'}
31 QUEENSWAY WORKSOP NOTTINGHAMSHIRE S81 0AB UK
{'postcode': 'TQ2 7SU', 'region': 'TORBAY', 'street_address': '105 HERON WAY TORQUAY', 'country': 'TAIWAN CHINA'}
105 HERON WAY TORQUAY TORBAY TQ2 7SU TAIWAN CHINA
{'postcode': 'SY13 4AT', 'region': 'CHESHIRE EAST', 'street_address': 'BURLEYDAM HOUSE WHITCHURCH ROAD BURLEYDAM WHITCHURCH', 'country': 'U.K'}
BURLEYDAM HOUSE WHITCHURCH

## Examine those where postcode is not retrieved

In [12]:
for address in clean_addresses:
    if address['postcode_retrieved'] == False:
        print(address['street_address'], address['region'], address['postcode'], address['country'])
        print(address['dirty'])
        try:
            print(address['parsed']['postcode'])
        except:
            print("no postcode retrieved")

105 HERON WAY TORQUAY TORBAY TQ2 7SU TAIWAN CHINA
105 HERON WAY TORQUAY TORBAY TQ2 7SU TAIWAN CHINA
tq2
1 RHODFA FLINT BODELWYDDAN RHYL DENBIGHSHIRE LL18 5WP U.K
1 RHODFA FLINT BODELWYDDAN RHYL DENBIGHSHIRE LL18 5WP U.K
ll18
45 GLYNFACH ROAD PORTH RHONDDA CYNON TAFF CF39 9LF CHINA
45 GLYNFACH ROAD PORTH RHONDDA CYNON TAFF CF39 9LF CHINA
cf39
ALBORAN APARTMENTS, 1 FLAT 302 SEVEN SEA GARDENS LONDON GREATER LONDON E3 3GU UK
ALBORAN APARTMENTS, 1 FLAT 302 SEVEN SEA GARDENS LONDON GREATER LONDON E3 3GU UK
3gu
4 STATION HALT SWINDON SWINDON SN3 4GL P.R.CHINA
4 STATION HALT SWINDON SWINDON SN3 4GL P.R.CHINA
no postcode retrieved
100 BROOK ROAD BUCKHURST HILL ESSEX IG9 5FE TAIWAN CHINA
100 BROOK ROAD BUCKHURST HILL ESSEX IG9 5FE TAIWAN CHINA
ig9
16 BOSGATE CLOSE BOZEAT WELLINGBOROUGH NORTHAMPTONSHIRE NN29 7JS TAIWAN CHINA
16 BOSGATE CLOSE BOZEAT WELLINGBOROUGH NORTHAMPTONSHIRE NN29 7JS TAIWAN CHINA
nn29
SKY APARTMENTS FLAT 42 HOMERTON ROAD LONDON GREATER LONDON E9 5FA TAIWAN CHINA
SKY APARTMEN