# Clean and Extract Address Components
With a particular focus on postcodes and countries.  For this project the tools we have are;
- SpacY (POS tagging)
- libpostal (which should automatically solve this problem, but has real trouble with miss-ordered addresses, which is why we're doing this)
- WordNinja (splits strings with no whitespace in to words using a probabilistic tool trained on wikipedia)
- regex (of course!  In this case an extremely complicated one based off the english ISO standard for postcodes)
- Any business rules we can dream up

To test how these packages might be used, we've got a dataset of mixed scraped and fake american addresses, embedded in nonsense text, described and available at https://onethinglab.com/2018/03/05/extracting-addresses-from-text/.  This isn't ideal because it's quite dirty - the real problem is explicity entered addresses that people have mucked up entering.

I've also gotten the Companies House free dataset of UK registered company addresses.  These are structured, which means I can creatively "deconstruct" them to create dirty messy human-entered addresses.  A summary of the fields included can be found at http://resources.companieshouse.gov.uk/toolsToHelp/pdf/freeDataProductDataset.pdf.

Finally, a smaller dataset of addresses is available from HM's Land registry, of prices paid for properties (1995 - 2012, no idea why that date window).  It can be gotten from https://data.gov.uk/dataset/314f77b3-e702-4545-8bcb-9ef8262ea0fd/archived-price-paid-information-residential-property-1995-to-2012 and can be abused in the same way as the above CH dataset.  It has the advantage of being small enough to fit in memory so I don't have to sub-sample to get something to experiment with...

### To Do
- Postcodes with space between first letters and first numbers
- Spaces before and after postcode removed
- TEL/FAX/CONTACT NUMBER/EMAIL stripping
- Punctuation only separator between country and rest of sentence

### Results notes
- regex better than libpostal at getting UK postcodes BUT libpostal does at least get the area code of the postcode (when applied to clean data), which narrows area to sub-city.  Worth accounting for.
- Libpostal is easily confused by telephone and fax numbers accidentally included.  A regex that removes those would be worth developing.
- Libpostal doesn't seem to care much about parentheses, guess it filters those itself
- Libpostal gets the postcode more easily if it's at the end of the address as expected.  If it's at the from it confuses it for a house number.  Suggested solution;  reorder string with all mixed letter-character tokens shifted to end (retaining ordering) before attempting to retrieve postcode.
- The postcode regex doesn't depend on whitespace, so maybe use it on a version of the address with the whitespace stripped out!  For normalisation before sending to BigMatch, strip all whitespace from postcodes field as well so no matter how they're retrieved they end up looking the same?
- For UK postcodes the only situation in which libpostal outperforms the regex is if there's missing characters (it can often then at least get the area code).
- It may well be worth trying libpostal's parse_address on different orderings to retrieve postcode and country, before removing the detected addresses/postcodes from the strings and re-parsing the remainder of the address.
- A country shortening ("UK" rather than "United Kingdom") is detected if the punctuation is removed ("U.K" <- "UK").
- Specific problem with "P.R.China", system will recognise "China", but not if it's in "P.R.China" OR "PRCHina".  Try parsing with punctuation removed OR substituted with a space each time?  Idea;  sub punctuation with spaces, join any sequential isolated letters ("U.K" -> "U K" -> "UK", "P.R.China" -> "P R China" -> "PR China").  Solves for initials and country pre-pending letters.  Should leave weird london postcodes (X5 Y 3B") undamaged.

- A few of these suggestions essentially involve running libpostal over multiple variants.  I should check if there's a computational cost.

### Further suggestions
- Use ONS's business name buzzword cull list to strip out company name components that might mess things up (or take everything before a given buzzword and call it a company name?)
- Does BXM code treat house number and unit as both part of the start of a street address?

In [1]:
import spacy
import wordninja
import re
import pickle

import numpy as np
import pandas as pd

from random import sample
from random import random
from random import randrange
from random import shuffle
from postal.parser import parse_address  # Python wrapper for libpostal, which is a C library

In [2]:
# REGEX courtesy commentator "borrible" at stackoverflow thread "https://stackoverflow.com/questions/6409948/php-extract-uk-postal-code-and-validate-it"
postcode_regex = re.compile(r"((GIR 0AA)|((([A-PR-UWYZ][0-9][0-9]?)|(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|(([A-PR-UWYZ][0-9][A-HJKSTUW])|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])))) ?[0-9][ABD-HJLNP-UW-Z]{2}))")

In [3]:
#with open("./data/IOB_tagged_addresses.pkl", "rb") as f:
#    data = pickle.load(f)

## 0. Fetch Clean Addresses
From the land registry dataset.  It has all the components, accidental repeats, and valid UK postcodes!  The only thing it lacks is a country, which I will create using a list of example country names/designations added at random (despite the fact they're all uk).  I don't actually know if there's any particular problems with this dataset.

In [4]:
countries = ['UK', 'U.K', 'U.K', 'CHINA', 'P.R.CHINA', 'TAIWAN CHINA', 'INDIA', 'IRAN']

In [5]:
lr_dat = pd.read_csv("./data/pp-2012-part1.csv", header=None, nrows=500)

In [6]:
lr_dat.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,{BD0D075D-7818-47B3-9657-7651CBD02219},155000,2012-09-28 00:00,YO24 2XQ,S,N,F,9,,IRVINE WAY,,YORK,YORK,YORK,A,A
1,{0FF70080-C8EA-4DDC-9C19-7651D9B540CC},264000,2012-07-18 00:00,PO11 9SE,D,N,F,16,,TRELOAR ROAD,,HAYLING ISLAND,HAVANT,HAMPSHIRE,A,A
2,{80CC2177-EDBE-4ABB-9EAA-7651DAA0E4EE},20000,2012-04-25 00:00,HU8 8ES,T,N,F,8,,LABURNUM GROVE,,HULL,CITY OF KINGSTON UPON HULL,CITY OF KINGSTON UPON HULL,A,A
3,{076B2969-50D6-4659-8956-7651ECAAB904},89950,2012-10-25 00:00,S81 0AB,S,N,F,31,,QUEENSWAY,,WORKSOP,BASSETLAW,NOTTINGHAMSHIRE,A,A
4,{9A6B1BBC-3522-4B44-9FD6-765202F73E28},168000,2012-04-13 00:00,TQ2 7SU,S,N,F,105,,HERON WAY,,TORQUAY,TORBAY,TORBAY,A,A


In [7]:
clean_addresses = []

# Iterate through addresses, get components
for index, row in lr_dat.iterrows():
    address = {}
    address['postcode'] = str(np.where(str(row[3])=="nan", "", row[3]))
    address['region'] = str(np.where(str(row[13])=="nan", "", row[13]))
    address['street_address'] = " ".join([str(x) for x in row.iloc[7:12] if str(x)!="nan"])
    address['country'] = sample(countries, 1)[0]
    
    clean_addresses.append(address)

In [8]:
clean_addresses

[{'country': 'IRAN',
  'postcode': 'YO24 2XQ',
  'region': 'YORK',
  'street_address': '9 IRVINE WAY YORK'},
 {'country': 'U.K',
  'postcode': 'PO11 9SE',
  'region': 'HAMPSHIRE',
  'street_address': '16 TRELOAR ROAD HAYLING ISLAND'},
 {'country': 'IRAN',
  'postcode': 'HU8 8ES',
  'region': 'CITY OF KINGSTON UPON HULL',
  'street_address': '8 LABURNUM GROVE HULL'},
 {'country': 'P.R.CHINA',
  'postcode': 'S81 0AB',
  'region': 'NOTTINGHAMSHIRE',
  'street_address': '31 QUEENSWAY WORKSOP'},
 {'country': 'TAIWAN CHINA',
  'postcode': 'TQ2 7SU',
  'region': 'TORBAY',
  'street_address': '105 HERON WAY TORQUAY'},
 {'country': 'U.K',
  'postcode': 'SY13 4AT',
  'region': 'CHESHIRE EAST',
  'street_address': 'BURLEYDAM HOUSE WHITCHURCH ROAD BURLEYDAM WHITCHURCH'},
 {'country': 'U.K',
  'postcode': 'ME14 5HW',
  'region': 'KENT',
  'street_address': 'PARK HOUSE FLAT 21 PARK AVENUE MAIDSTONE'},
 {'country': 'UK',
  'postcode': 'CH5 4DF',
  'region': 'FLINTSHIRE',
  'street_address': '98 HIGH 

## Build a Method to turn them in to Dirty Addresses

Mistakes to generate;
- randomly missed whitespaces
- randomly inserted whitespaces (esp. in postcode)
- postcodes moved within address
- country moved within address
- postcode or country with parentheses around them (and often in the wrong place!)
- postcode or country with quotation marks around them
- all whitespace missing (special for post code extraction)
- "TEL=XXX..." or "FAX=XXX" with or without equals sign (might be space)

In [9]:
def fubar_address(address, chance=0.1, deletions=False, delete_chance=0.1):
    dirty = address.copy()
    
    # Step 1 - add parentheses or quotation marks (country, or postcode)
    if random() <= chance:
        if random() < 0.5:
            dirty['postcode'] = "(" + dirty['postcode'] + ")"
        else:
            dirty['postcode'] = "'" + dirty['postcode'] + "'"
    
    if random() <= chance:
        if random() < 0.5:
            dirty['country'] = "(" + dirty['country'] + ")"
        else:
            dirty['country'] = "'" + dirty['country'] + "'"
    
    # Step 2 - shuffle components (country or postcode)
    postcode_shift=0
    country_shift=0
    if random() <= chance:
        postcode_shift = randrange(-2, 1)
    if random() <= chance:
        country_shift = randrange(-3, 0)

    # Step 3 - shuffle street address tokens (with much smaller random chance)
    if random() <= chance:
        street_address = dirty['street_address'].split(" ")
        shuffle(street_address)
        dirty['street_address'] = " ".join(street_address)
    
    # Rebuild the address
    dirty_list = [dirty['street_address'], dirty['region'], "", ""]
    dirty_list.insert(2+postcode_shift, dirty['postcode'])
    dirty_list.insert(3+country_shift, dirty['country'])
    dirty_list = [x for x in dirty_list if x!=""]

    # Step 4 - Append 'TEL=987987 001203'
    if random() <= chance:
        dirty_list.append("TEL=987987 001203")
        
    if random() <= chance:
        dirty_list.append("FAX=65465412")
        
    # Step 5 - drop entire words at random, convert to string
    if deletions == True:
        dirty_string = " ".join([x for x in " ".join(dirty_list).split() if random() >= delete_chance / 3.0])
        
        # Step 6 - randomly delete characters (with lesser probability)
        dirty_string = "".join([x for x in dirty_string if random() >= delete_chance / 10.0])
        
        # Step 7 - randomly delete whitespace
        dirty_string = "".join([x for x in dirty_string if (x != " ") or (random() >= delete_chance)])
        
    else:
        dirty_string = " ".join(dirty_list)
    
    return dirty_string

In [10]:
fubar_address(clean_addresses[3], chance=0.2, deletions=True, delete_chance=0)

'31 QUEENSWAY WORKSOP S81 0AB NOTTINGHAMSHIRE P.R.CHINA TEL=987987 001203'

## 1.  Improving Retrieval of Postcodes (starting with clean, unaltered addresses)
There's a Regex pattern for UK postcodes at least so we'll try that instead first.  There's also the possibility of trying to reorder characters in the strings to make retrieval by libpostal easier.

In [11]:
for address in clean_addresses:
    print(address)
    
    # Mess up the address (right now set to "no interference")
    address['dirty'] = fubar_address(address, chance=0.3, deletions=False, delete_chance=0.1)
    
    # Try to parse the address
    address['parsed'] = {y:x for x, y in parse_address(address['dirty'])}
    
    # Detect if postcode retrieval worked
    try:
        address['postcode_retrieved'] = address['postcode'].lower() == address['parsed']['postcode']
    except:
        address['postcode_retrieved'] = False

{'postcode': 'YO24 2XQ', 'region': 'YORK', 'street_address': '9 IRVINE WAY YORK', 'country': 'IRAN'}
{'postcode': 'PO11 9SE', 'region': 'HAMPSHIRE', 'street_address': '16 TRELOAR ROAD HAYLING ISLAND', 'country': 'U.K'}
{'postcode': 'HU8 8ES', 'region': 'CITY OF KINGSTON UPON HULL', 'street_address': '8 LABURNUM GROVE HULL', 'country': 'IRAN'}
{'postcode': 'S81 0AB', 'region': 'NOTTINGHAMSHIRE', 'street_address': '31 QUEENSWAY WORKSOP', 'country': 'P.R.CHINA'}
{'postcode': 'TQ2 7SU', 'region': 'TORBAY', 'street_address': '105 HERON WAY TORQUAY', 'country': 'TAIWAN CHINA'}
{'postcode': 'SY13 4AT', 'region': 'CHESHIRE EAST', 'street_address': 'BURLEYDAM HOUSE WHITCHURCH ROAD BURLEYDAM WHITCHURCH', 'country': 'U.K'}
{'postcode': 'ME14 5HW', 'region': 'KENT', 'street_address': 'PARK HOUSE FLAT 21 PARK AVENUE MAIDSTONE', 'country': 'U.K'}
{'postcode': 'CH5 4DF', 'region': 'FLINTSHIRE', 'street_address': '98 HIGH STREET CONNAHS QUAY DEESIDE', 'country': 'UK'}
{'postcode': 'PO9 2RR', 'region':

{'postcode': 'HP7 0BG', 'region': 'BUCKINGHAMSHIRE', 'street_address': '31 STATION ROAD AMERSHAM', 'country': 'TAIWAN CHINA'}
{'postcode': 'PE2 9RP', 'region': 'CITY OF PETERBOROUGH', 'street_address': '12 VOKES STREET PETERBOROUGH', 'country': 'IRAN'}
{'postcode': 'PO18 0SQ', 'region': 'WEST SUSSEX', 'street_address': '7 LILLYWHITE ROAD WESTHAMPNETT CHICHESTER', 'country': 'TAIWAN CHINA'}
{'postcode': 'BH9 1TN', 'region': 'BOURNEMOUTH', 'street_address': '17A CORONATION AVENUE MOORDOWN BOURNEMOUTH', 'country': 'U.K'}
{'postcode': 'LS25 2BB', 'region': 'WEST YORKSHIRE', 'street_address': '22 SEVERN DRIVE GARFORTH LEEDS', 'country': 'U.K'}
{'postcode': 'HP13 7HR', 'region': 'BUCKINGHAMSHIRE', 'street_address': '161 HERBERT ROAD HIGH WYCOMBE', 'country': 'UK'}
{'postcode': 'LE19 2RA', 'region': 'LEICESTERSHIRE', 'street_address': '3 MCDOWELL WAY NARBOROUGH LEICESTER', 'country': 'CHINA'}
{'postcode': 'NW9 7NB', 'region': 'GREATER LONDON', 'street_address': '61 COOL OAK LANE LONDON', 'cou

{'postcode': 'SP10 5HA', 'region': 'HAMPSHIRE', 'street_address': '39 SPEY COURT ANDOVER', 'country': 'UK'}
{'postcode': 'NP13 3JB', 'region': 'BLAENAU GWENT', 'street_address': '26 TANGLEWOOD DRIVE BLAINA ABERTILLERY', 'country': 'U.K'}
{'postcode': 'BS21 7AZ', 'region': 'NORTH SOMERSET', 'street_address': 'HOLME VIEW WALTON BAY CLEVEDON', 'country': 'U.K'}
{'postcode': 'PL3 4HQ', 'region': 'CITY OF PLYMOUTH', 'street_address': '133 ALMA ROAD PLYMOUTH', 'country': 'CHINA'}
{'postcode': 'CM8 1SZ', 'region': 'ESSEX', 'street_address': '9 MORTIMER WAY WITHAM', 'country': 'TAIWAN CHINA'}
{'postcode': 'CB6 1FN', 'region': 'CAMBRIDGESHIRE', 'street_address': '3 LUPINS CLOSE LITTLEPORT ELY', 'country': 'U.K'}
{'postcode': 'CH2 2RA', 'region': 'CHESHIRE WEST AND CHESTER', 'street_address': '25 THE HEYWOODS CHESTER', 'country': 'P.R.CHINA'}
{'postcode': 'DN4 9QX', 'region': 'SOUTH YORKSHIRE', 'street_address': '119 SHEFFIELD ROAD WARMSWORTH DONCASTER', 'country': 'P.R.CHINA'}
{'postcode': 'EN9

### Testing Effect of Ordering of Components on Postcode Detection by libpostal

In [12]:
def shift_mixed_tokens(string):
    """
    Moves any tokens in a string with mixed numbers and letters to the end of the string.
    The libpostal library should be pretty tolerant of other address components that end
    up shifted so long as they aren't the last in the string.
    """
    is_mixed = []
    isnt_mixed = []
    
    # Split on whitespace, default assumption
    for each in string.upper().split():
        
        # if token contains both letters AND numbers
        if re.search("[0-9]+", each) and re.search("[A-Z]+", each):
            is_mixed.append(each)
        else:
            isnt_mixed.append(each)
            
    return(" ".join(isnt_mixed + is_mixed))

In [13]:
test = "TEL077891 (SO13 2BH) FLAT 3 ARDVARK AVENUE SOTON UK"
result = shift_mixed_tokens((test))
result

'FLAT 3 ARDVARK AVENUE SOTON UK TEL077891 (SO13 2BH)'

In [14]:
parse_address(result)

[('flat 3', 'unit'),
 ('ardvark avenue', 'road'),
 ('soton', 'city'),
 ('uk', 'country'),
 ('tel077891 so13 2bh', 'postcode')]

In [15]:
parse_address(test)

[('tel077891 so13 2bh', 'house'),
 ('flat 3', 'house_number'),
 ('ardvark avenue', 'road'),
 ('soton', 'city'),
 ('uk', 'country')]

### Examine those where postcode is not retrieved by Default

In [16]:
failed_addresses = []
for address in clean_addresses:
    if address['postcode_retrieved'] == False:
        
        if 'postcode' not in address['parsed']:
            address['parsed']['postcode'] = None
            
        # Try using the regex to get the postcode instead
        try:
            address['postcode_regexed'] = re.findall(postcode_regex, address['dirty'].upper())[0][0]
        except:
            address['postcode_regexed'] = None
        
        # Try reordering and then parsing
        reordered = shift_mixed_tokens(address['dirty'])
        reparsed = {y:x for x, y in parse_address(reordered)}
        try:
            address['postcode_shifted'] = reparsed['postcode'].upper()
        except:
            address['postcode_shifted'] = None
        
        failed_addresses.append(address)

In [17]:
print("Postcode | Libpostal_found | Regex_found | Shifted_found | Dirty_address")
for address in failed_addresses:
    print(address['postcode'],
          " | ",
          address['parsed']['postcode'],
          " | ",
          address['postcode_regexed'], 
          " | ",
          address['postcode_shifted'],
          " - ",
          address['dirty'])

Postcode | Libpostal_found | Regex_found | Shifted_found | Dirty_address
HU8 8ES  |  None  |  HU8 8ES  |  None  -  8 LABURNUM GROVE HULL CITY OF KINGSTON UPON HULL 'HU8 8ES' 'IRAN'
TQ2 7SU  |  tq2  |  TQ2 7SU  |  None  -  105 HERON WAY TORQUAY TORBAY (TQ2 7SU) 'TAIWAN CHINA'
PO9 2RR  |  None  |  PO9 2RR  |  001203  -  1 SOUTHLEIGH ROAD HAVANT 'PO9 2RR' HAMPSHIRE INDIA TEL=987987 001203
NG17 8PT  |  None  |  NG17 8PT  |  None  -  NG17 8PT 3 LILAC GROVE KIRKBY IN ASHFIELD NOTTINGHAM NOTTINGHAMSHIRE (CHINA)
SN12 6FU  |  None  |  SN12 6FU  |  None  -  20 SAXIFRAGE BANK MELKSHAM WILTSHIRE 'SN12 6FU' U.K
LL18 5WP  |  ll18  |  LL18 5WP  |  None  -  1 RHODFA FLINT BODELWYDDAN RHYL DENBIGHSHIRE (LL18 5WP) (CHINA) TEL=987987 001203
BS16 1FN  |  bs16  |  BS16 1FN  |  BS16  -  GIFFORD LOCKY 5 BRISTOL STOKE CLOSE LITTLE SOUTH GLOUCESTERSHIRE INDIA BS16 1FN FAX=65465412
DL3 9QD  |  dl3  |  DL3 9QD  |  DL3 9QD FAX=65465412  -  GROVE LAZENBY DARLINGTON 11 DL3 9QD IRAN DARLINGTON FAX=65465412
YO24 4DG 

### Test postcode retrieval without spaces
The postcode regex isn't dependent on whitespace...

In [34]:
for address in failed_addresses:
    address['parsed_stripped'] = parse_address(address['dirty'].replace(" ", ""))
    print(address['dirty'].replace(" ", ""))
    print(address['parsed_stripped'])
    try:
        print(re.findall(postcode_regex, address['dirty'].replace(" ", ""))[0][0])
    except:
        pass
    
    print("\n")

16TRELOARROADHAYLINGISLANDHAMPSHIREPO119SEU.K
[('16treloarroadhaylingislandhampshirepo119seu.k', 'house')]
PO119SE


31QUEENSWAYWORKSOPNOTTINGHAMSHIRES810ABP.R.CHINA
[('31queenswayworksopnottinghamshires810abp.r.china', 'house')]
ES810AB


105HERONWAYTORQUAYTORBAYTQ27SUTAIWANCHINA
[('105heronwaytorquaytorbaytq27sutaiwanchina', 'house')]
TQ27SU


BURLEYDAMHOUSEWHITCHURCHROADBURLEYDAMWHITCHURCHCHESHIREEASTSY134ATU.K
[('burleydamhousewhitchurchroadburleydamwhitchurchcheshireeastsy134atu.k', 'house')]
SY134AT


PARKHOUSEFLAT21PARKAVENUEMAIDSTONEKENTME145HWU.K
[('parkhouseflat21parkavenuemaidstonekentme145hwu.k', 'house')]
AT21PA


20SAXIFRAGEBANKMELKSHAMWILTSHIRESN126FUU.K
[('20saxifragebankmelkshamwiltshiresn126fuu.k', 'house')]
SN126FU


31CASTLESTREETFRAMLINGHAMWOODBRIDGESUFFOLKIP139BPTAIWANCHINA
[('31castlestreetframlinghamwoodbridgesuffolkip139bptaiwanchina', 'house')]
IP139BP


1STANNSBARKINGGREATERLONDONIG117ALU.K
[('1stannsbarkinggreaterlondonig117alu.k', 'house')]
IG117AL


29PRIN

## 2. Improving Retrieval of Country Names
For a large part, libpostal expects country names or abbreviations to be at the end of the string.  Libpostal also doesn't deal well with punctuation.  I'd actually recommend removing ALL things that aren't numbers or letters before parsing any of this stuff

In [18]:
for address in clean_addresses:
    
    # Mess up the address (right now set to "no interference")
    address['dirty'] = fubar_address(address, chance=0)
    
    # Try to parse the address
    address['parsed'] = {y:x for x, y in parse_address(address['dirty'])}
        
    # Detect if country retrieval worked
    try:
        address['country_retrieved'] = address['country'].lower() == address['parsed']['country']
    except:
        address['country_retrieved'] = False
    
    print(address['dirty'])

9 IRVINE WAY YORK YORK YO24 2XQ IRAN
16 TRELOAR ROAD HAYLING ISLAND HAMPSHIRE PO11 9SE U.K
8 LABURNUM GROVE HULL CITY OF KINGSTON UPON HULL HU8 8ES IRAN
31 QUEENSWAY WORKSOP NOTTINGHAMSHIRE S81 0AB P.R.CHINA
105 HERON WAY TORQUAY TORBAY TQ2 7SU TAIWAN CHINA
BURLEYDAM HOUSE WHITCHURCH ROAD BURLEYDAM WHITCHURCH CHESHIRE EAST SY13 4AT U.K
PARK HOUSE FLAT 21 PARK AVENUE MAIDSTONE KENT ME14 5HW U.K
98 HIGH STREET CONNAHS QUAY DEESIDE FLINTSHIRE CH5 4DF UK
1 SOUTHLEIGH ROAD HAVANT HAMPSHIRE PO9 2RR INDIA
224 FILTON AVENUE HORFIELD BRISTOL CITY OF BRISTOL BS7 0AZ CHINA
LEY CROSSING BUNGALOW LEY LANE MINSTERWORTH GLOUCESTER GLOUCESTERSHIRE GL2 8JU IRAN
222 GORE ROAD NEW MILTON HAMPSHIRE BH25 5NQ CHINA
11A HILLVIEW ROAD CARLTON NOTTINGHAM NOTTINGHAMSHIRE NG4 1JX IRAN
3 LILAC GROVE KIRKBY IN ASHFIELD NOTTINGHAM NOTTINGHAMSHIRE NG17 8PT CHINA
20 SAXIFRAGE BANK MELKSHAM WILTSHIRE SN12 6FU U.K
1 RHODFA FLINT BODELWYDDAN RHYL DENBIGHSHIRE LL18 5WP CHINA
5 LITTLE LOCKY CLOSE STOKE GIFFORD BRISTOL SOU

38 MORRIS DRIVE BILLINGSHURST WEST SUSSEX RH14 9SF UK
4 STONELEIGH CLOSE NEWTON ABBOT DEVON TQ12 1QZ IRAN
18 MANOR ROAD POTTERS BAR HERTFORDSHIRE EN6 1DG INDIA
HILL HOUSE THE CRESCENT ROMSEY HAMPSHIRE SO51 7NG INDIA
40 HAYWAIN CLOSE SWINDON SWINDON SN25 4AB U.K
16 VINE STREET ALDERSHOT HAMPSHIRE GU11 3EU U.K
38 HILLSIDE PARK BARGOED CAERPHILLY CF81 8NL U.K
2 WENTWORTH STREET MIDDLESBROUGH MIDDLESBROUGH TS1 4ET U.K
79 DERWENT CRESCENT KETTERING NORTHAMPTONSHIRE NN16 8UH U.K
THE LONE PINE THE STREET CULFORD BURY ST EDMUNDS SUFFOLK IP28 6DN UK
12 POPLAR FARM LANE FARSLEY PUDSEY WEST YORKSHIRE LS28 5FF U.K
35 ST MARYS CRESCENT STANWELL STAINES SURREY TW19 7HY INDIA
19 BROAD ING CRESCENT KENDAL CUMBRIA LA9 6HA CHINA
36 WESTFIELD NEW ASH GREEN LONGFIELD KENT DA3 8QW IRAN
28 IVYDALE AVENUE SHELDON BIRMINGHAM WEST MIDLANDS B26 3SL P.R.CHINA
26 NORTHUMBERLAND STREET WIGAN GREATER MANCHESTER WN1 3PZ TAIWAN CHINA
31 CROSBY GARDENS NORTHALLERTON NORTH YORKSHIRE DL6 1AR TAIWAN CHINA
46 JAVA CRESCEN

18 NURSERY FIELDS SAWBRIDGEWORTH HERTFORDSHIRE CM21 0DH CHINA
8 NYMET AVENUE BOW CREDITON DEVON EX17 6LT UK
26 HIGH FIRS ROAD ROMSEY HAMPSHIRE SO51 5PZ TAIWAN CHINA
LANGLAND MANSIONS, 228 FLAT 8 FINCHLEY ROAD LONDON GREATER LONDON NW3 6QA INDIA
2 LON GADLAS ABERGELE CONWY LL22 7ET CHINA
55 BRIDGEWATER DRIVE NORTHAMPTON NORTHAMPTONSHIRE NN3 3AF U.K
21 DELL RISE PARK STREET ST ALBANS HERTFORDSHIRE AL2 2QJ INDIA
60 WEYMOUTH DRIVE HOUGHTON LE SPRING TYNE AND WEAR DH4 7TQ TAIWAN CHINA
97 DUGDALE HILL LANE POTTERS BAR HERTFORDSHIRE EN6 2DR P.R.CHINA
TREFOREST, 92 WATCHOUSE ROAD CHELMSFORD ESSEX CM2 8NH TAIWAN CHINA
16 RIDGE GROVE WHITEFIELD MANCHESTER GREATER MANCHESTER M45 8FE U.K
24 PECKLETON LANE DESFORD LEICESTER LEICESTERSHIRE LE9 9JU IRAN
23 VICTORIA STREET GOLDTHORPE ROTHERHAM SOUTH YORKSHIRE S63 9HS INDIA
20 CASHIO LANE LETCHWORTH GARDEN CITY HERTFORDSHIRE SG6 1AX INDIA
25 LAUREL ROAD BLABY LEICESTER LEICESTERSHIRE LE8 4DL U.K
3 VALENCE WOOD ROAD DAGENHAM GREATER LONDON RM8 3AT P.R.C

In [19]:
failed_addresses = []
for address in clean_addresses:
    if address['country_retrieved'] == False:        
        failed_addresses.append(address)

In [20]:
def clean_punctuation(string):
    """
    Clears a string of all periods, joins any left-over sequential
    single letters to one-another.
    """
    string_clean = re.sub("\.", " ", string)
    
    tokens = []
    
    for token in string_clean.split():
        if re.match("^[A-Za-z]$", token):
            tokens.append(token)
        else:
            tokens.append(" " + token + " ")
    
    return "".join(tokens).replace("  ", " ")

In [21]:
# Below demo encapsulates success and failure situations
clean_punctuation("here's a b long test 77 7 P.R.China")

" here's ab long test 77 7 PR China "

In [22]:
for address in failed_addresses:
    
    if 'country' not in address['parsed']:
        address['parsed']['country'] = None
        
    # Try parsing after removing all punctuation/special characters
    try:
        address_stripped = re.sub("[^0-9A-Za-z ]", "", address['dirty'])
        address_reparsed = {y:x for x, y in parse_address(address_stripped)}
        address['country_stripped'] = address_reparsed['country']
    except:
        address['country_stripped'] = None
    
    print("Country | Libpostal_found | Cleaning_found")
    print(address['country'],
          " | ",
          address['parsed']['country'],
          " | ",
          address['country_stripped'])

Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
P.R.CHINA  |  None  |  None
Country | Libpostal_found | Cleaning_found
TAIWAN CHINA  |  china  |  china
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
TAIWAN CHINA  |  china  |  china
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
P.R.CHINA  |  None  |  None
Country | Libpostal_found | Cleaning_found
P.R.CHINA  |  None  |  None
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
P.R.CHINA  |  None  |  None
Country | Libpostal_found | Cleaning_found
TAIWAN CHINA  |  

U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
TAIWAN CHINA  |  china  |  china
Country | Libpostal_found | Cleaning_found
TAIWAN CHINA  |  china  |  china
Country | Libpostal_found | Cleaning_found
TAIWAN CHINA  |  china  |  china
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
TAIWAN CHINA  |  china  |  china
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
P.R.CHINA  |  None  |  None
Country | Libpostal_found | Cleaning_found
P.R.CHINA  |  None  |  None
Country | Libpostal_found | Cleaning_found
U.K  |  None  |  uk
Country | Libpostal_found | Cleaning_found
P.R.CHINA  |  None  |  None
Country | Libpostal_fo

In [26]:
# Playing with string cleaning
test = "SO143HR"
parse_address(test)

[('so143hr', 'house')]

## 3.  Regex for getting rid of telephone numbers and fax numbers

Some examples of the cases we'd like to extract out:

- TEL=09483 284932
- FAX = 08264 827494
- TEL=974893738389
- TEL 8897897 78878
- CONTACT NUMBER=34879097890970
- EMAIL=QUESTIONABLE.FORMATTING@GMAIL.COM

The commonalities are the marker (TEL, FAX, CONTACT NUMBER), a separator (space, =, combination) and then lots of numbers which may or may not have arbitrary spaces.

I think Paul said something about email addresses already being handled so we'll not look at that.

I reckon the best option here is again to perform the search on a version with the spaces removed!

In [125]:
test1 = "3 MONTE WAY TEL=87687 689679"
test2 = "3 MONTE WAY TEL=87687689679 HAVERSHAM"
test3 = "3 MONTE WAY TEL = 87687 689679"
test4 = "3 MONTE WAY TEL 87687 689679"
test5 = "3 MONTE WAY TELEPHONE=87687 689679"
test6 = "3 MONTE WAY SO14 3HR"
test7 = "3 MONTE WAY FAX=87687 689679"
test8 = "3 MONTE WAY TELAVIV"

#contact_regex = "(TEL)[0-9*]|(FAX)[0-9*]|(CONTACT)[0-9*]"

contact_regex = "TEL[0-9]{3,}|FAX[0-9]{3,}"
try:
    # remove all non alphanumeric characters and whitespace
    x = re.sub("[^0-9A-Za-z]", "", test6).replace(" ", "")
    print(x)
    print(re.findall(contact_regex, x)[0])
except Exception as e:
    print(e)

3MONTEWAYSO143HR
list index out of range
