# Parsing Fake Addresses Using Python : [Pros & Cons]

with **Mr Fugu Data Science**

`___________________________________________________________________________________________`


If you have never used [faker] before: you are able to create fake data for testing; such as database applications

**visit**: [faker_website](https://faker.readthedocs.io/en/master/) 

We will be parsing fake addresses with [usaddress]

**visit**: [US address parser](https://usaddress.readthedocs.io/en/latest/)

U.S. & World-Wide Addresses website: [real addresses](http://results.openaddresses.io/)
these files are downloadable and used in first example

In [1]:
# !pip install faker
# !pip install usaddress

In [2]:
import os # search directory[s] for files 
import pandas as pd 
import usaddress # used to parse: US addresses
from faker import Factory,Faker # create fake: addresses
import csv # create/export csv_files 

fake_data=Faker()
fake_data.seed(4321) # setting a seed to ensure same results each time

# Using `OS` to locate files in various directories:
+ `This will assist in two ways:`
    + using `OS` to find files outside of your working directory
    + avoid hardcoding

# Current directory locating files:

In [3]:
# Find files in working directory:

def os_local_dir_search(file):
    d = os.getcwd()
    usr_dir = [os.path.join(d,o) for o in os.listdir(d) if os.path.isdir(os.path.join(d,o))]

    for item in usr_dir:
        if os.path.exists(item + file):
            file = item + file
    return file

# Walking from your directory to search for files:
+ this could be extended to outside of directory to system wide 

In [14]:
# Find a file outside your directory:

def os_any_dir_search(file):
    u=[]
    for p,n,f in os.walk(os.getcwd()):
        
        for a in f:
            a = str(a)
            if a.endswith(file): # can be (.csv) or a file like I did and search 
                print(a)
                print(p)
                t=pd.read_csv(p+'/'+file)
    return t

# need to use (.csv,.png, etc) because it is looking by file type ending
os_any_dir_search('berkeley.csv').head()


berkeley.csv
/Users/zatoichi59/Desktop/Projects/openaddr-collected-us_west/us/ca


Unnamed: 0,LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
0,-122.260714,37.863205,2550,DANA ST,,BERKELEY,,,94704,055 183213100,29adde543ce77829
1,-122.289434,37.855691,1012,GRAYSON ST,A,BERKELEY,,,94710,053 166004000,c177338841c6ef79
2,-122.289434,37.855691,1012,GRAYSON ST,C,BERKELEY,,,94710,053 166004200,8bc0fbe83c6df548
3,-122.294703,37.870741,1813,NINTH ST,,BERKELEY,,,94710,057 209002100,8faa530f19246223
4,-122.28939,37.870799,1901,CURTIS ST,,BERKELEY,,,94702,057 208303100,5c07b797bead947f


In [49]:
# for p,n,f in os.walk(os.getcwd()):
#     print(p) # all directory ext
#     print(n) # find all files readable in directory
#     print(f) # list of everything in directory as a list-lists

# Ideal Situation: Real Addresses to parse: parsed two different ways
+ 1<sup>st</sup>: Using `usaddress.tag()`
+ 2<sup>nd</sup>: Using `usaddress.parse()`


In [20]:
berkeley_addr=os_any_dir_search('berkeley.csv')

berkeley_sub=berkeley_addr.loc[:,['NUMBER','STREET','UNIT','CITY','POSTCODE']].fillna(value='0').astype('str', copy=False)

berkeley_sub_lst=berkeley_sub.values.tolist()

berkeley.csv
/Users/zatoichi59/Desktop/Projects/openaddr-collected-us_west/us/ca


# Convert Real Addresses with [usaddress.tag()]
+ This uses OrdDict() formatting

In [29]:
'''
def Convert(): coverting list of tuples to dictionary with (key,[value])

def list_to_liststr(): this is converting a list of list to list of strings used in parsing for usaddress

parse_tag: creates list of OrdDict() which is then converted into list of tuples used in the Covert() fcn

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dict_addr.items()])): 

this is used to iterated dictionary and create dataframe from dictionary with values that may be Nan
'''

dictionary = {} 
def Convert(tup, di): 
    for a, b in tup: 
        di.setdefault(a, []).append(b) 
    return di 

#--------------------------------

def list_to_liststr(nest_list):
    u=[]
    for i in nest_list:
        u.append(' '.join(i))
    hh=[]
    for i in u:
        hh.append(i) #using this to get rid of nested lists because parser can't read directly      
    return hh

#-----------------------------   

def parse_tag(address_stuff):
    h=[]
    for i in address_stuff:
        h.append(usaddress.tag(i)) 
    pp=[]
    for i in range(len(h)):
        o=h[i]
    #     pp.append([o[0].items()])
    #     op.append(o[0].values())
        for k in o[0].items():
                pp.append(k)             
    return pp
#--------------------------------

berk_=list_to_liststr(berkeley_sub_lst)

dict_addr=Convert(parse_tag(berk_),dictionary)

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dict_addr.items()])).head()

Unnamed: 0,AddressNumber,StreetName,StreetNamePostType,OccupancyIdentifier,ZipCode,PlaceName,StateName,StreetNamePostDirectional,BuildingName,StreetNamePreModifier,StreetNamePreType
0,2550,DANA,ST,0 BERKELEY,94704,A,BERKELEY,E,60 THE UPLANDS 0 BERKELEY,EL,CAMINO
1,1012,GRAYSON,ST,0 BERKELEY,94710,C,BERKELEY,N,50 THE UPLANDS 0 BERKELEY,EL,CAMINO
2,1012,GRAYSON,ST,0 BERKELEY,94710,A,BERKELEY,E,70 THE UPLANDS 0 BERKELEY,EL,CAMINO
3,1813,NINTH,ST,0 BERKELEY,94710,B,BERKELEY,E,90 THE UPLANDS 0 BERKELEY,EL,CAMINO
4,1901,CURTIS,ST,1 BERKELEY,94702,ST,BERKELEY,E,30 THE UPLANDS 0 BERKELEY,EL,CAMINO


In [30]:
h=[]
for i in berk_:
    h.append(usaddress.tag(i)) 
    pp=[]
for i in range(len(h)):
    o=h[i]
    #     pp.append([o[0].items()])
    #     op.append(o[0].values())
    for k in o[0].items():
        pp.append(k)
pp

[('AddressNumber', '2550'),
 ('StreetName', 'DANA'),
 ('StreetNamePostType', 'ST'),
 ('OccupancyIdentifier', '0 BERKELEY'),
 ('ZipCode', '94704'),
 ('AddressNumber', '1012'),
 ('StreetName', 'GRAYSON'),
 ('StreetNamePostType', 'ST'),
 ('PlaceName', 'A'),
 ('StateName', 'BERKELEY'),
 ('ZipCode', '94710'),
 ('AddressNumber', '1012'),
 ('StreetName', 'GRAYSON'),
 ('StreetNamePostType', 'ST'),
 ('PlaceName', 'C'),
 ('StateName', 'BERKELEY'),
 ('ZipCode', '94710'),
 ('AddressNumber', '1813'),
 ('StreetName', 'NINTH'),
 ('StreetNamePostType', 'ST'),
 ('OccupancyIdentifier', '0 BERKELEY'),
 ('ZipCode', '94710'),
 ('AddressNumber', '1901'),
 ('StreetName', 'CURTIS'),
 ('StreetNamePostType', 'ST'),
 ('OccupancyIdentifier', '0 BERKELEY'),
 ('ZipCode', '94702'),
 ('AddressNumber', '1829'),
 ('StreetName', 'SIXTY-THIRD'),
 ('StreetNamePostType', 'ST'),
 ('OccupancyIdentifier', '0 BERKELEY'),
 ('ZipCode', '94703'),
 ('AddressNumber', '2641'),
 ('StreetName', 'WEBSTER'),
 ('StreetNamePostType', 'ST'

# Real Address parsing with [usaddress.parse()]:
+ This is `NOT using OrdDict() formatting` 

In [31]:
#------------- Need to iterate nested list and create list of strings for parsing ------------
def parse_w_parse(real_addr_lst):
    parse_lst=[]
    for i in real_addr_lst:
        parse_lst.append(','.join(i))
    wow=[]
    for h in parse_lst:
        wow.extend(usaddress.parse(h))
    tq=[(sub[1], sub[0]) for sub in wow]
    return tq

#--------------------------------
tq=parse_w_parse(berkeley_sub_lst)
oui=Convert(tq,dictionary)
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in oui.items()])).head()


Unnamed: 0,AddressNumber,StreetName,StreetNamePostType,OccupancyIdentifier,ZipCode,PlaceName,StateName,StreetNamePostDirectional,BuildingName,StreetNamePreModifier,StreetNamePreType,OccupancyType,StreetNamePreDirectional
0,2550,DANA,ST,0 BERKELEY,94704,A,BERKELEY,E,60 THE UPLANDS 0 BERKELEY,EL,CAMINO,"BERKELEY,",WEST
1,1012,GRAYSON,ST,0 BERKELEY,94710,C,BERKELEY,N,50 THE UPLANDS 0 BERKELEY,EL,CAMINO,"BERKELEY,",WEST
2,1012,GRAYSON,ST,0 BERKELEY,94710,A,BERKELEY,E,70 THE UPLANDS 0 BERKELEY,EL,CAMINO,"BERKELEY,",WEST
3,1813,NINTH,ST,0 BERKELEY,94710,B,BERKELEY,E,90 THE UPLANDS 0 BERKELEY,EL,CAMINO,"BERKELEY,",WEST
4,1901,CURTIS,ST,1 BERKELEY,94702,ST,BERKELEY,E,30 THE UPLANDS 0 BERKELEY,EL,CAMINO,"BERKELEY,",WEST


In [50]:
parse_lst=[]
for i in berkeley_sub_lst:
    parse_lst.append(','.join(i))
wow=[]
for h in parse_lst:
    wow.extend(usaddress.parse(h))
tq=[(sub[1], sub[0]) for sub in wow]



In [51]:
parse_lst


['2550,DANA ST,0,BERKELEY,94704',
 '1012,GRAYSON ST,A,BERKELEY,94710',
 '1012,GRAYSON ST,C,BERKELEY,94710',
 '1813,NINTH ST,0,BERKELEY,94710',
 '1901,CURTIS ST,0,BERKELEY,94702',
 '1829,SIXTY-THIRD ST,0,BERKELEY,94703',
 '2641,WEBSTER ST,1,BERKELEY,94705',
 '1314,HASKELL ST,A,BERKELEY,94702',
 '2522,DANA ST,101,BERKELEY,94704',
 '2122,JEFFERSON AVE,0,BERKELEY,94703',
 '1917,FRANCISCO ST,0,BERKELEY,94709',
 '1646,CORNELL AVE,0,BERKELEY,94702',
 '1645,CORNELL AVE,0,BERKELEY,94702',
 '1709,MILVIA ST,0,BERKELEY,94709',
 '2301,VIRGINIA ST,1,BERKELEY,94709',
 '1176,ARCH ST,0,BERKELEY,94708',
 '1178,EUCLID AVE,1,BERKELEY,94708',
 '2515,ASHBY AVE,2,BERKELEY,94705',
 '2928,FLORENCE ST,B,BERKELEY,94705',
 '2914,DEAKIN ST,A2,BERKELEY,94705',
 '928,CARLETON ST,0,BERKELEY,94710',
 '2723,ASHBY PL,B,BERKELEY,94705',
 '1412,OXFORD ST,0,BERKELEY,94709',
 '2006,ROSE ST,0,BERKELEY,94709',
 '2234,WARD ST,1,BERKELEY,94705',
 '1812,PARKER ST,0,BERKELEY,94703',
 '1411,SPRUCE ST,1,BERKELEY,94709',
 '933,ADDIS

In [52]:
wow

[('2550,', 'AddressNumber'),
 ('DANA', 'StreetName'),
 ('ST,', 'StreetNamePostType'),
 ('0,', 'OccupancyIdentifier'),
 ('BERKELEY,', 'OccupancyIdentifier'),
 ('94704', 'ZipCode'),
 ('1012,', 'AddressNumber'),
 ('GRAYSON', 'StreetName'),
 ('ST,', 'StreetNamePostType'),
 ('A,', 'PlaceName'),
 ('BERKELEY,', 'StateName'),
 ('94710', 'ZipCode'),
 ('1012,', 'AddressNumber'),
 ('GRAYSON', 'StreetName'),
 ('ST,', 'StreetNamePostType'),
 ('C,', 'PlaceName'),
 ('BERKELEY,', 'StateName'),
 ('94710', 'ZipCode'),
 ('1813,', 'AddressNumber'),
 ('NINTH', 'StreetName'),
 ('ST,', 'StreetNamePostType'),
 ('0,', 'OccupancyIdentifier'),
 ('BERKELEY,', 'OccupancyIdentifier'),
 ('94710', 'ZipCode'),
 ('1901,', 'AddressNumber'),
 ('CURTIS', 'StreetName'),
 ('ST,', 'StreetNamePostType'),
 ('0,', 'OccupancyIdentifier'),
 ('BERKELEY,', 'OccupancyIdentifier'),
 ('94702', 'ZipCode'),
 ('1829,', 'AddressNumber'),
 ('SIXTY-THIRD', 'StreetName'),
 ('ST,', 'StreetNamePostType'),
 ('0,', 'OccupancyIdentifier'),
 ('BER

In [53]:
tq

[('AddressNumber', '2550,'),
 ('StreetName', 'DANA'),
 ('StreetNamePostType', 'ST,'),
 ('OccupancyIdentifier', '0,'),
 ('OccupancyIdentifier', 'BERKELEY,'),
 ('ZipCode', '94704'),
 ('AddressNumber', '1012,'),
 ('StreetName', 'GRAYSON'),
 ('StreetNamePostType', 'ST,'),
 ('PlaceName', 'A,'),
 ('StateName', 'BERKELEY,'),
 ('ZipCode', '94710'),
 ('AddressNumber', '1012,'),
 ('StreetName', 'GRAYSON'),
 ('StreetNamePostType', 'ST,'),
 ('PlaceName', 'C,'),
 ('StateName', 'BERKELEY,'),
 ('ZipCode', '94710'),
 ('AddressNumber', '1813,'),
 ('StreetName', 'NINTH'),
 ('StreetNamePostType', 'ST,'),
 ('OccupancyIdentifier', '0,'),
 ('OccupancyIdentifier', 'BERKELEY,'),
 ('ZipCode', '94710'),
 ('AddressNumber', '1901,'),
 ('StreetName', 'CURTIS'),
 ('StreetNamePostType', 'ST,'),
 ('OccupancyIdentifier', '0,'),
 ('OccupancyIdentifier', 'BERKELEY,'),
 ('ZipCode', '94702'),
 ('AddressNumber', '1829,'),
 ('StreetName', 'SIXTY-THIRD'),
 ('StreetNamePostType', 'ST,'),
 ('OccupancyIdentifier', '0,'),
 ('Occ

# Difficult Parsing: with 2 examples of psuedo U.S. address formats
+ Similar to real U.S. addresses with some unstructured data
    + 1<sup>st</sup>: working with a .csv dump, simulating possible web-scraped data
    + 2<sup>nd</sup>: Python `Faker` package `.address` 

In [40]:
# Creating Fake Addresses: (ONLY USE if USING "FAKER" package)

def address_stuff(h):
    user_list=[]
    for i in range(0,h):
        user_list.append(fake_data.address())
#         user_list.append(fake_data.city())
#         user_list.append(fake_data.state_abbr())
#         user_list.append(fake_data.user_name())

    return user_list     #pay attn here if indented more it overrides data and prints 1

In [54]:
fake_data.address()

'21353 Nicole Rue Suite 706\nPort David, MO 39157'

In [55]:
# Creating CSV and exporting to my directory with 6001 entries:

with open('fake_address.csv','w',newline='') as csvfile:
    CSV_fake_=csv.writer(csvfile)
    
    for i in address_stuff(6001):
        CSV_fake_.writerow([i])

# as a side note: if you don't use ('with') then you have to close() the file at the end

# Ex.1) Parse Addresses [.csv dump] using (usaddress.parse())
+ 1<sup>st</sup>: create inputs(address_funcrion,size)
+ 2<sup>nd</sup>: parse addresses using [usaddress.tag()]
+ 3<sup>rd</sup>: convert OrdDict() --> list(tup())
+ 4<sup>th</sup>: take list(tup()) --> Dict() with list(values()) convert to df (account for *missing values*) [*this is important because: there are miss-matching rows*]

In [22]:
# Unparsed Raw addresses:

unparsed_data=pd.read_csv('fake_address.csv', header=None)

unparsed_df_tolist = unparsed_data.values.tolist()

h=[]
hh=[]

for i in unparsed_df_tolist:
    hh.extend(i) #using this to get rid of nested lists because parser can't read directly      

for j in hh:
    h.extend(usaddress.parse(j))


q=[(sub[1], sub[0]) for sub in h] # switching order to create (key:value) pairs
qq=Convert(q, dictionary)
uo=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in qq.items()]))


In [44]:
uo.head()

Unnamed: 0,AddressNumber,StreetName,StreetNamePostType,PlaceName,StateName,ZipCode,OccupancyType,OccupancyIdentifier,SubaddressType,SubaddressIdentifier,USPSBoxType,USPSBoxID,StreetNamePostDirectional,Recipient,LandmarkName,StreetNamePreDirectional
0,774,Michael,Ways\n,Desireemouth,MO,92256,Suite,563\n,PSC,8070,Box,1131\n,North,USNV Bishop\n,USNS Hill\n,West
1,0,Ross,Circles\n,Clayland,IN,94270,Suite,818\n,PSC,5276,Box,7165\n,West,USNV Singleton\n,USS,
2,789,Hernandez,Street,Anamouth,LA,65211,Apt.,598\n,PSC,3552,Box,5871\n,South,USS Anderson\n,Byrd\n,
3,202,Adam,Park,Lake Martin,AK,46226,Apt.,892\n,PSC,6716,Box,3849\n,East,USNS James\n,,
4,13819,Darrell,Falls\n,APO,AA,40334,Suite,132\n,PSC,2301,Box,0854\n,South,USS Lopez\n,,


# Ex. 2) Using Python Faker and parsing addresses: using (usaddress.tag())

In [57]:
''' def add_parse: using 'tag' is a more robust parser but does have limitations
                   with this method you are creating an OrderedDict(),
        
    def Convert: creating dictionary where you have (key:[values]), 
    
    list comprehension: inside the dataframe it is used directly, the purpose is to loop over (key:value) 
                        and account for NaN values. The dataset has miss-matched rows, 
                        because not all columns apply and throws an arrow.
                        
    __One thing to note__: the more values you create the more Nan values and obscure addresses you create and 
                       parsing becomes an issue. 
        '''

def add_parse(address_stuff,qty):
    g=address_stuff(qty)
    h=[]
    for i in g:
        h.append(usaddress.tag(i)) 
    pp=[]
    for i in range(len(h)):
        o=h[i]
    #     pp.append([o[0].items()])
    #     op.append(o[0].values())
        for k in o[0].items():
                pp.append(k)             
    return pp
#----------------------------------

pp=add_parse(address_stuff,600) # adding values such as 6000, throws an error! this is explained below
#-------------------------

dictionary = {} 
g=(Convert(pp, dictionary))

pd.DataFrame(dict([(k,pd.Series(v)) for k,v in g.items() ]))

Unnamed: 0,AddressNumber,StreetName,StreetNamePostType,PlaceName,StateName,ZipCode,StreetNamePostDirectional,OccupancyType,OccupancyIdentifier,SubaddressType,SubaddressIdentifier,USPSBoxType,USPSBoxID,Recipient,LandmarkName
0,99867,Wilson,Forest\n,Barbaraburgh,NM,00684,North,Suite,306\n,PSC,2990,Box,0045\n,USS Morales\n,USNS West\n
1,61391,Lauren,Curve\n,Stevenside,ID,10310,North,Apt.,515\n,PSC,4606,Box,2190\n,USCGC Bender\n,
2,6311,Richard,Ways,Shermanborough,KY,78438,East,Suite,103\n,PSC,8017,Box,7624\n,USNS Martinez\n,
3,749,Figueroa,Turnpike,Port Derek,IN,63986,North,Apt.,584\n,PSC,1624,Box,7140\n,USNS Pope\n,
4,70932,Baker,Place,Colemouth,GA,18586,South,Apt.,935\n,PSC,4325,Box,8062\n,USS Palmer\n,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,,,,Lopezville,TN,33729,,,,,,,,,
596,,,,Francistown,HI,75706,,,,,,,,,
597,,,,Samuelland,WI,63502,,,,,,,,,
598,,,,FPO,AA,28555,,,,,,,,,


In [58]:
# Dump file to csv (export)
y=pd.DataFrame(dict([(k,pd.Series(v)) for k,v in g.items() ]))
y.to_csv('ce.csv',index=False,header=False)

# This image allows you to see one of the problems with parsing:
+ This particular address and similar are not exactly [REAL U.S. ADDRESSES] which create problems for this package.
+ If you evaluate the data it 

![](address_parse_fail.jpg)

![](address_parse_fail.png)

# This is the error code:
+ When I store larger datasets of fake addresses this occurs because it is a quesitonable address and the parser cannot parse

![](error_code.png)

# Citations

 https://www.geeksforgeeks.org/python-split-list-into-lists-by-particular-value/

https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths

https://www.geeksforgeeks.org/python-swap-tuple-elements-in-list-of-tuples/

https://stackoverflow.com/questions/49638992/generating-test-data-how-to-generate-a-valid-address-for-a-given-us-zipcode

https://stackoverflow.com/questions/9234560/find-all-csv-files-in-a-directory-using-python/38584736