# Regular Expressions Tutorial: Parse Addresses with Python


with `Mr. Fugu Data Science`

# (◕‿◕✿)


[Github](https://github.com/MrFuguDataScience) | [Youtube](https://www.youtube.com/channel/UCbni-TDI-Ub8VlGaP8HLTNw?view_as=subscriber)

[re module documentation](https://docs.python.org/3/library/re.html)

|     	|                                             	|   	| Special Characters 	|                                                                 	|
|-----	|---------------------------------------------	|---	|--------------------	|-----------------------------------------------------------------	|
| \A  	| matches start of string                     	|   	| .                  	| any character except (\n)                                       	|
| \b  	| boundary of  word                           	|   	| ^                  	| start of string, except for multiline it matches directly after 	|
| \B  	| Not Word Boundary                           	|   	| $                  	| end of string or just before (\n)                               	|
| \d  	| digits: (0-9)                               	|   	| *                  	| 0+, repetitions of pattern before the (*)                       	|
| \D  	| any character but NOT digit                 	|   	| +                  	| 1+ repetitions of pattern before the (+)                        	|
| \s  	| whitespace: (\t,\n, space)                  	|   	| ?                  	| 0 or 1 repetitions preceding RE.                                	|
| \S  	| matches any character but NOT (\t,\n space) 	|   	| *?, +?, ??         	| Creates non greedy forms                                        	|
| \w  	| matches characters,  digits, underscore     	|   	| {m}                	| exact number of matches                                         	|
| \W  	| anything except  character                  	|   	| {m.n}              	| range of matches                                                	|
| \Z  	| Only matches end of string                  	|   	| {m,n}?             	| match as FEW as possible, in range                              	|
| ( ) 	| grouping                                    	|   	| [ ]                	| Character set, finds anything inside                            	|


`---------------------------------------`

`Other Tips:`

| Look ahead &  Look Behind 	|                                                                                                                           	|
|---------------------------	|---------------------------------------------------------------------------------------------------------------------------	|
| ?:                        	| match but don't capture it                                                                                                	|
| ?=                        	| match suffix but exclude from capture "Look ahead" ex.) q(?=u), find "q" that IS followed by "u", BUT, doesn't return "u" 	|
| ?!                        	| match if suffix is absent (neg.) look ahead ex.) p(?!u), means find "p" NOT  followed by "u"                              	|
| ?<=                       	| positive look ahead                                                                                                       	|
| ?<!                       	| negative look behind                                                                                                      	|
| (?=(regex))               	| use this to store the data from look ahead                                                                                	|

In [2]:
import re
import pandas as pd
# from bs4 import BeautifulStoneSoup as bsopa
# from selenium import webdriver
# import requests
import faker

In [3]:
fake_=faker.Faker()
fake_.seed(413)
fake_personal_profile=[]
for i in range(500):
    fake_personal_profile.append(fake_.profile())
    


# use if you were only if you were taking 1 item from dictionary-list
# pd.DataFrame(fake_personal_profile,orient='index').transpose() 

# Creating Personal Profiles with Fake Data:

In [4]:
# Personal Data for Fake People:
fake_persons_data=pd.DataFrame(fake_personal_profile)

blood_type=fake_persons_data['blood_group']

fake_persons_data['residence'][4].split(' ')
# fake_persons_data

['36411', 'Schwartz', 'Shoals', 'Suite', '648\nEast', 'Denise,', 'NV', '66145']

In [5]:
fake_persons_data['residence'][0]

'74187 Laurie Parkways Apt. 344\nLake Maria, MN 77577'

In [6]:
# Change BOX to PO BOX:

update_pobox=[]
for i in fake_persons_data['residence']: 
    op=re.sub(r'Box', 'PO BOX',i)
    update_pobox.append(op)
    
update_pobox[:10]

['74187 Laurie Parkways Apt. 344\nLake Maria, MN 77577',
 '4035 Perez Pass Apt. 421\nCalderonton, KS 89435',
 '0086 Cohen Trafficway\nNorth Joseph, WV 50669',
 '4637 Mary Inlet Apt. 200\nNew Robert, ME 48035',
 '36411 Schwartz Shoals Suite 648\nEast Denise, NV 66145',
 '811 Johnson Garden Suite 506\nYolandafort, RI 63534',
 '976 Amanda Common Suite 647\nChristianchester, VT 25717',
 'USNV Maddox\nFPO AA 77402',
 '0259 Douglas Gardens\nNew Aliciaberg, NM 73609',
 '372 Wilson Common\nLake Latasha, MA 90202']

# Handling PO BOX:
+ There are two types of PO BOX here:

1.) `Foreign US Military PO BOX`: which have string starting with (`PSC`) and contain:  (`APO,FPO`)
    this corresponds to the `USPS` formatting of Foreign Military Postal Codes. Except the 
    position in the string here is inaccurate, but good enough for this exercise.
   
2.) The other PO BOX represent somewhat normal addresses. 

In [7]:
# Take Everything After PO BOX, and stop at (\n)
po_box_with_num=re.compile(r'\b(:? PO BOX.*)+\b')

# Take Everything Before (PO BOX)
unit_num_pobox=re.compile(r'.+?(?=PO BOX)')

# Check if String Contains a Specific Word: print sentence
match_word=re.compile(r'\b(.*?PO BOX).+(\s\w.+)\b')


h=[]
# define_words='PO BOX'
for i in update_pobox: 
    if re.findall(match_word,i):
        h.append(re.findall(match_word,i))
#     h.append(re.search(define_words,i))
h[:2]
# update_pobox

[[('PSC 5627, PO BOX', '\nAPO AP 26601')],
 [('Unit 7862 PO BOX', '\nDPO AA 94969')]]

# Find Strings with Apartment or Suite:

In [8]:
strg_apt_ste=[]

h=[]
for i in update_pobox: 
    if re.findall(re.compile(r'\b.+?(?=Apt|Suite)\b'),i):
        
        # everything before apt/suite appended but everything before apt/suite
        h.append(re.findall(re.compile(r'\b.+?(?=Apt|Suite)'),i))

        # append if string has apt/ste
        strg_apt_ste.append(i)

strg_apt_ste[:10]
# h

['74187 Laurie Parkways Apt. 344\nLake Maria, MN 77577',
 '4035 Perez Pass Apt. 421\nCalderonton, KS 89435',
 '4637 Mary Inlet Apt. 200\nNew Robert, ME 48035',
 '36411 Schwartz Shoals Suite 648\nEast Denise, NV 66145',
 '811 Johnson Garden Suite 506\nYolandafort, RI 63534',
 '976 Amanda Common Suite 647\nChristianchester, VT 25717',
 '667 Andrea Divide Suite 768\nRubenbury, RI 76156',
 '81186 Frank Ports Suite 654\nDebrafort, NE 47710',
 '6325 Morse Pine Apt. 801\nRobertoburgh, MN 50010',
 '5266 Hart Burgs Suite 937\nMoorebury, MO 58963']

# Extracting everything that is NOT PO BOX:

In [9]:
no_pobox=[]

match_word_pobox=re.compile(r'\b(.*?PO BOX).+(\s\w.+)\b')
for i in update_pobox: 
    if re.findall(match_word_pobox,i):
        pass
    else:
        no_pobox.append(i)

no_pobox[:10]

['74187 Laurie Parkways Apt. 344\nLake Maria, MN 77577',
 '4035 Perez Pass Apt. 421\nCalderonton, KS 89435',
 '0086 Cohen Trafficway\nNorth Joseph, WV 50669',
 '4637 Mary Inlet Apt. 200\nNew Robert, ME 48035',
 '36411 Schwartz Shoals Suite 648\nEast Denise, NV 66145',
 '811 Johnson Garden Suite 506\nYolandafort, RI 63534',
 '976 Amanda Common Suite 647\nChristianchester, VT 25717',
 'USNV Maddox\nFPO AA 77402',
 '0259 Douglas Gardens\nNew Aliciaberg, NM 73609',
 '372 Wilson Common\nLake Latasha, MA 90202']

# If the Beggining of Address is NOT a digit [0-9] sequence:
+ you have a military address, in this instance!

Because, I already removed the PO BOX's

In [10]:
store_address=[]
military_addr=[]
match_word_pobox=re.compile(r'^[\d]{1,6}.+[\s\w.+].+')

for i in no_pobox: 
    if re.findall(match_word_pobox,i):
        store_address.extend(re.findall(match_word_pobox,i))
    else:
        military_addr.extend([i])

military_addr
store_address[:10]


['74187 Laurie Parkways Apt. 344\nLake Maria, MN 77577',
 '4035 Perez Pass Apt. 421\nCalderonton, KS 89435',
 '0086 Cohen Trafficway\nNorth Joseph, WV 50669',
 '4637 Mary Inlet Apt. 200\nNew Robert, ME 48035',
 '36411 Schwartz Shoals Suite 648\nEast Denise, NV 66145',
 '811 Johnson Garden Suite 506\nYolandafort, RI 63534',
 '976 Amanda Common Suite 647\nChristianchester, VT 25717',
 '0259 Douglas Gardens\nNew Aliciaberg, NM 73609',
 '372 Wilson Common\nLake Latasha, MA 90202',
 '667 Andrea Divide Suite 768\nRubenbury, RI 76156']

# Parse Addresses:

In [125]:
y=[]
for i in store_address:
    y.append(re.split(r'\n',i))
#     print(i)
y

# Digit address
primary_addr_num=re.compile(r'^[\d]+')

# Cardinal Direction: "Predirectional"
post_directional=re.compile(r'[^\d]+\w')

# Street Name
# street_name=

# Postdirection:
# post_dir =

# apt,ste,suite
secondary_adrr= re.compile(r'\b(?=((Apt|Suite|#)\S\s\w+))')

# City:

# State:

# Zipcode:

q='74187 Laurie Parkways Apt. 344 Lake Maria, MN 77577'
# re.findall(r'^[\d]+',q)

In [24]:
# Find Digit for Street Address:

q='74187 Laurie Parkways Apt. 344 Lake Maria, MN 77577'
re.findall(r'^[\d]+',q)

['74187']

In [126]:
# Pre-Directional for Address:

q='74187 S.W. Laurie Parkways Apt. 344\nLake Maria, MN 77577'
predir_=re.findall(r'^[\d+]+\s(S(\s|\.|o|outh|[ewn]\s|e.|w.|n.)+|w(est|\.|\s)+|e(\.|ast)+)',
           q,flags=re.IGNORECASE)

predir_[0][0]
# re.findall(r'^[\d+].*\s[\w]\s',q)

'S.W. '

In [71]:
# Find Apartment:


# re.findall(r'^[\d+].*\s[\w]\s',q)


io=re.findall(r'^[\d+]+\s\w\S\s\w+',q)

for i in io:
    print(re.findall(r'[^\d]+\w',i))

In [1242]:
re.findall(re.compile(r'(?=Apt|Suite)\w\S+'),q)

['Apt.']

In [73]:
# Apartment or Suite with number:

ui=re.findall(re.compile(r'\b(?=((Apt|Suite)\S\s\w+))'),q)
ui[0][0]
# ui

'Apt. 344'

In [76]:
# Look Look Behind (POS): State and Zipcode

re.split('\n',q)

re.findall(r'(?<=\,).+',q)
# q

[' MN 77577']

In [1559]:
re.findall(r'(?=\,).+',q)

[', MN 77577']

In [78]:
# LookForward: But, there is a hidden issue to discuss on line below!

re.findall(re.compile(r'.+?(?=\,)'),q)
# q

['Lake Maria']

In [11]:
# City Addresses: first step

o=[]
for i in store_address:
    p=re.findall(re.compile(r'.+?(?=,)'),i)
    o.extend(p)
o[:10]

['Lake Maria',
 'Calderonton',
 'North Joseph',
 'New Robert',
 'East Denise',
 'Yolandafort',
 'Christianchester',
 'New Aliciaberg',
 'Lake Latasha',
 'Rubenbury']

In [80]:
# City: Second Step

streets=[]
for j in o:
    pp=re.sub('North|South|West|East','',j)
    streets.append(pp.lstrip())
#     print(j)
streets[:4]
# streets

['Lake Maria', 'Calderonton', 'Joseph', 'New Robert']

In [12]:
# Divided State & Zipcode:

state_zip=[]
for i in store_address:
    state_zip.extend(re.findall(r'(?<=\,).+',i))
state_zip_=[]    
for j in state_zip:
    state_zip_.extend([j.lstrip().replace(' ',',').split(',')])
state_zip_[:10]
# state_zip

[['MN', '77577'],
 ['KS', '89435'],
 ['WV', '50669'],
 ['ME', '48035'],
 ['NV', '66145'],
 ['RI', '63534'],
 ['VT', '25717'],
 ['NM', '73609'],
 ['MA', '90202'],
 ['RI', '76156']]

# Difficult way to parse address Post Directionals:

In [89]:
# rp=[]
# j='North Denise'
# for i in o:
#     if (re.findall(r'(W(est|\.|\s))|(E(ast|\.|\s))|(N(orth|\.|\s))|(S(outh|\.|\s)).?',
#            i,re.IGNORECASE)):
#         rp.append((re.findall(r'(W(est|\.|\s))|(E(ast|\.|\s))|(N(orth|\.|\s))|(S(outh|\.|\s)).?',
#            i,re.IGNORECASE)))
#     else:
#         rp.append([])
# rp
# 

# Realistic Easy Way to Parse Post Directional;

In [95]:
# Post Directional:

post_dir=[]
dir=['North','South','East','West']
# bv=['North','Joseph']
# set(bv)&set(dir)

y=[]
for i in o:
    y.extend([i.split(' ')])
for j in y:
    if set(j)&set(dir):
        post_dir.append(list(set(j)&set(dir)))
    else:
        post_dir.append([])
post_dir[:6]


[[], [], ['North'], [], ['East'], []]

In [13]:
# Full Apt/suite if occurs otherwise empty list:

apt_ste=[]
for i in store_address:
    if re.findall(re.compile(r'\b(?=((Apt.|Suite)\s\w+))'),i):
        ui=re.findall(re.compile(r'\b(?=((Apt.|Suite)\s\w+))'),i)
        apt_ste.extend(ui)
    else:
        apt_ste.append([])
        
apt_ste_=[]

for i in apt_ste:
    if i==[]:
        apt_ste_.append([])
    else:
        apt_ste_.append(i[0])
        
apt_ste_[:10]
# apt_ste

['Apt. 344',
 'Apt. 421',
 [],
 'Apt. 200',
 'Suite 648',
 'Suite 506',
 'Suite 647',
 [],
 [],
 'Suite 768']

# Military Addresses:

In [14]:
# for i in military_addr:
#     print(re.split(' ',i))
military_addr[:10]

['USNV Maddox\nFPO AA 77402',
 'USNV Lewis\nFPO AA 92743',
 'USNV Wu\nFPO AE 47182',
 'USNV Sloan\nFPO AP 06977',
 'USS Chen\nFPO AA 10270',
 'USNS Fischer\nFPO AE 45323',
 'USS Glover\nFPO AP 69577',
 'USNV Medina\nFPO AA 82238',
 'USCGC Lee\nFPO AA 38102',
 'USCGC Brady\nFPO AA 11924']

# `Now you can just take what elements by index to create Military Address FIle:`

In [15]:
parse_mil_adrr=[]
for i in military_addr:
    parse_mil_adrr.append(i.split('\n'))

k=[]
for i,j in parse_mil_adrr:
    k.append([i.split(' '),j.split(' ')])
k[:10]


[[['USNV', 'Maddox'], ['FPO', 'AA', '77402']],
 [['USNV', 'Lewis'], ['FPO', 'AA', '92743']],
 [['USNV', 'Wu'], ['FPO', 'AE', '47182']],
 [['USNV', 'Sloan'], ['FPO', 'AP', '06977']],
 [['USS', 'Chen'], ['FPO', 'AA', '10270']],
 [['USNS', 'Fischer'], ['FPO', 'AE', '45323']],
 [['USS', 'Glover'], ['FPO', 'AP', '69577']],
 [['USNV', 'Medina'], ['FPO', 'AA', '82238']],
 [['USCGC', 'Lee'], ['FPO', 'AA', '38102']],
 [['USCGC', 'Brady'], ['FPO', 'AA', '11924']]]

# Citations:

https://pe.usps.com/text/pub28/28apd_002.htm  (USPS postal code) criteria

https://github.com/scrapehero/yellowpages-scraper

https://docs.python.org/3/library/re.html

https://smartystreets.com/articles/regular-expressions-for-street-addresses

https://stackoverflow.com/questions/11160192/how-to-parse-freeform-street-postal-address-out-of-text-and-into-components

https://www.regular-expressions.info/wordboundaries.html

https://www.regular-expressions.info/lookaround.html