# Address Parser

Goal: As a DS, I want to create a Model, which is able to extract the street name, the house number, the postal_code and the city from an arbitrary address.

Approach:
- Construct simple, standardized training addresses
- Test first iteration of model on this training set
- Introduce random permutations of addresses
- Test and iterate over model to deal with random permutations

Source for addresses: https://openaddresses.io/

In [3]:
import sys

import pandas as pd
import numpy as np
from tqdm import tqdm

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, RNN, Bidirectional, TimeDistributed, LeakyReLU, ReLU
from tensorflow.keras.optimizers import Adam

from src.address_permutator import AddressPermutator

In [4]:
open_addresses = pd.read_csv('data/openaddr-collected-europe/pt/countrywide.csv').sample(250000)

In [5]:
sys.getsizeof(open_addresses)*1e-9

0.11706664000000001

### Create Addresses

In [6]:
open_addresses.head()

Unnamed: 0,LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
4971009,-9.152346,38.645339,15,AV 25 DE ABRIL,,CORROIOS,,,2855-725,pt.ine.add.PTCONT.4780616,cf9036889c60e6c4
4369023,-8.631629,41.129487,408 CAS,R MACHADO DOS SANTOS,,VILA NOVA DE GAIA,,,4400-209,pt.ine.add.PTCONT.4351055,8291c1bccacd19d1
2585996,-9.17465,38.716513,,AV CEUTA NORTE,,LISBOA,,,1350-410,pt.ine.add.PTCONT.2845706,4644f5cc01f66e70
4537409,-8.740059,39.004315,103 A,R LENTRISQUEIRA,,FOROS DE SALVATERRA,,,2120-216,pt.ine.add.PTCONT.4513788,264652d0072855ad
2685751,-9.130571,38.74819,56,AV ALM GAGO COUTINHO,,LISBOA,,,1700-031,pt.ine.add.PTCONT.3211988,73a5b578b78202c3


In [13]:
open_addresses = open_addresses.fillna('')
const_matrix = open_addresses[['STREET', 'NUMBER', 'POSTCODE', 'CITY']].values

In [14]:
permutator = AddressPermutator()

In [15]:
perm, standard = permutator.permutate(const_matrix.copy())

250000it [00:02, 106701.53it/s]


In [16]:
type(perm)

tuple

In [19]:
perm

('av 25 de abril 15, 2855-725 corroios',
 'r machado dos santos 408 cas, 4400-209 vila nova de gaia',
 'av ceuta norte, 1350-410 lisboa',
 '103  a. r  lentrisqueira. , 2120-216. foros  de, salvaterra',
 'av alm gago coutinho 56, 1700-031 lisboa',
 'r quinta do telheiro, 2460-052 alcobaça',
 'r| d| maria| ii: 2735-296, | 39: agualva-cacém',
 'r pinhalzinho 74, 2415-533 leiria',
 'r moçambique, 8500-608 portimão',
 'r pereiro 2, 3430-771 parada crs',
 'en 109 21, 3840-011 calvão vgs',
 '6270-554  av, 16, de  setembro| : seia| ',
 '65, r  principal  . 3105-153. louriçal',
 'qta. da. urtigueira, : 6420-654| torres',
 'r dr cristina torres 50, 3080-210 figueira da foz',
 'r dr joaquim carrusca 5, 7800-311 beja',
 'av 25 abril, 6100-621 sertã',
 'r beijoquinha 9a, 2725-510 mem martins',
 '596  r  voluntarios  , atouguia. 2490-081',
 '2266  av, republica. , 4430-196, vila, nova. de  gaia',
 '  r, poço. da. clara.   3150-256  ega',
 'r principal das praias do sado 223, 2910-345 setúbal',
 'av 

In [18]:
const_matrix

array([['AV 25 DE ABRIL', '15', '2855-725', 'CORROIOS'],
       ['R MACHADO DOS SANTOS', '408 CAS', '4400-209',
        'VILA NOVA DE GAIA'],
       ['AV CEUTA NORTE', '', '1350-410', 'LISBOA'],
       ...,
       ['CAM ARSENIO DE MENDONÇA', '21', '9100-048', 'GAULA'],
       ['R S JOÃO', '31', '2480-188', 'PORTO DE MÓS'],
       ['R COSTA', '', '3320-131', 'MACHIO']], dtype=object)

In [8]:
X, y = permutator.encode(perm, standard)

### Simple Model

In [10]:
simple = Sequential()

simple.add(LSTM(512, return_sequences=True, input_shape=(X.shape[1], X.shape[2])))
simple.add(LeakyReLU())

simple.add(LSTM(256, return_sequences=True))
simple.add(LeakyReLU())

simple.add(TimeDistributed(Dense(128)))
simple.add(LeakyReLU())

simple.add(TimeDistributed(Dense(X.shape[2], activation='softmax')))

optimizer = Adam(lr=0.01)

simple.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
simple.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 104, 512)          1181696   
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 104, 512)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 104, 256)          787456    
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 104, 256)          0         
_________________________________________________________________
time_distributed (TimeDistri (None, 104, 128)          32896     
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 104, 128)          0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 104, 64)           8

In [11]:
history = simple.fit(
    X,
    y,
    batch_size=128,
    epochs=5,
    shuffle=True,
    validation_split=0.1
)

Train on 225000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [12]:
simple.save('./saved_model/pt.h5')