# Address Parser

Goal: As a DS, I want to create a Model, which is able to extract the street name, the house number, the postal_code and the city from an arbitrary address.

Approach:
- Construct simple, standardized training addresses
- Test first iteration of model on this training set
- Introduce random permutations of addresses
- Test and iterate over model to deal with random permutations

Source for addresses: https://openaddresses.io/

In [1]:
import sys

import pandas as pd
import numpy as np
from tqdm import tqdm

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, RNN, Bidirectional, TimeDistributed, LeakyReLU, ReLU
from tensorflow.keras.optimizers import Adam

from source.address_permutator import AddressPermutator

In [2]:
open_addresses = pd.read_csv('data/openaddr-collected-europe/pt/countrywide.csv').sample(250000)

In [3]:
sys.getsizeof(open_addresses)*1e-9

0.117051391

### Create Addresses

In [4]:
open_addresses.head()

Unnamed: 0,LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
5135157,-8.426725,41.824786,,,,ARCOS DE VALDEVEZ,,,4970-773,pt.ine.add.PTCONT.5138620,7badd876ea4dba4b
1244929,-8.824561,40.191252,,R LAVADOURO,,FIGUEIRA DA FOZ,,,3080-437,pt.ine.add.PTCONT.1246122,74630bc4de087779
1833424,-7.530559,37.176477,,AV MANUEL ROSA MENDES,,VILA NOVA DE CACELA,,,8900-017,pt.ine.add.PTCONT.1552920,26200a2b8aeb512c
2327817,-9.339334,38.736575,CASA 2,EN 249,,SÃO DOMINGOS DE RANA,,,2785-034,pt.ine.add.PTCONT.3000242,e8256078fd78b2e6
340273,-8.752097,40.558268,,R TRAS DAS ESCOLAS,,GAFANHA DA BOA HORA,,,3840-255,pt.ine.add.PTCONT.286741,6b6a2a498984026d


In [5]:
open_addresses = open_addresses.fillna('')
const_matrix = open_addresses[['STREET', 'NUMBER', 'POSTCODE', 'CITY']].values

In [6]:
permutator = AddressPermutator(const_matrix.copy())

In [7]:
perm, standard = permutator.permutate()

250000it [00:02, 94516.21it/s]


In [8]:
X, y = permutator.encode(perm, standard)

### Simple Model

In [9]:
simple = Sequential()

simple.add(LSTM(512, return_sequences=True, input_shape=(X.shape[1], X.shape[2])))
simple.add(LeakyReLU())

simple.add(LSTM(256, return_sequences=True))
simple.add(LeakyReLU())

simple.add(LSTM(128, return_sequences=True))
simple.add(LeakyReLU())

simple.add(TimeDistributed(Dense(X.shape[2], activation='softmax')))

optimizer = Adam(lr=0.01)

simple.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
simple.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 106, 512)          1179648   
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 106, 512)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 106, 256)          787456    
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 106, 256)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 106, 128)          197120    
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 106, 128)          0         
_________________________________________________________________
time_distributed (TimeDistri (None, 106, 63)           8

In [None]:
history = simple.fit(
    X,
    y,
    batch_size=128,
    epochs=5,
    shuffle=True,
    validation_split=0.1
)

Train on 225000 samples, validate on 25000 samples
Epoch 1/5