# Address Parser

Goal: As a DS, I want to create a Model, which is able to extract the street name, the house number, the postal_code and the city from an arbitrary address.

Approach:
- Construct simple, standardized training addresses
- Test first iteration of model on this training set
- Introduce random permutations of addresses
- Test and iterate over model to deal with random permutations

Source for addresses: https://openaddresses.io/

In [1]:
import sys

import pandas as pd
import numpy as np
from tqdm import tqdm

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, RNN, Bidirectional, TimeDistributed, LeakyReLU, ReLU
from tensorflow.keras.optimizers import Adam

from source.address_permutator import AddressPermutator

In [2]:
open_addresses = pd.read_csv('data/openaddr-collected-europe/pt/countrywide.csv').sample(250000)

In [3]:
sys.getsizeof(open_addresses)*1e-9

0.11704186700000001

### Create Addresses

In [4]:
open_addresses.head()

Unnamed: 0,LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
437557,-8.678451,37.661822,,,,SÃO LUÍS,,,7630-435,pt.ine.add.PTCONT.442076,2ff689f9290ac435
2628735,-9.11051,38.773483,120.0,R SARG ARMANDO MONTEIRO FERREIRA,,LISBOA,,,1800-329,pt.ine.add.PTCONT.3001376,6a6569db76b2beca
916867,-7.18344,41.480391,189.0,R VASCO DA GAMA,,MIRANDELA,,,5370-481,pt.ine.add.PTCONT.934703,a3346176d19bbd1f
3868521,-8.384785,41.280149,37.0,R NOVA DO SISTELO,,PAÇOS DE FERREIRA,,,4590-177,pt.ine.add.PTCONT.3729528,79645e8bbe7a6e4c
5832954,-25.590071,37.769852,23.0,R REI D CARLOS,,PONTA DELGADA,,,9500-606,pt.ine.add.AC26.82807,1fd9f80d1fa1272c


In [5]:
open_addresses = open_addresses.fillna('')
const_matrix = open_addresses[['STREET', 'NUMBER', 'POSTCODE', 'CITY']].values

In [6]:
permutator = AddressPermutator(const_matrix.copy())

In [7]:
perm, standard = permutator.permutate()

250000it [00:02, 103611.02it/s]


In [8]:
X, y = permutator.encode(perm, standard)

### Simple Model

In [10]:
simple = Sequential()

simple.add(LSTM(512, return_sequences=True, input_shape=(X.shape[1], X.shape[2])))
simple.add(LeakyReLU())

simple.add(LSTM(256, return_sequences=True))
simple.add(LeakyReLU())

simple.add(TimeDistributed(Dense(128)))
simple.add(LeakyReLU())

simple.add(TimeDistributed(Dense(X.shape[2], activation='softmax')))

optimizer = Adam(lr=0.01)

simple.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
simple.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 104, 512)          1181696   
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 104, 512)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 104, 256)          787456    
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 104, 256)          0         
_________________________________________________________________
time_distributed (TimeDistri (None, 104, 128)          32896     
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 104, 128)          0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 104, 64)           8

In [None]:
history = simple.fit(
    X,
    y,
    batch_size=128,
    epochs=5,
    shuffle=True,
    validation_split=0.1
)

Train on 225000 samples, validate on 25000 samples
Epoch 1/5