# Address parsing using Named Entity Recognition
Data & AI course, UC Leuven, 2021 Fall
### Project supervisors
- Tom Magerman
- Aimée Lynn Backiel

### Project team (Group 4)
- Karolis Medekša
- Pedro Teixeira Palma Rosa
- Hysa Mello de Alcântara
- Josep Jacob Chetrit Valdepeñas

## Goals
The goal of the assignment is to implement a solution for parsing individual parts of an address (street, house number, postal code, etc.) using a Natural Language Processing model. We can derive three subtasks from the assignment:
- converting training data into a format, accepted by the NLP library
- training a Named Entity Recognition pipeline and creating a NLP model with it
- Evaluating how accuratelly the model makes predictions

## NLP tool
As the final solution requires a python script, it was decided to use [spaCy](https://spacy.io/), as it is one of the most popular Natural Language Processing tools in python, it also features full support for Named Entity Recognition, which is required for solving the assignment.

## Preparing training data
The training data is presented in an excel file, where for each address there are columns with extracted values for different tokens:
![image.png](attachment:image.png)

However, NER model training requires training data to be presented in a special format, by indicating positions in a string of different entities:
```python
# Example training data format
TRAIN_DATA = [
    ('Arkelsedijk 46,4206 AC Gorinchem', {
        'entities': [(0, 11, 'street'), (12, 14, 'nr'), (15, 22, 'postal'), (23, 32, 'city')]
    }), 
    ('SIGMA-TAU Industrie Farmaceutiche Riunite S.p.A.,Via Pontina, km 30,400 00040, Pomezia', {
        'entities': [(0, 48, 'co'), (49, 60, 'street'), (62, 67, 'nr'), (68, 77, 'postal'), (79, 86, 'city')]
    })
]
```

First we need code for reading raw training data into a `pandas` `DataFrame`

In [1]:
import pandas as pd

def read_DataFrame_from_file(filename: str, numberOfRows: int = None):
    return pd.read_excel(filename, nrows = numberOfRows, keep_default_na=False)

It is now possible to read the data and inspect it's format:

In [2]:
DATA_INPUT_FILENAME = 'training_data.xlsx'

raw_data: pd.DataFrame = read_DataFrame_from_file(DATA_INPUT_FILENAME, 10)
raw_data.head(3)

Unnamed: 0,person_id,person_name,person_address,cln1,cln2,cln3,person_ctry_code,cnt,chr_len,chr_len_label,...,street,nr,area,postal,city,region,country,unclear,status,label
0,3540,PURAC Biochem BV,"Arkelsedijk 46,4206 AC Gorinchem",Arkelsedijk 46 4206 AC Gorinchem,46 4206,46,NL,1,32,2,...,Arkelsedijk,46,,4206 AC,Gorinchem,,,,,1
1,28753,"Tinti, Maria Ornella",SIGMA-TAU Industrie Farmaceutiche Riunite S.p....,SIGMA-TAU Industrie Farmaceutiche Riunite S.p....,30 400 00040,30,IT,1,86,1,...,Via Pontina,km 30,,400 00040,Pomezia,,,,,1
2,35108,"Isobe, Shin-ichi, c/o Int. Prop. Dpt., NTT DoC...","Sanno Park Tower, 11-1, Nagatacho 2-chome,Chiy...",Sanno Park Tower 11-1 Nagatacho 2-chome Chiyod...,111 2,11-,JP,1,59,1,...,Nagatacho 2-chome 11-1,,Chiyoda-ku,,Tokyo,,,,,1


The following code can be used to transform the `DataFrame` into a format supported by `spaCy`:


First we transform the data frame into an array of of objects, where values are column values of each row using `DataFrame.to_dict('records')`. Function `map_to_training_entry` maps each object into a `tuple` of address and an object containing entity list. Finally, `get_entity_list` maps location of each entity in the address.

In [3]:
import re
import json

TOKEN_TYPES: set = {'co', 'building', 'street', 'nr', 'area', 'postal', 'city', 'region', 'country'}

def get_entity_list(entry: dict, address: str):
    entities: list = []
    present_tokens = filter(lambda item: item[0] in TOKEN_TYPES and item[1] and str(item[1]).strip(), entry.items())

    for item in present_tokens:
        token_value = str(item[1]).strip()
        match = re.search(re.escape(token_value), address)
        if match:
            span = match.span()
            entities.append((span[0], span[1], item[0]))
        else:
            # Try and resolve multiple tokens separated by ';'
            split_items = map(lambda token: token.strip(), token_value.split(';'))
            for token in split_items:
                split_match = re.search(re.escape(token), address)
                if split_match:
                    span = split_match.span()
                    entities.append((span[0], span[1], item[0]))
                else:
                    print('WARNING: could not find token "{}" in address "{}"'.format(token, address))
    
    return entities

def map_to_training_entry(entry: dict):
    address = entry['person_address']
    return (address, {
        'entities': get_entity_list(entry, address)
    })

train_data = list(
    map(map_to_training_entry, raw_data.to_dict('records'))
)

print("---- TRAINING DATA EXAMPLE ----")
print(train_data[0])

---- TRAINING DATA EXAMPLE ----
('Arkelsedijk 46,4206 AC Gorinchem', {'entities': [(0, 11, 'street'), (12, 14, 'nr'), (15, 22, 'postal'), (23, 32, 'city')]})


There are some entities for which there are warning about entities that were not found in the address string. We will discuss these problems later.

## Training the model
First we create a blank english `spaCy` model instance and add a `NER` pipeline into it. Address tokens also need to be added as labels to the pipeline.

In [4]:
import spacy 

nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

for token in TOKEN_TYPES:
    ner.add_label(token)

When training the model we use `minibatch` utility function to apply model updates in batches in order to increase performance. `compounding` utility is used to increase the batch size with further itearions. Usage of batching increases the training speed more than twice.

In [6]:
from spacy.util import minibatch, compounding
import random

optimizer = nlp.begin_training()
for itn in range(2):
    random.shuffle(train_data)
    losses = {}

    batches = minibatch(train_data, size=compounding(4, 32, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  
            annotations,  
            drop=0.5,  
            sgd=optimizer,
            losses=losses)
    print('Iteration: {} | Losses: {}'.format(itn, losses))

Iteration: 0 | Losses: {'ner': 89.77388048171997}
Iteration: 1 | Losses: {'ner': 78.99851471185684}


The training works, however, we do get warnings that in some entities the tokens could not be used for training:
![image.png](attachment:image.png)

We will return to this issue in a later chapter. Also, if we try to train a model with all of the data, we run into an error:

In [7]:
%%capture

raw_data: pd.DataFrame = read_DataFrame_from_file(DATA_INPUT_FILENAME, 999)
train_data = list(
    map(map_to_training_entry, raw_data.to_dict('records'))
)

nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

for token in TOKEN_TYPES:
    ner.add_label(token)

optimizer = nlp.begin_training()
for itn in range(1):
    random.shuffle(train_data)
    losses = {}

    batches = minibatch(train_data, size=compounding(4, 32, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  
            annotations,  
            drop=0.5,  
            sgd=optimizer,
            losses=losses)
    print('Iteration: {} | Losses: {}'.format(itn, losses))

ValueError: [E103] Trying to set conflicting doc.ents: '(0, 7, 'area')' and '(0, 7, 'city')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

### Conflicting entity errors
When the model is being trained with all the records, an error, indicating conflicting entities, is encountered. The issue will be addressed in a further chapter, for now, we can siply filter out entries having conflicting entities: 

In [9]:
def entities_overlap(entry):
    entities = entry[1]['entities']
    for first in entities:
        for second in entities:
            if (first == second): continue
            if (first[0] < second[0] and first[1] > second[0]) or (first[0] > second[0] and first[1] < second[0]) or (first[0]==second[0] or first[1]==second[1]):
                print('Entities {} and {} overlap in "{}"'.format(first, second, entry[0]))
                return True
    return False

train_data = list(filter(lambda entry: not entities_overlap(entry), train_data))

Entities (4, 74, 'co') and (4, 13, 'city') overlap in "5F, CHANGZHOU HIGH-TECH RESEARCH INSTITUTE OF NANJING UNIVERSITY CHANGZHOU, JIANGSU 213164 CHANGZHOU SCIENCE & EDUCATION TOWN, NO.801 MIDDLE CHANGWU ROAD"
Entities (23, 38, 'area') and (0, 38, 'area') overlap in "POLÍGONO INDUSTRIAL DE QUART DE POBLET, C/ LA PINAETA S/N, 46930 QUART DE POBLET VALENCIA ES"
Entities (0, 19, 'street') and (0, 2, 'nr') overlap in "14 Duntrune Terrace West Ferry,Dundee DD5 1LF Scotland"
Entities (0, 7, 'area') and (0, 7, 'city') overlap in "Enskede"
Entities (23, 31, 'postal') and (23, 25, 'region') overlap in "2 Sconsett Bluff,Avon, CT 06001"
Entities (48, 58, 'city') and (48, 54, 'region') overlap in "c/o Shinko El. Ind. Co.,Ltd. 80, Oshimada-machi,Nagano-shi Nagano 381-2287"
Entities (65, 74, 'area') and (65, 70, 'city') overlap in "c/o Taiki Corp., Ltd.,3-41 Nishiawaji 6-chome,Higashiyodogawa-ku,Osaka-shi Osaka 533-0031"
Entities (13, 33, 'area') and (13, 17, 'area') overlap in "4 NORMAN WAY,OVER IN

Now we can divide the present data into training and validation sets:

In [22]:
from sklearn.model_selection import train_test_split
import numpy as np

train_sample, test_sample = train_test_split(
    train_data, test_size = 0.2, random_state = 420
)
print('train entries: {} | test entries: {}'.format(len(train_sample), len(test_sample)))
# train_sample = train_data.sample(frac=0.8, random_state=420)
# test_sample = train_data.drop(train_sample.index)

train entries: 778 | test entries: 195


We can now train a model using the test data. For a baseline model we will perform 20 iterations of training:

In [24]:
nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

for token in TOKEN_TYPES:
    ner.add_label(token)

optimizer = nlp.begin_training()
for itn in range(20):
    random.shuffle(train_sample)
    losses = {}

    batches = minibatch(train_sample, size=compounding(4, 32, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  
            annotations,  
            drop=0.5,  
            sgd=optimizer,
            losses=losses)
    print('Iteration: {} | Losses: {}'.format(itn, losses))

Iteration: 0 | Losses: {'ner': 3047.68327928675}
Iteration: 1 | Losses: {'ner': 2880.2026164971303}
Iteration: 2 | Losses: {'ner': 2624.2195279695447}
Iteration: 3 | Losses: {'ner': 2549.3434972041164}
Iteration: 4 | Losses: {'ner': 2410.465751566776}
Iteration: 5 | Losses: {'ner': 2342.9147328278573}
Iteration: 6 | Losses: {'ner': 2198.874550244972}
Iteration: 7 | Losses: {'ner': 2201.1767211930755}
Iteration: 8 | Losses: {'ner': 2118.442391669805}
Iteration: 9 | Losses: {'ner': 2084.389958909136}
Iteration: 10 | Losses: {'ner': 2029.0070619526118}
Iteration: 11 | Losses: {'ner': 1953.75416000656}
Iteration: 12 | Losses: {'ner': 1958.9142796220924}
Iteration: 13 | Losses: {'ner': 1834.4957377857934}
Iteration: 14 | Losses: {'ner': 1816.372640061109}
Iteration: 15 | Losses: {'ner': 1786.3032062708623}
Iteration: 16 | Losses: {'ner': 1683.3750419441453}
Iteration: 17 | Losses: {'ner': 1678.1313704151987}
Iteration: 18 | Losses: {'ner': 1612.5993487595124}
Iteration: 19 | Losses: {'ner':

## Evaluating model performance
We can evaluate how well the model performs by evaluating the accuracy, precision, recall and F1 score of train and test data.

In [53]:
## Maps the results object into a DataFrame
def results_per_entity_to_df(res: dict):
    columns = ['Token', 'Precision', 'Recall', 'F1 score']
    df = pd.DataFrame(columns=columns)
    total = pd.concat(
        [pd.DataFrame([['Total', res['ents_p'], res['ents_r'], res['ents_f']]], columns=columns)]
        , ignore_index=True
    )
    per_entity = pd.concat(
        [pd.DataFrame([
            [token, 
             res['ents_per_type'][token]['p'], 
             res['ents_per_type'][token]['r'], 
             res['ents_per_type'][token]['f']]
        ], columns=columns) for token in TOKEN_TYPES], ignore_index=True
    )
    return pd.concat([per_entity, total], ignore_index=True)

We can use `spaCy`'s own `scorer` utility to evaluate the performance for each different attribute. 

In [54]:
%%capture
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def map_to_evaluation_model(entry: tuple):
    return (entry[0], entry[1]['entities'])


def evaluate(ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot)
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

train_results = evaluate(nlp, map(map_to_evaluation_model, train_sample))
test_results = evaluate(nlp, map(map_to_evaluation_model, test_sample))

In [56]:
from IPython.display import display, HTML

## TODO: Add bar charts
print('---- Results on train data ----')
display(HTML(results_per_entity_to_df(train_results).to_html(index=False)))
print('---- Results on test data ----')
display(HTML(results_per_entity_to_df(test_results).to_html(index=False)))

---- Results on train data ----


Token,Precision,Recall,F1 score
area,74.226804,53.731343,62.337662
co,48.979592,57.142857,52.747253
region,85.546875,95.633188,90.309278
street,81.018519,91.145833,85.784314
city,86.823105,97.368421,91.793893
building,50.0,5.405405,9.756098
country,100.0,52.631579,68.965517
postal,92.178771,97.633136,94.827586
nr,92.0,92.61745,92.307692
Total,84.884488,87.78157,86.308725


---- Results on test data ----


Token,Precision,Recall,F1 score
area,46.153846,27.906977,34.782609
co,44.444444,26.666667,33.333333
region,76.119403,82.258065,79.069767
street,50.746269,77.272727,61.261261
city,72.049689,85.925926,78.378378
building,0.0,0.0,0.0
country,66.666667,50.0,57.142857
postal,89.361702,87.5,88.421053
nr,73.809524,72.093023,72.941176
Total,69.030733,73.182957,71.046229


We can see that although the model works pretty well for the house number, postal code, city, street and region with the training data. On the other hand, on test data the model does not perform very well. Performance on are, company and country tokens is exceptionally poor.


We can try to improve the performance of the model by fixing issues with training data presentation and parsing.

## 