# Address parsing using Named Entity Recognition
Data & AI course, UC Leuven, 2021 Fall
### Project supervisors
- Tom Magerman
- Aimée Lynn Backiel

### Project team (Group 4)
- Karolis Medekša
- Pedro Teixeira Palma Rosa
- Hysa Mello de Alcântara
- Josep Jacob Chetrit Valdepeñas

## Goals
The goal of the assignment is to implement a solution for parsing individual parts of an address (street, house number, postal code, etc.) using a Natural Language Processing model. We can derive three subtasks from the assignment:
- converting training data into a format, accepted by the NLP library
- training a Named Entity Recognition pipeline and creating a NLP model with it
- Evaluating how accuratelly the model makes predictions

## NLP tool
As the final solution requires a python script, it was decided to use [spaCy](https://spacy.io/), as it is one of the most popular Natural Language Processing tools in python, it also features full support for Named Entity Recognition, which is required for solving the assignment.

## Preparing training data
The training data is presented in an excel file, where for each address there are columns with extracted values for different tokens:
![image-2.png](attachment:image-2.png)

However, NER model training requires training data to be presented in a special format, by indicating positions in a string of different entities:
```python
# Example training data format
TRAIN_DATA = [
    ('Arkelsedijk 46,4206 AC Gorinchem', {
        'entities': [(0, 11, 'street'), (12, 14, 'nr'), (15, 22, 'postal'), (23, 32, 'city')]
    }), 
    ('SIGMA-TAU Industrie Farmaceutiche Riunite S.p.A.,Via Pontina, km 30,400 00040, Pomezia', {
        'entities': [(0, 48, 'co'), (49, 60, 'street'), (62, 67, 'nr'), (68, 77, 'postal'), (79, 86, 'city')]
    })
]
```

First we need code for reading raw training data into a `pandas` `DataFrame`

In [1]:
import pandas as pd

def read_DataFrame_from_file(filename: str, numberOfRows: int = None):
    return pd.read_excel(filename, nrows = numberOfRows, keep_default_na=False)

It is now possible to read the data and inspect it's format:

In [2]:
DATA_INPUT_FILENAME = '../files/training_data.xlsx'

raw_data: pd.DataFrame = read_DataFrame_from_file(DATA_INPUT_FILENAME, 10)
raw_data.head(3)

Unnamed: 0,person_id,person_name,person_address,cln1,cln2,cln3,person_ctry_code,cnt,chr_len,chr_len_label,...,street,nr,area,postal,city,region,country,unclear,status,label
0,3540,PURAC Biochem BV,"Arkelsedijk 46,4206 AC Gorinchem",Arkelsedijk 46 4206 AC Gorinchem,46 4206,46,NL,1,32,2,...,Arkelsedijk,46,,4206 AC,Gorinchem,,,,,1
1,28753,"Tinti, Maria Ornella",SIGMA-TAU Industrie Farmaceutiche Riunite S.p....,SIGMA-TAU Industrie Farmaceutiche Riunite S.p....,30 400 00040,30,IT,1,86,1,...,Via Pontina,km 30,,400 00040,Pomezia,,,,,1
2,35108,"Isobe, Shin-ichi, c/o Int. Prop. Dpt., NTT DoC...","Sanno Park Tower, 11-1, Nagatacho 2-chome,Chiy...",Sanno Park Tower 11-1 Nagatacho 2-chome Chiyod...,111 2,11-,JP,1,59,1,...,Nagatacho 2-chome 11-1,,Chiyoda-ku,,Tokyo,,,,,1


The following code can be used to transform the `DataFrame` into a format supported by `spaCy`:


First we transform the data frame into an array of of objects, where values are column values of each row using `DataFrame.to_dict('records')`. Function `map_to_training_entry` maps each object into a `tuple` of address and an object containing entity list. Finally, `get_entity_list` maps location of each entity in the address.

In [3]:
import re
import json

TOKEN_TYPES: set = {'co', 'building', 'street', 'nr', 'area', 'postal', 'city', 'region', 'country'}

def get_entity_list(entry: dict, address: str):
    entities: list = []
    present_tokens = filter(lambda item: item[0] in TOKEN_TYPES and item[1] and str(item[1]).strip(), entry.items())

    for item in present_tokens:
        token_value = str(item[1]).strip()
        match = re.search(re.escape(token_value), address)
        if match:
            span = match.span()
            entities.append((span[0], span[1], item[0]))
        else:
            # Try and resolve multiple tokens separated by ';'
            split_items = map(lambda token: token.strip(), token_value.split(';'))
            for token in split_items:
                split_match = re.search(re.escape(token), address)
                if split_match:
                    span = split_match.span()
                    entities.append((span[0], span[1], item[0]))
                else:
                    print('WARNING: could not find token "{}" in address "{}"'.format(token, address))
    
    return entities

def map_to_training_entry(entry: dict):
    address = entry['person_address']
    return (address, {
        'entities': get_entity_list(entry, address)
    })

train_data = list(
    map(map_to_training_entry, raw_data.to_dict('records'))
)

print("---- TRAINING DATA EXAMPLE ----")
print(train_data[0])

---- TRAINING DATA EXAMPLE ----
('Arkelsedijk 46,4206 AC Gorinchem', {'entities': [(0, 11, 'street'), (12, 14, 'nr'), (15, 22, 'postal'), (23, 32, 'city')]})


There are some entities for which there are warning about entities that were not found in the address string. We will discuss these problems later.

## Training the model
First we create a blank english `spaCy` model instance and add a `NER` pipeline into it. Address tokens also need to be added as labels to the pipeline.

In [4]:
import spacy 

nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

for token in TOKEN_TYPES:
    ner.add_label(token)

When training the model we use `minibatch` utility function to apply model updates in batches in order to increase performance. `compounding` utility is used to increase the batch size with further itearions. Usage of batching increases the training speed more than twice.

In [6]:
from spacy.util import minibatch, compounding
import random

optimizer = nlp.begin_training()
for itn in range(2):
    random.shuffle(train_data)
    losses = {}

    batches = minibatch(train_data, size=compounding(4, 32, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  
            annotations,  
            drop=0.5,  
            sgd=optimizer,
            losses=losses)
    print('Iteration: {} | Losses: {}'.format(itn, losses))

Iteration: 0 | Losses: {'ner': 88.3712317943573}
Iteration: 1 | Losses: {'ner': 79.42872738838196}


The training works, however, we do get warnings that in some entities the tokens could not be used for training:
![image-2.png](attachment:image-2.png)

We will return to this issue in a later chapter. Also, if we try to train a model with all of the data, we run into an error:

In [7]:
%%capture

raw_data: pd.DataFrame = read_DataFrame_from_file(DATA_INPUT_FILENAME, 999)
train_data = list(
    map(map_to_training_entry, raw_data.to_dict('records'))
)

nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

for token in TOKEN_TYPES:
    ner.add_label(token)

optimizer = nlp.begin_training()
for itn in range(1):
    random.shuffle(train_data)
    losses = {}

    batches = minibatch(train_data, size=compounding(4, 32, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  
            annotations,  
            drop=0.5,  
            sgd=optimizer,
            losses=losses)
    print('Iteration: {} | Losses: {}'.format(itn, losses))

ValueError: [E103] Trying to set conflicting doc.ents: '(0, 19, 'street')' and '(0, 2, 'nr')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

### Conflicting entity errors
When the model is being trained with all the records, an error, indicating conflicting entities, is encountered. The issue will be addressed in a further chapter, for now, we can siply filter out entries having conflicting entities: 

In [8]:
def entities_overlap(entry):
    entities = entry[1]['entities']
    for first in entities:
        for second in entities:
            if (first == second): continue
            if (first[0] < second[0] and first[1] > second[0]) or (first[0] > second[0] and first[1] < second[0]) or (first[0]==second[0] or first[1]==second[1]):
                print('Entities {} and {} overlap in "{}"'.format(first, second, entry[0]))
                return True
    return False

train_data = list(filter(lambda entry: not entities_overlap(entry), train_data))

Entities (0, 19, 'street') and (0, 2, 'nr') overlap in "14 Duntrune Terrace West Ferry,Dundee DD5 1LF Scotland"
Entities (110, 115, 'city') and (110, 115, 'region') overlap in "c/o Intellectual Property Department, NTT DoCoMo, Inc., Sanno Park Tower, 11-1, Nagatacho 2 chome, Chiyoda-ku,Tokyo Tokyo 100-6150"
Entities (48, 58, 'city') and (48, 54, 'region') overlap in "c/o Shinko El. Ind. Co.,Ltd. 80, Oshimada-machi,Nagano-shi Nagano 381-2287"
Entities (35, 44, 'area') and (35, 40, 'city') overlap in "1-201, Fukuzaki 3-chome, Minato-ku Osaka-shi,Osaka 552-0013"
Entities (72, 85, 'postal') and (72, 85, 'region') overlap in "Meritor Heavy Vehicle Braking Systems (UK) Limited,Grange Road,Cwmbran, Monmouthshire NP44 3XU"
Entities (23, 31, 'postal') and (23, 25, 'region') overlap in "2 Sconsett Bluff,Avon, CT 06001"
Entities (18, 31, 'area') and (18, 25, 'city') overlap in "48 Karewa parade, Papamoa Beach,Papamoa, 3118"
Entities (43, 56, 'city') and (43, 52, 'city') overlap in "c/o JMS CO. LT

Now we can divide the present data into training and validation sets:

In [9]:
from sklearn.model_selection import train_test_split
import numpy as np

train_sample, test_sample = train_test_split(
    train_data, test_size = 0.2, random_state = 420
)
print('train entries: {} | test entries: {}'.format(len(train_sample), len(test_sample)))

train entries: 778 | test entries: 195


We can now train a model using the test data. For a baseline model we will perform 20 iterations of training:

In [11]:
nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

for token in TOKEN_TYPES:
    ner.add_label(token)

optimizer = nlp.begin_training()
for itn in range(20):
    random.shuffle(train_sample)
    losses = {}

    batches = minibatch(train_sample, size=compounding(4, 32, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  
            annotations,  
            drop=0.5,  
            sgd=optimizer,
            losses=losses)
    print('Iteration: {} | Losses: {}'.format(itn, losses))

Iteration: 0 | Losses: {'ner': 3117.6665315444843}
Iteration: 1 | Losses: {'ner': 2875.7419534698547}
Iteration: 2 | Losses: {'ner': 2676.9454808718665}
Iteration: 3 | Losses: {'ner': 2600.0219918610032}
Iteration: 4 | Losses: {'ner': 2466.7652404810096}
Iteration: 5 | Losses: {'ner': 2377.9448701580277}
Iteration: 6 | Losses: {'ner': 2309.518235458995}
Iteration: 7 | Losses: {'ner': 2240.6525369716946}
Iteration: 8 | Losses: {'ner': 2179.359040672815}
Iteration: 9 | Losses: {'ner': 2096.6069855579262}
Iteration: 10 | Losses: {'ner': 2077.0795414905624}
Iteration: 11 | Losses: {'ner': 2057.6608067577868}
Iteration: 12 | Losses: {'ner': 1996.812438069488}
Iteration: 13 | Losses: {'ner': 1975.1112576306525}
Iteration: 14 | Losses: {'ner': 1844.4737492165639}
Iteration: 15 | Losses: {'ner': 1761.438860742149}
Iteration: 16 | Losses: {'ner': 1777.6964060375333}
Iteration: 17 | Losses: {'ner': 1739.1100974915487}
Iteration: 18 | Losses: {'ner': 1705.7191997554996}
Iteration: 19 | Losses: {'

## Evaluating model performance
We can evaluate how well the model performs by evaluating the accuracy, precision, recall and F1 score of train and test data.

In [12]:
## Maps the results object into a DataFrame
def results_per_entity_to_df(res: dict):
    columns = ['Token', 'Precision', 'Recall', 'F1 score']
    df = pd.DataFrame(columns=columns)
    total = pd.concat(
        [pd.DataFrame([['Total', res['ents_p'], res['ents_r'], res['ents_f']]], columns=columns)]
        , ignore_index=True
    )
    per_entity = pd.concat(
        [pd.DataFrame([
            [token, 
             res['ents_per_type'][token]['p'], 
             res['ents_per_type'][token]['r'], 
             res['ents_per_type'][token]['f']]
        ], columns=columns) for token in TOKEN_TYPES], ignore_index=True
    )
    return pd.concat([per_entity, total], ignore_index=True)

We can use `spaCy`'s own `scorer` utility to evaluate the performance for each different attribute. 

In [13]:
%%capture
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def map_to_evaluation_model(entry: tuple):
    return (entry[0], entry[1]['entities'])


def evaluate(ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot)
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

train_results = evaluate(nlp, map(map_to_evaluation_model, train_sample))
test_results = evaluate(nlp, map(map_to_evaluation_model, test_sample))

In [14]:
from IPython.display import display, HTML

## TODO: Add bar charts
print('---- Results on train data ----')
display(HTML(results_per_entity_to_df(train_results).to_html(index=False)))
print('---- Results on test data ----')
display(HTML(results_per_entity_to_df(test_results).to_html(index=False)))

---- Results on train data ----


Token,Precision,Recall,F1 score
postal,95.505618,97.701149,96.590909
region,90.277778,88.636364,89.449541
country,83.333333,71.428571,76.923077
building,80.0,10.810811,19.047619
nr,83.625731,91.666667,87.461774
street,78.636364,88.717949,83.373494
city,88.0,96.8,92.190476
area,74.025974,76.510067,75.247525
co,62.962963,69.387755,66.019417
Total,85.057471,88.740839,86.860124


---- Results on test data ----


Token,Precision,Recall,F1 score
postal,86.046512,86.046512,86.046512
region,90.566038,67.605634,77.419355
country,50.0,50.0,50.0
building,0.0,0.0,0.0
nr,76.923077,83.333333,80.0
street,53.846154,68.292683,60.215054
city,75.172414,84.496124,79.562044
area,33.333333,42.857143,37.5
co,11.111111,12.5,11.764706
Total,70.184697,73.278237,71.698113


We can see that although the model works pretty well for the house number, postal code, city, street and region with the training data. On the other hand, on test data the model does not perform that well. Performance on building, area, company and country tokens is exceptionally poor.


We can try to improve the performance of the model by fixing issues with training data presentation and parsing.

## Fixing data issues
### Incorrect data mappings in the training data
There were more than 120 instances in which a token could not be matched with the address string. Most of these are either spelling errors or extracted data positioned incorrectly:
![image.png](attachment:image.png)

However, street names are almost always not matched for Japanese addresses, because they consist of two parts which are often not deparated by the semi-colon:
![image-2.png](attachment:image-2.png)

All of these issues were resolved in the training data file by hand.

### Fixing overlapping matches
There were multiple warning about overlapping entity mappings in the mapped data:
![image-3.png](attachment:image-3.png)

Some of the occurances are because of incorrect data, for instance, in `2 Sconsett Bluff, Avon, CT 06001` postal code is defined as `CT 06001` and the region is defined as `CT`, creating overlapping entries. These issues were resolved in the training data file by hand.


On the other hand, some of the issues were caused by a bug in the code. Current entity mapping solution just looks for a first match and uses it, but sometimes maps incorrectly. For instance, in address `2-15, Meiwadori 3-chome, Hyogo-ku, Kobe-shi, Hyogo 652-0882` te city name of `Hyogo` will be mapped as part of the area `Hyogo-ku` because it appears first.

Therefore, the entity mapping algorithm needs to be altered. It will now follow the pseudo-code:
```shell
FUNCTION get_entity_list (entries, address)
BEGIN
    entities = []
    retry = []
    FOR EACH entry IN entries
    DO
        IF entry IS MATCHED IN address SINGLE TIME
            ADD entry TO entities
            REPLACE entry WITH  "$" IN address
        ELSE
            ADD entry TO retry
    END
    
    FOR EACH entry IN retry
    DO
        IF entry IS MATCHED IN address
            ADD entry TO entities
            REPLACE entry WITH  "$" IN address 
    END
    
    RETURN entities
END
```

In [15]:
def get_entity_list(entry: dict, adr: str):
    address = str(adr)
    entities: list = []
    present_tokens = filter(lambda item: item[0] in TOKEN_TYPES and item[1] and str(item[1]).strip(), entry.items())

    ## tokens to retry matching
    retry_tokens: set = set()

    for item in present_tokens:
        token_value = str(item[1]).strip()
        match = re.search(re.escape(token_value), address)
        if match:
            # If multiple occurences can be matched, save the token to be matched later
            if (len(re.findall(re.escape(token_value), address)) > 1):
                retry_tokens.add((token_value, item[0]))
                continue
            span = match.span()
            entities.append((span[0], span[1], item[0]))
            # Replace matched entity with symbols, so that parts of it cannot be matched again
            address = address[:span[0]] + '$' * (span[1] - span[0]) + address[span[1]:]
        else:
            # Try and resolve multiple tokens separated by ';'
            split_items = map(lambda token: token.strip(), token_value.split(';'))
            for token in split_items:
                split_match = re.search(re.escape(token), address)
                if split_match:
                    # If multiple occurences can be matched, save the token to be matched later
                    if (len(re.findall(re.escape(token), address)) > 1):
                        retry_tokens.add((token, item[0]))
                        continue
                    span = split_match.span()
                    entities.append((span[0], span[1], item[0]))
                    # Replace matched entity with symbols, so that parts of it cannot be matched again
                    address = address[:span[0]] + '$' * (span[1] - span[0]) + address[span[1]:]
                else:
                    print('WARNING: could not find token "{}" in address "{}"'.format(token, adr))
    
    # Try and match previously marked tokens, now that single-match entities were eliminated
    for token, tkn_type in retry_tokens:
        token_value = str(token).strip()
        match = re.search(re.escape(token_value), address)
        if match:
            span = match.span()
            entities.append((span[0], span[1], tkn_type))
            address = address[:span[0]] + '$' * (span[1] - span[0]) + address[span[1]:]
        else:
            print('WARNING: could not find token "{}" in address "{}"'.format(token, adr))

    return entities

### Tokenization issues
While training the model there were warnings that some tokens could not be matched in addresses and will be ignored while training the model. It happens because the training process only supports entities that begin and end at token boundaries. 

Therefore, if postal code is defined as `04222` in string `LT04222` the instance will not be used for training because of a tokenization issue. These issues were fixed in the training data file by hand.

However, a significant amount of addresses have entities separated by commas without spaces, this causes a problem, because comma without any leading or trailing spaces is considered as a part of a token rather than a token separator. Because it applies to a large number of entities, we can use data pre-processing to add spaces after commas (and semi-colons) in these special cases:

In [16]:
def preprocess_data(data: pd.DataFrame):
    for col in data.columns:
        data[col] = data.apply(lambda row: re.sub(r'([^\s])([,;])([^\s])', r'\1\2 \3', str(row[col])), axis=1)

### Performance improvements 
Training the baseline model with improvements which should improve the applicability of the model:

In [17]:
raw_data: pd.DataFrame = read_DataFrame_from_file('../files/training_data_fixed.xlsx', 999)
preprocess_data(raw_data)

train_data = list(
    map(map_to_training_entry, raw_data.to_dict('records'))
)
train_data = list(filter(lambda entry: not entities_overlap(entry), train_data))

train_sample, test_sample = train_test_split(
    train_data, test_size = 0.2, random_state = 420
)
print('train entries: {} | test entries: {}'.format(len(train_sample), len(test_sample)))

train entries: 799 | test entries: 200


In [19]:
nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

for token in TOKEN_TYPES:
    ner.add_label(token)

optimizer = nlp.begin_training()
for itn in range(20):
    random.shuffle(train_sample)
    losses = {}

    batches = minibatch(train_sample, size=compounding(4, 32, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  
            annotations,  
            drop=0.5,  
            sgd=optimizer,
            losses=losses)
    print('Iteration: {} | Losses: {}'.format(itn, losses))

Iteration: 0 | Losses: {'ner': 3983.50618808993}
Iteration: 1 | Losses: {'ner': 3659.7332529618507}
Iteration: 2 | Losses: {'ner': 3310.0797448898084}
Iteration: 3 | Losses: {'ner': 3035.548955115989}
Iteration: 4 | Losses: {'ner': 2894.2235998356905}
Iteration: 5 | Losses: {'ner': 2827.419132373608}
Iteration: 6 | Losses: {'ner': 2701.589535403672}
Iteration: 7 | Losses: {'ner': 2619.0961535145325}
Iteration: 8 | Losses: {'ner': 2559.954532203828}
Iteration: 9 | Losses: {'ner': 2430.748212510282}
Iteration: 10 | Losses: {'ner': 2359.27769671099}
Iteration: 11 | Losses: {'ner': 2359.7902659028878}
Iteration: 12 | Losses: {'ner': 2236.3310442938946}
Iteration: 13 | Losses: {'ner': 2188.4337061023584}
Iteration: 14 | Losses: {'ner': 2093.617415083377}
Iteration: 15 | Losses: {'ner': 2046.0620843171973}
Iteration: 16 | Losses: {'ner': 1987.325347316779}
Iteration: 17 | Losses: {'ner': 1962.98968307596}
Iteration: 18 | Losses: {'ner': 1826.5443054193606}
Iteration: 19 | Losses: {'ner': 189

In [20]:
%%capture
train_results = evaluate(nlp, map(map_to_evaluation_model, train_sample))
test_results = evaluate(nlp, map(map_to_evaluation_model, test_sample))

In [21]:
print('---- Results on train data ----')
display(HTML(results_per_entity_to_df(train_results).to_html(index=False)))
print('---- Results on test data ----')
display(HTML(results_per_entity_to_df(test_results).to_html(index=False)))

---- Results on train data ----


Token,Precision,Recall,F1 score
postal,95.930233,97.345133,96.632504
region,90.301003,90.0,90.15025
country,84.0,63.636364,72.413793
building,67.307692,47.297297,55.555556
nr,93.189964,89.655172,91.388401
street,84.653465,87.692308,86.146096
city,85.514612,96.974063,90.884537
area,83.571429,52.702703,64.640884
co,61.46789,65.048544,63.207547
Total,86.715867,86.503067,86.609337


---- Results on test data ----


Token,Precision,Recall,F1 score
postal,92.391304,92.391304,92.391304
region,75.308642,79.220779,77.21519
country,0.0,0.0,0.0
building,16.666667,10.0,12.5
nr,87.837838,80.246914,83.870968
street,60.0,68.041237,63.768116
city,69.863014,86.440678,77.272727
area,50.0,22.033898,30.588235
co,37.5,37.5,37.5
Total,71.563981,73.182553,72.364217


We can see that fixing some of the data errors did not improve the precission, recall or F1 score on the total performance of the model. On the other hand, F1 score for street (60 -> 63), number (80 -> 83), postal (86 -> 92) increased, but decreased for country (50 -> 0), area (37 -> 30) and co (37 -> 11). We can conclude that for rare entities, like company and the country, differences in the train/test set shuffling could have had a strong impact due to more records being available.

## Exploring optimal training parameters
Obviously, model will perform differently depending on how many iterations it was trained through. Another important parameter is the dropout rate, which affects how likely the model is to remember training data. Hihger dropout rate will make the model more generalised.

Several tests were made to see how total performance of the model changes along with iterations and the dropout rate increase. The findings are displayed in a table below:


We can notice a few patterns:
- As the drop parameter increases, performance on the training data becomes worse, because the model is less attached to training data. Respectively, performance on the test data is usually better with the higher drop value (except when the value is unreasonably high)
- As the iterations count increases, we can notice a stable increase in performance on training data, however, it's difficult to make the same conclusion for test data.
- Having a bigger set of test data would allow us to choose the optimal parameters better.

By training the model for too many iterations, we risk overfitting the model, where it will perform very well with training data, but suffer in performance with unseen data. One example could be a rather poor performance of a model trained using 50 iterations and 0.5 drop. Therefore, it was decided to play it safer and choose 25 iterations and 0.5 drop parameter for training the model 
![image.png](attachment:image.png)

## Final solution
### train.py
```python
import pandas as pd
import re
import spacy
import random
from utils import read_DataFrame_from_excel
from spacy.util import compounding, minibatch


TRAINING_DATA_FILENAME = '../files/training_data_fixed.xlsx'
TRAINING_ENTRIES_COUNT = 999
TRAINED_MODEL_FILENAME = 'trained_model'

TOKEN_TYPES: set = {'co', 'building', 'street', 'nr', 'area', 'postal', 'city', 'region', 'country'}

TRAIN_ITERATION_COUNT = 20
TRAIN_DROP_PROPERTY = 0.5


def preprocess_data(data: pd.DataFrame):
    """
    Performs data preprocessing by adding a space after each comma and semicolon if they are missing
    
    Args:
        dataFrame (pd.DataFrame): dataset to be processed

    Returns:
        None
    """
    for col in data.columns:
        data[col] = data.apply(lambda row: re.sub(r'([^\s])([,;])([^\s])', r'\1\2 \3', str(row[col])), axis=1)


def get_entity_list(entry: dict, adr: str):
    """
    Extracts an array of tuples, indicating positions of tokens in a provided address
    
    Args:
        entry (dict): dictionary, where keys are token types.
            Example:
            dict = {
                'city': 'Vilnius',
                'street': 'Ozo g.',
                'nr': 25
            }
        adr (str): an address string.
            Example: 
            adr = 'Ozo g. 25, Vilnius'

    Returns:
        Array of tuples, where tuples follow structure of (token_position_start, token_position_end, token)
    """
    address = str(adr)
    entities: list = []
    present_tokens = filter(lambda item: item[0] in TOKEN_TYPES and item[1] and str(item[1]).strip(), entry.items())

    ## tokens to retry matching
    retry_tokens: set = set()

    for item in present_tokens:
        token_value = str(item[1]).strip()
        match = re.search(re.escape(token_value), address)
        if match:
            # If multiple occurences can be matched, save the token to be matched later
            if (len(re.findall(re.escape(token_value), address)) > 1):
                retry_tokens.add((token_value, item[0]))
                continue
            span = match.span()
            entities.append((span[0], span[1], item[0]))
            # Replace matched entity with symbols, so that parts of it cannot be matched again
            address = address[:span[0]] + '$' * (span[1] - span[0]) + address[span[1]:]
        else:
            # Try and resolve multiple tokens separated by ';'
            split_items = map(lambda token: token.strip(), token_value.split(';'))
            for token in split_items:
                split_match = re.search(re.escape(token), address)
                if split_match:
                    # If multiple occurences can be matched, save the token to be matched later
                    if (len(re.findall(re.escape(token), address)) > 1):
                        retry_tokens.add((token, item[0]))
                        continue
                    span = split_match.span()
                    entities.append((span[0], span[1], item[0]))
                    # Replace matched entity with symbols, so that parts of it cannot be matched again
                    address = address[:span[0]] + '$' * (span[1] - span[0]) + address[span[1]:]
                else:
                    print('WARNING: could not find token "{}" in address "{}"'.format(token, adr))
    
    # Try and match previously marked tokens, now that single-match entities were eliminated
    for token, tkn_type in retry_tokens:
        token_value = str(token).strip()
        match = re.search(re.escape(token_value), address)
        if match:
            span = match.span()
            entities.append((span[0], span[1], tkn_type))
            address = address[:span[0]] + '$' * (span[1] - span[0]) + address[span[1]:]
        else:
            print('WARNING: could not find token "{}" in address "{}"'.format(token, adr))

    return entities


def map_to_training_entry(entry: dict):
    """
    Maps an object of address tokens into a tuple of address string and an object containing entity list.
    
    Args:
        entry (dict): dictionary, where keys include token types.
            Example:
            dict = {
                'person_address': 'Ozo g. 25, Vilnius',
                'city': 'Vilnius',
                'street': 'Ozo g.',
                'nr': 25
            }

    Returns:
        A tuple, where first element is the address, and the second one is an object containing the entity list
    """
    address = entry['person_address']
    return (address, {
        'entities': get_entity_list(entry, address)
    })


def entities_overlap(entry):
    """
    Checks whether an entry contains overlapping entities
    
    Args:
        entry (array or tuple): dictionary, where keys are token types.
            Example:
            dict = {
                'city': 'Vilnius',
                'street': 'Ozo g.',
                'nr': 25
            }
        adr (str): an address string.
            Example: 
            adr = 'Ozo g. 25, Vilnius'

    Returns:
        Array of tuples, where tuples follow structure of (token_position_start, token_position_end, token)
    """
    entities = entry[1]['entities']
    for first in entities:
        for second in entities:
            if (first == second): continue
            if (first[0] < second[0] and first[1] > second[0]) or (first[0] > second[0] and first[1] < second[0]) or (first[0]==second[0] or first[1]==second[1]):
                print('Entities {} and {} overlap in "{}"'.format(first, second, entry[0]))
                return True
    return False


if __name__ == '__main__':

    raw_data: pd.DataFrame = read_DataFrame_from_excel(TRAINING_DATA_FILENAME, TRAINING_ENTRIES_COUNT)
    preprocess_data(raw_data)

    train_data = map(map_to_training_entry, raw_data.to_dict('records'))
    train_data = list(filter(lambda entry: not entities_overlap(entry), train_data))

    nlp = spacy.blank('en')
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner)

    for token in TOKEN_TYPES:
        ner.add_label(token)
    
    print('--- TRAINING THE MODEL IN {} ITERATIONS | DROP = {} ---'.format(TRAIN_ITERATION_COUNT, TRAIN_DROP_PROPERTY))
    optimizer = nlp.begin_training()

    for itn in range(TRAIN_ITERATION_COUNT):
        random.shuffle(train_data)
        losses = {}

        batches = minibatch(train_data, size=compounding(4, 32, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  
                annotations,  
                drop=TRAIN_DROP_PROPERTY,  
                sgd=optimizer,
                losses=losses)
        print('Iteration: {} | Losses: {}'.format(itn, losses))

    nlp.to_disk(TRAINED_MODEL_FILENAME)
```

### deploy.py
```python
import pandas as pd
import spacy
from train import TRAINED_MODEL_FILENAME, preprocess_data
from utils import read_dataFrame_from_csv, read_DataFrame_from_excel, write_DataFrame_to_excel


PARSED_DATA_FILENAME = '../files/parsed.xlsx'


def enrich_row_with_address_details(row, nlp):
    """
    Gets address properties from a DataFrame row using a NLP model
    
    Args:
        row: An object having a 'person_address' property
            Example:
            row = {
                'person_address': 'Ozo g. 25, Vilnius',
                'city': 'Vilnius',
                'street': 'Ozo g.',
                'nr': 25
            }
        nlp: A spacy NLP model

    Returns:
        an array indicating address's co, building, street, nr, area, postal, city, region, country in this order
    """

    obj = {
        'co': '',
        'building': '',
        'street': '',
        'nr': '',
        'area': '',
        'postal': '',
        'city': '',
        'region': '',
        'country': ''
    }

    doc = nlp(row['person_address'])
    for ent in doc.ents:
        if (len(obj[ent.label_])):
            obj[ent.label_] = '{}; {}'.format(obj[ent.label_], ent.text)
        else:
            obj[ent.label_] = ent.text

    return [
        obj['co'],
        obj['building'],
        obj['street'],
        obj['nr'],
        obj['area'],
        obj['postal'],
        obj['city'],
        obj['region'],
        obj['country']
    ]


def parse_addresses(frame: pd.DataFrame):
    """
    Parses addresses in a given DataFrame by adding co, building, street, nr, area, postal, city, region, country information
    
    Args:
        frame (DataFrame): A data frame having a 'person_address' property. This argument is not mutated by the code

    Returns:
        A DataFrame having extracted properties
    """

    data: pd.DataFrame = frame.copy()
    original_addresses = data['person_address']
    preprocess_data(data)

    nlp: spacy.language = spacy.load(TRAINED_MODEL_FILENAME)

    data[['co', 'building', 'street', 'nr', 'area', 'postal', 'city', 'region', 'country']] = data.apply(
        lambda row: enrich_row_with_address_details(row, nlp),
        axis=1,
        result_type='expand'
    )
    data['person_address'] = original_addresses
    return data


if __name__ == '__main__':

    filename: str = input('Enter a txt filename with the addresses data that you wish to parse (e.g. input.xlsx):\n> ')
    
    data: pd.DataFrame = read_DataFrame_from_excel(filename) if filename.endswith('.xlsx') else read_dataFrame_from_csv(filename)
    data = parse_addresses(data)

    write_DataFrame_to_excel(data, PARSED_DATA_FILENAME)

```