# Unit tag and Span cleaning

This script combines two steps in the OCOD processing pipeline.

* unit tagging
* Removing overlapping spans

These two processes are separated by the weak labelling in humanloop but as they are relatively simple they are included in a single script

* **Raw CSV loaded and lightly processed. Output**: two column csv columns, property address, unit tag
* Data labelled in programmatic. Output: json file of entities.
* **Data programmatic output json cleaned ordered and overlaps removed**. Output: json file
* Clean json converted to dataframe and multi-addresses expanded. Output: CSV
* Count and locate addresses
* Create address matcher and match businesses
* Classify address types

## Unit tagging

This park of the pipeline adds in a binary value indicating whether the line contains flats/units/stores etc which are likely to have unit level ID. This is important as such addresses are likely to have a unit ID AND an street number and as such need to be treated with care

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import pandas as pd
import numpy as np
import re
import random
from prep_helper_functions import *

In [2]:


ocod_data =  pd.read_csv('./data/' +
                    'OCOD_FULL_2022_02.csv',
                   encoding_errors= 'ignore').rename(columns = lambda x: x.lower().replace(" ", "_"))
ocod_data['postcode'] = ocod_data['postcode'].str.upper()
#empty addresses cannot be used. however there are only three so not a problem
ocod_data = ocod_data.dropna(subset = 'property_address')
ocod_data.reset_index(inplace = True, drop = True)
ocod_data['property_address'] = ocod_data['property_address'].str.lower()

#ensure there is a space after commas
#This is because some numbers are are written as 1,2,3,4,5 which causes issues during tokenisation
ocod_data.property_address = ocod_data.property_address.str.replace(',', r', ')
#remove multiple spaces
ocod_data.property_address = ocod_data.property_address.str.replace('\s{2,}', r' ')


#different words associated with unit ID's
flatregex = r"(flat|apartment|penthouse|unit)" #unit|store|storage these a

#This is not an exhaustive list of road names but it covers about 80% of all road types in the VOA business register.
#The cardinal directions are includted as an option as they can appear after the road type. However they serve no real purpose in this particular regex and are 
#included for completness
road_regex  = r"((road|street|lane|way|gate|avenue|close|drive|hill|place|terrace|crescent|gardens|square|walk|grove|mews|row|view|boulevard|pleasant|vale|yard|chase|rise|green|passage|friars|viaduct|promenade|end|ridge|embankment|villas|circus))\b( east| west| north| south)?"
#These names may be followed by a road type e.g. Earls court road. A negative lookahead is used to prevent these roads being tagged as units.
flatregex2 = r"(mansions|villa|court)\b(?!(\s"+road_regex+"))"

#flat_tag is used for legacy reasons but refers to sub-units in general
ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)

ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)

#typo in the data leads to a large number of fake flats
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("stanley court ", "stanley court, ")
#This typo leads to some rather silly addresses
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("100-1124", "100-112")
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("40a, 40, 40¨, 42, 44", "40a, 40, 40, 42, 44")


#only two columns are needed for the humanloop labelling process
ocod_data[['property_address', 'flat_tag', 'commercial_park_tag', 'title_number']].rename(columns = {'property_address':'text'}).to_csv('./data/property_address_only.csv')

  ocod_data =  pd.read_csv('./data/' +
  ocod_data.property_address = ocod_data.property_address.str.replace('\s{2,}', r' ')
  ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)
  ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)


In [3]:
#Create the index for the ground truth
random.seed(2017)
test_set = random.sample([*range(0, ocod_data.shape[0])], 1000) 

test_set = ocod_data.loc[test_set, 'title_number'].reset_index().rename(columns = {'index':'datapoint_id'})

test_set.to_csv('./data/test_set_indices_space_cleaned_v2.csv')

In [4]:
#Belatedly create dev set
#This also needs to be manually labelled and so is also pretty small.

dev_set_all = ocod_data.loc[~ocod_data.title_number.isin(test_set.title_number),:]
random.seed(2017)
dev_set = random.sample(dev_set_all.title_number.to_list(), 2000)

dev_set = dev_set_all.loc[dev_set_all.title_number.isin(dev_set), 'title_number'].reset_index().rename(columns = {'index':'datapoint_id'})

dev_set.to_csv('./data/dev_set.csv')


## Labelling in Humanloop

This part of process uses the humanloop programmatic app and is an external process. Once the labelling step is complete the process outputs a json file containing the labels and spans, this is then cleaned in the next step.

## Removing overlapping spans

During the humanloop tagging process the rules may result in the same words being tagged as part of multiple spans, this often occures for road names made up of multiple parts 
e.g. Canberra Crescent Gardens may be tagges as Canberra Cresecent and Canberra Crescent Gardens. The overlaps need to be removed before further prcoessing.
For simplicity the largest span of any two overlapping spans is kept and the smaller of the two is removed.

In [1]:
#These libraries are specific to this part of the process
import json
import requests 
import config #contains hidden api key
import operator #used for sorting the label dictionaries by start point. This is the basis for removing overlaps

In [2]:
# Opening JSON file
f =open("./data/test.json")  #aggregate and download button

# returns JSON object as 
# a dictionary
data = json.load(f)

#this makes a list of all the observation rows. These refer to the row of the orginal observation text and so can be linked back to the original OCOD dataset.


## Clean the results

Take the non-denoised results and keep only the longest of any overlapping elements

In [19]:

data_labels_dict = clean_programmatic_for_humanloop(data, ocod_data)
    
#Save the cleaned data back as a json file ready to be processed further  
with open('./data/full_dataset_no_overlaps.json', 'w') as f:
    json.dump(data_labels_dict, f)

count = 5000
count = 10000
count = 15000
count = 20000
count = 25000
count = 30000
count = 35000
count = 40000
count = 45000
count = 50000
count = 55000
count = 60000
count = 65000
count = 70000
count = 75000
count = 80000
count = 85000
count = 90000


### Uploading to humanloop cloud

This allows a sample of the data to be uploaded to the humanloop cloud so that an example model can be made.
The model provides another way to check the quality of the weak labelling. However, only 10k obersvations can be uploaded, as such a sub-sample is used. 

In [8]:
f =open('./data/full_dataset_no_overlaps.json')  
data_labels_dict = json.load(f)

### Create new project with unlabelled data

This creates the json and uploads it as a new project to human loop. The data can then be hand labelled to create a ground truth


In [11]:
jason_test_data ={
     "name": "test_set_24_04_22",
     "description": "the ground truth test set for labelling",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": [data_labels_dict[x] for x in test_set.loc[:,'datapoint_id']]#, #upload only data from the test set
}
upload_to_human_loop(jason_data = jason_test_data, config, project_owner = "jonathan.s.bourne@gmail.com" )
del jason_test_data

In [40]:
jason_dev_data ={
     "name": "dev_set_13_05_22",
     "description": "the ground truth test set for labelling",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": [data_labels_dict[x] for x in dev_set.loc[:,'datapoint_id']]#, #upload only data from the test set
}


upload_to_human_loop(jason_data = jason_dev_data, config, project_owner = "jonathan.s.bourne@gmail.com" )

del jason_dev_data

# Spacy 

## create spacy format

The below chunk creates the format using the output of programmatic

In [2]:
from prep_helper_functions import *

training_data, dev_data = create_spacy_training_set(programmatic_data_path = "./data/test.json",
                              dev_set_path = './data/ground_truth_dev_set_labels.json',
                              hmm_denoising = True,
                              alignment_mode_type = "expand",
                              save_folder = '/tf/enhance_ocod/'+"data/spacy_data/training/data_hmm_24_05_22")

loading data
aggregateResults
removing overlapping spans from training data
count = 5000
count = 10000
count = 15000
count = 20000
count = 25000
count = 30000
count = 35000
count = 40000
count = 45000
count = 50000
count = 55000
count = 60000
count = 65000
count = 70000
count = 75000
count = 80000
count = 85000
count = 90000
removing overlapping spans from dev data
creating training DocBin
creating dev DocBin


## spacy training data debugging

The below chunks help debug the creation of the spacy training data.

Errors are usually caused by tokenization issues. Several of this issues have been solved by changing the 'infixes' 

In [63]:
training_data =  train_set
training_data[4908]
#The rule is splitting the unit id on the . but why? it shouldnt'

{'datapoint_id': 5001,
 'text': 'units d.01 and d.02, trathen square, london (se10 0et)',
 'entities': [[0, 5, 'unit_type'],
  [6, 7, 'unit_id'],
  [8, 10, 'unit_id'],
  [21, 35, 'street_name'],
  [37, 43, 'city'],
  [45, 53, 'postcode']]}

In [66]:
#nlp = spacy.blank("en")
training_data = train_set
# the DocBin will store the example documents
print('alignment mode '+alignment_mode_type)
db = DocBin()
for i in range(i,i+1):
    current_set = training_data[i]
    doc = nlp(current_set['text'])
    ents = []
    for start, end, label in current_set['entities']:
        span = doc.char_span(start, end, label=label, alignment_mode = alignment_mode_type )
        print(span)
        ents.append(span)
    doc.ents = ents

alignment mode expand
units
d.01
d.01
trathen square
london
se10 0et


ValueError: [E1010] Unable to set entity information for token 1 which is included in more than one span in entities, blocked, missing or outside.

In [27]:
!python -m spacy init fill-config /tf/empty_homes_london/base_config.cfg /tf/empty_homes_london/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/tf/empty_homes_london/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [24]:
#This is the code to run the training model
#!python -m spacy train config.cfg --paths.train /tf/data/spacy_data/train.spacy --paths.dev /tf/data/spacy_data/dev.spacy --output /tf/data/spacy_data/ --gpu-id 1
#!python -m spacy train config.cfg --paths.train /home/jonno/data/spacy_data/train.spacy --paths.dev /home/jonno/data/spacy_data/dev.spacy --output /home/jonno/data/spacy_data/cpu
#python -m spacy train ./spacy_config_files/cpu_config.cfg --paths.train ./data/spacy_data/train.spacy --paths.dev ./data/spacy_data/dev.spacy --output ./data/spacy_data/cpu3

#!python -m spacy train ./spacy_config_files/cpu_config.cfg --paths.train ./data/spacy_data/training/data_hmm_24_05_22/train.spacy --paths.dev ./data/spacy_data/training/data_hmm_24_05_22/dev.spacy --output ./data/spacy_data/training/data_hmm_24_05_22



# Predicting using spacy

The below chunks are used to predict label from data using spacy. There are some issues that appear to be related to a recent update of CUDA which has caused a variety of problems. As such this part of the code is being kept separate and some of the code choices may look very strange. 

Interestingly the performance of the spaCy model is pretty much identical to the labels created using rules. This suggests that the labelling is probably very good and also that spacy is overfitting

In [4]:
import json
import spacy

from address_parsing_helper_functions import load_and_prep_OCOD_data

import os
print(os.getcwd())

#spacy.require_gpu()
#spacy.prefer_gpu()

nlp1 = spacy.load("/tf/data/spacy_data/cpu/model-best") 

ocod_data = load_and_prep_OCOD_data('/tf/data/' +'OCOD_FULL_2022_02.csv')

/tf/empty_homes_london


  ocod_data =  pd.read_csv(file_path,


In [10]:
#transformer takes about 85 minutes with cpu and 2.35 minutes with GPU
#However if there is a recent update to pytorch there can be porblems with the GPU inference https://github.com/explosion/spaCy/issues/8229
#In addition if nvidia updates something Docker can have difficult to resolve bugs

import time

start_time = time.time()
#with torch.no_grad():
spacy_docs_list = list(nlp1.pipe(ocod_data.loc[1:100,'property_address']))
end_time = time.time()

print(end_time - start_time)

#This runtime comparison of spacy using cpu and gpu. 
#GPU about 5-6 times faster for inference on a transformer
#https://github.com/BlueBrain/Search/issues/337

0.3141946792602539


In [26]:
spacy_model_path = "/tf/data/spacy_data/cpu/model-best"

def spacy_pred_fn(spacy_model_path, ocod_data, print_every =1000):
    nlp1 = spacy.load(spacy_model_path) 

    ocod_context = [(ocod_data.loc[x,'property_address'], {'datapoint_id':x}) for x in range(0,ocod_data.shape[0])]
    i = 0
    all_entities_json = []        
    for doc, context in list(nlp1.pipe(ocod_context[0:1000], as_tuples = True)):

        #This doesn't print as it is a stream not a conventional loop
        #if i%print_every==0: print("doc ", i, " of "+ str(ocod_data.shape[0]))
        #i = i+1

        temp = doc.to_json()
        temp.pop('tokens')
        temp.update({'datapoint_id':context['datapoint_id']})
        all_entities_json = spacy_docs_list + [temp]

    all_entities = pd.json_normalize(all_entities_json, record_path = "ents", meta= ['text', 'datapoint_id'])
    
    return all_entities

In [13]:
test = spacy_pred_fn(spacy_model_path, ocod_data, print_every =1000)

test

NameError: name 'spacy_pred_fn' is not defined

### This is designed to avoid the prediction crash that occurs possibly due to the nvidia upgrade

In [2]:
target_data = ocod_data.loc[:,'property_address']
lower = [0, 20000,40000,60000,80000]
upper = [20000,40000,60000,80000,len(target_data)]

ocod_context = [(ocod_data.loc[x,'property_address'], {'datapoint_id':x}) for x in range(0,ocod_data.shape[0])]

print('begin getting labels')

for x in range(0,5):
    print(x)
    spacy_docs_list = []        
    for doc, context in list(nlp1.pipe(ocod_context[lower[x]:upper[x]], as_tuples = True)):
        temp = doc.to_json()
        temp.pop('tokens')
        temp.update({'datapoint_id':context['datapoint_id']})
        spacy_docs_list = spacy_docs_list + [temp]
        
    file_name = '/home/jonno/data/spacy_pred_labels' + str(x) + '.json'
    with open(file_name, 'w') as f:
        json.dump(spacy_docs_list, f)
        
all_entities_json = []
for x in range(0,5):
    print(x)
    file_name = '/home/jonno/data/spacy_pred_labels' + str(x) + '.json'
    f =open(file_name)  #aggregate and download button
    all_entities_json = all_entities_json  + json.load(f)
        
all_entities = pd.json_normalize(all_entities_json, record_path = "ents", meta= ['text', 'datapoint_id'])

all_entities.to_csv('/home/jonno/data/spacy_preds_normalised.csv')


begin getting labels
0
1
2
3
4


In [12]:
with open('/home/jonno/data/spacy_pred_labels.json', 'w') as f:
    json.dump(spacy_labels, f)