# Unit tag and Span cleaning

This script combines two steps in the OCOD processing pipeline.

* unit tagging
* Removing overlapping spans

These two processes are separated by the weak labelling in humanloop but as they are relatively simple they are included in a single script

* **Raw CSV loaded and lightly processed. Output**: two column csv columns, property address, unit tag
* Data labelled in programmatic. Output: json file of entities.
* **Data programmatic output json cleaned ordered and overlaps removed**. Output: json file
* Clean json converted to dataframe and multi-addresses expanded. Output: CSV
* Count and locate addresses
* Create address matcher and match businesses
* Classify address types

## Unit tagging

This park of the pipeline adds in a binary value indicating whether the line contains flats/units/stores etc which are likely to have unit level ID. This is important as such addresses are likely to have a unit ID AND an street number and as such need to be treated with care

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import pandas as pd
import numpy as np
import re
import random
from helper_functions import *

In [2]:
ocod_data =  pd.read_csv('/tf/empty_homes_data/' +
                    'OCOD_FULL_2022_02.csv',
                   encoding_errors= 'ignore').rename(columns = lambda x: x.lower().replace(" ", "_"))
ocod_data['postcode'] = ocod_data['postcode'].str.upper()
#empty addresses cannot be used. however there are only three so not a problem
ocod_data = ocod_data.dropna(subset = 'property_address')
ocod_data.reset_index(inplace = True, drop = True)
ocod_data['property_address'] = ocod_data['property_address'].str.lower()

#different words associated with unit ID's
flatregex = r"(flat|apartment|penthouse|unit)" #unit|store|storage these a

#This is not an exhaustive list of road names but it covers about 80% of all road types in the VOA business register.
#The cardinal directions are includted as an option as they can appear after the road type. However they serve no real purpose in this particular regex and are 
#included for completness
road_regex  = r"((road|street|lane|way|gate|avenue|close|drive|hill|place|terrace|crescent|gardens|square|walk|grove|mews|row|view|boulevard|pleasant|vale|yard|chase|rise|green|passage|friars|viaduct|promenade|end|ridge|embankment|villas|circus))\b( east| west| north| south)?"
#These names may be followed by a road type e.g. Earls court road. A negative lookahead is used to prevent these roads being tagged as units.
flatregex2 = r"(mansions|villa|court)\b(?!(\s"+road_regex+"))"

#flat_tag is used for legacy reasons but refers to sub-units in general
ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)

ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)

#typo in the data leads to a large number of fake flats
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("stanley court ", "stanley court, ")

#only two columns are needed for the humanloop labelling process
ocod_data[['property_address', 'flat_tag', 'commercial_park_tag', 'title_number']].rename(columns = {'property_address':'text'}).to_csv('/tf/empty_homes_data/property_address_only.csv')

  ocod_data =  pd.read_csv('/tf/empty_homes_data/' +
  ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)
  ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)


In [44]:
#Create the index for the ground truth
random.seed(2017)
test_set = random.sample([*range(0, ocod_data.shape[0])], 1000) 

test_set = ocod_data.loc[test_set, 'title_number'].reset_index().rename(columns = {'index':'datapoint_id'})

test_set.to_csv('/tf/empty_homes_data/test_set_indices.csv')

## Labelling in Humanloop

This part of process uses the humanloop programmatic app and is an external process. Once the labelling step is complete the process outputs a json file containing the labels and spans, this is then cleaned in the next step.

## Removing overlapping spans

During the humanloop tagging process the rules may result in the same words being tagged as part of multiple spans, this often occures for road names made up of multiple parts 
e.g. Canberra Crescent Gardens may be tagges as Canberra Cresecent and Canberra Crescent Gardens. The overlaps need to be removed before further prcoessing.
For simplicity the largest span of any two overlapping spans is kept and the smaller of the two is removed.

In [3]:
#These libraries are specific to this part of the process
import json
import requests 
import config #contains hidden api key
import operator #used for sorting the label dictionaries by start point. This is the basis for removing overlaps

In [38]:
# Opening JSON file
f =open("/tf/empty_homes_data/humanloop_11_04_22_t0915.json")  #aggregate and download button

# returns JSON object as 
# a dictionary
data = json.load(f)

#this makes a list of all the observation rows. These refer to the row of the orginal observation text and so can be linked back to the original OCOD dataset.


In [68]:

datapoint_id_list = [*range(0,len(data['datapoints']))]

data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):

    count_it += 1
    if count_it % 100 == 0: 
        print('count = {}'.format(count_it))
        
    #single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
   # list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    list_of_labels_dict = results_list = data['datapoints'][i]['programmatic']['results']
    ##these labels are in dictionary form
#     list_of_labels_dict = [{'start': x['start'], 
#                             'end':x['end'], 
#                             'label': x['label'], 
#                             'label_text': x['text'] } for x in results_list]
    
    #this inplace sorting using operator orders the dictionary by the start point. ties are automatically broken
    #it required the operator library
    list_of_labels_dict.sort(key=operator.itemgetter('start'))
    
    list_of_labels_dict = remove_overlapping_spans2(list_of_labels_dict)

    #create the NER dataset structure shown on the spacy website
   # data_and_labels = data_and_labels + [ ( ocod_data['property_address'][i], list_of_labels ) ]
    #create a list of dictionaries using a similar structure to save as a json
    data_labels_dict = data_labels_dict + [
        {
            'text' : ocod_data['property_address'][i],
            'labels' : list_of_labels_dict,
            'datapoint_id': i,
            'title_number':ocod_data['title_number'][i]
        }
    ]
    
#Save the cleaned data back as a json file ready to be processed further  
with open('/tf/empty_homes_data/full_dataset_no_overlaps.json', 'w') as f:
    json.dump(data_labels_dict, f)

count = 100
count = 200
count = 300
count = 400
count = 500
count = 600
count = 700
count = 800
count = 900
count = 1000
count = 1100
count = 1200
count = 1300
count = 1400
count = 1500
count = 1600
count = 1700
count = 1800
count = 1900
count = 2000
count = 2100
count = 2200
count = 2300
count = 2400
count = 2500
count = 2600
count = 2700
count = 2800
count = 2900
count = 3000
count = 3100
count = 3200
count = 3300
count = 3400
count = 3500
count = 3600
count = 3700
count = 3800
count = 3900
count = 4000
count = 4100
count = 4200
count = 4300
count = 4400
count = 4500
count = 4600
count = 4700
count = 4800
count = 4900
count = 5000
count = 5100
count = 5200
count = 5300
count = 5400
count = 5500
count = 5600
count = 5700
count = 5800
count = 5900
count = 6000
count = 6100
count = 6200
count = 6300
count = 6400
count = 6500
count = 6600
count = 6700
count = 6800
count = 6900
count = 7000
count = 7100
count = 7200
count = 7300
count = 7400
count = 7500
count = 7600
count = 7700
count = 

count = 59600
count = 59700
count = 59800
count = 59900
count = 60000
count = 60100
count = 60200
count = 60300
count = 60400
count = 60500
count = 60600
count = 60700
count = 60800
count = 60900
count = 61000
count = 61100
count = 61200
count = 61300
count = 61400
count = 61500
count = 61600
count = 61700
count = 61800
count = 61900
count = 62000
count = 62100
count = 62200
count = 62300
count = 62400
count = 62500
count = 62600
count = 62700
count = 62800
count = 62900
count = 63000
count = 63100
count = 63200
count = 63300
count = 63400
count = 63500
count = 63600
count = 63700
count = 63800
count = 63900
count = 64000
count = 64100
count = 64200
count = 64300
count = 64400
count = 64500
count = 64600
count = 64700
count = 64800
count = 64900
count = 65000
count = 65100
count = 65200
count = 65300
count = 65400
count = 65500
count = 65600
count = 65700
count = 65800
count = 65900
count = 66000
count = 66100
count = 66200
count = 66300
count = 66400
count = 66500
count = 66600
count 

### Uploading to humanloop cloud

This allows a sample of the data to be uploaded to the humanloop cloud so that an example model can be made.
The model provides another way to check the quality of the weak labelling. However, only 10k obersvations can be uploaded, as such a sub-sample is used. 

In [26]:

jason_test_data ={
     "name": "test_set_24_04_22",
     "description": "the ground truth test set for labelling",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": [data_labels_dict[x] for x in test_set.loc[:,'datapoint_id']]#, #upload only data from the test set
   # "complete":False
}

#url = "https://api.humanloop.com/datasets"
url = "https://api.humanloop.com/projects/1369/add-tasks"
# replace payload with your actual dataset...
payload= json.dumps(jason_test_data)
headers = {
  'X-API-Key': config.api_key,#the api key is hidden in a config file
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

{"detail":{"loc":["data"],"msg":"The following fields were specified but did not exist in your dataset: span_tagging","description":"The following fields were specified but did not exist in your dataset: span_tagging","type":"value_error"}}


{'name': 'test_set_24_04_22',
 'description': 'the ground truth test set for labelling',
 'fields': [{'name': 'text', 'data_type': 'text'},
  {'name': 'labels', 'data_type': 'character_offsets'},
  {'name': 'datapoint_id', 'data_type': 'text'}],
 'data': [{'text': 'land on the south-east side of folly lane, east cowes',
   'labels': [{'start': 0,
     'end': 4,
     'text': 'land',
     'labelId': '15',
     'label': 'unit_type',
     'labellingFunctionId': 14,
     'groundTruthId': None},
    {'start': 31,
     'end': 41,
     'text': 'folly lane',
     'labelId': '12',
     'label': 'street_name',
     'labellingFunctionId': 23,
     'groundTruthId': None},
    {'start': 48,
     'end': 53,
     'text': 'cowes',
     'labelId': '3',
     'label': 'city',
     'labellingFunctionId': 13,
     'groundTruthId': None}],
   'datapoint_id': 26104,
   'title_number': 'IW82373'},
  {'text': '3 hyde park square mews, london',
   'labels': [{'start': 0,
     'end': 1,
     'text': '3',
     'la

In [62]:


datapoint_id_list = test_set.loc[:,'datapoint_id'].to_list()

data_and_labels = []
data_labels_dict2 = []

count_it = 0
for i in datapoint_id_list:

    count_it += 1
    if count_it % 100 == 0: 
        print('count = {}'.format(count_it))
        
    list_of_labels_dict = results_list = data['datapoints'][i]['programmatic']['results']
    
    #this inplace sorting using operator orders the dictionary by the start point. ties are automatically broken
    #it required the operator library
    list_of_labels_dict.sort(key=operator.itemgetter('start'))
    
    list_of_labels_dict = remove_overlapping_spans2(list_of_labels_dict)
    #create a list of dictionaries using a similar structure to save as a json
    data_labels_dict2 = data_labels_dict + [
        {
            'text' : ocod_data['property_address'][i],
            'entities' : list_of_labels_dict,
            'datapoint_id': i,
          #  'title_number':ocod_data['title_number'][i]
        }
    ]

count = 100
count = 200
count = 300
count = 400
count = 500
count = 600
count = 700
count = 800
count = 900
count = 1000


In [69]:
jason_test_data ={
     "name": "test_set_24_04_22",
     "description": "the ground truth test set for labelling",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": [data_labels_dict[x] for x in test_set.loc[:,'datapoint_id']]#, #upload only data from the test set
}

# Create new project with unlabelled data


In [72]:

"""
Step 1: Specify URL and headers for your API requests and some helper methods 
Reference: https://api.humanloop.com/docs#section/Authentication
Notes: 
    - If you don't already have a Humanloop account,
      signup @ https://app.humanloop.com/signup
    - Replace <INSERT YOUR API KEY HERE> with your users X-API-KEY 
      @ https://app.humanloop.com/profile
"""
base_url = "https://api.humanloop.com"
headers = {
    "Content-Type": "application/json",
    "X-API-KEY":  config.api_key,#the api key is hidden in a config file
}
# use the email associated to your Humanloop account
project_owner = "jonathan.s.bourne@gmail.com"


def get_field_id_by_name(name: str, fields):
    """Helper method for parsing field_id from dataset.fields given the name"""
    return [field for field in fields if field["name"] == name][0]["id"]


""" 
Step 2: Create a dataset
Reference: https://api.humanloop.com/docs#operation/upload_data_datasets_post
Notes:
    - It can be helpful to include your own unique identifiers for your data-points
      if available so that you can easily correlate any annotations and predictions 
      created by Humanloop back to your system.
    - If using large datasets (> 10k rows), you will have to upload it in multiple 
      batches using the API. Starting with the POST as shown below, then adding 
      subsequent batches using the PUT against the newly created dataset 
      (https://api.humanloop.com/docs#operation/update_data_datasets__id__put.)
"""

dataset_fields = requests.post(
    url=f"{base_url}/datasets", data=json.dumps(jason_test_data), headers=headers
).json()["fields"]


"""
Step 3: Create a project
Reference: https://api.humanloop.com/docs#operation/create_project_projects_post
Notes:
    - A Humanloop project is made up of one or more datasets, a team of annotators 
      and a model. As your team begin to annotate the data, a model is trained in real 
      time and used to prioritise what data your annotators should focus on next 
      (see https://humanloop.com/blog/why-you-should-be-using-active-learning).
    - The project inputs specify those dataset fields you wish to show to 
      your annotators and the model. 
    - The project output specifies the type of model you wish to train and the 
      corresponding label taxonomy. 
    - If your dataset has a field with existing annotations, you can use this to 
      warm start your project as shown in the following examples. 
      If you want your team to first review these existing annotations in Humanloop, 
      set "review_existing_annotations" to True, otherwise they will be used 
      automatically to train an initial model.
    - Both classification (single and multi-label) and extraction
      projects are supported.
    - You can update your project with more data by either connecting another dataset 
      or simply adding additional data-points to your existing dataset. 
      Alternatively, you can submit tasks for your model and/or team to complete
      (see our Human-in-the-loop tutorial for more information on this!).
"""


"""
Step 3b: Span extraction project 
"""
extraction_project_request = {
    "name": "Ground truth for offshore empties",
    "inputs": [
        {
            "name": "text",
            "data_type": "text",
            "description": "unparsed addresses",
            "data_sources": [
                {"field_id": get_field_id_by_name("text", dataset_fields)}
            ],
        },
        {
            "name": "datapoint_id",
            "data_type": "text",
            "description": "The original row the data is on",
            "display_only": True,
            "data_sources": [
                {"field_id": get_field_id_by_name("datapoint_id", dataset_fields)}
            ],
        },
    ],
    "outputs": [
        {
            "name": "labels",
            "description": "entities address parts",
            "task_type": "sequence_tagging",
            "data_sources": [
                {"field_id": get_field_id_by_name("labels", dataset_fields)}
            ],
            # which input you wish your model to extract from
            "input": "text",
        }
    ],
    "users": [project_owner],
    "guidelines": "Insert your markdown annotator guidelines here",
    "review_existing_annotations": True,
}

extraction_project_id = requests.post(
    url=f"{base_url}/projects",
    data=json.dumps(extraction_project_request),
    headers=headers,
).json()["id"]

print(f"Navigate to https://app.humanloop.com/projects/{extraction_project_id}")

Navigate to https://app.humanloop.com/projects/1388


## create spacy old format

In [58]:
datapoint_id_list = [*range(0,len(data['datapoints']))]
data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):
    count_it += 1
    if count_it % 1000 == 0: 
        print('count = {}'.format(count_it))
        
    single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
    
    results_list = data['datapoints'][i]['programmatic']['results']
    list_of_labels =[(x['start'],x['end'],x['label'] ) for x in results_list]
    
    #list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    
    list_of_labels.sort(key=lambda y: y[0])
    
    list_of_labels = remove_overlapping_spans_tuples(list_of_labels)
    #print(list_of_labels)
    #create the NER dataset structure shown on the spacy website
    data_and_labels = data_and_labels + [ {
        'datapoint_id': i,
        'text':ocod_data['property_address'][i], 
                                           'entities':list_of_labels}  ]

print(data_and_labels)
#Save the cleaned data back as a json file ready to be processed further  
with open('/tf/empty_homes_data/humanloop_spacy_format.json', 'w') as f:
    json.dump(data_labels_dict, f)

count = 100
count = 200
count = 300
count = 400
count = 500
count = 600
count = 700
count = 800
count = 900
count = 1000
count = 1100
count = 1200
count = 1300
count = 1400
count = 1500
count = 1600
count = 1700
count = 1800
count = 1900
count = 2000
count = 2100
count = 2200
count = 2300
count = 2400
count = 2500
count = 2600
count = 2700
count = 2800
count = 2900
count = 3000
count = 3100
count = 3200
count = 3300
count = 3400
count = 3500
count = 3600
count = 3700
count = 3800
count = 3900
count = 4000
count = 4100
count = 4200
count = 4300
count = 4400
count = 4500
count = 4600
count = 4700
count = 4800
count = 4900
count = 5000
count = 5100
count = 5200
count = 5300
count = 5400
count = 5500
count = 5600
count = 5700
count = 5800
count = 5900
count = 6000
count = 6100
count = 6200
count = 6300
count = 6400
count = 6500
count = 6600
count = 6700
count = 6800
count = 6900
count = 7000
count = 7100
count = 7200
count = 7300
count = 7400
count = 7500
count = 7600
count = 7700
count = 

count = 59400
count = 59500
count = 59600
count = 59700
count = 59800
count = 59900
count = 60000
count = 60100
count = 60200
count = 60300
count = 60400
count = 60500
count = 60600
count = 60700
count = 60800
count = 60900
count = 61000
count = 61100
count = 61200
count = 61300
count = 61400
count = 61500
count = 61600
count = 61700
count = 61800
count = 61900
count = 62000
count = 62100
count = 62200
count = 62300
count = 62400
count = 62500
count = 62600
count = 62700
count = 62800
count = 62900
count = 63000
count = 63100
count = 63200
count = 63300
count = 63400
count = 63500
count = 63600
count = 63700
count = 63800
count = 63900
count = 64000
count = 64100
count = 64200
count = 64300
count = 64400
count = 64500
count = 64600
count = 64700
count = 64800
count = 64900
count = 65000
count = 65100
count = 65200
count = 65300
count = 65400
count = 65500
count = 65600
count = 65700
count = 65800
count = 65900
count = 66000
count = 66100
count = 66200
count = 66300
count = 66400
count 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

