# Unit tag and Span cleaning

This script combines two steps in the OCOD processing pipeline.

* unit tagging
* Removing overlapping spans

These two processes are separated by the weak labelling in humanloop but as they are relatively simple they are included in a single script

* **Raw CSV loaded and lightly processed. Output**: two column csv columns, property address, unit tag
* Data labelled in programmatic. Output: json file of entities.
* **Data programmatic output json cleaned ordered and overlaps removed**. Output: json file
* Clean json converted to dataframe and multi-addresses expanded. Output: CSV
* Count and locate addresses
* Create address matcher and match businesses
* Classify address types

## Unit tagging

This park of the pipeline adds in a binary value indicating whether the line contains flats/units/stores etc which are likely to have unit level ID. This is important as such addresses are likely to have a unit ID AND an street number and as such need to be treated with care

In [5]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import pandas as pd
import numpy as np
import re
import random
from helper_functions import *

In [6]:
ocod_data =  pd.read_csv('/tf/empty_homes_data/' +
                    'OCOD_FULL_2022_02.csv',
                   encoding_errors= 'ignore').rename(columns = lambda x: x.lower().replace(" ", "_"))
ocod_data['postcode'] = ocod_data['postcode'].str.upper()
#empty addresses cannot be used. however there are only three so not a problem
ocod_data = ocod_data.dropna(subset = 'property_address')
ocod_data.reset_index(inplace = True, drop = True)
ocod_data['property_address'] = ocod_data['property_address'].str.lower()

#different words associated with unit ID's
flatregex = r"(flat|apartment|penthouse|unit)" #unit|store|storage these a

#This is not an exhaustive list of road names but it covers about 80% of all road types in the VOA business register.
#The cardinal directions are includted as an option as they can appear after the road type. However they serve no real purpose in this particular regex and are 
#included for completness
road_regex  = r"((road|street|lane|way|gate|avenue|close|drive|hill|place|terrace|crescent|gardens|square|walk|grove|mews|row|view|boulevard|pleasant|vale|yard|chase|rise|green|passage|friars|viaduct|promenade|end|ridge|embankment|villas|circus))\b( east| west| north| south)?"
#These names may be followed by a road type e.g. Earls court road. A negative lookahead is used to prevent these roads being tagged as units.
flatregex2 = r"(mansions|villa|court)\b(?!(\s"+road_regex+"))"

#flat_tag is used for legacy reasons but refers to sub-units in general
ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)

ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)

#typo in the data leads to a large number of fake flats
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("stanley court ", "stanley court, ")

#only two columns are needed for the humanloop labelling process
ocod_data[['property_address', 'flat_tag', 'commercial_park_tag']].rename(columns = {'property_address':'text'}).to_csv('/tf/empty_homes_data/property_address_only.csv')

  ocod_data =  pd.read_csv('/tf/empty_homes_data/' +
  ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)
  ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)


In [22]:
#Create the index for the ground truth
random.seed(2017)
test_set = random.sample([*range(0, ocod_data.shape[0])], 1000) 

test_set = ocod_data.loc[test_set, 'title_number'].reset_index().rename(columns = {'index':'datapoint_id'})

test_set.to_csv('/tf/empty_homes_data/test_set_indices.csv')

## Labelling in Humanloop

This part of process uses the humanloop programmatic app and is an external process. Once the labelling step is complete the process outputs a json file containing the labels and spans, this is then cleaned in the next step.

## Removing overlapping spans

During the humanloop tagging process the rules may result in the same words being tagged as part of multiple spans, this often occures for road names made up of multiple parts 
e.g. Canberra Crescent Gardens may be tagges as Canberra Cresecent and Canberra Crescent Gardens. The overlaps need to be removed before further prcoessing.
For simplicity the largest span of any two overlapping spans is kept and the smaller of the two is removed.

In [3]:
#These libraries are specific to this part of the process
import json
import requests 
import config #contains hidden api key
import operator #used for sorting the label dictionaries by start point. This is the basis for removing overlaps

In [5]:
# Opening JSON file
f =open("/tf/empty_homes_data/humanloop_02_04_22_t1400.json")  #aggregate and download button

# returns JSON object as 
# a dictionary
data = json.load(f)

#this makes a list of all the observation rows. These refer to the row of the orginal observation text and so can be linked back to the original OCOD dataset.


In [11]:
data['datapoints']['data']

TypeError: list indices must be integers or slices, not str

In [None]:
datapoint_id_list = [x['datapointId'] for x in data]

In [None]:


data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):
    count_it += 1
    if count_it % 100 == 0: 
        print('count = {}'.format(count_it))
        
    single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
   # list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    
    ##these labels are in dictionary form
    list_of_labels_dict = [{'start': data[x]['start'], 
                            'end':data[x]['end'], 
                            'label': data[x]['label'], 
                            'label_text': data[x]['text'] } for x in single_id_index[0].tolist()]
    
    #this inplace sorting using operator orders the dictionary by the start point. ties are automatically broken
    #it required the operator library
    list_of_labels_dict.sort(key=operator.itemgetter('start'))
    
    list_of_labels_dict = remove_overlapping_spans2(list_of_labels_dict)

    #create the NER dataset structure shown on the spacy website
   # data_and_labels = data_and_labels + [ ( ocod_data['property_address'][i], list_of_labels ) ]
    #create a list of dictionaries using a similar structure to save as a json
    data_labels_dict = data_labels_dict + [
        {
            'text' : ocod_data['property_address'][i],
            'labels' : list_of_labels_dict,
            'datapoint_id': i,
        }
    ]
    
#Save the cleaned data back as a json file ready to be processed further  
with open('/tf/empty_homes_data/full_dataset_no_overlaps.json', 'w') as f:
    json.dump(data_labels_dict, f)

### Uploading to humanloop cloud

This allows a sample of the data to be uploaded to the humanloop cloud so that an example model can be made.
The model provides another way to check the quality of the weak labelling. However, only 10k obersvations can be uploaded, as such a sub-sample is used. 

In [None]:
#take a random sub sample of data
random.seed(10)
data_labels_dict2 = random.sample(data_labels_dict, 9999)

jason_test_data ={
     "name": "28_02_22_1010",
     "description": "example labelling structure",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": data_labels_dict2
}

url = "https://api.humanloop.com/datasets"

# replace payload with your actual dataset...
payload= json.dumps(jason_test_data)
headers = {
  'X-API-Key': config.api_key,#the api key is hidden in a config file
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

In [26]:
results_list = data['datapoints'][0]['programmatic']['results']
[(x['start'],x['end'],x['label'] ) for x in results_list]

[(41, 47, 'city'),
 (27, 39, 'street_name'),
 (48, 55, 'postcode'),
 (0, 25, 'building_name'),
 (27, 39, 'street_name')]

## create spacy old format

In [14]:
datapoint_id_list = [*range(0, 100, 1)]
data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):
    count_it += 1
    if count_it % 100 == 0: 
        print('count = {}'.format(count_it))
        
    single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
    list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    

    #create the NER dataset structure shown on the spacy website
    data_and_labels = data_and_labels + [ ( ocod_data['property_address'][i], list_of_labels ) ]

    
#Save the cleaned data back as a json file ready to be processed further  
with open('/tf/empty_homes_data/humanloop_spacy_format.json', 'w') as f:
    json.dump(data_labels_dict, f)

KeyError: 0