# Unit tag and Span cleaning

This script combines two steps in the OCOD processing pipeline.

* unit tagging
* Removing overlapping spans

These two processes are separated by the weak labelling in humanloop but as they are relatively simple they are included in a single script

* **Raw CSV loaded and lightly processed. Output**: two column csv columns, property address, unit tag
* Data labelled in programmatic. Output: json file of entities.
* **Data programmatic output json cleaned ordered and overlaps removed**. Output: json file
* Clean json converted to dataframe and multi-addresses expanded. Output: CSV
* Count and locate addresses
* Create address matcher and match businesses
* Classify address types

## Unit tagging

This park of the pipeline adds in a binary value indicating whether the line contains flats/units/stores etc which are likely to have unit level ID. This is important as such addresses are likely to have a unit ID AND an street number and as such need to be treated with care

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import pandas as pd
import numpy as np
import re
import random
from helper_functions import *

In [2]:


ocod_data =  pd.read_csv('./data/' +
                    'OCOD_FULL_2022_02.csv',
                   encoding_errors= 'ignore').rename(columns = lambda x: x.lower().replace(" ", "_"))
ocod_data['postcode'] = ocod_data['postcode'].str.upper()
#empty addresses cannot be used. however there are only three so not a problem
ocod_data = ocod_data.dropna(subset = 'property_address')
ocod_data.reset_index(inplace = True, drop = True)
ocod_data['property_address'] = ocod_data['property_address'].str.lower()

#ensure there is a space after commas
#This is because some numbers are are written as 1,2,3,4,5 which causes issues during tokenisation
ocod_data.property_address = ocod_data.property_address.str.replace(',', r', ')
#remove multiple spaces
ocod_data.property_address = ocod_data.property_address.str.replace('\s{2,}', r' ')


#different words associated with unit ID's
flatregex = r"(flat|apartment|penthouse|unit)" #unit|store|storage these a

#This is not an exhaustive list of road names but it covers about 80% of all road types in the VOA business register.
#The cardinal directions are includted as an option as they can appear after the road type. However they serve no real purpose in this particular regex and are 
#included for completness
road_regex  = r"((road|street|lane|way|gate|avenue|close|drive|hill|place|terrace|crescent|gardens|square|walk|grove|mews|row|view|boulevard|pleasant|vale|yard|chase|rise|green|passage|friars|viaduct|promenade|end|ridge|embankment|villas|circus))\b( east| west| north| south)?"
#These names may be followed by a road type e.g. Earls court road. A negative lookahead is used to prevent these roads being tagged as units.
flatregex2 = r"(mansions|villa|court)\b(?!(\s"+road_regex+"))"

#flat_tag is used for legacy reasons but refers to sub-units in general
ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)

ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)

#typo in the data leads to a large number of fake flats
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("stanley court ", "stanley court, ")
#This typo leads to some rather silly addresses
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("100-1124", "100-112")
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("40a, 40, 40¨, 42, 44", "40a, 40, 40, 42, 44")


#only two columns are needed for the humanloop labelling process
ocod_data[['property_address', 'flat_tag', 'commercial_park_tag', 'title_number']].rename(columns = {'property_address':'text'}).to_csv('./data/property_address_only.csv')

  ocod_data =  pd.read_csv('./empty_homes_data/' +
  ocod_data.property_address = ocod_data.property_address.str.replace('\s{2,}', r' ')
  ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)
  ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)


In [3]:
#Create the index for the ground truth
random.seed(2017)
test_set = random.sample([*range(0, ocod_data.shape[0])], 1000) 

test_set = ocod_data.loc[test_set, 'title_number'].reset_index().rename(columns = {'index':'datapoint_id'})

test_set.to_csv('./data/test_set_indices_space_cleaned_v2.csv')

In [4]:
#Belatedly create dev set
#This also needs to be manually labelled and so is also pretty small.

dev_set_all = ocod_data.loc[~ocod_data.title_number.isin(test_set.title_number),:]
random.seed(2017)
dev_set = random.sample(dev_set_all.title_number.to_list(), 2000)

dev_set = dev_set_all.loc[dev_set_all.title_number.isin(dev_set), 'title_number'].reset_index().rename(columns = {'index':'datapoint_id'})

dev_set.to_csv('./data/dev_set.csv')


## Labelling in Humanloop

This part of process uses the humanloop programmatic app and is an external process. Once the labelling step is complete the process outputs a json file containing the labels and spans, this is then cleaned in the next step.

## Removing overlapping spans

During the humanloop tagging process the rules may result in the same words being tagged as part of multiple spans, this often occures for road names made up of multiple parts 
e.g. Canberra Crescent Gardens may be tagges as Canberra Cresecent and Canberra Crescent Gardens. The overlaps need to be removed before further prcoessing.
For simplicity the largest span of any two overlapping spans is kept and the smaller of the two is removed.

In [5]:
#These libraries are specific to this part of the process
import json
import requests 
import config #contains hidden api key
import operator #used for sorting the label dictionaries by start point. This is the basis for removing overlaps

In [8]:
# Opening JSON file
f =open("./data/test.json")  #aggregate and download button

# returns JSON object as 
# a dictionary
data = json.load(f)

#this makes a list of all the observation rows. These refer to the row of the orginal observation text and so can be linked back to the original OCOD dataset.


In [14]:

datapoint_id_list = [*range(0,len(data['datapoints']))]

data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):

    count_it += 1
    if count_it % 5000 == 0: 
        print('count = {}'.format(count_it))
        
    #single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
   # list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    list_of_labels_dict = results_list = data['datapoints'][i]['programmatic']['results']
    ##these labels are in dictionary form
#     list_of_labels_dict = [{'start': x['start'], 
#                             'end':x['end'], 
#                             'label': x['label'], 
#                             'label_text': x['text'] } for x in results_list]
    
    #this inplace sorting using operator orders the dictionary by the start point. ties are automatically broken
    #it required the operator library
    list_of_labels_dict.sort(key=operator.itemgetter('start'))
    
    list_of_labels_dict = remove_overlapping_spans2(list_of_labels_dict)

    #create the NER dataset structure shown on the spacy website
   # data_and_labels = data_and_labels + [ ( ocod_data['property_address'][i], list_of_labels ) ]
    #create a list of dictionaries using a similar structure to save as a json
    data_labels_dict = data_labels_dict + [
        {
            'text' : ocod_data['property_address'][i],
            'labels' : list_of_labels_dict,
            'datapoint_id': i,
            'title_number':ocod_data['title_number'][i]
        }
    ]
    
#Save the cleaned data back as a json file ready to be processed further  
with open('./data/full_dataset_no_overlaps.json', 'w') as f:
    json.dump(data_labels_dict, f)

count = 5000
count = 10000
count = 15000
count = 20000
count = 25000
count = 30000
count = 35000
count = 40000
count = 45000
count = 50000
count = 55000
count = 60000
count = 65000
count = 70000
count = 75000
count = 80000
count = 85000
count = 90000


### Uploading to humanloop cloud

This allows a sample of the data to be uploaded to the humanloop cloud so that an example model can be made.
The model provides another way to check the quality of the weak labelling. However, only 10k obersvations can be uploaded, as such a sub-sample is used. 

In [39]:
f =open('./data/full_dataset_no_overlaps.json')  

data_labels_dict = json.load(f)

In [11]:
jason_test_data ={
     "name": "test_set_24_04_22",
     "description": "the ground truth test set for labelling",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": [data_labels_dict[x] for x in test_set.loc[:,'datapoint_id']]#, #upload only data from the test set
}

In [40]:
jason_dev_data ={
     "name": "dev_set_13_05_22",
     "description": "the ground truth test set for labelling",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": [data_labels_dict[x] for x in dev_set.loc[:,'datapoint_id']]#, #upload only data from the test set
}

# Create new project with unlabelled data


In [44]:

"""
Step 1: Specify URL and headers for your API requests and some helper methods 
Reference: https://api.humanloop.com/docs#section/Authentication
Notes: 
    - If you don't already have a Humanloop account,
      signup @ https://app.humanloop.com/signup
    - Replace <INSERT YOUR API KEY HERE> with your users X-API-KEY 
      @ https://app.humanloop.com/profile
"""
base_url = "https://api.humanloop.com"
headers = {
    "Content-Type": "application/json",
    "X-API-KEY":  config.api_key,#the api key is hidden in a config file
}
# use the email associated to your Humanloop account
project_owner = "jonathan.s.bourne@gmail.com"


def get_field_id_by_name(name: str, fields):
    """Helper method for parsing field_id from dataset.fields given the name"""
    return [field for field in fields if field["name"] == name][0]["id"]


""" 
Step 2: Create a dataset
Reference: https://api.humanloop.com/docs#operation/upload_data_datasets_post
Notes:
    - It can be helpful to include your own unique identifiers for your data-points
      if available so that you can easily correlate any annotations and predictions 
      created by Humanloop back to your system.
    - If using large datasets (> 10k rows), you will have to upload it in multiple 
      batches using the API. Starting with the POST as shown below, then adding 
      subsequent batches using the PUT against the newly created dataset 
      (https://api.humanloop.com/docs#operation/update_data_datasets__id__put.)
"""

dataset_fields = requests.post(
    url=f"{base_url}/datasets", data=json.dumps(jason_dev_data), headers=headers ######################## CHANGE this depending on whether dev or test
).json()["fields"]


"""
Step 3: Create a project
Reference: https://api.humanloop.com/docs#operation/create_project_projects_post
Notes:
    - A Humanloop project is made up of one or more datasets, a team of annotators 
      and a model. As your team begin to annotate the data, a model is trained in real 
      time and used to prioritise what data your annotators should focus on next 
      (see https://humanloop.com/blog/why-you-should-be-using-active-learning).
    - The project inputs specify those dataset fields you wish to show to 
      your annotators and the model. 
    - The project output specifies the type of model you wish to train and the 
      corresponding label taxonomy. 
    - If your dataset has a field with existing annotations, you can use this to 
      warm start your project as shown in the following examples. 
      If you want your team to first review these existing annotations in Humanloop, 
      set "review_existing_annotations" to True, otherwise they will be used 
      automatically to train an initial model.
    - Both classification (single and multi-label) and extraction
      projects are supported.
    - You can update your project with more data by either connecting another dataset 
      or simply adding additional data-points to your existing dataset. 
      Alternatively, you can submit tasks for your model and/or team to complete
      (see our Human-in-the-loop tutorial for more information on this!).
"""


"""
Step 3b: Span extraction project 
"""
extraction_project_request = {
    "name": "Ground truth for offshore empties dev set for spacy training",
    "inputs": [
        {
            "name": "text",
            "data_type": "text",
            "description": "unparsed addresses",
            "data_sources": [
                {"field_id": get_field_id_by_name("text", dataset_fields)}
            ],
        },
        {
            "name": "datapoint_id",
            "data_type": "text",
            "description": "The original row the data is on",
            "display_only": True,
            "data_sources": [
                {"field_id": get_field_id_by_name("datapoint_id", dataset_fields)}
            ],
        },
    ],
    "outputs": [
        {
            "name": "labels",
            "description": "entities address parts",
            "task_type": "sequence_tagging",
            "data_sources": [
                {"field_id": get_field_id_by_name("labels", dataset_fields)}
            ],
            # which input you wish your model to extract from
            "input": "text",
        }
    ],
    "users": [project_owner],
    "guidelines": "Insert your markdown annotator guidelines here",
    "review_existing_annotations": True,
}

extraction_project_id = requests.post(
    url=f"{base_url}/projects",
    data=json.dumps(extraction_project_request),
    headers=headers,
).json()["id"]

print(f"Navigate to https://app.humanloop.com/projects/{extraction_project_id}")

Navigate to https://app.humanloop.com/projects/1651


# Spacy 

## create spacy format

The below chunk creates the format using the output of programmatic

In [9]:
datapoint_id_list = [*range(0,len(data['datapoints']))]
data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):
    count_it += 1
    if count_it % 5000 == 0: 
        print('count = {}'.format(count_it))
        
    single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
    
    results_list = data['datapoints'][i]['programmatic']['results']
    list_of_labels =[(x['start'],x['end'],x['label'] ) for x in results_list]
    
    #list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    
    list_of_labels.sort(key=lambda y: y[0])
    
    list_of_labels = remove_overlapping_spans_tuples(list_of_labels)
    #print(list_of_labels)
    #create the NER dataset structure shown on the spacy website
    data_and_labels = data_and_labels + [ {
        'datapoint_id': i,
        'text':ocod_data['property_address'][i], 
                                           'entities':list_of_labels}  ]

#Save the cleaned data back as a json file ready to be processed further  
with open('./data/humanloop_spacy_format.json', 'w') as f:
    json.dump(data_and_labels, f)

count = 5000
count = 10000
count = 15000
count = 20000
count = 25000
count = 30000
count = 35000
count = 40000
count = 45000
count = 50000
count = 55000
count = 60000
count = 65000
count = 70000
count = 75000
count = 80000
count = 85000
count = 90000


In [14]:
data_and_labels[0]

{'datapoint_id': 0,
 'text': 'westleigh lodge care home, nel pan lane, leigh (wn7 5jt)',
 'entities': [(0, 25, 'building_name'),
  (27, 39, 'street_name'),
  (41, 46, 'city'),
  (48, 55, 'postcode')]}

This chunk creates the format using the output of the humanloop cloud labelling. This acts as the ground truth

In [11]:
f =open('./data/ground_truth_dev_set_labels.json')  

data_labels_dict_gt = json.load(f)

In [13]:
data_and_labels = []

count_it = 0
for i in range(0, len(data_labels_dict_gt)):
    count_it += 1
    if count_it % 5000 == 0: 
        print('count = {}'.format(count_it))
        
    inputs = data_labels_dict_gt[i]['inputs']

    labels = data_labels_dict_gt[i]['data']['labels']

    list_of_labels =list_of_labels =[(x['start'],x['end'],x['label'] ) for x in labels]
    
    list_of_labels.sort(key=lambda y: y[0])
    
    #create the NER dataset structure shown on the spacy website
    data_and_labels = data_and_labels + [ {
        'datapoint_id': inputs['datapoint_id'],
        'text':inputs['text'], 
                                           'entities':list_of_labels}  ]
    
    data_and_labels.sort(key=lambda y: y.get('datapoint_id'))

#Save the cleaned data back as a json file ready to be processed further  
with open('./data/humanloop_spacy_format_gt.json', 'w') as f:
    json.dump(data_and_labels, f)

In [16]:
f =open('./data/humanloop_spacy_format.json')  #aggregate and download button
spacy_data = json.load(f)

f =open('./data/humanloop_spacy_format_gt.json')  #aggregate and download button
dev_set = json.load(f)

In [40]:
random.seed(2017)
#dev_set_indices = dev_set.loc[:, 'datapoint_id'].to_list()
#dev_set_indices = random.sample([*range(0, len(spacy_data))], 9400) 

from operator import itemgetter
dev_set_indices = list(map(itemgetter('datapoint_id'), dev_set))
dev_set_indices = list(map(int, dev_set_indices))

In [41]:
#dev_set = [spacy_data[x] for x in dev_set.loc[:, 'datapoint_id'].to_list()]
train_set = [spacy_data[x] for x in [*range(0, len(spacy_data))] if x not in dev_set_indices ]

92088

# Create spaCy training data

This chunk creates the spacy training data needed to create a spacy model

In [44]:
import spacy
from spacy.tokens import DocBin
from spacy.lang.char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
from spacy.lang.char_classes import LIST_ICONS, HYPHENS, CURRENCY, UNITS
from spacy.lang.char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
from spacy.util import compile_infix_regex

alignment_mode_type = "expand"#"strict"

nlp = spacy.blank("en")


infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\,*^\(\)\/](?=[0-9-])", #added in / to break up 34/45 etc. added in , to break up 34,35 although this should now be removed in the cleaning stage
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/\(\)](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=/\(\)](?=[{a}0-9])".format(a=ALPHA), #I added this one in to try and break things like "(odd)33-45"
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer



training_data = train_set
# the DocBin will store the example documents
db = DocBin()
for i in range(0, len(training_data)):
    current_set = training_data[i]
    #print(i) #printing is used for debugging
    doc = nlp(current_set['text'])
    ents = []
    for start, end, label in current_set['entities']:
        span = doc.char_span(start, end, label=label, alignment_mode = alignment_mode_type )
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./data/spacy_data/train.spacy")

training_data = dev_set
# the DocBin will store the example documents
db = DocBin()
for i in range(0, len(training_data)):
    current_set = training_data[i]
    #print(i)
    doc = nlp(current_set['text'])
    ents = []
    for start, end, label in current_set['entities']:
        span = doc.char_span(start, end, label=label, alignment_mode = alignment_mode_type )
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./data/spacy_data/dev.spacy")

## spacy training data debugging

The below chunks help debug the creation of the spacy training data.

Errors are usually caused by tokenization issues. Several of this issues have been solved by changing the 'infixes' 

In [38]:
training_data = spacy_data

training_data[i]


{'datapoint_id': 1110,
 'text': 'flat 40/41, aldford house, park street, london (w1k 7lg)',
 'entities': [[0, 4, 'unit_type'],
  [5, 7, 'unit_id'],
  [12, 25, 'building_name'],
  [27, 38, 'street_name'],
  [40, 46, 'city'],
  [48, 55, 'postcode']]}

In [39]:
#nlp = spacy.blank("en")
training_data = spacy_data
# the DocBin will store the example documents
db = DocBin()
for i in range(i,i+1):
    current_set = training_data[i]
    doc = nlp(current_set['text'])
    ents = []
    for start, end, label in current_set['entities']:
        span = doc.char_span(start, end, label=label, alignment_mode = alignment_mode_type )
        print(span)
        ents.append(span)
    doc.ents = ents

flat
None
aldford house
park street
london
w1k 7lg


TypeError: object of type 'NoneType' has no len()

In [27]:
!python -m spacy init fill-config /tf/empty_homes_london/base_config.cfg /tf/empty_homes_london/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/tf/empty_homes_london/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
#This is the code to run the training model
#!python -m spacy train config.cfg --paths.train /tf/data/spacy_data/train.spacy --paths.dev /tf/data/spacy_data/dev.spacy --output /tf/data/spacy_data/ --gpu-id 1
#!python -m spacy train config.cfg --paths.train /home/jonno/data/spacy_data/train.spacy --paths.dev /home/jonno/data/spacy_data/dev.spacy --output /home/jonno/data/spacy_data/cpu
#python -m spacy train ./spacy_config_files/cpu_config.cfg --paths.train ./data/spacy_data/train.spacy --paths.dev ./data/spacy_data/dev.spacy --output ./data/spacy_data/cpu3



# Predicting using spacy

The below chunks are used to predict label from data using spacy. There are some issues that appear to be related to a recent update of CUDA which has caused a variety of problems. As such this part of the code is being kept separate and some of the code choices may look very strange. 

Interestingly the performance of the spaCy model is pretty much identical to the labels created using rules. This suggests that the labelling is probably very good and also that spacy is overfitting

In [4]:
import json
import spacy

from address_parsing_helper_functions import load_and_prep_OCOD_data

import os
print(os.getcwd())

#spacy.require_gpu()
#spacy.prefer_gpu()

nlp1 = spacy.load("/tf/data/spacy_data/cpu/model-best") 

ocod_data = load_and_prep_OCOD_data('/tf/data/' +'OCOD_FULL_2022_02.csv')

/tf/empty_homes_london


  ocod_data =  pd.read_csv(file_path,


In [10]:
#transformer takes about 85 minutes with cpu and 2.35 minutes with GPU
#However if there is a recent update to pytorch there can be porblems with the GPU inference https://github.com/explosion/spaCy/issues/8229
#In addition if nvidia updates something Docker can have difficult to resolve bugs

import time

start_time = time.time()
#with torch.no_grad():
spacy_docs_list = list(nlp1.pipe(ocod_data.loc[1:100,'property_address']))
end_time = time.time()

print(end_time - start_time)

#This runtime comparison of spacy using cpu and gpu. 
#GPU about 5-6 times faster for inference on a transformer
#https://github.com/BlueBrain/Search/issues/337

0.3141946792602539


In [26]:
spacy_model_path = "/tf/data/spacy_data/cpu/model-best"

def spacy_pred_fn(spacy_model_path, ocod_data, print_every =1000):
    nlp1 = spacy.load(spacy_model_path) 

    ocod_context = [(ocod_data.loc[x,'property_address'], {'datapoint_id':x}) for x in range(0,ocod_data.shape[0])]
    i = 0
    all_entities_json = []        
    for doc, context in list(nlp1.pipe(ocod_context[0:1000], as_tuples = True)):

        #This doesn't print as it is a stream not a conventional loop
        #if i%print_every==0: print("doc ", i, " of "+ str(ocod_data.shape[0]))
        #i = i+1

        temp = doc.to_json()
        temp.pop('tokens')
        temp.update({'datapoint_id':context['datapoint_id']})
        all_entities_json = spacy_docs_list + [temp]

    all_entities = pd.json_normalize(all_entities_json, record_path = "ents", meta= ['text', 'datapoint_id'])
    
    return all_entities

In [13]:
test = spacy_pred_fn(spacy_model_path, ocod_data, print_every =1000)

test

NameError: name 'spacy_pred_fn' is not defined

In [2]:
target_data = ocod_data.loc[:,'property_address']
lower = [0, 20000,40000,60000,80000]
upper = [20000,40000,60000,80000,len(target_data)]

ocod_context = [(ocod_data.loc[x,'property_address'], {'datapoint_id':x}) for x in range(0,ocod_data.shape[0])]

print('begin getting labels')

for x in range(0,5):
    print(x)
    spacy_docs_list = []        
    for doc, context in list(nlp1.pipe(ocod_context[lower[x]:upper[x]], as_tuples = True)):
        temp = doc.to_json()
        temp.pop('tokens')
        temp.update({'datapoint_id':context['datapoint_id']})
        spacy_docs_list = spacy_docs_list + [temp]
        
    file_name = '/home/jonno/data/spacy_pred_labels' + str(x) + '.json'
    with open(file_name, 'w') as f:
        json.dump(spacy_docs_list, f)
        
all_entities_json = []
for x in range(0,5):
    print(x)
    file_name = '/home/jonno/data/spacy_pred_labels' + str(x) + '.json'
    f =open(file_name)  #aggregate and download button
    all_entities_json = all_entities_json  + json.load(f)
        
all_entities = pd.json_normalize(all_entities_json, record_path = "ents", meta= ['text', 'datapoint_id'])

all_entities.to_csv('/home/jonno/data/spacy_preds_normalised.csv')


begin getting labels
0
1
2
3
4


In [9]:
all_entities = []
for x in range(0,5):
    file_name = '/home/jonno/data/spacy_pred_labels' + str(x) + '.json'
    f =open(file_name)  #aggregate and download button
    all_entities = spacy_labels  +json.load(f)

In [12]:
with open('/home/jonno/data/spacy_pred_labels.json', 'w') as f:
    json.dump(spacy_labels, f)

In [4]:
!python ~/empty_homes_london/full_ocod_parse_process.py /home/jonno/data/

  ocod_data =  pd.read_csv(file_path,
Loading the spaCy model
Traceback (most recent call last):
  File "/home/jonno/empty_homes_london/full_ocod_parse_process.py", line 21, in <module>
    all_entities = spacy_pred_fn(spacy_model_path = root_path+'full_dataset_no_overlaps.json', ocod_data = ocod_data)
  File "/home/jonno/empty_homes_london/address_parsing_helper_functions.py", line 524, in spacy_pred_fn
    nlp1 = spacy.load(spacy_model_path) 
  File "/home/jonno/parse_process/lib/python3.8/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/home/jonno/parse_process/lib/python3.8/site-packages/spacy/util.py", line 422, in load_model
    return load_model_from_path(Path(name), **kwargs)  # type: ignore[arg-type]
  File "/home/jonno/parse_process/lib/python3.8/site-packages/spacy/util.py", line 484, in load_model_from_path
    meta = get_model_meta(model_path)
  File "/home/jonno/parse_process/lib/python3.8/site-packages/spacy/util.py", line 856, in g