# Unit tag and Span cleaning

This script combines two steps in the OCOD processing pipeline.

* unit tagging
* Removing overlapping spans

These two processes are separated by the weak labelling in humanloop but as they are relatively simple they are included in a single script

* **Raw CSV loaded and lightly processed. Output**: two column csv columns, property address, unit tag
* Data labelled in programmatic. Output: json file of entities.
* **Data programmatic output json cleaned ordered and overlaps removed**. Output: json file
* Clean json converted to dataframe and multi-addresses expanded. Output: CSV
* Count and locate addresses
* Create address matcher and match businesses
* Classify address types

## Unit tagging

This park of the pipeline adds in a binary value indicating whether the line contains flats/units/stores etc which are likely to have unit level ID. This is important as such addresses are likely to have a unit ID AND an street number and as such need to be treated with care

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import pandas as pd
import numpy as np
import re
import random
from helper_functions import *

In [2]:
ocod_data =  pd.read_csv('/tf/empty_homes_data/' +
                    'OCOD_FULL_2022_02.csv',
                   encoding_errors= 'ignore').rename(columns = lambda x: x.lower().replace(" ", "_"))
ocod_data['postcode'] = ocod_data['postcode'].str.upper()
#empty addresses cannot be used. however there are only three so not a problem
ocod_data = ocod_data.dropna(subset = 'property_address')
ocod_data.reset_index(inplace = True, drop = True)
ocod_data['property_address'] = ocod_data['property_address'].str.lower()

#ensure there is a space after commas
#This is because some numbers are are written as 1,2,3,4,5 which causes issues during tokenisation
ocod_data.property_address = ocod_data.property_address.str.replace(',', r', ')
#remove multiple spaces
ocod_data.property_address = ocod_data.property_address.str.replace('\s{2,}', r' ')


#different words associated with unit ID's
flatregex = r"(flat|apartment|penthouse|unit)" #unit|store|storage these a

#This is not an exhaustive list of road names but it covers about 80% of all road types in the VOA business register.
#The cardinal directions are includted as an option as they can appear after the road type. However they serve no real purpose in this particular regex and are 
#included for completness
road_regex  = r"((road|street|lane|way|gate|avenue|close|drive|hill|place|terrace|crescent|gardens|square|walk|grove|mews|row|view|boulevard|pleasant|vale|yard|chase|rise|green|passage|friars|viaduct|promenade|end|ridge|embankment|villas|circus))\b( east| west| north| south)?"
#These names may be followed by a road type e.g. Earls court road. A negative lookahead is used to prevent these roads being tagged as units.
flatregex2 = r"(mansions|villa|court)\b(?!(\s"+road_regex+"))"

#flat_tag is used for legacy reasons but refers to sub-units in general
ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)

ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)

#typo in the data leads to a large number of fake flats
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("stanley court ", "stanley court, ")
#This typo leads to some rather silly addresses
ocod_data.loc[:, 'property_address'] = ocod_data['property_address'].str.replace("100-1124", "100-112")

#only two columns are needed for the humanloop labelling process
ocod_data[['property_address', 'flat_tag', 'commercial_park_tag', 'title_number']].rename(columns = {'property_address':'text'}).to_csv('/tf/empty_homes_data/property_address_only.csv')

  ocod_data =  pd.read_csv('/tf/empty_homes_data/' +
  ocod_data.property_address = ocod_data.property_address.str.replace('\s{2,}', r' ')
  ocod_data['flat_tag'] = ocod_data['property_address'].str.contains(flatregex + '|'+flatregex2, case = False)
  ocod_data['commercial_park_tag'] = ocod_data['property_address'].str.contains(r"(retail|industrial|commercial|business|distribution|car)", case = False)


In [3]:
#Create the index for the ground truth
random.seed(2017)
test_set = random.sample([*range(0, ocod_data.shape[0])], 1000) 

test_set = ocod_data.loc[test_set, 'title_number'].reset_index().rename(columns = {'index':'datapoint_id'})

test_set.to_csv('/tf/empty_homes_data/test_set_indices_space_cleaned.csv')

In [14]:
test = "I like eating cheese and ham"

coords = [(m.start(), m.end()) for m in re.finditer("load of old trousers", test)]

coords

[]

## Labelling in Humanloop

This part of process uses the humanloop programmatic app and is an external process. Once the labelling step is complete the process outputs a json file containing the labels and spans, this is then cleaned in the next step.

## Removing overlapping spans

During the humanloop tagging process the rules may result in the same words being tagged as part of multiple spans, this often occures for road names made up of multiple parts 
e.g. Canberra Crescent Gardens may be tagges as Canberra Cresecent and Canberra Crescent Gardens. The overlaps need to be removed before further prcoessing.
For simplicity the largest span of any two overlapping spans is kept and the smaller of the two is removed.

In [4]:
#These libraries are specific to this part of the process
import json
import requests 
import config #contains hidden api key
import operator #used for sorting the label dictionaries by start point. This is the basis for removing overlaps

In [9]:
# Opening JSON file
f =open("/tf/empty_homes_data/test.json")  #aggregate and download button

# returns JSON object as 
# a dictionary
data = json.load(f)

#this makes a list of all the observation rows. These refer to the row of the orginal observation text and so can be linked back to the original OCOD dataset.


In [10]:

datapoint_id_list = [*range(0,len(data['datapoints']))]

data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):

    count_it += 1
    if count_it % 5000 == 0: 
        print('count = {}'.format(count_it))
        
    #single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
   # list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    list_of_labels_dict = results_list = data['datapoints'][i]['programmatic']['results']
    ##these labels are in dictionary form
#     list_of_labels_dict = [{'start': x['start'], 
#                             'end':x['end'], 
#                             'label': x['label'], 
#                             'label_text': x['text'] } for x in results_list]
    
    #this inplace sorting using operator orders the dictionary by the start point. ties are automatically broken
    #it required the operator library
    list_of_labels_dict.sort(key=operator.itemgetter('start'))
    
    list_of_labels_dict = remove_overlapping_spans2(list_of_labels_dict)

    #create the NER dataset structure shown on the spacy website
   # data_and_labels = data_and_labels + [ ( ocod_data['property_address'][i], list_of_labels ) ]
    #create a list of dictionaries using a similar structure to save as a json
    data_labels_dict = data_labels_dict + [
        {
            'text' : ocod_data['property_address'][i],
            'labels' : list_of_labels_dict,
            'datapoint_id': i,
            'title_number':ocod_data['title_number'][i]
        }
    ]
    
#Save the cleaned data back as a json file ready to be processed further  
with open('/tf/empty_homes_data/full_dataset_no_overlaps.json', 'w') as f:
    json.dump(data_labels_dict, f)

count = 5000
count = 10000
count = 15000
count = 20000
count = 25000
count = 30000
count = 35000
count = 40000
count = 45000
count = 50000
count = 55000
count = 60000
count = 65000
count = 70000
count = 75000
count = 80000
count = 85000
count = 90000


### Uploading to humanloop cloud

This allows a sample of the data to be uploaded to the humanloop cloud so that an example model can be made.
The model provides another way to check the quality of the weak labelling. However, only 10k obersvations can be uploaded, as such a sub-sample is used. 

In [26]:

# jason_test_data ={
#      "name": "test_set_24_04_22",
#      "description": "the ground truth test set for labelling",
#      "fields": [
#          {"name": "text", 
#           "data_type": "text"
#          },
#          {"name": "labels", 
#           "data_type": "character_offsets"},
#          {"name": "datapoint_id", 
#           "data_type": "text"
#          }
#      ],
#      "data": [data_labels_dict[x] for x in test_set.loc[:,'datapoint_id']]#, #upload only data from the test set
#    # "complete":False
# }

# #url = "https://api.humanloop.com/datasets"
# url = "https://api.humanloop.com/projects/1369/add-tasks"
# # replace payload with your actual dataset...
# payload= json.dumps(jason_test_data)
# headers = {
#   'X-API-Key': config.api_key,#the api key is hidden in a config file
#   'Content-Type': 'application/json'
# }

# response = requests.request("POST", url, headers=headers, data=payload)

# print(response.text)

{"detail":{"loc":["data"],"msg":"The following fields were specified but did not exist in your dataset: span_tagging","description":"The following fields were specified but did not exist in your dataset: span_tagging","type":"value_error"}}


In [62]:


# datapoint_id_list = test_set.loc[:,'datapoint_id'].to_list()

# data_and_labels = []
# data_labels_dict2 = []

# count_it = 0
# for i in datapoint_id_list:

#     count_it += 1
#     if count_it % 100 == 0: 
#         print('count = {}'.format(count_it))
        
#     list_of_labels_dict = results_list = data['datapoints'][i]['programmatic']['results']
    
#     #this inplace sorting using operator orders the dictionary by the start point. ties are automatically broken
#     #it required the operator library
#     list_of_labels_dict.sort(key=operator.itemgetter('start'))
    
#     list_of_labels_dict = remove_overlapping_spans2(list_of_labels_dict)
#     #create a list of dictionaries using a similar structure to save as a json
#     data_labels_dict2 = data_labels_dict + [
#         {
#             'text' : ocod_data['property_address'][i],
#             'entities' : list_of_labels_dict,
#             'datapoint_id': i,
#           #  'title_number':ocod_data['title_number'][i]
#         }
#     ]

count = 100
count = 200
count = 300
count = 400
count = 500
count = 600
count = 700
count = 800
count = 900
count = 1000


In [11]:
jason_test_data ={
     "name": "test_set_24_04_22",
     "description": "the ground truth test set for labelling",
     "fields": [
         {"name": "text", 
          "data_type": "text"
         },
         {"name": "labels", 
          "data_type": "character_offsets"},
         {"name": "datapoint_id", 
          "data_type": "text"
         }
     ],
     "data": [data_labels_dict[x] for x in test_set.loc[:,'datapoint_id']]#, #upload only data from the test set
}

# Create new project with unlabelled data


In [12]:

"""
Step 1: Specify URL and headers for your API requests and some helper methods 
Reference: https://api.humanloop.com/docs#section/Authentication
Notes: 
    - If you don't already have a Humanloop account,
      signup @ https://app.humanloop.com/signup
    - Replace <INSERT YOUR API KEY HERE> with your users X-API-KEY 
      @ https://app.humanloop.com/profile
"""
base_url = "https://api.humanloop.com"
headers = {
    "Content-Type": "application/json",
    "X-API-KEY":  config.api_key,#the api key is hidden in a config file
}
# use the email associated to your Humanloop account
project_owner = "jonathan.s.bourne@gmail.com"


def get_field_id_by_name(name: str, fields):
    """Helper method for parsing field_id from dataset.fields given the name"""
    return [field for field in fields if field["name"] == name][0]["id"]


""" 
Step 2: Create a dataset
Reference: https://api.humanloop.com/docs#operation/upload_data_datasets_post
Notes:
    - It can be helpful to include your own unique identifiers for your data-points
      if available so that you can easily correlate any annotations and predictions 
      created by Humanloop back to your system.
    - If using large datasets (> 10k rows), you will have to upload it in multiple 
      batches using the API. Starting with the POST as shown below, then adding 
      subsequent batches using the PUT against the newly created dataset 
      (https://api.humanloop.com/docs#operation/update_data_datasets__id__put.)
"""

dataset_fields = requests.post(
    url=f"{base_url}/datasets", data=json.dumps(jason_test_data), headers=headers
).json()["fields"]


"""
Step 3: Create a project
Reference: https://api.humanloop.com/docs#operation/create_project_projects_post
Notes:
    - A Humanloop project is made up of one or more datasets, a team of annotators 
      and a model. As your team begin to annotate the data, a model is trained in real 
      time and used to prioritise what data your annotators should focus on next 
      (see https://humanloop.com/blog/why-you-should-be-using-active-learning).
    - The project inputs specify those dataset fields you wish to show to 
      your annotators and the model. 
    - The project output specifies the type of model you wish to train and the 
      corresponding label taxonomy. 
    - If your dataset has a field with existing annotations, you can use this to 
      warm start your project as shown in the following examples. 
      If you want your team to first review these existing annotations in Humanloop, 
      set "review_existing_annotations" to True, otherwise they will be used 
      automatically to train an initial model.
    - Both classification (single and multi-label) and extraction
      projects are supported.
    - You can update your project with more data by either connecting another dataset 
      or simply adding additional data-points to your existing dataset. 
      Alternatively, you can submit tasks for your model and/or team to complete
      (see our Human-in-the-loop tutorial for more information on this!).
"""


"""
Step 3b: Span extraction project 
"""
extraction_project_request = {
    "name": "Ground truth for offshore empties V2 includes comma space",
    "inputs": [
        {
            "name": "text",
            "data_type": "text",
            "description": "unparsed addresses",
            "data_sources": [
                {"field_id": get_field_id_by_name("text", dataset_fields)}
            ],
        },
        {
            "name": "datapoint_id",
            "data_type": "text",
            "description": "The original row the data is on",
            "display_only": True,
            "data_sources": [
                {"field_id": get_field_id_by_name("datapoint_id", dataset_fields)}
            ],
        },
    ],
    "outputs": [
        {
            "name": "labels",
            "description": "entities address parts",
            "task_type": "sequence_tagging",
            "data_sources": [
                {"field_id": get_field_id_by_name("labels", dataset_fields)}
            ],
            # which input you wish your model to extract from
            "input": "text",
        }
    ],
    "users": [project_owner],
    "guidelines": "Insert your markdown annotator guidelines here",
    "review_existing_annotations": True,
}

extraction_project_id = requests.post(
    url=f"{base_url}/projects",
    data=json.dumps(extraction_project_request),
    headers=headers,
).json()["id"]

print(f"Navigate to https://app.humanloop.com/projects/{extraction_project_id}")

Navigate to https://app.humanloop.com/projects/1590


## create spacy old format

In [14]:
with open('/tf/empty_homes_data/humanloop_spacy_format.json', 'w') as f:
    json.dump(data_labels, f)

NameError: name 'data_labels' is not defined

In [77]:
datapoint_id_list = [*range(0,len(data['datapoints']))]
data_and_labels = []
data_labels_dict = []

count_it = 0
for i in set(datapoint_id_list):
    count_it += 1
    if count_it % 5000 == 0: 
        print('count = {}'.format(count_it))
        
    single_id_index = np.where(np.array(datapoint_id_list)==i)
    ##these labels are in tuple form
    
    results_list = data['datapoints'][i]['programmatic']['results']
    list_of_labels =[(x['start'],x['end'],x['label'] ) for x in results_list]
    
    #list_of_labels = [(data[x]['start'], data[x]['end'], data[x]['label']) for x in single_id_index[0].tolist()]
    
    list_of_labels.sort(key=lambda y: y[0])
    
    list_of_labels = remove_overlapping_spans_tuples(list_of_labels)
    #print(list_of_labels)
    #create the NER dataset structure shown on the spacy website
    data_and_labels = data_and_labels + [ {
        'datapoint_id': i,
        'text':ocod_data['property_address'][i], 
                                           'entities':list_of_labels}  ]

#Save the cleaned data back as a json file ready to be processed further  
with open('/tf/empty_homes_data/humanloop_spacy_format.json', 'w') as f:
    json.dump(data_and_labels, f)

count = 5000
count = 10000
count = 15000
count = 20000
count = 25000
count = 30000
count = 35000
count = 40000
count = 45000
count = 50000
count = 55000
count = 60000
count = 65000
count = 70000
count = 75000
count = 80000
count = 85000
count = 90000


In [78]:
with open('/tf/empty_homes_data/humanloop_spacy_format.json', 'w') as f:
    json.dump(data_and_labels, f)

In [80]:
f =open('/tf/empty_homes_data/humanloop_spacy_format.json')  #aggregate and download button

spacy_data = json.load(f)

In [33]:
spacy_data

[{'datapoint_id': 0,
  'text': 'westleigh lodge care home, nel pan lane, leigh (wn7 5jt)',
  'entities': [[0, 25, 'building_name'],
   [27, 39, 'street_name'],
   [41, 46, 'city'],
   [48, 55, 'postcode']]},
 {'datapoint_id': 1,
  'text': 'flat 1, 1a canal street, manchester (m1 3he)',
  'entities': [[0, 4, 'unit_type'],
   [5, 6, 'unit_id'],
   [8, 10, 'street_number'],
   [11, 23, 'street_name'],
   [25, 35, 'city'],
   [37, 43, 'postcode']]},
 {'datapoint_id': 2,
  'text': 'flat 201, 1 regent road, manchester (m3 4ay)',
  'entities': [[0, 4, 'unit_type'],
   [5, 8, 'unit_id'],
   [10, 11, 'street_number'],
   [12, 23, 'street_name'],
   [25, 35, 'city'],
   [37, 43, 'postcode']]},
 {'datapoint_id': 3,
  'text': 'land at 2a gerard street, ashton in makerfield, wigan (wn4 9aa)',
  'entities': [[0, 4, 'unit_type'],
   [11, 24, 'street_name'],
   [48, 53, 'city'],
   [55, 62, 'postcode']]},
 {'datapoint_id': 4,
  'text': 'unit 111, timber wharf, worsley street, manchester (m15 4nz)',
  

In [138]:
import spacy
from spacy.tokens import DocBin
from spacy.lang.char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
from spacy.lang.char_classes import LIST_ICONS, HYPHENS, CURRENCY, UNITS
from spacy.lang.char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
from spacy.util import compile_infix_regex

alignment_mode_type = "strict"

nlp = spacy.blank("en")

infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\,*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer


training_data = spacy_data
# the DocBin will store the example documents
db = DocBin()
for i in range(83, len(training_data)):
    current_set = training_data[i]
    print(i)
    doc = nlp(current_set['text'])
    ents = []
    for start, end, label in current_set['entities']:
        span = doc.char_span(start, end, label=label, alignment_mode = alignment_mode_type )
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("/tf/empty_homes_data/spacy_data/train.spacy")

83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196


TypeError: object of type 'NoneType' has no len()

In [86]:
training_data = [{
  'text': 'westleigh lodge care home, nel pan lane, leigh (wn7 5jt)',
  'entities': [[0, 25, 'building_name'],
   [27, 39, 'street_name'],
   [41, 46, 'city'],
   [48, 55, 'postcode']]},
{
  'text': 'land at 2a gerard street, ashton in makerfield, wigan (wn4 9aa)',
  'entities': [(0, 4, 'unit_type'),
   (11, 24, 'street_name'),
   (48, 53, 'city'),
   (55, 62, 'postcode')]},
 {
  'text': 'unit 111, timber wharf, worsley street, manchester (m15 4nz)',
  'entities': [(0, 4, 'unit_type'),
   (5, 8, 'unit_id'),
   (10, 23, 'building_name'),
   (24, 38, 'street_name'),
   (40, 50, 'city'),
   (52, 59, 'postcode')]}]

training_data[0]['entities']

db = DocBin()
for i in range(0, len(training_data)):
    current_set = training_data[i]
    doc = nlp(current_set['text'])
    ents = []
    for start, end, label in current_set['entities']:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)

In [139]:
training_data = spacy_data

training_data[i]


{'datapoint_id': 196,
 'text': '5, 7 and 9 seven sisters road, london (n7 6aj)',
 'entities': [[0, 1, 'street_number'],
  [2, 3, 'street_number'],
  [8, 9, 'street_number'],
  [10, 28, 'street_name'],
  [31, 37, 'city'],
  [38, 44, 'postcode']]}

In [140]:
#nlp = spacy.blank("en")
training_data = spacy_data
# the DocBin will store the example documents
db = DocBin()
for i in range(i,i+1):
    current_set = training_data[i]
    doc = nlp(current_set['text'])
    ents = []
    for start, end, label in current_set['entities']:
        span = doc.char_span(start, end, label=label, alignment_mode = alignment_mode_type )
        print(span)
        ents.append(span)
    doc.ents = ents

5
None
None
None
london
None


TypeError: object of type 'NoneType' has no len()

In [143]:
training_data[i]['text'][8:9]

' '

In [141]:
doc = nlp(current_set['text'])
for token in doc:
    print(token.text)

5
,
7
and
9
seven
sisters
road
,
london
(
n7
6aj
)


In [1]:
import re
texts = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
rx = re.compile(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)')
for text in texts:
    m = re.search(rx, text)
    if m:
        print(m.group(1))

Major city
town
village on the river


In [22]:
import re
texts = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']

search_pattern = re.compile(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)')

texts = ocod_data['property_address'].tolist()

holding_list = []

for text in texts:

    match = re.search(search_pattern, text)

    if match:
        start = match.span(1)[0]
        end = match.span(1)[1]
        temp_tuple = (match.group(1), match.span(1)[0],match.span(1)[1], match.start(), match.end())
        #print((match.group(1), match.span(1)[0],match.span(1)[1], match.start(), match.end()))
        if start >=end:
            print()
            
        holding_list = holding_list +[temp_tuple]
   # if m:
   #     print(m.group(1))




In [21]:
holding_list

[('leigh (wn7 5jt)', 41, 56, 0, 56),
 ('manchester (m1 3he)', 25, 44, 0, 44),
 ('manchester (m3 4ay)', 25, 44, 0, 44),
 ('wigan (wn4 9aa)', 48, 63, 0, 63),
 ('manchester (m15 4nz)', 40, 60, 0, 60),
 ('peterborough (pe2 8ns)', 18, 40, 0, 40),
 ('peterborough (pe2 8nr)', 16, 38, 0, 38),
 ('peterborough', 19, 31, 0, 31),
 ('peterborough', 32, 44, 0, 44),
 ('peterborough', 27, 39, 0, 39),
 ('peterborough (pe2 8nr)', 17, 39, 0, 39),
 ('peterborough (pe2 8ns)', 19, 41, 0, 41),
 ('peterborough (pe2 8ne)', 17, 39, 0, 39),
 ('peterborough', 86, 98, 0, 98),
 ('peterborough (pe2 8ns)', 19, 41, 0, 41),
 ('peterborough (pe2 8nr)', 17, 39, 0, 39),
 ('peterborough (pe2 8ns)', 19, 41, 0, 41),
 ('cambridge', 20, 29, 0, 29),
 ('peterborough (pe2 8ns)', 19, 41, 0, 41),
 ('peterborough (pe2 8ny)', 19, 41, 0, 41),
 ('cambridge cb1 2fx', 24, 41, 0, 41),
 ('barnsley (s70 5xf)', 29, 47, 0, 47),
 ('barnsley (s70 5xf)', 29, 47, 0, 47),
 ('barnsley (s70 5tj)', 30, 48, 0, 48),
 ('barnsley (s70 5tj)', 30, 48, 0, 4