# Locating and classifying the expanded ocod dataset

This notebook runs through the process of locating properties withing the OA/LSOA system and classifying properties into one of the 5 types and 'unknown'

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import pandas as pd
import numpy as np
import os
import re
import io
import zipfile
#from helper_functions import *
from locate_and_classify_helper_functions import *


In [2]:
print("load ONSPD")
# zip file handler  
root_path = "./data/"
zip = zipfile.ZipFile(root_path + 'ONSPD.zip')
# looks in the data folder for a csv file that begins ONSPD
#This will obviously break if the ONS change the archive structure
target_zipped_file = [i for i in zip.namelist() if re.search(r'^Data\/ONSPD.+csv$',i) ][0]
postcode_district_lookup = load_postocde_district_lookup(root_path + "ONSPD.zip", target_zipped_file)
print("load expanded ocod")
ocod_data =  pd.read_csv("./data/OCOD_cleaned_expanded3.csv")
print("pre-process expanded ocod data")
ocod_data = preprocess_expandaded_ocod_data(ocod_data, postcode_district_lookup)
print("load and pre-process the Land Registry price paid dataset")
price_paid_df = load_and_process_pricepaid_data("./data/price_paid_files/", postcode_district_lookup)
print("add in missing Local authority codes to the ocoda dataset")
#ocod_data = add_missing_lads_ocod(ocod_data, price_paid_df)
print("load and pre-process the voa business ratings list dataset")
#voa_businesses = load_voa_ratinglist('./data/' +'VOA_ratings.csv', postcode_district_lookup)
#del postcode_district_lookup

load ONSPD


  postcode_district_lookup = pd.read_csv(f)[['pcds','oslaua','oa11','lsoa11', 'msoa11', 'ctry']]


load expanded ocod
pre-process expanded ocod data
load and pre-process the Land Registry price paid dataset


FileNotFoundError: [Errno 2] No such file or directory: './data/price_paid_files/'

## Using price paid data to match names

The land registry does not use standardised LAD codes or names and 
the LAD names it uses appear to be wrong sometimes. I need to know the LADs so that I only try road matching within local authorities to minimise the chance of having the same road twice. To get around this I will use the substantially larger database of the price paid data to get all the land registry district names and match them to the onsp using the postcodes. This works as there are a large number of sales in each district most of them will have a postcode. There are cases where the wrong district or postcode is applied meaning a single district name can have two or more lad11cd's, to solve this I simply take the lad11cd with the largest number of counts.

The resulting OCOD data frame has a LAD11CD for each entry, and thus allows the road matching to work effectively

# Street and buildings to match lsoa

This section fills in missing lsoa11cd using knowledge of the LAD11cd and the streets within it. This takes data from price paid and voa

In [3]:
##
##This process is quite convoluted and there is certainly a more efficient and pythonic way
## however the order within each filling method is important to ensure that there are no duplicates
## as this causes the OCOD dataset to grow with duplicates
##
ocod_data = street_and_building_matching(ocod_data, price_paid_df, voa_businesses)

replace the missing lsoa using street matching
replace the missing lsoa using building matching
insert newly ID'd LSOA and OA
update missing LSOA and OA for nested properties where at least one nested property has an OA or LSOA


## Matching at sub street level

Some streets are on the boundary of LSOA this section uses the street number to match to the nearest lsoa.

In [4]:
#This takes some time
ocod_data3 = substreet_matching(ocod_data, price_paid_df, voa_businesses)
#percent of dataset without lsoa
ocod_data['lsoa11cd'].isnull().sum()/ocod_data.shape[0]

lad  100  of 244
lad  200  of 244


0.13455019044751895

## Add in counts of businesses per oa and LSOA

In [5]:
#This function allows areas with no  businesses to automatically exclude business from the classification
ocod_data = counts_of_businesses_per_oa_lsoa(ocod_data, voa_businesses)


## What still doesn't have an LSOA?
what still doesn't have lsoa and what properties do they have?

In [19]:
pd.crosstab(ocod_data['postcode'].notnull(), ocod_data['lsoa_building'].notnull())

lsoa_building,False,True
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
False,49390,7231
True,81337,21754


In [None]:
test = ocod_data

In [None]:
#observations localised with lsoa and/or oa
pd.crosstab(test['lsoa11cd'].notnull(),  test['oa11cd'].notnull())/ocod_data.shape[0]

In [None]:
#this is definately the problem then
pd.crosstab(test['lsoa_street'].notnull(),  test['lsoa_building'].notnull())

In [None]:
test2 = test[test['lsoa11cd'].isnull()]
pd.crosstab(test2['property_address'].str.startswith('land') , test2['lsoa_street'].notnull())

In [None]:
#this is definately the problem then
pd.crosstab(test['street_name'].notnull(),  test['lsoa11cd'].notnull())

In [None]:
test[test['lsoa11cd'].isnull() & test['street_name'].notnull()].to_csv('/tf/data/delete_me.csv')

In [21]:
#95.5% of sets have only a single lsoa, when grouped by street, town, district and locality
#when grouped by only street and district, this number is still 90%
#excluding town the number is still 0.95% but dropping locality gives a match on 91%, therefore using locality is the key
temp = price_paid_df.groupby(['street', 'district', 'lsoa11cd']).size().reset_index().groupby(['street', 'district']).size()\
.reset_index().rename(columns = {0:'counts'})

#temp.groupby('counts').size()/temp.shape[0]


# VOA matching businesses

The below chunk matches addresses to known businesses

In [7]:
ocod_data = voa_address_match_all_data(ocod_data, voa_businesses)

address matched  0 lads of 331
address matched  50 lads of 331
address matched  100 lads of 331
address matched  150 lads of 331
address matched  200 lads of 331
address matched  250 lads of 331
address matched  300 lads of 331


In [None]:
pd.crosstab(ocod_data['oa_busi_building'].notnull(), ocod_data['business_address'].notnull())

# Classify property type

This section classifies the the data into different property types. 

# Classification type 1

The land is classified by the rules below which search the address string or meta data using regex.
The classification is hierarchical with the first match being the classification type.
Therefore if a property is classified by rule three and rule 6, rule three will take precedent and the property would be classed as airspace

- Starts with land/plot (land)
- Parking spaces (carpark)
- Air space (airspace)
- Flats, penthouses. apartments (domestic)
- Address matched businesses (business)
- Keyword relating to business (business)
- Land with other words before it (land)
- Pubs (business)
- A business was match in the same building (business)
- Is in the same address as a building (business)
- No business in the OA (domestic)
- No business in the LSOA (domestic)

After classifying the properties, classes left unknown are completed using the properties that are classed from the same title number
This is possible as there are no conflicting property classes within a given title number. This shows the quality of the method

In [19]:
ocod_data = classification_type1(ocod_data)

In [8]:
multi_class_titles = ocod_data[~ocod_data['class'].isin(['unknown', 'airspace', 'carpark']) & (ocod_data['within_larger_title']==True)].groupby(['title_number', 'class']).\
size().reset_index().groupby('title_number').size().reset_index().rename(columns={0:'counts'})

#there are no within title-ids that have more than one class. This shows that this is a very accurate way of filling in missing class data
print(multi_class_titles[multi_class_titles['counts']>1])

multi_class_titles = multi_class_titles[multi_class_titles['counts']==1]
#multi_class_titles.groupby('counts').size()

ocod_data[ocod_data['title_number'].isin(multi_class_titles['title_number'])].groupby('class').size()
#[['street_number', 'street_name','property_address', "business_address"]]

     title_number  counts
10         126312       2
16         142155       2
17         146577       2
19         147442       2
20         148312       2
...           ...     ...
4287    WYK737596       2
4294    WYK792514       2
4299    WYK856042       2
4304     YEA16295       2
4320      YY38811       2

[669 rows x 2 columns]


class
business     8046
domestic    40157
land            9
dtype: int64

## Classification type 2

Classification type 2 only affects the properties of class 'unknown' in classification type 1.

These properties are assumed to beeither domestic or business.
They are heierarchically classified into domestic or 'unknown' using the following rules

- Street match == TRUE, Street name is known AND street number is known (domestic)
- Street match is FALSE AND street name is known (domestic)
- Building name is known (domestic)

All remaining addresses do not contain enough information to be classified and are classed as unknown

In [9]:
ocod_data = classification_type2(ocod_data)

In [None]:
#If there is a street match, and the property has a street and a street number OR a building name
#Then is is a domestic property

test = ocod_data[ocod_data['class2']=='unknown']
print(pd.crosstab((test['street_match']==True), (test['street_name'].notnull()==True) ))

ocod_data[(ocod_data['street_name'].isnull()==True) & (ocod_data['class2']=='unknown')].to_csv('./data/delete_me.csv')

In [None]:
ocod_data.groupby('class').size()

In [None]:
pd.crosstab(ocod_data['unit_type'],(ocod_data['class2']=="domestic"))

## Contracting the dataset
Businesses, carparks and airpsace etc are classed as a single address independent of how many components they a made of.
This chunk strips down businesses that have been expanded back to a single address


In [10]:
ocod_data = contract_ocod_after_classification(ocod_data, class_type = 'class2', classes = ['domestic'] )


In [None]:
ocod_data.groupby('class2').size()

In [None]:
ocod_data.groupby('class2').size()/ocod_data.shape[0]

In [45]:
#non of the unknowns have a postcode. I guess this is obvious as if there is no matching VOA postcode you are classed as domestic
#pd.crosstab(ocod_data[ocod_data['class']=="unknown"].postcode.notnull(), ocod_data[ocod_data['class']=="unknown"].street_name.notnull())

In [46]:
pd.crosstab(ocod_data['tenure'], ocod_data['region'].str.lower())#.to_latex() #convert to copyable latex table

region,east anglia,east midlands,greater london,north,north west,south east,south west,wales,west midlands,yorks and humber
tenure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Freehold,3418,4680,21783,3304,16221,17722,5754,3395,5050,8460
Leasehold,434,1209,40133,882,5741,5496,1702,693,1858,2793


# Saving the enhanced expanded dataset

In [48]:
ocod_data.to_csv("/tf/data/enhanced_ocod_dataset.csv")

#Save the test set indices to create the ground truth
#this is commented out to avoid overwriting

#ocod_data.loc[ocod_data.title_number.isin(pd.read_csv("./data/test_set_indices.csv")['title_number']) ,  
#              ['title_number','within_title_id','unit_type' ,'building_name', 'street_number', 'street_name','postcode' ,'property_address',  'lsoa11cd', 'class2']].to_csv('./data/parsed_ground_truth_raw.csv')

# Post creation analysis

In [18]:
pd.crosstab(ocod_data['class2'], ocod_data['region'].str.lower())#.to_latex() #convert to copyable latex table

region,east anglia,east midlands,greater london,north,north west,south east,south west,wales,west midlands,yorks and humber
class2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
airspace,50,162,94,15,215,185,266,290,94,303
business,550,1437,15806,799,2523,4453,1424,670,1423,1337
carpark,27,36,1597,24,518,194,89,9,50,57
domestic,2510,3791,48017,2742,16976,14306,4829,2607,4112,8439
land,639,919,2639,733,2215,4808,1270,629,1316,1314
unknown,64,115,734,157,319,453,207,89,150,133


In [19]:
pd.crosstab(ocod_data['class2'], ocod_data['tenure'])#.to_latex() #convert to copyable latex table

tenure,Freehold,Leasehold
class2,Unnamed: 1_level_1,Unnamed: 2_level_1
airspace,7,1667
business,20948,9474
carpark,252,2349
domestic,60744,47585
land,14619,1863
unknown,1534,887


In [20]:
temp_df = ocod_data[['title_number', 'tenure', 'within_larger_title']].drop_duplicates()

#most of titles containing nested addresses are free hold by about 3/2
pd.crosstab(temp_df['tenure'], temp_df['within_larger_title'])


within_larger_title,False,True
tenure,Unnamed: 1_level_1,Unnamed: 2_level_1
Freehold,46557,4208
Leasehold,41562,1381


In [21]:
#The analysis is based on nested addresses being domestic
temp_df = ocod_data[['title_number', 'tenure', 'property_address']][ocod_data['within_larger_title']==True]
temp_df['is_flat'] = temp_df['property_address'].str.contains(r"(flat|apartment|penthouse|unit)", case = False)

#pd.crosstab(temp_df['tenure'], temp_df['within_larger_title'])

temp_df.groupby('tenure').size()

#Of nested addresses freehold is more common by 3/2 50k to 24k
#most of theproperties are not flats however flats dominate the leasehold section
#flats are 1/3 of nested addresses but make up almost 3/4 of the leashold nested addresses
#note this does not include items marked as units
pd.crosstab(temp_df.tenure, temp_df.is_flat)

  temp_df['is_flat'] = temp_df['property_address'].str.contains(r"(flat|apartment|penthouse|unit)", case = False)


is_flat,False,True
tenure,Unnamed: 1_level_1,Unnamed: 2_level_1
Freehold,43388,7113
Leasehold,5186,16455


## Largest nested addresses

In [5]:
#The largest nested address
ocod_data.within_title_id.max()
ocod_data[ocod_data.within_title_id==ocod_data.within_title_id.max()].reset_index()['property_address'][0]



'Ground to ninth Floor Flats being 101-114, 201-214, 301-314, 401-414, 501-514, 601-613 and 701-704 Alaska Building, 101-114, 201-214,301-314, 401-412, 501-506 and 601-605 Arizona Building, 101-114, 201-214, 301-314, 401-414, 501-514, 601-614, 701-708, 801-804, 901-903 California Building, 101-108,     201-208, 301-307, 401-408, 501-508, 601-608, 701-708, 801-808 and 901-903 Colorado Building, 1-4, 101-109, 201-210, 301-310, 401-410, 501-510 and 601-605 Dakota Building, 1-7, 101-108, 201-208, 301-308, 401-408, 501-506 and 601-604 Idaho Building, 102-112, 201-212, 301-312, 401-412, 501-508 and 601-604 Indiana Building, 1-15, 101-116, 201-216, 301-315, 401-416, 501-510 Montana Building, 101-108, 201-208, 301-308, 401-408, 501-506 and 601-604 Nebraska Building, 1-10, 101-110, 201-210, 301-310 and 402-403 Utah Building, 1-10 and 101-110 Boston Building, 1-6, 101-106, 201-206, 301-306, 401-408 and 501-507 Madison Building, Deals Gateway, London'

# Accuracy metrics

Checkin the classification accuracy of the results

In [4]:
ocod_data = pd.read_csv("./data/enhanced_ocod_dataset.csv")

In [6]:
ocod_data

Unnamed: 0.1,Unnamed: 0,title_number,nested_id,nested_title,unique_id,unit_id,unit_type,building_name,street_number,street_name,...,city,district,region,property_address,oa11cd,lsoa11cd,msoa11cd,lad11cd,class,class2
0,5583,100073,1,False,100073-1,,,,11,stanley crescent,...,london,KENSINGTON AND CHELSEA,GREATER LONDON,"11 stanley crescent, london (w11 2na)",E00014494,E01002882,E02000582,E09000020,unknown,domestic
1,2489,100396,3,True,100396-3,,,,1,crown passage,...,london,CITY OF WESTMINSTER,GREATER LONDON,"59 and 60 pall mall and 1 crown passage, st ja...",E00023939,E01004736,E02000977,E09000033,business,business
2,9447,100471,1,False,100471-1,,,,46,gerrard street,...,london,CITY OF WESTMINSTER,GREATER LONDON,"46 gerrard street, london (w1d 5qh)",E00023928,E01004734,E02000977,E09000033,business,business
3,5976,100990,1,False,100990-1,,,,13,great marlborough street,...,london,CITY OF WESTMINSTER,GREATER LONDON,"13 great marlborough street, london (w1f 7hp)",E00175191,E01033595,E02000972,E09000033,unknown,domestic
4,1001,101374,1,True,101374-1,,,,368,finchley road,...,london,CAMDEN,GREATER LONDON,"368, 370 and 372 finchley road, london (nw3 7aj)",E00004348,E01000884,E02000169,E09000007,unknown,domestic
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137580,535,YY99699,1,False,YY99699-1,112,unit,sunbridge halls,178,sunbridge road,...,bradford,BRADFORD,YORKS AND HUMBER,"unit 112, sunbridge halls, 178 sunbridge road,...",E00176105,E01033691,E02002221,E08000032,unknown,domestic
137581,719,YY99701,1,False,YY99701-1,114,unit,,178,sunbridge road,...,bradford,BRADFORD,YORKS AND HUMBER,"unit 114 sunbridge halls, 178 sunbridge road, ...",E00176105,E01033691,E02002221,E08000032,unknown,domestic
137582,1987,YY99815,1,False,YY99815-1,,,phoenix house,,topcliffe lane,...,wakefield,LEEDS,YORKS AND HUMBER,"phoenix house, topcliffe lane, tingley, wakefi...",E00058211,E01011540,E02002431,E08000035,business,business
137583,583,YY99873,1,False,YY99873-1,,,capitol park west,,capitol boulevard,...,leeds,LEEDS,YORKS AND HUMBER,"capitol park west, capitol boulevard, tingley,...",,E01032491,,E08000035,unknown,domestic


In [3]:
ocod_data.loc[:, ['title_number', 'within_title_id', 'within_larger_title', 'unique_id', 'unit_id', 'unit_type',
       'building_name', 'street_number', 'street_name', 'postcode', 'city',
       'district',  'region', 'property_address', 'oa11cd', 'lsoa11cd',
       'msoa11cd',  'lad11cd', 'class', 'class2']].rename(columns={'within_title_id':'nested_id',
                                                                  'within_larger_title':'nested_title'})

KeyError: "['within_title_id', 'within_larger_title'] not in index"

In [3]:
ocod_data.columns

Index(['Unnamed: 0', 'title_number', 'within_title_id', 'unique_id',
       'within_larger_title', 'tenure', 'unit_id', 'unit_type',
       'building_name', 'street_number', 'street_name', 'postcode', 'city',
       'district', 'county', 'region', 'multiple_address_indicator',
       'price_paid', 'property_address', 'postcode2', 'oa11cd', 'lsoa11cd',
       'msoa11cd', 'street_number2', 'street_name2', 'lad11cd', 'lsoa_street',
       'lsoa_building', 'oa_building', 'oa_busi_building',
       'lsoa_busi_building', 'lsoa_nested', 'oa_nested', 'lsoa_nested2',
       'business_counts', 'lsoa_business_counts', 'street_match',
       'address_match', 'business_address', 'class', 'class2'],
      dtype='object')

In [7]:
[ 'title_number', 'within_title_id', 'unique_id',
       'within_larger_title', 'tenure', 'unit_id', 'unit_type',
       'building_name', 'street_number', 'street_name', 'postcode', 'city',
       'district',  'region', 'multiple_address_indicator',
       'price_paid', 'property_address', 'oa11cd', 'lsoa11cd',
       'msoa11cd',  'lad11cd', 'class', 'class2']

['title_number',
 'within_title_id',
 'unique_id',
 'within_larger_title',
 'tenure',
 'unit_id',
 'unit_type',
 'building_name',
 'street_number',
 'street_name',
 'postcode',
 'city',
 'district',
 'region',
 'multiple_address_indicator',
 'price_paid',
 'property_address',
 'oa11cd',
 'lsoa11cd',
 'msoa11cd',
 'lad11cd',
 'class',
 'class2']

In [8]:
ground_truth_df = pd.read_csv('./data/ground_truth_test_set_labels.csv')


#I only need a small number of the columns to be able to calculate the F1 score
#Everything else just makes it confusing. 
#renaming is for consistancy
ground_truth_df = ground_truth_df.loc[ground_truth_df.loc[:,'result_type']=="span",[ 'result_type', 'label',
       'start', 'end', 'text', 'input:text', 'input:datapoint_id']].rename(
    columns = {'input:text':'property_address',
              'input:datapoint_id':'datapoint_id',
              'text':'label_text'})

In [9]:
unit_park = (ocod_data.property_address.str.contains('unit') & ocod_data.property_address.str.contains('park'))

ocod_data.loc[(unit_park==True)  & (ocod_data.class2.isin(['unknown', 'domestic'])), ['property_address', 'class', 'class2']]

Unnamed: 0,property_address,class,class2
5291,"part of the ground, first, second, third and f...",unknown,domestic
6332,"block c2 and c3, boardwalk place, london, incl...",domestic,domestic
7497,"ground floor unit at building 11, chiswick par...",unknown,domestic
18694,"28 lucas house, coleridge gardens, london, par...",unknown,domestic
18700,"18 bailey house, coleridge gardens, london, pa...",unknown,domestic
...,...,...,...
126515,"the ground, first, second and third floors, pl...",unknown,domestic
126516,"the ground, first, second and third floors, pl...",unknown,domestic
126517,"the ground, first, second and third floors, pl...",unknown,domestic
126518,"the ground, first, second and third floors, pl...",unknown,domestic


In [10]:
from sklearn import metrics

In [11]:
gt_class = pd.read_csv('./data/parsed_ground_truth_complete.csv').loc[:, ['title_number', 'truth']].drop_duplicates().\
merge(ocod_data.loc[:, ['title_number', 'class2']].drop_duplicates(), how = 'left')

#get_class = gt_class.loc[(gt_class['class2']!='unknown'),:]
label_names = list(np.unique(gt_class.truth.to_list()))

performance_df = metrics.precision_recall_fscore_support(gt_class.truth.to_list(),
                                        gt_class['class2'].to_list(), 
                                        labels = label_names
                                                        )

performance_df = pd.DataFrame(np.round_(np.transpose(performance_df),2), columns = ["precision", "recall", "fscore", "support"])
performance_df['class'] = list(np.unique(gt_class.truth.to_list()))
#print(performance_df[['class',"precision", "recall", "fscore", "support"]].to_latex(index = False))
performance_df[['class',"precision", "recall", "fscore", "support"]]

Unnamed: 0,class,precision,recall,fscore,support
0,airspace,1.0,0.93,0.96,14.0
1,business,0.97,0.81,0.88,287.0
2,carpark,1.0,0.96,0.98,26.0
3,domestic,0.89,0.98,0.93,483.0
4,land,1.0,0.99,1.0,179.0
5,unknown,0.0,0.0,0.0,9.0


In [12]:
gt_class = pd.read_csv('./data/parsed_ground_truth_complete.csv').loc[:, ['title_number', 'truth']].\
merge(ocod_data.loc[:, ['title_number', 'class2']].drop_duplicates(), how = 'left')

#get_class = gt_class.loc[(gt_class['class2']!='unknown'),:]
label_names = list(np.unique(gt_class.truth.to_list())) 
    
test = metrics.classification_report(gt_class.truth.to_list(),
                                        gt_class['class2'].to_list(), 
                                        labels = label_names
                                                        )
    
print(test)

              precision    recall  f1-score   support

    airspace       1.00      0.93      0.96        14
    business       0.97      0.79      0.87       311
     carpark       1.00      0.96      0.98        26
    domestic       0.93      0.99      0.96       918
        land       1.00      0.99      1.00       179
     unknown       0.00      0.00      0.00         9

    accuracy                           0.94      1457
   macro avg       0.82      0.78      0.79      1457
weighted avg       0.94      0.94      0.94      1457



# Weighting by the actual number of correctly classified properties

as opposed to the number of correctly classified title numbers

In [14]:
gt_class = pd.read_csv('./data/parsed_ground_truth_complete.csv').loc[:, ['title_number', 'truth']].\
merge(ocod_data.loc[:, ['title_number', 'class2']].drop_duplicates(), how = 'left')
label_names = list(np.unique(gt_class.truth.to_list()))

performance_df = metrics.precision_recall_fscore_support(gt_class.truth.to_list(),
                                        gt_class['class2'].to_list(), 
                                        labels = label_names)

performance_df = pd.DataFrame(np.round_(np.transpose(performance_df),2), columns = ["precision", "recall", "fscore", "support"])
performance_df['class'] = list(np.unique(gt_class.truth.to_list()))
performance_df[['class',"precision", "recall", "fscore", "support"]]
print(performance_df[['class',"precision", "recall", "fscore", "support"]].to_latex(index = False))

\begin{tabular}{lrrrr}
\toprule
   class &  precision &  recall &  fscore &  support \\
\midrule
airspace &       1.00 &    0.93 &    0.96 &     14.0 \\
business &       0.97 &    0.79 &    0.87 &    311.0 \\
 carpark &       1.00 &    0.96 &    0.98 &     26.0 \\
domestic &       0.93 &    0.99 &    0.96 &    918.0 \\
    land &       1.00 &    0.99 &    1.00 &    179.0 \\
 unknown &       0.00 &    0.00 &    0.00 &      9.0 \\
\bottomrule
\end{tabular}



  print(performance_df[['class',"precision", "recall", "fscore", "support"]].to_latex(index = False))


# Future work


The below are primarly nice to have things and would not change the output or results in any significant way

- I could re-insert the original street number in to the address when contracting this would be better for addresses that had been expanded but shouldn't have been. But defintaley isn't very important

- I could clean up the functions to remove the 'setting on copy' warning
- Create a verbose flag such that the messages and print outs of the functions are suppressed