# Coordinates to Postcode
Convert the longitude and latitude in the Zoopla data to postcodes. Unfortunately we don't have the postcode or house number in the Zoopla data. So we will infer this from the property longitude and latitude. Datasets of all postcodes for a given area, including their mean latitude and longitude are available from Source: https://www.doogal.co.uk/AdministrativeAreas. We will then use the inferred property postcode to join other data (e.g. deprivation data, flood risk data, EPC data (averages for that postcode and road) and government historic property prices (averages for that postcode, house type, year, etc).

In [1]:
from collections import Counter
import os
import numpy as np
import pandas as pd
import re
pd.set_option('display.max_columns', 100)

### Read in mapping files

In [2]:
AREA = 'Hinckley'
DATA_FOLDER = os.path.join('data', 'raw')
SAVE_FOLDER = os.path.join('data', 'processed')

In [3]:
if AREA == 'Nuneaton':
    mapping_df_filename = 'Nuneaton and Bedworth postcodes.csv'
    zoopla_df_filename = 'zoopla_properties_nuneaton.csv'
elif AREA == 'Hinckley':
    mapping_df_filename = 'Hinckley and Bosworth postcodes.csv'
    zoopla_df_filename = 'zoopla_properties_hinckley.csv'
    
mapping_df = pd.read_csv(os.path.join(DATA_FOLDER, mapping_df_filename))
zoopla_df = pd.read_csv(os.path.join(DATA_FOLDER, zoopla_df_filename))

In [4]:
print(mapping_df.shape)
display(mapping_df.head())

(3515, 17)


Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,Ward,Parish,Introduced,Terminated,Altitude,Country,Last Updated,Quality,LSOA Code,LSOA Name
0,CV10 0RY,Yes,52.559033,-1.482983,435148,295814,SP351958,Twycross and Witherley with Sheepy,Witherley,1980-01-01,,81,England,2022-11-25,Within the building of the matched address clo...,E01025881,Hinckley and Bosworth 008D
1,CV10 0SB,Yes,52.56227,-1.487188,434860,296172,SP348961,Twycross and Witherley with Sheepy,Witherley,1980-01-01,,83,England,2022-11-25,Within the building of the matched address clo...,E01025881,Hinckley and Bosworth 008D
2,CV10 0TT,Yes,52.554782,-1.461007,436641,295352,SP366953,Ambien,Higham on the Hill,1980-01-01,,96,England,2022-11-25,Within the building of the matched address clo...,E01025818,Hinckley and Bosworth 008B
3,CV10 0TU,Yes,52.553685,-1.464958,436374,295228,SP363952,Ambien,Higham on the Hill,1998-12-01,,90,England,2022-11-25,Within the building of the matched address clo...,E01025818,Hinckley and Bosworth 008B
4,CV10 0TZ,Yes,52.54929,-1.449379,437434,294747,SP374947,Ambien,Higham on the Hill,1980-01-01,,104,England,2022-11-25,Within the building of the matched address clo...,E01025818,Hinckley and Bosworth 008B


### Map latitude and longitude to postcode

Create function that takes the longitude and latitude from the Zoopla data and find the closest one in the mapping data, returning the corresponding postcode in the mapping file.

In [5]:
def get_closest_postcode(latitude, longitude, map_df):
    
    """
    Find the nearest longitude and latitude in mapping file and get the postcode.
    This uses the Euclidean (rather than Haversine) distance, since the distances will be small
    such that the Earth's curvature need not be considered.
    An alternative could be to use geopy.reverse() to convert coordinates to postcodes.
    
    Parameters
    ----------
    - latitude (float): latitude of the Zoopla property
    - longitude (float): longitude of the Zoopla property
    - map_df (pandas.DataFrame): dataframe that maps coordinates to postcodes
    
    Returns
    -------
    - postcode (string): postcode closely mapping the Zoopla property
    """
    
    # squared euclidean distance between Zoopla property and each mapping dataset postcode mean
    # this uses array broadcasting in numpy
    sq_distances = (np.array(latitude) - np.array(map_df['Latitude']))**2 + \
        (np.array(longitude) - np.array(map_df['Longitude']))**2 
    
    min_sq_distance_row = np.argmin(sq_distances)
    
    return map_df.iloc[min_sq_distance_row]['Postcode']
    

In [6]:
zoopla_df['postcode_test_1'] = zoopla_df[['latitude', 'longitude']].apply(lambda x: get_closest_postcode(x[0], x[1], mapping_df), axis=1)

In [7]:
zoopla_df[['latitude', 'longitude', 'postcode_test_1']].head(10)

Unnamed: 0,latitude,longitude,postcode_test_1
0,52.53883,-1.396291,LE10 0NS
1,52.548298,-1.353169,LE10 1ND
2,52.552856,-1.375555,LE10 0EQ
3,52.534348,-1.392864,LE10 0LW
4,52.546017,-1.38644,LE10 0XB
5,52.547695,-1.388362,LE10 0TN
6,52.54538,-1.372886,LE10 1RP
7,52.53732,-1.375827,LE10 0PJ
8,52.542767,-1.394461,LE10 0XW
9,52.533012,-1.397817,LE10 0YL


This works OK but sometimes we get an adjacent postcode to the one we want. Instead, let's try combining it with the road name by loading the EPC data which has postcodes and names, and joining to the Zoopla data on the street name.

### Try getting postcode from road name using EPC data
Data available from https://epc.opendatacommunities.org/domestic/search

In [8]:
def get_most_common_postcode(postcode_list):
    
    """
    Return the most frequent value in a list
    """
    
    if not postcode_list:
        return None
    else:
        postcode_counter = Counter(postcode_list)
        return postcode_counter.most_common()[0][0]
    

In [9]:
def get_street_name(address_1, address_2):
    
    """
    Get street name from first two street address fields
    """
    
    street_and_road = re.compile(r'^\d,\s+')
    
    # if street name starts with a number (maybe followed by comma) and a space, likely next part is street name
    if street_and_road.match(address_1):
        street = re.split(street_and_road, address_1)[1].lower()
        
    # otherwise choose the second part of the address as the street name
    else:
        street = str(address_2).lower()
        
    return street


In [10]:
def get_postcode(latitude, longitude, street_name, method='closest'):
    
    """
    Convert longitude and latitude into a post code, using the street name to narrow 
    the possible post codes down. Two methods possible, discussed below.
    
    Parameters
    ----------
    - latitude (float): latitude of the Zoopla property
    - longitude (float): longitude of the Zoopla property
    - street_name (string): the street name for the Zoopla property
    - method (string) default='closest': algorithm for choosing the postcode
       - closest: uses Euclidean distance to calculate the
         nearest longitude and latitude in the mapping file to the Zoopla property and gets the postcode
       - frequency: gets the most common postcode for the property's street name
    
    Returns
    -------
    - postcode (string): postcode closely mapping the Zoopla property
    """
    
    # get all possible postcodes for the street name
    possible_postcodes = list(epc_df[epc_df['Street'].str.lower()==street_name.lower()]['POSTCODE'])
    
    if method == 'frequency':
        
        final_postcode = get_most_common_postcode(possible_postcodes)
        
        # if there is no most common postcode in the EPC data, use the 'closest' algorithm instead
        if not final_postcode:
            method = 'closest'
        
    if method == 'closest':
        
        possible_postcodes = set(possible_postcodes)
        
        # if set is empty, try all possible postcodes and find nearest one
        if not possible_postcodes:
            final_postcode = get_closest_postcode(latitude, longitude, mapping_df)
        
        # if set is filled, limit search to the postcodes in the set
        else:

            # get mapping dataframe of just the postcodes in the above set
            mapping_df_temp = mapping_df[mapping_df['Postcode'].isin(possible_postcodes)]
            
            # where mapping dataframe filled, otherwise, choose the postcode from the set 
            # whose location is closest to the property longitude and latitude
            if not mapping_df_temp.empty:
                final_postcode = get_closest_postcode(latitude, longitude, mapping_df_temp)
            
            # if the postcodes are not found in the mapping file (likely due to
            # that postcode falling outside the geo boundary of the mapping file),
            # set the postcode to the first element in the possible_postcodes set 
            else:
                final_postcode = sorted(possible_postcodes)[0]
            

    return final_postcode
  

In [11]:
if AREA == 'Nuneaton':
    epc_filename = 'epcs_nuneaton.csv'
elif AREA == 'Hinckley':
    epc_filename = 'epcs_hinckley.csv'

epc_df = pd.read_csv(os.path.join(DATA_FOLDER, epc_filename), dtype=str)

In [12]:
print(epc_df.shape)
display(epc_df.head())

(42663, 92)


Unnamed: 0,LMK_KEY,ADDRESS1,ADDRESS2,ADDRESS3,POSTCODE,BUILDING_REFERENCE_NUMBER,CURRENT_ENERGY_RATING,POTENTIAL_ENERGY_RATING,CURRENT_ENERGY_EFFICIENCY,POTENTIAL_ENERGY_EFFICIENCY,PROPERTY_TYPE,BUILT_FORM,INSPECTION_DATE,LOCAL_AUTHORITY,CONSTITUENCY,COUNTY,LODGEMENT_DATE,TRANSACTION_TYPE,ENVIRONMENT_IMPACT_CURRENT,ENVIRONMENT_IMPACT_POTENTIAL,ENERGY_CONSUMPTION_CURRENT,ENERGY_CONSUMPTION_POTENTIAL,CO2_EMISSIONS_CURRENT,CO2_EMISS_CURR_PER_FLOOR_AREA,CO2_EMISSIONS_POTENTIAL,LIGHTING_COST_CURRENT,LIGHTING_COST_POTENTIAL,HEATING_COST_CURRENT,HEATING_COST_POTENTIAL,HOT_WATER_COST_CURRENT,HOT_WATER_COST_POTENTIAL,TOTAL_FLOOR_AREA,ENERGY_TARIFF,MAINS_GAS_FLAG,FLOOR_LEVEL,FLAT_TOP_STOREY,FLAT_STOREY_COUNT,MAIN_HEATING_CONTROLS,MULTI_GLAZE_PROPORTION,GLAZED_TYPE,GLAZED_AREA,EXTENSION_COUNT,NUMBER_HABITABLE_ROOMS,NUMBER_HEATED_ROOMS,LOW_ENERGY_LIGHTING,NUMBER_OPEN_FIREPLACES,HOTWATER_DESCRIPTION,HOT_WATER_ENERGY_EFF,HOT_WATER_ENV_EFF,FLOOR_DESCRIPTION,FLOOR_ENERGY_EFF,FLOOR_ENV_EFF,WINDOWS_DESCRIPTION,WINDOWS_ENERGY_EFF,WINDOWS_ENV_EFF,WALLS_DESCRIPTION,WALLS_ENERGY_EFF,WALLS_ENV_EFF,SECONDHEAT_DESCRIPTION,SHEATING_ENERGY_EFF,SHEATING_ENV_EFF,ROOF_DESCRIPTION,ROOF_ENERGY_EFF,ROOF_ENV_EFF,MAINHEAT_DESCRIPTION,MAINHEAT_ENERGY_EFF,MAINHEAT_ENV_EFF,MAINHEATCONT_DESCRIPTION,MAINHEATC_ENERGY_EFF,MAINHEATC_ENV_EFF,LIGHTING_DESCRIPTION,LIGHTING_ENERGY_EFF,LIGHTING_ENV_EFF,MAIN_FUEL,WIND_TURBINE_COUNT,HEAT_LOSS_CORRIDOR,UNHEATED_CORRIDOR_LENGTH,FLOOR_HEIGHT,PHOTO_SUPPLY,SOLAR_WATER_HEATING_FLAG,MECHANICAL_VENTILATION,ADDRESS,LOCAL_AUTHORITY_LABEL,CONSTITUENCY_LABEL,POSTTOWN,CONSTRUCTION_AGE_BAND,LODGEMENT_DATETIME,TENURE,FIXED_LIGHTING_OUTLETS_COUNT,LOW_ENERGY_FIXED_LIGHT_COUNT,UPRN,UPRN_SOURCE
0,188e4e7604368b7386e5ff93771a037ccfb150c2861096...,5 Brockey Close,Barwell,,LE9 8BG,10003465551,D,B,68,87,Bungalow,Detached,2022-10-04,E07000132,E14000583,Leicestershire,2022-10-10,Stock condition survey,67,87,270,102,2.2,48,0.9,44,44,418,370,59,39,47,Single,Y,,,,,100.0,double glazing installed before 2002,Normal,0.0,3.0,3.0,100,0.0,From main system,Good,Good,"Solid, no insulation (assumed)",,,Fully double glazed,Average,Average,"Cavity wall, as built, insulated (assumed)",Good,Good,"Room heaters, mains gas",,,"Pitched, 270 mm loft insulation",Good,Good,"Boiler and radiators, mains gas",Good,Good,"Programmer, room thermostat and TRVs",Good,Good,Low energy lighting in all fixed outlets,Very Good,Very Good,mains gas (not community),0,,,2.3,0.0,N,natural,"5 Brockey Close, Barwell",Hinckley and Bosworth,Bosworth,LEICESTER,England and Wales: 1983-1990,2022-10-10 19:12:48,Rented (social),6,,100032074592,Energy Assessor
1,849605851212012102416531797929305,"2, Nob Hill",Norton juxta Twycross,,CV9 3QE,3560172078,D,C,67,80,House,Detached,2012-10-19,E07000132,E14000583,Leicestershire,2012-10-24,marketed sale,60,75,156,94,8.4,34,5.2,112,112,1463,1067,172,105,243,Single,N,NODATA!,,,2106.0,85.0,"double glazing, unknown install date",Normal,1.0,8.0,8.0,73,0.0,From main system,Good,Average,"Solid, no insulation (assumed)",,,Mostly double glazing,Poor,Poor,"Cavity wall, as built, partial insulation (ass...",Average,Average,"Room heaters, wood logs",,,"Pitched, 150 mm loft insulation",Good,Good,"Boiler and radiators, oil",Good,Good,"Programmer, room thermostat and TRVs",Good,Good,Low energy lighting in 73% of fixed outlets,Very Good,Very Good,oil (not community),0,NO DATA!,,,0.0,,natural,"2, Nob Hill, Norton juxta Twycross",Hinckley and Bosworth,Bosworth,ATHERSTONE,England and Wales: 1976-1982,2012-10-24 16:53:17,owner-occupied,26,19.0,100030495131,Address Matched
2,1062831709962013121817485087588537,"21, Barrie Road",,,LE10 0QU,3806087178,E,B,53,82,House,End-Terrace,2013-12-18,E07000132,E14000583,Leicestershire,2013-12-18,assessment for green deal,49,81,303,103,4.9,58,1.7,80,49,892,519,82,58,84,Single,Y,NODATA!,,,2107.0,100.0,"double glazing, unknown install date",Normal,1.0,5.0,5.0,36,0.0,From main system,Good,Good,"Suspended, no insulation (assumed)",,,Fully double glazed,Average,Average,"Solid brick, as built, no insulation (assumed)",Very Poor,Very Poor,"Room heaters, mains gas",,,"Pitched, 200 mm loft insulation",Good,Good,"Boiler and radiators, mains gas",Good,Good,"Programmer, TRVs and bypass",Average,Average,Low energy lighting in 36% of fixed outlets,Average,Average,mains gas (not community),0,NO DATA!,,,0.0,,natural,"21, Barrie Road",Hinckley and Bosworth,Bosworth,HINCKLEY,England and Wales: 1930-1949,2013-12-18 17:48:50,owner-occupied,11,4.0,100030497070,Address Matched
3,641449911152012091816382695920980,"69, Hinckley Road",Earl Shilton,,LE9 7LH,9039157868,D,C,64,79,Bungalow,Detached,2012-09-18,E07000132,E14000583,Leicestershire,2012-09-18,marketed sale,59,76,207,118,5.1,40,3.0,111,62,817,679,110,75,128,Single,Y,NODATA!,,,2106.0,100.0,double glazing installed during or after 2002,Normal,1.0,5.0,5.0,20,1.0,From main system,Good,Good,"Solid, no insulation (assumed)",,,Fully double glazed,Good,Good,"Cavity wall, filled cavity",Good,Good,"Room heaters, mains gas",,,"Pitched, 100 mm loft insulation",Average,Average,"Boiler and radiators, mains gas",Good,Good,"Programmer, room thermostat and TRVs",Good,Good,Low energy lighting in 20% of fixed outlets,Poor,Poor,mains gas (not community),0,NO DATA!,,,0.0,,natural,"69, Hinckley Road, Earl Shilton",Hinckley and Bosworth,Bosworth,LEICESTER,England and Wales: 1930-1949,2012-09-18 16:38:26,owner-occupied,15,3.0,100030519581,Address Matched
4,496648659922010061416010176908480,"8, Pickering Place",Burbage,,LE10 2FJ,8785576768,B,B,83,85,Flat,NO DATA!,2010-06-10,E07000132,E14000583,Leicestershire,2010-06-14,new dwelling,83,84,134,127,1.4,22,1.3,57,34,212,215,85,85,61,standard tariff,,mid floor,,,,,NO DATA!,NO DATA!,,,,3,,From main system,Good,Good,(other premises below),,,Fully double glazed,Good,Good,Average thermal transmittance 0.45 W/m?K,Good,Good,,,,(other premises above),,,"Boiler and radiators, mains gas",Good,Good,"Programmer, room thermostat and TRVs",Average,Average,Low energy lighting in 33% of fixed outlets,Average,Average,mains gas - this is for backwards compatibilit...,0,NO DATA!,,2.45,,,NO DATA!,"8, Pickering Place, Burbage",Hinckley and Bosworth,Bosworth,HINCKLEY,NO DATA!,2010-06-14 16:01:01,,9,3.0,10090026218,Address Matched


In [13]:
# get street name from addresses
epc_df['Street'] = epc_df[['ADDRESS1', 'ADDRESS2']].apply(lambda x: get_street_name(x[0], x[1]), axis=1)

In [14]:
# try both 'closest' and 'frequency' method to get postcode
zoopla_df['postcode_test_2'] = zoopla_df[['latitude', 'longitude', 'street_name']].apply(lambda x: get_postcode(x[0], x[1], x[2], method='closest'), axis=1)
zoopla_df['postcode_test_3'] = zoopla_df[['latitude', 'longitude', 'street_name']].apply(lambda x: get_postcode(x[0], x[1], x[2], method='frequency'), axis=1)

In [15]:
zoopla_df[['latitude', 'longitude', 'postcode_test_1', 'postcode_test_2', 'postcode_test_3']].head(30)

Unnamed: 0,latitude,longitude,postcode_test_1,postcode_test_2,postcode_test_3
0,52.53883,-1.396291,LE10 0NS,LE10 0NS,LE10 0NS
1,52.548298,-1.353169,LE10 1ND,LE10 1ND,LE10 1ND
2,52.552856,-1.375555,LE10 0EQ,LE10 0RH,LE10 0RH
3,52.534348,-1.392864,LE10 0LW,LE10 0LW,LE10 0LR
4,52.546017,-1.38644,LE10 0XB,LE10 0XB,LE10 0XB
5,52.547695,-1.388362,LE10 0TN,LE10 0TN,LE10 0TN
6,52.54538,-1.372886,LE10 1RP,LE10 1RP,LE10 1RP
7,52.53732,-1.375827,LE10 0PJ,LE10 0PJ,LE10 0PJ
8,52.542767,-1.394461,LE10 0XW,LE10 0XW,LE10 0XW
9,52.533012,-1.397817,LE10 0YL,LE10 0YL,LE10 0YJ


From checks on Google, the 'closest' method is most accurate so will be used for this dataset

In [16]:
zoopla_df.drop(columns=['postcode_test_1', 'postcode_test_3'], inplace=True)
zoopla_df = zoopla_df.rename(columns={'postcode_test_2': 'postcode'})

### Get the parish from the postcode

In [17]:
def get_parish(postcode):
    
    """
    Get the parish of the Zoopla property from the mapping dataframe (based on postcode)
    """
    
    try:
        return mapping_df[mapping_df['Postcode']==postcode].iloc[0]['Parish']
    except IndexError:
        if AREA == 'Nuneaton':
            return "Nuneaton and Bedworth, unparished area"
        elif AREA == 'Hinckley':
            return "Hinckley and Bosworth, unparished area"

In [18]:
zoopla_df['parish'] = zoopla_df['postcode'].apply(lambda x: get_parish(x))

In [19]:
zoopla_df['parish'].value_counts()

Hinckley and Bosworth, unparished area    279
Burbage                                    12
Sheepy                                      1
Name: parish, dtype: int64

Maybe not such a useful feature but keep for now!

### Save to csv file

In [20]:
zoopla_df.head()

Unnamed: 0,details_url,agent_phone,description,agent_address,latitude,longitude,outcode,country_code,num_bathrooms,listing_status,property_type,listing_id,num_recepts,post_town,displayable_address,floor_plan,image_url,street_name,agent_name,county,price_modifier,first_published_date,country,last_published_date,price,category,num_bedrooms,agent_logo,postcode,parish
0,https://www.zoopla.co.uk/for-sale/details/6388...,01455 886081,Attractive extended traditional bay fronted s...,"98 Castle Street, Hinckley",52.53883,-1.396291,LE10,gb,1,sale,Semi-detached house,63884099,2,Hinckley,"Langdale Road, Hinckley LE10",,https://lid.zoocdn.com/354/255/c9c518b9e9bd024...,Langdale Road,Scrivins & Co Estate Agents & Letting Agents,Leicestershire,,2023-02-04 11:38:12,England,2023-02-04 11:39:32,260000.0,Residential,3,https://st.zoocdn.com/zoopla_static_agent_logo...,LE10 0NS,"Hinckley and Bosworth, unparished area"
1,https://www.zoopla.co.uk/for-sale/details/6387...,01455 364814,** viewing essential ** A beautifully present...,"112 Castle Street, Hinckley",52.548298,-1.353169,LE10,gb,4,sale,Detached house,63878342,2,Hinckley,"Bradgate Gardens, Hinckley LE10",,https://lid.zoocdn.com/354/255/e2c80f945cd69da...,Bradgate Gardens,Castle Estates,Leicestershire,offers_over,2023-02-03 15:54:45,England,2023-02-04 09:42:14,450000.0,Residential,5,https://st.zoocdn.com/zoopla_static_agent_logo...,LE10 1ND,"Hinckley and Bosworth, unparished area"
2,https://www.zoopla.co.uk/for-sale/details/6387...,01455 364814,**viewing essential ** A well appointed semi ...,"112 Castle Street, Hinckley",52.552856,-1.375555,LE10,gb,1,sale,Semi-detached house,63874929,2,Hinckley,"York Road, Hinckley LE10",,https://lid.zoocdn.com/354/255/4fd00c679828a04...,York Road,Castle Estates,Leicestershire,offers_over,2023-02-03 11:20:13,England,2023-02-03 13:17:45,280000.0,Residential,3,https://st.zoocdn.com/zoopla_static_agent_logo...,LE10 0RH,"Hinckley and Bosworth, unparished area"
3,https://www.zoopla.co.uk/for-sale/details/6387...,01455 364026,An immaculately maintained tastefully decorat...,"28-30 New Buildings, Hinckley",52.534348,-1.392864,LE10,gb,1,sale,Semi-detached house,63871971,1,Hinckley,"Strathmore Road, Hinckley LE10",,https://lid.zoocdn.com/354/255/31105260d73c332...,Strathmore Road,Profiles,Leicestershire,offers_in_region_of,2023-02-02 22:07:06,England,2023-02-02 22:07:06,260000.0,Residential,3,https://st.zoocdn.com/zoopla_static_agent_logo...,LE10 0LW,"Hinckley and Bosworth, unparished area"
4,https://www.zoopla.co.uk/for-sale/details/6152...,01455 364871,You're sure to be impressed when you enter thi...,"84 Castle Steet, Hinckley",52.546017,-1.38644,LE10,gb,1,sale,Bungalow,61524804,1,Hinckley,"Aulton Way, Hinckley, Leicestershire LE10",,https://lid.zoocdn.com/354/255/88ccf710c285162...,Aulton Way,Your Move - Hinckley,Leicestershire,,2023-02-01 18:08:21,England,2023-02-01 18:08:21,315000.0,Residential,3,https://st.zoocdn.com/zoopla_static_agent_logo...,LE10 0XB,"Hinckley and Bosworth, unparished area"


In [21]:
try:
    os.mkdir(SAVE_FOLDER)
except OSError:
    pass

save_file = os.path.join(SAVE_FOLDER, f'zoopla_properties_with_postcode_{AREA.lower()}.csv')
    
zoopla_df.to_csv(save_file, index=False)