# Address Enricher Project

# Index

1. [Proclamation](1.-proclamation)
2. [Installations & Imports](#2)
3. [Configuration File](#3)
4. [Import Data](#4)
5. [Table Configuration](#5)
6. [Google places API class creation](#6)
7. [Extract Values From Results](#7)
8. [Fill Missing Values](#8)
9. [Check Accuracy](#9)
10. [Neighbours](#10)
11. Maps
12. Export table


# 1. Proclamation



<a id="1"></a>

This Notebook was developed for the purpose of enriching an address dataset of any form, shape or size, with a dependency on having a configuration file to manage importing the data, table transformation, distance parameter and exporting of the data. A secondary function of this notebook is to cluster neighbours of a certain site either by distance or by the same block, along with a map to visualize these neighbours. Google MAps APIs are used to develop these functions.

[Return to index](#index)

# 2. Installations & Imports

Before anything else we need to ensure that we have the necessary packages imported and installed to run this notebook.

In [1]:
# !pip install googlemaps

In [2]:
# !pip install google

In [3]:
# !pip install geopy

In [4]:
# !pip install gmaps

In [5]:
# !pip install pyarrow

In [6]:
# !jupyter nbextension enable --py gmaps

In [7]:
# !jupyter nbextension enable --py widgetsnbextension

In [9]:
import numpy as np
import pandas as pd
import os
import googlemaps
from google.cloud import bigquery
import json
import string
import requests
import geopy.distance
import gmaps
import pyarrow

Now that we have all the necessary package dependencies imported we have to load in and make sure our config file is how we want it.

<a id="3"></a>

# 3. Configuration File

The configuration file is a json file that gets loaded into the notebook as a dictionary

In [10]:
def load_config(file):
    with open(file) as conf:
        config = json.load(conf)
    return config

In [11]:
file = input("Input json file to configure Notebook ")

Input json file to configure Notebook config2.json


In [12]:
config = load_config(file)
config

{'file': 'bcx-insights-6dfb9fabfb5b.json',
 'project_ID': 'bcx-insights',
 'table_ID': 'bcx_networkhealth.addresses_20191029',
 'ID': 'ENTITYID',
 'Country': '',
 'Province': 'PROVINCE',
 'Postal Code': 'POSTALCODE',
 'City': '',
 'Suburb': 'SUBURBCITY',
 'Street': 'STREETBOX',
 'Number': '',
 'Building': 'BUILDING',
 'Floor': 'FLOOR',
 'Room': 'ROOM',
 'Latitude': 'LATITUDE',
 'Longitude': 'LONGITUDE',
 'null': 'None',
 'Dict_key': 'City',
 'Max_Distance(km)': 2,
 'Map_Distance(km)': 5,
 'Map_site': '',
 'export_name': 'bcx_networkhealth.addresses_20191029_enriched',
 'export_path': 'Desktop/PROJECT'}

Our first 3 keys are used to load in the data and the last 2 are used to export the data. Keys from 'ID' to 'null' are used to transfrom the table to the required schema. Max distance key is used to specify the maximum distance for clustering nearest neighbours. See the configuration file user guide for more details on how to use the configuration file and how it affects this notebook: .....

# 4. Import Data

This function has four ways to read in data from different sources/extensions and selects one of the ways based on the config file

In [13]:
def df_read_files(config):
    
    #store filename in secure envirnment
    filename = config['file']
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = filename
    
    #if and elifs determining which source the dataframe will be created from
    if '.parquet' in filename:
        data = pd.read_parquet(filename, engine = 'pyarrow')
    elif '.csv' in filename:
        data = pd.read_csv(filename)
    elif '.json' in filename:
        if config['project_ID'] == '':
            data = pd.read_json(filename)
            
        else:
            project_id = config['project_ID']
            bigquery_client = bigquery.Client(project = project_id)
            table_id = config['table_ID']
            QUERY = "SELECT * FROM " +  "`" + table_id + "`"
            job = bigquery_client.query(QUERY)
            data = job.to_dataframe()
        
    return data

df = df_read_files(config)

In [14]:
df.tail()

Unnamed: 0,ENTITYID,STREETBOX,SUBURBCITY,POSTALCODE,CAREOF,BUILDING,FLOOR,ROOM,LATITUDE,LONGITUDE,PROVINCE
26529,33959,Marabastad Informal Trading Market,11th Street in Junction Street mogul Street,Marabastad :Pretoria,,,,,,,
26530,336483,Erf 3080,Cnr Heidelberg & Airport Roads,"Dalpark Ext 5, Brakpan",,,,Shop 111,,,
26531,59858,27 Murrayfield Boulevard,Pretoria - Silverlakes - Homeowners Association,Management Centre 0081,,,,,,,
26532,191114,Cnr R51 and Brizial,"Daveyton, Jhb",Boyas View North UJ Campus Daveyton,,,,no 90,,,
26533,437784,PO BOX 7655,Pretoria,speedpot@icon.co.za(Send invoices via email),,,,,,,Gauteng


# 5. Table Configuration

Now that the data is loaded into a dataframe, the dataframe must be configured to the desired schema.

In [15]:
def configure_table(df, config):
    '''This function uses a configuration file and tunes/formats the input DataFrame to a desired Format
    that can be used to enrich the address dataset and get enighbours for each site
    '''
    
    df = df.copy() 
    df.replace(config['null'], '', inplace=True) 
    
    input_cols = df.columns.to_list()
    conf_vals = list(config.values())
    
    #drop unwanted columns
    for i in input_cols:
        if i not in conf_vals:
            df.drop(i, axis=1, inplace=True)
    
    #Get key value pairs in dictionary in a tuple format
    pairs = list(config.items())[3:-7]
    
    #For loop to either rename columns, create new and fill with empty string
    for i in pairs:
        if i[1] in input_cols:
            df.rename(columns={i[1]:i[0]}, inplace=True)
        elif i[1] == '':
            df[i[0]] = i[1]
       
    # Make sure all nans are changed to empty strings        
    df.replace(np.nan, '', inplace=True)
    
    # Change table order
    output_cols = list(config.keys())[3:-7]
    data = df[output_cols]
    
    
    return data

In [16]:
dfc = configure_table(df, config)

In [19]:
dfc.head(10)

Unnamed: 0,ID,Country,Province,Postal Code,City,Suburb,Street,Number,Building,Floor,Room,Latitude,Longitude
0,183627,,Gauteng,,,"Loulardia, Centurion",100 Brakfontein Road,,Shopite corporate park,,,,
1,461123,,Gauteng,,,"Marshalltown, Johannesburg",5 Simmonds Street,,Standard Bank Centre,,,,
2,181822,,KwaZulu-Natal,,,Ivongo,C/o Shepstone Street and Collis Road,,1861 Rear Building,,,,
3,283018,,,,,"Highveld Technopark, Centurion",5 Bauhinia Str,,22 Cambridge Office Park,1st Floor,,,
4,297531,,,,,Pietermaritzburg,8 Grix Road,,"Subdivision 9 (of 4) of Lot 122,",,,,
5,67977,,,,,"Laser Park, Honeydew",Zeiss Road,,Kimbuilt Industrial Park,,Block B Unit 10,,
6,372671,,,,,"Rosebank, Gauteng",51 Bath Avenue,,Rosebank Mall,,Shop 327/328,,
7,445339,,,,,"Centurion, Pretoria","5 BauhiniaRoad, Highveld Technopark",,Cambridge Office Park - Building 17,1st floor,,,
8,458160,,Gauteng,,,Highveld Technopark Centurion,"5 Bauhinia Road,",,Building 22 Cambridge Office Park,,,,
9,234524,,,,,"Foreshore, Cpt",14 Christiaan Barnard Street,,Atlantic Centre,,,,


In [20]:
quant = input("WHOLE or SUBSET of data: ")

WHOLE or SUBSET of data: SUBSET


In [21]:
if quant == 'SUBSET':
    rows = input("Amount of rows: ")
    dfc = dfc.copy().head(int(rows))

Amount of rows: 20


# 6. Google places API class creation

Class GooglePlaces contains a constructor and two instance methods. The constructor takes in the API key when an instance of the class is created. When the method get_place_details_using_address() is called, it uses the address passed as an argument to make a geocode API request. When the method get_place_details_using_coordinates() is called, it uses the coordinates passed as an argument parameter to make a geocode API request. Both these methods returns the API results, if none exists, it returns an empty list.

In [22]:
class GooglePlaces(object):
    
    #constructor
    def __init__(self, apiKey):
        super(GooglePlaces, self).__init__()
        self.apiKey = apiKey
        
    # method
    def get_place_details_using_address(self,address):
        endpoint_url = "https://maps.googleapis.com/maps/api/geocode/json"
        
        # parameters 
        params = {
            'region' : 'za',
            'address': address,
            'key': self.apiKey
            
        }
        
        # api request
        res = requests.get(endpoint_url, params = params)
        res =  json.loads(res.content)
        
        # create list
        empty = []
        
        # checks results
        # if no results, return empty list
        if len(res['results']) == 0:
            return empty
        # else, return results
        else:
            results = res['results'][0]
            return results
    
    # method
    def get_place_details_using_coordinates(self,coords):
        endpoint_url = "https://maps.googleapis.com/maps/api/geocode/json"
    
        # parameters
        params = {
            'region' : 'za',
            'latlng': coords,
            'key': self.apiKey
        }
        res = requests.get(endpoint_url, params = params)
        res =  json.loads(res.content)
        empty = []
        if len(res['results']) == 0:
            return empty
        else:
            results = res['results'][0]
            return results
gp = GooglePlaces('AIzaSyDL9k9T2RcmXn2y5YtiGA_6jRa-boFzS-o')

# 7. Extract Values From Results

Function get_values takes in a dictionary of dictionaries containing the results of the API request. It loops through the values given in the dictionary containing the address_components of the API results and stores them in a new dictionary. Values are then extracted from the new dictionary, inserted into a list in a specific order then returned.

In [25]:
def get_values(d):
    
    """
    This function takes the results from the API request, searches it then extracts certain values and returns a list containing 
    these values in a specific order.
    
    """
   
    # new dictionary
    values = {}
        
    # length of the address components dictionary in the API results
    count = len(d['address_components']) - 1

    
    #loop through the items in address components to create a new dictionary
    while count >= 0:
        
        values.update({d['address_components'][count]['types'][0]: d['address_components'][count]['long_name']})
        count = count -1

    # create a list
    results = []
    
    #get country from dictionary and add to list, where no value exists, nan is added to the list
    results.append(values.get('country',np.nan))
    
    #get province from dictionary and add to list, where no value exists, nan is added to the list
    results.append(values.get('administrative_area_level_1',np.nan))
    
    #get city from dictionary and add to list, where no value exists, nan is added to the list
    if 'administrative_area_level_2' in values.keys() and 'locality' in values.keys():
        results.append(values.get('locality',np.nan))
    else:
        results.append(values.get('administrative_area_level_2',np.nan))
    #get suburb from dictionary and add to list, where no value exists, nan is added to the list
    results.append(values.get('political',np.nan))
    
    #get street from dictionary and add to list, where no value exists, nan is added to the list
    results.append(values.get('route',np.nan))
    
    #get number from dictionary and add to list, where no value exists, nan is added to the list
    results.append(values.get('street_number',np.nan))
    
    #get code from dictionary and add to list, where no value exists, nan is added to the list
    results.append(values.get('postal_code',np.nan)) 

    #get lat from dictionary and add to list, where no value exists, nan is added to the list
    results.append(d['geometry']['location']['lat'])
    
    #get lng from dictionary and add to list, where no value exists, nan is added to the list
    results.append(d['geometry']['location']['lng'])
       
    #get place id
    results.append(d['place_id'])
    
    # return list
    return results

# 8. Fill Missing Values

Function fill_missing takes parameters dataframe and instance. 'dataframe' is a dataframe with the required format and 'instance' is an instance of the class GooglePlaces containing the API key in use. This function loops through every row of the dataframe and does the following with each iteration:

##### 1.Splits the columns

The row is split into two parts: one containing all the elements of an address and the other the coordinates of the address. 

##### 2.Checks which values exists in the row and calls the appropriate API requests methods. 
Certain values of the address cannot be null and if this condition is met, all the available address information is passed as a string when method get_place_details_using_address() is called. If the condition is not met, method get_place_details_using_coordinates() is called and uses the coordinates. When none of these values exist (null) the row is skipped by incrementing the index and the loop is exited.

##### 3.Checks the results of the API request
When an empty list is returned from get_place_details_using_address(), get_place_details_using_coordinates() is called, if empty lists are returned from both methods, the row is skipped by incrementing the index and the loop is existed.
When there are results for either methods (get_place_details_using_address() or get_place_details_using_coordinates()), function get_values is called and returns a list of values in a specific format.

##### 4.Compares values in the row to the values in the list
First the list is checked for null values and is replaced with the values of the row. Then the row is checked for null values and mismatched values (values that does not match the values in the list) and is replaced accordingly.

##### 5.Updates dataframe
The row is then updated in the dataframe. The index is incremented and when the loop is completed, the dataframe is returned.

In [26]:
def fill_missing(dataframe,instance):
    
    dataframe = dataframe.copy()
    # create column to store the place_id 
    dataframe['Place_ID'] = ""
    
    #initialise control variable
    index = 0
    
    #create list of column names
    cols = dataframe.columns.tolist()
    
    
    while index < len(dataframe):

        row = dataframe.copy().iloc[index]
        
        # split dataframe
        #address elements
        address_df = row[['ID','Country','Province','City','Suburb','Street','Number','Postal_Code','Building','Floor','Room']]
        #coordinates
        coordinates_df = row[['Latitude','Longitude']]
        
        # create a list of columns in a specific order
        add_df_cols = ['Room','Floor','Building','Number','Street','Suburb','Postal_Code','City','Province','Country']
        

        # check if number and street or room and building has values
        if pd.notnull(address_df['Number']) and pd.notnull(address_df['Street']) or pd.notnull(address_df['Room']) and pd.notnull(address_df['Building']) or pd.notnull(address_df['Street']) and pd.notnull(address_df['Building']):

        #get existing values from the address df and store in a string

            address  = ""

            #loop through columns in the address df
            for col in add_df_cols:
                #where values are not null, add it to address string
                if pd.notnull(address_df[col]):
                    address += str(address_df[col])+" "

            # call get_details function to make a request to the api
            results = instance.get_place_details_using_address(address)

            # check api returned results
            if len(results) > 0:

                #format results of the geocode api request by calling get_values
                values = get_values(results)

                #inserts original values from the dataframe into the list
                values.insert(0,row.ID)
                values.insert(8,row.Building)
                values.insert(9,row.Floor)
                values.insert(10,row.Room) 
                row['Place_ID'] = values[-1]

            #if api results emptry, use coordinates
            elif pd.notnull(coordinates_df['Latitude']) and pd.notnull(coordinates_df['Longitude']):
                
                #get coordinates from dataframe and store as string
                coordinates = str(coordinates_df.Latitude) + "," + str(coordinates_df.Longitude)
    
                # call get_details function to make a request to the api
                results = instance.get_place_details_using_coordinates(coordinates)

                # check api returned results
                if len(results) > 0:

                #format results of the geocode api request by calling get_values
                    values = get_values(results)

                    #inserts original values from the dataframe into the list
                    values.insert(0,row.ID)
                    values.insert(8,row.Building)
                    values.insert(9,row.Floor)
                    values.insert(10,row.Room)
                    row['Place_ID'] = values[-1]

                #if api results empty, skip row and exist loop
                else:
                    index += 1
                    break
            else:
                index +=1
                break
        
        # if street or room and building empty, use coordinates
        elif pd.notnull(coordinates_df['Latitude']) and pd.notnull(coordinates_df['Longitude']):
            
            #get coordinates from dataframe and store as string
            coordinates = str(coordinates_df.Latitude) + "," + str(coordinates_df.Longitude)
        
            # call get_details function to make a request to the api
            results = instance.get_place_details_using_coordinates(coordinates)
                       
            # check api returned results
            if len(results) > 0:
            
                #format results of the geocode api request by calling get_values
                values = get_values(results)


                values.insert(0,row.ID)
                values.insert(8,row.Building)
                values.insert(9,row.Floor)
                values.insert(10,row.Room)
                row['Place_ID'] = values[-1]
            
           #if api results empty, skip row and exist loop
            else:
                index +=1
                break
        # if address elements empty and coordinates empty, skip row and exist loop       
        else:
            index +=1
            break
        
        #create dataframe with API results
        frame = pd.DataFrame(columns = cols, data = [values])

        row_2 = frame.copy().iloc[0]


        for col in cols:
            
            # check for nulls in the results of api request and change it to the value column
            if pd.isnull(row_2[col]):
                row_2[col] = row[col]

            # check for nulls and mismatch values in dataframe row and replace wiith results of api 
            if row[col] != row_2[col]:
                row[col] = row_2[col]

        #update row in dataframe
        dataframe.loc[index] = row

        #increment index
        index += 1
    
    return dataframe
dff = fill_missing(dfc,gp)

In [28]:
dff.head()

Unnamed: 0,ID,Country,Province,Postal Code,City,Suburb,Street,Number,Building,Floor,Room,Latitude,Longitude,Place_ID
0,183627,South Africa,Gauteng,Centurion,Louwlardia,Brakfontein Road,100,1683,Shopite corporate park,,,-25.9115,28.1659,ChIJWUXqQIdllR4RyRR2weZ-WAc
1,461123,South Africa,Gauteng,Johannesburg,Selby,Simmonds Street,5,2001,Standard Bank Centre,,,-26.2093,28.0394,ChIJkU1txKIOlR4RLYRcu8wOmss
2,181822,South Africa,KwaZulu-Natal,Margate,Margate,Ivongo,C/o Shepstone Street and Collis Road,4275,1861 Rear Building,,,-30.8499,30.3805,EjVDb2xsaXMgUmQgJiBTaGVwc3RvbmUgU3QsIE1hcmdhdG...
3,283018,South Africa,Gauteng,Centurion,Highveld Techno Park,Bauhinia Street,5,157,22 Cambridge Office Park,1st Floor,,-25.8811,28.1831,ChIJg_vaT8dllR4RJm6Dkglhdo4
4,297531,South Africa,KwaZulu-Natal,Pietermaritzburg,Willowton,Grix Road,8,3201,"Subdivision 9 (of 4) of Lot 122,",,,-29.5947,30.4175,Ej8xMjIsIDggR3JpeCBSZCwgV2lsbG93dG9uLCBQaWV0ZX...


# 9. Check Accuracy

Function accuracy_check compares values in the dataframe to the results of the geocoding API to see if they’re the same. The function loops through every row in the dataframe and does the following with each iteration:

1. Gets address and coordinates
Builds an address and coordinates string using existing values in the dataframe.

2. API calls
Calls the both methods and passes the appropriate string

3. Updates values
Checks the results of the API and updates the row accordingly

4. Compares values
Creates lists of the values in the dataframe and the results of the API and checks if they match or not. New columns are created to store this result.

5. Prints details
Results are stored and in a dictionary and printed

In [32]:
def check_accuracy(dataframe,instance):

    #initialise index
    index = 0
    
    #get column names and store in list
    cols = dataframe.columns.tolist()
    
    # create new dataframe
    frame = pd.DataFrame(columns = cols)
    
    # new dict
    d = {}
    
    # initialis count variable
    count_empty = 0

    while index < len(dataframe): 

        row = dataframe.copy().iloc[index]

        # split dataframe
        # address elements
        address_df = row[['ID','Country','Province','City','Suburb','Street','Number','Postal Code','Building','Floor','Room']]
        
        # coordinates
        coordinates_df = row[['Latitude','Longitude']]
        
        #get coordinates from dataframe and store as string
        coordinates = str(coordinates_df.Latitude) + "," + str(coordinates_df.Longitude)
        
        # create a list of columns in a specific order
        add_df_cols = ['Room','Floor','Building','Number','Street','Suburb','Postal Code','City','Province','Country']
        
        address  = ""

        #loop through columns in the address df
        for col in add_df_cols:
            #where values are not null, add it to address string
            if pd.notnull(address_df[col]):
                address += str(address_df[col])+" "       

        # call get_details method to make a request to the api using address
        results = instance.get_place_details_using_address(address)
     
        
        # call get_details method to make a request to the api using coordinates
        res = instance.get_place_details_using_coordinates(coordinates)
        
        # check if api results empty and increment counter if true
        if len(res) == 0:
            count_empty += 1
        if len(results) == 0:
            count_empty += 1
      
        # check if both api calls have results
        if len(results) > 0 and len(res) > 0:
           
            # call get_values function for both api results
            values = get_values(results)
            vals = get_values(res)
            
            # fill row with the results
            row.Latitude = values[7]
            row.Longitude = values[8]
            row.Number = vals[5]
            row.Street = vals[4]
            row.Suburb = vals[3]
            row.City = vals[2]
            row.Province = vals[1]
            row.Country = vals[0]
            row.Postal_Code = vals[6]
            

        elif len(results) > 0 and len(res) == 0:
        
            #get results of the geocode api request
            values = get_values(results)

            
            row.Latitude = values[7]
            row.Longitude = values[8]
            row.Number = ""
            row.Street = ""
            row.Suburb = ""
            row.City = ""
            row.Province = ""
            row.Country = ""
            row.Postal_Code = ""

        elif len(results) == 0 and len(res) > 0:
            
            vals = get_values(res)

            #fill row wiith available results
            row.Number = vals[5]
            row.Street = vals[4]
            row.Suburb = vals[3]
            row.City = vals[2]
            row.Province = vals[1]
            row.Country = vals[0]
            row.Postal_Code = vals[6]
            row.Latitude = ""
            row.Longitude = ""
         
        else:

            # if no results avaiable, add unchaged row to new dataframe
            frame = frame.append(row,ignore_index=True)
            index += 1
            break
            
        #update row in dataframe
        frame = frame.append(row, ignore_index=True)
        
        # new column
        frame['Address_Match'] = ""
        
        #new column
        frame['Coordinates_Match'] = ""
        
        
        # copy row from both dataframes 
        row1 = dataframe.copy().iloc[index]
        row2 = frame.copy().iloc[index]
        
        cols_address = ['Country','Province','City','Suburb','Street','Number','Postal Code']
        cols_coordinates = ['Latitude','Longitude']

        
        # create new lists for address values
        df_address_values = []
        f_address_values = []
        
        # save address values in a list
        for i in cols_address:
            df_address_values.append(row1[i])
            f_address_values.append(row2[i])
         
        # compare values in the two lists
        if df_address_values == f_address_values:
            frame['Address_Match'] = 'Match'
        else:
            frame['Address_Match'] = 'Mismatch'
            
            
        # create new lists for coordinate values
        df_coord_values = []
        f_coord_values = []
        
        # save coordinate values in a list
        for i in cols_coordinates:
            df_coord_values.append(row1[i])
            f_coord_values.append(row2[i])
            
        # compare values in the two lists   
        if df_coord_values == f_coord_values:
            frame['Coordinates_Match'] = 'Match'
        else:
            frame['Coordinates_Match'] = 'Mismatch'
            
            
        #increment index
        index += 1
        
    # filter dataframe where values match    
    a = frame[frame['Address_Match']== 'Match']
    b = frame[frame['Coordinates_Match']=='Match']
    
    # update dictionary
    # number of rows
    d.update({'# of rows': len(frame)})
    
    # number of empty api results
    d.update({'# of empty API results': count_empty})
    
    # address match %
    d.update({'Address Match %': a['Address_Match'].sum() / len(frame) *100})
    
    # coordinate match %
    d.update({'Coordinates Match %': b['Coordinates_Match'].sum() / len(frame) *100})
    
    # print dictionary
    print(d)
check_accuracy(dff,gp)

{'# of rows': 20, '# of empty API results': 0, 'Address Match %': 0.0, 'Coordinates Match %': 0.0}


# 10. Neighbours

In this part of the Notebook we are getting the neighbours of a site in question, either by distance or those on the same block. We start first by creating a dictionary of dictionaries with distances between sites in the same city

In [40]:
def city_distances_h(df):
    """
    
    Creates a dictionary with each unique City as a key, storing the harversine distance
    from each site to every other site in that respective City
    
    RUN THIS CELL ONCE for a subset/entire dataset to store the distances
    
    """
    
    # Sort values by city
    df = df.sort_values(by=['City']).reset_index()
    cities = sorted(df['City'].unique())
    dic3 = {}
    
    # Create a dictionary for each city.
    for city in cities:
        data = df[df['City'] == city]
        data = data.reset_index()
        dic3[city] = {}
        
        # Nested for loop filling dictionary with values and keys (distances and IDs)
        for i in range(len(data)):
            start_point = (data['Latitude'][i], data['Longitude'][i]) # start
            dic3[city][data['ID'][i]] = {}
            
            for j in range(len(data)):
                end_point = (data['Latitude'][j], data['Longitude'][j]) #end
                distance = geopy.distance.great_circle(start_point, end_point) # distance calculation
                distance = str(distance)
                distance = round(float(distance[:-3]), 2)
                dic3[city][data['ID'][i]][data['ID'][j]] = distance # dictionary imputing
                
    return dic3


In [41]:
dic3 = city_distances_h(dff)

Thereafter the neighbours within a distance are computed and same block neighbours are computed using those with zero distance, same building or same place_ID

In [42]:
def neighbours(r, df, dic):
    """
    
    Pulls the site ID's that fall within the given range r and adds them to a column
    of a dataframe. Also adds another column for sites on the same block if the given
    latitude and longitude for any sites are the same
    
    """

    df = df.copy()
    
    # Useful variables and lists
    city = df['City']
    ID = df['ID']
    build = df['Building']
    places = df['Place_ID']
    d_nbs = []
    b_nbs = []
    
    # for loop iterating over dataframe
    for i in range(len(df)):
        x = dic[city[i]][ID[i]]
        y = build[i]
        z = places[i]
        
        ids = [key for key, val in x.items() if val <= r]
        
        blocks = [key for key, val in x.items() if val == 0]
        
        if build[i] != '':
            blocks2 = [Id for Id, building in zip(ID, build) if (building == y) and (Id not in blocks)]
            if len(blocks2) > 0:
                blocks.append(blocks2)
            
        if places[i] != []:
            blocks3 = [Id for Id, place in zip(ID, places) if (place == z) and (Id not in blocks)]
            if len(blocks3) > 0:
                blocks.append(blocks3)
            
        d_nbs.append(ids)
        b_nbs.append(blocks)
        
    df['Max_Distance'] = [i for i in d_nbs]
    df['Same_Block'] = [c for c in b_nbs if c != []]
    
    for i in range(len(df)):
        if df['ID'][i] in df['Max_Distance'][i]:
            df['Max_Distance'][i].remove(df['ID'][i])
            
    for i in range(len(df)):
        if df['ID'][i] in df['Same_Block'][i]:
            df['Same_Block'][i].remove(df['ID'][i])
    
    df.drop('Place_ID', axis=1, inplace=True)
    
    return df

In [45]:
df = neighbours(config['Max_Distance(km)'], dff, dic3)

In [46]:
df.head(10)

Unnamed: 0,ID,Country,Province,Postal Code,City,Suburb,Street,Number,Building,Floor,Room,Latitude,Longitude,Max_Distance,Same_Block
0,183627,South Africa,Gauteng,Centurion,Louwlardia,Brakfontein Road,100,1683,Shopite corporate park,,,-25.9115,28.1659,[],[]
1,461123,South Africa,Gauteng,Johannesburg,Selby,Simmonds Street,5,2001,Standard Bank Centre,,,-26.2093,28.0394,[],[]
2,181822,South Africa,KwaZulu-Natal,Margate,Margate,Ivongo,C/o Shepstone Street and Collis Road,4275,1861 Rear Building,,,-30.8499,30.3805,[],[]
3,283018,South Africa,Gauteng,Centurion,Highveld Techno Park,Bauhinia Street,5,157,22 Cambridge Office Park,1st Floor,,-25.8811,28.1831,"[458160, 445339]","[458160, 445339]"
4,297531,South Africa,KwaZulu-Natal,Pietermaritzburg,Willowton,Grix Road,8,3201,"Subdivision 9 (of 4) of Lot 122,",,,-29.5947,30.4175,[],[]
5,67977,South Africa,Gauteng,Roodepoort,Laser Park,Zeiss Road,9,2040,Kimbuilt Industrial Park,,Block B Unit 10,-26.0776,27.915,[],[]
6,372671,South Africa,Gauteng,Johannesburg,Rosebank,Bath Avenue,51,2196,Rosebank Mall,,Shop 327/328,-26.1465,28.0412,[370241],[]
7,445339,South Africa,Gauteng,Centurion,Highveld Techno Park,Bauhinia Street,5,157,Cambridge Office Park - Building 17,1st floor,,-25.8811,28.1831,"[283018, 458160]","[283018, 458160]"
8,458160,South Africa,Gauteng,Centurion,Highveld Techno Park,Bauhinia Street,5,157,Building 22 Cambridge Office Park,,,-25.8811,28.1831,"[283018, 445339]","[283018, 445339]"
9,234524,South Africa,Western Cape,Cape Town,Foreshore,Christiaan Barnard Street,14,8000,Atlantic Centre,,,-33.9211,18.4332,[],[]


# 11. Maps

In [47]:
gmaps.configure(api_key='AIzaSyDIAO3gmSqPaKqSOVB0OChAurgVC5TwGWc')

In [48]:
def single_id_map(r, ID, city, df, dictionary):
    
    df[['Latitude', 'Longitude']].astype(float)
    dic = dictionary
    sites = []
    x = dic[city][ID]
    ids = [key for key, val in x.items() if val <= r]
    sites.append(ids)
    
    coords = []
    indexes = []
    for s in sites[0]:
        row = df[df['ID'] == s]
        place = (float(row['Latitude'].item()), float(row['Longitude'].item()))
        ID = str(row['ID'].item())
        coords.append(place)
        indexes.append(ID)
        
    fig = gmaps.figure()
    markers = gmaps.marker_layer(coords, info_box_content=indexes)
    fig.add_layer(markers)
        
    return fig

In [49]:
single_id_map(5, 45338052, 'BELLVILLE', testing, testing_dic)

NameError: name 'testing' is not defined

In [0]:
def closest_neighbour(ID, df, dic):
    
    city = df[df['ID'] == ID]['City'].item()
    distances = dic[city][ID]
    
    lowest = min(filter(None, distances.values()))
    site = next(k for k, v in distances.items() if v == lowest)
    
    start_lat = df[df['ID'] == ID]['Latitude'].item()
    start_lon = df[df['ID'] == ID]['Longitude'].item()
    start_site = (float(start_lat), float(start_lon))
    
    end_lat = df[df['ID'] == site]['Latitude'].item()
    end_lon = df[df['ID'] == site]['Longitude'].item()
    end_site = (float(end_lat), float(end_lon))
    
    fig = gmaps.figure()
    route = gmaps.directions_layer(start_site, end_site,travel_mode = 'TRANSIT')
    fig.add_layer(route)
    
    
    return fig

In [0]:
closest_neighbour(45338052, testing, testing_dic)

Figure(layout=FigureLayout(height='420px'))

# 12. Export Data

In [0]:
def df_export_file(dataframe, file_name, file_path, project_ID, table_ID):
    
    if 'parquet' in file_name:
        output = dataframe.to_parquet(self, file_name, engine='auto', compression='snappy', index=None, partition_cols=None)
    elif 'csv' in file_name:
        output = dataframe.to_csv(r''+ file_path + '\\' + file_name + '.csv')
    elif 'json' in file_name:
        output = dataframe.to_json(r''+ file_path + '\\' + file_name + '.json')
    elif table_ID in file_name:
        bigquery_client = bigquery.Client(project = project_ID)
        sql = "CREATE TABLE" + file_name 
        job = bigquery_client.query(sql)
        output = dataframe.to_gbq(file_name,
                 project_ID,
                 if_exists='replace')
    else:
        None


df_export_file(test, config["export_name"], config["export_path"], config["project_ID"], config["table_ID"])

1it [00:04,  4.27s/it]
