## <center>Geocoding Address to Get GEO DATA FIELDS (Postal Code, Lat, Lng)</center>

<center>Using Google API. Limit to 2,500 per records per day per IP. 0.5$ USD per 1000 extra records and up to 100,000 records per day. More information about the quota can be found : https://developers.google.com/maps/documentation/geocoding/usage-limits </center>

## Importing packages

The manual of common usage of the GoogleMap API package can be found at : 
https://github.com/googlemaps/google-maps-services-python

Here we focus on using the API to get postal code, longitude, latitude, postal code, formatted address information, etc.

In [1]:
import pandas as pd
import numpy as np
import googlemaps as gm
import json
import re

import warnings
warnings.filterwarnings('ignore')

## Initialize googlemap API client with GoogleMAP API KEY

To Get a free API KEY, one need to register on Google Website. More information can be found : 
https://developers.google.com/maps/documentation/geocoding/start#GeocodingRequests

In [2]:
gmaps = gm.Client(key='') #AIzaSyDbQ0C1wJEwhwZp85-lGKvVdznWaVRY3cI

## Reading input files

In [3]:
#File that contains the address data to be geo-coded
FILE_INPUT = 'C:/Users/liuleo/Documents/KT/Ext_Data/Data/hdb_trx.csv'
df = pd.read_csv(FILE_INPUT)

# GET the column name with contains address information. 
ADDRESS_COL = 'address_real'

# GET the ID (Customer ID or policy ID), here just using month as an illustration
ID_COL = 'month'

## Clean up address fields

In [4]:
# Remove all the non-alphanumeric characters in address related columns
# The name of the columns should be replaced by the column names in the hdb policy file from your database
df['block'] = df['block'].apply(lambda x : re.sub('[^A-Za-z0-9]+', ' ', x).lstrip())
df['street_name'] = df['street_name'].apply(lambda x : re.sub('[^A-Za-z0-9]+', ' ', x).lstrip())

In [5]:
# Combine several columns to get address
# Better to add "block" in front of block number and add 'Singapore' after the address to improve the geocoding precision 
df[ADDRESS_COL] = 'block ' + df['block'] + ', ' + df['street_name'] + ', Singapore' 

In [6]:
# Can a take sample to test first, for example 100
SAMPLE_NB = 100
df = df[[ID_COL,ADDRESS_COL]].sample(SAMPLE_NB).reset_index(drop=True)

## Define function for Geo Coding

In [7]:
def google_geo_coding(x):
    try:
        results = gmaps.geocode(x[ADDRESS_COL])
    except:
        results = 'Not Valid Address'
    
    return results

## Geocoding chunk by chunk with Google API

In [8]:
# Define the step length for applying chunk by chunk
n = 10
df_result = pd.DataFrame([],columns=[ID_COL,ADDRESS_COL,'geo_coding'])

#Geo Coding address data chunk by chunk
for i in range(0, len(df), n):
    
    #Get the chunk data in each loop
    print 'Start GeoCoding first {} address'.format(i+n)
    temp_chunk = df[i:i+n]
    
    # Geo Coding column contains the Geo JSON object return by Google API
    temp_chunk['geo_coding'] = temp_chunk.apply(google_geo_coding,axis=1)
    df_result = df_result.append(temp_chunk)
    print 'Finish GeoCoding first {} address'.format(i+n)
    
    temp_chunk.drop(temp_chunk.columns, axis=1, inplace=True)

Start GeoCoding first 10 address
Finish GeoCoding first 10 address
Start GeoCoding first 20 address
Finish GeoCoding first 20 address
Start GeoCoding first 30 address
Finish GeoCoding first 30 address
Start GeoCoding first 40 address
Finish GeoCoding first 40 address
Start GeoCoding first 50 address
Finish GeoCoding first 50 address
Start GeoCoding first 60 address
Finish GeoCoding first 60 address
Start GeoCoding first 70 address
Finish GeoCoding first 70 address
Start GeoCoding first 80 address
Finish GeoCoding first 80 address
Start GeoCoding first 90 address
Finish GeoCoding first 90 address
Start GeoCoding first 100 address
Finish GeoCoding first 100 address


## Get Geo fields from Google GEO JSON Object:

The Geo Information we need:
1. Postal Code Long Name (If address not correct, will return country name : Singapore)
2. Postal Code Short Name (If address not correct, will return country name : SG)
3. Longitude
4. Latitude
5. Country Name (In case to filter out any forgein country address)
6. Formatted Address (Can be used for later campaign for example)

In [10]:
df_result['postal_code_long'] = df_result['geo_coding'].apply(lambda x : np.nan if x == 'Not Valid Address'
                                                              else x[0]['address_components'][len(x[0]['address_components'])-1]['long_name'])

df_result['postal_code_short'] = df_result['geo_coding'].apply(lambda x : np.nan if x == 'Not Valid Address'
                                                               else x[0]['address_components'][len(x[0]['address_components'])-1]['short_name'])

df_result['lat'] = df_result['geo_coding'].apply(lambda x : np.nan if x == 'Not Valid Address' 
                                   else x[0]['geometry']['location']['lat'])

df_result['lng'] = df_result['geo_coding'].apply(lambda x : np.nan if x == 'Not Valid Address' 
                                   else x[0]['geometry']['location']['lng'])

df_result['country'] = df_result['geo_coding'].apply(lambda x: np.nan if x == 'Not Valid Address'
                                      else x[0]['address_components'][len(x[0]['address_components'])-2]['long_name'])

df_result['format_add'] = df_result['geo_coding'].apply(lambda x : np.nan if x == 'Not Valid Address'
                                         else x[0]['formatted_address'])

## Save results to a text file

In [11]:
OUTPUT_FILE = 'C:/Users/liuleo/Documents/KT/TMP/geocoding_results.csv'
df_result.to_csv(OUTPUT_FILE,sep='|',index=False)