<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span></li></ul></div>

<h3>Overview

This is part of a series of notebooks used to impute missing location data and create ***lat*** (latitude) and ***lon*** (longitude) columns. The ***location*** column of the original dataset is a dictionary with geocordinates and business address. Many of these location entries are null. Where not null, the coordinates are pulled from the location column. Where no coordinates are present, the a geocoding api is used to convert the address fields are to lat and lon. 

Note book 1.1b **Extract Lat Lon From Points** performs the conversion by coordinates.

In [4]:
import pandas as pd
import numpy as np
import pandas_profiling
import json
import ast

import os
import sys
import re

module_path = os.path.abspath(os.path.join('./lib/'))

if module_path not in sys.path:   
    sys.path.append(module_path)


import datetime
from utilities import *
from sodapy_dataset_reader import *


import shapefile as shp
import matplotlib.pyplot as plt
import seaborn as sns

import os
import sys
import re
import time

import geopandas as gpd
from geopandas.tools import geocode
from shapely.geometry import Point

%matplotlib inline

In [5]:
df_by_loc = pd.read_csv('../tmp/by_location_201920195621191425.csv', low_memory=False)

In [6]:
df_by_loc.shape

(143883, 27)

In [7]:
 df_by_loc.columns

Index(['business_zip', 'certificate_number', 'city', 'dba_name',
       'full_business_address', 'lic', 'lic_code_description', 'location',
       'mail_city', 'mail_state', 'mail_zipcode', 'mailing_address_1',
       'naic_code', 'naic_code_description',
       'neighborhoods_analysis_boundaries', 'ownership_name', 'parking_tax',
       'state', 'supervisor_district', 'transient_occupancy_tax', 'ttxid',
       'dba_start', 'dba_end', 'loc_start', 'loc_end', 'lat', 'lon'],
      dtype='object')

In [9]:
 df_by_loc.lat.head()

0    0
1    0
2    0
3    0
4    0
Name: lat, dtype: int64

In [10]:
 df_by_loc.lon.head()

0    0
1    0
2    0
3    0
4    0
Name: lon, dtype: int64

In [11]:
df_by_loc['lat'] = 0
df_by_loc['lon'] = 0

In [12]:
df_by_loc.location.isnull().sum()

0

In [13]:
def convert_location(id, nm, loc):
    try:
        loc = ast.literal_eval(loc)
    except:
        print(f'an error occurred converting literal on {id}, {nm}')
        return 0,0
    else:  
        try:
            lat = float(loc['latitude'])
            lon = float(loc['longitude'])
        except:
            #print(f'an error occurred extracting coordinates on {id}, {nm}')
            return 0,0
        else:
            #print(f'returning lat: {lat}, lon: {lon}')
            return lat, lon

In [14]:
start_execution = time.time()

df_by_loc[['lat', 'lon']] = df_by_loc.apply(lambda row: convert_location( row.certificate_number, 
                                     row.dba_name,
                                     row.location), 
                                       axis=1, result_type='expand')
end_execution = time.time()
print(f'Elapsed time: {end_execution - start_execution}')

Elapsed time: 37.60917401313782


In [15]:
df_by_loc.head()

Unnamed: 0,business_zip,certificate_number,city,dba_name,full_business_address,lic,lic_code_description,location,mail_city,mail_state,...,state,supervisor_district,transient_occupancy_tax,ttxid,dba_start,dba_end,loc_start,loc_end,lat,lon
0,94123.0,28,San Francisco,3101 Laguna Apts,3101 Laguna St,,,"{'latitude': '37.799823', 'longitude': '-122.4...",San Francisco,CA,...,CA,2.0,False,0000028-02-001,1993-09-30,,1993-09-30,,37.799823,-122.430996
1,94116.0,52,San Francisco,Ideal Novak Corp,8 Mendosa Ave,,,"{'latitude': '37.748926', 'longitude': '-122.4...",San Francisco,CA,...,CA,7.0,False,0000052-01-001,1968-10-01,,1968-10-01,,37.748926,-122.465074
2,94123.0,71,San Francisco,Tournahu Arms,1842 Jefferson St,,,"{'latitude': '37.804734', 'longitude': '-122.4...",San Francisco,CA,...,CA,,False,0000071-01-001,1968-10-01,,1968-10-01,2013-12-31,37.804734,-122.442997
3,94123.0,71,San Francisco,3301 Broderick Apartments,3301 Broderick St,,,"{'latitude': '37.800876', 'longitude': '-122.4...",San Francisco,CA,...,CA,2.0,False,0000071-02-001,1968-10-01,,1988-05-01,2013-12-31,37.800876,-122.444757
4,94123.0,71,San Francisco,3301 Broderick Apartments,3301 Broderick St,,,"{'latitude': '37.800876', 'longitude': '-122.4...",San Francisco,CA,...,CA,2.0,False,0000071-02-003,1968-10-01,,1988-05-01,2013-12-31,37.800876,-122.444757


In [16]:
ts = get_timestamp()
df_by_loc.to_csv(f'../tmp/converted_by_loc_{ts}.csv', index=False)

In [17]:
df_by_loc.shape

(143883, 27)

 Next step, if lat, lon still = 0, pass it to the geocoder

In [18]:
from api_keys import *

In [19]:
def convert_address(addr):
    time.sleep(1)
    try:
        geo = geocode(addr, provider='google', api_key=google_key)
    except:
        #print(f'an error occurred')
        return 0,0
    else:
        point = geo.geometry.geometry
        lon = float(point.x)
        lat = float(point.y)
        #print(f'returning {lat}, {lon}')
        return lat, lon

NOTE:  The following code calls the geocoding API. This was run iteratively until no more addresses could be translated. This was a manual process, and consumed a lot of resources, and for this reason I am including the code in a commented out state, to avoid have to re-run it.  

```

start_execution = time.time()

df_by_loc[['lat', 'lon']] = df_by_loc.apply(lambda row: [row.lat, row.lon] if row.lat > 0 else
            convert_address(f'{row.full_business_address}, San Francisco, CA'),
                                axis=1, result_type='expand')

end_execution = time.time()
print(f'Elapsed time: {end_execution - start_execution}')

df_by_loc.head()

ts = get_timestamp()
df_by_loc.to_csv(f'../tmp/converted_by_loc_then_address_final{ts}.csv', index=False)


```