## Geocoding & Create Output files

For this geocoding exercise we are going to use `OpenCage` which is a geocoding service.
* For this install opencage in your environment with : `pip install opencage jupyter`.
* Create an account in [OpenCage Website](http://opencagedata.com), and save the `API key` localy or anywhere that you can retrieve easly.
* 

In [1]:
from opencage.geocoder import OpenCageGeocode
from pprint import pprint

The safetiest and easy way to protect an API KEY is by saving it in a environment variable.
* create an env variable (filename.env) : 
    * open terminal : `touch <filename>.env`
    * then open created file in text editor
    * save `API KEY` like `OPENCAGE=yourapikey`

To read the .env file you need to install python-detenv like : `pip isntall python-dotenv`

In [2]:
import os
from dotenv import load_dotenv, find_dotenv

In [18]:
load_dotenv('simply.env') # load the function
key = os.getenv('OPENCAGE') # retrieve the API KEY
geocoder = OpenCageGeocode(key) # use the key in the geocoder

Geocoder Directionality:
* __Forward__ : `forward_geocode` refers to converting an address (pseudo-spatial data) to a coordinate pair, often alongside other supplementary info. 
* __Reverse__ : `reverse_geocode` Backwards geocoding refers to converting coordinate pair (spatial data) to an address. 

## Reverse Geocoding

In [24]:
lat, long = 45.51597294921563, -122.68166028329473
results= geocoder.reverse_geocode(lat, long)

# returns a json file with all info
pprint(results)

[{'annotations': {'DMS': {'lat': "45° 30' 57.37032'' N",
                          'lng': "122° 40' 54.46776'' W"},
                  'FIPS': {'county': '41051', 'state': '41'},
                  'MGRS': '10TER2485340316',
                  'Maidenhead': 'CN85pm83et',
                  'Mercator': {'x': -13656875.124, 'y': 5672616.967},
                  'OSM': {'edit_url': 'https://www.openstreetmap.org/edit?way=399825297#map=17/45.51594/-122.68180',
                          'note_url': 'https://www.openstreetmap.org/note/new#map=17/45.51594/-122.68180&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=45.51594&mlon=-122.68180#map=17/45.51594/-122.68180'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'NORTHERN_AMERICA': '021',
                                         'US': '840',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']}

### Parsing the `Reverse` Geocoding result
Easy to parse the JSON file. Let's retrieve the "formatted" placename. that include the address. we want the first result (the 0th index position) and we want the value of the `formatted` key

In [25]:
formatted = results[0]['formatted']
pprint(formatted)

('The Sovereign, 710 Southwest Madison Street, Portland, OR 97205, United '
 'States of America')


Similar if we just want the street name, town and state, we could concatenate those values

In [35]:
# concatenate multiple values
street = results[0]['components']['road']+ ', '+ results[0]['components']['city']+ ', '+results[0]['components']['state_code']

print(street)

# Or, using join...
street = ', '.join([results[0]['components']['road'], results[0]['components']['city'], results[0]['components']['state_code']])
print(street)


# Join in a cleaner way...
comp = results[0]['components']
street = ', '.join([comp['road'], comp['city'], comp['state_code']])
print(street)

Southwest Madison Street, Portland, OR
Southwest Madison Street, Portland, OR
Southwest Madison Street, Portland, OR


## Forward Geocoding

In [38]:
query= u'1207 SW Broadway, Portland, OR 97205'

results = geocoder.geocode(query)

pprint(results)

[{'annotations': {'DMS': {'lat': "45° 30' 57.23028'' N",
                          'lng': "122° 40' 53.81220'' W"},
                  'FIPS': {'county': '41051', 'state': '41'},
                  'MGRS': '10TER2486740311',
                  'Maidenhead': 'CN85pm83et',
                  'Mercator': {'x': -13656854.857, 'y': 5672610.816},
                  'OSM': {'edit_url': 'https://www.openstreetmap.org/edit?way=219867064#map=17/45.51590/-122.68161',
                          'note_url': 'https://www.openstreetmap.org/note/new#map=17/45.51590/-122.68161&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=45.51590&mlon=-122.68161#map=17/45.51590/-122.68161'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'NORTHERN_AMERICA': '021',
                                         'US': '840',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']}

Noticed that the geocoder has returned 3 results. <br>
Each slightly more general that the previous. The first matches the addreess to a specific place -- 1207 Southwest Broadway, Portland, OR 97205 --, the second matches it to an address, and the third simply places the address in "Multomah County" with zipcode 97205. <br>
* First result is the most precise (highest score)

### Parcing `forward` geocoding result
We can parce the result! we could get the formatted address , the location's identifying [what3words](https://what3words.com/) words, or the lat and long corresponding to the address

In [40]:
formatted = results[0]['formatted']
print(formatted)

what3words = results[0]['annotations']['what3words']['words']
print(what3words)

lat = results[0]['geometry']['lat']
long = results[0]['geometry']['lng']

print([lat, long])

1207 Southwest Broadway, Portland, OR 97205, United States of America
broad.asleep.raft
[45.5158973, -122.6816145]


## Batch Geocoding 
Lets read a CSV file, iterate over each row and finally print the results

In [51]:
# import CSV library
import csv

# provide location of csv file
address_file = 'data/som_lib.csv'
output_file = 'data/som_lib_geocoded.csv'

# you can also use the path of the main file for the output file like:
output_file = os.path.splitext(address_file)[0] +'_geocoded.csv'

# open file...
try:
    with open(address_file, 'r') as incsv:
        #Open output_file in write mode ('w')
        with open(output_file, 'w') as outcsv:
            reader = csv.DictReader(incsv)
            # Add lat and lng to existing fieldnames
            field_names = reader.fieldnames + ['lat', 'lng']
            # Create the writer DictWriter - we'll use this to write rows
            writer = csv.DictWriter(outcsv, field_names)    
            # Iterate over the CSV rows...
            for line in reader:
                address = line['address'].strip()
                result = geocoder.geocode(address)
                # if address is geocoded , retrieve lat and lng. and save it to output 
                if result and len(result):
                    line['lat'] = result[0]['geometry']['lat']
                    line['lng'] = result[0]['geometry']['lng']
                else:
                    line['lat'] = None
                    line['lng'] = None
                print(line)
                # write geocoded row to output file
                writer.writerow(line)
# Add except if file not found
except FileNotFoundError:
    print(f'{address_file} does not exist.')
# Add except if exceed rate limit of 2500 requests per day (openCage 2500 requestes per day max, when free serv)
except RateLimitExceededError as ex:
    print(ex)

OrderedDict([('name', 'Somerville Public Library East Branch'), ('address', '115 Broadway, Somerville, MA 02145'), ('lat', 42.3893085), ('lng', -71.0869435)])
OrderedDict([('name', 'Somerville Public Library West Branch'), ('address', '40 College Ave, Somerville, MA 02144'), ('lat', 42.3981483), ('lng', -71.1217138)])
OrderedDict([('name', 'Somerville Public Library Central Branch'), ('address', '79 Highland Avenue, Somerville, MA 02143'), ('lat', 42.3857881), ('lng', -71.0944054)])


### Next Steps...

1. Let's now wrapping up all into a function for reusability
2. Place the geocoding operation in a workflow fo the Pandas library
3. Use Geopandas to write the resutl to a GeoJSON file

In [57]:
def geocode(in_file, address_field, out_file =False):
    if out_file:
        out_file = out_file
    else:
        out_file = os.path.splitext(in_file)[0]+ '_geocoded.csv'
        
    # Create try/except block
    try:
        with open(in_file, 'r') as incsv:
            # open outputfile in write mode ('w')...
            with open(out_file, 'w') as outcsv:
                reader = csv.DictReader(incsv)
                # Add lat & lng to existing fieldnames
                field_names = reader.fieldnames + ['lat', 'lng']
                 # Create the writer DictWriter - we'll use this to write rows
                writer = csv.DictWriter(outcsv, field_names)    
                # Iterate over the CSV rows...
                for line in reader:
                    address = line['address'].strip()
                    result = geocoder.geocode(address)
                    # if address is geocoded , retrieve lat and lng. and save it to output 
                    if result and len(result):
                        line['lat'] = result[0]['geometry']['lat']
                        line['lng'] = result[0]['geometry']['lng']
                    else:
                        line['lat'] = None
                        line['lng'] = None
                    print(line)
                    # write geocoded row to output file
                    writer.writerow(line)
    # Add except if file not found
    except FileNotFoundError:
        print(f'{address_file} does not exist.')
    # Add except if exceed rate limit of 2500 requests per day (openCage 2500 requestes per day max, when free serv)
    except RateLimitExceededError as ex:
        print(ex)

In [58]:
geocode(in_file='data/som_lib.csv', address_field='address')

OrderedDict([('name', 'Somerville Public Library East Branch'), ('address', '115 Broadway, Somerville, MA 02145'), ('lat', 42.3893085), ('lng', -71.0869435)])
OrderedDict([('name', 'Somerville Public Library West Branch'), ('address', '40 College Ave, Somerville, MA 02144'), ('lat', 42.3981483), ('lng', -71.1217138)])
OrderedDict([('name', 'Somerville Public Library Central Branch'), ('address', '79 Highland Avenue, Somerville, MA 02143'), ('lat', 42.3857881), ('lng', -71.0944054)])


Now, Lets re-write the process using `Pandas`
* We are going to use `df.apply(...)`-- this syntax is a vectorized <i>"for LOOP"</i>. All it's doing is applying an operation (in this case, the geocoding function) to each row (axis=1)


In [59]:
import pandas as pd

In [85]:
def geocode(row, query_field):
    '''
    params: 
    -------
        row: row from created dataframe in run() function.
        query_field : is the column name where address rest in the csv file.
    
    Description:
    -----------
        get each row from df created in run() function and apply geocoding process, 
        to return lat and lng from address if exist, 
    '''
    try:
        address = row[query_field].strip()
        result = geocoder.geocode(address)
        if result and len(result):
            row['lat'] = result[0]['geometry']['lat']
            row['lng'] = result[0]['geometry']['lng']
        else:
            row['lat'] = None
            row['lng'] = None
        return row
    except RateLimitExceededError as ex:
        print(ex)
        quit
        

def run(in_file, query_field, out_file=False):
    '''
    params:
    -------
        in_file : csv file with data to be geocoded
        query_field : column name in csv file where address rest
    
    Description: 
    -----------
        Create dataframe from csv file
        apply geocode() function to each row in df, creating new dataframe
        print df returned
        Save df to csv file 
    '''
    if out_file:
        out_file = out_file
    else:
        out_file = os.path.splitext(in_file)[0]+ '_geocoded.csv'
    try:
        df = pd.read_csv(in_file)
        df = df.apply(lambda x: geocode(x, query_field), axis=1)
        print(df)
        df.to_csv(out_file)
    except FileNotFoundError:
        print(f'{in_file} does not exist.')

In [86]:
# call run function

run(in_file='data/som_lib.csv', query_field='address')

                                       name  \
0     Somerville Public Library East Branch   
1     Somerville Public Library West Branch   
2  Somerville Public Library Central Branch   

                                    address        lat        lng  
0        115 Broadway, Somerville, MA 02145  42.389308 -71.086944  
1      40 College Ave, Somerville, MA 02144  42.398148 -71.121714  
2  79 Highland Avenue, Somerville, MA 02143  42.385788 -71.094405  


### Extension 3: Write a GeoJSON Using GeoPandas
Let's extent out geocoder to write a GeoJSON file automatically, this will remove the step in which we interpret the latitudes and longitudes using GIS. The key row is `row['geometry'] = Point(...)` -- we're creating a point object which Geopandas can interpret (the block beginning `gpd.GeoDataFrame()`)

In [64]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

In [92]:
def geocode(row, query_field):
    '''
    params: 
    -------
        row: row from created dataframe in run() function.
        query_field : is the column name where address rest in the csv file.
    
    Description:
    -----------
        get each row from df created in run() function and apply geocoding process, 
        to return lat and lng from address if exist, 
    '''
    try:
        address = row[query_field].strip()
        result = geocoder.geocode(address)
        if result and len(result):
            row['geometry'] = Point(result[0]['geometry']['lng'], result[0]['geometry']['lat'])
        else:
            row['geometry'] = None
        return row
    except RateLimitExceededError as ex:
        print(ex)
        quit
        
        
def read_data_return_geojson(in_file, query_field, out_file=False):
    '''
    params:
    -------
        in_file : csv file with data to be geocoded
        query_field : column name in csv file where address rest
    
    Description: 
    -----------
        Create dataframe from csv file
        apply geocode() function to each row in df, creating new dataframe
        print df returned
        Save df to csv file 
    '''
    if out_file:
        out_file = out_file
    else:
        out_file = os.path.splitext(in_file)[0]+ '_geocoded.geojson'
    try:
        df = pd.read_csv(in_file)
        gdf = gpd.GeoDataFrame(df.apply(lambda x: geocode(x, query_field), axis=1)
                               ,geometry='geometry')
            
        print(gdf)
        gdf.to_file(out_file, driver='GeoJSON')
    except FileNotFoundError:
        print(f'{in_file} does not exist.')

In [93]:
# call run function
def main():
    """ 
    Runs read_data_return_geojson() function, which internally runs geocode()    
    """
    try:
        read_data_return_geojson(in_file='data/som_lib.csv', query_field='address')
    except :
        raise NotImplementedError

In [94]:
# Call main function

if __name__ == "__main__":
    main()

                                       name  \
0     Somerville Public Library East Branch   
1     Somerville Public Library West Branch   
2  Somerville Public Library Central Branch   

                                    address                    geometry  
0        115 Broadway, Somerville, MA 02145  POINT (-71.08694 42.38931)  
1      40 College Ave, Somerville, MA 02144  POINT (-71.12171 42.39815)  
2  79 Highland Avenue, Somerville, MA 02143  POINT (-71.09441 42.38579)  
