<div align="right">Python 3.6</div>

## Testing The Google Maps Subclass
This notebook was created to test objects associated with extracting information into a Dataframe using the Google Maps API.  Initially, this was part of an effort to operationalize the interesting bits of code in a messy procedure I did for some research.  The original use case was to extract just the address and add it to tables of information with latitude and longitude in them.  This notebook may grow to test more related objects as they are developed and/or expansions of the original code.

### Enrich or Change Larger Dataframe Section by Section

The purpose of the <font color=blue><b>DFBuilder</b></font> object is to allow scanning of a larger dataframe, a small number of rows at a time.  It then allows code to be customized to make changes and build up a new dataframe from the results.  The operation is in
a standard loop by design. The original use case was to add a field with data accessed from an API off the web, and time delays were necessary (as well as other logic) to prevent (or at least reduce the risk of) server timeouts during operation.

Scanning through the source a few lines at a time, performing the operation and adding back out to the target DF
creates a "caching effect" where data is saved along the way so in the event of a server time-out all is not lost.  The resulting DF can then be saved out to a file and a rerun of <font color=blue><b>buildOutDF()</b></font> should make it possible to pick up where you left off and add in more data (instead of losing everything and having to begin again).

The abstract class sets up the core logic and subclasses add in functions to modify the data in different ways and potentially using different APIs.  This notebook only tests the subclass designed for the Google Maps API.

### Libraries Needed
Import statements included in this notebook are for the main abstract object and a test object.

In [3]:
# general libraries
import pandas as pd

In [4]:
## required for Google Maps API code
import os

## for larger data and/or make many requests in one day - get Google API key and use these lines:
# os.environ["GOOGLE_API_KEY"] = "YOUR_GOOGLE_API_Key"
## for better security (PROD environments) - install key to server and use just this line to load it:
# os.environ.get('GOOGLE_API_KEY')

# set up geocode
from geopy.geocoders import Nominatim
geolocator = Nominatim()
from geopy.exc import GeocoderTimedOut
import time

# note: for now could do this ... used time because it is already in use


### Test Data
Input Data Set up Here

In [5]:
## Test code on a reasonably small DF
tst_lat_lon_df = pd.read_csv("testset_unique_lat_and_lon_vals.csv", index_col=0)

In [6]:
tst_lat_lon_df.describe()

Unnamed: 0,lat,lon
count,1160.0,1160.0
mean,41.232457,-74.04298
std,1.162332,1.463141
min,39.390049,-78.366203
25%,40.269619,-74.651615
50%,40.74292,-74.14307
75%,42.3616,-73.752644
max,44.950298,-70.187302


In [7]:
tst_lat_lon_df.tail()

Unnamed: 0,lat,lon
1155,43.233299,-70.911079
1156,43.233601,-70.911301
1157,43.233299,-70.910698
1158,43.233398,-70.911003
1159,43.233299,-70.910713


### Code Testing

The abstract class which follows is intended to be the "work horse" of this code.  Intent is that it gets the developer to the point where all they need to think about is what their final subclass will do to enrich the data.  The parent class sets up a loop that can extract from a larger input DF, a small number of rows to be operated on in a temp DF and then be added to an outputDF.  In the event of something interrupting the process (a common event when dealing with web APIs), modified rows created before the incident are waiting in output DF and can be extracted. Then code can be restarted or continued to allow building up the rest of the Dataframe without losing previous work or having to go all the way back to the beginnin.

In [8]:
# note: gmtime() produced results in Grenwich Mean Time
#       localtime() seems to get the local time from the computer (in my case EST)

from time import localtime, strftime

def getNow():
    return strftime("%Y-%m-%d %H:%M:%S", localtime())

In [9]:
getNow()

'2018-06-07 14:32:24'

In [10]:
from abc import ABCMeta, abstractmethod
import pandas as pd

class DFBuilder(object, metaclass=ABCMeta):       # sets up abstract class
    '''DataFrame Builder abstract class.  Sets up logic to be inherited by objects that need to loop over a DataFrame
and cache the results.  Original use case involves making API calls to the web which can get interrupted by 
errors and server timeouts.  This object stores all the logic to build up and save a DataFrame a small number of 
records at a time.  Then a subclass can define an abstract method in the base class as to what we want to do to 
the input data.  Original use case added in content extracted form the web to a new column.  But subclasses can 
be built to do more. Initialization argumens: endRw, time_delay. endRw = number of records to cache at a time
when building outDF.  time_delay is number of seconds delay between each cycle of the loop that builds outDF.'''
    def __init__(self,endRw,time_delay):          # abstract classes can be subclassed
        self.endRow=endRw                         # but cannot be instantiated
        self.delay=time_delay
        self.tmpDF=pd.DataFrame()   # temp DF will be endRow rows in length
        self.outDF=pd.DataFrame()   # final DF build in sets of endRow rows so all is not lost in a failure
        self.lastIndex = None
        self.statusMsgGrouping = 100
        
    def __str__(self):
        return ("Global Settings for this object: \n" +  
                "endRow: " + str(self.endRow) + "\n" + 
                "delay:  " + str(self.delay) + "\n" + 
                "statusMsgGrouping: " + str(self.statusMsgGrouping) + "\n"
                "Length of outDF: " + str(len(self.outDF)) + "\n" +
                "nextIndex: " + str(self.lastIndex))       
                 # if continuing build process with last added table - index of next rec.
        
    @abstractmethod                               # abstract method definition in Python
    def _modifyTempDF_(): pass                    # This method will operate on TempDF inside the loop
    
    def set_statusMsgGrouping(self, newValue):
        '''Change number of records used to determine when to provide output messages during buildOutDF().
Default is 100 records.  newValue=x sets this to a new number. Note that If endRow is not a factor of 
statusMsgGrouping output may appear at unexpected intervals. endRow sets the number of rows to cache to
outDF in each iteration of the build loop.'''
        
        self.statusMsgGrouping = newValue
        print(self)
        
    def set_timeDelay(self, newValue):
        '''Change number of seconds in time delay between requests while creating outDF().
Default is 1 second.  newValue=x sets this to a new number.'''
        self.delay = newValue
        print(self)
        
    def set_endRow_OutDf_caching(self, newValue):
        '''Change value of endRow which controls how many rows to cache at a time within buildOutDF().
Default is 5.  If something goes wrong and you have to restart the process, this value also represents
the maximum number of requests you will lose.  The rest will have already been added to outDF.
newValue=x sets this to a new number.'''
        self.endRow = newValue
        print(self)
    
    def buildOutDF(self, inputDF):
        '''Scans inputDF using self.endRow rows (default of 5) at a time to do it.  It then calls in logic
from _modifyTempDF()_ to make changes to each subset of rows and appends tiny tempDF onto an outDF.  When the 
subclass is using a web API, self.delay tells it how much time to delay each iteration of the loop. Should
this function fail in the middle, outDF will have all work up to the failure.  
This can be saved out to a DF or csv.  The function can be run again on a subset of the data 
(the records not encountered yet before the failure).'''
    
        lenDF = len(inputDF)
        print("Timestamp: ", getNow())
        print("Processing inputDF with length of: ", lenDF)
        print("Please wait ...")
        endIndx = 0

        i = 0
        while i < lenDF:
            # print("i: ", i)
            endIndx = i + self.endRow
            if endIndx > lenDF:
                        endIndx = lenDF

            # print("Range to use: ", i, ":", endIndx)
            if i % self.statusMsgGrouping == 0:
                print(getNow(), "Now processing index: ", i)

            self.tmpDF = inputDF[i:endIndx].copy(deep=True)
            self._modifyTempDF_()
            time.sleep(self.delay)
            self.outDF = self.outDF.append(self.tmpDF) 
            self.lastIndex = endIndx 
            i = endIndx
            # print("i at end of loop: ", i)        
          
        self.reindex_OutDF()
        print("Process complete. %d records added to outDF." %(self.lastIndex))
        print("Timestamp: ", getNow())
        
    def reindex_OutDF(self):
        '''Reindex OutDF using same settings that are used internally for the index during its creation.
This is like doing: outDF.reset_index(drop=True, inplace=True).'''
        self.outDF.reset_index(drop=True, inplace=True)

In [11]:
class GMapsLoc_DFBuilder(DFBuilder): 
    '''This class inherits DFBuilder.buildOutDF() which makes use of data extraction and nodification functions in
this subclass.  endRw sets number of rows to process at a time while building outDF (default=5). time_delay 
can set the time delay between loop iterations to help prevent licensing issues and related server timeouts.
Default is 1 second. Initialization arguments: endRw, time_delay, return_null.
 * endRw controls grouping: process endRow rows at a time and add to outDF (default is 5).
 * time_delay has default of 1 second and sets how much time to wait each request whild building outDF.
 * return_null, if False, records error text formatted as "_<errTxt>_" for records that failed to process.
   Set to True to have it return blank records when errors occur instead (default is False).'''
    def __init__(self, endRw=5,time_delay=1, return_null=False):           
        super().__init__(endRw,time_delay)
        self.rtn_null = return_null
        self.timeout = 10
        self.location = ""  # stores last location accessed using getGeoAddr
        
    def __str__(self):
        outStr = (super().__str__() + "\n" +
                  "rtn_null: " + str(self.rtn_null) + "\n" +
                  "timeout: "  + str(self.timeout) + "\n")
        if isinstance(self.location, (type(None), str)):
            outStr = outStr + "location (last obtained): " + str(self.location)
        else:
            outStr = outStr + "location (last obtained): " + str(self.location.raw)
            
        return outStr

    def set_ServerTimeout(self, newValue):
        '''Change number of seconds for the server timeout setting used during web requests.
Default is 10 second.  newValue=x sets this to a new number.'''
        self.timeout = newValue
        print(self)    
    
    def testConnection(self, lat=48.8588443, lon=2.2943506):
        '''Test getGeoAddr() function to prove connection to Google Maps is working.  Use this ahead of
performing much larger operations with Google Maps.'''
        return self.getGeoAddr(lat, lon)
        
    def getGeoAddr(self, lt, lng, timeout=10, test=False, rtn_null=False):
        '''Make call to Google Maps API to return back just the address from the json location record.  Errors
should result in text values to help identify why an address was not returned.  This can be turned off and 
records that failed can bring back just an empty field by setting rtn_null to True. timeout = server timeout 
and has a default that worked well during testing.  '''
        
        try:
            self.location = geolocator.reverse(str(lt) + ", " + str(lng), timeout=timeout)
            if test == True:
                print("===============================")
                print("Address:\n")
                print(self.location)
                print("===============================")
                rtnVal = self.location
            else:
                rtnVal = self.location.address
        except GeocoderTimedOut as gEoTo:
            print(type(gEoTo))
            print(gEoTo)
            self.location = None
            rtnVal = "_" + str(eee).upper().replace(' ', '_').replace(':', '') + "_"
            ## old error text: "_TIME_OUT_ERROR_ENCOUNTERED_"
        except Exception as eee:
            print(type(eee))
            print(eee)  
            self.location = None
            rtnVal = "_" + str(eee).upper().replace(' ', '_').replace(':', '') + "_"
        finally:
            # time_delay is not included here and should be incorporated into
            # the loop that calls this function if desirable

            if rtn_null==True and self.location is None:
                return ""
            else:
                return rtnVal
        
    def _modifyTempDF_(self, test=False):
        '''Add Address Field to tempDF based on lat, lon (latitude/longitude field values in inputDF)'''
        self.tmpDF["Address"] = self.tmpDF.apply(lambda x: self.getGeoAddr(lt=x.lat,lng=x.lon, 
                                     timeout=self.timeout,test=False, rtn_null=self.rtn_null), axis=1)

### Testing of The Subclass
A different subclass was created in another notebook to test most if not all of the non-web related logic of the Abstract class.  This means testing in this notebook can focus on the code that produces final results and that interacts with the Google Maps API.

This section shows how the code can build up outDF() adding addresses obtained from the Google Maps API to the latitude and longitude provided in the input data. Tests show how errors are handled, both as text of the form "_<errorTxt>_" in the address field, or as empty strings if you set rtn_null to True.  Tests also show how data can be added to outDF by re-running the build function.  This allows adding of additional data to outDF, or of adding in data that was missed due to server timeout errors or other interruptions to the web process.

#### Test Main Logic with Error Handling Exposed
These tests were designed to show the error handling in action.  For the sake of brevity, earlier testing was deleted to just show later tests in which errors are expected (due to exceeding daily license allotment from Google).

In [10]:
## build main object using the defaults   
testObj = GMapsLoc_DFBuilder()

In [11]:
print(testObj)

Global Settings for this object: 
endRow: 5
delay:  1
statusMsgGrouping: 100
Length of outDF: 0
nextIndex: None
rtn_null: False
timeout: 10
location (last obtained): 


In [12]:
testObj.buildOutDF(tst_lat_lon_df)  ## some tests not shown performed ahead of this run
                                    ## errors should be result of exceeding daily record allotment
                                    ## for free Google Maps API license
                                    ## this code tests basic functioning and default error handling

Timestamp:  2018-06-06 14:44:58
Processing inputDF with length of:  1160
Please wait ...
2018-06-06 14:44:58 Now processing index:  0
2018-06-06 14:46:04 Now processing index:  100
2018-06-06 14:47:10 Now processing index:  200
2018-06-06 14:48:15 Now processing index:  300
2018-06-06 14:49:21 Now processing index:  400
2018-06-06 14:50:26 Now processing index:  500
2018-06-06 14:51:30 Now processing index:  600
2018-06-06 14:52:35 Now processing index:  700
2018-06-06 14:53:39 Now processing index:  800
2018-06-06 14:54:43 Now processing index:  900
2018-06-06 14:55:48 Now processing index:  1000
2018-06-06 14:56:52 Now processing index:  1100
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP E

In [14]:
testObj.outDF.head()

Unnamed: 0,lat,lon,Address
0,42.377602,-71.124702,"6, Berkeley Street, Old Cambridge, Cambridge, ..."
1,42.432098,-71.056099,"47, Bowers Avenue, Bryant Terrace Apartments, ..."
2,42.249298,-71.074501,"46, Buckingham Road, Milton, Norfolk County, M..."
3,42.3578,-71.062698,"Boston Athenaeum, Beacon Street, Downtown Cros..."
4,42.347,-71.074799,"303, Columbus Avenue, Chinatown, South End, Bo..."


In [15]:
testObj.outDF.tail()  ## this check shows default behavior
                      ## errors recorded in address field so user can find out why a particular location
                      ## failed to return results - in this case "too many requests" (for license allotment)
                      ## errors begin and end with "_" which an address will not.  
                      ## a query or filter of the data for addresses starting with "_" can inform the user
                      ## which records need to be run again

Unnamed: 0,lat,lon,Address
1155,43.233299,-70.911079,_HTTP_ERROR_429_TOO_MANY_REQUESTS_
1156,43.233601,-70.911301,_HTTP_ERROR_429_TOO_MANY_REQUESTS_
1157,43.233299,-70.910698,_HTTP_ERROR_429_TOO_MANY_REQUESTS_
1158,43.233398,-70.911003,_HTTP_ERROR_429_TOO_MANY_REQUESTS_
1159,43.233299,-70.910713,_HTTP_ERROR_429_TOO_MANY_REQUESTS_


In [16]:
## change error handling and a few other default parameters
testObj.rtn_null = True           ## change error handling: bad records will not simply get blank Address values
testObj.set_statusMsgGrouping(10) ## get status message about every 10 records (this will be a small test)
testObj.set_timeDelay(0)          ## remove time delay (this increases risk of errors)
                                  ## note: each set_ function outputs current state of variables
                                  ##       each output begins with "Global settings ..."
                                  ##       last one is what these settings look like going into the next test

Global Settings for this object: 
endRow: 5
delay:  1
statusMsgGrouping: 10
Length of outDF: 1160
nextIndex: 1160
rtn_null: True
timeout: 10
location (last obtained): None
Global Settings for this object: 
endRow: 5
delay:  0
statusMsgGrouping: 10
Length of outDF: 1160
nextIndex: 1160
rtn_null: True
timeout: 10
location (last obtained): None


In [22]:
testObj.buildOutDF(tst_lat_lon_df[-25:])  ## redo end of DF .. should be entirely blank since we're out of licenses
                                          ## rtn_null = False told code to return empty cell instead of error text
                                          ## in production, it may be easier to just search for the nulls to get
                                          ## which records to redo, then delete nulls and add in missing records.

Timestamp:  2018-06-06 15:29:05
Processing inputDF with length of:  25
Please wait ...
2018-06-06 15:29:05 Now processing index:  0
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Requests
2018-06-06 15:29:09 Now processing index:  10
<class 'geopy.exc.GeocoderServiceError'>
HTTP Error 429: Too Many Reques

In [20]:
testObj.outDF.tail()

Unnamed: 0,lat,lon,Address
1195,43.233299,-70.911079,
1196,43.233601,-70.911301,
1197,43.233299,-70.910698,
1198,43.233398,-70.911003,
1199,43.233299,-70.910713,


In [21]:
print(testObj)  ## final look at settings for this object after process is complete

Global Settings for this object: 
endRow: 5
delay:  0
statusMsgGrouping: 10
Length of outDF: 1200
nextIndex: 40
rtn_null: True
timeout: 10
location (last obtained): None


#### Test Main Logic - Fresh Alotment of Licenses (No Errors Expected)
Note that an error could still occur due to a server timeout, a server being down (on the Google site) or some other unexpected event.  This test was set up to maximize the likelihood of showing what output can look like when no errors occur.  Since at least one error seems to occur in batches of 900 or more, data is split in half with the second half added in after the first for the test set.

In [34]:
### quick clean test with fresh alotment of license records for the day
#   illustrates adding more in later and a run with no errors in it
#   * do 600 initially 
#   * then add in the end of DF

testObj = GMapsLoc_DFBuilder()

In [36]:
print(testObj)

Global Settings for this object: 
endRow: 5
delay:  1
statusMsgGrouping: 100
Length of outDF: 0
nextIndex: None
rtn_null: False
timeout: 10
location (last obtained): 


In [37]:
testObj.buildOutDF(tst_lat_lon_df[0:600])

Timestamp:  2018-06-07 22:15:50
Processing inputDF with length of:  600
Please wait ...
2018-06-07 22:15:50 Now processing index:  0
2018-06-07 22:16:53 Now processing index:  100
2018-06-07 22:17:57 Now processing index:  200
2018-06-07 22:19:02 Now processing index:  300
2018-06-07 22:20:07 Now processing index:  400
2018-06-07 22:21:12 Now processing index:  500
Process complete. 600 records added to outDF.
Timestamp:  2018-06-07 22:22:18


In [38]:
testObj.buildOutDF(tst_lat_lon_df[600:])  ## end of the df added in 
                                          ## in this text, indicies between input/output will match
                                          ## since every record was added in using the same sequence

Timestamp:  2018-06-08 03:07:25
Processing inputDF with length of:  560
Please wait ...
2018-06-08 03:07:25 Now processing index:  0
2018-06-08 03:08:29 Now processing index:  100
2018-06-08 03:09:33 Now processing index:  200
2018-06-08 03:10:36 Now processing index:  300
2018-06-08 03:11:40 Now processing index:  400
2018-06-08 03:12:44 Now processing index:  500
Process complete. 560 records added to outDF.
Timestamp:  2018-06-08 03:13:22


In [39]:
tst_lat_lon_df.tail() ## final records in the input

Unnamed: 0,lat,lon
1155,43.233299,-70.911079
1156,43.233601,-70.911301
1157,43.233299,-70.910698
1158,43.233398,-70.911003
1159,43.233299,-70.910713


In [40]:
testObj.outDF.tail()  ## final records in the output

Unnamed: 0,lat,lon,Address
1155,43.233299,-70.911079,"159, Long Hill Road, Dover, Strafford County, ..."
1156,43.233601,-70.911301,"159, Long Hill Road, Dover, Strafford County, ..."
1157,43.233299,-70.910698,"155, Long Hill Road, Dover, Strafford County, ..."
1158,43.233398,-70.911003,"158, Long Hill Road, Dover, Strafford County, ..."
1159,43.233299,-70.910713,"155, Long Hill Road, Dover, Strafford County, ..."


#### Experiment in Cleaning Up Results
This test was run with a fresh alotment of google license records for the day.  It should have completed without error but the server went down causing 5 error records instead.  This test shows what to do in this scenario.

In [12]:
## do test with testObjDocs - resetting it to blank to start fresh
testObjDocs = GMapsLoc_DFBuilder()
print(testObjDocs)

Global Settings for this object: 
endRow: 5
delay:  1
statusMsgGrouping: 100
Length of outDF: 0
nextIndex: None
rtn_null: False
timeout: 10
location (last obtained): 


In [13]:
testObjDocs.buildOutDF(tst_lat_lon_df)

Timestamp:  2018-06-07 14:35:06
Processing inputDF with length of:  1160
Please wait ...
2018-06-07 14:35:06 Now processing index:  0
2018-06-07 14:36:09 Now processing index:  100
2018-06-07 14:37:13 Now processing index:  200
2018-06-07 14:38:16 Now processing index:  300
2018-06-07 14:39:20 Now processing index:  400
2018-06-07 14:40:25 Now processing index:  500
2018-06-07 14:41:29 Now processing index:  600
2018-06-07 14:42:36 Now processing index:  700
2018-06-07 14:43:41 Now processing index:  800
2018-06-07 14:44:45 Now processing index:  900
<class 'geopy.exc.GeocoderServiceError'>
[Errno 50] Network is down
<class 'geopy.exc.GeocoderServiceError'>
[Errno 50] Network is down
<class 'geopy.exc.GeocoderServiceError'>
[Errno 50] Network is down
<class 'geopy.exc.GeocoderServiceError'>
[Errno 50] Network is down
<class 'geopy.exc.GeocoderServiceError'>
[Errno 50] Network is down
2018-06-07 14:51:07 Now processing index:  1000
2018-06-07 14:52:18 Now processing index:  1100
Process

In [18]:
testObjDocs.outDF[975:1000]  ## spot check run on batches of 25 records to find the bad ones (b/2 900 and 1000)
                             ## bad records found here

Unnamed: 0,lat,lon,Address
975,43.00182,-76.33844,"2418, Falls Road, Marcellus, Town of Marcellus..."
976,43.015862,-76.166779,"301, Hutchinson Avenue, Elmwood, Syracuse, Ono..."
977,43.030029,-76.282669,"270, Emann Drive, Parson Farms, Town of Camill..."
978,43.099602,-76.162804,"321, Salina Meadows Parkway, The Meadows, Pitc..."
979,40.766701,-73.516502,"65, Clinton Street, Hicksville, Nassau County,..."
980,40.619202,-73.965599,"1222, East 10th Street, Midwood, BK, Kings Cou..."
981,40.799599,-73.970299,"875, West End Avenue, Upper West Side, Manhatt..."
982,40.76749,-73.964752,"121, East 67th Street, Lenox Hill, Manhattan C..."
983,41.145699,-73.833801,"Shawdow Tree Lane, Briarcliff Manor, Town of O..."
984,40.796982,-73.964752,"850, Columbus Avenue, Frederick Douglass House..."


In [20]:
len(testObjDocs.outDF)  # current length of DF

1160

In [22]:
## our first run: index of input will be same as index on output
## test showing that records on input match the problem range in output

tst_lat_lon_df[985:990]

Unnamed: 0,lat,lon
985,43.227901,-77.309402
986,43.246601,-77.183197
987,43.2001,-77.038498
988,42.472801,-73.274399
989,42.929901,-76.563698


In [23]:
testObjDocs.buildOutDF(tst_lat_lon_df[985:990])

Timestamp:  2018-06-07 15:03:45
Processing inputDF with length of:  5
Please wait ...
2018-06-07 15:03:45 Now processing index:  0
Process complete. 5 records added to outDF.
Timestamp:  2018-06-07 15:03:48


In [24]:
testObjDocs.outDF.tail(10)  ## new records on the end ... still need to delete the bad ones

Unnamed: 0,lat,lon,Address
1155,43.233299,-70.911079,"159, Long Hill Road, Dover, Strafford County, ..."
1156,43.233601,-70.911301,"159, Long Hill Road, Dover, Strafford County, ..."
1157,43.233299,-70.910698,"155, Long Hill Road, Dover, Strafford County, ..."
1158,43.233398,-70.911003,"158, Long Hill Road, Dover, Strafford County, ..."
1159,43.233299,-70.910713,"155, Long Hill Road, Dover, Strafford County, ..."
1160,43.227901,-77.309402,"1429, NY 104, Williamson Town, Wayne County, N..."
1161,43.246601,-77.183197,"6751, Pound Road, Williamson, Williamson Town,..."
1162,43.2001,-77.038498,"Sodus Center Road, Sodus Center, Sodus Town, W..."
1163,42.472801,-73.274399,"Ramsey Beach, Lakeway Drive, Pittsfield, Berks..."
1164,42.929901,-76.563698,"84, School Street, Auburn, Cayuga County, New ..."


In [27]:
testObjDocs.outDF.drop(testObjDocs.outDF.index[985:990], inplace=True)

In [28]:
testObjDocs.outDF[984:991]  ## as expected - bad rows dropped but we now have an indexing issue

Unnamed: 0,lat,lon,Address
984,40.796982,-73.964752,"850, Columbus Avenue, Frederick Douglass House..."
990,43.22591,-77.137512,"Richardson Road, Sodus Town, Wayne County, New..."
991,43.228649,-77.153687,"Ridge Road, East Williamson, Williamson Town, ..."
992,42.849998,-73.800003,"Roundabout forested area, Casablanca Court, Fl..."
993,43.2439,-77.164597,"Bear Swamp Road, Williamson, Williamson Town, ..."
994,43.226021,-77.137001,"Richardson Road, Sodus Town, Wayne County, New..."
995,43.159,-77.614998,"150 State St, 150, State Street, Rochester, Mo..."


In [29]:
testObjDocs.outDF.tail()

Unnamed: 0,lat,lon,Address
1160,43.227901,-77.309402,"1429, NY 104, Williamson Town, Wayne County, N..."
1161,43.246601,-77.183197,"6751, Pound Road, Williamson, Williamson Town,..."
1162,43.2001,-77.038498,"Sodus Center Road, Sodus Center, Sodus Town, W..."
1163,42.472801,-73.274399,"Ramsey Beach, Lakeway Drive, Pittsfield, Berks..."
1164,42.929901,-76.563698,"84, School Street, Auburn, Cayuga County, New ..."


In [30]:
## fix index:
testObjDocs.reindex_OutDF()
testObjDocs.outDF[984:991]

Unnamed: 0,lat,lon,Address
984,40.796982,-73.964752,"850, Columbus Avenue, Frederick Douglass House..."
985,43.22591,-77.137512,"Richardson Road, Sodus Town, Wayne County, New..."
986,43.228649,-77.153687,"Ridge Road, East Williamson, Williamson Town, ..."
987,42.849998,-73.800003,"Roundabout forested area, Casablanca Court, Fl..."
988,43.2439,-77.164597,"Bear Swamp Road, Williamson, Williamson Town, ..."
989,43.226021,-77.137001,"Richardson Road, Sodus Town, Wayne County, New..."
990,43.159,-77.614998,"150 State St, 150, State Street, Rochester, Mo..."


In [31]:
testObjDocs.outDF.tail()  ## note: all records are in here now but indices will be different from input DF

Unnamed: 0,lat,lon,Address
1155,43.227901,-77.309402,"1429, NY 104, Williamson Town, Wayne County, N..."
1156,43.246601,-77.183197,"6751, Pound Road, Williamson, Williamson Town,..."
1157,43.2001,-77.038498,"Sodus Center Road, Sodus Center, Sodus Town, W..."
1158,42.472801,-73.274399,"Ramsey Beach, Lakeway Drive, Pittsfield, Berks..."
1159,42.929901,-76.563698,"84, School Street, Auburn, Cayuga County, New ..."


#### Testing of Enhanced Print() and set_ Functions
Now has logic to handle output in different way.  We get to see `location.raw` if possible, and it knows what to do if `location` is `None` or an empty string.  Also testing build parameters for first time and new `set_` functions.

In [55]:
gMapAddrDat = GMapsLoc_DFBuilder(endRw=4, time_delay=3, return_null=True)

In [56]:
gMapAddrDat.set_statusMsgGrouping(12)

Global Settings for this object: 
endRow: 4
delay:  3
statusMsgGrouping: 12
Length of outDF: 0
nextIndex: None
rtn_null: True
timeout: 10
location (last obtained): 


In [57]:
gMapAddrDat.set_endRow_OutDf_caching(3)

Global Settings for this object: 
endRow: 3
delay:  3
statusMsgGrouping: 12
Length of outDF: 0
nextIndex: None
rtn_null: True
timeout: 10
location (last obtained): 


In [58]:
gMapAddrDat.set_timeDelay(0)

Global Settings for this object: 
endRow: 3
delay:  0
statusMsgGrouping: 12
Length of outDF: 0
nextIndex: None
rtn_null: True
timeout: 10
location (last obtained): 


In [59]:
gMapAddrDat.set_ServerTimeout(9)

Global Settings for this object: 
endRow: 3
delay:  0
statusMsgGrouping: 12
Length of outDF: 0
nextIndex: None
rtn_null: True
timeout: 9
location (last obtained): 


In [60]:
gMapAddrDat.buildOutDF(tst_lat_lon_df[0:50])

Timestamp:  2018-06-06 14:18:25
Processing inputDF with length of:  50
Please wait ...
2018-06-06 14:18:25 Now processing index:  0
2018-06-06 14:18:30 Now processing index:  12
2018-06-06 14:18:36 Now processing index:  24
2018-06-06 14:18:41 Now processing index:  36
2018-06-06 14:18:46 Now processing index:  48
Process complete. 50 records added to outDF.
Timestamp:  2018-06-06 14:18:47


In [61]:
print(gMapAddrDat)

Global Settings for this object: 
endRow: 3
delay:  0
statusMsgGrouping: 12
Length of outDF: 50
nextIndex: 50
rtn_null: True
timeout: 9
location (last obtained): {'place_id': '128659718', 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'way', 'osm_id': '250265211', 'lat': '40.72450395', 'lon': '-73.9803511341422', 'display_name': '200, East 7th Street, Alphabet City, Manhattan Community Board 3, New York County, NYC, New York, 10009, United States of America', 'address': {'house_number': '200', 'road': 'East 7th Street', 'neighbourhood': 'Alphabet City', 'city': 'NYC', 'county': 'New York County', 'state': 'New York', 'postcode': '10009', 'country': 'United States of America', 'country_code': 'us'}, 'boundingbox': ['40.7243903', '40.7246029', '-73.980457', '-73.980246']}


In [62]:
gMapAddrDat.location = ""

In [63]:
print(gMapAddrDat)

Global Settings for this object: 
endRow: 3
delay:  0
statusMsgGrouping: 12
Length of outDF: 50
nextIndex: 50
rtn_null: True
timeout: 9
location (last obtained): 


#### Test of Other Internal Functions
These functions can be used in testing or to just get back a single value.  These examples might prove useful.

In [70]:
gMapAddrDat.testConnection()  ## uses default test record to just ensure connection is working

'Tour Eiffel, 5, Avenue Anatole France, Gros-Caillou, 7e, Paris, Île-de-France, France métropolitaine, 75007, France'

In [66]:
gMapAddrDat.getGeoAddr(40.699100, -73.703697, test=True)  ## function called by buildOutDF()
                                                          ## use test mode to obtain more information

Full Location Data Returned:

Beth David Cemetery, Bethel Avenue, Elmont, Nassau County, New York, 11581, United States of America


Location(Beth David Cemetery, Bethel Avenue, Elmont, Nassau County, New York, 11581, United States of America, (40.6987137, -73.7042976, 0.0))

In [68]:
tstLoc1 = gMapAddrDat.getGeoAddr(40.699100, -73.703697, test=True)  ## use .raw on output during testing
print(type(tstLoc1))                                                ## to view JSON structure of Location obj
tstLoc1.raw

Full Location Data Returned:

Beth David Cemetery, Bethel Avenue, Elmont, Nassau County, New York, 11581, United States of America
<class 'geopy.location.Location'>


{'address': {'country': 'United States of America',
  'country_code': 'us',
  'county': 'Nassau County',
  'grave_yard': 'Beth David Cemetery',
  'locality': 'Elmont',
  'postcode': '11581',
  'road': 'Bethel Avenue',
  'state': 'New York'},
 'boundingbox': ['40.6986137', '40.6988137', '-73.7043976', '-73.7041976'],
 'display_name': 'Beth David Cemetery, Bethel Avenue, Elmont, Nassau County, New York, 11581, United States of America',
 'lat': '40.6987137',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'lon': '-73.7042976',
 'osm_id': '357548338',
 'osm_type': 'node',
 'place_id': '2899917'}

In [69]:
tstLoc1 = gMapAddrDat.getGeoAddr(40.699100, -73.703697)  ## default: test=False
print(type(tstLoc1))                                     ## when called internally to build the address field for
tstLoc1                                                  ## for outDF, it just returns an address string

<class 'str'>


'Beth David Cemetery, Bethel Avenue, Elmont, Nassau County, New York, 11581, United States of America'

### Documentation Tests

In [25]:
# create new object to test the docstrings and some more quick coding tweaks
testObjDocs = GMapsLoc_DFBuilder()
print(testObjDocs)

Global Settings for this object: 
endRow: 5
delay:  1
statusMsgGrouping: 100
Length of outDF: 0
nextIndex: None
rtn_null: False
timeout: 10
location (last obtained): 


In [26]:
help(testObjDocs)

Help on GMapsLoc_DFBuilder in module __main__ object:

class GMapsLoc_DFBuilder(DFBuilder)
 |  This class inherits DFBuilder.buildOutDF() which makes use of data extraction and nodification functions in
 |  this subclass.  endRw sets number of rows to process at a time while building outDF (default=5). time_delay 
 |  can set the time delay between loop iterations to help prevent licensing issues and related server timeouts.
 |  Default is 1 second. Initialization arguments: endRw, time_delay, return_null.
 |   * endRw controls grouping: process endRow rows at a time and add to outDF (default is 5).
 |   * time_delay has default of 1 second and sets how much time to wait each request whild building outDF.
 |   * return_null, if False, records error text formatted as "_<errTxt>_" for records that failed to process.
 |     Set to True to have it return blank records when errors occur instead (default is False).
 |  
 |  Method resolution order:
 |      GMapsLoc_DFBuilder
 |      DFBuilde

In [27]:
print(testObjDocs.__doc__)  # note: formatting is messed up if you do not use print() on the doc string

This class inherits DFBuilder.buildOutDF() which makes use of data extraction and nodification functions in
this subclass.  endRw sets number of rows to process at a time while building outDF (default=5). time_delay 
can set the time delay between loop iterations to help prevent licensing issues and related server timeouts.
Default is 1 second. Initialization arguments: endRw, time_delay, return_null.
 * endRw controls grouping: process endRow rows at a time and add to outDF (default is 5).
 * time_delay has default of 1 second and sets how much time to wait each request whild building outDF.
 * return_null, if False, records error text formatted as "_<errTxt>_" for records that failed to process.
   Set to True to have it return blank records when errors occur instead (default is False).


In [28]:
print(testObjDocs.buildOutDF.__doc__) # buildOutDF

Scans inputDF using self.endRow rows (default of 5) at a time to do it.  It then calls in logic
from _modifyTempDF()_ to make changes to each subset of rows and appends tiny tempDF onto an outDF.  When the 
subclass is using a web API, self.delay tells it how much time to delay each iteration of the loop. Should
this function fail in the middle, outDF will have all work up to the failure.  
This can be saved out to a DF or csv.  The function can be run again on a subset of the data 
(the records not encountered yet before the failure).


In [29]:
help(DFBuilder)

Help on class DFBuilder in module __main__:

class DFBuilder(builtins.object)
 |  DataFrame Builder abstract class.  Sets up logic to be inherited by objects that need to loop over a DataFrame
 |  and cache the results.  Original use case involves making API calls to the web which can get interrupted by 
 |  errors and server timeouts.  This object stores all the logic to build up and save a DataFrame a small number of 
 |  records at a time.  Then a subclass can define an abstract method in the base class as to what we want to do to 
 |  the input data.  Original use case added in content extracted form the web to a new column.  But subclasses can 
 |  be built to do more. Initialization argumens: endRw, time_delay. endRw = number of records to cache at a time
 |  when building outDF.  time_delay is number of seconds delay between each cycle of the loop that builds outDF.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, endRw, time_delay)
 |      Initialize self.  See help(type(

### Earlier Testing
Additional testing was performed to get to this point.  Those early tests are not included in this notebook.