<div align="right">Python 3.6</div>

## Testing The Abstract Base Class - Earlier Version of the Code

This notebook was created because it is easier to test out the core logic separately from the logic that makes web API calls and extracts data from Google Maps.  The intent was to test and debug as much of the logic (and/or similar logic) as might be needed ahead of bringing it all together in the final subclass.

Note that code and test results may differ from the final implementation.  Some issues with the code were corrected, enhanced, and improved during testing of the Google Maps interacting subclass (a different notebook).

### Enrich or Change Larger Dataframe Section by Section

The purpose of the <font color=blue><b>DFBuilder</b></font> object is to allow scanning of a larger dataframe, a small number of rows at a time.  It then allows code to be customized to make changes and build up a new dataframe from the results.  The operation is in
a standard loop by design. The original use case was to add a field with data accessed from an API off the web, and time delays were necessary (as well as other logic) to prevent (or at least reduce the risk of) server timeouts during operation.

Scanning through the source a few lines at a time, performing the operation and adding back out to the target DF
creates a "caching effect" where data is saved along the way so in the event of a server time-out all is not lost.  The resulting DF can then be saved out to a file, code modified, and a re-run of the code can pick up the process where it left off instead of having to start over again.

These tests use a subclass that will never be used in the real world and which does not communicate with the web.  The goal of these tests were to shake out problems with the other logic ahead of testing with web interaction.

In [1]:
who

Interactive namespace is empty.


### Libraries Needed
Import statements included in this notebook are for the main abstract object and a test object.

In [2]:
import pandas as pd
import time

In [3]:
## this entire cell may not be needed for this test but will be needed for the next test notebook of final objects
import os

## for larger data and/or make many requests in one day - get Google API key and use these lines:
# os.environ["GOOGLE_API_KEY"] = "YOUR_GOOGLE_API_Key"
## for better security (PROD environments) - install key to server and use just this line to load it:
# os.environ.get('GOOGLE_API_KEY')

# set up geocode
from geopy.geocoders import Nominatim
geolocator = Nominatim()
from geopy.exc import GeocoderTimedOut


### Test Data
Input Data Set up Here

In [4]:
## Test code on a reasonably small DF
tst_lat_lon_df = pd.read_csv("testset_unique_lat_and_lon_vals.csv", index_col=0)

In [5]:
tst_lat_lon_df.describe()

Unnamed: 0,lat,lon
count,1160.0,1160.0
mean,41.232457,-74.04298
std,1.162332,1.463141
min,39.390049,-78.366203
25%,40.269619,-74.651615
50%,40.74292,-74.14307
75%,42.3616,-73.752644
max,44.950298,-70.187302


In [6]:
tst_lat_lon_df.tail()

Unnamed: 0,lat,lon
1155,43.233299,-70.911079
1156,43.233601,-70.911301
1157,43.233299,-70.910698
1158,43.233398,-70.911003
1159,43.233299,-70.910713


In [7]:
## Create smaller random sample from above DF for further testing

tst_lat_lon_df_sample = tst_lat_lon_df.sample(frac=0.1).copy(deep=True)
    # frac=0.1 for 10% or use n=100 for get 100 records
    # this variant seemed to create trouble with indexing of the DF in buildOutDF():
    #      tst_lat_lon_df.copy(deep=True).sample(frac=0.1)
    # also: options on reset_index given in next cell were needed as part of the fix
len(tst_lat_lon_df_sample)

116

In [8]:
tst_lat_lon_df_sample.reset_index(drop=True, inplace=True)  
tst_lat_lon_df_sample.iloc[[24,25,67]]  ## attempt to fix index and show 3 rows that will be manipulated for testing

Unnamed: 0,lat,lon
24,40.253269,-74.651611
25,40.633709,-74.407356
67,40.1259,-75.06131


In [9]:
# sample: sub_df.iloc[0]['A']
# creating some missing values for testing of error handling in the code
# note:  tried tst_lat_lon_df_sample.iloc[67]['lat'] = "" but it seems the pandas dataframe "protects itself"
#        the change failed to occur unless setting the numeric field to None
#        see notes on roundValue() function later in this document for similar pandas related behaviors

tst_lat_lon_df_sample.iloc[67]['lat'] = None
tst_lat_lon_df_sample.iloc[67]['lon'] = None
tst_lat_lon_df_sample.iloc[24]['lat'] = None
tst_lat_lon_df_sample.iloc[25]['lon'] = None
tst_lat_lon_df_sample.iloc[[24,25,67]]

Unnamed: 0,lat,lon
24,,-74.651611
25,40.633709,
67,,


### Code Testing

The abstract class which follows is intended to be the "work horse" of this code.  Intent is that it gets the developer to the point where all they need to think about is what their final subclass will do to enrich the data.  The parent class sets up a loop that can extract from a larger input DF, a small number of rows to be operated on in a temp DF and then be added to an outputDF.  In the event of something interrupting the process (a common event when dealing with web APIs), modified rows created before the incident are waiting in output DF and can be extracted. Then code can be restarted or continued to allow building up the rest of the Dataframe without losing previous work or having to go all the way back to the beginnin.

This test notebook sets up a subclass that will never be used in the real world.  There are more efficient ways to modify a DF with the example selected for this test.  The test's intent is simply to show that most of the core logic works before we test a subclass that is slower and more involved because it actually makes calls to a web API.

In [10]:
who

GeocoderTimedOut	 Nominatim	 geolocator	 os	 pd	 time	 tst_lat_lon_df	 tst_lat_lon_df_sample	 


In [82]:
from abc import ABCMeta, abstractmethod
import pandas as pd

class DFBuilder(object, metaclass=ABCMeta):       # sets up abstract class
    def __init__(self,endRw,time_delay):          # abstract classes can be subclassed
        self.endRow=endRw                         # but cannot be instantiated
        self.delay=time_delay
        self.tmpDF=pd.DataFrame()   # temp DF will be endRow rows in length
        self.outDF=pd.DataFrame()   # final DF build in sets of endRow rows so all is not lost in a failure
        self.lastIndex = None
        # self.start=0
        
    def __str__(self):
        return ("Global Settings for this object: \n" +  
               "endRow: " + str(self.endRow) + "\n" + 
               "delay:  " + str(self.delay) + "\n" + 
               "Length of outDF: " + str(len(self.outDF)) + "\n" +
               "nextIndex: " + str(self.lastIndex))       # if continuing with last added table - index of next rec.
        
    @abstractmethod                               # abstract method definition in Python
    def _modifyTempDF_(): pass                    # This method will operate on TempDF inside the loop
    
    def buildOutDF(self, inputDF):
        '''Scans inputDF using self.endRow rows (default of 5) at a time to do it.  It then calls in logic
from _modifyTempDF()_ to make changes to each subset of rows and appends tiny tempDF onto an outDF.  When the 
subclass is using a web API, self.time_delay tells it how much time to delay each iteration of the loop.  All
parameters are set during initialization of the object.  Should this function fail in the middle, outDF will
have all work up to the failure.  This can be saved out to a DF or csv.  The function can be run again on
a subset of the data (the records not encountered yet before the failure).'''
    
        lenDF = len(inputDF)
        print("Processing inputDF of length: ", lenDF)
        endIndx = 0

        i = 0
        while i < lenDF:
            # print("i: ", i)
            endIndx = i + self.endRow
            if endIndx > lenDF:
                        endIndx = lenDF

            # print("Range to use: ", i, ":", endIndx)
            self.tmpDF = inputDF[i:endIndx].copy(deep=True)
            self._modifyTempDF_()
            time.sleep(self.delay)
            self.outDF = self.outDF.append(self.tmpDF) 
            self.lastIndex = endIndx            
            i = endIndx
            # print("i at end of loop: ", i)        
          
        self.reindex_OutDF()
        
    def reindex_OutDF(self):
        self.outDF.reset_index(drop=True, inplace=True)

### Test Sub Class
Before creating, testing, and debugging a subclass that uses a Google maps API to enrich the data, it is desirable to start simpler.  This test object makes use of all of the same logic except for the API calls out to the web to get new data.  For this test, the "delay" will be set to 0 since it is not needed.  This code just shows that we can loop through an original DF, copy rows 5 at a time to a temporary DF, add columns to them using logic that looks at existing rows, and output to our output DF.

Stopping the code in the middle may allow us to test what happens if the code halts, showing what is stored in the dataframe in the object when this happens.

In [83]:
class TstModification_DFBuilder(DFBuilder): 
    '''Test of ability to scan a dataframe x rows at a time and add data columns to it.
There are more efficient ways to round cols in a DF; this object is a test of base logic from the abstract class
ahead of creating a more complex subclass that interacts with the web during the loop.  It builds a copy of the 
DF a small number of rows at a time and creates some new fields as it does so.  Input DF must have "lat" and 
"lon" cols. lat=Latitude / lon = Longitude. Defaults set delay to 0 seconds and rows processed at a time to 
5 for this test.'''
    
    def __init__(self, endRw=5,time_delay=0):           
        super().__init__(endRw,time_delay)
        
    def roundValue(self, value, dec_places=4, rtn_null=False):
        '''Takes arguments: value, dec_places. Rounds value to dec_places specified (if not specified, default=4.)
rtn_null defaults to False.  If True, error handling should result in an empty string being returned.   
If false __ErrType__ should be returned to help with debugging code and data by distinguishing why there is  
no rounded answer returned.  Testing shows that while using round() throws errors if input is not a number, 
applying it to a dataframe does not.  Try-except code left in for future research but does not seem to ever 
get triggered as of this writing.'''
        
        try:
            rtnVal = round(value, dec_places)
        except TypeError as terr:
            print(type(terr))
            print(terr)
            rtnVal = "__NAN__"
        except Exception as eerr:
            print(type(eerr))
            print(eerr)
            rtnVal = "__ERR__"
        finally:
            if rtn_null==True:
                if isinstance(rtnVal, str):
                    return ""
                elif rtnVal is None:
                    return ""
                else:
                    return rtnVal
            else: 
                return rtnVal
        
    def _modifyTempDF_(self, dec_places=4, rtn_null=False):
        '''Create rounded lat and lon columns adding them to tempDF. Defaults round to 4 places and return error
strings in the column if the input value is not a number and cannot be rounded.  Note: error handling for 
roundValue() may never come into play due to interaction of apply/lambda/roundValue with DataFrames.  Attempts
to test this on a dataframe resulted in a dataframe with NaNs in it instead of error text.'''
        self.tmpDF["lat_rnd"] = self.tmpDF.apply(lambda x: self.roundValue(x.lat, dec_places, rtn_null), axis=1)
        self.tmpDF["lon_rnd"] = self.tmpDF.apply(lambda x: self.roundValue(x.lon, dec_places, rtn_null), axis=1)

#### Test Loop Math Issue
This section was created to debug an issue with the loop.  Successful runs of these these tests show this problem was corrected.  

In [13]:
# del loopDFTst

In [14]:
# del tstDFbldr

In [15]:
who

ABCMeta	 DFBuilder	 GeocoderTimedOut	 Nominatim	 TstModification_DFBuilder	 abstractmethod	 geolocator	 os	 pd	 
time	 tst_lat_lon_df	 tst_lat_lon_df_sample	 


In [16]:
loopDFTst = TstModification_DFBuilder()

In [17]:
print(loopDFTst)

Global Settings for this object: 
endRow: 5
delay:  0
Length of outDF: 0
nextIndex: None


In [18]:
loopDFTst.buildOutDF(tst_lat_lon_df)

Processing inputDF of length:  1160


In [19]:
print("Length of outDF: ", len(loopDFTst.outDF))
loopDFTst.outDF.tail()

Length of outDF:  1160


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
1155,43.233299,-70.911079,43.2333,-70.9111
1156,43.233601,-70.911301,43.2336,-70.9113
1157,43.233299,-70.910698,43.2333,-70.9107
1158,43.233398,-70.911003,43.2334,-70.911
1159,43.233299,-70.910713,43.2333,-70.9107


In [20]:
tst_lat_lon_df.iloc[[0,1,2,1157,1158,1159]]  # looking at start and end of source data

Unnamed: 0,lat,lon
0,42.377602,-71.124702
1,42.432098,-71.056099
2,42.249298,-71.074501
1157,43.233299,-70.910698
1158,43.233398,-70.911003
1159,43.233299,-70.910713


In [21]:
loopDFTst.outDF.iloc[[0,1,2,1157,1158,1159]] # comparing to output rows

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
0,42.377602,-71.124702,42.3776,-71.1247
1,42.432098,-71.056099,42.4321,-71.0561
2,42.249298,-71.074501,42.2493,-71.0745
1157,43.233299,-70.910698,43.2333,-70.9107
1158,43.233398,-70.911003,43.2334,-70.911
1159,43.233299,-70.910713,43.2333,-70.9107


In [22]:
# reset the looptest object and check it again with two slices of original DF
del loopDFTst

In [23]:
loopDFTst = TstModification_DFBuilder()
print(loopDFTst)

Global Settings for this object: 
endRow: 5
delay:  0
Length of outDF: 0
nextIndex: None


In [24]:
loopDFTst.buildOutDF(tst_lat_lon_df[0:98])  # use number that does not divide evenly by 5 (endRow=5)
print(loopDFTst)

Processing inputDF of length:  98
Global Settings for this object: 
endRow: 5
delay:  0
Length of outDF: 98
nextIndex: 98


In [25]:
print("Length of outDF: ", len(loopDFTst.outDF))
loopDFTst.outDF.tail()

Length of outDF:  98


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
93,40.37822,-74.961792,40.3782,-74.9618
94,40.694191,-73.730133,40.6942,-73.7301
95,40.660992,-73.706703,40.661,-73.7067
96,40.6991,-73.703697,40.6991,-73.7037
97,40.674,-73.873001,40.674,-73.873


In [26]:
loopDFTst.buildOutDF(tst_lat_lon_df[98:100])  # test: missed few records scenario (in this case, adding just 2)

Processing inputDF of length:  2


In [27]:
print("Length of outDF: ", len(loopDFTst.outDF))
loopDFTst.outDF.tail()

Length of outDF:  100


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
95,40.660992,-73.706703,40.661,-73.7067
96,40.6991,-73.703697,40.6991,-73.7037
97,40.674,-73.873001,40.674,-73.873
98,40.827599,-73.896004,40.8276,-73.896
99,40.663101,-73.762199,40.6631,-73.7622


In [28]:
loopDFTst.buildOutDF(tst_lat_lon_df[100:105])  # test: add amount equal to internal self.endRow variable

Processing inputDF of length:  5


In [29]:
print("Length of outDF: ", len(loopDFTst.outDF))
loopDFTst.outDF.tail()

Length of outDF:  105


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
100,40.768501,-73.960701,40.7685,-73.9607
101,40.676601,-73.703796,40.6766,-73.7038
102,40.756401,-73.970299,40.7564,-73.9703
103,40.672798,-73.704201,40.6728,-73.7042
104,40.766479,-73.951233,40.7665,-73.9512


In [30]:
loopDFTst.buildOutDF(tst_lat_lon_df[105:205])  # test: add 100 rows (which is divisble by 5)

Processing inputDF of length:  100


In [31]:
print("Length of outDF: ", len(loopDFTst.outDF))
loopDFTst.outDF.tail()

Length of outDF:  205


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
200,40.744099,-74.143501,40.7441,-74.1435
201,40.269508,-74.482819,40.2695,-74.4828
202,40.269569,-74.483322,40.2696,-74.4833
203,40.26965,-74.483269,40.2696,-74.4833
204,40.26981,-74.483269,40.2698,-74.4833


In [32]:
loopDFTst.buildOutDF(tst_lat_lon_df[205:306])  # another add records test

Processing inputDF of length:  101


In [33]:
print("Length of outDF: ", len(loopDFTst.outDF))
loopDFTst.outDF.tail()

Length of outDF:  306


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
301,40.269489,-74.482903,40.2695,-74.4829
302,40.23085,-74.474373,40.2309,-74.4744
303,40.253269,-74.651611,40.2533,-74.6516
304,40.269241,-74.483482,40.2692,-74.4835
305,40.26944,-74.482887,40.2694,-74.4829


In [34]:
loopDFTst.buildOutDF(tst_lat_lon_df[306:])  # get the rest added

Processing inputDF of length:  854


In [35]:
print("Length of outDF: ", len(loopDFTst.outDF))
loopDFTst.outDF.tail()

Length of outDF:  1160


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
1155,43.233299,-70.911079,43.2333,-70.9111
1156,43.233601,-70.911301,43.2336,-70.9113
1157,43.233299,-70.910698,43.2333,-70.9107
1158,43.233398,-70.911003,43.2334,-70.911
1159,43.233299,-70.910713,43.2333,-70.9107


In [36]:
## sanity checking:
loopDFTst.outDF.iloc[[0,1,2,1157,1158,1159]]

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
0,42.377602,-71.124702,42.3776,-71.1247
1,42.432098,-71.056099,42.4321,-71.0561
2,42.249298,-71.074501,42.2493,-71.0745
1157,43.233299,-70.910698,43.2333,-70.9107
1158,43.233398,-70.911003,43.2334,-70.911
1159,43.233299,-70.910713,43.2333,-70.9107


In [37]:
tst_lat_lon_df.iloc[[0,1,2,1157,1158,1159]]

Unnamed: 0,lat,lon
0,42.377602,-71.124702
1,42.432098,-71.056099
2,42.249298,-71.074501
1157,43.233299,-70.910698
1158,43.233398,-70.911003
1159,43.233299,-70.910713


#### Simple Test Using Defaults

In [38]:
tstDFbldr = TstModification_DFBuilder()
print(tstDFbldr)  ## show defaults set during object build

Global Settings for this object: 
endRow: 5
delay:  0
Length of outDF: 0
nextIndex: None


In [39]:
tst_lat_lon_df.tail()

Unnamed: 0,lat,lon
1155,43.233299,-70.911079
1156,43.233601,-70.911301
1157,43.233299,-70.910698
1158,43.233398,-70.911003
1159,43.233299,-70.910713


In [40]:
tstDFbldr.buildOutDF(tst_lat_lon_df)  ## executes in under 1 second on 1160 rows

Processing inputDF of length:  1160


In [41]:
tstDFbldr.outDF.describe()            ## use .describe instead of .describe() to see source data ... rounding did work
                                      ## display adds zeros, but our rounded fields are rounded to 4 decimals

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
count,1160.0,1160.0,1160.0,1160.0
mean,41.232457,-74.04298,41.232457,-74.042981
std,1.162332,1.463141,1.162333,1.46314
min,39.390049,-78.366203,39.39,-78.3662
25%,40.269619,-74.651615,40.2696,-74.6516
50%,40.74292,-74.14307,40.7429,-74.14305
75%,42.3616,-73.752644,42.3616,-73.752625
max,44.950298,-70.187302,44.9503,-70.1873


In [42]:
tstDFbldr.outDF.tail()  ## tail of DF after first run of the function

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
1155,43.233299,-70.911079,43.2333,-70.9111
1156,43.233601,-70.911301,43.2336,-70.9113
1157,43.233299,-70.910698,43.2333,-70.9107
1158,43.233398,-70.911003,43.2334,-70.911
1159,43.233299,-70.910713,43.2333,-70.9107


#### Add 5 Rows - Original Data Source
Using buildOutDF() to add to outDF inside the object.  This test will repeat the first 5 rows on the end of the DF

In [43]:
tstDFbldr.buildOutDF(tst_lat_lon_df[0:5])  # add copy of 5 rows to the end (like adding more data later)
tstDFbldr.outDF.tail(10)                   # function added these new rows and fixed the index
                                           # this test is why the reset_index code was added

Processing inputDF of length:  5


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
1155,43.233299,-70.911079,43.2333,-70.9111
1156,43.233601,-70.911301,43.2336,-70.9113
1157,43.233299,-70.910698,43.2333,-70.9107
1158,43.233398,-70.911003,43.2334,-70.911
1159,43.233299,-70.910713,43.2333,-70.9107
1160,42.377602,-71.124702,42.3776,-71.1247
1161,42.432098,-71.056099,42.4321,-71.0561
1162,42.249298,-71.074501,42.2493,-71.0745
1163,42.3578,-71.062698,42.3578,-71.0627
1164,42.347,-71.074799,42.347,-71.0748


In [44]:
tstDFbldr.outDF.head(10)  ## quick check of the head of the DF

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
0,42.377602,-71.124702,42.3776,-71.1247
1,42.432098,-71.056099,42.4321,-71.0561
2,42.249298,-71.074501,42.2493,-71.0745
3,42.3578,-71.062698,42.3578,-71.0627
4,42.347,-71.074799,42.347,-71.0748
5,42.2528,-71.129097,42.2528,-71.1291
6,42.344898,-71.101601,42.3449,-71.1016
7,42.424702,-71.111198,42.4247,-71.1112
8,42.297001,-71.054703,42.297,-71.0547
9,42.336201,-71.103699,42.3362,-71.1037


In [45]:
# tstDFbldr.outDF  ## uncomment to view whole DF

#### Repeat Testing With Random Sample of Data
Test done again on smaller random sample of original data created as deep copy.  If test is run on whole DF, it will complete in 1/10th the time.  Testing, however, revealed sone odd quirks in just spot testing the coding features.

In [46]:
tstDFbldr2 = TstModification_DFBuilder()  
tstDFbldr2.buildOutDF(tst_lat_lon_df_sample)  ## create another object and run the function on it

Processing inputDF of length:  116


In [47]:
tstDFbldr2.outDF.iloc[[24,25,67]]             ## NaN produced from empty Lat/Lon values

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
24,,-74.651611,,-74.6516
25,40.633709,,40.6337,
67,,,,


In [48]:
# reset object by replacing with fresh blank one
tstDFbldr2 = TstModification_DFBuilder(time_delay=1)  ## set delay to 1 second so we can interrupt during processing

#### Simulated Interrupt
In the final object, it will be server timeouts from the web that may result in coding failing to complete.  The closest we can come to simulating this without introducing web API content is to set a time delay and interrupt the code run in the middle (manually) from within Jupyter.  That test as well as some other logic checks follows here.

In [49]:
tstDFbldr2.buildOutDF(tst_lat_lon_df_sample)          ## stop this test in middle for next set of tests

Processing inputDF of length:  116


KeyboardInterrupt: 

In [50]:
print(tstDFbldr2.delay  )    ## show delay used:  1 second
tstDFbldr2.outDF.describe()  ## describe resulting DF ... it has only a fraction of the expected rows
                             ## because we stopped the code early during testing ...

1


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
count,49.0,49.0,49.0,49.0
mean,41.159567,-74.40532,41.159565,-74.405316
std,1.197015,1.171229,1.197014,1.171228
min,39.4203,-77.178574,39.4203,-77.1786
25%,40.269619,-75.025757,40.2696,-75.0258
50%,40.72617,-74.464989,40.7262,-74.465
75%,42.091789,-73.889793,42.0918,-73.8898
max,44.583321,-70.537613,44.5833,-70.5376


In [51]:
tstDFbldr2.outDF.tail()  ## index values shown here were cleaned up by changing creation of sample DF
                         ## see comments in data preparation at start of NB

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
45,42.659409,-73.781357,42.6594,-73.7814
46,40.923328,-73.883057,40.9233,-73.8831
47,40.760601,-74.005501,40.7606,-74.0055
48,43.099098,-73.708313,43.0991,-73.7083
49,40.644901,-73.958298,40.6449,-73.9583


In [52]:
## work-around:  this code could be used to reset index with same options as buildOutDF()
# tstDFbldr2.reindex_OutDF()
# tstDFbldr2.outDF.tail()

In [53]:
tstDFbldr2.buildOutDF(tst_lat_lon_df_sample[-5:])   ## first attempt to add last 5 rows again from sample data

Processing inputDF of length:  5


In [54]:
tstDFbldr2.outDF.describe()  # count was unchanged from previous in earlier iteration of the code

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
count,54.0,54.0,54.0,54.0
mean,41.121034,-74.370418,41.121031,-74.370413
std,1.156245,1.169211,1.156245,1.16921
min,39.4203,-77.178574,39.4203,-77.1786
25%,40.269647,-75.005827,40.269625,-75.00585
50%,40.72588,-74.464958,40.7259,-74.46495
75%,41.497149,-73.88973,41.497125,-73.889725
max,44.583321,-70.537613,44.5833,-70.5376


In [55]:
tstDFbldr2.outDF.tail(10)  # now it seems to work right

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
45,42.659409,-73.781357,42.6594,-73.7814
46,40.923328,-73.883057,40.9233,-73.8831
47,40.760601,-74.005501,40.7606,-74.0055
48,43.099098,-73.708313,43.0991,-73.7083
49,40.644901,-73.958298,40.6449,-73.9583
50,41.353199,-72.038597,41.3532,-72.0386
51,40.666698,-74.208702,40.6667,-74.2087
52,40.049099,-75.079437,40.0491,-75.0794
53,40.365349,-74.946037,40.3653,-74.946
54,41.2827,-73.869102,41.2827,-73.8691


In [56]:
tstDFbldr2.buildOutDF(tst_lat_lon_df_sample[-5:])   ## Try it a second time

Processing inputDF of length:  5


In [57]:
tstDFbldr2.outDF.describe()                         ## note: count is 5 more than before

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
count,59.0,59.0,59.0,59.0
mean,41.089032,-74.341431,41.089029,-74.341425
std,1.12036,1.166732,1.12036,1.166729
min,39.4203,-77.178574,39.4203,-77.1786
25%,40.269674,-74.985897,40.26965,-74.9859
50%,40.72559,-74.464928,40.7256,-74.4649
75%,41.353199,-73.886383,41.3532,-73.8864
max,44.583321,-70.537613,44.5833,-70.5376


In [58]:
tstDFbldr2.outDF.tail(10)    ## comparison of tail helps confirm new records were added
                             ## but clean index reset occurs this time (as it is supposed to)

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
50,41.353199,-72.038597,41.3532,-72.0386
51,40.666698,-74.208702,40.6667,-74.2087
52,40.049099,-75.079437,40.0491,-75.0794
53,40.365349,-74.946037,40.3653,-74.946
54,41.2827,-73.869102,41.2827,-73.8691
55,41.353199,-72.038597,41.3532,-72.0386
56,40.666698,-74.208702,40.6667,-74.2087
57,40.049099,-75.079437,40.0491,-75.0794
58,40.365349,-74.946037,40.3653,-74.946
59,41.2827,-73.869102,41.2827,-73.8691


In [59]:
## Another investigation
'''Idea: If in doubt as to whether the dataframe being passed in for the second run is mutating correctly or not,
try making a deep copy in steps and resetting the index on the copy as shown here. '''

## problems this was investigating now appear to be fixed

tmpDF1 = tst_lat_lon_df_sample[-5:].copy(deep=True)
tmpDF1.reset_index(drop=True, inplace=True)

In [60]:
tmpDF1

Unnamed: 0,lat,lon
0,41.353199,-72.038597
1,40.666698,-74.208702
2,40.049099,-75.079437
3,40.365349,-74.946037
4,41.2827,-73.869102


In [61]:
tstDFbldr2.buildOutDF(tmpDF1)  ## initial test seems promising

Processing inputDF of length:  5


In [62]:
tstDFbldr2.outDF.tail(10)     ##  as shown here, multiple tests seem to add the new rows every time

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
55,41.353199,-72.038597,41.3532,-72.0386
56,40.666698,-74.208702,40.6667,-74.2087
57,40.049099,-75.079437,40.0491,-75.0794
58,40.365349,-74.946037,40.3653,-74.946
59,41.2827,-73.869102,41.2827,-73.8691
60,41.353199,-72.038597,41.3532,-72.0386
61,40.666698,-74.208702,40.6667,-74.2087
62,40.049099,-75.079437,40.0491,-75.0794
63,40.365349,-74.946037,40.3653,-74.946
64,41.2827,-73.869102,41.2827,-73.8691


In [63]:
tstDFbldr2.buildOutDF(tmpDF1)
tstDFbldr2.outDF.tail(10)

Processing inputDF of length:  5


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
60,41.353199,-72.038597,41.3532,-72.0386
61,40.666698,-74.208702,40.6667,-74.2087
62,40.049099,-75.079437,40.0491,-75.0794
63,40.365349,-74.946037,40.3653,-74.946
64,41.2827,-73.869102,41.2827,-73.8691
65,41.353199,-72.038597,41.3532,-72.0386
66,40.666698,-74.208702,40.6667,-74.2087
67,40.049099,-75.079437,40.0491,-75.0794
68,40.365349,-74.946037,40.3653,-74.946
69,41.2827,-73.869102,41.2827,-73.8691


In [64]:
print(tstDFbldr2)  ## note: we added tmpDF1 which ended on index 4.  This is why "nextIndex" now reads 5
                   ## nextIndex represents next index if we were to cotinue with the next record in the last
                   ## table we added to outDF using the buildOutDF() function

Global Settings for this object: 
endRow: 5
delay:  1
Length of outDF: 70
nextIndex: 5


#### Attempt To Replicate Earlier Problem
This problem was originally created in another Notebook without all the tests before it.  Strangely, removing some of the tests that preceded the one that was expected to work caused it to fail in an initial run of this notebook.  Code has since changed and these tests now show that the content works as expected (problem solved).

In [65]:
## to illustrate:  we try creating a fresh object to see if we can show that problem in this NB
## build 3 here ...

In [66]:
tstDFbld3 = TstModification_DFBuilder(time_delay=1)
tstDFbld3.buildOutDF(tst_lat_lon_df_sample)          ## stop this test in middle for next set of tests

Processing inputDF of length:  116


In [68]:
print(tstDFbld3.delay)      ## show delay used:  1 second
tstDFbld3.outDF.describe()  ## describe resulting DF ... it has only a fraction of the expected rows
                             ## because we stopped the code early during testing ...

1


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
count,114.0,114.0,114.0,114.0
mean,41.067327,-74.360352,41.067325,-74.360351
std,1.141535,1.334016,1.141537,1.334016
min,39.3904,-77.178574,39.3904,-77.1786
25%,40.269436,-75.088343,40.269425,-75.088325
50%,40.715551,-74.468445,40.71555,-74.46845
75%,42.087417,-73.88973,42.087425,-73.889725
max,44.583321,-70.537613,44.5833,-70.5376


In [69]:
tstDFbld3.outDF.tail()

Unnamed: 0,lat,lon,lat_rnd,lon_rnd
111,41.353199,-72.038597,41.3532,-72.0386
112,40.666698,-74.208702,40.6667,-74.2087
113,40.049099,-75.079437,40.0491,-75.0794
114,40.365349,-74.946037,40.3653,-74.946
115,41.2827,-73.869102,41.2827,-73.8691


In [70]:
tmpDF2 = tst_lat_lon_df_sample[-5:].copy(deep=True)
tmpDF2.reset_index(drop=True, inplace=True)
tmpDF2

Unnamed: 0,lat,lon
0,41.353199,-72.038597
1,40.666698,-74.208702
2,40.049099,-75.079437
3,40.365349,-74.946037
4,41.2827,-73.869102


In [71]:
tstDFbld3.buildOutDF(tmpDF2)   # starting with fresh object and fresh deep copy of the sample
tstDFbld3.outDF.tail()         # the problem recurs
                               # first attempt fails

Processing inputDF of length:  5


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
116,41.353199,-72.038597,41.3532,-72.0386
117,40.666698,-74.208702,40.6667,-74.2087
118,40.049099,-75.079437,40.0491,-75.0794
119,40.365349,-74.946037,40.3653,-74.946
120,41.2827,-73.869102,41.2827,-73.8691


In [72]:
tstDFbld3.buildOutDF(tmpDF2) 
tstDFbld3.outDF.tail()          # second and subsequent attempts succeed

Processing inputDF of length:  5


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
121,41.353199,-72.038597,41.3532,-72.0386
122,40.666698,-74.208702,40.6667,-74.2087
123,40.049099,-75.079437,40.0491,-75.0794
124,40.365349,-74.946037,40.3653,-74.946
125,41.2827,-73.869102,41.2827,-73.8691


In [73]:
tstDFbld3.buildOutDF(tmpDF2) 
tstDFbld3.outDF.tail()

Processing inputDF of length:  5


Unnamed: 0,lat,lon,lat_rnd,lon_rnd
126,41.353199,-72.038597,41.3532,-72.0386
127,40.666698,-74.208702,40.6667,-74.2087
128,40.049099,-75.079437,40.0491,-75.0794
129,40.365349,-74.946037,40.3653,-74.946
130,41.2827,-73.869102,41.2827,-73.8691


### Documentation Tests

In [84]:
# create new object to test the docstrings
testObj1 = TstModification_DFBuilder()

In [85]:
help(testObj1)

Help on TstModification_DFBuilder in module __main__ object:

class TstModification_DFBuilder(DFBuilder)
 |  Test of ability to scan a dataframe x rows at a time and add data columns to it.
 |  There are more efficient ways to round cols in a DF; this object is a test of base logic from the abstract class
 |  ahead of creating a more complex subclass that interacts with the web during the loop.  It builds a copy of the 
 |  DF a small number of rows at a time and creates some new fields as it does so.  Input DF must have "lat" and 
 |  "lon" cols. lat=Latitude / lon = Longitude. Defaults set delay to 0 seconds and rows processed at a time to 
 |  5 for this test.
 |  
 |  Method resolution order:
 |      TstModification_DFBuilder
 |      DFBuilder
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, endRw=5, time_delay=0)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  roundValue(self, value, dec_places=4, rtn_null=False)
 |    

In [86]:
print(testObj1.__doc__)  # note: formatting is messed up if you do not use print() on the doc string

Test of ability to scan a dataframe x rows at a time and add data columns to it.
There are more efficient ways to round cols in a DF; this object is a test of base logic from the abstract class
ahead of creating a more complex subclass that interacts with the web during the loop.  It builds a copy of the 
DF a small number of rows at a time and creates some new fields as it does so.  Input DF must have "lat" and 
"lon" cols. lat=Latitude / lon = Longitude. Defaults set delay to 0 seconds and rows processed at a time to 
5 for this test.


In [87]:
print(testObj1.buildOutDF.__doc__) # buildOutDF

Scans inputDF using self.endRow rows (default of 5) at a time to do it.  It then calls in logic
from _modifyTempDF()_ to make changes to each subset of rows and appends tiny tempDF onto an outDF.  When the 
subclass is using a web API, self.time_delay tells it how much time to delay each iteration of the loop.  All
parameters are set during initialization of the object.  Should this function fail in the middle, outDF will
have all work up to the failure.  This can be saved out to a DF or csv.  The function can be run again on
a subset of the data (the records not encountered yet before the failure).
