To run this notebook you will need:  
python 3.4.3  
pandas 0.18.1  
requests 2.9.0  
json 2.0.9  
numpy 1.11.0  
  
credit to: sev@thedatascout.com  

In [4]:
import pandas
import requests
import json
import numpy
from pandas.io.json import json_normalize
import re

print(pandas.__version__)
print(requests.__version__)
print(json.__version__)
print(numpy.__version__)

0.20.3
2.18.4
2.0.9
1.12.0


Creating the RidbData object

In [84]:
import pandas as pd 
import requests
import json
from pandas.io.json import json_normalize
import numpy as np

class RidbData():
    def __init__(self, name, endpoint, url_params):
        self.df = pd.DataFrame()
        self.endpoint = endpoint
        self.url_params = url_params
        self.name = name
    
    def clean(self) :
        # by replacing '' with np.NaN we can use dropna to remove rows missing required data, like lat/longs
        self.df = self.df.replace('', np.nan)
    
        # normalize column names for lat and long. i.e. can be FacilityLatitude or RecAreaLatitude
        self.df.columns = self.df.columns.str.replace('.*Latitude', 'Latitude')
        self.df.columns = self.df.columns.str.replace('.*Longitude', 'Longitude')
        self.df = self.df.dropna(subset=['Latitude','Longitude'])

    def extract(self):
        request_url = self.endpoint
        response = requests.get(url=self.endpoint,params=self.url_params,headers={})
        data = json.loads(response.text)
        self.df = json_normalize(data['RECDATA'])

In [5]:
? requests.get

Create an instance of RidbData to connect to the facilities endpoint. <br>
You can get a RIDB API key here: https://ridb.recreation.gov/?action=register

In [76]:
ridb_facilities_endpoint = 'https://ridb.recreation.gov/api/v1/facilities'
ridb_params = dict(apiKey= "CBE42AEF8619445F9823241CCB78E219")
ridb = RidbData('ridb', ridb_facilities_endpoint, ridb_params)

Running the extract method, we can observe the 'df' attribute with the fetched data

In [77]:
ridb.extract()

In [78]:
ridb.df.isnull().sum()

FacilityAdaAccess            0
FacilityDescription          0
FacilityDirections           0
FacilityEmail                0
FacilityID                   0
FacilityLatitude             0
FacilityLongitude            0
FacilityMapURL               0
FacilityName                 0
FacilityPhone                0
FacilityReservationURL       0
FacilityTypeDescription      0
FacilityUseFeeDescription    0
GEOJSON.COORDINATES          0
GEOJSON.TYPE                 0
Keywords                     0
LastUpdatedDate              0
LegacyFacilityID             0
OrgFacilityID                0
StayLimit                    0
dtype: int64

In [79]:
ridb.df.describe()

Unnamed: 0,FacilityID,FacilityLatitude,FacilityLongitude
count,50.0,50.0,50.0
mean,201413.96,39.293978,-105.397023
std,934.63198,4.477295,14.192505
min,200001.0,30.285824,-122.2634
25%,200013.25,35.468325,-114.178798
50%,202053.0,39.4936,-111.868919
75%,202080.25,42.910205,-96.553334
max,202122.0,47.246301,-71.079722


In [80]:
ridb.df.head()

Unnamed: 0,FacilityAdaAccess,FacilityDescription,FacilityDirections,FacilityEmail,FacilityID,FacilityLatitude,FacilityLongitude,FacilityMapURL,FacilityName,FacilityPhone,FacilityReservationURL,FacilityTypeDescription,FacilityUseFeeDescription,GEOJSON.COORDINATES,GEOJSON.TYPE,Keywords,LastUpdatedDate,LegacyFacilityID,OrgFacilityID,StayLimit
0,True,"Like the other Presidential Libraries, the Geo...","See the map at <a href=""http://bushlibrary.tam...",Library.Bush@nara.gov,200001,30.612222,-96.331389,http://bushlibrary.tamu.edu/map.html,George Bush Presidential Library and Museum,979-691-4000,,Library,,"[-96.331389, 30.612222]",Point,,2007-02-26,,,
1,True,"The National Archives Building in Washington, ...",The National Archives Building is located betw...,,200002,38.892778,-77.023056,http://www.archives.gov/national_archives_expe...,National Archives Building,(866) 272-6272,,Archives,,"[-77.023056, 38.892778]",Point,,2016-03-21,,,
2,True,The National Archives at College Park opened f...,From I-495 (The Capital Beltway) take exit 28B...,,200003,38.9975,-76.925556,http://www.archives.gov/facilities/md/images/m...,National Archives at College Park,1-866-272-6272,,Archives,,"[-76.925556, 38.9975]",Point,,2007-02-26,,,
3,True,"Located in Atlanta, Georgia, the Jimmy Carter ...",The Jimmy Carter Library and Museum is located...,carter.library@nara.gov,200004,33.7675,-84.3553,http://www.jimmycarterlibrary.gov/images/map_a...,Jimmy Carter Presidential Library and Museum,(404) 865-7100,,Library,,"[-84.3553, 33.7675]",Point,,2007-02-26,,,
4,True,The Eisenhower Presidential Library is a natio...,Abilene is located on I-70 approximately 150 m...,eisenhower.library@nara.gov,200005,38.943889,-97.219167,,Dwight D. Eisenhower Presidential Library and ...,(785) 263-6700,,Library,,"[-97.219167, 38.943889]",Point,,2007-02-26,,,


In [81]:
ridb.df.shape

(50, 20)

Next, we will remove any entries that dont have a lat/long and clean up empty strings with np.NAN

In [82]:
ridb.clean()

In [83]:
ridb.df.isnull().sum()

FacilityAdaAccess            32
FacilityDescription           0
FacilityDirections            0
FacilityEmail                 3
FacilityID                    0
Latitude                      0
Longitude                     0
FacilityMapURL               33
FacilityName                  0
FacilityPhone                 1
FacilityReservationURL       45
FacilityTypeDescription      35
FacilityUseFeeDescription    41
GEOJSON.COORDINATES           0
GEOJSON.TYPE                  0
Keywords                     43
LastUpdatedDate               0
LegacyFacilityID             38
OrgFacilityID                50
StayLimit                    32
dtype: int64

Compare the 'FacilityReservationURL' field from above with the cleaned up column below. You'll see 'NaN' after the DataFrame has been cleaned

In [43]:
ridb.df.head()

Unnamed: 0,FacilityAdaAccess,FacilityDescription,FacilityDirections,FacilityEmail,FacilityID,Latitude,Longitude,FacilityMapURL,FacilityName,FacilityPhone,FacilityReservationURL,FacilityTypeDescription,FacilityUseFeeDescription,GEOJSON.COORDINATES,GEOJSON.TYPE,Keywords,LastUpdatedDate,LegacyFacilityID,OrgFacilityID,StayLimit
0,True,"Like the other Presidential Libraries, the Geo...","See the map at <a href=""http://bushlibrary.tam...",Library.Bush@nara.gov,200001,30.612222,-96.331389,http://bushlibrary.tamu.edu/map.html,George Bush Presidential Library and Museum,979-691-4000,,Library,,"[-96.331389, 30.612222]",Point,,2007-02-26,,,
1,True,"The National Archives Building in Washington, ...",The National Archives Building is located betw...,,200002,38.892778,-77.023056,http://www.archives.gov/national_archives_expe...,National Archives Building,(866) 272-6272,,Archives,,"[-77.023056, 38.892778]",Point,,2016-03-21,,,
2,True,The National Archives at College Park opened f...,From I-495 (The Capital Beltway) take exit 28B...,,200003,38.9975,-76.925556,http://www.archives.gov/facilities/md/images/m...,National Archives at College Park,1-866-272-6272,,Archives,,"[-76.925556, 38.9975]",Point,,2007-02-26,,,
3,True,"Located in Atlanta, Georgia, the Jimmy Carter ...",The Jimmy Carter Library and Museum is located...,carter.library@nara.gov,200004,33.7675,-84.3553,http://www.jimmycarterlibrary.gov/images/map_a...,Jimmy Carter Presidential Library and Museum,(404) 865-7100,,Library,,"[-84.3553, 33.7675]",Point,,2007-02-26,,,
4,True,The Eisenhower Presidential Library is a natio...,Abilene is located on I-70 approximately 150 m...,eisenhower.library@nara.gov,200005,38.943889,-97.219167,,Dwight D. Eisenhower Presidential Library and ...,(785) 263-6700,,Library,,"[-97.219167, 38.943889]",Point,,2007-02-26,,,


Check the DataFrame shape after running clean to see if any entries were removed due to dropping cells with missing lat/longs. 

In [44]:
ridb.df.shape

(50, 20)

Great! We have a RIDB data object, but wouldnt it be easier to just write a function?

In [45]:
def get_ridb_data(endpoint,url_params):
    response = requests.get(url = endpoint, params = url_params)
    data = json.loads(response.text)
    df = json_normalize(data['RECDATA'])
    df = df.replace('', np.nan)
    df.columns = df.columns.str.replace('.*Latitude', 'Latitude')
    df.columns = df.columns.str.replace('.*Longitude', 'Longitude')
    df = df.dropna(subset=['Latitude','Longitude'])

    return df

In [16]:
ridb_df = get_ridb_data(ridb_facilities_endpoint, ridb_params)

Indeed, our function has produced the same result as the object above:

In [17]:
ridb_df.head()

Unnamed: 0,FacilityAdaAccess,FacilityDescription,FacilityDirections,FacilityEmail,FacilityID,Latitude,Longitude,FacilityMapURL,FacilityName,FacilityPhone,FacilityReservationURL,FacilityTypeDescription,FacilityUseFeeDescription,GEOJSON.COORDINATES,GEOJSON.TYPE,Keywords,LastUpdatedDate,LegacyFacilityID,OrgFacilityID,StayLimit
0,True,"Like the other Presidential Libraries, the Geo...","See the map at <a href=""http://bushlibrary.tam...",Library.Bush@nara.gov,200001,30.612222,-96.331389,http://bushlibrary.tamu.edu/map.html,George Bush Presidential Library and Museum,979-691-4000,,Library,,"[-96.331389, 30.612222]",Point,,2007-02-26,,,
1,True,"The National Archives Building in Washington, ...",The National Archives Building is located betw...,,200002,38.892778,-77.023056,http://www.archives.gov/national_archives_expe...,National Archives Building,(866) 272-6272,,Archives,,"[-77.023056, 38.892778]",Point,,2016-03-21,,,
2,True,The National Archives at College Park opened f...,From I-495 (The Capital Beltway) take exit 28B...,,200003,38.9975,-76.925556,http://www.archives.gov/facilities/md/images/m...,National Archives at College Park,1-866-272-6272,,Archives,,"[-76.925556, 38.9975]",Point,,2007-02-26,,,
3,True,"Located in Atlanta, Georgia, the Jimmy Carter ...",The Jimmy Carter Library and Museum is located...,carter.library@nara.gov,200004,33.7675,-84.3553,http://www.jimmycarterlibrary.gov/images/map_a...,Jimmy Carter Presidential Library and Museum,(404) 865-7100,,Library,,"[-84.3553, 33.7675]",Point,,2007-02-26,,,
4,True,The Eisenhower Presidential Library is a natio...,Abilene is located on I-70 approximately 150 m...,eisenhower.library@nara.gov,200005,38.943889,-97.219167,,Dwight D. Eisenhower Presidential Library and ...,(785) 263-6700,,Library,,"[-97.219167, 38.943889]",Point,,2007-02-26,,,


Lets create a second function to handle the slightly different clean method needed for media files:

In [86]:
def get_ridb_facility_media(endpoint, url_params):
     # endpoint = https://ridb.recreation.gov/api/v1/facilities/facilityID/media/  
    response = requests.get(url = endpoint, params = url_params) 
    data = json.loads(response.text)
    df = json_normalize(data['RECDATA'])
    df = df[df['MediaType'] == 'Image']
    return df

The RIDB Media Endpoint is per facility, so we have to provide the facility ID in the endpoint URL:  
https://ridb.recreation.gov/api/v1/facilities/{facilityID}/media/  
We'll get data for the FacilityID 200006

In [87]:
ridb_media_endpoint = 'https://ridb.recreation.gov/api/v1/facilities/200006/media/'

In [88]:
ridb_df_media = get_ridb_facility_media(ridb_media_endpoint, ridb_params)

In [89]:
ridb_df_media

Unnamed: 0,Credits,Description,EmbedCode,EntityID,EntityMediaID,EntityType,Height,MediaID,MediaType,Subtitle,Title,URL,Width
0,,,,200006,309,Facility,0,0,Image,,Gerald Ford Presidential Library,https://ridb.recreation.gov/images/309.jpg,0


## Extending Objects
To accomodate the change in the media object clean method, we can extend the existing RidbData object. All we need to do is provide the new clean method. The rest of the code will be inherited from the RidbData object

In [90]:
class RidbMediaData(RidbData):
    def clean(self) :
        self.df = self.df[self.df['MediaType'] == 'Image']

If we also wanted to enable the RidbMediaData object to fetch all images for a given set of facilities, we could provide a new extract method as well:

In [92]:
class RidbMediaData(RidbData):

    def clean(self) :
        self.df = self.df[self.df['MediaType'] == 'Image']
        
    def extract(self):
        request_url = self.endpoint
        for index, param_set in self.url_params.iterrows():
            facility_id = param_set['facilityID']
            req_url = self.endpoint + str(facility_id) + "/media"

            response = requests.get(url=req_url,params=dict(apiKey=param_set['apiKey']))
            data = json.loads(response.text)

            # append new records to self.df if any exist
            if data['RECDATA']:
                new_entry = json_normalize(data['RECDATA'])
                self.df = self.df.append(new_entry)

To use this new method, we would need to make a change to the endpoint and url_params parameters we are passing to the constructor. The params object will now be a DataFrame containing the RIDB API key and the facilityIDs of interest.

In [93]:
media_url = 'https://ridb.recreation.gov/api/v1/facilities/'
media_params = pd.DataFrame({
    'apiKey':"CBE42AEF8619445F9823241CCB78E219",
    'facilityID':[200001, 200002, 200003, 200004, 200005, 200006, 200007, 200008]
    })

In [94]:
ridb_media = RidbMediaData('media', media_url, media_params)

In [95]:
ridb_media.extract()

Lets take a look at what we have extracted. Note that EntityID = FacilityID

In [96]:
ridb_media.df

Unnamed: 0,Credits,Description,EmbedCode,EntityID,EntityMediaID,EntityType,Height,MediaID,MediaType,Subtitle,Title,URL,Width
0,,,,200002,283,Facility,0,0,Image,,"National Archives Building, Washington, DC",https://ridb.recreation.gov/images/283.jpg,0
0,,,,200003,240,Facility,0,0,Image,,NARA's College Park Facility,https://ridb.recreation.gov/images/240.jpg,0
0,,,,200004,296,Facility,0,0,Image,,Entrance to the Carter Presidential Library,https://ridb.recreation.gov/images/296.jpg,0
0,,,,200006,309,Facility,0,0,Image,,Gerald Ford Presidential Library,https://ridb.recreation.gov/images/309.jpg,0
0,,,,200007,310,Facility,0,0,Image,,Gerald R. Ford Presidential Museum,https://ridb.recreation.gov/images/310.jpg,0


Run the clean function. It looks like all our media is images, so we dont expect to drop any records in this step

In [97]:
ridb_media.clean()

In [98]:
ridb_media.df

Unnamed: 0,Credits,Description,EmbedCode,EntityID,EntityMediaID,EntityType,Height,MediaID,MediaType,Subtitle,Title,URL,Width
0,,,,200002,283,Facility,0,0,Image,,"National Archives Building, Washington, DC",https://ridb.recreation.gov/images/283.jpg,0
0,,,,200003,240,Facility,0,0,Image,,NARA's College Park Facility,https://ridb.recreation.gov/images/240.jpg,0
0,,,,200004,296,Facility,0,0,Image,,Entrance to the Carter Presidential Library,https://ridb.recreation.gov/images/296.jpg,0
0,,,,200006,309,Facility,0,0,Image,,Gerald Ford Presidential Library,https://ridb.recreation.gov/images/309.jpg,0
0,,,,200007,310,Facility,0,0,Image,,Gerald R. Ford Presidential Museum,https://ridb.recreation.gov/images/310.jpg,0


### Putting it all together
Now that we have Ridb Data objects with the same interface we can use them to create a data extraction pipeline in just two lines!   
First we will setup our endpoints and objects

In [30]:
facilities_endpoint = 'https://ridb.recreation.gov/api/v1/facilities/'
recareas_endpoint = 'https://ridb.recreation.gov/api/v1/recareas'
key_dict = dict(apiKey = "CBE42AEF8619445F9823241CCB78E219")
facilities = RidbData('facilities', facilities_endpoint, key_dict)
recareas = RidbData('recareas', recareas_endpoint, key_dict)
facility_media = RidbMediaData('facilitymedia', facilities_endpoint, media_params) 

ridb_data = [facilities,recareas,facility_media]

Here we go - because our objects have the same interface, we can execute their methods within an array of like objects

In [31]:
# clean and extract all the RIDB data
list(map(lambda x: x.extract(), ridb_data))
list(map(lambda x: x.clean(), ridb_data))

[None, None, None]

All done! lets check out the cleaned data

In [102]:
facilities.df.shape

(50, 20)

In [32]:
facilities.df.head()

Unnamed: 0,FacilityAdaAccess,FacilityDescription,FacilityDirections,FacilityEmail,FacilityID,Latitude,Longitude,FacilityMapURL,FacilityName,FacilityPhone,FacilityReservationURL,FacilityTypeDescription,FacilityUseFeeDescription,GEOJSON.COORDINATES,GEOJSON.TYPE,Keywords,LastUpdatedDate,LegacyFacilityID,OrgFacilityID,StayLimit
0,True,"Like the other Presidential Libraries, the Geo...","See the map at <a href=""http://bushlibrary.tam...",Library.Bush@nara.gov,200001,30.612222,-96.331389,http://bushlibrary.tamu.edu/map.html,George Bush Presidential Library and Museum,979-691-4000,,Library,,"[-96.331389, 30.612222]",Point,,2007-02-26,,,
1,True,"The National Archives Building in Washington, ...",The National Archives Building is located betw...,,200002,38.892778,-77.023056,http://www.archives.gov/national_archives_expe...,National Archives Building,(866) 272-6272,,Archives,,"[-77.023056, 38.892778]",Point,,2016-03-21,,,
2,True,The National Archives at College Park opened f...,From I-495 (The Capital Beltway) take exit 28B...,,200003,38.9975,-76.925556,http://www.archives.gov/facilities/md/images/m...,National Archives at College Park,1-866-272-6272,,Archives,,"[-76.925556, 38.9975]",Point,,2007-02-26,,,
3,True,"Located in Atlanta, Georgia, the Jimmy Carter ...",The Jimmy Carter Library and Museum is located...,carter.library@nara.gov,200004,33.7675,-84.3553,http://www.jimmycarterlibrary.gov/images/map_a...,Jimmy Carter Presidential Library and Museum,(404) 865-7100,,Library,,"[-84.3553, 33.7675]",Point,,2007-02-26,,,
4,True,The Eisenhower Presidential Library is a natio...,Abilene is located on I-70 approximately 150 m...,eisenhower.library@nara.gov,200005,38.943889,-97.219167,,Dwight D. Eisenhower Presidential Library and ...,(785) 263-6700,,Library,,"[-97.219167, 38.943889]",Point,,2007-02-26,,,


In [103]:
recareas.df.shape

(50, 17)

In [33]:
recareas.df.head()

Unnamed: 0,GEOJSON.COORDINATES,GEOJSON.TYPE,Keywords,LastUpdatedDate,OrgRecAreaID,RecAreaDescription,RecAreaDirections,RecAreaEmail,RecAreaFeeDescription,RecAreaID,Latitude,Longitude,RecAreaMapURL,RecAreaName,RecAreaPhone,RecAreaReservationURL,StayLimit
0,"[-111.87832, 33.63307]",Point,,2015-05-01,,WestWorld is located in north Scottsdale at th...,Area is north of Frank Lloyd Wright Boulevard ...,,,4,33.63307,-111.87832,,WestWorld,480-312-6802,,
1,"[-112.32155, 33.86337]",Point,,2016-12-15,,This newly expanded reservoir has 114 miles of...,"Site is about 30 miles from Phoenix, Arizona. ...",lakepleasant@mail.maricopa.gov,,6,33.86337,-112.32155,http://www.maricopa.gov/parks/lake_pleasant/im...,Lake Pleasant,602-372-7460,,
2,"[-105.86078, 40.2168]",Point,,2015-05-01,,"<A HREF=""http://www.usbr.gov/dataweb/dams/co01...","North from Granby, Colorado on U.S. 34.",,,9,40.2168,-105.86078,,Shadow Mountain Lake,970-887-4100,,
3,"[-147.214, 65.546]",Point,,2017-06-29,,"Located just an hour's drive from Fairbanks, A...",Begin your trip at the BLM office or the Alask...,EasternInterior@blm.gov,,10,65.546,-147.214,www.blm.gov/maps,White Mountains National Recreation Area,800-437-7021 or 907-474-2200,https://www.recreation.gov/camping/white-moun...,10 days
4,"[-121.16976, 38.70519]",Point,,2016-11-14,,Recreation at Folsom Reservoir is managed by t...,Folsom Lake is located 20 miles northeast of S...,plucero@mp.usbr.gov,,11,38.70519,-121.16976,,Folsom Lake,916-988-0205,,


In [104]:
facility_media.df.shape

(5, 13)

In [100]:
facility_media.df.head()

Unnamed: 0,Credits,Description,EmbedCode,EntityID,EntityMediaID,EntityType,Height,MediaID,MediaType,Subtitle,Title,URL,Width
0,,,,200002,283,Facility,0,0,Image,,"National Archives Building, Washington, DC",https://ridb.recreation.gov/images/283.jpg,0
0,,,,200003,240,Facility,0,0,Image,,NARA's College Park Facility,https://ridb.recreation.gov/images/240.jpg,0
0,,,,200004,296,Facility,0,0,Image,,Entrance to the Carter Presidential Library,https://ridb.recreation.gov/images/296.jpg,0
0,,,,200006,309,Facility,0,0,Image,,Gerald Ford Presidential Library,https://ridb.recreation.gov/images/309.jpg,0
0,,,,200007,310,Facility,0,0,Image,,Gerald R. Ford Presidential Museum,https://ridb.recreation.gov/images/310.jpg,0
