# OpenFEMA API Tutorial: Part 5 - Getting Dataset Updates
  
## Quick Summary
- This tutorial explains and demonstrates how to check for dataset and record updates for an OpenFEMA dataset. 
- This is useful as it can minimize bandwidth and computation.
- The OpenFEMA API has a special endpoint called "DataSets" that contains a lastDataSetRefresh value useful to determine when the dataset was last successfully updated in OpenFEMA.
- Many datasets contain a lastRefresh field indicating when the record was last updated or added. This can be used to retrieve only updated data rather than download a full dataset.
- A [Final Working Example](#Final-Working-Example) (at the bottom of this document) demonstrates checking a dataset update and retrieving only records added or modified since a previous time.
- The next couple tutorials will cover basic data analysis, graphing, and mapping.

## Overview
In previous tutorials we demonstrated the basic use of the OpenFEMA API, parameters, retrieving more than 10,000 records by making multiple API calls, and metadata usage. This tutorial will demonstrate how to use a special field to retrieve only those records that have been added or changed since your last API call. 

While some users are engaged in an analysis or study of FEMA data, necessitating a retrieval of a full dataset, other users integrate FEMA data into their own applications or processes. While the most straightforward approach may seem to be downloading the full dataset at some periodic interval, it is possible to download only changes to datasets after an initial download. This reduces the amount of data and time needed to refresh a dataset.
 
The purpose of this notebook is to demonstrate how to perform periodic updates of data rather than download the full dataset. Important limitations and gotchas are also provided. The examples are presented using Python 3, but it should be easy to translate them to almost any programming language.


## Checking for Updates

Data updates are possible for some datasets and are useful in situations where a full set of historical data is required as well as any data added since the initial download. For the OpenFEMA API, it is important to note the distinction between the dataset being updated or refreshed versus the actual data or records being updated.

### Dataset Updates
The OpenFEMA data store will either reload or update a specific dataset according to a refresh interval based on the data owner’s recommendations, the source systems speed of data processing, the size of the data set, and the complexity of retrieving the data. The refresh interval (expressed as an ISO 8601 Repeating Interval) for each data set varies and can be found on the associated dataset page as the following images illustrate:

![Update Frequency Example 1](img/frequency1_2.png)

![Update Frequency Example 2](img/frequency2_2.png)


A special metadata API endpoint exists to describe each OpenFEMA dataset: [OpenFemaDataSets](https://www.fema.gov/openfema-data-page/openfema-data-sets-v1) (aka. DataSets). The “lastRefresh” element in this dataset indicates when the metadata record was refreshed (i.e., the attributes representing a specific dataset), while the "lastDataSetRefresh" element indicates when the dataset _that the metadata record represents_ was last updated in the OpenFEMA data store. 

Let's call this endpoint and review some of the elements it returns.

<div class="alert alert-block alert-info">
    <b>Tip:</b> To see all the elements this metadata endpoint returns, review the data dictionary on dataset page: https://www.fema.gov/openfema-data-page/openfema-data-sets-v1
</div>


In [2]:
# declare a URL handling module
import urllib.request
import json
import math

# define URL for the Data Sets endpoint and a subsequent query
baseUrl = "https://www.fema.gov/api/open/v1/OpenFemaDataSets"

# we only want to see a few of the elements offered by this endpoint for the purposes of this example
select = "?$select=name,version,lastRefresh,lastDataSetRefresh,accrualPeriodicity,recordCount"   

# we want to see metadata for the DisasterDeclarationsSummaries endpoint
# NOTE: we specified version 2 in the filter because version 1 is still available
filter = "&$filter=name%20eq%20%27DisasterDeclarationsSummaries%27%20and%20version%20eq%202"    

# we don't really need to see any metadata - this query will return 1 record
other = "&$metadata=off"   

# define the request and read the data 
request = urllib.request.urlopen(baseUrl + select + filter + other)
result = request.read()

# transform result to Python dictionary
jsonData = json.loads(result.decode('utf-8'))

# display the metadata object
print(json.dumps(jsonData, indent=2))

{
  "OpenFemaDataSets": [
    {
      "name": "DisasterDeclarationsSummaries",
      "version": 2,
      "lastRefresh": "2022-12-16T12:41:55.537Z",
      "lastDataSetRefresh": "2022-12-16T12:41:55.537Z",
      "accrualPeriodicity": "R/PT20M",
      "recordCount": 63788
    }
  ]
}


The "lastDataSetRefresh" value will tell us the exact date and time that the DisasterDeclarationsSummaries data was refreshed in the OpenFEMA data store. Remember, we are calling the metadata function so the "lastRefresh" value is telling us that the last time the *metadata* for this dataset was refreshed. In many cases they may be the same, but it is the "lastDataSetRefresh" value that is the important value here. **Note: this does not tell us that the FEMA source system data changed at this time, just when it was refreshed in OpenFEMA.**

The "accrualPeriodicity" value is the update frequency in the ISO 8601 Repeating Interval format mentioned earlier. 

The "recordCount" value is often useful to verify to total count of a dataset.

<div class="alert alert-block alert-info">
    <b>Tip:</b> The following URL can also be used to return the same metadata as above: https://www.fema.gov/api/open/v1/DataSets?   $select=name,version,lastRefresh,lastDataSetRefresh,accrualPeriodicity,recordCount&\$filter=name eq 'DisasterDeclarationsSummaries' and version eq 2
</div>

Prior to executing your own refresh or retrieval of a dataset, it is worthwhile to check its status. There are situations when the OpenFEMA data store is unable to refresh from the source data according to the frequency. A call to this endpoint is faster than querying the dataset and trying to determine if records have changed. 

<div class="alert alert-block alert-warning">
    <b>Note:</b> Whether you are refreshing the entire dataset or just trying to add/update changed records since the last update, your refresh interval should not be more frequent than the data set refresh interval.
</div>

One possible technique is to store/save the date and time of your last successful dataset call. Compare this value with the retrieved lastDataSetRefresh value and only execute your next call if the lastDataSetRefresh is greater than that of your previous call.

In [4]:
# assume we have read this value from a file or log
myLastDataSetApiCall = "2022-11-23T03:02:03.724Z"

# do comparison
if (jsonData['OpenFemaDataSets'][0]['lastDataSetRefresh'] > myLastDataSetApiCall):
    print("The dataset has been refreshed since the last call. Do stuff.")
    
    # do stuff - call routine to re-query the OpenFEMA API for that dataset
else:
    print("The dataset has not been refreshed since the last call.")
    
    # skip issuing the next call - no need
    
# if successful, save the current lastDataSetRefresh


The dataset has been refreshed since the last call. Do stuff.


### Data or Record Updates
For some datasets, it is possible to retrieve *updated data*, meaning it is possible to identify a record by the date it was changed within an OpenFEMA dataset. If a dataset contains a field called “lastRefresh”, OpenFEMA is receiving and adding updates to the record set, and lastRefresh represents the date when the record was added to or updated in the dataset. Datasets that do not contain this field are refreshed by performing a full reload of all the data; there is no way to tell when an individual record was added or changed. **Note: the lastRefresh date does not represent when the data was modified in the source system, just the date when it changed in the OpenFEMA dataset.** Exceptions may exist. Such exceptions will be documented on the appropriate dataset page.

<div class="alert alert-block alert-info">
    <b>Tip:</b> While some datasets may not have a "lastRefresh" date, they may have a source field that contains a date that can be used in a similar fashion (e.g., the "sent" field in the IPAWS Archived Alerts dataset). 
</div>

For cases where this date exists, it is possible to query recent data only. The following shows the total count of DisasterDeclarationsSummaries records. We don't really need to return or look at the data, we just want the count.

In [5]:
# define URL for the Disaster Declarations Summaries endpoint and a subsequent query
baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# define the request and read the data 
request = urllib.request.urlopen(baseUrl + "?$inlinecount=allpages&$top=1")
result = request.read()

# transform result to Python dictionary
jsonData = json.loads(result.decode('utf-8'))

# display the record count only
print("Records found: " + str(jsonData['metadata']['count']))

Records found: 63788


Let's add a filter to the above and view only those records added within the last day. For this example, we will use the day prior to now, but it could just as well use a saved date. We have to add the datetime library to make it easier to manipulate dates and times. As the last refresh dates are expressed in UTC, production level code should take into account time zone information.

<div class="alert alert-block alert-warning">
    <b>Note:</b> Even though most of the disaster related datasets are updated every 20 minutes, there are rarely more than a few records added each day.
</div>

In [6]:
# add library to assist with time math
from datetime import datetime, timedelta

# calculate 1 day prior and turn it into a string we can use in our filter
priorDay = datetime.strftime(datetime.now() + timedelta(days=-1),'%Y-%m-%dT%H:%M:%S.%fZ')
print("Now: " + datetime.strftime(datetime.now(),'%Y-%m-%dT%H:%M:%S.%fZ'))
print("Prior day: " + priorDay)

# define filter 
filter = "?$filter=lastRefresh%20gt%20%27" + priorDay + "%27"

# define the request and read the data 
request = urllib.request.urlopen(baseUrl + filter + "&$inlinecount=allpages&$top=5")
result = request.read()

# transform result to Python dictionary
jsonData = json.loads(result.decode('utf-8'))

# display the metadata object
print("Records found: " + str(jsonData['metadata']['count']))

# display some of the data
for rec in jsonData['DisasterDeclarationsSummaries']:
    print(str(rec['femaDeclarationString']) + ', ' + rec['declarationTitle'] + ', ' + rec['declarationDate'] + ', ' + rec['lastRefresh'])

Now: 2022-12-16T07:54:00.445380Z
Prior day: 2022-12-15T07:54:00.445380Z
Records found: 23
FM-5424-FL, 1707 ADKINS AVE FIRE, 2022-03-04T00:00:00.000Z, 2022-12-15T16:41:21.433Z
DR-1991-IL, SEVERE STORMS AND FLOODING, 2011-06-07T00:00:00.000Z, 2022-12-15T12:21:42.394Z
DR-1991-IL, SEVERE STORMS AND FLOODING, 2011-06-07T00:00:00.000Z, 2022-12-15T12:21:42.394Z
FM-5210-MT, HIGHWAY 200 FIRE COMPLEX, 2017-09-10T00:00:00.000Z, 2022-12-15T12:41:20.591Z
DR-1991-IL, SEVERE STORMS AND FLOODING, 2011-06-07T00:00:00.000Z, 2022-12-15T12:21:42.387Z


Please note that not all datasets have the lastRefresh value. This is often due to the fact that many table sources may have been used to create a dataset and no primary key can be provided by the source system. The OpenFEMA data load process has no way to identify specific records with which to update. Therefore, a full drop and reload will occur.

While many of the datasets containing lastRefresh come from transactional based systems for which deletes cannot occur, others do not. There may be situations where records are deleted. In these situations, OpenFEMA may issue a reload to ensure no stray records exist in the dataset.

<div class="alert alert-block alert-warning">
    <b>Note:</b> Occasionally files that are normally updated must be fully reloaded. When this occurs, the lastRefresh date for every record is updated. Please see the <a href="https://www.fema.gov/about/openfema/api#:~:text=Special%20Dataset%20Fields">Special Dataset Fields</a> section of the OpenFEMA API Documentation page.
</div>

## Final Working Example
This example combines the two types of updates above. 

 - Purpose: Maintain a fresh set of Disaster Declarations Summaries data. 
 - This code is executed every 30 minutes by a Linux cron job (or Windows scheduled task).
 - Save data to a file. Often users will save directly to a database, but this simplifies the example.
 
<div class="alert alert-block alert-warning">
    <b>Note:</b> This is not meant to be a Python language tutorial. The point is to show the dataset update features as discussed above. There are other, more Pythonic ways that this can be done. If you are writing production quality code, it is recommended that you follow industry best practices - replace embedded constants, evaluate returned values, add error/exception handling, add logging, proper object cleanup, build for resilience by adding retries if failure, etc.
</div>

In [1]:
# declare a URL handling module
import urllib.request
import json
import math

# Small function to check dataset status - lastCallDatetime is a UTC string.
# A return value of true indicates the dataset has been refreshed and processing can continue.
def checkDatasetStatus(lastCallDatetime):
    # define URL for the Data Sets endpoint and a subsequent query
    endpointUrl = gBaseUrl + "v1/OpenFemaDataSets"
    
    # we only need the date
    select = "?$select=lastDataSetRefresh"   

    # ...for the dataset and version. We could use the date in the filter above instead of testing 
    #    in Python, but we would still have to check to see what was returned.
    filter = "&$filter=name%20eq%20%27" + gDatasetName + "%27%20and%20version%20eq%20" + str(gDatasetVersion)    

    # we dont really need to see any metadata - this query will return 1 record
    other = "&$metadata=off"   

    # define the request, read the data, transform to dictionary
    request = urllib.request.urlopen(endpointUrl + select + filter + other)
    result = request.read()
    jsonData = json.loads(result.decode('utf-8'))
    print(jsonData['OpenFemaDataSets'][0]['lastDataSetRefresh'])
    # do comparison - normally you should convert string dates to date objects for a comparison, but since
    #    our format follows ISO and it is strict, a string comparison are comparable - conceptually and alphabetically
    if (jsonData['OpenFemaDataSets'][0]['lastDataSetRefresh'] > lastCallDatetime):
        return True
    else:
        return False

# define a function to determine how many calls we must make to get all of our data (assuming max records of 1,000)
def determineLoopCount(filter):
    # define URL for the endpoint and a subsequent query
    endpointUrl = gBaseUrl + "v" + str(gDatasetVersion) + "/" + gDatasetName

    # define the request and read the data - since we only need count, request only 1 record and 1 field
    request = urllib.request.urlopen(endpointUrl + "?$inlinecount=allpages&$top=1&$select=id&" + filter)
    result = request.read()
    jsonData = json.loads(result.decode('utf-8'))

    # calculate the number of calls we will need to get all of our data (using the maximum of 1000)
    return math.ceil((jsonData['metadata']['count']) / 1000)

# define a function that will do paging - continue making calls until all data retrieved
def getDatasetRecords(filter, loopNum, outputFile):
    # define URL for the endpoint
    endpointUrl = gBaseUrl + "v" + str(gDatasetVersion) + "/" + gDatasetName

    # using jsona so $metadata off by default, $inlinecount defaults to none, no $select because we 
    #    want all, $limit defaults to 1,000
    orderby = "&$orderby=id"     # order unimportant to me, so use id
    format = "&$format=jsona"    # lets use an array of json objects - easier

    # Initialize our file. Only doing this because of the type of file wanted. See the loop below.
    #   The root json entity is usually the name of the dataset, but you can use any name.
    outFile = open(outputFile, "a")
    outFile.write('{"' + gDatasetName + '":[');

    # Loop and call the API endpoint changing the record start each iteration
    skip = 0
    i = 0
    while (i < loopNum):
        # using jsona paging technique described in part 3 of the tutorial
        request = urllib.request.urlopen(endpointUrl + "?" + filter + orderby + format + "&$skip=" + str(skip))
        result = request.read()

        # Append results to file, trimming off first and last JSONA brackets, adding comma except for last call,
        #   AND root element terminating array bracket and brace to end unless on last call. The goal here is to 
        #   create a valid JSON file that contains ALL the records. This can be done differently.
        if (i == (loopNum - 1)):
            # on the last so terminate the single JSON object
            outFile.write(str(result[1:-1],'utf-8') + "]}")
        else:
            outFile.write(str(result[1:-1],'utf-8') + ",")

        # increment the loop counter and skip value
        i+=1
        skip = i * 1000

    outFile.close()
    
    return i


# SETTING A BUNCH OF GLOBAL VARIABLES HERE TO SIMPLIFY THE EXAMPLE - WOULD PROBABLY WANT TO DO DIFFERENTLY
gBaseUrl = "https://www.fema.gov/api/open/"
gDatasetName = "DisasterDeclarationsSummaries"
gDatasetVersion = 2

# assume we have read this value from a file or log
myLastDataSetApiCall = "2022-11-21T20:01:59.663Z"

# define filter - get all records since the last call
# to get all the data the first time, use a date value of "0"
filter = "$filter=lastRefresh%20gt%20%27" + myLastDataSetApiCall + "%27"

# do comparison
if (checkDatasetStatus(myLastDataSetApiCall)):
    print("The dataset has been refreshed since the last call. Do stuff.")
    
    # call function to get loop count - if 0 returned, no new records since last call
    loopNum = determineLoopCount(filter)
    if (loopNum > 0):
        # we have data to get
        iterationsDone = getDatasetRecords(filter, loopNum, "dds_output3.json")
        
        print(str(loopNum) + " iterations needed; " + str(iterationsDone) + " iterationsDone")
    else:
        print("The dataset has been refreshed since the last call, but no new records exist.www")

else:
    print("The dataset has not been refreshed since the last call.")
    
    # skip issuing the next call - no need
    
# if successful, save the current date/time to a log or file

print("Done")

2022-12-16T12:21:56.921Z
The dataset has been refreshed since the last call. Do stuff.
1 iterations needed; 1 iterationsDone
Done


## Where to go Next
The next couple tutorials will cover basic data analysis, graphing, and mapping.

## Other Resources
- [OpenFEMA Homepage](https://www.fema.gov/open)
- [OpenFEMA API Documentation](https://www.fema.gov/about/openfema/api)
- [OpenFEMA Samples on GitHub](https://github.com/FEMA/openfema-samples)
- [ISO 8601 Repeating Intervals](https://en.wikipedia.org/wiki/ISO_8601#Repeating_intervals)