# OpenFEMA API Tutorial: Part 6 - File Formats
  
## Quick Summary
- This tutorial discusses and demonstrates the different file formats that are returned by the OpenFEMA API: JSON, JSONA, JSONL, GEOJSON, CSV, and Parquet.
- The OpenFEMA API Tutorial: Part 2 - Query Parameters explained the ```$format``` parameter used to specify the format of the data returned, but it did not explain in detail when one format should be chosen over another.
- Not all file formats are available for every dataset.
- Includes tips for working with larger data sets such as streaming files and using the undocumented ```$gzip``` parameter.
- Example quick links:
  * [JSON](#json)
  * [JSONA](#jsona)
  * [JSONL](#jsonl)
  * [GeoJSON](#geojson) (GeoJSON is focused on geographical data and will use a different dataset)
  * [CSV](#csv)
  * [Parquet](#parquet)
  * [Format Selection Guide](#whatformat)
  * [Working with Large Data](#largedata)
  * [Gzip](#gzip)

## Overview
In the previous tutorials we demonstrated the basic use of the OpenFEMA API and its available parameters. As seen on the dataset webpages the "Full Data" links offer full files in several formats. The ```$format``` parameter can also be used against the API to specify the return format of queried data. Different formats are offered for the benefit of our users. Each format has its own advantages and disadvantages.

The purpose of this notebook is to discuss the attributes of the different file formats and demonstrate their use. It will not discuss the formats in a high-level of detail. The examples presented will return less than 10,000 records so as not to complicate the examples with "paging" issues. See the (tutorial name for paging).

## Formats Offered
By default, the API returns data in a JSON format. Python (and other programming languages) have an easy time manipulating JSON data, which is human readable as well. Any other specified format requires processing on the OpenFEMA servers to convert the JSON data to the desired format. For small datasets, this poses no issue. For very large datasets, the conversion may significantly slow the retrieval process. Some formats are more verbose than others, increasing the size of a data download. Supported ```$format``` values are:

 - json - Returns data in the JavaScript Object Notation format (default).
 - jsona - Returns data as a JavaScript Object Notation array format. There is no top-level object and each object is separated by a comma. The metadata object is automatically suppressed if this format is chosen.
 - jsonl - Also known as a json lines file. Returns one json object (for OpenFEMA datasets, generally a record) per line with a line feed as a delimiter. In other words, each line is a valid json object. This is a good format for streaming large amounts of data.
  - geojson - Returns data in a special json format designed to represent geospatial features. This format is only available for datasets that support it, such as the FemaRegions endpoint.
  - csv - Returns data in a Comma-Separated Value format. The metadata object is automatically suppressed if this format is chosen.
 - parquet - A columnar data format that is more compact than json and csv. A good choice for use with programming languages such as R, Python, and Julia.
 
<div class="alert alert-block alert-info">
<strong>Note:</strong> In previous tutorials we have tried to limit use of external libraries/dependencies. Using the standard Python libraries simplifies matters by not having to install new modules. This tutorial requires external modules to demonstrate different file format uses, including: requests, folium, pandas, and fastparquet.
</div>

## How do I Determine What Formats are Offered for Each Dataset?
Not all formats are offered for all datasets. Some datasets are so large (NFIP Policies - 15GB+ CSV file) that a verbose format such as JSON will simply make the resulting file too large. Some datasets do not contain data that can be rendered in a specific format. The Disaster Declarations Summaries dataset for example, does not contain geospatial data - a GEOJSON format would not be useful.

File formats for full file downloads are listed in the "Full Data" list of the dataset web pages. This list comes from the "distribution" object of the metadata:

In [1]:
# declare a URL handling module
import requests
import json

# define URL for the metadata endpoint
baseUrl = "https://www.fema.gov/api/open/v1/DataSets"

# define query parameters - limiting to one dataset 
queryParameters = {
    '$select': 'distribution',
    '$filter': 'name%20eq%20%27DisasterDeclarationsSummaries%27%20and%20version%20eq%202',                                                  # set $allrecords to true to avoid dealing with pagination
    '$metadata': 'off'
} 

try:
    with requests.get(baseUrl, params=queryParameters) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status()
        data = response.json()
        print(json.dumps(data, indent=2))
    
except:
    # here is where you would add any logic for if the request fails
    print(f'file could not be downloaded, server returned ${response.status_code}')

{
  "DataSets": [
    {
      "distribution": [
        {
          "format": "csv",
          "accessURL": "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.csv",
          "datasetSize": "small (10MB - 50MB)"
        },
        {
          "format": "json",
          "accessURL": "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.json",
          "datasetSize": "small (10MB - 50MB)"
        },
        {
          "format": "jsona",
          "accessURL": "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.jsona",
          "datasetSize": "small (10MB - 50MB)"
        },
        {
          "format": "jsonl",
          "accessURL": "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.jsonl",
          "datasetSize": "small (10MB - 50MB)"
        }
      ]
    }
  ]
}


## When Should I Use the API and ```$format``` Parameter Versus the Full File Links?
While you can retrieve all the data using the API by specifying the endpoint with the appropriate ```$format``` parameter, this forces the API to do a little extra work to process the format. If you need the full file, use the full file download link - especially if the file is very large (e.g., NFIP Policies). The extremely large files are delivered through a more efficient mechanism. Retrieving the large files through the API will take the longest time.

## Does the File Size Have Any Bearing on the File Format I Specify?
Yes! The JSON file formats are verbose. They will add size to your download. This is the reason that the very large datasets are only offered in CSV and Parquet formats. We will be tweaking the Parquet format in the future so that it will be smaller in size than the CSV files. Additionally, we will be offering gzipped compressed files as well.

As the native format for the OpenFEMA data is JSON, specifying another format adds work to the API back-end. For most datasets, the slightly reduced performance is negligible. For very large datasets, this may significantly add to the time it takes to return data.

## Format Comparison
The following examples will retrieve the same data in different formats. Only a few records and fields will be returned to make it visually easier to see the differences.

### <a id="json"></a>JSON
JSON stands for JavaScript Object Notation and is a JavaScript native way of storing data. While it is verbose, JSON is "self-describing" and fairly easy to understand. It is the native format for OpenFEMA data delivery. OpenFEMA supports two variants of the basic JSON format and one that is tailored for geospatial data.

<div class="alert alert-block alert-info">
<strong>Note:</strong> None of these examples include writing or saving the data locally. Working examples of that can be found in <a href='https://github.com/FEMA/openfema-samples/tree/master/code-samples'>code-samples</a>
</div>

In [6]:
# A simple JSON data request and the resulting output
import requests
import json

# define URL for the metadata endpoint
baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# Define query parameters - limit fields and records returned to make example more simple
queryParameters = {
    '$select': 'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle',
    '$metadata': 'off',   
    '$top': 3,
    '$format': 'json' # when requesting json, this parameter is optional since json is the default
} 

try:
    with requests.get(baseUrl, params=queryParameters) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status()
        print("JSON (raw - not printed in pretty format)")
        print(response.json())
        print("\nJSON (pretty format)")
        print(json.dumps(response.json(), indent=2))
    
except:
    # here is where you would add any logic for if the request fails
    print(f'file could not be downloaded, server returned ${response.status_code}')

JSON (raw - not printed in pretty format)
{'DisasterDeclarationsSummaries': [{'disasterNumber': 4026, 'state': 'NH', 'declarationType': 'DR', 'declarationDate': '2011-09-03T00:00:00.000Z', 'incidentType': 'Hurricane', 'declarationTitle': 'TROPICAL STORM IRENE'}, {'disasterNumber': 5464, 'state': 'RI', 'declarationType': 'FM', 'declarationDate': '2023-04-14T00:00:00.000Z', 'incidentType': 'Fire', 'declarationTitle': 'QUEENS RIVER FIRE'}, {'disasterNumber': 5463, 'state': 'KS', 'declarationType': 'FM', 'declarationDate': '2023-04-13T00:00:00.000Z', 'incidentType': 'Fire', 'declarationTitle': 'HADDAM FIRE'}]}

JSON (pretty format)
{
  "DisasterDeclarationsSummaries": [
    {
      "disasterNumber": 4026,
      "state": "NH",
      "declarationType": "DR",
      "declarationDate": "2011-09-03T00:00:00.000Z",
      "incidentType": "Hurricane",
      "declarationTitle": "TROPICAL STORM IRENE"
    },
    {
      "disasterNumber": 5464,
      "state": "RI",
      "declarationType": "FM",
     

### <a id="jsona"></a> JSON Array (JSONA)
JSON Arrays or the JSONA format is slightly different than a default JSON Object. It does not have the outer "DisasterDeclarationsSummaries" object - it is just a comma separated list of JSON records. Technically, the file is not a valid JSON object. Instead, each item in the list is a valid JSON object. If you wrap the entire array in a top-level object, it is recognized as a valid JSON file. 

Since JSONA files do not have an outer object and are instead just a list of valid JSON objects, they can be easier to iterate through.

In [8]:
# Basic JSONA Example
import requests
import json

# define URL for the metadata endpoint
baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# define query parameters - limit fields and records returned to make example more simple
queryParameters = {
    '$select': 'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle',
    '$metadata': 'off',
    '$top': 3,
    '$format': 'jsona' # note that the format here is jsona instead of the default json
} 

try:
    with requests.get(baseUrl, params=queryParameters) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status()
        jsonData = response.json()
        print("JSONA (raw - not printed in pretty format).")
        print(jsonData)
        print("\nJSONA (pretty format)")
        print(json.dumps(jsonData, indent=2))
    
except:
    # here is where you would add any logic for if the request fails
    print(f'file could not be downloaded, server returned ${response.status_code}')

JSONA (raw - not printed in pretty format).
[{'disasterNumber': 4026, 'state': 'NH', 'declarationType': 'DR', 'declarationDate': '2011-09-03T00:00:00.000Z', 'incidentType': 'Hurricane', 'declarationTitle': 'TROPICAL STORM IRENE'}, {'disasterNumber': 5464, 'state': 'RI', 'declarationType': 'FM', 'declarationDate': '2023-04-14T00:00:00.000Z', 'incidentType': 'Fire', 'declarationTitle': 'QUEENS RIVER FIRE'}, {'disasterNumber': 5463, 'state': 'KS', 'declarationType': 'FM', 'declarationDate': '2023-04-13T00:00:00.000Z', 'incidentType': 'Fire', 'declarationTitle': 'HADDAM FIRE'}]

JSON (pretty format)
[
  {
    "disasterNumber": 4026,
    "state": "NH",
    "declarationType": "DR",
    "declarationDate": "2011-09-03T00:00:00.000Z",
    "incidentType": "Hurricane",
    "declarationTitle": "TROPICAL STORM IRENE"
  },
  {
    "disasterNumber": 5464,
    "state": "RI",
    "declarationType": "FM",
    "declarationDate": "2023-04-14T00:00:00.000Z",
    "incidentType": "Fire",
    "declarationTitl

#### Iterating Through JSONA
Iterating through the returned array or accessing individual elements is very easy:

In [9]:
# iterate through the array of objects
for disaster in jsonData:
    print(disaster)
    
# print only the disaster number
for disaster in jsonData:
    print(disaster['disasterNumber'])

{'disasterNumber': 4026, 'state': 'NH', 'declarationType': 'DR', 'declarationDate': '2011-09-03T00:00:00.000Z', 'incidentType': 'Hurricane', 'declarationTitle': 'TROPICAL STORM IRENE'}
{'disasterNumber': 5464, 'state': 'RI', 'declarationType': 'FM', 'declarationDate': '2023-04-14T00:00:00.000Z', 'incidentType': 'Fire', 'declarationTitle': 'QUEENS RIVER FIRE'}
{'disasterNumber': 5463, 'state': 'KS', 'declarationType': 'FM', 'declarationDate': '2023-04-13T00:00:00.000Z', 'incidentType': 'Fire', 'declarationTitle': 'HADDAM FIRE'}
4026
5464
5463


#### Streaming JSONA
Sometimes the data to be returned is too large for the available computer memory. JSON data can be streamed and handled in chunks. While returning data into consumable chunks requires less memory at any point in time, the "chunks" may not correspond to even/full record boundaries. See the output of the code below; notice the resulting chunks. We have separated each chunk with a line-feed for clarity.

From the requests documentation, "The chunk_size is the number of bytes that should be read into memory. This is not necessarily the length of each item returned as decoding can take place." This is fine if the data is saved to a file. However, if you wanted to act on each chunk, it would be necessary to ensure you have captured a full JSON record.

The requests library does have a ```iter_lines()``` method, but a JSONA file does not terminate each record with a line feed, so this will not work.

In [20]:
# Streaming Data/JSON Buffer Example
import requests

baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# define query parameters - limit fields and records returned to make example more simple
queryParameters = {
    '$select': 'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle',
    '$top': 5,
    '$metadata': 'off',
    '$format': 'jsona'
} 

try:
    # issue the api call as a stream
    with requests.get(baseUrl, params=queryParameters, stream=True) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status() 
        
        print('JSONA Raw output - each line is one data "chunk"')
        for chunk in response.iter_content(chunk_size=64): 
            print('\n')
            print(chunk)
except:
    # any additional logic to do when a request fails, such as trying again goes here
    print(f'file could not be downloaded, server returned ${response.status_code}')


JSONA Raw output - each line is one data "chunk"


b'[{"disasterNumber":4026,"state":"NH","declarationType":"DR","declarationDate":"2011-09-03T00:00:00.000Z","inci'


b'dentType":"Hurricane","declarationTitle":"TROPICAL STORM IRENE"},{"disasterNumber":5464,"state":"RI","declarationType":"FM","declarationDate":"2023-04-14T00:00:00.000Z","incidentType":"Fire","declarationTitle":"QUEENS R'


b'IVER FIRE"},{"disasterNumber":5463,"state":"KS","declarationType":"FM","declarationDate":"2023-04-13T00:00:00.000Z","incidentType":"Fire","declarationTitle":"HADDAM FIRE"},{"disasterNumber":4026,"state":"NH","declarationType":"DR","declarationDate":"2011-09-03T00:00:00.000Z","incidentType":"Hurricane","declarationTitle":"TROPICAL STORM IRENE"},{"disasterNumber":4731,"state":"CO","declarationType":"DR","declarationDate":"2023-08-25T00:00:00.000Z","incidentType":"Flood","declarationTitle":"SEVERE STORMS, FLOODING, A'


b'ND TORNADOES"}]'


### <a id="jsonl"></a> JSON Lines (JSONL)
While JSON Lines or JSONL is similar to JSONA in that each record is a JSON object, instead of encapsulating those in an array with objects delimited by commas, they are instead delimited with a line feed. Like JSONA, JSONL files cannot be read as a valid JSON object, but each record represents valid JSON. This can be an easier format to iterate through sometimes as many file operations or streams can be accessed line-by-line.

Let's view a JSONL response and see how it differs from JSON and JSONA. 

In [25]:
# Basic JSONL Example
import requests
import json

# define URL for the metadata endpoint
baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# define query parameters - limit fields and records returned to make example more simple
queryParameters = {
    '$select': 'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle',
    '$metadata': 'off',
    '$top': 3, 
    '$format': 'jsonl' # note that the format here is jsonl instead of the default json
} 

try:
    with requests.get(baseUrl, params=queryParameters) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status()
        jsonlData = response.content
        print("JSONL (raw - not printed in pretty format).")
        print(jsonlData)

except:
    # here is where you would add any logic for if the request fails
    print(f'file could not be downloaded, server returned ${response.status_code}')

JSONL (raw - not printed in pretty format).
b'{"disasterNumber":4026,"state":"NH","declarationType":"DR","declarationDate":"2011-09-03T00:00:00.000Z","incidentType":"Hurricane","declarationTitle":"TROPICAL STORM IRENE"}\n{"disasterNumber":5464,"state":"RI","declarationType":"FM","declarationDate":"2023-04-14T00:00:00.000Z","incidentType":"Fire","declarationTitle":"QUEENS RIVER FIRE"}\n{"disasterNumber":5463,"state":"KS","declarationType":"FM","declarationDate":"2023-04-13T00:00:00.000Z","incidentType":"Fire","declarationTitle":"HADDAM FIRE"}'


<div class="alert alert-block alert-info">
<strong>Note:</strong> The requests call specifies "content" instead of "json()" as its return type. JSONL will not be recognized as valid JSON by the "json()" method. The content returned is in bytes. Each entry in the returned byte string is delimited by a line break "\n".
</div>

We can parse the resulting content and iterate through JSONL as follows:

In [26]:
# this will split each line into seprate byte strings - each representing a record
jsonlist = jsonlData.splitlines()

# iterate - turn each line into a json object/dictionary (save if you want), and display one element
for json_str in jsonlist:
    jsonData = json.loads(json_str.decode('utf-8'))
    print(jsonData['disasterNumber'])


4026
5464
5463


The defining characteristic of the JSONL format is the line feed terminator, and this makes it very useful for streaming data. While the ```iter_lines()``` method does retrieve data by a chunk size, it allows data to be accessed one line at a time. The benefit of course is the ability to process large files that may exceed memory and that if the data stream is interrupted or terminated early, there is no need to worry about partial or corrupt records.

In [29]:
# Simple Streaming Data/JSONL Buffer Example
import requests

baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# define query parameters - limit fields and records returned to make example more simple
queryParameters = {
    '$select': 'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle',
    '$metadata': 'off',
    '$top': 5, 
    '$format': 'jsonl' 
} 

try:
    # issue the api call as a stream
    with requests.get(baseUrl, params=queryParameters, stream=True) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status() 
        
        print('JSONL Raw output, Each line is one record')
        
        # the chunk_size can be adjusted according to available system memory
        for line in response.iter_lines(chunk_size=10240): 
            print(line)
except:
    # any additional logic to do when a request fails, such as trying again goes here
    print(f'file could not be downloaded, server returned ${response.status_code}')


JSONL Raw output, Each line is one record
b'{"disasterNumber":5530,"state":"NV","declarationType":"FM","declarationDate":"2024-08-12T00:00:00.000Z","incidentType":"Fire","declarationTitle":"GOLD RANCH FIRE"}'
b'{"disasterNumber":5529,"state":"OR","declarationType":"FM","declarationDate":"2024-08-09T00:00:00.000Z","incidentType":"Fire","declarationTitle":"LEE FALLS FIRE"}'
b'{"disasterNumber":5528,"state":"OR","declarationType":"FM","declarationDate":"2024-08-06T00:00:00.000Z","incidentType":"Fire","declarationTitle":"ELK LANE FIRE"}'
b'{"disasterNumber":5527,"state":"OR","declarationType":"FM","declarationDate":"2024-08-02T00:00:00.000Z","incidentType":"Fire","declarationTitle":"MILE MARKER 132 FIRE"}'
b'{"disasterNumber":5526,"state":"CO","declarationType":"FM","declarationDate":"2024-08-01T00:00:00.000Z","incidentType":"Fire","declarationTitle":"QUARRY FIRE"}'


<div class="alert alert-block alert-info">
<strong>Tip:</strong> While it is fairly easy to work with JSONL using the standard Python libraries, a module called "jsonlines" does exist to simplify these tasks.
</div>

### <a id="geojson"></a> GeoJSON
GeoJSON is a standardized way of storing a variety of types of geospatial data in a JSON format. See <a ref="https://datatracker.ietf.org/doc/html/rfc7946">this reference to the GeoJSON standard</a>. Only a few OpenFEMA datasets currently support GeoJSON - those that contain geospatial data.

The following example retrieves one FEMA region in a GeoJSON format. Note that it is expressed in a JSON format. If you were to change the format to JSON, it would look the same but with subtle differences. The GeoJSON file contains elements and element names as defined in the standard: featurecollection, features, geometry, polygon, etc. The plain JSON will conform to the OpenFEMA data dictionary.

In [1]:
# GeoJSON example
import requests

baseUrl = "https://www.fema.gov/api/open/v2/FemaRegions"

# define query parameters - returning only 1 region to make example more simple
queryParameters = {
    '$metadata': 'off',
    '$top': 1, 
    '$format': 'geojson' # <--NOTE, we have specified the geojson format
} 

try:
    # issue the api call as a stream
    with requests.get(baseUrl, params=queryParameters, stream=True) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status() 
               
        geojsonData = response.content.decode('utf-8')
        print("GeoJSON (raw, but decoded).")
        print(geojsonData)
except:
    # any additional logic to do when a request fails, such as trying again goes here
    print(f'file could not be downloaded, server returned ${response.status_code}')

GeoJSON (raw, but decoded).
{"type": "FeatureCollection", "features": [{"type":"Feature","properties":{"name":"FEMA Regional 1 Headquarters","region":1,"address":"99 High Street","city":"Boston","state":"MA","zipCode":"2110","states":["ME","NH","VT","MA","CT","RI"],"lastRefresh":"2022-07-17T16:47:16.612Z","hash":"036bfec29c153b95636382264da9b6a31441fff5","id":"32dea6cc-ae94-4384-9ef5-1bd80d7fcca9"},"geometry":{"type":"GeometryCollection","geometries":[{"type":"Point","coordinates":[-71.054648822,42.354478452]},{"type":"Polygon","coordinates":[[[-69.204384,47.452389],[-69.20495,47.452824],[-69.205168,47.453439],[-69.205399,47.453663],[-69.211467,47.454758],[-69.212345,47.454785],[-69.213553,47.45543],[-69.215369,47.456536],[-69.216942,47.456712],[-69.219052,47.456933],[-69.219996,47.457161],[-69.22148,47.458011],[-69.22374,47.459298],[-69.22442,47.459687],[-69.231845,47.452523],[-69.23672,47.447819],[-69.243114,47.441653],[-69.249506,47.435487],[-69.261722,47.423543],[-69.273939,47.4115

The above example uses no special libraries. Normally, you would download the file and use some tool to visualize the geospatial data. We can, however, display the data within Python. There are a number of libraries that can help manipulate and display geospatial data including:
* Folium (folium is an interface to leaflet.js) for displaying a map
* GeoPandas for analysis of geoJSON data

The following example takes the GeoJSON data we downloaded above and displays it on a map. Read the folium documentation for details on enhancing the map output with various labels, keys, colors, etc.

<div class="alert alert-block alert-warning">
<strong>Warning:</strong> If you are trying this code in a code editor instead of a Jupyter notebook, you will need to save the resulting html output to a file and view that file in a browser.
</div>
<div class="alert alert-block alert-info">
<strong>Note:</strong> If you would like to see an example of how to create a GeoJSON file, see the FEMA_Region_GeoJson.ipynb notebook found in the <a href='https://github.com/FEMA/openfema-samples/tree/master/analysis-examples'>OpenFEMA GitHub analysis-examples folder</a>.
</div>

In [6]:
# Displaying GeoJSON data with Folium.
import folium

# if you saved the above GeoJSON as a file, you would read it as follows in folium
#folium.GeoJson(open('myfile.geojson').read()).add_to(my_map)

# display map by providing lat, long center point - in this case, center of USA
my_map = folium.Map(location=[43.0,-100.0],zoom_start=4)

folium.GeoJson(geojsonData).add_to(my_map)
my_map

### <a id="csv"></a>CSV
CSV stands for Comma Separated Values and is a very common way of storing tabular data. Each row represents one record with the first row commonly holding the data headers. CSV files are less verbose than JSON making them better suited for storing large amounts of data. CSV is such a common format that almost every database and analytics tools are capable of understanding and manipulating them.

In [36]:
# A simple CSV data request and the resulting output
import requests
import csv

# define URL for the metadata endpoint
baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# define query parameters - limit fields and records returned to make example more simple
queryParameters = {
    '$select': 'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle',
    '$metadata': 'off',
    '$top': 3, 
    '$format': 'csv' # note that we specify the desired format as csv
} 

try:
    with requests.get(baseUrl, params=queryParameters) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status()
        csvData = response.content
        print("CSV (raw - not printed in pretty format)")
        print(csvData)
    
except:
    # here is where you would add any logic for if the request fails
    print(f'file could not be downloaded, server returned ${response.status_code}')

CSV (raw - not printed in pretty format)
b'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle\n5530,NV,FM,2024-08-12T00:00:00.000Z,Fire,GOLD RANCH FIRE\n5529,OR,FM,2024-08-09T00:00:00.000Z,Fire,LEE FALLS FIRE\n5528,OR,FM,2024-08-06T00:00:00.000Z,Fire,ELK LANE FIRE\n'


Iterating through a CSV file is straightforward. Since we saved the above in a variable and not a file, we do not have to first open a CSV file, but we do have to decode it. Be mindful of the amount of data you will download so you do not run out of memory.

While the initial decoded data below looks like separate rows, it is still one big string. Next, split the data into lines/rows, indicate the field delimiter and now you have a more usable list that can be iterated over.

In [39]:
csvDataDecoded = csvData.decode('utf-8')
print(csvDataDecoded)

# split the data into lines, convert to list, and print each row
lstCsvData = list(csv.reader(csvDataDecoded.splitlines(), delimiter=','))
for row in lstCsvData:
    print(row)

disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle
5530,NV,FM,2024-08-12T00:00:00.000Z,Fire,GOLD RANCH FIRE
5529,OR,FM,2024-08-09T00:00:00.000Z,Fire,LEE FALLS FIRE
5528,OR,FM,2024-08-06T00:00:00.000Z,Fire,ELK LANE FIRE

['disasterNumber', 'state', 'declarationType', 'declarationDate', 'incidentType', 'declarationTitle']
['5530', 'NV', 'FM', '2024-08-12T00:00:00.000Z', 'Fire', 'GOLD RANCH FIRE']
['5529', 'OR', 'FM', '2024-08-09T00:00:00.000Z', 'Fire', 'LEE FALLS FIRE']
['5528', 'OR', 'FM', '2024-08-06T00:00:00.000Z', 'Fire', 'ELK LANE FIRE']


### <a id="parquet"></a> Parquet
Unlike CSV or JSON, Parquet is a column-based file format instead of a row based format. The format was designed for efficient data storage and retrieval. It works well for complex data in large volumes, achieving smaller file sizes and faster queries/analysis. 

Be aware, however, that the Parquet format is more memory intensive, sometimes requiring 5 times more memory as other formats. Also, it is not human-readable. To work with Parquet files in Python, it is necessary to utilize a module or library tailored to this format. The following libraries provide Parquet support (not an exhaustive link): Pandas, fastparquet, PyArrow, dask, and DuckDB.

<div class="alert alert-block alert-info">
<strong>Note:</strong> It is out of the scope of this document to delve into the details of Parquet files. If you want to know more about the actual format, please see the following links: <a href="https://parquet.apache.org/docs/">Apache Parquet Documentation</a>, <a href="https://data-mozart.com/parquet-file-format-everything-you-need-to-know/">Parquet File Format - Everything you need to know!</a>.
</div>

Unlike the other formats we have seen, parquet is not human readable. The following example pulls some OpenFEMA data in a Parquet format, saves it in a variable, and displays the raw data.

<div class="alert alert-block alert-info">
<strong>Tip:</strong> If you will be working with Parquet files, it is strongly recommended you download and save the data to a file. Then you can open in locally and perform your analysis.</a>
</div>

In [7]:
# Parquet file example
import requests

# define URL for the metadata endpoint
baseUrl = "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries"

# define query parameters - limit fields and records returned to make example more simple
queryParameters = {
    '$select': 'disasterNumber,state,declarationType,declarationDate,incidentType,declarationTitle',
    '$metadata': 'off',
    '$top': 1, 
    '$format': 'parquet' # note that we specify the desired format as csv
} 

try:
    with requests.get(baseUrl, params=queryParameters) as response:
        # raises an exception when there is a server or network error
        response.raise_for_status()
        parquetData = response.content
        print("Parquet (raw format - first 100 bytes)")
        print(parquetData[:100])
    
except:
    # here is where you would add any logic for if the request fails
    print(f'file could not be downloaded, server returned ${response.status_code}')

Parquet (raw format - first 100 bytes)
b'PAR1\x15\x00\x15\x08\x15\x0c,\x15\x02\x15\x00\x15\x06\x15\x06\x1c\x18\x04\x9a\x15\x00\x00\x18\x04\x9a\x15\x00\x00\x16\x00\x16\x02\x18\x04\x9a\x15\x00\x00\x18\x04\x9a\x15\x00\x00\x00\x00\x00\x04\x0c\x9a\x15\x00\x00\x15\x02\x19%\x06\x00\x19\x18\x0edisasterNumber\x15\x02\x16\x02\x16j\x16j&\x08<\x18\x04\x9a\x15\x00\x00\x18\x04\x9a'


#### The Parquet Schema
One of the advantages of Parquet is that it includes a schema that defines the data elements contained in the file. The following example will save the full DisasterDeclarationsSummaries file in a Parquet format and then access the schema without reading the entire file into memory. 

Rather than use the Requests library as shown in earlier examples to capture data in chunks and write to a file, we will use the build in urllib library to retrieve a full file.

In [53]:
from urllib.request import urlretrieve

# get the full dataset as a parquet file and save to local directory
urlretrieve("https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.parquet","ddsv2.parquet")

# read just schema, not entire file into memory
import pyarrow.parquet as pq

schema = pq.read_schema("ddsv2.parquet")
print(schema)

id: string not null
disasterNumber: int16 not null
state: string
femaDeclarationString: string
declarationType: string
declarationDate: date32[day]
fyDeclared: int16 not null
incidentType: string
declarationTitle: string
ihProgramDeclared: bool
iaProgramDeclared: bool
paProgramDeclared: bool
hmProgramDeclared: bool
incidentBeginDate: date32[day]
incidentEndDate: date32[day]
disasterCloseoutDate: date32[day]
tribalRequest: bool
fipsStateCode: string
fipsCountyCode: string
placeCode: string not null
designatedArea: string
declarationRequestNumber: string
lastIAFilingDate: date32[day]
incidentId: string
region: int16
designatedIncidentTypes: string
lastRefresh: timestamp[ms]
hash: string


The same can be done using the Pandas library. Note that additional information such as the memory used is displayed.

In [57]:
import pandas as pd

df = pd.read_parquet("ddsv2.parquet")

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66138 entries, 0 to 66137
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   id                        66138 non-null  object        
 1   disasterNumber            66138 non-null  int16         
 2   state                     66138 non-null  object        
 3   femaDeclarationString     66138 non-null  object        
 4   declarationType           66138 non-null  object        
 5   declarationDate           66138 non-null  object        
 6   fyDeclared                66138 non-null  int16         
 7   incidentType              66138 non-null  object        
 8   declarationTitle          66138 non-null  object        
 9   ihProgramDeclared         66138 non-null  bool          
 10  iaProgramDeclared         66138 non-null  bool          
 11  paProgramDeclared         66138 non-null  bool          
 12  hmProgramDeclared 

#### Analysis Example
After importing the parquet file into a dataframe you can perform whatever analysis you want. The following example will take the data frame and will count the number of records by state.

In [58]:
# count records by state
df["state"].value_counts()

TX    5350
KY    2762
MO    2750
OK    2543
FL    2543
VA    2524
LA    2503
GA    2365
NC    2170
PR    2078
MS    1945
IA    1922
KS    1825
TN    1701
AL    1675
CA    1673
AR    1624
MN    1621
NY    1521
NE    1517
SD    1466
IN    1464
ND    1393
IL    1306
OH    1292
WV    1266
PA    1239
ME    1065
SC    1039
WA    1013
WI     892
MI     805
CO     664
MT     640
NJ     626
OR     618
NM     530
MD     447
MA     414
VT     401
ID     369
AZ     342
AK     333
NH     320
NV     284
CT     261
UT     255
WY     132
RI     123
HI     108
VI      84
MP      76
AS      76
DE      53
MH      53
FM      31
DC      23
GU      22
PW       1
Name: state, dtype: int64

## <a id="whatformat"></a> So What Format Should I Use?
What file format you choose is largely dependent on your individual use case. Most users will find JSON works well. It is a common format widely used to transfer information on the web and many tools and languages support the format. However, there are instances where other formats are more appropriate. When choosing a file format, ask the following questions:
* **Do my tools support it?** If you are using a spreadsheet to review data, a CSV format is undoubtably the best format. The structure of the other offered formats are not tabular in nature.

* **Am I working with large amounts of data and/or large file sizes?** If so, you probably want to use a format more optimized for that. CSV and Parquet both offer smaller file sizes that can lead to faster downloads. You should also consider streaming the data instead of storing it all in memory. JSONL is particularly useful when saving data from a stream as it will store full records instead of equal byte chunks.

* **What do I plan to do with the data?** A simple summary report? A detailed statistical analysis of the data? Mapping data? Importing into a database? Most programming languages have either built in methods or external libraries that work with all of the formats mentioned. Parquet is fast to read into a dataframe for analysis CSV is probably the most common format and is ideal when compatibility is important. JavaScript is particularly suited for working with JSON files as it is a native format. GeoJson works well for mapping and geospatial analysis.

### Format Quick Reference 

| Format        | File Size     | Best For   | Downsides  |
| :------------- | :------------- | :------------- | :------------- | 
| JSON | Large | General use  | Verbose  |
| JSONA | Large | General use/Easy to iterate  | Verbose, file is not JSON - each object in array is  |
| JSONL | Large | Great for streaming/Easy to iterate  | Verbose, file is not JSON - each line is  |
| GeoJSON | N/A | GeoSpatial Data  | Verbose, different tools work differently  |
| CSV | Medium | Data analytics, larger datasets, tabular data  | no schema  |
| Parquet | Small | Data analytics, large datasets, defined schema  | Memory intensive  |

## <a id="largedata"></a> Working With Large Datasets
While most of OpenFEMA's datasets are reasonably sized and easy to work with, a few (e.g., NFIP Policies, IHP Valid Registrations) are big enough that download times can be excessive and working with the entire file in memory is no longer feasible. Even after choosing a file format more suited to larger data, there are tips that can simplify use.

<div class="alert alert-block alert-info">
<strong>Tip:</strong> For more information for working with large files, see: <a href="https://www.fema.gov/about/openfema/working-with-large-data-sets">OpenFEMA Guide to Working with Large Datasets</a>.
</div>

### Streaming Data vs Storing in Memory
The simplest way to retrieve API data is through an http get request where the entire payload is retrieved, stored in memory, and then either analyzed or saved to file. In most cases with OpenFEMA data, this will not cause problems as most systems have sufficient memory resources. This will not be the case with larger files.

Streaming data allows chunks of the file to be worked on or written to file without necessitating the entire file be stored in memory. While it is slightly more complicated to implement, most major programming languages have libraries to assist with streaming data. Typically, working with a data stream will have the following components:
* Some sort of flag in your get request to indicate that this will be a data stream
* A check to make sure that the request was successful and did not return an error
* A loop to iterate through chunks of the file.
* A step to analyze or work with or write the chunk of data to a file. When working with streaming data, it is sometimes necessary to use a write method specific to streaming data.
* Stopping the data stream and writing the last chunk of the stream as applicable.
Some of these steps are handled automatically depending on the libraries and packages used. You can see additional examples in <a href='https://github.com/FEMA/openfema-samples/blob/master/code-samples'>code-samples</a>. The examples that utilize streaming data will include "stream" in the file name.

### <a id="gzip"></a> Utilizing GZIP
There is also an un-documented option to download files in a compressed format. Use the parameter and value of ```$gzip=true``` to compress the entire file prior to download. This can decrease download speeds. A comparison of downloading a file with and without the gzip flag can be found below.


In [4]:
# Sample code showing the comparison between download times with and without gzip - full files
from urllib.request import urlretrieve
import time
import os

# Time the downloads
startTime = time.time()
gzipDownloadTime = 0


# get the full dataset as a json gzip file and save to local directory
gzipDownloadTime = 0
startTime = time.time()
urlretrieve("https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries?$format=json&$gzip=true","gziptest.json.gzip")
gzipDownloadTime = time.time() - startTime

# get the full dataset as a json file and save to local directory
jsonDownloadTime = 0
startTime = time.time()
urlretrieve("https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries?$format=json","gziptest.json")
jsonDownloadTime = time.time() - startTime

gzipFileSize = os.path.getsize("gziptest.json.gzip")
jsonFileSize = os.path.getsize("gziptest.json")

print('gzip File Size: ' + str(gzipFileSize) + ' Download Time:', gzipDownloadTime)
print('json File Size: ' + str(jsonFileSize) + ' Download Time:', jsonDownloadTime)


gzip File Size: 91538 Download Time: 0.6189210414886475
json File Size: 845615 Download Time: 0.34485745429992676


<div class="alert alert-block alert-warning">
<strong>Warning:</strong> The gzip file took longer to download than the uncompressed file. What gives? In most cases, the FEMA content delivery network (CDN) will compress data when transferring it. Forcing the OpenFEMA API to compress the file first takes more time and the CDN is transferring the same amount of data. We only recommend using the $gzip flag when downloading large files where the CDN does not appear to be honoring compression before transmission.
</div>

### Working with Gzip Files
Gzip files will need to be uncompressed prior to use, but Python and most other languages have libraries that can easily perform this operation. See the following example.

In [13]:
# reading a gzip json file

import json
import gzip

# Opening and reading the gzip file
with gzip.open('gziptest.json.gzip','r') as fp:        
    data = fp.read()

# turn bytes data string into decoded jaon string
jsonData = json.loads(data.decode('utf-8'))

# print only 1 record
print(json.dumps(jsonData['DisasterDeclarationsSummaries'][0], indent=2))

{
  "femaDeclarationString": "FM-5530-NV",
  "disasterNumber": 5530,
  "state": "NV",
  "declarationType": "FM",
  "declarationDate": "2024-08-12T00:00:00.000Z",
  "fyDeclared": 2024,
  "incidentType": "Fire",
  "declarationTitle": "GOLD RANCH FIRE",
  "ihProgramDeclared": false,
  "iaProgramDeclared": false,
  "paProgramDeclared": true,
  "hmProgramDeclared": true,
  "incidentBeginDate": "2024-08-11T00:00:00.000Z",
  "incidentEndDate": null,
  "disasterCloseoutDate": null,
  "tribalRequest": false,
  "fipsStateCode": "32",
  "fipsCountyCode": "031",
  "placeCode": "99031",
  "designatedArea": "Washoe (County)",
  "declarationRequestNumber": "24123",
  "lastIAFilingDate": null,
  "incidentId": "2024081201",
  "region": 9,
  "designatedIncidentTypes": "R",
  "lastRefresh": "2024-08-27T18:22:14.800Z",
  "hash": "5d07e7c51bb300bfbec94a699a1e1ab1d61a97cd",
  "id": "f15a7a79-f1c3-41bb-8a5c-c05fbae34423"
}


## Where to Go Next
The next tutorial will illustrate how to join data from different OpenFEMA datasets as well as external, non-FEMA data.

## Other Resources
- [OpenFEMA Homepage](https://www.fema.gov/open)
- [OpenFEMA API Documentation](https://www.fema.gov/about/openfema/api)
- [OpenFEMA Code Samples on GitHub](https://github.com/FEMA/openfema-samples/tree/master/code-samples)
- [GeoJSON Standard](https://datatracker.ietf.org/doc/html/rfc7946)
- [JSONL (JSON Lines) Standard](http://jsonlines.org/)
- [Apache Parquet Documentation](https://parquet.apache.org/docs/)
- [Parquet File Format - Everything you need to know!](https://data-mozart.com/parquet-file-format-everything-you-need-to-know/)