# Mass Downloader for FDSN Compliant Web Services
This package contains functionality to query and integrate data from any number of FDSN web service providers simultaneously. The package aims to formulate download requests in a way that is convenient for seismologists without having to worry about political and technical data center issues. It can be used by itself or as a library component integrated into a bigger project.

## Why Would You Want to Use This?
Directly using the FDSN web services for example via the obspy.clients.fdsn client is fine for small amounts of data but quickly becomes cumbersome for larger data sets. Many data centers do provide tools to easily download larger amounts of data but that is usually only from one data center. Now most seismologists don’t really care a lot where the data they download originates - they just want the data for their use case and oftentimes they want as much data as they can get. As the number of FDSN compliant web services increases this becomes more and more cumbersome. That is where this module comes in. You

1. specify the geographical region from which to download data,
2. define a number of other restrictions (temporal, data quality, ...),
3. and launch the download.

The mass downloader module will acquire all waveforms and associated station information across all known FDSN web service implementations producing a clean data set ready for further use. It works by

1. figuring out what stations each provider offers,
2. downloading MiniSEED and associated StationXML meta information in an efficient and data center friendly manner, and
3. dealing with all the nasty real-world data issues like missing or incomplete data, duplicate data across data centers, e.g.
    * Basic optional automatic quality control by assuring that the data has no-gaps/overlaps or is available for a certain percentage of the requested time span.
    * It can relaunch download to acquire missing pieces which might happen for example if a data center has been offline.
    * It can assure that there always is a corresponding StationXML file for the waveforms.

## Usage Examples
Before delving into the nitty-gritty details of how it works and why it does things in a certain way we’ll demonstrate the usage of this module on two annotated examples. They can serve as templates for your own needs.

### Earthquake Data
The classic seismological data set consists of waveform recordings for a certain earthquake. This example downloads all data it can find for the Tohoku-Oki Earthquake from 5 minutes before the earthquake centroid time to 1 hour after. It will furthermore only download data with an epicentral distance between 70.0 and 90.0 degrees and some additional restrictions.<br> 
<span style=color:red; font-size:200%> ***Be aware that this example will attempt to download data from all FDSN data centers that ObsPy knows of and combine it into one data set.*** </span>

In [None]:
import obspy
from obspy.clients.fdsn.mass_downloader import RectangularDomain, \
    Restrictions, MassDownloader

# Rectangular domain containing parts of southern Germany.
domain = RectangularDomain(minlatitude=30, maxlatitude=50,
                           minlongitude=5, maxlongitude=35)

restrictions = Restrictions(
    # Get data for a whole year.
    starttime=obspy.UTCDateTime(2012, 1, 1),
    endtime=obspy.UTCDateTime(2013, 1, 1),
    # Chunk it to have one file per day.
    chunklength_in_sec=86400,
    # Considering the enormous amount of data associated with continuous
    # requests, you might want to limit the data based on SEED identifiers.
    # If the location code is specified, the location priority list is not
    # used; the same is true for the channel argument and priority list.
    network="BW", station="A*", location="", channel="EH*",
    # The typical use case for such a data set are noise correlations where
    # gaps are dealt with at a later stage.
    reject_channels_with_gaps=False,
    # Same is true with the minimum length. All data might be useful.
    minimum_length=0.0,
    # Guard against the same station having different names.
    minimum_interstation_distance_in_m=100.0)

# Restrict the number of providers if you know which serve the desired
# data. If in doubt just don't specify - then all providers will be
# queried.
'''
mdl = MassDownloader(providers=["LMU", "GFZ"])
mdl.download(domain, restrictions, mseed_storage="waveforms",
             stationxml_storage="stations")
'''

### Continuous Request (Noise Data)¶
Another use case requiring massive amounts of data are noise studies. Ambient seismic noise correlations require continuous recordings from stations over a large time span. This example downloads data, from within a certain geographical domain, for a whole year. Individual MiniSEED files will be split per day. The download helpers will attempt to optimize the queries to the data centers and split up the files again if required.

In [None]:
import obspy
from obspy.clients.fdsn.mass_downloader import RectangularDomain, \
    Restrictions, MassDownloader

# Rectangular domain containing parts of southern Germany.
domain = RectangularDomain(minlatitude=30, maxlatitude=50,
                           minlongitude=5, maxlongitude=35)

restrictions = Restrictions(
    # Get data for a whole year.
    starttime=obspy.UTCDateTime(2012, 1, 1),
    endtime=obspy.UTCDateTime(2013, 1, 1),
    # Chunk it to have one file per day.
    chunklength_in_sec=86400,
    # Considering the enormous amount of data associated with continuous
    # requests, you might want to limit the data based on SEED identifiers.
    # If the location code is specified, the location priority list is not
    # used; the same is true for the channel argument and priority list.
    network="BW", station="A*", location="", channel="EH*",
    # The typical use case for such a data set are noise correlations where
    # gaps are dealt with at a later stage.
    reject_channels_with_gaps=False,
    # Same is true with the minimum length. All data might be useful.
    minimum_length=0.0,
    # Guard against the same station having different names.
    minimum_interstation_distance_in_m=100.0)

# Restrict the number of providers if you know which serve the desired
# data. If in doubt just don't specify - then all providers will be
# queried.
'''
mdl = MassDownloader(providers=["LMU", "GFZ"])
mdl.download(domain, restrictions, mseed_storage="waveforms",
             stationxml_storage="stations")
'''

## Usage
Using the download helpers requires the definition of three separate things, all of which are detailed in the following paragraphs.

1. Data Selection: The data to be downloaded can be defined by enforcing geographical or temporal constraints and a couple of other options.
2. Storage Options: Choosing where the final MiniSEED and StationXML files should be stored.
3. Start the Download: Choose from which provider(s) to download and then launch the downloading process.

### Step 1: Data Selection
Data set selection serves the purpose to limit the data to be downloaded to data useful for the purpose at hand. It is handled by two objects: subclasses of the Domain object and the Restrictions class.

The domain module currently defines three different domain types used to limit the geographical extent of the queried data: RectangularDomain, CircularDomain, and GlobalDomain. Subclassing Domain enables the construction of arbitrarily complex domains. Please see the domain module for more details. Instances of these classes will later be passed to the function sparking the downloading process. A rectangular domain for example is defined like this:

In [None]:
from obspy.clients.fdsn.mass_downloader.domain import RectangularDomain
domain = RectangularDomain(minlatitude=-10, maxlatitude=10,
                           minlongitude=-10, maxlongitude=10)

Additional restrictions like temporal bounds, SEED identifier wildcards, and other things are set with the help of the Restrictions class. Please refer to its documentation for a more detailed explanation of the parameters.

### Earthquake

In [None]:
import obspy
restrictions = Restrictions(
    # Get data from 5 minutes before the event to one hour after the
    # event.
    starttime=obspy.UTCDateTime(2012, 1, 1),
    endtime=obspy.UTCDateTime(2012, 1, 2),
    # You might not want to deal with gaps in the data.
    reject_channels_with_gaps=True,
    # And you might only want waveforms that have data for at least
    # 95 % of the requested time span.
    minimum_length=0.95,
    # No two stations should be closer than 10 km to each other.
    minimum_interstation_distance_in_m=10E3,
    # Only HH or BH channels. If a station has HH channels,
    # those will be downloaded, otherwise the BH. Nothing will be
    # downloaded if it has neither.
    channel_priorities=["HH[ZNE]", "BH[ZNE]"],
    # Location codes are arbitrary and there is no rule as to which
    # location is best.
    location_priorities=["", "00", "10"])

### Noise Data
And the restrictions for downloading a noise data set might look similar to the following:

In [None]:
import obspy
restrictions = Restrictions(
    # Get data for a whole year.
    starttime=obspy.UTCDateTime(2012, 1, 1),
    endtime=obspy.UTCDateTime(2013, 1, 1),
    # Chunk it to have one file per day.
    chunklength_in_sec=86400,
    # Considering the enormous amount of data associated with
    # continuous requests, you might want to limit the data based on
    # SEED identifiers. If the location code is specified, the
    # location priority list is not used; the same is true for the
    # channel argument and priority list.
    network="BW", station="A*", location="", channel="BH*",
    # The typical use case for such a data set are noise correlations
    # where gaps are dealt with at a later stage.
    reject_channels_with_gaps=False,
    # Same is true with the minimum length. Any data during a day
    # might be useful.
    minimum_length=0.0,
    # Sanitize makes sure that each MiniSEED file also has an
    # associated StationXML file, otherwise the MiniSEED files will
    # be deleted afterwards. This is not desirable for large noise
    # data sets.
    sanitize=False,
    # Guard against the same station having different names.
    minimum_interstation_distance_in_m=100.0)

The network, station, location, and channel codes are directly passed to the station service of each fdsn-ws implementation and can thus take comma separated string lists as arguments, i.e.

In [None]:
restrictions = Restrictions(
    ...
    network="BW,G?", station="A*,B*",
    ...
    )

Not all fdsn-ws implementations support the direct exclusion of network or station codes. The exclude_networks and exclude_stations arguments should thus be used for that purpose to ensure compatibility across all data providers, e.g.

In [None]:
restrictions = Restrictions(
    ...
    network="B*,G*", station="A*, B*",
    exclude_networks=["BW", "GR"],
    exclude_stations=["AL??", "*O"],
    ...
    )

It is also possible to restrict the downloaded stations to stations part of an existing inventory object which can originate from a StationXML file or from other sources. It will only keep stations that are part of the inventory object. Channels are still selected dynamically based on the other restrictions.<br><br> Keep in mind that all other restrictions still apply - passing an inventory will just further restrict the possibly downloaded data.

In [None]:
restrictions = Restrictions(
    ...
    limit_stations_to_inventory=inv,
    ...
    )

### Step 2: Storage Options
After determining what to download, the helpers must know where to store the requested data. That requires some flexibility in case the mass downloader is to be integrated as a component into a bigger system. An example is a toolbox that has a database to manage its data.

A major concern is to not download pre-existing data. In order to enable such a use case the download helpers can be given functions that are evaluated when determining the file names of the requested data. Depending on the return value, the helper class will download the whole, part, or even none, of that particular piece of data.

#### Storing MiniSEED waveforms
The MiniSEED storage rules are set by the mseed_storage argument of the download() method of the MassDownloader class

**Option 1: Folder Name**
In the simplest case it is just a folder name:

In [None]:
mseed_storage = "waveforms"

This will cause all MiniSEED files to be stored as

*waveforms/NETWORK.STATION.LOCATION.CHANNEL__STARTTIME__ENDTIME.mseed.*

An example of this is

*waveforms/BW.FURT..BHZ__20141027T163723Z__20141027T163733Z.mseed*

which is rather general but also quite long.

**Option 2: String Template**

For more control use the second possibility and provide a string containing {network}, {station}, {location}, {channel}, {starttime}, and {endtime} format specifiers. These values will be interpolated to acquire the final filename. The start and end times will be formatted with strftime() with the specifier "%Y%m%dT%H%M%SZ" in an effort to avoid colons which are troublesome in file names on many systems.

In [None]:
mseed_storage = ("some_folder/{network}/{station}/"
                 "{channel}.{location}.{starttime}.{endtime}.mseed")

results in

*some_folder/BW/FURT/BHZ..20141027T163723Z.20141027T163733Z.mseed.*

The download helpers will create any non-existing folders along the path.

**Option 3: Custom Function**

The most complex but also most powerful possibility is to use a function which will be evaluated to determine the filename. **If the function returns** True , **the MiniSEED file is assumed to already be available and will not be downloaded again; keep in mind that in that case no station data will be downloaded for that channel.** If it returns a string, the MiniSEED file will be saved to that path. Utilize closures to use any other parameters in the function. This hypothetical function checks if the file is already in a database and otherwise returns a string which will be interpreted as a filename.

In [None]:
def get_mseed_storage(network, station, location, channel, starttime,
                      endtime):
    # Returning True means that neither the data nor the StationXML file
    # will be downloaded.
    if is_in_db(network, station, location, channel, starttime, endtime):
        return True
    # If a string is returned the file will be saved in that location.
    return os.path.join(ROOT, "%s.%s.%s.%s.mseed" % (network, station,
                                                     location, channel))
mseed_storage = get_mseed_storage

<span style=color:blue; font-size:200%>Note<br>
No matter which approach is chosen, if a file already exists, it will not be overwritten; it will be parsed and the download helper class will attempt to download matching StationXML files.</span>

#### Storing StationXML files
The same logic applies to the StationXML files. This time the rules are set by the stationxml_storage argument of the download() method of the MassDownloader class. StationXML files will be downloaded on a per-station basis thus all channels and locations from one station will end up in the same StationXML file.

** Option 1: Folder Name**

A simple string will be interpreted as a folder name. This example will save the files to "stations/NETWORK.STATION.xml", e.g. to "stations/BW.FURT.xml".

In [None]:
stationxml_storage = "stations"

**Option 2: String Template**

Another option is to provide a string formatting template, e.g.

In [None]:
stationxml_storage = "some_folder/{network}/{station}.xml"

will write to *"some_folder/NETWORK/STATION.xml"*

, in this case for example to "some_folder/BW/FURT.xml"*.

<span style=color:blue; font-size:200%>Note<br>
If the StationXML file already exists, it will be opened to see what is in the file. In case it does not contain all necessary channels, it will be deleted and **only those channels needed in the current run will be downloaded again.** Pass a custom function to the stationxml_path argument if you require different behavior as documented in the following section.</span>

**Option 3: Custom Function**

As with the waveform data, the StationXML paths can also be set with the help of a function. The function in this case is a bit more complex than for the waveform case. It has to return a dictionary with three keys: "available_channels", "missing_channels", and "filename". "available_channels" is a list of channels that are already available as station information and that require no new download. Make sure to include all already available channels; this information is later used to discard MiniSEED files that have no corresponding station information. "missing_channels" is a list of channels for that particular station that must be downloaded and "filename" determines where to save these. Please note that in this particular case the StationXML file will be overwritten if it already exists and only the "missing_channels" will be downloaded to it, independent of what already exists in the file.

Alternatively the function can also return a string and the behaviour is the same as two first options for the stationxml_storage argument.

The next example illustrates a complex use case where the availability of each channel’s station information is queried in some database and only those channels that do not exist yet will be downloaded. Use closures to pass more arguments to the function.

In [None]:
def get_stationxml_storage(network, station, channels, starttime, endtime):
    available_channels = []
    missing_channels = []
    for location, channel in channels:
        if is_in_db(network, station, location, channel, starttime,
                    endtime):
            available_channels.append((location, channel))
        else:
            missing_channels.append((location, channel))
    filename = os.path.join(ROOT, "%s.%s.xml" % (network, station))
    return {
        "available_channels": available_channels,
        "missing_channels": missing_channels,
        "filename": filename}
stationxml_storage = get_stationxml_storage

### Step 3: Start the Download¶
The final step is to actually start the download. Pass the previously created domain, restrictions, and path settings and off you go. Two more parameters of interest are the chunk_size_in_mb setting which controls how much data is requested per thread, client and request. threads_per_clients control how many threads are used to download data in parallel per data center - 3 is a value in agreement with some data centers.

In [None]:
mdl = MassDownloader()  
mdl.download(domain, restrictions, chunk_size_in_mb=50,
             threads_per_client=3, mseed_storage=mseed_storage,
             stationxml_storage=stationxml_storage)