# Getting Cloud Native Data

This notebook details how to get cloud native seismic and geodetic data from EarthScope services. This section requires a working knowledge of Python. You should have an understanding of data types and string formatting, creating file paths and file naming, web requests, and creating Python functions. In addition to Python, you should have an understanding of GNSS data types and formats. 

The notebook material is informational and is useful for completing the [**final exercise**](./6_putting_it_all_together.ipynb

![](images/cloud_native_data_access.png)

## Seismic Data in AWS S3

Object storage in the cloud is a cost-effective way to hold and distribute very large collections of data. Objects consist of the data, metadata, and a unique identifier and are accessible through an application programming interface, or API. EarthScope uses Amazon Web Services' (AWS) Simple Storage Service, or S3, to store and distribute seismic and geodetic data.

AWS S3 supports streaming data directly into memory. Streaming data is a significant advantage when analyzing large amounts of data because writing and reading data to and from a drive consumes the majority of time when performing an analysis. When data is streamed directly into memory, it is immediately available for processing.

### Buckets and Keys

Objects in S3 are stored in containers called `buckets`. Each object is identified by a unique object identifier, or `keys`. Objects are accessed using a combination of the web service endpoint, a bucket name, and a key. The combination is called an ARN or an Amazon Resource Name. Unlike a hierarchical file system on your computer, S3 doesn't have directories; instead, it has prefixes, which act as filters that logically group data. Consider the following example, we can decipher the key:

> s3:ncedc-pds/continuous_waveforms/BK/2022/2022.231/MERC.BK.HNZ.00.D.2022.231

- s3 - service name
- ncedc-pds - bucket name
- continuous_waveforms - prefix
- BK - (prefix) seismic network name 
- 2022 - (prefix) year 
- 2022.231 - (prefix) year and day of year
- MERC.BK.HNZ.00.D.2022.231 - (key) station.network.channel.location.year.day of year

Similar to a web service file URL, the ARN is used to request data.

### S3 Buckets with Public Read Access

S3 buckets can be configured for public read access, and you can access objects without providing credentials. The [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) Python package provides libraries for working with AWS services, including S3. Boto3 provides two methods for interacting with AWS services. The `client` method is a low-level and fine-grained interface that closely follows the AWS API for a service. The 'resource` method is a high-level interface that wraps the 'client` interface. AWS stopped development on the resource interface in `boto3` in 2023; for this reason, we will use the `client` interface when working with S3 resources.

The following example reads a miniSEED file from the Northern California Earthquake Data Center (NCEDC). The trace data is for the 2014 Napa earthquake. GeoLab's default environment includes both `boto3` and `obspy` packages, and we can import them without installation. We establish the connection to S3 by creating a client that specifies that requests are unsigned. This means that the S3 bucket allows public access and does not require credentials. The client calls the `get_object` method with the bucket name and key for the miniSEED object. 

Instead of writing the data to a file (as we did using web services), the S3 client sends it to an in-memory binary stream that can be read by [`obspy`](https://docs.obspy.org/). Streaming the data to in-memory objects is more efficient than downloading and reading files for analysis.

In [None]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config
# from io import BytesIO
import io
from obspy import read

s3 = boto3.client('s3', config = Config(signature_version = UNSIGNED), region_name='us-west-2')

BUCKET_NAME = 'ncedc-pds'
KEY = 'continuous_waveforms/BK/2014/2014.236/PACP.BK.HHN.00.D.2014.236'

response = s3.get_object(Bucket=BUCKET_NAME, Key=KEY)
data_stream = io.BytesIO(response['Body'].read())

# Parse with ObsPy
st = read(data_stream)

# Print the ObsPy Streams
print(st)

## Getting RINEX Observations in Apache Arrow from EarthScope

Apache Arrow is a memory based data format and framework that makes large-scale data analysis faster, more efficient, and compatible across different tools. Arrow is a columnar format, which means that data is held in columns instead of rows. File formats, such as RINEX, are row-based formats. This example compares calculating the average L1 signal strength between a RINEX file and an Arrow response.

**Row-based format (RINEX)**: 
>Each observation epoch has: (timestamp, satellite, obs_code, range, phase, snr, slip, flags, fcn, system, igs). To get snr, you would need to read every row and skip the other details, such as phase, slip, system, etc.
 
**Columnar format (Arrow)**: 
>Data is stored by columns: one column for snr, satellite, phase, etc. Since we're only interested in SNR, you can proceed directly to the SNR column without reviewing the other data.

Another feature of arrow is that it is a memory format that supports zero-copy sharing. This means that other data formats, such as Pandas, xarray, and numpy, can use Arrow data directly in memory without translation. Analyzing data is faster and more efficient without the overhead of translation.

An important feature of arrow is its vectorized operations, which process many values at the same time. Filtering, aggregating, and joining data are done quickly and in memory. Using the previous example, arrow, can filter snr over a specified period of time and aggregate into 15-minute intervals, reducing the amount of data to transfer and analyze.

These features make arrow an ideal way to deliver large amounts of data efficiently.

### Requesting Data in Arrow

The EarthScope SDK supports requesting GNSS observations in arrow. An EarthScope client can request RINEX observations in arrow. Because arrow is an in-memory format, it's commonly converted to a Python data structure for analysis. 

[Pandas](https://pandas.pydata.org/) is popular package for working with tabular, or table, data. It enables loading, cleaning, exploring, transforming, and analyzing data efficiently. Arrow tables integrate with other analyis and visualization Python libraries and is commonly used for data exploration.

The following code demonstrates how to use the SDK to request data. As with other methods, an authorization token is needed to make the request. The EarthScope client checks for a token when making a request, which means that it isn't necessary to include it in the request. If you do not have a token, you can use the EarthScope CLI to get a token, see the [**Getting a Token to Access EarthScope Services**](./3_authorization.ipynb) notebook.

The client has a `data.gnss_observation` method for requesting GNSS observation data. The method takes a number of parameters but at minimum, the start and end datetime and station is required. 

Note that the request is for 30 hours of data with three hours of data before and after. If we used a web service request, we would have to download two files, extract the data for the time period from each file and join the two files to get the dataset. This method does that automatically and returns only the data you need.

Pandas can convert an arrow table directly into a [dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), which is a two-dimensional tabular data structure. The `df` displays the first and last five rows of the table.

In [None]:
import datetime as dt
import pandas as pd
from earthscope_sdk import EarthScopeClient

es = EarthScopeClient()

# Request 30 hours of data (1 day + 3 hour arcs on either side)
arrow_table = es.data.gnss_observations(
    start_datetime=dt.datetime(2025, 7, 20, 21),
    end_datetime=dt.datetime(2025, 7, 22, 3),
    station_name="AC60",
).fetch()

df = arrow_table.to_pandas()
df

By specifying values for the columns, you can filter the data so it returns only the data you need. After the arrow table has been converted to a dataframe, you can apply functions, such as sort, to make it easier to work with.

In [None]:
arrow_table = es.data.gnss_observations(
    start_datetime=dt.datetime(2025, 7, 20),
    end_datetime=dt.datetime(2025, 9, 20),
    station_name="AC60",
    session_name="A",
    system="G",
    obs_code="1C",
    satellite="7",
    field="snr",
).fetch()
df = arrow_table.to_pandas()
df.sort_values(by="timestamp")
df

The `data.gnss_observations` method is the first of new functions that support cloud-native methods planned for the EarthScope SDK. These functions are more efficient than web services and can support large scale research efforts.

## Summary

This section introduces several new concepts. First, we discuss object storage, i.e., AWS S3, which is an efficient way to store petabytes of data. Objects in S3 are organized with prefixes, which are analogous to a directory path. Objects are located using a combination of the service, s3, prefixes, and the object key (name). Objects can be streamed directly to memory, making them immediately available to packages such as obspy. A limitation of object store is that it returns the entire object, i.e., the contents of a miniSEED file. Boto3 doesn't support selectively pulling data from an object; for example, getting data 10 minutes before and after an event.

Alternatively, the EarthScope SDK supports selectively pulling GNSS observation data instead of pulling all the data from RINEX that covers the requested time period. The SDK returns the data in Apache Arrow, which is an efficient method for transporting data and converting it to a scientific computing data format such as Pandas dataframes. 

Boto3 and the EarthScope SDK are highly efficient methods for accessing data that enable scaling data processing and analysis.

## [< Previous](./5_web_services_exercises.ipynb)&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;[Next >](./7_cloud_native_exercise.ipynb)