---
# **Traditional vs Cloud: Geophysics with GeoLab**

##### 🔶 This notebook demonstrates a side-by-side comparison between traditional geophysics research workflows and a modern, cloud-optimized approach through EarthScope’s GeoLab platform using GNSS 📡 data. By working through a typical data processing task in both environments, we highlight the practical differences in performance, scalability, and ease of use. GeoLab enables researchers to access open datasets, run compute-intensive workflows, and share reproducible analyses—all within a cloud-native ecosystem tailored to the needs of the geophysics community. 

*This demo is your hands-on guide to accessing and processing GNSS data—first through the traditional GAGE archive, and then using its cloud-based counterpart. In this notebook, you’ll follow a two-track journey:*

***1.🚶 Traditional Serial Processing**
   – Download RINEX files one by one to your local GeoLab environment*

***2.☁️ Cloud-Optimized Workflow**
   – Stream data directly from an AWS S3 bucket—no massive local downloads*

*Before diving into the notebook, you need to be familiar with
👉 [GAGE data archive](https://www.unavco.org/data/data.html) || 👉 [GNSS data products](https://www.unavco.org/data/gps-gnss/gps-gnss.html)*

---

# 1️⃣ **Traditional Serial Processing**
*In this section, we’ll explore the classic method of GNSS data handling which is downloading RINEX files to your GeoLab environment and processing them one by one.*

##### 🔶 RINEX (Receiver Independent Exchange Format) is the open, ASCII-based standard for recording raw GNSS observations—packaging pseudorange, carrier-phase, navigation, and meteorological data into a human-readable, vendor-neutral text format. More about 👉 [RINEX](https://igs.org/wg/rinex/)

*First, we’ll install the `GeoRINEX` package into a temporary environment using a bash script.* 

*For more details, 👉 [GeoRINEX GitHub repository](https://github.com/geospace-code/georinex/tree/main) || 👉 [Bash scripting inside notebook](https://stackoverflow.com/questions/58981651/how-to-run-a-shell-script-in-jupyter-notebook)*

In [None]:
!pip install georinex

*Now, we need to build a function for authenticating and downloading rinex file: **'rinex_download'.** This function will help you to pull files down from EarthScope's authenticated "gage-data" service in an automated script or pipeline. EarthScope’s “gage-data” portal requires `OAuth2` tokens; the SDK handles device-code flows, token caching, and token refreshing so you never have to copy-and-paste access tokens by hand. It allows you to stream large files without blowing out memory To learn more about how it works, 👉 [earthscope-sdk](https://docs.earthscope.org/projects/SDK/en/stable/content/usage.html)*

*We will download from the EarthScope's GAGE data archive 👉*
*[https://gage-data.earthscope.org/archive/gnss/rinex/obs/](https://gage-data.earthscope.org/archive/gnss/rinex/obs/)*

In [None]:
# ____ Authentication & Downloading function __________________________________________________
import asyncio
from pathlib import Path

from earthscope_sdk import AsyncEarthScopeClient

async def download_rinex(station: str, year: int, doy: int, save_dir: str = ".") -> Path:
    """
    Download a RINEX .rnx.gz from EarthScope’s gage-data server.

    Parameters
    ----------
    station : str
        4-char station code, e.g. "p057"
    year : int
        Four-digit year, e.g. 2025
    doy : int
        Day-of-year (1–365 or 366)
    save_dir : str
        Directory where the file will be written (created if needed)

    Returns
    -------
    Path
        Path to the downloaded .rnx.gz file
    """
    # 1. Ensure output directory exists
    out_dir = Path(save_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    # 2. Construct the standard RINEX URL
    filename = f"{station}{doy:03d}0.24d.Z"
    url = f"https://gage-data.earthscope.org/archive/gnss/rinex/obs/{year}/{doy:03d}/{filename}"

    # 3. Spin up the SDK’s async client (auto-auth, token refresh, retries)
    async with AsyncEarthScopeClient() as client:
        # Build & send via the SDK’s HTTPX client
        req = client.ctx.httpx_client.build_request("GET", url)
        resp = await client.ctx.httpx_client.send(req)               # :contentReference[oaicite:0]{index=0}
        resp.raise_for_status()

        # 4. Write out the file
        out_path = out_dir / Path(url).name
        out_path.write_bytes(resp.content)
        return out_path

### **1. Single Download**

*we will download GNSS data for a specific station, year, day and save it to a local directory*

In [None]:
# ── Imports & Setup ─────────────────────────────────────────────────────────────
# necessary packages for collecting, analyzing, plotting, and saving data
import os
import time
import numpy as np
import matplotlib.pyplot as plt
import georinex as gr                    # GeoRINEX: convert RINEX -> xarray datasets (More on: https://github.com/geospace-code/georinex/blob/main/Readme_OBS.md)
from datetime import datetime, timedelta

# ── Configuration of input ──────────────────────────────────────────────────────────────
st_code   = 'p057'          # station code
yr        = 2024            # the year of the observation data
dy        = 10              # the day of the observation data
save_dir  = 'rinex_data'    # output directory to save the rinex data

# ── Single download ──────────────────────────────────────────────────────────────
rinex_path = await download_rinex(st_code, yr, dy, save_dir=save_dir)
print("Saved to", rinex_path)

### **2. Serial Download & Processing**

*now, we will download GNSS data for multiple days in a year and calculate the mean value per day*

In [None]:
# ── Serial Download & Processing Loop ──────────────────────────────────────────
doys = np.arange(1,11)      # doys = days of the year
                            # setup to download RINEX data for first 10 days of the year

# containers(empty arrays) to collect values in the loop
snr_list = []
dates    = []

# Loop over day-of-year (doy)
for doy in doys:
    # 1️⃣ Download the file into rinex_path
    rinex_path = await download_rinex(st_code, yr, doy, save_dir=save_dir)
    print('_______')
    print("Saved to", rinex_path)

    # 2️⃣ Load the downloaded RINEX file into an xarray dataset
    obs = gr.load(rinex_path, use='G', meas=['S1'])  
    #    - use='G' restricts to GPS satellites || RINEX can contain observations from multiple GNSS constellations: 'G' = GPS; 'R' = GLONASS; 'E' = Galileo, etc. ||
    #    - meas=['S1'] loads only the L1C signal-to-noise ratio || RINEX files contain many different measurement 'meas' types for each satellite and epoch (C1, L1, S1, D1, etc.).

    # 3️⃣ Compute the daily average SNR across all epochs & satellites
    print("     Calculating mean ...")
    daily_mean_snr = obs['S1'].mean().values
    snr_list.append(daily_mean_snr)

    # 4️⃣ Record the corresponding calendar date
    date = datetime(yr, 1, 1) + timedelta(days=int(doy - 1))
    dates.append(date)

    # 5️⃣ Clean up: delete the RINEX file to save disk space
    os.remove(rinex_path)
    print("Removed ", rinex_path)
    print('_______')

# ── Visualization ──────────────────────────────────────────────────────────────
plt.figure(figsize=(10, 4))
plt.plot(dates, snr_list, marker='o', linestyle='-')
plt.axhline(np.mean(snr_list), linestyle='--', label=f"Mean SNR = {np.mean(snr_list):.2f}")
plt.title(f"{len(dates)} Days of L1C SNR Averages (Serial Download & Processing))")
plt.xlabel("Date")
plt.ylabel("Mean L1C SNR")
plt.xticks(rotation=45)
plt.legend(loc='best')
plt.tight_layout()
plt.show()

*To recap, what the code snippet has done,*

*Download GNSS observations    --> into `georinex` format     --> using `xarray` to slice/average*

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Comparison_satellite_navigation_orbits.svg/250px-Comparison_satellite_navigation_orbits.svg.png" alt="GNSS" width="200"/>  <img src="https://raw.githubusercontent.com/asifashraf060/cloud-101-geolab/a3878c6335c72aab898d537011b3302c3c6c61f4/images/gr.png" alt="GeoRINEX" width="200"/> <img src="https://camo.githubusercontent.com/2ee0e1a7be6338f330d4a2e69f86c9e545e9dc39e60fee9d52ee79b141b14d6b/68747470733a2f2f646f63732e7861727261792e6465762f656e2f737461626c652f5f7374617469632f5861727261795f4c6f676f5f5247425f46696e616c2e737667" alt="Xarray" width="200"/>

# 2️⃣ **Cloud Optimized Workflow**


Cloud computing provides on-demand access to flexible compute and storage resources over the internet.

- **Compute**: Virtual machines (EC2) where your code runs; check out 👉 [EC2 documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html)

- **Storage**: Data lives on services like S3 (AWS); check out 👉 [S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html)

- **Cluster**: A group of machines working together on large tasks

These resources scale up/down in minutes, so you pay only for what you use. 

<img src="https://community.aws/_next/image?url=https%3A%2F%2Fassets.community.aws%2Fa%2F2YnihCpaNZkmFVuxyHKWrcxDSMT.png%3FimgSize%3D919x516&w=1080&q=75" alt="awsEC2" width="320"/> 
<img src="https://i.ytimg.com/vi/ecv-19sYL3w/mqdefault.jpg" alt="awsS3" width="320"/> <img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/06/09/Image-1a-VPC.png" alt="awsS3" width="315"/> 

*First, lets tell the `AWS CLI` to start a Single Sign-On login flow using the “es-dev” profile—opening your browser to authenticate and caching temporary credentials for that profile.*

In [None]:
!aws sso login --profile es-dev

### **1. Cloud storage "objects vs local files**
Amazon S3 is an object storage service that stores data as objects

 💠 "bucket" = top-level container

 💠 "object" = data blob + metadata, addressed by a key (no hierarchical folders)

 💠 "file"   = local disk item (you only get one if you download it)

 In essence, Amazon S3 keeps each file you upload as an object (data + metadata) inside a bucket you name. Every object is addressed by a unique key (the full path string you assign).

 ***Traditional storage vs Object Storage⤵️***

 <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*VsX-1XW2EVYCwz2IRH_2FQ.png" alt="cloudStorage" width="500"/> 

To understand the **benifits for this object storage approach,**
read the blog 👉 [EarthScope's cloud optimized data storage](https://www.earthscope.org/news/meet-tiledb-one-key-to-cloud-optimizing-our-data-archives/)

##### **🚩🚩 📝 EXERCISE**

From the blog, answer the following:

**Q: What data format does EarthScope use to store its data in the cloud❓**

*A: TileDB*

**Q: Why is TileDB, when combined with object storage, well-suited for large-scale analysis❓**

*A:* 

*1. Sparse Arrays: Efficiently represent large datasets with many empty or default values.*

*2. Multi-Dimensional Querying: Enables fast access to data along one or more dimensions.*

*3. Versioning Support: Each cell update is versioned, allowing retrieval by specific points in time.*

*4. Parallel I/O: Supports concurrent read/write operations, ideal for multithreaded computing.*

*5. Separation of Metadata: Metadata is managed via a relational database, improving scalability.*

*6. Direct Data Access: Standardized array format allows direct loading into dataframes for immediate analysis.*

---

##### Make URI (Uniform Resource Identifier) to access you GNSS file 

In [None]:
# -----------------------------------------------------------------------------
# 1) Station & Cloud-storage identifiers
# -----------------------------------------------------------------------------
station_id  = "P057"                                                        # your GNSS station code
bucket_name = "repository-stage-us-east-2-mlmoghi3ooss/geolab-gnss-demo"    # EarthScope’s S3 bucket (object store)
object_key  = f"{station_id}.tdb"                                           # each .tdb is an immutable “object” in S3

# -----------------------------------------------------------------------------
# 2) Build a “URI” to refer to your object
# -----------------------------------------------------------------------------
s3_uri = f"s3://{bucket_name}/{object_key}" # This is EarthScope's bucket-centric architecture

##### **🚩🚩 📝 EXERCISE**

**Q: Explain the input parameters in the previous cell in terms of the object storage approach in AWS S3❓**

*A: bucket_name - bucket; object_key = object; station_id = file*

**Q: Compare the directory in HPC/Local vs the Cloud for the GNSS file❓**

*A: In traditional HPC you think “file /data/P057.tdb.” In the cloud it’s “object key P057.tdb in bucket geolab-gnss-demo.” There are no real directories on S3—just keys.*

---

### **2. ARCO-style data concept**
*Each object is immutable and carries metadata (upload timestamp, tags). You refer to it by bucket+key, and you can attach custom metadata (e.g. station, observation type).*

In [None]:
from datetime import datetime, timedelta
# -----------------------------------------------------------------------------
# 2) Time‐window for your download
# -----------------------------------------------------------------------------
start_iso      = "2024-05-11"         # ISO format is standard in ARCO specs
duration_hours = 12

start_dt = datetime.fromisoformat(start_iso)
end_dt   = start_dt + timedelta(hours=duration_hours)

# -----------------------------------------------------------------------------
# 3) Observation parameters (per ARCO data dictionary)
# -----------------------------------------------------------------------------
# – constellation: 0=GPS, 1=GLONASS, 2=Galileo, …
# – obs_code:      numeric GNSS observation type
constellation_code = 0    # GPS
raw_obs_code       = 12611  # e.g. “GPS L1 C/A pseudorange” per ARCO spec

# Some client libraries expect obs_code packed as 2‐byte big-endian:
obs_code_bytes = raw_obs_code.to_bytes(2, "big")

### **3. Access EarthScope's data lake in AWS S3**

*We will use `boto3` to start an AWS session, which is a friendly, Pythonic way to talk to AWS services.*

*It is the official AWS SDK for Python.*

*It handles all the gritty details of signing requests, managing credentials, and retrying fallen network calls, so you can focus on your data rather than on AWS plumbing.*
*To learn more  👉  [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)*

In [None]:
# ============================================================================
# 4) Setup AWS credentials & TileDB config
# ============================================================================
import boto3
# ———————————————————————————————————————————————————————————————
# 4a) Load your AWS credentials via boto3 Session
#     This will look for ~/.aws/credentials or $AWS_PROFILE
# ———————————————————————————————————————————————————————————————
session    = boto3.Session(profile_name="es-dev")
creds      = session.get_credentials().get_frozen_credentials()  # Returns a “credentials provider” object and converts that into an immutable snapshot of raw values
# ———————————————————————————————————————————————————————————————
# 4b) Build a config dict for TileDB’s S3 virtual filesystem (VFS)
#     so TileDB knows how to authenticate & parallelize I/O.
# ———————————————————————————————————————————————————————————————
tdb_config = {
    # Which AWS region your bucket lives in
    "vfs.s3.region":                "us-east-2",
    # How many concurrent connections to use when talking to S3
    "sm.io_concurrency_level":      12,
    # Your access keys (pulled from the frozen credentials snapshot)
    "vfs.s3.aws_access_key_id":     creds.access_key,
    "vfs.s3.aws_secret_access_key": creds.secret_key,
    # If you’re using temporary session tokens, include this too
    "vfs.s3.aws_session_token":     creds.token,
}

### **4. Object read (streaming)**
- Open the array directly on S3
- Data flows over the network into memory
- No local file is ever created
- This is completely opposite of file download, which is full copy

*In the following cell you will open a remote `tiledb` array on S3, load the specified time‐range slice into a pandas DataFrame.*

*We will do it in two ways,*

*Option 1: One-shot slice --> single open & single request, fewer S3 round trips, lower per request overhead*

*Option 2: Incremental daily chunks --> repeated open & repeated request, higher S3 round trips, stream small slices*

In [None]:
import tiledb
# ============================================================================
# 5) Object read (streaming)
#    — Good for pipelines that parse on-the-fly, without touching disk
# ============================================================================

# 5a) Option 1: One-shot slice
# Open the TileDB array in “read” mode with your AWS creds baked in
with tiledb.open(s3_uri, mode="r", config=tdb_config) as A:
    A.upgrade_version()  # Ensure compatibility if schema has changed

    # Slice out exactly the rows you need; TileDB streams them from S3
    df_stream = A.df[
        unicode_time_millis(start_dt)  : unicode_time_millis(end_dt),
        :  # keep all other dimensions
    ]
# 5b) Option 2: Incremental daily chunks
for doy in np.arange(1,51):
    start=datetime(year, 1, 1) + timedelta(days=int(doy - 1))
    date_li+=[start]
    end=start+timedelta(days=1)
    start=unix_time_millis(start)
    end=unix_time_millis(end)
  
    with tiledb.open(s3_uri, mode="r", config=tdb_config,) as A:
        df=A.df[slice(int(start), int(end)),:]['snr']
        snr_li+=[df.mean()]

##### **🚩🚩 📝 EXERCISE**

**Q: How option 1 and 2 will work for the following use cases/scenarios❓**

A:

| Scenario                                      | Option 1: Single-shot slice                                                                | Option 2: Incremental daily chunks                                           |
| --------------------------------------------- | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- |
| **One-off analysis over a continuous window** | ✔️ Fetch entire `start → end` range in one call<br>– Fewer S3 round-trips<br>– Simple code | ❌ Too many repeated opens/reads                                              |
| **Small-to-medium window**                    | ✔️ Efficient bulk read<br>– Minimal overhead                                               | ✔️ Works but incurs extra latency per chunk                                  |
| **Very large window (days/weeks/months)**     | ❌ May exhaust RAM if you load too much data at once                                        | ✔️ Streams day by day<br>– Caps memory use per slice                         |
| **Fine-grained per-day metrics**              | ❌ You’d need to slice & loop locally after download                                        | ✔️ Computes each day’s metric on-the-fly<br>– No large in-memory DataFrame   |
| **Low-latency needs (few requests)**          | ✔️ Single network hit<br>– Faster end-to-end for moderate windows                          | ❌ Many handshakes add up                                                     |
| **Minimal local storage footprint**           | ✔️ No disk writes                                                                          | ✔️ Only per-chunk in-memory; no full download                                |
| **Reuse same data repeatedly**                | ✔️ Good if you only need one contiguous pass                                               | ❌ Re-streaming same ranges wastes network and time                           |
| **Disk-based tools require a file path**      | ❌ Not applicable (no local file)                                                           | ❌ Streaming only; consider full download first (see “file download” pattern) |


---

##### **🚩🚩 📝 FINAL EXERCISE 🚩🚩🚩🚩🚩🚩🚩🚩🚩🚩**
**Q: What is the difference in total time taken to complete the download and processing steps between the traditional workflow and the cloud-optimized workflow?❓**
