# Workshop 2: Accessing Remote Hydrological Data

Accessing data from remote servers is a common task in environmental and climate sciences, where large datasets are often stored on institutional or public repositories. Several tools and protocols are available to facilitate this, depending on the format, structure, and access restrictions of the data. In this notebook we explore some commonly used methods.


[2.1. Manual data downloading](#manual_downloading)

[2.2.   Command line downloading](#cl_downloading)

[2.3.   Data access without downloading](#remote_access)




In [None]:
%%capture
# Installing packages that are required above the ones already installed in Google Colab
!pip install zarr cftime s3fs netCDF4==1.6.0

# If you are running this notebook in a platform other than Google Colab, please use the following command to install all the required packages.
#!python -m pip install -r requirements.txt

<a id='manual_downloading'></a>
# 2.1. Manual Data Downloading
Direct data download refers to the process of retrieving datasets from a remote server via a straightforward HTTP, HTTPS, or FTP/SFTP link. This method is commonly used when data are stored as static files (e.g., NetCDF, CSV, GRIB) and made accessible through a direct URL. Users can download files manually through a browser. While this approach is simple and widely supported, it may be less efficient for accessing large datasets or performing repeated queries on multiple files, in which case more advanced methods may be preferable. Some tools are available, such as [Filezilla](https://filezilla-project.org/), for accessing FTP/SFTP servers for bulk downloads.

***Demonstration examples:***

https://environment.data.gov.uk/hydrology/explore

https://portal.grdc.bafg.de/applications/public.html?publicuser=PublicUser#dataDownload/Home


<a id='cl_downloading'></a>
# 2.2. Command Line Downloading

Datasets hosted on remote servers can be downloaded via HTTP, HTTPS, or FTP links using command-line tools in Linux. These methods are especially useful for automating bulk downloads from static URLs. With Linux shell scripting, such downloads can also be parallelised to efficiently handle multiple files simultaneously. This method downloads the full data available via the link being used, and most of the time cannot download subsets of one file.

[Parallel computing](https://www.geeksforgeeks.org/computer-science-fundamentals/introduction-to-parallel-computing/) can significantly speed up the process of downloading large numbers of files from the command line. Instead of downloading files one at a time, multiple downloads can be done simultaneously, making efficient use of available CPU and network resources. This approach is especially useful when working with large datasets with multiple files or when accessing data from remote servers with high latency (i.e., delay in communication between your computer and the remote server).

***Demonstration examples:***

https://catalogue.ceh.ac.uk/documents/dbf13dd5-90cd-457a-a986-f2f9dd97e93c

https://www.ncei.noaa.gov/pub/data/


*Note: The ! operator in Python-based Jupyter Notebooks allows users to execute Linux shell commands directly from within the notebook, effectively stepping out of the Python environment to run system-level commands.*

### (i) Downloading single file

In [None]:
!wget https://catalogue.ceh.ac.uk/datastore/eidchub/dbf13dd5-90cd-457a-a986-f2f9dd97e93c/GB/monthly/CEH_GEAR_monthly_GB_1894.nc

In [None]:
# For now we are deleting the data
!rm CEH_GEAR_monthly_GB_1894.nc

### (ii) Downloading restricted data

In [None]:
# To use wget for multiple files or even ftp servers which are linked to your account and password protected you can use the following version
# For detailed example of accessing servers with password protection please see: https://eidc.ac.uk/help/getdata/downloadData
# Add your own username and password to download the whole catalogue.
# Please do not try this during the training session as it would take a lot time and storage space.
!wget --user=YOUR_USERNAME --password=YOUR_PASSWORD --auth-no-challenge https://catalogue.ceh.ac.uk/datastore/eidchub/dbf13dd5-90cd-457a-a986-f2f9dd97e93c

### (iii) Downloading multiple files

In [None]:
# Multiple files can be available to us sometimes in compressed format like the following
!wget https://www.ncei.noaa.gov/pub/data/hourly_precip-3240/01/3240_01_1948-1998.tar.Z

In [None]:
%%capture
# There are command line methods to uncompress the datasets
!tar -zxvf 3240_01_1948-1998.tar.Z

In [None]:
# For now we are deleting the data
!rm 3240_*

In [None]:
# To download multiple files via wget, you can use a txt file that has a list of the URLs
# First as an example we create a txt file with urls we want to download
url_list = ['https://www.ncei.noaa.gov/pub/data/daily-grids/beta/by-month/1951/01/prcp-195101-cen-scaled.csv',
            'https://www.ncei.noaa.gov/pub/data/daily-grids/beta/by-month/1952/01/prcp-195201-cen-scaled.csv']

with open("urls.txt", "w") as outfile:
    outfile.write("\n".join(url_list))

!more urls.txt

In [None]:
# Then use the text file with the URL list to download multiple files
!wget -i urls.txt

In [None]:
# For now we are deleting the data
!rm prcp* urls.txt

<a id='remote_access'></a>
# 2.3. Access Without Downloading

Hydroclimate data can be accessed remotely without downloading entire datasets. Multiple platforms allow users to query, subset, and stream data directly into analysis environments such as Python, R, or MATLAB. Some methods, such as APIs and FTP/HTTP access, support dynamic data slicing using query parameters. However, remote data access more broadly enables scalable, on-demand computing by allowing users to process and analyze data without the need for local storage. Such remote access methods are increasingly critical for handling the growing volume of high-resolution climate data.

## 2.3.1. Data Access Protocols/Servers

These are servers that mostly designed to host/access large multidimensional scientific datasets (e.g., NetCDF or HDF formats) but are not suitable for other formats like CSV or JSON. Some types of these servers are: THREDDS (Thematic Real-time Environmental Distributed Data Services); OPeNDAP (Open-source Project for a Network Data Access Protocol). These can be accessed via scientific tools/libraries (e.g., xarray, netCDF4, nccopy, Panoply) or protocols like OPeNDAP allow client-side access via URLs (wget). There is some flexibility for accessing subset of data but no functionality for filtering by metadata or aggregation.

***Demonstration examples:***

https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6

https://ds.nccs.nasa.gov/thredds/catalog/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/catalog.html

https://ds.nccs.nasa.gov/thredds/catalog/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/catalog.html?dataset=AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc


### (i) Downloading via command line

In [None]:
# You can use the "HTTPServer" URL to directly download the while file without any subsetting
#!wget https://ds.nccs.nasa.gov/thredds/fileServer/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc
#!rm pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc

In [None]:
# You can use the "NetcdfSubset" URL to directly download the file but you can also subset the data to your requirements
# Link here https://ds.nccs.nasa.gov/thredds/ncss/grid/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc/dataset.html
!wget -O test.nc "https://ds.nccs.nasa.gov/thredds/ncss/grid/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc?var=pr&north=40&west=66&east=100&south=8&horizStride=1&time=1950-12-31T12:00:00Z&&accept=netcdf3"

# -O flag in wget allows you to rename the file you are downloading

In [None]:
# Importing packages
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Plotting the downloaded data
f = xr.open_dataset("test.nc")
f['pr'][0].plot()

In [None]:
# Deleting the downloaded data
!rm test.nc

### (ii) Loading data into Python without downloading

In [None]:
# You can use the "OpenDAP" URL to directly load data into Python
# https://ds.nccs.nasa.gov/thredds/dodsC/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc.html
xr.open_dataset('https://ds.nccs.nasa.gov/thredds/dodsC/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc')

In [None]:
# The OpenDAP servers provide the functionality to subset the data as needed
xr.open_dataset("https://ds.nccs.nasa.gov/thredds/dodsC/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc?lat[0:1:100],pr[0:1:0][0:1:0][0:1:0],time[0:1:10],lon[0:1:100]")

In [None]:
# Or you can read the full data file and subset it using xarray
f = xr.open_dataset('https://ds.nccs.nasa.gov/thredds/dodsC/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_1950_v1.1.nc')
f.sel(lat=slice(8,40), lon=slice(66,100)).pr[0].plot()

### (iii) Reading in multiple files together

In [None]:
# Using the "OpenDAP" URLs for different files, you can read multiple files together without downloading

# For this you would need to first create a list of the file URLs
base_url = 'https://ds.nccs.nasa.gov/thredds/dodsC/AMES/NEX/GDDP-CMIP6/MIROC6/historical/r1i1p1f1/pr/pr_day_MIROC6_historical_r1i1p1f1_gn_'
years = range(1950, 1952)  # Adjust the range as needed

file_urls = [f"{base_url}{year}_v1.1.nc" for year in years]
file_urls

In [None]:
# Then use the list of URLs to open all of them directly using xarray function to open multiple files
f = xr.open_mfdataset(file_urls, combine='by_coords')
f.sel(lat=slice(8,40), lon=slice(66,100))

## 2.3.2. APIs (Application Programming Interfaces)
Now a days, many data providers offer RESTful APIs that allow users to query and retrieve data programmatically. APIs are powerful for accessing dynamic content, filtering by time, location, or variable, and automating data workflows. The filtering can easily be done using Python scripts.

***Demonstration examples:***

[Climate Data Store](https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics?tab=overview)

[COSMOS API](https://cosmos-api.ceh.ac.uk/docs)

For the training session we will be accessing station observations from the [COSMOS-UK network](https://www.ceh.ac.uk/our-science/projects/cosmos-uk). We will be accessing observed variable of Daily Maximum Temperature Air (TA_MAX) from 2016--2022 for one of the COSMOS station, [Alice Holt (ALIC1)](https://cosmos.ceh.ac.uk/sites/ALIC1), directly from the API
![COSMOS_ALCI1.png](https://raw.githubusercontent.com/NERC-CEH/UKCEH_Summer_School/refs/heads/main/content/COSMOS_ALIC1.png)

### (i) Importing required packages

In [None]:
import numpy as np
import pandas as pd
import xarray as xr
from datetime import datetime
import io
import json
import requests
import zipfile
import matplotlib.pyplot as plt

### (ii) Pre-written functions

In [None]:
# Pre-written functions are genereally give on the API webpage
# Here we are using these for accessing COSMOS data.
# Please see https://cosmos-api.ceh.ac.uk/python_examples for code examples
# Please see https://cosmos-api.ceh.ac.uk/docs for more details


def get_api_response(url, csv=False):
    """Helper function to send request to API and get the response

    :param str url: The URL of the API request
    :param bool csv: Whether this is a CSV request. Default False.
    :return: API response
    """
    # Send request and read response
    print(url)
    response = requests.get(url)

    if csv:
        return response
    else:
        # Decode from JSON to Python dictionary
        return json.loads(response.content)


def get_collection_parameter_info(params):
    """A function for wrangling the collection information into a more visually appealing format!"""
    df = pd.DataFrame.from_dict(params)
    df = df.T[["label", "description", "unit", "sensorInfo"]]

    df["unit_symbol"] = df["unit"].apply(lambda x: x["symbol"]["value"])
    df["unit_label"] = df["unit"].apply(lambda x: x["label"])
    df["sensor_depth"] = df["sensorInfo"].apply(
        lambda x: None if pd.isna(x) else x["sensor_depth"]["value"]
    )

    df = df.drop(["sensorInfo", "unit"], axis=1)

    return df


def format_datetime(dt):
    return dt.strftime("%Y-%m-%dT%H:%M:%SZ")


def read_json_collection_data(json_response):
    """Wrangle the response JSON from a COSMOS-API data collection request into a more usable format - in this case a Pandas Dataframe

    :param dict json_response: The JSON response dictionary returned from a COSMOS-API data collection request
    :return: Dataframe of data
    :rtype: pd.DataFrame
    """
    # The response is a list of dictionaries, one for each requested site

    # You can choose how you want to build your dataframes.  Here, I'm just loading all stations into one big dataframe.
    # But you could modify this for your own use cases.  For example you might want to build a dictionary of {site_id: dataframe}
    # to keep site data separate, etc.
    master_df = pd.DataFrame()

    for site_data in resp["coverages"]:
        # Read the site ID
        site_id = site_data["dct:identifier"]

        # Read the time stamps of each data point
        time_values = pd.DatetimeIndex(site_data["domain"]["axes"]["t"]["values"])

        # Now read the values for each requested parameter at each of the time stamps
        param_values = {
            param_name: param_data["values"]
            for param_name, param_data in site_data["ranges"].items()
        }

        # And put everything into a dataframe
        site_df = pd.DataFrame.from_dict(param_values)
        site_df["datetime"] = time_values
        site_df["site_id"] = site_id

        site_df = site_df.set_index(["datetime", "site_id"])
        master_df = pd.concat([master_df, site_df])

    return master_df

### (iii) Accessing Station Observations

In [None]:
# We need to extract "ta_max" parameter for COSMOS station "ALIC1" over the period of 2016 -- 2022
start_date = format_datetime(datetime(2016, 1, 1))
end_date = format_datetime(datetime(2022, 12, 31))
query_date_range = f"{start_date}/{end_date}"
param_name = [
    "ta_max",
]
site_nm = "ALIC1"

In [None]:
# First we get the metadata for the COSMOS station
BASE_URL = "https://cosmos-api.ceh.ac.uk"
site_info_url = f"{BASE_URL}/collections/1D/locations"
site_info_response = get_api_response(site_info_url)

site_info = {}
for site in site_info_response["features"]:
    site_id = site["id"]
    site_name = site["properties"]["label"]
    coordinates = site["geometry"]["coordinates"]
    date_range = site["properties"]["datetime"]
    start_date, end_date = date_range.split("/")

    other_info = site["properties"]["siteInfo"]
    other_info = {key: d["value"] for key, d in other_info.items()}

    site_info[site_id] = {
        "site_name": site_name,
        "coordinates": coordinates,
        "start_date": start_date,
        "end_date": end_date,
    } | other_info

site_info_df = pd.DataFrame.from_dict(site_info).T
s_df = site_info_df[site_info_df.index == site_nm]
s_df

In [None]:
# Extracting the COSMOS station latitude and longitude from the whole metadata list
# COSMOS station latitude and longitude is required to calculate the nearest grid point on the CHESS grid to extract corresponding model data
site_latitude = s_df["coordinates"][0][0]
site_longitude = s_df["coordinates"][0][1]
print(
    "COMOS Site "
    + site_nm
    + " Latitude: "
    + str(site_latitude)
    + " Longitude: "
    + str(site_longitude)
)

In [None]:
# Extracting COSMOS TA_MAX data for the station over the required period into a pandas dataframe
query_url = f'{BASE_URL}/collections/1D/locations/{site_nm}?datetime={query_date_range}&parameter-name={",".join(param_name)}'
resp = get_api_response(query_url)
df = read_json_collection_data(resp)
df = df.reset_index()
display(df)
print(df.shape)

In [None]:
# Calculating monthly climatological values of TA_MAX for the station over 2016--2022
df_site = (
    df.groupby(pd.PeriodIndex(df["datetime"], freq="M"))["ta_max"].mean().reset_index()
)
df_site["datetime"] = df_site.datetime.dt.to_timestamp()
df_site = df_site.groupby(df_site["datetime"].dt.month).mean("ta_max")
df_site

## 2.3.3. Cloud Storage Access

Many datasets are now being hosted in cloud storages, especially for large-scale earth observation data. Tools like AWS CLI, boto3 (Python), gsutil, or cloud-native file systems (e.g., s3fs, gcsfs) allow seamless access to cloud-hosted data. Examples are AWS S3, Google Cloud Storage, Azure.

![Object_Store.png](https://raw.githubusercontent.com/NERC-CEH/UKCEH_Summer_School/refs/heads/main/content/Object_Store.png)


**OTHER RESOURCES:**
1. Please see the [GitHub respository](https://github.com/NERC-CEH/object_store_tutorial) with a guide on utilizing object storage.
2. Please see the video to hear more about [JASMIN Object Storage: Optimizing Performance for Climate Research](https://www.youtube.com/watch?v=xJ8qEXQAri0&list=PLhF74YhqhjqnXvjzFCKnw4TGAFnkVu7Qn&index=2) from the JASMIN User Conference 2023.

### (i) Importing required packages and pre-written functions

In [None]:
import fsspec
import s3fs
import zarr
import xarray as xr

In [None]:
def open_zarr_from_s3(endpoint_url: str,
                      store_path: str):
    """
    Open a Zarr dataset hosted on an S3‑compatible object store.

    Parameters
    ----------
    endpoint_url : str
        Base S3 endpoint, e.g. "https://chess-scape-o.s3-ext.jc.rl.ac.uk".
    store_path : str
        Path to the Zarr store inside the bucket, e.g.
        "ens01-year100kmchunk/tmax_01_year100km.zarr".

    Returns
    -------
    xr.Dataset
        The opened Zarr dataset.
    """
    # 1. Create an fsspec filesystem for the S3 endpoint
    fs = fsspec.filesystem("s3", asynchronous=True, anon=True,
                           endpoint_url=endpoint_url,)

    # 2. Wrap it in a Zarr store
    zstore = zarr.storage.FsspecStore(fs, path=store_path)

    # 3. Open the dataset with xarray
    ds = xr.open_zarr(zstore, decode_times=True, decode_cf=True)

    return ds

### (ii) Exploring the data in Object Store

In [None]:
# JASMIN Object Store tenancy we are using is chess-scape-o, the URL is as follows
jasmin_s3_url = "https://chess-scape-o.s3-ext.jc.rl.ac.uk"

In [None]:
# s3fs is a python package that allows you to not only read the data but also explore the tenancy ()
# Here we will be using s3fs to list the bucket and not read the data, we read the data using intake package shown below
# For more information please see: https://pypi.org/project/s3fs/
s3 = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': jasmin_s3_url})
s3.ls('s3://ens01-year100kmchunk/')

# In the output you see that within in the chess-scape-o tenancy, a bucket called ens01-year100kmchunk
# has 10 different zarr files for different 10 different variables. This is for a single chunk type tested.

### (iii) Accessing data and the associated metadata

In [None]:
# We are accessing TASMAX for the ensemble member #01 from the catalogue
chess_data_01 = open_zarr_from_s3(jasmin_s3_url, "ens01-year100kmchunk/tmax_01_year100km.zarr")
chess_data_01

In [None]:
# CHESS-SCAPE is on the British National Grid with Easting and Northing Coordinates.
# We also set the latitude and longitude as coordinates
chess_data_01 = chess_data_01.set_coords(("lat", "lon"))
chess_data_01

In [None]:
# Slicing for the time period 2016--2022
chess_data_01 = chess_data_01["tasmax"].sel(time=slice("2016-01-01", "2022-12-30"))
chess_data_01

In [None]:
%%capture
# Extracting data for the other ensemble members
# Ensemble member #04
chess_data_04 = open_zarr_from_s3(jasmin_s3_url, "ens04-year100kmchunk/tmax_04_year100km.zarr")
chess_data_04 = chess_data_04.set_coords(("lat", "lon"))
chess_data_04 = chess_data_04["tasmax"].sel(time=slice("2016-01-01", "2022-12-30"))

# Ensemble member #06
chess_data_06 = open_zarr_from_s3(jasmin_s3_url, "ens06-year100kmchunk/tmax_06_year100km.zarr")
chess_data_06 = chess_data_06.set_coords(("lat", "lon"))
chess_data_06 = chess_data_06["tasmax"].sel(time=slice("2016-01-01", "2022-12-30"))

# Ensemble member #15
chess_data_15 = open_zarr_from_s3(jasmin_s3_url, "ens15-year100kmchunk/tmax_15_year100km.zarr")
chess_data_15 = chess_data_15.set_coords(("lat", "lon"))
chess_data_15 = chess_data_15["tasmax"].sel(time=slice("2016-01-01", "2022-12-30"))

### (iv) Deriving the Observed Station nearest grid point on the Gridded Dataset

In [None]:
# Function to derive the data for the nearest grid point to the station lat lon
def find_chess_tile(lat, lon, latlon_ref):
    """
    Created by Doran Khamis (dorkha@ceh.ac.uk)
    Function to calculate the nearest grid point
    of a given lat lon value within a gridded dataset
    The input data is the latitude, longitude of the station
    and the grid reference (latlon_ref) of the gridded dataset
    The function returns the y and x index for the gridded dataset
    which can be used to derive the nearest grid point
    This function assumes equal length lat/lon vectors in latlon_ref
    """
    dist_diff = np.sqrt(
        np.square(latlon_ref.lat.values - lat) + np.square(latlon_ref.lon.values - lon)
    )
    chesstile_yx = np.where(dist_diff == np.min(dist_diff))
    return chesstile_yx

In [None]:
# We create a temporary CHESS-SCAPE gridded dataset array
chess_tmp = chess_data_01[0, :, :]
chess_tmp

In [None]:
# Extracting the x and y indices which point to the nearest grid point of the COSMOS station
y, x = find_chess_tile(site_latitude, site_longitude, chess_tmp)
print(y,x)

In [None]:
# Deleting the temporary array
del chess_tmp

### (v) Extracting the model ensemble data for the grid point nearest to the observed station

In [None]:
# Creating arrays for day, month and year from the time index
day = np.array([i.day for i in chess_data_01.time.values])
month = np.array([i.month for i in chess_data_01.time.values])
year = np.array([i.year for i in chess_data_01.time.values])

In [None]:
# Indexing the CHESS-SCAPE data with the x and y coordinates nearest to the observed station
ens = ["ENS01", "ENS04", "ENS06", "ENS15"]
chess_site_data = np.zeros((len(ens), len(day)))
chess_site_data[0, :] = chess_data_01[:, y, x].squeeze().values
chess_site_data[1, :] = chess_data_04[:, y, x].squeeze().values
chess_site_data[2, :] = chess_data_06[:, y, x].squeeze().values
chess_site_data[3, :] = chess_data_15[:, y, x].squeeze().values

In [None]:
# Converting CHESS-SCAPE temperature from Kelvin to deg Celsius
chess_site_data = chess_site_data - 273.15

In [None]:
# Creating a pandas dataframe for CHESS-SCAPE ensemble TASMAX
f = np.vstack((year, month, day, chess_site_data))
df = pd.DataFrame(f.T, columns=["YEAR", "MONTH", "DAY"] + ens)
df

In [None]:
# Calculating monthly climatology of TASMAX for all the ensemble members
df_model = df.groupby(["YEAR", "MONTH"])[ens].mean()
df_model = df_model.groupby(["MONTH"])[ens].mean()
df_model

### (vi) Comparing observations against modelled ensemble projection

In [None]:
# List of months
months = ["JAN", "FEB", "MAR", "APR", "MAY", "JUN",
          "JUL", "AUG", "SEP", "OCT", "NOV", "DEC",]

In [None]:
# Calculating model ensemble mean, minimum and maximum
df_model_max = df_model.max(axis=1)
df_model_min = df_model.min(axis=1)
df_model_mn = df_model.mean(axis=1)

In [None]:
# Plotting monthly climatology of Daily Maximum Air Temperature from COSMOS station ALIC1 and nearest grid point on CHESS-SCAPE averaged over 2016--2022
fig = plt.figure(figsize=(10, 6))
plt.plot(months, df_site.values, color="k", lw=3, label="OBSERVED")
plt.plot(months, df_model_mn.values, color="b", ls="--", lw=2, label="MODEL MEAN")
plt.fill_between(
    months,
    df_model_min.values,
    df_model_max.values,
    color="b",
    alpha=0.3,
    label="MODEL SPREAD",
)
plt.ylabel("Daily Maximum Air Temperature ($^\circ$C)", fontsize=15)
plt.yticks(np.arange(7, 26, 2), fontsize=15)
plt.xticks(fontsize=15)
plt.legend(loc="upper left", fontsize=15)
plt.title(site_nm + " - Monthly Climatology 2016 - 2022", fontsize=20)
plt.show()

## 2.3.4. Others

There are several other ways to access data remotely, and increasingly, most of these methods are being supported by cloud-based infrastructure.  

  - Emerging cloud-native standards, such as the Spatiotemporal Asset Catalog (STAC), enable efficient cataloging and discovery of satellite or gridded climate data across distributed systems. These are often integrated with cloud platforms, allowing users to search and access data programmatically with minimal overhead. For example, [ECMWF Data Stores STAC Catalogue API](https://cds.climate.copernicus.eu/stac-browser/?.language=en).
  - Sensor Observation Services (SOS) are also available and are particularly useful for accessing real-time or near-real-time observations, especially from in-situ sensor networks or environmental monitoring platforms. For example, [UK AIR](https://uk-air.defra.gov.uk/data/data-availability). In the near future, the [FDRI project](https://fdri.org.uk/) will provide monitoring data covering the whole hydrological system.
  - Finally, platforms such as [Google Earth Engine (GEE)](https://developers.google.com/earth-engine/datasets) and other open data cube frameworks enable users to query and analyze massive gridded datasets remotely, leveraging built-in computational resources—eliminating the need to download data locally. The GEE platform offers [comprehensive tutorials](https://developers.google.com/earth-engine/guides/getstarted) that guide users through its functionality. As a self-learning exercise, we have included a GitHub repository developed at UKCEH, linked in this workshop’s directory. It provides step-by-step training on [extracting drought indicators using GEE in Python](https://github.com/eugmag/Google_Earth_Engine_python_demo/tree/main), allowing you to follow along at your own pace.

# Thank you! Come talk to us about what sort of data are you looking to download and/or access?