<img src='https://www.icos-cp.eu/sites/default/files/2017-11/ICOS_CP_logo.png' width=400 align=right>

# ICOS Carbon Portal Python Libraries


# Example: Reading an ObsPack collection

We will showcase two different methods / ways of reading the OpsPack data with python

1. Download the file, unpack and load the data
2. Access the collection using our python library<br>
   Load data directly into memory, no need to download. This works on this Jupyter Service from the ICOS Carbon Portal 
   or locally (on your computer) after authentication (see the `how_to_authenticate.ipynb` notebook)


## Introduction
The goal of this notebook is to show you the ease of use of Jupyter Notebooks and to familiarize you with ObsPack collections and the files stored within these ObsPacks. This notebook will first show how to read in a local NetCDF file that originates from an ObsPack collection (for a specific station, tracer, and sampling height), and how to subset and plot the measurements within the file. 

In a next step, we will show how to read in an ObsPack file using the ICOS Python Library `icoscp_core`. This is an API that allows you to read in all data stored on the ICOS Carbon Portal without having to download the data, providing an efficient way to read in and process atmospheric measurements and model results in memory. 

Documentation of the library, including information on running it locally, can be found on [PyPI.org](https://pypi.org/project/icoscp_core/).


## Import python packages
Note: We chose to make use of the xarray package instead of the netCDF4 package, as the former is more versatile and better supported.

In [None]:
import os
import xarray as xr
import matplotlib.pyplot as plt

# Make the matplotlib figures interactive with zoom, pan, move, save figures etc... but the figures are not that nice anymore
# %matplotlib widget

from pylab import plot, boxplot, setp
import seaborn as sns
color_pal = sns.color_palette()
import zipfile
import datetime
import numpy as np
import math
from tqdm.notebook import tqdm
import pandas as pd

# ICOS Library
from icoscp_core.icos import meta, data


## Define functions, used later in the notebook

### Label
return a 'label' containing station information, location and sampling height

In [None]:
def label(dobj, ret='string'):
    '''
    A function to extract information from a digital object containing the full set of metadata of a data object.
    For convenience, we extract some of the information to display.
    Parameters: dobj: metadata object, as returned from an icoscp_core library call 
                ret: [string | dictionary]returned format, default String, or a dictionary.
      
    Returns a string with the station id, name, location, and the sampling height of the specific data object.
    This can be for example added to the data frame for grouping, or used as series title
    '''
    info ={}
    # sampling height (in meters above ground)
    info['sh'] = dobj.specificInfo.acquisition.samplingHeight
    # station id & name
    info['id'] = dobj.specificInfo.acquisition.station.id
    info['name'] = dobj.specificInfo.acquisition.station.org.name
    # station location
    info['lat'] = dobj.specificInfo.acquisition.station.location.lat
    info['lon'] = dobj.specificInfo.acquisition.station.location.lon
    info['alt'] = dobj.specificInfo.acquisition.station.location.alt
    
    if ret=='dictionary':
        return info
    
    return f"{info['id']} {info['name']} {info['sh']} m (Lat: {info['lat']}, Lon: {info['lon']}, Alt: {info['alt']})"
    

### Data Features
To work with the dataset we create a add feature function to add some conveniences:
- add features so we can group by month or year
- change the index to datetime, makes it much easier to plot time series data

In [None]:
def data_feature(dobj):
    '''
    A data wrangling function to return the data as pandas dataframe
    with index set to the timestamp for easy plotting
    and add some features for data analysis.
    

    Parameters
    ----------
    dobj : data object, as returned from an icoscp_core library call 
    
    Returns
    -------
    Pandas Dataframe    
    '''
    df = dobj
    # set index to datetime
    df.index = pd.to_datetime(df.time)
    
    # add columns to aggregate on
    df['day'] = df.index.day
    df['month'] = df.index.month
    df['year'] = df.index.year
    
    return df    

### Rolling z-score
Z-Score will show how many standard deviations the value is away from the mean.<br>
For seasonal data, it seems a good choice to have z-score calculated from a rolling mean/std to find and display outliers<br>
The following function is adapted from [https://stackoverflow.com/](https://stackoverflow.com/questions/47164950/compute-rolling-z-score-in-pandas-dataframe)
, apparently much faster than the built-in function from pandas.
    

In [None]:
def zscore(data, window=720):
    '''
    Calculate the z-score... how much standard deviation is the data point away from the mean
    We will use this to plot outliers. For this example we now that we have hourly
    measurements; we have opted for a default window size of a month (24*30=720)

    Parameters
    ----------
    data : Pandas series
    
    Returns
    -------
    Pandas Series
    '''
    r = data.rolling(window=window)    
    m = r.mean().shift(1)
    #m = r.mean()
    s = r.std(ddof=0).shift(1)
    z = (data-m)/s
    return z

# 1. Download the file, unpack and load the data

### How to read in a local NetCDF file coming from an ObsPack collection

#### Read in the data
The following NetCDF file will be used as an example: co2_cbw_tower-insitu_445_allvalid-207magl.nc. This file contains CO2 mole fraction observations from the Cabauw station at 207 meters a.g.l.
Files like this are available, if you download the collection from the data portal ( [https://doi.org/10.18160/X450-GTAY](https://doi.org/10.18160/X450-GTAY) ). Then you can work offline on your own computer. For this demo, we have made one of the files available in the current 'data' folder

In [None]:
# Give the path of local nc file
path = './data/co2_cbw_tower-insitu_445_allvalid-207magl.nc'

# Open the local nc file
cbw_data = xr.open_dataset(path)

# Explore some of the characteristics of the data
print("These are the dimensions of the data set:")
print(str(list(cbw_data.coords)) + "\n")

print("These are the variables of the data set:")
print(str(list(cbw_data.keys())))

### Explore the ObsPack entry

In [None]:
cbw_data

### Subset the data

In [None]:
# Select a time period of interest. Here: JJA 2018
cbw_subset = cbw_data.sel(time=slice('2018-06-01', '2018-09-01'))

### Creating a CO2 mixing ratio timeseries

In [None]:
fig, ax = plt.subplots(nrows=1, ncols = 1, figsize=(12,4))
ax.grid(zorder=0)
plt.plot(cbw_subset.time, cbw_subset.value * 1e6, '-', zorder=2, label = 'CO2 dry mole fraction')
ax.set_title(cbw_subset.site_name)
ax.set_ylabel(str(cbw_data.value.standard_name) + ' [ppm]')
ax.set_xlabel('time')
ax.set_xlim([min(cbw_subset.time),max(cbw_subset.time)])
ax.set_ylim([380,460])
plt.xticks(rotation = 45)
plt.legend()

# if you want to save the image... uncomment the following line
# plt.savefig('co2_cabauw.png')

### Assigning a new variable to the ObsPack using xarray

In [None]:
## Create a synthetic mole fraction observation variable with a randomized uncertainty (noise)
noise = np.random.normal(2.1e-6, 1.5e-6, len(cbw_subset.value))
model = cbw_subset.value + noise

## Insert the newly created variable into the ObsPack
## (Use the xr.assign() function)
cbw_subset = cbw_subset.assign(model=model)
cbw_subset

### Adding the synthetic simulation results to the timeseries plot

In [None]:
## Make a timeseries plot similar to the one before, but now adding the synthetic model results
fig, ax = plt.subplots(nrows=1, ncols = 1, figsize=(10,4))
ax.grid(zorder=0)
plt.plot(cbw_subset.time, cbw_subset.value * 1e6, '-', zorder=2, label = 'Observed CO2 dry mole fraction')
plt.plot(cbw_subset.time, cbw_subset.model * 1e6, '-', zorder=2, label = 'Modelled CO2 dry mole fraction')
ax.set_title(cbw_subset.site_name)
ax.set_ylabel(str(cbw_data.value.standard_name) + ' [ppm]')
ax.set_xlabel('time')
ax.set_xlim([min(cbw_subset.time),max(cbw_subset.time)])
ax.set_ylim([380,460])
plt.xticks(rotation = 0)
plt.legend()

### Plotting a timeseries of the residuals

In [None]:
## Make a residual timeseries plot
fig, ax = plt.subplots(nrows=1, ncols = 1, figsize=(12,4))
ax.grid(zorder=0)
plt.plot(cbw_subset.time, (cbw_subset.model - cbw_subset.value) * 1e6, '-', c = 'r', zorder=2, label = 'Model error / residual CO2 mole fraction')
ax.set_title(cbw_subset.site_name)
ax.set_ylabel(str(cbw_data.value.standard_name) + '\n residual [ppm]')
ax.set_xlabel('time')
ax.set_xlim([min(cbw_subset.time),max(cbw_subset.time)])
ax.set_ylim([-7.5,7.5])
plt.xticks(rotation = 10, fontsize=8)
plt.legend()

In [None]:
## Calculate the RMSE between the observed and synthetic CO2 mole fractions
MSE = np.square(np.subtract(cbw_subset.value,cbw_subset.model)).mean() 
MSE_ppm = np.square(np.subtract(cbw_subset.value*1e6,cbw_subset.model*1e6)).mean() 
 
RMSE = math.sqrt(MSE)
RMSE_ppm = math.sqrt(MSE_ppm)
print("Root Mean Square Error:")
print(RMSE)

print("Root Mean Square Error (in ppm):")
print(RMSE_ppm)

# 2. Access the collection using our python library
directly into memory, no need to download. This works on this Jupyter Service from the ICOS Carbon Portal or locally (on your computer) after authentication (see the `how_to_authenticate.ipynb` notebook)

## Read the collection of ObsPack files
Now we proceed to read the collection of ObsPack files directly, you can think of a collection as a zip file, containing many files. **Be aware... there might be a LOT of files**, the following example contains more than 100 data objects, and loading them all into memory can make things slow. The ObsPack collection for CO2, CH4, and N2O is published with a DOI [https://doi.org/10.18160/YDMA-2X1H](https://doi.org/10.18160/YDMA-2X1H), that is what we are using to access the data and metadata. 

Note that ObsPacks are published as separate *data objects* for CO2, CH4, and since summer 2024 also N2O, all three with their own DOI and that in paralell also an ObsPack *collection* combining data for all three gases is published with a single DOI. This *collection* allows easy access using our ICOS CP Python libraries.

### Resolve the DOI


In [None]:
import requests
doi = "10.18160/YDMA-2X1H"
coll_uri = requests.head(f"https://doi.org/{doi}").headers.get('Location')
coll_uri

### Read collection metadata

In [None]:
coll = meta.get_collection_meta(coll_uri)
coll.title

In [None]:
coll.latestVersion

### Fast-forward to the latest version
This step is also optional, should only be used if the code should find the latest version of the collection

In [None]:
coll = meta.get_collection_meta(coll.latestVersion)
coll.title

### Citation

Note that a citation string is only available for collections that have a DOI.

In [None]:
coll.references.citationString

### Collection description

In [None]:
coll.description

### A list of all the 'files' or data sets included
We will display only the first 5, but actually there are many more

In [None]:
# Print a list of the data object in the collection, which includes the links to the landing pages and the filename for each dataset
print(f"Available datasets: {len(coll.members)}")
coll.members[:5]

### Select CO2 data sets only 

In [None]:
co2_coll = [s for s in coll.members if 'co2' in s.name]
print(f"Available datasets: {len(co2_coll)}")
co2_coll[:5]

## Look at one dataset
The collection contains already a list of data objects as displayed above. A data object contains meta data about the dataset and allows direct access to the actual measurements.<br>
Please have a look at the documentation of the `icoscp_core` python library for further details on how to work with collections and datasets.<br>

This time we will use plotly to create box plots for all the stations. But before we loop through all the files, we have a look at the first dataset.
    
    


In [None]:
dataset = co2_coll[37] # just choosing a random station out of the collection

# This is the landing page uri information that we need for access to metadata and data of the object.
dataset.res

### Get the metadata

Extract information on the selected station and the selected dataset

In [None]:
# Get the data object metadata
dobj_meta = meta.get_dobj_meta(dataset.res)
dobj_meta.references.title

In [None]:
# many more properties are available...
print([p for p in dir(dobj_meta) if not p.startswith('_')])

In [None]:
# custom built function using this properties ... see beginning of notebook
label(dobj_meta)

### Get the data
This step would require authentication for local use (not on an ICOS-hosted Jupyter service)

NOTE the following example is appropriate mostly for the case of single data object, or for inhomogeneous lists of data objects of different data types. For homogeneous lists, it is highly recommended to use batch_get_columns_as_arrays method for much better performance due to reduced number of server calls.

In [None]:
dobj_cols = data.get_columns_as_arrays(dobj_meta)
dobj_data = pd.DataFrame(dobj_cols)
dobj_data.head()

In [None]:
# For CO2 convert units mol mol-1 to ppm
dobj_data['CO2 [ppm]'] = dobj_data['value'] * 1e6

In [None]:
# add some features to the dataset... see function in the beginning of the notebook. this helps later on to group the data
dobj_data = data_feature(dobj_data)
dobj_data.head()

### Simple Plot
Obviously one would need to work more on the figure... like putting on units, and lables, etc,<br>
but for right now we just quickly plot the data.

In [None]:
simplefig, ax = plt.subplots(figsize=(10,4))
ax.grid(color='grey', linestyle='--', linewidth=0.5)
ax.set_title(label(dobj_meta)) 
ax.scatter(dobj_data.index, dobj_data['CO2 [ppm]'], s=0.5) # s = size of marker
simplefig.show()

### Add rolling mean

In [None]:
# add a rolling mean with approximately a monthly window
rm = dobj_data.ewm(span = 720).mean()
ax.plot(rm.index, rm['CO2 [ppm]'], color='black', linestyle='dashed')

### Add outliers from a rolling mean
As you can see, there are some outliers in the dataset. We will hightlight them with zscore (see functions in the beginning of the notebook). The function will calculate the distance in standard deviations for a rolling window. First we will add the zscore to the dataset (we could have done this in the add data features section as well). Assuming the data would follow a normal distribution, we will highlight data points which have a distance of more than 3 standard deviations (per rolling window). As a default the window size is set to approximately a monthly window.

In [None]:
# add the zscore to the data set
dobj_data['zscore'] = zscore(dobj_data['CO2 [ppm]'])
zfilter = (dobj_data.zscore > 3) | (dobj_data.zscore < -3) 
outliers = dobj_data[zfilter]
ax.plot(outliers.index, outliers['CO2 [ppm]'], 'ro',markersize='1')

In [None]:
simplefig

### Aggregate data plot

In [None]:
# create a plot with running average and aggregate by month
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4))
fig.suptitle(label(dobj_meta), fontsize=12, y=1.05)

# a rolling mean with a monthly window (30 days x 24 hours = 720), makes a trend clearly visible
dobj_data.ewm(span = 720).mean().plot(y='CO2 [ppm]', grid=True, ax=ax1)
ax1.set_title('rolling mean')
ax1.grid(color='grey', linestyle='--', linewidth=0.5)

# aggregate by month, to show seasonality
dobj_data.groupby(dobj_data.index.month, sort=False, group_keys=False ).mean().sort_index().plot(y='CO2 [ppm]',grid=True, ax=ax2)
ax2.set_title('aggregate by month')
ax2.grid(color='grey', linestyle='--', linewidth=0.5)

fig.show()

### Boxplot

In [None]:
# create a more fancy plot with seaborn

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4))
fig.suptitle(label(dobj_meta), fontsize=12, y=1.05)
sns.boxplot(data=dobj_data, x=dobj_data.index.year, y='CO2 [ppm]', palette='Blues', ax=ax1)
ax1.set_title('values by year')
ax1.grid(color='grey', linestyle='--', linewidth=0.5)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=45)

sns.violinplot(data=dobj_data, x=dobj_data.index.month, y='CO2 [ppm]', palette='Blues', ax=ax2)
ax2.set_title('aggregate by month')
ax2.grid(color='grey', linestyle='--', linewidth=0.5)
ax2.set_xlabel('month')

fig.show()

## Loop through the collection

Since we now have a better understanding of the underlying data, we can plot many datasets side by side to compare.<br>
We demonstrate a simple filter implententation for country and/or station id.<br>
The following example finds all stations from **Switzerland** plus adding the **'Schauinsland'** station from Germany and **'Weybourne'** from Great Britain.

### Create a list of metadata objects
Remember, above we have a created a list of objects for co2, called **coll_co2**<br>
We are now using this list, and load the metadata for each dataset

In [None]:
all_co2_meta = [meta.get_dobj_meta(do.res) for do in tqdm(co2_coll)]

### Filter by country and/or station
We will save the resulting list of meta data for further analysis. 
- You can add multiple countries as filter, like this ['CH', 'PL'], or if you want to plot **ALL** countries, please set the country filter to ['ALL']. You will find the country code for each digital object in their metadata with ```all_co2_meta[0].specificInfo.acquisition.station.countryCode ```
- The same applies for stationId ```all_co2_meta[0].specificInfo.acquisition.station.station.id ```


In [None]:
# for the country filter the following applies
# ['ALL'] -> find all countries, equivalent to no filter
# [] -> empty list, find 'nothing'

country = ['CH']
stationId = ['SSL', 'WAO']

In [None]:
country_list = []
station_list = []

if 'ALL' in country:
    country_list = all_co2_meta
elif country:
    country_list = [do for do in all_co2_meta if do.specificInfo.acquisition.station.countryCode in country]

if stationId and (not 'ALL' in country):
    station_list = [do for do in all_co2_meta if do.specificInfo.acquisition.station.id in stationId]
    
filtered_co2_meta = country_list + station_list

for selected in filtered_co2_meta:
    print(f'{label(selected)}')

### Load data and create a plot

In [None]:
# create a plot with running average and aggregate by month
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4))
fig.suptitle('Aggregate data', fontsize=12)

ax1.set_title('moving average')
ax1.grid(color='grey', linestyle='--', linewidth=0.5)

ax2.set_title('aggregate by month')
ax2.grid(color='grey', linestyle='--', linewidth=0.5)

for do_meta in tqdm(filtered_co2_meta):

    # extract the data frame
    df = pd.DataFrame(data.get_columns_as_arrays(do_meta))
    
    # For CO2 convert units mol mol-1 to ppm
    df['CO2 [ppm]'] = df['value'] * 1e6
    
    # add features for aggregation
    df = data_feature(df)
    
    # plot    
    serieslabel = label(do_meta)
    df.ewm(span = 3600).mean().plot(y='CO2 [ppm]', grid=True, ax=ax1, label=serieslabel)
    df = df.groupby(['month'], sort=False, group_keys=False ).mean().sort_index()
    df.plot(y='CO2 [ppm]',grid=True, ax=ax2, label=serieslabel, marker='o', linestyle='dashed')
    
    # move the legend below the plot
    ax1.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2),fancybox=True, shadow=True, ncol=1)
    ax2.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2),fancybox=True, shadow=True, ncol=1)
    
plt.show()
    

### BoxPlot many stations
We will use the same filter applied from above

In [None]:
# depending on how many stations you want to display, 
# you probably want to adjust the figure and font size
# for all 132 stations you can try the following values
#figuresize = [10,30] 
#fontsize = 9

figuresize = [8,8] 
fontsize = 12

# Initialize the figure
f, ax = plt.subplots(figsize=figuresize)
frames= []

#------------------------------------
for do_meta in tqdm(filtered_co2_meta):

    # extract the data frame
    df = pd.DataFrame(data.get_columns_as_arrays(do_meta))
    
    # For CO2 convert units mol mol-1 to ppm
    df['CO2 [ppm]'] = df['value'] * 1e6
    
    # add features for aggregation
    df = data_feature(df)
    df['label'] = label(do_meta)
    frames.append(df)

if(frames):
    result = pd.concat(frames)
    result.boxplot('CO2 [ppm]',by='label',vert=False, ax=ax, fontsize=fontsize,flierprops={'marker': '.'},patch_artist = True)

    f.suptitle('ObsPack CO2 mole fraction observations')
    ax.set_xlabel('CO2 [ppm]')
    plt.show()
else:
    print('no results')
    f.clear()
    ax.clear