<img src='https://www.icos-cp.eu/sites/default/files/2017-11/ICOS_CP_logo.png' width=400 align=right>

# ICOS Carbon Portal Python Library
## Example: Reading a ObsPack collection

We will showcase two different methods or ways of reading the OpsPack data with python

1. Download the file, unpack und load the data
2. Access the collection usgin our python library, directly into memory, no need to download. This works external, (on your computer) or after you login to this Jupyter Service from the ICOS Carbon Portal.

## Documentation
Full documentation for the ICOS python library is availalbe on the [project page](https://icos-carbon-portal.github.io/pylib/), how to install and wheel on [pypi.org](https://pypi.org/project/icoscp/), source is available on [github](https://github.com/ICOS-Carbon-Portal/pylib)

## Introduction
The goal of this notebook is to show you the ease of use of Jupyter Notebooks and to familiarize you with ObsPack collections and the files stored within these ObsPacks. This notebook will first show how to read in a local NetCDF file that originates from an ObsPack collection (for a specific station, tracer, and measurement height), and how to subset and plot the measurements within the file. 

In a next step, we will show how to read in an ObsPack file using the ICOS Python Library (i.e. the *pylib* Python package). This is an API that allows you to read in all data stored on the ICOS Carbon Portal without having to read the data into local memory, providing an efficient way to read in and process atmospheric measurements and model results.

Even though direct access of ObsPack collections is not yet supported by the *pylib* in a convenient way, we will show you an alternative way of accessing the ObsPack collections through the internal ICOS servers.

## Import python packages
Note: We chose to make use of the xarray package instead of the netCDF4 package, as the former is more versatile and better supported.

In [None]:
import os
import xarray as xr
import matplotlib.pyplot as plt
# Make the matplotlib figures interactive with 
# zoom, pan, move, save figures etc..
# %matplotlib widget
from pylab import plot, boxplot, setp
import seaborn as sns
color_pal = sns.color_palette()
import zipfile
import datetime
import numpy as np
import math
from tqdm.notebook import tqdm
import pandas as pd
from icoscp.collection import collection
from icoscp.cpb.dobj import Dobj

# 1. Download the file, unpack und load the data

### How to read in a local NetCDF file coming from an ObsPack collection

#### Read in the data
The following NetCDF file will be used as an example: co2_cbw_tower-insitu_445_allvalid-207magl.nc. This file contains CO2 mole fraction observations from the Cabauw station at 207 meters a.g.l.
Files like this are available, if you download the collection from the data portal ( [https://www.icos-cp.eu/data-products/CEC4-CAGK](https://www.icos-cp.eu/data-products/CEC4-CAGK) ). Then you can work offline on your own computer. For this demo, we have made one of the files available in the current 'data' folder

In [None]:
# Give the path of local nc file
path = './data/co2_cbw_tower-insitu_445_allvalid-207magl.nc'

# Open the local nc file
cbw_data = xr.open_dataset(path)

# Explore some of the characteristics of the data
print("These are the dimensions of the data set:")
print(str(list(cbw_data.coords)) + "\n")

print("These are the variables of the data set:")
print(str(list(cbw_data.keys())))

### Explore the ObsPack entry

In [None]:
cbw_data

### Subset the data

In [None]:
# Select a time period of interest. Here: JJA 2018
cbw_subset = cbw_data.sel(time=slice('2018-06-01', '2018-09-01'))

### Creating a CO2 mixing ratio timeseries

In [None]:
fig, ax = plt.subplots(nrows=1, ncols = 1, figsize=(12,4))
ax.grid(zorder=0)
plt.plot(cbw_subset.time, cbw_subset.value * 1e6, '-', zorder=2, label = 'CO2 dry mole fraction')
ax.set_title(cbw_subset.site_name)
ax.set_ylabel(str(cbw_data.value.standard_name) + ' [ppm]')
ax.set_xlabel('time')
ax.set_xlim([min(cbw_subset.time),max(cbw_subset.time)])
ax.set_ylim([380,460])
plt.xticks(rotation = 45)
plt.legend()

# if you want to save the image... uncomment the following line
# plt.savefig('co2_cabauw.png')

### Assigning a new variable to the ObsPack using xarray

In [None]:
## Create a synthetic mole fraction observation variable with a randomized uncertainty (noise)
noise = np.random.normal(2.1e-6, 1.5e-6, len(cbw_subset.value))
model = cbw_subset.value + noise

## Insert the newly created variable into the ObsPack
## (Use the xr.assign() function)
cbw_subset = cbw_subset.assign(model=model)
cbw_subset

### Adding the synthetic simulation results to the timeseries

In [None]:
## Make a timeseries plot similar to the one before, but now adding the synthetic model results
fig, ax = plt.subplots(nrows=1, ncols = 1, figsize=(10,4))
ax.grid(zorder=0)
plt.plot(cbw_subset.time, cbw_subset.value * 1e6, '-', zorder=2, label = 'Observed CO2 dry mole fraction')
plt.plot(cbw_subset.time, cbw_subset.model * 1e6, '-', zorder=2, label = 'Modelled CO2 dry mole fraction')
ax.set_title(cbw_subset.site_name)
ax.set_ylabel(str(cbw_data.value.standard_name) + ' [ppm]')
ax.set_xlabel('time')
ax.set_xlim([min(cbw_subset.time),max(cbw_subset.time)])
ax.set_ylim([380,460])
plt.xticks(rotation = 0)
plt.legend()

### Plotting a timeseries of the residuals

In [None]:
## Make a residual timeseries plot
fig, ax = plt.subplots(nrows=1, ncols = 1, figsize=(12,4))
ax.grid(zorder=0)
plt.plot(cbw_subset.time, (cbw_subset.model - cbw_subset.value) * 1e6, '-', c = 'r', zorder=2, label = 'Model error / residual CO2 mole fraction')
ax.set_title(cbw_subset.site_name)
ax.set_ylabel(str(cbw_data.value.standard_name) + '\n residual [ppm]')
ax.set_xlabel('time')
ax.set_xlim([min(cbw_subset.time),max(cbw_subset.time)])
ax.set_ylim([-7.5,7.5])
plt.xticks(rotation = 10, fontsize=8)
plt.legend()

In [None]:
## Calcualte the RMSE between the observed and synthetic CO2 mole fractions
MSE = np.square(np.subtract(cbw_subset.value,cbw_subset.model)).mean() 
MSE_ppm = np.square(np.subtract(cbw_subset.value*1e6,cbw_subset.model*1e6)).mean() 
 
RMSE = math.sqrt(MSE)
RMSE_ppm = math.sqrt(MSE_ppm)
print("Root Mean Square Error:")
print(RMSE)

print("Root Mean Square Error (in ppm):")
print(RMSE_ppm)

# 2. Access the collection using our python library
directly into memory, no need to download. This works external, (on your computer) or after you login to this Jupyter Service from the ICOS Carbon Portal.
## Read the collection of ObsPack files
Now we proceed to read the collection of ObsPack files directly into memory, you can think of a collection as a zip file, containing many files. **Be aware... there might be a LOT of files**, the following example contains 132 data objects, and loading them all in to memory can make things slow. The ObsPack is published with a DOI [https://doi.org/10.18160/CEC4-CAGK](https://doi.org/10.18160/CEC4-CAGK), that is what we are using to access the data and meta data

### Import the ICOS python library

In [None]:
# official DOI
collection_doi = '10.18160/CEC4-CAGK'

# get all datasets in the collection. Returns a list of icoscp.Dobj including meta data about the collection itself.
coll = collection.get(collection_doi)

### Display information about the collection

In [None]:
print(coll.title,' \n') 
print(coll.description,' \n') 
print(coll.citation,' \n') 

### A list of all the 'files' or data sets inluded
We will display on the first 5, but actually there are many more

In [None]:
coll.datalink[0:5]

In [None]:
coll.data[0:5]

### Look at a dataset
The collection contains already a list of data objects as displayed above. A data object contains meta data about the dataset and allows direct access to the actual measurements
Please have a look at the documentation of the 'pylib' for further details on how to work with collections and datasets. For the moment we will just use the .data option which provides a list for all dataobject within the collection. This time we will use plotly to create box plots for all the stations. But before we loop through all the files, we have a look at the first data set file from the collection.

In [None]:
# configure all the digital objects, to make sure we have the correct timestamp
for do in coll.data:
    do.dateTimeConvert = False

In [None]:
dataset = coll.data[25] # just choosing a random station out of the collection

In [None]:
data = dataset.data
data.head()

### Dataframe adjustments
To work with the dataset we create a add feature function to add some conveniences:
- add a few features (so we can group by month or year
- change the index to datetime, makes it much easier to plot time series data
- add a column with 'ppm' values

and we extract some metadata from the station with the label function
- return a string with station id, name, and samplingheigth 

In [None]:
def data_feature(df):
    '''
    create a function to set the index to a the timestamp
    and add some features for data analysis
    '''
    # set index to datetime
    df.index = pd.to_datetime(df.time)
    
    # add colums to aggregate on
    df['day'] = df.index.day
    df['month'] = df.index.month
    df['year'] = df.index.year
    
        # multiply the *value* with 10^6 to disply 'ppm'
    df['ppm'] = df.value * 1e6
    
    return df    

In [None]:
data = data_feature(data)
data.head()

The following function extracts some station metadata as a string. We will use this later on to group data.<br>

In [None]:
def label(do):
    '''
    A function to extract information from the metadata.
    A digitial Object (do) includes a very rich set of meta data.
    For conveninece we extract some of the informtion to disiplay.
    Returns a string , which is added to the data frame for grouping
    '''
    info ={}
    # sampling height
    info['sh'] = do.meta['specificInfo']['acquisition']['samplingHeight']
    # station id & name
    info['id'] = do.meta['specificInfo']['acquisition']['station']['id']
    info['name'] = do.meta['specificInfo']['acquisition']['station']['org']['name']
    
    return f"{info['id']} {info['name']} {info['sh']} m"

In [None]:
# Print the first 20 stations in the collection
for do in coll.data[0:20]:
    print(label(do))

### Simple plot

In [None]:
fig, ax = plt.subplots(figsize=(10,4))
ax.grid(color='grey', linestyle='--', linewidth=0.5)
ax.set_title('Station: ' + str(dataset.station['location']))
ax.scatter(data.time, data.value, s=0.5) # s = size of marker
fig.show()

### Aggreate data plot

In [None]:
# create a plot with running average and aggreate by month
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4))
fig.suptitle('Station: ' + str(dataset.station['location'])+ '\n', fontsize=12)

# a smothed average by year, makes a trend clearly visible
data.ewm(span = 3600).mean().plot(y='ppm', grid=True, ax=ax1)
ax1.set_title('values by year')
ax1.grid(color='grey', linestyle='--', linewidth=0.5)

# aggregate by month, to show seasonality
data.groupby(['month'], sort=False, group_keys=False ).mean().sort_index().plot(y='ppm',grid=True, ax=ax2)
ax2.set_title('aggregate by month')
ax2.grid(color='grey', linestyle='--', linewidth=0.5)

fig.show()

### Boxplot

In [None]:
# create a more fancy plot with seaborn

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4))
fig.suptitle('Station: ' + str(dataset.station['location'])+ '\n', fontsize=12)
sns.boxplot(data=data, x='year', y='ppm', palette='Blues', ax=ax1)
ax1.set_title('values by year')
ax1.grid(color='grey', linestyle='--', linewidth=0.5)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=45)

sns.violinplot(data=data, x='month', y='ppm', palette='Blues', ax=ax2)
ax2.set_title('aggregate by month')
ax2.grid(color='grey', linestyle='--', linewidth=0.5)

fig.show()

### Loop through the collection

Since we now have a better understanding of the underlying data, we can plot many datasets side by side to compare.<br>
We demonstrate a simple filter implententation for country and/or station id. If you want to have all (countries or stations) set the filter to an empty list.<br>
The following example finds all stations from Switzerland plus adding the 'Schauinsland' station from Germany and 'Weybourne' from Great Brittain.

In [None]:
# filter by country, you can add multiple countries as filter, or leave empty for all
country = ['CH']

# filter by station id, you can add multiple stations as filter, or leave empty for all
stationid = ['SSL', 'WAO']

# create a plot with running average and aggreate by month
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4))
fig.suptitle('Aggregate data', fontsize=12)

ax1.set_title('moving average')
ax1.grid(color='grey', linestyle='--', linewidth=0.5)


ax2.set_title('aggregate by month')
ax2.grid(color='grey', linestyle='--', linewidth=0.5)


for do in tqdm(coll.data):
    # filter by station and cound
    if (do.station['specificInfo']['countryCode'] not in country) and (do.station['id'] not in stationid):
        continue 
        
    # plot
    data = data_feature(do.data)
    data.ewm(span = 3600).mean().plot(y='ppm', grid=True, ax=ax1, label=f"{do.station['id']}")
    data = data.groupby(['month'], sort=False, group_keys=False ).mean().sort_index()
    data.plot(y='ppm',grid=True, ax=ax2, label=do.station['id'],marker='o', linestyle='dashed')
    
plt.show()
    

### BoxPlot many stations

In [None]:
def label(do):
    '''
    A function to extract information from the metadata.
    A digitial Object (do) includes a very rich set of meta data.
    For conveninece we extract some of the informtion to disiplay.
    Returns a string , which is added to the data frame for grouping
    '''
    info ={}
    # sampling height
    info['sh'] = do.meta['specificInfo']['acquisition']['samplingHeight']
    # station id & name
    info['id'] = do.meta['specificInfo']['acquisition']['station']['id']
    info['name'] = do.meta['specificInfo']['acquisition']['station']['org']['name']
    
    return f"{info['id']} {info['name']} {info['sh']} m"
    

In [None]:
# filter by country, you can add multiple countries as filter, or leave empty for all
country = ['CH']

# filter by station id, you can add multiple stations as filter, or leave empty for all
stationid = ['SSL', 'WAO']

# depending on how many stations you want to display, 
# you probably want to adjust the figure and font size
# for all 132 stations you can try the following values
#figuresize = [10,24] 
#fontsize = 9

figuresize = [8,8] 
fontsize = 12

# Initialize the figure
f, ax = plt.subplots(figsize=figuresize)
frames= []

#------------------------------------
for do in tqdm(coll.data):
    
    # filter by station and cound
    if (do.station['specificInfo']['countryCode'] not in country) and (do.station['id'] not in stationid):
        continue        
    
    # add features, like day, month, ppm for ease of use
    data = data_feature(do.data)
    # add some feature from the metatdata of the objec
    data['label'] = label(do)
    frames.append(data)
   
if(frames):
    result = pd.concat(frames)
    result.boxplot('ppm',by='label',vert=False, ax=ax, fontsize=fontsize,flierprops={'marker': '.'},patch_artist = True)

    f.suptitle('ObsPack CO2 mole fraction observations')
    ax.set_xlabel('ppm')
    plt.show()
else:
    print('no results')
    f.clear()
    ax.clear