<img src='https://www.icos-cp.eu/sites/default/files/2017-11/ICOS_CP_logo.png' width=400 align=right>

# ICOS Carbon Portal Python Libraries

This example uses a foundational library called `icoscp_core` which can be used to access time-series ICOS data that are <i>previewable</i> in the ICOS Data Portal. "Previewable" means that it is possible to visualize the data variables in the preview plot. The library can also be used to access (meta-)data from [ICOS Cities](https://citydata.icos-cp.eu/portal/) and [SITES](https://data.fieldsites.se/portal/) data repositories. 

General information on all ICOS Carbon Portal Python libraries can be found on our [help pages](https://icos-carbon-portal.github.io/pylib/). 

Documentation of the `icoscp_core` library, including information on running it locally, can also be found on [PyPI.org](https://pypi.org/project/icoscp_core/).

Note that for running this example locally, authentication is required (see the `how_to_authenticate.ipynb` notebook).


# Example: Access and work with ecosystem data

## Import libraries

In [None]:
from icoscp_core.icos import data, meta, ECO_STATION
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

## List stations

In [None]:
# Stations specific for the ecosystem domain (see example 1a and 1c for examples for the atmosphere and ocean domains) 
stations = meta.list_stations(ECO_STATION)

# Filter stations by country (e.g., Sweden, 'SE')
filtered_stations = [
    s for s in stations
    if s.country_code == 'SE'
]

print("Filtered stations list:")
pd.DataFrame(filtered_stations)

## View metadata for a selected station 

The example shows how to access some of the metadata associated with the station. 

In [None]:
# Specify a station uri from list above
station_uri = 'http://meta.icos-cp.eu/resources/stations/ES_SE-Deg'
station_meta = meta.get_station_meta(station_uri)

# Print the station name
print('Name:', station_meta.org.name)

# Print the ICOS labeling date (available on ICOS stations)
print('Got ICOS label on:', station_meta.specificInfo.labelingDate)

print('Known staff, possibly former:')
pd.DataFrame([
    {
        'first_name': memb.person.firstName,
        'last_name': memb.person.lastName,
        'role': memb.role.role.label,
        'start': memb.role.start,
        'end': memb.role.end
    }
    for memb in station_meta.staff
])

## See a list of data types

Data types are the most important category of classification of data objects. They exist to combine a number of other metadata elements. Data objects are associated with data types instead of being linked to these numerous other metadata. In the following example, filters are applied so that only data types associated with ICOS Level 2 data from the atmospheric domain that are previewable are shown. See more information [about data levels](https://www.icos-cp.eu/data-services/data-collection/data-levels-quality) here. Additional filters can be applied. Please refer to the documentation for more details.


In [None]:
all_data_types = meta.list_datatypes()

# filters applied:
# data types with data access (possible to view with Python)
# data types with level 2 data
# data types associated with the ecosystem theme
selected_datatypes = [
    dt for dt in all_data_types
    if dt.has_data_access and dt.data_level==2 and dt.theme.uri=='http://meta.icos-cp.eu/resources/themes/ecosystem'
]
for data_type in selected_datatypes:
    print(f"{data_type.label} ({data_type.uri})")
    

## Find data objects based on the selected station and a specified data type

This example shows how to get a list of data objects associated with the selected station and the data type "Fluxnet Product". In this case, it is only one object available.

In [None]:
# Specify a data type from the list above 
data_type = 'http://meta.icos-cp.eu/resources/cpmeta/miscFluxnetProduct'

station_data_objects = meta.list_data_objects(datatype = data_type, 
                                         station = station_uri)

for station_data_object in station_data_objects:
    print(station_data_object.filename)

if len(station_data_objects) == 0:
    print(f'No available objects with data type {data_type} at station {station_uri}')

## Access data

This example shows how to access the data and metadata from FLX_SE-Deg_FLUXNET2015_FULLSET_HH_2001-2020_beta-3.csv.zip.


In [None]:
# Select a filename from the list above
filename = 'FLX_SE-Deg_FLUXNET2015_FULLSET_HH_2001-2020_beta-3.csv.zip'
selected_data_object = next((station_data_object for station_data_object in station_data_objects if station_data_object.filename == filename), None)

if selected_data_object:

    # Access full metadata associated with the data object
    # NOTE this is an expensive operation, ~ 100 ms
    dobj_meta = meta.get_dobj_meta(selected_data_object)
    
    # Access the object's data; relies on full metadata
    dobj_arrays = data.get_columns_as_arrays(dobj_meta)
    
    # Convert to a pandas dataframe
    df = pd.DataFrame(dobj_arrays)

    display(df)
else:
    print('The list of objects was empty, check the variable "filename"')


## Make a plot: single data column

The selected data object that has been accessed contains data for GPP, which can be calculated in different ways. In this example, we use the GPP stored in the "GPP_DT_VUT_REF" column in the dataframe df above. If you access a different data object, the data may be stored in a column with a different name. It is also worth noting that in different data types the names of the columns containing the observation timestamp and quality flag may also differ (depends on the conventions used by corresponding thematic center or scientific community).

<mark>Note that only the latest year of data are plotted</mark>. This selection was made because there are very many data points. 


Before the data is plotted, the "NEE_VUT_REF_QC" column is used to exclude data that has not undergone/passed manual quality control. It should be noted that by default, the data-fetching methods of the ICOS Python library automatically exclude data points that have been explicitly flagged as bad. This behaviour is controlled by `keep_bad_data` boolean parameter of methods `get_columns_as_arrays` and `batch_get_columns_as_arrays` (equals to `False` by default).

In [None]:
# helper method to ensure presence of a column in a pandas DataFrame
def assert_col(df: pd.DataFrame, col: str) -> None:
    assert col in df.columns, f"Column '{col}' not found in data. Choose among {list(df.columns)}"

def assert_cols(df: pd.DataFrame, cols: list[str]) -> None:
    for col in cols:
        assert_col(df, col)

# List all the programmatically accessible columns
list(df.columns)

In [None]:
time_column = 'TIMESTAMP'
data_column = 'GPP_DT_VUT_REF'
quality_flag = 'NEE_VUT_REF_QC'
value_accept_quality = '0'

# make sure the required columns are present
assert_cols(df, [time_column, data_column, quality_flag])

# Find the latest year based on the time_column
latest_year = df[time_column].dt.year.max()

# Filter the DataFrame to include only rows from the latest year
df_latest_year = df[df[time_column].dt.year == latest_year]

# use the quality flag to keep only data marked as good after manual quality control
df_latest_year_quality = df_latest_year[df_latest_year[quality_flag] == value_accept_quality]

# metadata part specific to observational time series data collected at stations
# (dobj_meta was initialized in "Access data" section above)
ts_meta = dobj_meta.specificInfo

station_name = ts_meta.acquisition.station.org.name

# dictionary to look up value type information by dataset column
column_value_types = {
    col.label: col.valueType
    for col in ts_meta.columns
}

# find metadata associated with the selected columns (data_column and time_column)
x_value_type = column_value_types[time_column]
y_value_type = column_value_types[data_column]

# create axis labels based on the metadata
x_axis_label = f"{x_value_type.self.label} ({time_column})"
y_axis_label = f"{y_value_type.self.label} [{y_value_type.unit}]"

plot = df_latest_year_quality.plot(x=time_column, y=data_column, grid=True, title=station_name, style='o', markersize=3)
plot.set(xlabel=x_axis_label, ylabel=y_axis_label)
plt.show()

## Make a plot: two data columns on different axes

Possible for two of the selected data columns. Even if more are given, only the first two in the list "selected_data_columns" will be used.

<mark>Note that only the latest year of data are plotted</mark>. This selection was made because there are very many data points. 

Before the data is plotted, quality flags are applied to exclude poor data.

In [None]:
time_column = 'TIMESTAMP'
# Select two of the data column in dataframe "df"
data_column1 = 'GPP_DT_VUT_REF'
quality_flag1 = 'NEE_VUT_REF_QC'
value_accept_quality1 = '0'

data_column2 = 'SW_IN_F'
quality_flag2 = 'SW_IN_F_QC'
value_accept_quality2 = '0'

# Find the latest year based on the time_column
latest_year = df[time_column].dt.year.max()

# Filter the DataFrame to include only rows from the latest year
df_latest_year = df[df[time_column].dt.year == latest_year]

# Set up the plot with the first variable
fig, ax1 = plt.subplots()

# Filter based on quality flag associated with selected data_column1
df_latest_year_quality = df_latest_year[df_latest_year[quality_flag1] == value_accept_quality1]

# dobj_meta from section "Access data"
columns_meta = dobj_meta.specificInfo.columns

# Find the unit for data column 1 
dobj_value_type = [col for col in columns_meta if col.label==data_column1][0].valueType

# create label for y-axis based on the metadata
y_axis_label1 = f"{dobj_value_type.self.label} [{dobj_value_type.unit}]"

# b stands for blue and "." for circle markers
ax1.plot(df_latest_year_quality[time_column], df_latest_year_quality[data_column1], 'b.')
ax1.set_xlabel('Time')
ax1.set_ylabel(y_axis_label1, color='b')
ax1.tick_params(axis='y', labelcolor='b')

# Find station name
# dobj_meta accessed in "Access data" section
station = dobj_meta.specificInfo.acquisition.station.org.name

# Set the title with the station name
ax1.set_title(station)

# Create a secondary y-axis for the second variable
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

# Filter based on quality flag associated with selected data_column2
df_latest_year_quality = df_latest_year[df_latest_year[quality_flag2] == value_accept_quality2]

# Find the unit for data column 2
dobj_value_type = [col for col in columns_meta if col.label==data_column2][0].valueType

# create label for y-axis based on the metadata
y_axis_label2 = f"{dobj_value_type.self.label} [{dobj_value_type.unit}]"

# r stands for red and "." for circle markers
ax2.plot(df_latest_year_quality[time_column], df_latest_year_quality[data_column2], 'r.')
ax2.set_ylabel(y_axis_label2, color='r')
ax2.tick_params(axis='y', labelcolor='r')

# show the dates in this specific format (YYYY-MM-DD)
ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))

# Rotate the dates to fit better
ax1.tick_params(axis='x', rotation=45)

# Add grid
ax1.grid(True)

# Show the plot
fig.tight_layout()  # to make sure labels/axes don't overlap
plt.show()