<img src='https://www.icos-cp.eu/sites/default/files/2017-11/ICOS_CP_logo.png' width=400 align=right>

# ICOS Carbon Portal Python Libraries

This example uses a foundational library called `icoscp_core` which can be used to access time-series ICOS data that are <i>previewable</i> in the ICOS Data Portal. "Previewable" means that it is possible to visualize the data variables in the preview plot. The library can also be used to access (meta-)data from [ICOS Cities](https://citydata.icos-cp.eu/portal/) and [SITES](https://data.fieldsites.se/portal/) data repositories. 

General information on all ICOS Carbon Portal Python libraries can be found on our [help pages](https://icos-carbon-portal.github.io/pylib/). 

Documentation of the `icoscp_core` library, including information on running it locally, can also be found on [PyPI.org](https://pypi.org/project/icoscp_core/).

Note that for running this example locally, authentication is required (see the `how_to_authenticate.ipynb` notebook).

# Example: Access and work with atmospheric data

## Import libraries

In [None]:
from icoscp_core.icos import data, meta, ATMO_STATION
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

## List stations

In [None]:
# Stations specific for the atmosphere domain (see example 1b and 1c for examples for the ecosystem and ocean domains) 
stations = meta.list_stations(ATMO_STATION)

# Filter stations by country (e.g., Sweden, 'SE')
filtered_stations = [
    s for s in stations
    if s.country_code == 'SE'
]

print("Filtered stations list:")
pd.DataFrame(filtered_stations)

## View metadata for a selected station 

The example shows how to access some of the metadata associated with the station. 

In [None]:
# Specify a station uri from list above
station_uri = 'http://meta.icos-cp.eu/resources/stations/AS_HTM'
station_meta = meta.get_station_meta(station_uri)

# Print the station name
print('Name:', station_meta.org.name)

# Print the ICOS labeling date (available on ICOS stations)
print('Got ICOS label on:', station_meta.specificInfo.labelingDate)

print('Known staff, possibly former:')
pd.DataFrame([
    {
        'first_name': memb.person.firstName,
        'last_name': memb.person.lastName,
        'role': memb.role.role.label,
        'start': memb.role.start,
        'end': memb.role.end
    }
    for memb in station_meta.staff
])

## See a list of data types

Data types are the most important category of classification of data objects. They exist to combine a number of other metadata elements. Data objects are associated with data types instead of being linked to these numerous other metadata. In the following example, filters are applied so that only data types associated with ICOS Level 2 data from the atmospheric domain that are previewable are shown. See more information [about data levels](https://www.icos-cp.eu/data-services/data-collection/data-levels-quality) here. Additional filters can be applied. Please refer to the documentation for more details.


In [None]:
all_data_types = meta.list_datatypes()

# filters applied:
# data types with data access (possible to view with Python)
# data types with level 2 data
# data types associated with the atmospheric theme
selected_datatypes = [
    dt for dt in all_data_types
    if dt.has_data_access and dt.data_level==2 and dt.theme.uri=='http://meta.icos-cp.eu/resources/themes/atmosphere'
]
for data_type in selected_datatypes:
    print(f"{data_type.label} ({data_type.uri})")
    

## Find data objects based on the selected station and a specified data type

This example shows how to get a list of data objects associated with the selected station and the data type "ICOS ATC/CAL Flask Release". 

In [None]:
# Specify a data type from the list above 
data_type = 'http://meta.icos-cp.eu/resources/cpmeta/atcFlaskDataObject'

station_data_objects = meta.list_data_objects(datatype = data_type, 
                                         station = station_uri)

for station_data_object in station_data_objects:
    print(station_data_object.filename)

if len(station_data_objects) == 0:
    print(f'No available objects with data type {data_type} at station {station_uri}')

## Access data for a single data object 

This example shows how to access the data and metadata from ICOS_ATC_L2_L2-2024.1_HTM_150.0_CTS_FLASK_CO2.zip.

In [None]:

# Select a filename from the list above
filename = 'ICOS_ATC_L2_L2-2024.1_HTM_150.0_CTS_FLASK_CO2.zip'
selected_data_object = next((station_data_object for station_data_object in station_data_objects if station_data_object.filename == filename), None)

if selected_data_object:

    # Access full metadata associated with the data object
    # NOTE this is an expensive operation, ~ 100 ms
    dobj_meta = meta.get_dobj_meta(selected_data_object)
    
    # Access the object's data; relies on full metadata, which is expensive to fetch
    # see below for example of batch data fetching
    dobj_arrays = data.get_columns_as_arrays(dobj_meta)
    
    # Convert to a pandas dataframe
    df = pd.DataFrame(dobj_arrays)

    display(df)
else:
    print('The list of objects was empty, check the variable "filename"')


## Make a plot: single data column

The selected data_object that has been accessed contains data for CO2 (stored in the "co2" column in the DataFrame above). For this particular data type (ICOS ATC/CAL Flask Release), different data objects contain data for different gases which are stored in different columns. It is also worth noting that in different data types the names of the columns containing the observation timestamp and quality flag may also differ (depends on the conventions used by corresponding thematic center or scientific community).

Before the data is plotted, the "Flag" column is used to exclude data that has not undergone/passed manual quality control. It should be noted that by default, the data-fetching methods of the ICOS Python library automatically exclude data points that have been explicitly flagged as bad. This behaviour is controlled by `keep_bad_data` boolean parameter of methods `get_columns_as_arrays` and `batch_get_columns_as_arrays` (equals to `False` by default).

In [None]:
# helper method to ensure presence of a column in a pandas DataFrame
def assert_col(df: pd.DataFrame, col: str) -> None:
    assert col in df.columns, f"Column '{col}' not found in data. Choose among {list(df.columns)}"

def assert_cols(df: pd.DataFrame, cols: list[str]) -> None:
    for col in cols:
        assert_col(df, col)

# List all the programmatically accessible columns
list(df.columns)

In [None]:
time_column = 'SamplingStart'
data_column = 'co2'
quality_flag = 'Flag'
value_accept_quality = 'O'

# make sure the required columns are present
assert_cols(df, [time_column, data_column, quality_flag])

# use the quality flag to keep only data marked as good after manual quality control
df_quality = df[df[quality_flag] == value_accept_quality]

# metadata part specific to observational time series data collected at stations
# (dobj_meta was initialized in "Access data" section above)
ts_meta = dobj_meta.specificInfo

station_name = ts_meta.acquisition.station.org.name

# dictionary to look up value type information by dataset column
column_value_types = {
    col.label: col.valueType
    for col in ts_meta.columns
}

# find metadata associated with the selected columns (data_column and time_column)
x_value_type = column_value_types[time_column]
y_value_type = column_value_types[data_column]

# create axis labels based on the metadata
x_axis_label = f"{x_value_type.self.label} ({time_column})"
y_axis_label = f"{y_value_type.self.label} [{y_value_type.unit}]"

plot = df_quality.plot(x=time_column, y=data_column, grid=True, title=station_name, style='o', markersize=3)
plot.set(xlabel=x_axis_label, ylabel=y_axis_label)
plt.show()

## Combine data for all data objects given the station and data type

Combination of selected data columns from objects in the list "station_data_objects".

All objects will be considered, but will only be added if they have data stored in columns with the names listed in "data_columns". 

In [None]:
time_column = 'SamplingStart'
# In the example list of station_data_objects, additional columns include 'co', 'sf6', 'ch4', and 'h2'. These can be added to this list.
# For other selections, consider the print statements in the output of this cell.
data_columns = ['co2', '14C', 'n2o']
# all data objects have the same column and value for the quality flag 
quality_flag = 'Flag'
value_accept_quality = 'O'

# Save the names of the data_columns that are found in the object dataframes
# Their associated metadata is saved in y_axis labels for use in later plots.
renamed_data_columns = []
y_axis_labels = []

# initiate the final df
merged_df = pd.DataFrame(columns=[time_column])

for dobj, arrs in data.batch_get_columns_as_arrays(station_data_objects):

    # Convert the arrays into a DataFrame
    df = pd.DataFrame(arrs)

    # use the quality flag to keep only best-quality data (maked "O" in column "Flag")
    df_quality = df[df[quality_flag] == value_accept_quality].copy()

    # Check if the time_column is available in the dataframe columns associated with the data object
    # If not, the users need to look at available columns and find correct column names
    if time_column not in df_quality.columns:
        print(f"The column given for time ('{time_column}') is not found in {dobj.filename}. Skipping this object.")
        print(f"Available columnns are '{list(df_quality.columns)}'")
        continue

    # New column names for the final df_quality (to distinuigh between the different data objects)
    # based on station's id and sampling height (if available)
    station_id = dobj.station_uri.split('_')[-1]

    if dobj.sampling_height:
        suffix = f"_{station_id}_{dobj.sampling_height}"

    else:
        suffix = f"_{station_id}"

    # See which of the desired data_columns are available in this data object
    # Rename these for unique column names in the final merged_df that is updated with each iteration of the data objects
    # Save metadata associated with the found columns
    found_columns = []

    for data_column in data_columns:
        if data_column in df_quality.columns:

            found_columns.append(data_column)

            # Rename the column with the suffix
            df_quality.rename(columns={data_column: data_column + suffix}, inplace=True)
            
            if data_column + suffix not in renamed_data_columns:
                renamed_data_columns.append(data_column + suffix)
                
                # find y-axis label for column (used in graph)
                dobj_meta = meta.get_dobj_meta(dobj)
                columns_meta = dobj_meta.specificInfo.columns
                dobj_value_type = [col for col in columns_meta if col.label==data_column][0].valueType
                y_axis_labels.append(f"{dobj_value_type.self.label} [{dobj_value_type.unit}]")

    # If any of the columns were found, merge them with the merged DataFrame
    if found_columns:
        # Select only the relevant columns (timestamp + renamed columns)
        columns_to_merge = [time_column] + [col + suffix for col in found_columns]
        merged_df = pd.merge(merged_df, df_quality[columns_to_merge], on=time_column, how='outer')

    else:
        print(f"None of '{data_columns}' found for {dobj.filename}. Skipping this object.")
        print(f"Available columns are '{list(df_quality.columns)}'")

# time_column should be in datetime format. If not alrady, it will be convert to it here:
merged_df[time_column] = pd.to_datetime(merged_df[time_column])

# Make sure it is in the right order
merged_df = merged_df.sort_values(time_column)

display(merged_df)

## Make a plot: multiple data columns

Not suitable for plotting of different species, as they often have different value ranges. A better plot for our example selection will follow.

In [None]:
legend_labels = [f"{col} ({unit})" for col, unit in zip(renamed_data_columns, y_axis_labels)]

# Plot the data
ax = merged_df.plot(x=time_column, y=renamed_data_columns, grid=True, style='o', markersize=3)

# Update the legend with the new labels
ax.legend(legend_labels)

# Show the plot
plt.show()

### Zoomed in to latest year of data

In [None]:
# Find the latest year based on the time_column
latest_year = merged_df[time_column].dt.year.max()

# Filter the DataFrame to include only rows from the latest year
merged_df_latest_year = merged_df[merged_df[time_column].dt.year == latest_year]

# Plot the data
ax = merged_df_latest_year.plot(x=time_column, y=renamed_data_columns, grid=True, style='o', markersize=3)

# Update the legend with the new labels
ax.legend(legend_labels)

# Show the plot
plt.show()

## Make a plot: two data columns on different axes

Possible for two of the data columns. Even if more are given, only the first two in the list "selected_data_columns" will be used.

In [None]:
time_column = 'SamplingStart'

# Select two of the data column in dataframe "merged_df"
selected_data_columns = ['14C_HTM_150.0', 'co2_HTM_150.0']

# Check if all selected columns are in renamed_data_columns
missing_columns = set(selected_data_columns) - set(renamed_data_columns)

if missing_columns or time_column not in merged_df.columns:
    
    print(f"One or more of the columns ({selected_data_columns}), or the time_column ({time_column}), are not in merged_df.")
    print(f"Available columns are: {merged_df.columns}")  
    
else:

    # Set up the plot with the first variable
    fig, ax1 = plt.subplots()

    # Plot the first variable on the primary y-axis

    # Find the unit 
    col_index_1 = renamed_data_columns.index(selected_data_columns[0])
    y_axis_label_1 = y_axis_labels[col_index_1]

    # b stands for blue and "." for circle markers
    ax1.plot(merged_df[time_column], merged_df[selected_data_columns[0]], 'b.', markersize = 3)
    ax1.set_xlabel('Time')
    ax1.set_ylabel(y_axis_label_1, color='b')
    ax1.tick_params(axis='y', labelcolor='b')

    # Set the title with the station name
    ax1.set_title(station_name)

    if len(selected_data_columns) > 1:
        
        # Create a secondary y-axis for the second variable
        ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

        # Plot the second variable on the secondary y-axis
        col_index_2 = renamed_data_columns.index(selected_data_columns[1])
        y_axis_label_2 = y_axis_labels[col_index_2]

        # r stands for red and "." for circle markers
        ax2.plot(merged_df[time_column], merged_df[selected_data_columns[1]], 'r.', markersize = 3)
        ax2.set_ylabel(y_axis_label_2, color='r')
        ax2.tick_params(axis='y', labelcolor='r')
        
    # show the dates in this specific format (YYYY-MM-DD)
    ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
    
    # Rotate the dates to fit better
    ax1.tick_params(axis='x', rotation=45)

    # Add grid
    ax1.grid(True)

    # Show the plot
    fig.tight_layout()  # to make sure labels/axes don't overlap
    plt.show()
