# How to document and share your data?

## 1) Type and format of data

In goesciences, there are different types and formats of data.

**For raster data:**
- TIFF format
- Geographic Tagged Image File Format (GeoTIFF): georeferenced raster imagery
- Cloud Optimized GeoTIFF (COG): to facilitate access to/from the cloud

**For multi-dimensional array:**
- Hierarchical Data Format (HDF, especially HDF-5): General-purpose
- Network Common Data Form (NetCDF, especially NetCDF4): Scientific-specific format with a compression based on HDF-5
- Zarr: development targeted for data in cloud environments, there is a possibility to chunk dimensions

**For tabulated data:**
- Comma-Separated Values (CSV): text file format that uses commas to separate values, and newlines to separate records
- xlsx (Excel)

**For text data:**
- American Standard Code for Information Interchange (ASCII): binary code used by electronic equipment to handle text using the English alphabet, numbers, and other common symbols



And many [others](https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-formats#toc-geographic-tagged-image-file-format-geotiff)!


<span style="color:darkorange">**Your turn!** Open different type of data, and check what is inside </span>

In [None]:
#Open raster data
import rasterio
geotiff_path = "/mnt/data-summer-shared/documentation-data/data/2022-08-12-00_00_2022-08-22-23_59_Sentinel-2_L1C_True_color.tiff"
geotiff_data = rasterio.open(geotiff_path)
print(geotiff_data.meta)



In [None]:
#Open multi-dimensional data
import xarray as xr
nc_path = "/mnt/data-summer-shared/notebook-basics/data/surface_temperature_historical_1970_2014_IPSL_CM6A_LR.nc"
nc_data = xr.open_dataset(nc_path)
nc_data

In [None]:
#Open tabulated data
import pandas as pd
txt_path = '/mnt/data-summer-shared/documentation-data/data/locations.csv'
txt_data = pd.read_csv(txt_path)
print(txt_data.iloc[:2]) #print only the first two colomns

## 2) Metadata and FAIR requirements

No matter the type of data and the format, we need to properly define our metadata.

The aim is to match the [FAIR requirements](https://www.go-fair.org/fair-principles/) : Findability, Accessibility, Interoperability, and Reuse

Findability:
- Is the dataset assigned a globally unique and persistent identifier (e.g., DOI, Handle)?
- Is the dataset described with rich metadata (i.e., creator, title, data identifier, publisher, publication date, summary and keyword)?
- Is the metadata explicitly including the identifier of the dataset?
- Is the dataset (and metadata) indexed in a searchable resource (e.g., a public data repository like Zenodo, PANGAEA, or Dataverse)?

Accessibility:
- Can the data be retrieved using a standard protocol (e.g., HTTP, FTP, OAI-PMH, API)?
- Are the metadata accessible even if the data itself is no longer available?

Interoperability:
- Are you using standard formats (e.g., NetCDF, CSV) and vocabularies (e.g., CF conventions)?
- Are metadata schemas used (like Dublin Core, ISO19115)?

Reusable:
- Is it released with a clear and accessible data and code usage license? (e.g., CC-BY, MIT)?
- Is it associated with detailed provenance (e.g., how, when, by whom data was generated)?

<span style="color:darkorange">**Your turn!** Check if a dataset meet the FAIR requirements </span>

We can use an automated FAIR assement tool, to check your data or another dataset: https://www.f-uji.net/
if you don't have any dataset, you can check: [ITS_LIVE data](https://nsidc.org/data/nsidc-0782/versions/1) or [PROTECT-MIP-Antarctica](https://data-protect-slr.univ-grenoble-alpes.fr/dataset/d4-7-protect-mip-antarctica)


## 3) Write your metadata

Your metadata need to be specified either inside your dataset (for example in your NetCDF file) or outside (in a README).

### Writing metadata inside a dataset. Example for NetCDF file - the CF convention

Netcdf file usually follow the Climate and Forecast Metadata Conventions (CF), explained [here](https://cfconventions.org/conventions.html)

<span style="color:darkorange">**Your turn!** Modify a netcdf data file to match the CF convention </span>


In [None]:
#Global attributes
nc_data.attrs["title"] = "Monthly mean surface air temperature"
nc_data.attrs["Conventions"] = "CF-1.8"
nc_data.attrs["author"] = ''
nc_data.attrs["insitution"] = ''
nc_data.attrs["creation"] = '' #date of creation
nc_data.attrs["history"] = '' #what operations where done and when

In [None]:
#Variable attribute (units, long_name, standard_name) [MANDATORY]
nc_data["temp"].attrs["units"] = "Celsius"
nc_data["temp"].attrs["long_name"] = "Surface temperature"#long descriptive name which may, for example, be used for labeling plots.
nc_data["temp"].attrs["standard_name"] = "temperature" #unique identifiers for variables, The name used to identify the physical quantity. A standard name contains no whitespace and is case sensitive.

#Variable attribute (units, long_name, standard_name) [FACULTATIVE]
nc_data["temp"].attrs["short_name"] = "T"
nc_data["temp"].attrs["description"] = "the temperature at the surface" #The description is meant to clarify the qualifiers of the fundamental quantities such as which surface a quantity is defined on or what the flux sign conventions are

You also need to define the coordinate system you are using, usually with a variable crs

```
variables:
  int crs;
    crs:grid_mapping_name = "universal_transverse_mercator";
    crs:utm_zone_number = 32;
    crs:semi_major_axis = 6378137.0;
    crs:inverse_flattening = 298.257223563;
    crs:longitude_of_prime_meridian = 0.0;
    crs:datum = "WGS84";
    crs:epsg_code = "EPSG:32632";
    crs:spatial_ref = "PROJCS[\"WGS 84 / UTM zone 32N\", ... ]"; // Optional full WKT
```


This can be done automatically:

In [None]:
import rioxarray
import rioxarray
print(f'EPSG before writing crs {nc_data.rio.crs}')
nc_data.rio.write_crs("EPSG:32632", inplace=True)#you can modify the EPSG code here
print(f'EPSG after writing crs {nc_data.rio.crs}')
# da.rio.to_netcdf("output.nc")#Save back to NetCDF

Check that you added a variable spatial_ref.

In [None]:
nc_data

You can also modify your metadata using [NCO](https://nco.sourceforge.net/) commands:
- to change the name of a variable: ```ncrename -v old_var,new_var input.nc ```
- to change the name of a dimension: ```ncrename -d old_dim,new_dim myfile.nc ```
- to rename an attribute: ```ncrename -a old_attr,new_attr,var_name myfile.nc ```
- to edit or add attributes: ```ncatted -a attribute_name,variable_name,mode,data_type,value input.nc ```
- to add a new attribute: ```ncatted -a units,temp,a,c,"Celsius" myfile.nc ```
- to delete an attribute: ```ncatted -a units,temp,d,, myfile.nc ```

### Writing a README.txt file

The same information previously defined can be included in a text file, called README.txt.
More information [here](https://data.research.cornell.edu/data-management/sharing/readme/#dataspecific).


## 4) Where to store your data

Different platforms can be used to store your data online, and give a doi to this storage:
- Gaia Data, which is composed of three infractures: Data Terra, CLIMERI-France et PNDB. DATA TERRA mainly covers data observations, for example data observations of land surfaces [Theia](https://www.theia-land.fr/en/homepage-en/). CLIMERI-France deals with data from climate simulation while PNDB is more focused on biodiversity data. Data can be shared using [EasyData](https://www.easydata.earth/#/public/home).
- [RechercheDataGouv](https://recherche.data.gouv.fr/en/category/9/guide/depositing-a-dataset)
- [zenodo](https://help.zenodo.org/docs/deposit/create-new-upload/) (international, and widely used)

- [Dataverse](https://dataverse.org/)
- [PANGEA](https://www.pangaea.de/submit/)


The most commonly used is probably zenodo.

NB: You can define an embargo on your data on zenodo and Recherche Data Gouv

## 5) To go further

If you want to go further, and create a data hub or portal, here are some usefull tools:
- [CKAN](https://ckan.org/)
- [ERDDAP](https://www.ncei.noaa.gov/erddap/index.html)

You can also create an app, to provide visualizations of your data, for example using the python library [Holoviz](https://holoviz.org/).
An example, [here](https://edu.oggm.org/en/latest/explorer.html).



