# 06: Creating multiple CF-NetCDF files in one go

In some cases, people will choose to break down their data into multiple smaller netcdf files that they will publish in a single data collection. There are a number of good reasons to do this.
* The data user can access only the data they are interested in.
* Each file can be simpler with potentially less dimensions and less missing values. Imagine you have 10 depth profiles that all sample a different set of depths. If these profiles were included in a single netcdf file, the file would most likely a single depth dimension and coordinate variable which would need to account for all 10 depth profiles. Alternatively, 10 depth dimensions and coordinate variables could be included, but this is considered bad practice.
* Each individual file can be assigned a separate set of global attributes which describe the data more accurately. For example each file could have global attributes for the coordinates and timestamp. If multiple depth profiles are stored in a single file, only the minimum and maximum coordinates and timestamp can be encoded into the global attributes.
* Imagine you are looking for data in a data centre. You want to find depth profiles in a certain area of interest on a map. Files that include a single depth profile will be presented as points on the map. Files that include multiple depth profiles will be presented as a bounding box on a map, and without opening up the file it could be unclear whether the file includes data for your area of interest.

One might think that this would involve more work for the data creator and user. However, if the files are similar (and they should be if they follow the CF and ACDD conventions) this is not neccessarily the case.

In this notebook we will look at we can easily create multiple NetCDF files in one go using Python. 

Let's start with some code to create one basic CF-NetCDF file. Here we are assuming that your data have been loaded from some tabular file (CSV, XLSX, CNV etc) into a pandas dataframe.

## Example for 1 depth profile

In [1]:
import pandas as pd
import xarray as xr
from datetime import datetime

# Get the current timestamp in UTC and format it in ISO8601
time_now = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ')

# Creating dummy data
data = {
    'Pressure': [0, 10, 20, 30, 40], 
    'Temperature': [25.5, 24.8, 23.9, 22.5, 21.0], 
    'Salinity': [35.5, 35.6, 35.7, 35.8, 35.9]  
}
df = pd.DataFrame(data)
# df = pd.read_csv('/path/to/my/data.csv')

# Create an xarray dataset from the dataframe
xrds = xr.Dataset(
    coords={
        'pressure': df['Pressure']
    },
    data_vars={
        'Temperature': ('pressure', df['Temperature']),
        'Salinity': ('pressure', df['Salinity']),
    } 
)

# Add attributes to make it CF-compliant
xrds['Temperature'].attrs = {
    'standard_name': 'sea_water_temperature',
    'long_name': 'Temperature of sea water',
    'units': 'degrees_Celsius',
    'coverage_content_type': 'physicalMeasurement'
}
xrds['Salinity'].attrs = {
    'standard_name': 'sea_water_salinity',
    'long_name': 'Salinity of sea water',
    'units': 'PSU',
    'coverage_content_type': 'physicalMeasurement'
}
xrds['pressure'].attrs = {
    'standard_name': 'sea_water_pressure',
    'long_name': 'Sea water pressure',
    'units': 'dbar',
    'coverage_content_type': 'coordinate'
}

# Global attributes based on these requirements
# https://adc.met.no/node/4
# Based on the Attribute Convention for Data Discovery (ACDD) and Climate & Forecast (CF) conventions
xrds.attrs = {
    'id': 'your_unique_id_here',
    'naming_authority': 'institution that provides the id',
    'title': 'Depth Profile Data',
    'summary': 'This dataset contains depth profiles of temperature and salinity measurements.',
    'keywords': 'sea_water_temperature, sea_water_salinity',
    'keywords_vocabulary': 'CF:NetCDF COARDS Climate and Forecast Standard Names',
    'geospatial_lat_min': 80.6713,
    'geospatial_lat_max': 80.6713,
    'geospatial_lon_min': 31.2093,
    'geospatial_lon_max': 31.2093,
    'time_coverage_start': '2020-04-26T09:56:00Z',
    'time_coverage_end': '2020-04-26T09:56:00Z',
    'Conventions': 'ACDD-1.3, CF-1.11',
    'history': f'{time_now}: Modified by YourName using Python',
    'source': 'Measurement',
    'processing_level': 'Level of processing/quality control',
    'date_created': time_now,
    'creator_type': 'person',
    'creator_institution': 'Your Institution',
    'creator_name': 'Your Name',
    'creator_email': 'your@email.com',
    'creator_url': 'your_url_here', # OrcID is best practice, e.g. https://orcid.org/0000-0002-9746-544X
    'institution': 'Your Institution',
    'publisher_name': 'Publisher Name', # Data centre where your data will be published
    'publisher_email': 'publisher@email.com',
    'publisher_url': 'publisher_url_here',
    'project': 'Your Project Name',
    'instrument': 'CTD',
    'instrument_vocabulary': 'http://vocab.nerc.ac.uk/collection/L22/current/TOOL0001/',
    'license': 'https://creativecommons.org/licenses/by/4.0/',
    'featureType': 'profile',
    'station_name': 'S1' # Custom attribute, but you can add any attribute you like alongside the minimum requirements.
}
xrds

# Save the dataset to a NetCDF file
# xrds.to_netcdf('/path/to/your/depth_profile.nc')

## Creating multiple files using a for loop

You have multiple depth profiles. The two most likely ways your data are structured are:
* 1 tabular file per depth profile
* All profiles in 1 tabular file

In each case we will use for loops to create one NetCDF file per loop.

### 1 tabular file per depth profile

Hopefully all your tabular files are structured in the same way with the same column headers. If not, fix this first! 

Let's firstly show a basic for loop for those who are unfamiliar. 

In [2]:
numbers = [1, 2, 3, 4, 5]

# Use a for loop to iterate through the list
for number in numbers:
    print('Printing number',number)

Printing number 1
Printing number 2
Printing number 3
Printing number 4
Printing number 5


Now imagine we are loading in each of your tabular files in turn. Imagine you have 3 files.

In [3]:
files = ['file1.csv','file2.csv','file3.csv']

#for file in files:
#    df = pd.read_csv(file) # This won't work because these files don't exist! But you get the idea.

Or imagine you have lots of files; it might be impractical to write down the name of each.

In [4]:
import glob

# Path to the folder containing CSV files
folder_path = '/path/to/your/folder/'

# Get a list of file paths for all CSV files in the folder
files = glob.glob(folder_path + '*.csv')

#for file in files:
#    df = pd.read_csv(file) # This won't work because these files don't exist! But you get the idea.

Now let's build on this to create an xarray object. We will leave out the global attributes for now.

All we have to do it stick our code above inside the for loop to generate one NetCDF file per loop.

In [5]:
import numpy as np

files = ['file1.csv','file2.csv','file3.csv']

for file in files:
    # df = pd.read_csv(file)
    
    # Since we don't have the files here, let's create dummy profiles instead.
    pressures = np.random.uniform(0, 100, 10)  # 10 Random pressures between 0 and 100
    temperatures = np.random.uniform(2, 30, 10)  # Random temperatures between 2°C and 30°C
    salinities = np.random.uniform(30, 40, 10)  # Random salinity between 30 and 40 PSU
    
    # Creating a DataFrame for the depth profile
    df = pd.DataFrame({
        'Pressure': pressures,
        'Temperature': temperatures,
        'Salinity': salinities
    })
    
    # Create an xarray dataset from the dataframe
    xrds = xr.Dataset(
        coords={
            'pressure': df['Pressure']
        },
        data_vars={
            'Temperature': ('pressure', df['Temperature']),
            'Salinity': ('pressure', df['Salinity']),
        } 
    )

    # Add attributes to make it CF-compliant
    xrds['Temperature'].attrs = {
        'standard_name': 'sea_water_temperature',
        'long_name': 'Temperature of sea water',
        'units': 'degrees_Celsius',
        'coverage_content_type': 'physicalMeasurement'
    }
    xrds['Salinity'].attrs = {
        'standard_name': 'sea_water_salinity',
        'long_name': 'Salinity of sea water',
        'units': 'PSU',
        'coverage_content_type': 'physicalMeasurement'
    }
    xrds['pressure'].attrs = {
        'standard_name': 'sea_water_pressure',
        'long_name': 'Sea water pressure',
        'units': 'dbar',
        'coverage_content_type': 'coordinate'
    }
    
    print(xrds, '\n')
    # xrds.to_netcdf('/path/to/your/depth_profile.nc')

<xarray.Dataset>
Dimensions:      (pressure: 10)
Coordinates:
  * pressure     (pressure) float64 53.81 72.04 34.82 ... 58.14 66.28 91.99
Data variables:
    Temperature  (pressure) float64 6.481 17.95 19.32 ... 21.87 4.615 22.82
    Salinity     (pressure) float64 30.77 33.48 33.24 36.12 ... 30.02 36.6 35.63 

<xarray.Dataset>
Dimensions:      (pressure: 10)
Coordinates:
  * pressure     (pressure) float64 57.38 66.62 0.6772 ... 84.83 64.44 72.7
Data variables:
    Temperature  (pressure) float64 21.42 21.71 14.4 9.453 ... 27.34 18.14 13.12
    Salinity     (pressure) float64 36.21 30.26 30.89 ... 31.81 38.79 38.87 

<xarray.Dataset>
Dimensions:      (pressure: 10)
Coordinates:
  * pressure     (pressure) float64 20.04 11.64 8.974 ... 86.67 13.92 7.189
Data variables:
    Temperature  (pressure) float64 15.05 13.42 16.91 ... 20.75 23.64 26.98
    Salinity     (pressure) float64 35.75 33.13 35.69 ... 33.58 39.87 34.38 



But there is a problem. In the example above, we are giving each NetCDF file the same file name. Therefore, we will just create the file during the first loop and then overwrite it with each loop that follows.

To assign a different file name for each NetCDF file, we need to include a variable in the filename.

We also need to assign a different set of global attributes for each file - though some attributes will be the same.

In [6]:
files = ['file1.csv','file2.csv','file3.csv']

# Side car file? Information in headers? Perhaps need examples in the data folder.

for file in files:
    # df = pd.read_csv(file)
    
    # Since we don't have the files here, let's create dummy profiles instead.
    pressures = np.random.uniform(0, 100, 10)  # 10 Random pressures between 0 and 100
    temperatures = np.random.uniform(2, 30, 10)  # Random temperatures between 2°C and 30°C
    salinities = np.random.uniform(30, 40, 10)  # Random salinity between 30 and 40 PSU
    
    # Creating a DataFrame for the depth profile
    df = pd.DataFrame({
        'Pressure': pressures,
        'Temperature': temperatures,
        'Salinity': salinities
    })
    
    # Create an xarray dataset from the dataframe
    xrds = xr.Dataset(
        coords={
            'pressure': df['Pressure']
        },
        data_vars={
            'Temperature': ('pressure', df['Temperature']),
            'Salinity': ('pressure', df['Salinity']),
        } 
    )

    # Add attributes to make it CF-compliant
    xrds['Temperature'].attrs = {
        'standard_name': 'sea_water_temperature',
        'long_name': 'Temperature of sea water',
        'units': 'degrees_Celsius',
        'coverage_content_type': 'physicalMeasurement'
    }
    xrds['Salinity'].attrs = {
        'standard_name': 'sea_water_salinity',
        'long_name': 'Salinity of sea water',
        'units': 'PSU',
        'coverage_content_type': 'physicalMeasurement'
    }
    xrds['pressure'].attrs = {
        'standard_name': 'sea_water_pressure',
        'long_name': 'Sea water pressure',
        'units': 'dbar',
        'coverage_content_type': 'coordinate'
    }
    
    # Get the current timestamp in UTC and format it in ISO8601 - different for each file
    time_now = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ')
    
    # Global attributes including variables to vary them for each loop (and each NetCDF file)
    xrds.attrs = {
        'id': your_id,
        'naming_authority': 'institution that provides the id',
        'title': f'Depth Profile Data from {station_name}',
        'summary': f'This dataset contains depth profiles of temperature and salinity measurements from {station_name}.',
        'keywords': 'sea_water_temperature, sea_water_salinity',
        'keywords_vocabulary': 'CF:NetCDF COARDS Climate and Forecast Standard Names',
        'geospatial_lat_min': latitude,
        'geospatial_lat_max': latitude,
        'geospatial_lon_min': longitude,
        'geospatial_lon_max': longitude,
        'time_coverage_start': timestamp_collected_iso8601,
        'time_coverage_end': timestamp_collected_iso8601,
        'Conventions': 'ACDD-1.3, CF-1.11',
        'history': f'{time_now}: Modified by YourName using Python',
        'source': 'Measurement',
        'processing_level': 'Level of processing/quality control',
        'date_created': time_now,
        'creator_type': 'person',
        'creator_institution': 'Your Institution',
        'creator_name': 'Your Name',
        'creator_email': 'your@email.com',
        'creator_url': 'your_url_here', # OrcID is best practice, e.g. https://orcid.org/0000-0002-9746-544X
        'institution': 'Your Institution',
        'publisher_name': 'Publisher Name', # Data centre where your data will be published
        'publisher_email': 'publisher@email.com',
        'publisher_url': 'publisher_url_here',
        'project': 'Your Project Name',
        'instrument': 'CTD',
        'instrument_vocabulary': 'http://vocab.nerc.ac.uk/collection/L22/current/TOOL0001/',
        'license': 'https://creativecommons.org/licenses/by/4.0/',
        'featureType': 'profile',
        'station_name': station_name # Custom attribute, but you can add any attribute you like alongside the minimum requirements.
    }
    
    filename = 
    # xrds.to_netcdf(f'/path/to/your/{filename}.nc')
    
    

SyntaxError: invalid syntax (2097758331.py, line 90)