# Working with the `InputDataset` class

## Contents
1. [Introduction](#1.-Introduction)
2. [InputDataset subclasses and their instantiation](#2.-InputDataset-subclasses-and-their-instantiation)
3. [Working with different sources](#3.-Working-with-different-sources)
   - [Working with local, prepared (netCDF) sources](#3i.-Working-with-local,-prepared-netCDF-sources)
   - [Working with remote, prepared (netCDF) sources](#3ii.-Working-with-remote-prepared-netCDF-sources)
   - [Working with unprepared (yaml) sources](#3iii.-Working-with-unprepared-yaml-sources)

## 1. Introduction
In C-Star, the `InputDataset` holds information on, and offers methods relevant to, files containing numerical data required by a simulation (such as initial conditions). This can be compared with [the `AdditionalCode` class](LINK-TODO), which is related to text-based files needed by a simulation (such as lists of custom settings).

The `InputDataset` class is an abstract class, and can not be instantiated directly. Instead, the relevant subclass should be used.

## 2. InputDataset subclasses and their instantiation
C-Star currently supports the ROMS ocean model, for which there are five `InputDataset` subclasses:
```
InputDataset
 └── ROMSInputDataset
     ├── ROMSModelGrid
     ├── ROMSInitialConditions
     ├── ROMSTidalForcing
     ├── ROMSBoundaryForcing
     └── ROMSSurfaceForcing
```
As mentioned above, the `InputDataset` and `ROMSInputDataset` are abstract base classes, so one of these five subclasses must be instantiated.

The parameters required to create an `InputDataset` instance vary depending on the source. Let's consider each in turn:

## 3. Working with different sources
### 3i. Working with local, prepared (netCDF) sources
In the simplest case, the input dataset already exists, in a ROMS-compatible (netCDF) format, on the local filesystem. In this case, we only need to provide the `location` parameter, with a path to the file:

In [1]:
from cstar.roms import ROMSModelGrid
my_grid = ROMSModelGrid(location="~/Code/my_ucla_roms/Examples/input_data/sample_grd_riv.nc")
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: ~/Code/my_ucla_roms/Examples/input_data/sample_grd_riv.nc
Working path: None ( does not yet exist. Call InputDataset.get() )


### Creating a working version with `InputDataset.get()`:
In the above example, we see that `Working path` is `None` and that we should call `InputDataset.get()` to change this. In the case of a local `netCDF` file, whose contents cannot be tampered with by C-Star, calling `get()` creates a symbolic link in the working directory to the source file:

In [2]:
my_grid.get(local_dir = "~/Code/my_c_star/examples/input_dataset_example")

In [3]:
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: ~/Code/my_ucla_roms/Examples/input_data/sample_grd_riv.nc
Working path: /Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc (exists)
Local hash: {PosixPath('/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc'): '8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e'}


After calling `get()` we see that there is now additional information associated with this `InputDataset` - the source location, as before, but also the `Working path` (in this case a symbolic link to the source location) and a `Local hash`: a checksum of the file in question to protect against changes or tampering with the file.

### 3ii. Working with remote, prepared (netCDF) sources
In this case, as above, the input dataset already exists, in a ROMS-compatible (netCDF) format, but this time is stored at a remote location. Now, the `location` parameter will be a URL, and we also need to provide a value for the `file_hash` parameter. 

<div class="alert alert-info">

**Note**
    
The `file_hash` parameter is a unique string summary of the entire file, that is used for security with remote binary files (such as netCDF) to verify that any downloads by C-Star correspond exactly to the expected data. C-Star uses a 256-bit shasum for hashes.

If you do not know the file hash, it is advisable that you ask the creator of the file to check their local copy. 

</div>

In [4]:
from cstar.roms import ROMSModelGrid
my_grid = ROMSModelGrid(location="https://github.com/dafyddstephenson/ucla_roms_examples_input_data/raw/main/sample_grd_riv.nc",
                       file_hash="8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e")
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: https://github.com/dafyddstephenson/ucla_roms_examples_input_data/raw/main/sample_grd_riv.nc
Source file hash: 8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e
Working path: None ( does not yet exist. Call InputDataset.get() )


### Creating a local copy with `InputDataset.get()`:
As before, we see that `Working path` is `None` and that we should call `InputDataset.get()` to change this. In the case of a _remote_ `netCDF` file, calling `get()` downloads a copy of the source file to the working directory:

In [5]:
my_grid.get(local_dir = "~/Code/my_c_star/examples/input_dataset_example")

In [6]:
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: https://github.com/dafyddstephenson/ucla_roms_examples_input_data/raw/main/sample_grd_riv.nc
Source file hash: 8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e
Working path: /Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc (exists)
Local hash: {PosixPath('/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc'): '8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e'}


### 3iii. Working with unprepared (yaml) sources
C-Star also supports creating input datasets from plaintext instructions in `.yaml` format, by interfacing with the `roms-tools` python package. For more information on creating datasets to export in this format, see [the `roms-tools` documentation](https://roms-tools.readthedocs.io/en/latest/). 

As we are working with plain text (rather than binary files as in the examples above) we don't need to verify remote downloads, and so the process for using local or remote files is the same: we simply provide the `location` parameter, either a URL or local path.

As we are creating the dataset from scratch, depending on the type of dataset, we also need some additional information. In particular, the `start_date` and `end_date` parameters allow C-Star to tell `roms-tools` the dates between which the dataset is required (if any - the grid is time-invariant, for instance).

In [7]:
from cstar.roms import ROMSSurfaceForcing
my_surface_forcing = ROMSSurfaceForcing(
    location="~/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml",
    start_date="2012-01-01 12:00:00",
    end_date = "2012-01-02 12:00:00"
)

print(my_surface_forcing)

------------------
ROMSSurfaceForcing
------------------
Source location: ~/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml
start_date: 2012-01-01 12:00:00
end_date: 2012-01-02 12:00:00
Working path: None ( does not yet exist. Call InputDataset.get() )


### Creating a prepared copy with `InputDataset.get()`:

In [None]:
my_surface_forcing.get(local_dir = "~/Code/my_c_star/examples/input_dataset_example")

INFO - Data will be interpolated onto fine grid.
  ds = xr.open_mfdataset(
INFO - Writing the following NetCDF files:
/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/roms_frc_201201.nc


Saving roms-tools dataset created from ~/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml...
[###                                     ] | 8% Completed | 4hr 53mss

The other, optional parameters are:
- `file_hash` (`str`): This is the 256-bit checksum associated with the file found at `location`. More information can be found [below](#Remote-vs-local-input-datasets)
-  `start_date` (`str` or `datetime`): For spatiotemporal datasets, this is the earliest date associated with the dataset. More information can be found [below](#Prepared-vs-unprepared-datasets)
-  `end_date` (`str` or `datetime`): For spatiotemporal datasets, this is the latest date associated with the dataset. More information can be found [below](#Prepared-vs-unprepared-datasets)

## Prepared vs unprepared input datasets
C-Star users have two options for the source of an `InputDataset`:
- `netCDF` files, which are already prepared and ready to be provided directly to the model.
- `yaml` files, which contain plaintext instructions to _create_ a `netCDF` file using the `roms-tools` package.

`netCDF` files are typically very large (often `TB` in total for a meaningful simulation) whereas `yaml` files are only a few `kB`, making them easier to work with when preparing or obtaining a remotely hosted simulation. However, `yaml` files necessitate generating the corresponding `netCDF` locally, a process that can have a large memory footprint and additionally [requires an available copy of any datasets that `roms-tools` requires.](https://roms-tools.readthedocs.io/en/latest/datasets.html)

Instantiating the `InputDataset` is the same in either case - the `location` simply points to a file in the chosen format. If the user points to a `yaml` file and also provides a `start_date` or `end_date`, the corresponding entries in the `yaml` file will be updated to these values such that the final produced dataset is defined on the correct date range. If the user points to a `netCDF` file, this is not possible - the `start_date` and `end_date` are simply used for internal checks within C-Star.

## Remote vs local input datasets
C-Star users further have the choice between working with source files accessible to C-Star locally, or remote files (that must be downloaded). Instantiation is the same in both cases for `yaml` files, which can be read directly from a remote source. 

When the source is a remote netCDF file, the `file_hash` parameter must be specified. This is for security purposes - if the provided hash does not match that of the downloaded file, the file will be deleted and C-Star will raise an error containing the expected and received file hashes.

You can calculate the hash of a local file using `cstar.base.utils.get_sha256_hash`, for instance if you were planning to upload the file to a remote location and [create/share a blueprint](LINK-TODO) that uses it. It is not possible in general to calculate the file hash of a remote file without first downloading it.


## Fetching `InputDatasets` via `InputDataset.get()` or `Case.setup()`
`InputDataset` instances in C-Star at first point to their source location, as described above. When `InputDataset.get()` is called, a local, model-ready version of the dataset is established, tracked separately under the `working_path` attribute.

The realization of the local version depends on the source:
- for local or remote `yaml` files, the file is loaded into memory, modified as necessary, and passed to `roms-tools` to generate a netCDF file at the local location
- for remote netCDF files, the file is downloaded and stored at the local location
- or local netCDF files, a symbolic link to the source file is created at the local location to minimize storage demand

When calling `Case.setup()` on a `Case` with one or more `InputDataset`s, `InputDataset.get()` is called on each, fetching them to the caseroot.

### `yaml` example:

In [1]:
from cstar.roms import ROMSSurfaceForcing
my_forcing = ROMSSurfaceForcing(
    location="/Users/dafyddstephenson/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml",
    start_date="2012-01-15 12:00:00",
    end_date="2012-01-16 12:00:00"
    )
print(my_forcing)

------------------
ROMSSurfaceForcing
------------------
Source location: /Users/dafyddstephenson/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml
start_date: 2012-01-15 12:00:00
end_date: 2012-01-16 12:00:00
Working path: None ( does not yet exist. Call InputDataset.get() )


In [2]:
my_forcing.get("/Users/dafyddstephenson/Downloads")

  ds = xr.open_mfdataset(


---
roms_tools_version: 2.2.1
---
Grid:
  N: 20
  center_lat: 52.4
  center_lon: -4.1
  hc: 300.0
  hmin: 5.0
  nx: 30
  ny: 30
  rot: 0
  size_x: 240
  size_y: 240
  theta_b: 2.0
  theta_s: 5.0
  topography_source:
    name: ETOPO5
SurfaceForcing:
  correct_radiation: true
  end_time: null
  model_reference_date: '2000-01-01T00:00:00'
  source:
    climatology: false
    name: ERA5
    path: /Users/dafyddstephenson/code/roms_tools_datasets/ERA5_2012-01.nc
  start_time: null
  type: physics
  use_coarse_grid: false
Saving roms-tools dataset created from /Users/dafyddstephenson/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml...
[########################################] | 100% Completed | 28.70 s
