# Working with the `InputDataset` class

## Contents
1. [Introduction](#1.-Introduction)
2. [InputDataset subclasses and their instantiation](#2.-InputDataset-subclasses-and-their-instantiation)
3. [Working with different sources](#3.-Working-with-different-sources)
   - [Working with local, prepared (netCDF) sources](#3i.-Working-with-local,-prepared-netCDF-sources)
   - [Working with remote, prepared (netCDF) sources](#3ii.-Working-with-remote-prepared-netCDF-sources)
   - [Working with unprepared (yaml) sources](#3iii.-Working-with-unprepared-yaml-sources)
4. [Partitioning input datasets for use with ROMS](#4.-Partitioning-input-datasets-for-use-with-ROMS)
5. [Summary](#5.-Summary)

## 1. Introduction
In C-Star, the `InputDataset` holds information on, and offers methods relevant to, files containing numerical data required by a simulation (such as initial conditions). This can be compared with [the `AdditionalCode` class](LINK-TODO), which is related to text-based files needed by a simulation (such as lists of custom settings).

The `InputDataset` class is an abstract class, and can not be instantiated directly. Instead, the relevant subclass should be used.

## 2. InputDataset subclasses and their instantiation
C-Star currently supports the ROMS ocean model, for which there are five `InputDataset` subclasses:
```
InputDataset
 └── ROMSInputDataset
     ├── ROMSModelGrid
     ├── ROMSInitialConditions
     ├── ROMSTidalForcing
     ├── ROMSBoundaryForcing
     └── ROMSSurfaceForcing
```
As mentioned above, the `InputDataset` and `ROMSInputDataset` are abstract base classes, so one of these five subclasses must be instantiated.

The parameters required to create an `InputDataset` instance vary depending on the source. Let's consider each in turn:

## 3. Working with different sources
### 3i. Working with local, prepared (netCDF) sources
In the simplest case, the input dataset already exists, in a ROMS-compatible (netCDF) format, on the local filesystem. In this case, we only need to provide the `location` parameter, with a path to the file:

In [1]:
from cstar.roms import ROMSModelGrid
my_grid = ROMSModelGrid(location="~/Code/my_ucla_roms/Examples/input_data/sample_grd_riv.nc")
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: ~/Code/my_ucla_roms/Examples/input_data/sample_grd_riv.nc
Working path: None ( does not yet exist. Call InputDataset.get() )


### Creating a working version with `InputDataset.get()`:
<div class="alert alert-info">

**Note**
    
Most users will not need to use the `get()` method: if your `InputDataset` is part of a `ROMSSimulation` instance, then C-Star will call `get()` automatically as part of any `ROMSSimulation.setup()` call.

</div>
In the above example, we see that `Working path` is `None` and that we should call `InputDataset.get()` to change this. In the case of a local `netCDF` file, whose contents cannot be tampered with by C-Star, calling `get()` creates a symbolic link in the working directory to the source file:



In [2]:
my_grid.get(local_dir = "~/Code/my_c_star/examples/input_dataset_example")

In [3]:
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: ~/Code/my_ucla_roms/Examples/input_data/sample_grd_riv.nc
Working path: /Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc (exists)
Local hash: {PosixPath('/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc'): '8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e'}


After calling `get()` we see that there is now additional information associated with this `InputDataset` - the source location, as before, but also the `Working path` (in this case a symbolic link to the source location) and a `Local hash`: a checksum of the file in question to protect against changes or tampering with the file.

### 3ii. Working with remote, prepared (netCDF) sources
In this case, as above, the input dataset already exists, in a ROMS-compatible (netCDF) format, but this time is stored at a remote location. Now, the `location` parameter will be a URL, and we also need to provide a value for the `file_hash` parameter. 

<div class="alert alert-info">

**Note**
    
The `file_hash` parameter is a unique string summary of the entire file, that is used for security with remote binary files (such as netCDF) to verify that any downloads by C-Star correspond exactly to the expected data. C-Star uses a 256-bit shasum for hashes.

If you do not know the file hash, it is advisable that you ask the creator of the file to check their local copy. 

</div>

In [4]:
from cstar.roms import ROMSModelGrid
my_grid = ROMSModelGrid(location="https://github.com/dafyddstephenson/ucla_roms_examples_input_data/raw/main/sample_grd_riv.nc",
                       file_hash="8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e")
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: https://github.com/dafyddstephenson/ucla_roms_examples_input_data/raw/main/sample_grd_riv.nc
Source file hash: 8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e
Working path: None ( does not yet exist. Call InputDataset.get() )


### Creating a local copy with `InputDataset.get()`:
<div class="alert alert-info">

**Note**
    
Most users will not need to use the `get()` method: if your `InputDataset` is part of a `ROMSSimulation` instance, then C-Star will call `get()` automatically as part of any `ROMSSimulation.setup()` call.

</div>
As before, we see that `Working path` is `None` and that we should call `InputDataset.get()` to change this. In the case of a _remote_ `netCDF` file, calling `get()` downloads a copy of the source file to the working directory:

In [5]:
my_grid.get(local_dir = "~/Code/my_c_star/examples/input_dataset_example")

Downloading file 'sample_grd_riv.nc' from 'https://github.com/dafyddstephenson/ucla_roms_examples_input_data/raw/main/sample_grd_riv.nc' to '/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example'.


In [6]:
print(my_grid)

-------------
ROMSModelGrid
-------------
Source location: https://github.com/dafyddstephenson/ucla_roms_examples_input_data/raw/main/sample_grd_riv.nc
Source file hash: 8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e
Working path: /Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc (exists)
Local hash: {PosixPath('/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/sample_grd_riv.nc'): '8e2f1ca3135ac7f5696d3eaec79b035a1bae15c8a34e751a7f9d925787ab3f6e'}


### 3iii. Working with unprepared (yaml) sources
C-Star also supports creating input datasets from plaintext instructions in `.yaml` format, by interfacing with the `roms-tools` python package. `netCDF` files are typically very large (often `TB` in total for a meaningful simulation) whereas `yaml` files are only a few `kB`, making them easier to work with when preparing or obtaining a remotely hosted simulation. However, `yaml` files necessitate generating the corresponding `netCDF` locally, a process that can have a large memory footprint and additionally [requires an available copy of any datasets that `roms-tools` requires.](https://roms-tools.readthedocs.io/en/latest/datasets.html). For more information on creating datasets to export in `yaml` format, see [the `roms-tools` documentation](https://roms-tools.readthedocs.io/en/latest/). 

As we are working with plain text (rather than binary files as in the examples above) we don't need to verify remote downloads, and so the process for using local or remote files is the same: we simply provide the `location` parameter, either a URL or local path.

As we are creating the dataset from scratch, depending on the type of dataset, we also need some additional information. In particular, the `start_date` and `end_date` parameters allow C-Star to tell `roms-tools` the dates between which the dataset is required (if any - the grid is time-invariant, for instance).

In [7]:
from cstar.roms import ROMSSurfaceForcing
my_surface_forcing = ROMSSurfaceForcing(
    location="~/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml",
    start_date="2012-01-01 12:00:00",
    end_date = "2012-01-04 12:00:00"
)

print(my_surface_forcing)

------------------
ROMSSurfaceForcing
------------------
Source location: ~/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml
start_date: 2012-01-01 12:00:00
end_date: 2012-01-04 12:00:00
Working path: None ( does not yet exist. Call InputDataset.get() )


### Creating a prepared copy with `InputDataset.get()`:
<div class="alert alert-info">

**Note**
    
Most users will not need to use the `get()` method: if your `InputDataset` is part of a `ROMSSimulation` instance, then C-Star will call `get()` automatically as part of any `ROMSSimulation.setup()` call.

</div>

In [8]:
my_surface_forcing.get(local_dir="~/Code/my_c_star/examples/input_dataset_example/")

INFO - Data will be interpolated onto fine grid.
  ds = xr.open_mfdataset(
INFO - Writing the following NetCDF files:
/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/roms_frc_201201.nc


Saving roms-tools dataset created from ~/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml...
[########################################] | 100% Completed | 2.98 sms


## 4. Partitioning input datasets for use with ROMS with `InputDataset.partition()`

<div class="alert alert-info">

**Note**
    
Most users will not need to use the `partition()` method: if your `InputDataset` is part of a `ROMSSimulation` instance, then C-Star will call `partition()` automatically as part of any `ROMSSimulation.pre_run()` call.

</div>

ROMS requires that input datasets are "partitioned" - i.e., split into several smaller files such that each processor in a parallel run works with a subset of the entire domain. To perform this action, call `InputDataset.partition()`. 

The `np_xi` and `np_eta` parameters of this method correspond to the number of processors in the `xi` and `eta` directions (roughly corresponding to East-West and North-South, depending on grid rotation):

In [11]:
my_surface_forcing.partition(np_xi=3,np_eta=3)

Partitioning /Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/roms_frc_201201.nc into (3,3)


  ds = xr.open_dataset(filepath.with_suffix(".nc"))
  ds = xr.open_dataset(filepath.with_suffix(".nc"))


... We can see that the method executed successfully as C-Star is now additionally tracking the `Partitioned files` :

In [12]:
print(my_surface_forcing)

------------------
ROMSSurfaceForcing
------------------
Source location: ~/Code/my_c_star/blueprints/cstar_blueprint_roms_marbl_example/input_datasets_yaml/roms_frc.yaml
start_date: 2012-01-01 12:00:00
end_date: 2012-01-04 12:00:00
Working path: /Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/roms_frc_201201.nc (exists)
Local hash: {PosixPath('/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/roms_frc_201201.nc'): '9cb484309c1ca9d5c0e185a3f50f59e7ee837dd1b183014093f7027f8235a60b'}
Partitioned files: ['/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/PARTITIONED/roms_frc_201201.0.nc',
                    '/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/PARTITIONED/roms_frc_201201.1.nc',
                       ...
                    '/Users/dafyddstephenson/Code/my_c_star/examples/input_dataset_example/PARTITIONED/roms_frc_201201.8.nc'] <9 items>


## 5. Summary
In this guide, we have considered:
- The different subclasses of `ROMSInputDataset`
- How to instantiate these different subclasses when input datasets have different sources

And optionally, for users working outside of the context of a `ROMSSimulation`:
- How to create a working copy/path to a prepared, locally available copy of the dataset
- How to partition the dataset such that it is ROMS-ready