# Tutorial #1: Generate insitu dataset for `pkg/obsfit`

In this tutorial, we demonstrate how to construct a dataset of insitu observations suitable for use with `pkg/obsfit` in the lat-lon-cap (llc) grid configuration). The main hurdle here is that the llc grid is a curvilinear grid that is partitioned into chunks for parallelized simulation. A specific routine in MITgcm determines which observations belong to which chunk, and data must be formatted with knowledge of that routine. This is where `obsprep` comes in handy!

### Load libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import pandas as pd

import obsprep as op
from obsprep.utils import generate_random_points

np.random.seed(3)

### Load grid dataset
In this example, we have a premade dataset containing grid information from the $1^\circ$ llc grid, the `llc90`. In order to generate this information for a user's model, we recommend [xmitgcm](https://xmitgcm.readthedocs.io/en/latest/usage.html).

In [None]:
grid_ds = xr.open_dataset('../datasets/ECCO_llc90.nc')

### Generate random observation locations
Here, the user may provide *ungridded* observational data. For simplicity, we generate random values of `lats`, `lons`, and `depths`.

In [None]:
nobs = 10
lons, lats, depths = generate_random_points(nobs)

### Initialize the `obsprep` object

In [None]:
OP = op.Prep('obsfit')

### Get spatial interpolation information
Here, we endow our synthetic dataset with interpolation information so that the model knows where to position our insitu data. The user relays knowledge of their parallel grid partitioning with the parameters `sNx` and `sNy` (see [here](https://mitgcm.readthedocs.io/en/latest/phys_pkgs/exch2.html#exch2-size-h-and-multiprocessing) for more details).

In [None]:
OP.get_obs_point(lons, lats, depths, grid_type='llc', grid_ds=grid_ds, num_interp_points=1, sNx=30, sNy=30)

The observational dataset being generated is an `xarray.Dataset` object, which is itself an attribute of our `Prep` object, `OP.ds`.

### Generate temporal metadata
Additionally, the `obsfit` package requires temporal fields to properly organize observations based on their occurrence during model simulation time. As with our observation locations, we generate random time fields, add them to the dataset under the name `obs_datetime`, and then obtain the temporal fields in a format the model expects, namely as fields `obs_YYYYMMDD`, `obs_HHMMSS`, and `obs_date`.

In [None]:
def generate_datetime(nobs, start_date='1992-01-01', end_date='1993-01-01'):
    # create datetime values 
    start_date = pd.to_datetime(start_date)
    end_date = pd.to_datetime(end_date)
    return np.random.choice(pd.date_range(start=start_date, end=end_date, freq='D'), size=nobs)

obs_datetime = generate_datetime(nobs)
OP.ds['obs_datetime'] = xr.DataArray(obs_datetime, dims=['iSAMPLE'])

OP.get_obs_datetime(time_var = 'obs_datetime')

### Set observational data and its `sample_type`
Lastly, we create synthetic data in the dataset's `obs_val` field and indicate to the model the type of observations using integers according to the table below.

| Field | Number | Name                    |
|-------|--------|-------------------------|
| T     | 1      | Temperature             |
| S     | 2      | Salinity                |
| U     | 3      | Zonal Velocity          |
| V     | 4      | Meridional Velocity     |
| SSH   | 5      | Sea Surface Height      |

Note that creating dummy data in this way is not advised -- typically the user will format their actual observational data appropriately and insert it into the dataset now.

In [None]:
theta_data = np.random.uniform(low=25, high=32, size=nobs)
OP.ds['obs_val'] = xr.DataArray(theta_data, dims=['iSAMPLE'])

# assign the integer value with knowledge that this is temperature data
# this function populates the sample_type field
OP.get_sample_type('T')

### Write out the dataset!
Now our dataset has been created within the `OP` object, so we write it to netcdf.

In [None]:
# get the dataset back from the preprocessing constructor
ds = OP.ds.copy()
ds.to_netcdf('../datasets/obsfit_llc_example.nc')

# Extras
Here are a couple more ways to explore the data used by the obsfit package.
### 1. Plot
We can plot where our random observations landed, as well as the depths we assigned them.

In [None]:
import cartopy.crs as ccrs
plt.figure(figsize=(10, 4))
ax = plt.axes(projection=ccrs.PlateCarree())
sc = ax.scatter(ds['sample_lon'], ds['sample_lat'], c=ds['sample_depth'], cmap='viridis', s=100, edgecolor='k')
ax.set_global()
ax.coastlines()
cbar = plt.colorbar(sc, orientation='vertical')
cbar.set_label('Depth (m)')
plt.title('Synthetic obsfit data', fontsize=20)
plt.show()

### 2. Repeat tutorial beginning with an observational dataset
It may be more straightforward to first assemble a dataset with ungridded observation points and data, then feed it into the `Prep` object:

In [None]:
ds_in = xr.Dataset(
    {
        'sample_lon': (('iSAMPLE',), lons),
        'sample_lat': (('iSAMPLE',), lats),
        'sample_depth': (('iSAMPLE',), depths),
        'obs_datetime': (('iSAMPLE',), obs_datetime),
        'obs_val': (('iSAMPLE',), theta_data)
    },
)

OP = op.Prep('obsfit', ds_in)
OP.get_obs_point(grid_type='llc', grid_ds=grid_ds, num_interp_points=1, sNx=30, sNy=30)
OP.get_obs_datetime(time_var = 'obs_datetime')
OP.get_sample_type('T')

ds_out = OP.ds.copy()