# missing data interpolation

statistics is the answer to everything

Use this notebook to gapfill a saved netcdf file.

### potential shenanigans

"Several techniques have been used to fill the gaps in either the UWLS or OI derived total vector maps.

These are implemented using covariance derived from normal mode analysis (Lipphardt et al. 2000), open-boundary modal analysis (OMA) (Kaplan and Lekien 2007), and empirical orthogonal function (EOF) analysis (Beckers and Rixen 2003; Alvera-Azcárate et al. 2005); and using idealized or smoothed observed covariance (Davis 1985)."

- normal mode analysis
- open-boundary modal analysis (OMA)
- empirical orthogonal function analysis (EOF)
- use idealized/smoothed observed covariance

---

### other ideas

DINEOF (could only find an implementation in R)

to be honest I don't understand any of these methods but they look cool

### currently implemented:

rip data straight from the lower resolution data for areas where data is considered missing in the high resolution data

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path
import numpy as np

import utils
from parcels_utils import xr_dataset_to_fieldset, HFRGrid
from constants import *
from gapfilling import InterpolationStep, SmoothnStep, Gapfiller

### target and interp_references

#### Change these variables

`target` is the data you are interpolating.

`interp_references` is a list of reference data to interpolate from. A few specifications:
- should be ordered from most accurate data to least accurate (highest to lowest resolution)
- time domain should be identical or bigger than the one of the target
- lat and lon domain should be bigger than the target's to prevent any out-of-bounds complications

`mask_nc` must have the exact same lat and lon dimensions of the target

In [None]:
files_root = Path("current_netcdfs")

target = HFRGrid(files_root / "tj_plume_2020-02_ThreddsCode.USWC_1KM_HOURLY.nc")

interp_references = [
    HFRGrid(files_root / "tj_plume_2020-02_ThreddsCode.USWC_2KM_HOURLY.nc"),
    HFRGrid(files_root / "tj_plume_2020-02_ThreddsCode.USWC_6KM_HOURLY.nc"),
]

# mask_nc can be none if you do not need to filter out land currents
mask_nc = None
mask_nc = HFRGrid(files_root / "tj_sample_ThreddsCode.USWC_1KM_HOURLY.nc", init_fs=False)

In [None]:
gapfiller = Gapfiller()
gapfiller.add_steps(InterpolationStep(interp_references), SmoothnStep(mask=mask_nc))

### interpolation type

more information can be found in the `tutorial_interpolation` notebook

EDIT: just use `linear`

## nan values and parcels

note that when this xarray Dataset is passed into parcels, all the nan values change to 0 and the mask generation won't work anymore

so the Dataset is copied for use with the FieldSet instead

### use of Parcels Field for interpolation

indexing Field values goes [time, depth, lat, lon]

Field does linear interpolation automatically when indexing values between it's coordinate values

### linear interpolation using lower resolution data

## even more filling with PLS and smoothing with DCT shenanigans

uses the matlab engine and smoothn.m

https://www.mathworks.com/help/matlab/matlab-engine-for-python.html

https://www.mathworks.com/matlabcentral/fileexchange/25634-smoothn

### formatting and saving

In [None]:
target_interped_xrds = gapfiller.execute(target)

In [None]:
save_path = str(target.path).split(".nc")[0] + "_interped.nc"
target_interped_xrds.to_netcdf(save_path)
print(f"saved to {save_path}")

### display field to see if interpolation worked

In [None]:
fs_interp = xr_dataset_to_fieldset(target_interped_xrds)
target.fieldset.U.show()  # uninterpolated
fs_interp.U.show()  # interpolated, gapfilled, smoothed
fs_interp.V.show()  # interpolated, gapfilled, smoothed