xarray wrapper for linear_regression #119

mathause · 2021-12-08T14:12:19Z

#116 introduces a numpy version for linear_regression - as discussed with @znicholls this morning/ evening we should add a xarray wrapper for this. I created a prototype for this but there are two open decisions:

should intercept be a separate DataArray
should we use xr.apply_ufunc or not?

proof of concept implementation

create test data

import numpy as np
import xarray as xr

from mesmer.core import linear_regression

rng = np.random.default_rng(0)


n_timesteps = 30
n_lat, n_lon = 5, 4
n_cells = n_lat * n_lon

time = np.arange(n_timesteps)


data = 0.1 * time + rng.normal(scale=0.1, size=n_timesteps)
pred0 = xr.DataArray(
    data, dims=("time"), coords={"time": time}, name="tas_globmean"
)

data = rng.normal(scale=0.25, size=n_timesteps)
pred1 = xr.DataArray(
    data, dims=("time"), coords={"time": time}, name="tas_globmean"
)


data = 0.25 * time + rng.normal(scale=0.1, size=(n_timesteps, n_cells)).T

LON, LAT = np.meshgrid(np.arange(n_lon), np.arange(n_lat))

coords = {"time": time, "lon": ("cells", LON.flatten()), "lat": ("cells", LAT.flatten())}
tgt = xr.DataArray(data, dims=("cells", "time"), coords=coords, name="tas")


pred_combined = xr.concat([pred0, pred1], dim="predictor")

weights = xr.ones_like(pred0.time)

Without `xr.apply_ufunc`:

def linear_regression_xr(predictors, target, weights=None, split=True, verbose=False):

    if "predictor" not in predictors.dims:
        predictors = predictors.expand_dims("predictor")

    p = predictors.transpose("time", "predictor")
    t = target.transpose("time", "cells")
    w = weights

    if verbose:
        print(f"{p.shape = }")
        print(f"{t.shape = }")
        if weights is not None:
            print(f"{w.shape = }")

    out = linear_regression(p, t, w)

    dims = ("cells", "predictors")
    coords = {"time": p.time, "cells": tas["cells"]}

    if split:
        out = xr.Dataset(
            {"intercept": ("cells", out[:, 0]), "coefficients": (dims, out[:, 1:])},
            coords=coords,
        )
        return out

    out = xr.Dataset({"coefficients": (dims, out)}, coords=coords)
    return out


print(linear_regression_xr_ufunc(pred0, tgt))
print(linear_regression_xr_ufunc(pred0, tgt, weights))

print(linear_regression_xr_ufunc(pred0, tgt, split=False))
print(linear_regression_xr_ufunc(pred0, tgt, weights, split=False))

print(linear_regression_xr_ufunc(pred_combined, tgt))
print(linear_regression_xr_ufunc(pred_combined, tgt, weights))

print(linear_regression_xr_ufunc(pred_combined, tgt, split=False))
print(linear_regression_xr_ufunc(pred_combined, tgt, weights, split=False))

With `xr.apply_ufunc`:

def _linear_regression_wrapper(
    predictors, target, weights=None, split=True, verbose=False
):

    target = target.T

    if verbose:
        print(f"{predictors.shape = }")
        print(f"{target.shape = }")

        if weights is not None:
            print(f"{weights.shape = }")

    out = linear_regression(predictors, target, weights=weights)

    # this is only for prototyping purposes
    if split:
        return out[:, 0], out[:, 1:]
    else:
        return out


def linear_regression_xr_ufunc(
    predictors, target, weights=None, split=True, verbose=False
):

    if "predictor" not in predictors.dims:
        predictors = predictors.expand_dims("predictor")

    args = (predictors, target)
    input_core_dims = (("time", "predictor"), ("time",))

    if weights is not None:
        args += (weights,)
        input_core_dims += (("time",),)

    output_core_dims = ((), ("predictor",)) if split else (("predictor",),)
    exclude_dims = set() if split else {"predictor"}

    out = xr.apply_ufunc(
        _linear_regression_wrapper,
        *args,
        exclude_dims=exclude_dims,
        output_core_dims=output_core_dims,
        input_core_dims=input_core_dims,
        kwargs={"split": split, "verbose": verbose},
    )

    if split:
        return xr.Dataset(dict(intercept=out[0], coefficients=out[1]))
    return out.to_dataset(name="coefficients")



print(linear_regression_xr(pred0, tgt))
print(linear_regression_xr(pred0, tgt, weights))

print(linear_regression_xr(pred0, tgt, split=False))
print(linear_regression_xr(pred0, tgt, weights, split=False))

print(linear_regression_xr(pred_combined, tgt))
print(linear_regression_xr(pred_combined, tgt, weights))

print(linear_regression_xr(pred_combined, tgt, split=False))
print(linear_regression_xr(pred_combined, tgt, weights, split=False))

observations

Not using xr.apply_ufunc requires less code - needs to be tested how easy it works when using more real world coords
returning the intercept separately seems a bit more intuitive - this will depend on how we write the estimate function.
the functions can be simplified once we decide whether the intercept is saved seperately

edit: renamed the test variables

The text was updated successfully, but these errors were encountered:

znicholls · 2021-12-08T23:59:45Z

Nice. My thoughts (but happy to go the other way if you'd prefer):

Use xr.apply_ufunc. Yes it's more code for this simple example but it's also a good habit to build and the sooner we get used to it the better
return the intercept separately. To be honest, we could even return all the coefficients separately because they'll all have different units...

znicholls · 2021-12-09T00:22:32Z

Reading a bit more about ufunc, perhaps not using ufunc is the better choice. The docs (https://numpy.org/doc/stable/reference/ufuncs.html) say, "A universal function (or ufunc for short) is a function that operates on ndarrays in an element-by-element fashion". Our regression doesn't operate element by element, it operates 'all at once'. That suggest to me that ufunc might actually be inappropriate in this case (on the other hand maybe it doesn't matter that we're doing all at once, using xarray's ufunc just does the co-ordinate handling for us and we shouldn't get too worried about whether it's the exact perfect use case or not).

mathause · 2021-12-09T14:40:18Z

on the other hand maybe it doesn't matter that we're doing all at once, using xarray's ufunc just does the co-ordinate handling for us and we shouldn't get too worried about whether it's the exact perfect use case or not

Agree - we should not get hung up on the semantics. Having said that - it might be a generalized ufunc, which operates vector-wise. Or maybe not - because it only works only along a very specific axis...

Thanks for the feedback - next round:

I think not using apply_ufunc leads to clearer code here.
I agree to return every coefficient separately - this helps with understanding the return values & for error checking for the predict step.
I am not yet happy with the name.
Currently I am passing DataArray objects - not sure this is optimal (round-trip land-only unstructured grid #117 requires a Dataset).
I am passing predictors as dict to get the names.

from typing import Optional, Mapping

def linear_regression_xr(
    predictors: Mapping[str, xr.DataArray] = None,
    target: xr.DataArray = None,
    dim: str = None,
    weights: Optional[xr.DataArray] = None,
):

    # TODO: check predictors are 1D & contain `dim`
    # TODO: check target is 2D and contains `dim`
    # TODO: check weights is 1D and contains `dim`

    predictors_concat = xr.concat(tuple(predictors.values()), dim="predictor")

    target_dim = list(set(tgt.dims) - {dim})[0]

    out = linear_regression(
        predictors_concat.transpose(dim, "predictor"),
        target.transpose(dim, target_dim),
        weights,
    )

    keys = ["intercept"] + list(predictors)
    dataarrays = {key: (target_dim, out[:, i]) for i, key in enumerate(keys)}
    out = xr.Dataset(dataarrays, coords=target.coords).drop(dim)
    
    if weights is not None:
        out["weights"] = weights
    
    return out

Example:

linear_regression_xr(predictors={"tas": pred0}, target=tgt, dim="time")
linear_regression_xr(predictors={"tas": pred0, "hfds": pred1}, target=tgt, dim="time")

linear_regression_xr(
    predictors={"tas": pred0, "hfds": pred1},
    target=tgt,
    weights=weights,
    dim="time",
)

znicholls · 2021-12-10T06:35:29Z

Agree - we should not get hung up on the semantics. Having said that - it might be a generalized ufunc, which operates vector-wise.

Nice, I think that is what it is (I'm not sure it matters that it's meant to be used on a specific axis, it could in theory be used on any axis).

Currently I am passing DataArray objects - not sure this is optimal

I agree. Passing DataSets is probably better, particularly for unit handling.

I am passing predictors as dict to get the names.

This would be solved if we used DataSets instead right?

mathause · 2021-12-10T09:28:13Z

I also think Dataset makes sense - however they can have one or several data_vars (DataArrays). We need to know which data_var to use - or enforce that only one is on the ds... Or how do you see this?

I am passing predictors as dict to get the names.

This would be solved if we used DataSets instead right?

Yes - would you pass all predictors in one Dataset or as several Datasets? I have a slight preference to pass every predictor individually.

predictors={"tas": pred0, "hfds": pred1}
predictors=xr.Dataset({"tas": pred0, "hfds": pred1})

particularly for unit handling

Shouldn't the unit attribute be on the DataArray? Or why is it better for unit handling?

znicholls · 2021-12-14T02:09:34Z

I also think Dataset makes sense - however they can have one or several data_vars (DataArrays). We need to know which data_var to use - or enforce that only one is on the ds... Or how do you see this?

Yes in my head we would tell it which variables to use

Yes - would you pass all predictors in one Dataset or as several Datasets? I have a slight preference to pass every predictor individually.

One dataset was what I was thinking (just so they share co-ordinates...)

Shouldn't the unit attribute be on the DataArray? Or why is it better for unit handling?

Yes, ignore me

mathause mentioned this issue Dec 17, 2021

linear regression: add xarray wrapper #123

Merged

4 tasks

mathause closed this as completed in #123 Jan 20, 2022

mathause added this to the v0.9.0 milestone Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray wrapper for linear_regression #119

xarray wrapper for linear_regression #119

mathause commented Dec 8, 2021 •

edited

znicholls commented Dec 8, 2021

znicholls commented Dec 9, 2021 •

edited

mathause commented Dec 9, 2021

znicholls commented Dec 10, 2021

mathause commented Dec 10, 2021

znicholls commented Dec 14, 2021

xarray wrapper for linear_regression #119

xarray wrapper for linear_regression #119

Comments

mathause commented Dec 8, 2021 • edited

proof of concept implementation

create test data

Without xr.apply_ufunc:

With xr.apply_ufunc:

observations

znicholls commented Dec 8, 2021

znicholls commented Dec 9, 2021 • edited

mathause commented Dec 9, 2021

znicholls commented Dec 10, 2021

mathause commented Dec 10, 2021

znicholls commented Dec 14, 2021

mathause commented Dec 8, 2021 •

edited

Without `xr.apply_ufunc`:

With `xr.apply_ufunc`:

znicholls commented Dec 9, 2021 •

edited