Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray wrapper for linear_regression #119

Closed
mathause opened this issue Dec 8, 2021 · 6 comments · Fixed by #123
Closed

xarray wrapper for linear_regression #119

mathause opened this issue Dec 8, 2021 · 6 comments · Fixed by #123
Milestone

Comments

@mathause
Copy link
Member

mathause commented Dec 8, 2021

#116 introduces a numpy version for linear_regression - as discussed with @znicholls this morning/ evening we should add a xarray wrapper for this. I created a prototype for this but there are two open decisions:

  • should intercept be a separate DataArray
  • should we use xr.apply_ufunc or not?

proof of concept implementation

create test data

import numpy as np
import xarray as xr

from mesmer.core import linear_regression

rng = np.random.default_rng(0)


n_timesteps = 30
n_lat, n_lon = 5, 4
n_cells = n_lat * n_lon

time = np.arange(n_timesteps)


data = 0.1 * time + rng.normal(scale=0.1, size=n_timesteps)
pred0 = xr.DataArray(
    data, dims=("time"), coords={"time": time}, name="tas_globmean"
)

data = rng.normal(scale=0.25, size=n_timesteps)
pred1 = xr.DataArray(
    data, dims=("time"), coords={"time": time}, name="tas_globmean"
)


data = 0.25 * time + rng.normal(scale=0.1, size=(n_timesteps, n_cells)).T

LON, LAT = np.meshgrid(np.arange(n_lon), np.arange(n_lat))

coords = {"time": time, "lon": ("cells", LON.flatten()), "lat": ("cells", LAT.flatten())}
tgt = xr.DataArray(data, dims=("cells", "time"), coords=coords, name="tas")


pred_combined = xr.concat([pred0, pred1], dim="predictor")

weights = xr.ones_like(pred0.time)

Without xr.apply_ufunc:

def linear_regression_xr(predictors, target, weights=None, split=True, verbose=False):

    if "predictor" not in predictors.dims:
        predictors = predictors.expand_dims("predictor")

    p = predictors.transpose("time", "predictor")
    t = target.transpose("time", "cells")
    w = weights

    if verbose:
        print(f"{p.shape = }")
        print(f"{t.shape = }")
        if weights is not None:
            print(f"{w.shape = }")

    out = linear_regression(p, t, w)

    dims = ("cells", "predictors")
    coords = {"time": p.time, "cells": tas["cells"]}

    if split:
        out = xr.Dataset(
            {"intercept": ("cells", out[:, 0]), "coefficients": (dims, out[:, 1:])},
            coords=coords,
        )
        return out

    out = xr.Dataset({"coefficients": (dims, out)}, coords=coords)
    return out


print(linear_regression_xr_ufunc(pred0, tgt))
print(linear_regression_xr_ufunc(pred0, tgt, weights))

print(linear_regression_xr_ufunc(pred0, tgt, split=False))
print(linear_regression_xr_ufunc(pred0, tgt, weights, split=False))

print(linear_regression_xr_ufunc(pred_combined, tgt))
print(linear_regression_xr_ufunc(pred_combined, tgt, weights))

print(linear_regression_xr_ufunc(pred_combined, tgt, split=False))
print(linear_regression_xr_ufunc(pred_combined, tgt, weights, split=False))

With xr.apply_ufunc:

def _linear_regression_wrapper(
    predictors, target, weights=None, split=True, verbose=False
):

    target = target.T

    if verbose:
        print(f"{predictors.shape = }")
        print(f"{target.shape = }")

        if weights is not None:
            print(f"{weights.shape = }")

    out = linear_regression(predictors, target, weights=weights)

    # this is only for prototyping purposes
    if split:
        return out[:, 0], out[:, 1:]
    else:
        return out


def linear_regression_xr_ufunc(
    predictors, target, weights=None, split=True, verbose=False
):

    if "predictor" not in predictors.dims:
        predictors = predictors.expand_dims("predictor")

    args = (predictors, target)
    input_core_dims = (("time", "predictor"), ("time",))

    if weights is not None:
        args += (weights,)
        input_core_dims += (("time",),)

    output_core_dims = ((), ("predictor",)) if split else (("predictor",),)
    exclude_dims = set() if split else {"predictor"}

    out = xr.apply_ufunc(
        _linear_regression_wrapper,
        *args,
        exclude_dims=exclude_dims,
        output_core_dims=output_core_dims,
        input_core_dims=input_core_dims,
        kwargs={"split": split, "verbose": verbose},
    )

    if split:
        return xr.Dataset(dict(intercept=out[0], coefficients=out[1]))
    return out.to_dataset(name="coefficients")



print(linear_regression_xr(pred0, tgt))
print(linear_regression_xr(pred0, tgt, weights))

print(linear_regression_xr(pred0, tgt, split=False))
print(linear_regression_xr(pred0, tgt, weights, split=False))

print(linear_regression_xr(pred_combined, tgt))
print(linear_regression_xr(pred_combined, tgt, weights))

print(linear_regression_xr(pred_combined, tgt, split=False))
print(linear_regression_xr(pred_combined, tgt, weights, split=False))

observations

  • Not using xr.apply_ufunc requires less code - needs to be tested how easy it works when using more real world coords
  • returning the intercept separately seems a bit more intuitive - this will depend on how we write the estimate function.
  • the functions can be simplified once we decide whether the intercept is saved seperately

edit: renamed the test variables

@znicholls
Copy link
Collaborator

Nice. My thoughts (but happy to go the other way if you'd prefer):

  • Use xr.apply_ufunc. Yes it's more code for this simple example but it's also a good habit to build and the sooner we get used to it the better
  • return the intercept separately. To be honest, we could even return all the coefficients separately because they'll all have different units...

@znicholls
Copy link
Collaborator

znicholls commented Dec 9, 2021

Reading a bit more about ufunc, perhaps not using ufunc is the better choice. The docs (https://numpy.org/doc/stable/reference/ufuncs.html) say, "A universal function (or ufunc for short) is a function that operates on ndarrays in an element-by-element fashion". Our regression doesn't operate element by element, it operates 'all at once'. That suggest to me that ufunc might actually be inappropriate in this case (on the other hand maybe it doesn't matter that we're doing all at once, using xarray's ufunc just does the co-ordinate handling for us and we shouldn't get too worried about whether it's the exact perfect use case or not).

@mathause
Copy link
Member Author

mathause commented Dec 9, 2021

on the other hand maybe it doesn't matter that we're doing all at once, using xarray's ufunc just does the co-ordinate handling for us and we shouldn't get too worried about whether it's the exact perfect use case or not

Agree - we should not get hung up on the semantics. Having said that - it might be a generalized ufunc, which operates vector-wise. Or maybe not - because it only works only along a very specific axis...


Thanks for the feedback - next round:

  • I think not using apply_ufunc leads to clearer code here.
  • I agree to return every coefficient separately - this helps with understanding the return values & for error checking for the predict step.
  • I am not yet happy with the name.
  • Currently I am passing DataArray objects - not sure this is optimal (round-trip land-only unstructured grid #117 requires a Dataset).
  • I am passing predictors as dict to get the names.
from typing import Optional, Mapping

def linear_regression_xr(
    predictors: Mapping[str, xr.DataArray] = None,
    target: xr.DataArray = None,
    dim: str = None,
    weights: Optional[xr.DataArray] = None,
):

    # TODO: check predictors are 1D & contain `dim`
    # TODO: check target is 2D and contains `dim`
    # TODO: check weights is 1D and contains `dim`

    predictors_concat = xr.concat(tuple(predictors.values()), dim="predictor")

    target_dim = list(set(tgt.dims) - {dim})[0]

    out = linear_regression(
        predictors_concat.transpose(dim, "predictor"),
        target.transpose(dim, target_dim),
        weights,
    )

    keys = ["intercept"] + list(predictors)
    dataarrays = {key: (target_dim, out[:, i]) for i, key in enumerate(keys)}
    out = xr.Dataset(dataarrays, coords=target.coords).drop(dim)
    
    if weights is not None:
        out["weights"] = weights
    
    return out

Example:

linear_regression_xr(predictors={"tas": pred0}, target=tgt, dim="time")
linear_regression_xr(predictors={"tas": pred0, "hfds": pred1}, target=tgt, dim="time")

linear_regression_xr(
    predictors={"tas": pred0, "hfds": pred1},
    target=tgt,
    weights=weights,
    dim="time",
)

@znicholls
Copy link
Collaborator

Agree - we should not get hung up on the semantics. Having said that - it might be a generalized ufunc, which operates vector-wise.

Nice, I think that is what it is (I'm not sure it matters that it's meant to be used on a specific axis, it could in theory be used on any axis).

Currently I am passing DataArray objects - not sure this is optimal

I agree. Passing DataSets is probably better, particularly for unit handling.

I am passing predictors as dict to get the names.

This would be solved if we used DataSets instead right?

@mathause
Copy link
Member Author

I also think Dataset makes sense - however they can have one or several data_vars (DataArrays). We need to know which data_var to use - or enforce that only one is on the ds... Or how do you see this?

I am passing predictors as dict to get the names.

This would be solved if we used DataSets instead right?

Yes - would you pass all predictors in one Dataset or as several Datasets? I have a slight preference to pass every predictor individually.

predictors={"tas": pred0, "hfds": pred1}
predictors=xr.Dataset({"tas": pred0, "hfds": pred1})

particularly for unit handling

Shouldn't the unit attribute be on the DataArray? Or why is it better for unit handling?

@znicholls
Copy link
Collaborator

I also think Dataset makes sense - however they can have one or several data_vars (DataArrays). We need to know which data_var to use - or enforce that only one is on the ds... Or how do you see this?

Yes in my head we would tell it which variables to use

Yes - would you pass all predictors in one Dataset or as several Datasets? I have a slight preference to pass every predictor individually.

One dataset was what I was thinking (just so they share co-ordinates...)

Shouldn't the unit attribute be on the DataArray? Or why is it better for unit handling?

Yes, ignore me

@mathause mathause added this to the v0.9.0 milestone Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants