Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first attempt at adding basic bias correction code and fixing tests #2

Merged
merged 16 commits into from
Dec 24, 2020

Conversation

dgergel
Copy link
Member

@dgergel dgergel commented Dec 14, 2020

This PR adds in some super basic bias correction code as well as a dependency on a forked version of scikit-downscale with an interim fix for a current bug in PointwiseDownscaler.

@brews brews added the enhancement New feature or request label Dec 14, 2020
@brews
Copy link
Member

brews commented Dec 15, 2020

Interesting, you're getting this error because of https://github.com/dgergel/xsd/blob/2f76a3a02a5cccf962d3c801f1603c545ced967b/skdownscale/pointwise_models/core.py#L46 where it's trying to get the .data attribute from an xr.Dataset.

Stacktrace just before this error looks like:

...
  /Users/bmalevich/Projects/dodola/dodola/core.py(35)bias_correct_bcsd()
-> model.fit(gcm_training_ds, obs_training_ds)
  /Users/bmalevich/miniconda3/envs/dodola_dev/lib/python3.8/site-packages/skdownscale/pointwise_models/core.py(177)fit()
-> self._models = _fit_wrapper(X, *args, **kws)
  /Users/bmalevich/miniconda3/envs/dodola_dev/lib/python3.8/site-packages/skdownscale/pointwise_models/core.py(74)_fit_wrapper()
-> ydf = y[index].pipe(_da_to_df, feature_dim)
  /Users/bmalevich/miniconda3/envs/dodola_dev/lib/python3.8/site-packages/xarray/core/common.py(634)pipe()
-> return func(self, *args, **kwargs)
> /Users/bmalevich/miniconda3/envs/dodola_dev/lib/python3.8/site-packages/skdownscale/pointwise_models/core.py(47)_da_to_df()
-> data = da.transpose('time', ...).data

@brews
Copy link
Member

brews commented Dec 15, 2020

Looks like input to the *.fit() and *.predict() methods for a PointWiseDownscaler need be a xr.DataArray or maybe xr.Dataset if we're careful. I should have read the docs on PointWiseDownscaler more carefully.

It also looks like dimensions are transposed in the output...

So if we hackishly change tests.test_services._datafactory() to

def _datafactory(x, start_time="1950-01-01"):
    """Populate xr.Dataset with synthetic data for testing"""
    start_time = str(start_time)
    if x.ndim != 1:
        raise ValueError("'x' needs dim of one")

    out = xr.Dataset(
        {"fakevariable": (["time", "lon", "lat"], x[:, np.newaxis, np.newaxis])},
        coords={
            "time": xr.cftime_range(
                start=start_time, freq="D", periods=len(x), calendar="noleap"
            ),
            "lon": (["lon"], [1.0]),
            "lat": (["lat"], [1.0]),
        },
    )
    return out["fakevariable"]

The test completes but output values differ from the goal:

E       AssertionError: Left and right DataArray objects are not equal
E       
E       Differing values:
E       L
E           array([[[-726]],
E           
E                  [[-725]],
E           
E                  ...,
E           
E                  [[2549]],
E           
E                  [[2551]]])
E       R
E           array([[[   0]],
E           
E                  [[   1]],
E           
E                  ...,
E           
E                  [[1823]],
E           
E                  [[1824]]])

What say you, @dgergel ? You're more familiar with the method, does this L array match your expectations from the input?

Edit: If this is correct, then we need to change input to DataArrays and change docstrs to reflect this.

@dgergel
Copy link
Member Author

dgergel commented Dec 16, 2020

@brews inputs to the *.fit() and *.predict() methods for the PointWiseDownscaler do need be DataArrays unless we set it up a bit more cleverly. I should have paid closer attention to that in the test code when I was looking at it earlier.

As for what's happening with the outputs, I'm going to do a bit more exploring/testing myself to figure that out - I don't immediately know what's going on. But I don't think those outputs are right. More soon.

@dgergel
Copy link
Member Author

dgergel commented Dec 17, 2020

@brews I've been digging into this and I think something is going wrong with how the bias correction code is ingesting our test data, which seems to be due to a mismatch between what it expects and what we're giving it. For example, in the outputs, given that our training and testing data is 0 - 1825 +/- 10, our bias corrected values should not be negative (or equal to -725). I tried a couple things, for example changing the time index in case the time grouper was struggling with a cftime index, e.g.

def _datafactory(x, start_time="1950-01-01"):
    """Populate xr.Dataset with synthetic data for testing"""
    start_time = str(start_time)
    if x.ndim != 1:
        raise ValueError("'x' needs dim of one")

    time = pd.date_range(start=start_time, freq="D", periods=len(x))
    
    out = xr.Dataset(
        {"fakevariable": (["time", "lon", "lat"], x[:, np.newaxis, np.newaxis])},
        coords={"index": time, "time": time, "lon": (["lon"], [1.0]), "lat": (["lat"], [1.0])})
    
    return out["fakevariable"]

But that didn't do it. I tested the branch of scikit-downscale that we're using with the global bias correction workflow and looked at a timestep of the output before we tried it here, so I'm thinking there is a problem bw what PointwiseDownscaler is expecting and the test data we're giving it as I mentioned earlier. I think removing PointwiseDownscaler and just testing on a gridcell with a test DataFrame and the BCSDTemperature model is probably the best next step, unless you have other ideas?

@brews
Copy link
Member

brews commented Dec 17, 2020

@dgergel So if I follow you right, you're saying this is an upstream package problem with PointWiseDownscaler?

In that case, I think that trying to get this to work without PointWiseDownscaler might be the way to go. We can continue to prototype this and come back to this later once things are fixed upstream. That's why we test things like this 👍

Given that the our fake data for this test case is only a single grid point anyways... Can we make the test work with a xr.DataArray?

Regardless, I say go ahead and drop PointWiseDownscaler and do what you need to do to get the test to run. I suggest any data input and output data to your public core.py functions be an xr data structure. If the BCSDTemperature model needs to be a pd.Dataframe, then I suggest we make the conversion in your core.py functions.

(And stop me anytime if you want to walk through anything or you have any questions)

@brews
Copy link
Member

brews commented Dec 17, 2020

I tried something different for this failing test. I fed it np.linspace(0, 10, n) (giving a range of 0 to 10) instead of np.arange(0, n) (range of 0 to 1825) to give it "realistic" temperature range and get:

E       Differing dimensions:
E           (time: 1825, lon: 1, lat: 1) != (lon: 1, lat: 1, time: 1825)
E       Differing values:
E       L
E           array([[[-3.985746]],
E           
E                  [[-3.977522]],
E           
E                  ...,
E           
E                  [[13.977522]],
E           
E                  [[13.985746]]])
E       R
E           array([[[0.000000e+00, 5.482456e-03, ..., 9.994518e+00, 1.000000e+01]]])

@brews
Copy link
Member

brews commented Dec 19, 2020

From meeting today sounds like this is actually working as expected and unexpected numbers stem from quantile mapping in addition to a bias correction.

Potential solutions are:

  • Adjust data we input to the service - use case where we can directly calculate the expected output DataArray and compare the expected and output DataArrays in their entirety. Only do this if we can easily create the input data and easily calculate the expected output.
  • Continue with the test input data we have and then check the output DataArray dimensions/shape and the head/tail 3-5 values via a slice or .sel()

Choose whichever is less work at this point.

In addition we need to ensure input data to fit the model is passed in as a DataArray, not a Dataset as was the original setup.

Brainstorming for later work: At some point in the future we're going to need to allow users to choose what variable name to extract from output input data. The alternative is to allow only a single variable in each input dataset, but this feels like it might be clunky and too implicit. Do we allow users to have a different variable name for input files or does the variable name need to be the same for all files. Just things to chew on for later.

@brews brews self-requested a review December 24, 2020 00:38
Copy link
Member

@brews brews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @dgergel ! Thanks again for putting this together.

The automated tests passed and that's your goal!

I have a few suggestions but nothing essential to fix before this merges. I don't think any of this is going to immediately break anything -- and we're still prototyping everything -- so it's okay if we get some of this in a later PR. Your call.

Go ahead and merge when ready!

Edit: Also let me know if anything is unclear or if you want to discuss things further.

dodola/core.py Outdated
observation data for building quantile map
gcm_predict_ds : Dataset
future model data to be bias corrected
predicted : Dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this "predicted" arg was renamed or removed. If so I'd remove the two docstr lines for predicted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - it should be ds_predicted instead of predicted, thanks!

dodola/core.py Outdated
# import numpy as np
# import xarray as xr
# from skdownscale.pointwise_models import PointWiseDownscaler, BcsdTemperature
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can remove numpy and xarray imports - they're not used.

dodola/core.py Outdated


def morenerdymathstuff(*args):
raise NotImplementedError
# TO-DO: implement additional bias correction functionality
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're set to just remove morenerdymathstuff() alltogether. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - done 👍

setup.py Show resolved Hide resolved
dodola/tests/test_services.py Outdated Show resolved Hide resolved
@dgergel
Copy link
Member Author

dgergel commented Dec 24, 2020

@brews all of your comments should be addressed, so this is ready to merge (I'd do it myself but looks like I don't have merge privileges - can you update that btw?)

@brews brews merged commit 103fe43 into ClimateImpactLab:main Dec 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants