Add prototype of class structure #109

znicholls · 2021-10-21T11:43:42Z

Closes #xxx
Tests added
Passes isort . && black . && flake8
Fully documented, including CHANGELOG.rst

znicholls · 2021-10-21T11:46:17Z

@mathause this is sort of what I was thinking (I didn't get up to actually writing tests so that will have to wait for another day). There's a lot of boilerplate code around the actual calibration. If we can move to some sane classes then it might become much clearer what is actually doing calibration and what is just being copied around because there's not enough utility code available (e.g. this loop

mesmer/mesmer/calibrate_mesmer/train_gv.py

Line 174 in d73e8f5

for run in np.arange(nr_runs):

and this loop

mesmer/mesmer/calibrate_mesmer/train_lv.py

Line 239 in d73e8f5

for run in np.arange(nr_runs):

are the same idea but it's really hard to see at the moment). That will hopefully make it much easier to see how to scale things, how to add new things and where utility code is required instead.

codecov-commenter · 2021-10-21T11:46:19Z

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (89a9c20) 87.88% compared to head (8f19cd5) 88.80%.

Files	Patch %	Lines
mesmer/prototype/calibrate.py	89.36%	5 Missing ⚠️
mesmer/prototype/calibrate_multiple.py	95.57%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #109      +/-   ##
==========================================
+ Coverage   87.88%   88.80%   +0.91%     
==========================================
  Files          40       42       +2     
  Lines        1742     1902     +160     
==========================================
+ Hits         1531     1689     +158     
- Misses        211      213       +2

Flag	Coverage Δ
unittests	`88.80% <93.75%> (+0.91%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

znicholls · 2021-10-24T08:48:19Z

@mathause I implemented an actual test that passes. Have a look and see if it makes any sense to you. The idea is that almost everything in the legacy implementation is io and reshaping. If we can get our classes setup properly, we can hopefully make it easy to see how the regression is actually done using the _regress_single_point (or whatever name we end up with) method. I think some of the wrapping could be done better, but the test at least gives us a starting point to see what we're replacing and make sure the answer is still the same.

mathause · 2021-10-25T14:36:55Z

Thanks a lot for getting this started & implementing an example - that helps to understand your idea! Some preliminary comments after a quick look.

Your MesmerCalibrateTargetPredictor defines the interface (calibrate) but all other methods are @staticmethod - so there is not so much advantage for having a class. The call could just as well be linear_regression.calibrate(...).
What we should think about how the signatures of our functions should look like. Once we know this the rest is easy ;-) I'll try to do this later.
E.g. which of these should it be

LinearRegression().calibrate(esm_tas, predictors={...})
LinearRegression(esm_tas).calibrate(predictors={...})
LinearRegression(predictors={...}).calibrate(esm_tas)

If we define LinearRegression().calibrate() should we also define LinearRegression().emulate()? I see why we would want to separate them. Yet, keeping them together would enforce to implement both.

I have not had the idea to use an outer join of hist and proj as a single DataArray. Looks elegant but also a bit wasteful (in my analysis pipeline I just concatenate them - so I am not really allowed to say anything ;-))
How we do this will very much depend on internal data structure #106

Do I understand your idea correctly: you flatten your input arrays and then loop over each gridpoint?
Could we use xr.apply_ufunc instead?
- I think this could simplify some of the logic - internally and for users: if there is a vectorized function in xarray - let's use it, if not users have to write a function that works on one gridpoint and xr.apply_ufunc takes care of the rest. Admittedly xr.apply_ufunc is not trivial either.
- Or is there something I am missing?

mathause · 2021-10-25T21:59:46Z

I played with the code for a bit and understand it a bit better (& can now answer many of my questions ;-)). I see now that you construct a stacked_coord over scenario x time and only then loop over the points.

Here is how the flattened arrays look like:

<xarray.DataArray (gridpoint: 2, stacked_coord: 7)>
array([...])
Coordinates:
  * gridpoint      (gridpoint) int64 0 1
    lat            (gridpoint) int64 -60 60
    lon            (gridpoint) int64 120 240
  * stacked_coord  (stacked_coord) MultiIndex
  - scenario       (stacked_coord) object 'hist' 'hist' ... 'ssp126' 'ssp126'
  - time           (stacked_coord) int64 1850 1950 2014 2015 2050 2100 2300

<xarray.DataArray 'emulator_tas' (stacked_coord: 7, predictor: 4)>
array([[...)
Coordinates:
  * stacked_coord  (stacked_coord) MultiIndex
  - scenario       (stacked_coord) object 'hist' 'hist' ... 'ssp126' 'ssp126'
  - time           (stacked_coord) int64 1850 1950 2014 2015 2050 2100 2300
  * predictor      (predictor) MultiIndex
  - variable       (predictor) object 'emulator_tas' ... 'global_variability'

Thus, you could only call apply_ufunc after flatting the arrays in some way. I tried how this last step could be implemented with apply_ufunc. For this I use your target_flattened and predictors_flattened. Looks more complicated than your solution...

def _regress_single_group(target_point, predictor, weights=None):

    # this is the method that actually does the regression
    args = [predictor.T, target_point.reshape(-1, 1)]
    if weights is not None:
        args.append(weights)
    reg = sklearn.linear_model.LinearRegression().fit(*args)
    a = np.concatenate([reg.intercept_, *reg.coef_])

    return a


xr.apply_ufunc(
    _regress_single_group,
    target_flattened,
    predictors_flattened,
    input_core_dims=[["stacked_coord"], ["predictor", "stacked_coord"]],
    output_core_dims=(("pred",),),
    vectorize=True,
)

znicholls · 2021-10-25T23:22:21Z

I think you've got it!

target_point.reshape(-1, 1)

Maybe a comment on this would help e.g. "Make data a flat array with two dimensions so sklearn behaves"

Looks more complicated than your solution...

A little but the performance improvements are probably worth it!

znicholls · 2021-10-25T23:24:46Z

I see now that you construct a stacked_coord over scenario x time and only then loop over the points.

Yep, I'm not sure if this is the smartest way though or if there should be an extra layer i.e. should the layer I've just written assume that things are already stacked, then we add an extra layer which handles the stacking for the user or should we use what we have here. I am tempted to add an extra layer because I think it will provide us with greater control as we add new features.

znicholls · 2021-10-26T00:14:00Z

If we define LinearRegression().calibrate() should we also define LinearRegression().emulate()? I see why we would want to separate them. Yet, keeping them together would enforce to implement both.

This is true. I think it's something to think about once we have a few more pieces in place. I think at this point we're so low level that we shouldn't worry about emulation just yet because the process for how calibration and emulation fit together is a bit complicated (you have to calibrate multiple different models, and then make sure they join together properly, before you can actually make emulations).

znicholls · 2021-10-26T01:47:54Z

@mathause I just pushed an attempt to also do the global variability calibration (we can always just pick the commits of interest once we decide on a good direction). It was a good learning exercise, but it's not completely clear to me how we can make a coherent structure out of all this. One to discuss this morning.

znicholls · 2021-10-26T07:45:45Z

The notes I made on where we landed with train_lv (so we don't lose them and have to work it out again):

calculating distances between points and these Gaspari-Cohn correlation functions
autoregression for each gridpoint (weighting each scenario, ensemble member choices)
calculating localization radius, empirical cov matrix, and localized ecov matrix
calculating innovations (something like realisations from spatial covariance matrix/random element of the draws)

mathause · 2021-10-26T21:15:49Z

I am not saying it's a good idea but if you want to get rid of duplicated loops you can use yield:

def _loop(target, group):
    for _, scenario_vals in target.groupby(group)
        scenario_parameters = {k: [] for k in parameters}
        for _, em_vals in scenario_vals.groupby(ensemble_member_level):
            yield em_vals

def _select_auto_regressive_process_order(...):

    for em_vals in _loop(target, ensemble_member_level):
            em_orders = AutoRegression1DOrderSelection().calibrate(
                em_vals, maxlag=maxlag, ic=ic
            )

znicholls · 2021-10-27T01:13:13Z

I am not saying it's a good idea but if you want to get rid of duplicated loops you can use yield

Nice, I tidied up a bit. Still the reimplementation of training local variability to go, let's see if that happens this week or not

mathause · 2021-10-27T11:26:37Z

I am not sure where to add this comment so I add it here.

I think one thing that bugs me is that we need stacked coords of time x scenario because of the different times of hist and proj. If all had the same time vector we could do a nice 3D array... gridpoint x time x scenario.

znicholls · 2021-10-27T21:41:26Z

I think one thing that bugs me is that we need stacked coords of time x scenario because of the different times of hist and proj. If all had the same time vector we could do a nice 3D array... gridpoint x time x scenario.

Yes I don't love this either but I don't have a solution either given that the scenarios have different number of time points...

Options I've thought about (and sadly none have jumped out as great):

concatenate history to every scenario (cons: duplicates data which can be memory intensive and also gives history extra weight/forces us to then drop history internally i.e. we end up doing the same operation anyway)
force the user to do the flattening first (cons: seems against the spirit of 'just working' with CMIP6)
calibrate one scenario at a time (doesn't work I don't think because the regressions need everything at once)

leabeusch · 2021-11-02T22:24:45Z

Based on this comment of @mathause #106 (comment), I guess you've moved beyond @znicholls original option proposals but I just want to stress nevertheless, that I'm very against option 3, i.e.,

calibrate one scenario at a time (doesn't work I don't think because the regressions need everything at once)

as @znicholls already points out himself: this really goes against the general idea of MESMER to be calibrated on a broad range of scenarios simultaneously to ensure that the resulting parameters are able to successfully emulate a broad range of scenarios too i.e., what we analysed in the GMD paper... how good or bad the single scenario approach would be of course always depends on what scenario is used for calibration again & so on... but please don't kill MESMER's overall capability to be trained on multiple scenarios at once in this whole refactoring exercise. 😅

znicholls · 2021-11-26T03:51:13Z

Lessons learnt so far from this re-write:

Keeping calibration and emulation next to each other would make it much easier to understand what is going on with each part of the model/bit of code
Ensuring that methods/functions are no more than 10 lines (with very few exceptions) would make it much easier to see what is going on. Functions with more than 10 lines have too much context, which makes it hard to focus on what is actually going on.
Using long, descriptive names also makes a massive difference. Abbreviations make it really hard to keep track of what is going on (and, when combined with long functions, become almost impossible to keep track of).
Doing a re-write alongside the original writer would probably be much simpler as they know the theory (I had to do a lot of googling to work out what I was actually looking at)

I have train local variability left to sort out, then I'll close this PR and start with some much smaller steps. I think doing this has given me enough experience to have a sense of how to start with a new structure.

mesmer/prototype/utils.py

leabeusch · 2021-11-30T19:32:40Z

Doing a re-write alongside the original writer would probably be much simpler as they know the theory (I had to do a lot of googling to work out what I was actually looking at)

Haven't properly been keeping up with all the refactoring progress that has been going on lately, but in case I could be of help with this point, I assume you'd let me know? ^^'

Sketch out prototype

93239ff

znicholls added 3 commits October 24, 2021 17:47

Mark path which is never used to simply things

9ae897f

Block out another branch

35b5fed

Implemente prototype method

92f1e4e

znicholls requested a review from mathause October 24, 2021 08:47

znicholls added 3 commits October 25, 2021 13:12

Tidy up

47fa2c6

Format

b408db0

Tidy up a bit more

b93de58

Change calibration pattern to simplify use of classes

de1d293

Try doing auto-regression implementation

3d97ec1

Clean up loops

14c429d

Remove pdb statement

4fd57aa

mathause mentioned this pull request Oct 27, 2021

draft roadmap #111

Closed

mathause mentioned this pull request Oct 28, 2021

internal data structure #106

Open

Sketch out test for train lv

1026161

znicholls added 2 commits November 25, 2021 18:04

More notes about how to do train lv

d4e4e7f

Start working on geodesic functions

3470ee4

znicholls added 2 commits November 26, 2021 15:58

Get legacy training running

3b23a6d

Finish reimplementing train_lv

88ec4ad

mathause reviewed Nov 26, 2021

View reviewed changes

mesmer/prototype/utils.py Outdated Show resolved Hide resolved

mesmer/prototype/utils.py Outdated Show resolved Hide resolved

mathause mentioned this pull request Jun 10, 2022

speed up gaspari cohn #157

Closed

mathause added 3 commits September 21, 2023 18:02

Merge branch 'main' into prototype

592441f

linting

08a5134

fix: test_prototype_train_lv

fe1d1bc

This was referenced Sep 21, 2023

gaspari_cohn: allow passing xr.DataArray #298

Merged

calc_geodist_exact: allow passing DataArray #299

Merged

add calc_gaspari_cohn_correlation_matrices #300

Merged

mathause added a commit to mathause/mesmer that referenced this pull request Sep 21, 2023

add comments from MESMER-group#109

e1fa096

mathause added a commit that referenced this pull request Sep 21, 2023

add comments from #109 (#302)

fe128ff

mathause added 5 commits September 21, 2023 23:24

Merge branch 'main' into prototype

1b4b283

clean train_lt.py

c854fa4

remove prototype/utils.py after #298, #299, and #300

c3bf5e0

Merge branch 'main' into prototype

c727a6e

allow selected ar order to be None

26f6761

mathause mentioned this pull request Sep 25, 2023

fix ar order selection #305

Merged

4 tasks

mathause added 4 commits September 25, 2023 14:34

Merge branch 'main' into prototype

a2fec9b

Merge branch 'main' into prototype

7143934

fix for gaspari_cohn and geodist_exact

d3eb99d

small refactor

8f19cd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prototype of class structure #109

Add prototype of class structure #109

znicholls commented Oct 21, 2021

znicholls commented Oct 21, 2021

codecov-commenter commented Oct 21, 2021 •

edited by codecov bot

znicholls commented Oct 24, 2021 •

edited

mathause commented Oct 25, 2021

mathause commented Oct 25, 2021

znicholls commented Oct 25, 2021

znicholls commented Oct 25, 2021

znicholls commented Oct 26, 2021

znicholls commented Oct 26, 2021

znicholls commented Oct 26, 2021 •

edited

mathause commented Oct 26, 2021

znicholls commented Oct 27, 2021

mathause commented Oct 27, 2021

znicholls commented Oct 27, 2021 •

edited

leabeusch commented Nov 2, 2021

znicholls commented Nov 26, 2021

leabeusch commented Nov 30, 2021

Add prototype of class structure #109

Are you sure you want to change the base?

Add prototype of class structure #109

Conversation

znicholls commented Oct 21, 2021

znicholls commented Oct 21, 2021

codecov-commenter commented Oct 21, 2021 • edited by codecov bot

Codecov Report

znicholls commented Oct 24, 2021 • edited

mathause commented Oct 25, 2021

mathause commented Oct 25, 2021

znicholls commented Oct 25, 2021

znicholls commented Oct 25, 2021

znicholls commented Oct 26, 2021

znicholls commented Oct 26, 2021

znicholls commented Oct 26, 2021 • edited

mathause commented Oct 26, 2021

znicholls commented Oct 27, 2021

mathause commented Oct 27, 2021

znicholls commented Oct 27, 2021 • edited

leabeusch commented Nov 2, 2021

znicholls commented Nov 26, 2021

leabeusch commented Nov 30, 2021

codecov-commenter commented Oct 21, 2021 •

edited by codecov bot

znicholls commented Oct 24, 2021 •

edited

znicholls commented Oct 26, 2021 •

edited

znicholls commented Oct 27, 2021 •

edited