[ENH] Speed up evotuning and improve evotuning ergonomics #57

ericmjl · 2020-06-28T02:51:44Z

PR Description

Before you read on, please ignore the branch name. I thought originally that I could use numba to speed up things, but it turned out that once again, with some careful profiling, I found we didn't have to.

This PR does a few things:

Adds a pre-commit configuration.
Adds an installation script that makes easy the installation of jax on GPU.
Provided backend specification of device (GPU/CPU).
Switched preparation of sequences as input-output pairs exclusively on CPU, for speed.
Added ergonomic UI features - progressbars! - that improve user experience.
Added docs on recommended batch size and its relationship to GPU RAM consumption.
Switched from exact calculation of train/holdout loss to estimated calculation.

In any case, this PR closes #56.

Checklist

General

I have made the PR off a new branch from my fork
(<your_username>:<feature-branch_name>), not
<your_username>:master.
I have added my changes to the CHANGELOG.md file at the top.
I have made any necessary changes to the documentation in the README.

Code checks

If there are new features implemented, add suitable tests
in the tests directory.
If any new dependencies are introduced through the new features,
add the packages, pinned to a version, to environment.yml.
Run make test in a console in the top level directory
to make sure all the tests pass.
Run make format in a console in the top level directory
to make the code comply with the formatting standards.

- The script builds a conda environment first. - Then it clobbers over with the GPU-based installation based on instructions given by JAX's developers.

Primarily to add black config

codecov-commenter · 2020-06-28T12:44:54Z

Codecov Report

Merging #57 into master will increase coverage by 4.06%.
The diff coverage is 96.29%.

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   89.27%   93.33%   +4.06%     
==========================================
  Files          11       11              
  Lines         522      540      +18     
==========================================
+ Hits          466      504      +38     
+ Misses         56       36      -20

Impacted Files	Coverage Δ
jax_unirep/params.py	`100.00% <ø> (+71.42%)`	⬆️
jax_unirep/evotuning.py	`96.95% <96.00%> (+3.03%)`	⬆️
jax_unirep/utils.py	`91.59% <100.00%> (+0.14%)`	⬆️
jax_unirep/sampler.py	`92.75% <0.00%> (+1.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 661a31b...b43b92e. Read the comment docs.

ericmjl · 2020-06-28T21:50:46Z

Hmmm, I'm a little confused as to how the code coverage was impacted. I think I need a second opinion on whether stuff could be refactored a bit better. @ElArkk?

- One unit test - One lazy man's execution test

ElArkk · 2020-06-29T11:36:49Z

Wow, great work @ericmjl ! Very clever to optionally move average loss computation off of GPU, while retaining it in any case (if available) for the work-intensive weight updates!

As for the coverage, one thing I think could be responsible for the decrease on evotuning.py is that we do not supply holdout seqs in the execution test of evotuning or fit?

tests/test_params.py

ElArkk · 2020-06-29T11:45:15Z

jax_unirep/evotuning.py

+    global evotune_loss  # this is necessary for JIT to reference evotune_loss
+    evotune_loss_jit = jit(evotune_loss, backend=backend)
+
+    def batch_iter(xs: np.ndarray, ys: np.ndarray, batch_size: int = 25):


Does it make sense to set a default value for batch_size here, when this function is only used inside avg_loss, and you then supply another value (100) further below?

Yeah, good catch! We should make it all uniform for simplicity, as there's much less unnecessary complexity this way.

jax_unirep/evotuning.py

This can be used as part of the test suite. I should have remembered! Co-authored-by: Arkadij Kummer <43340666+ElArkk@users.noreply.github.com>

@ElArkk

h/t @ElArkk Co-authored-by: Arkadij Kummer <43340666+ElArkk@users.noreply.github.com>

At least I tried.

ericmjl · 2020-06-30T11:55:10Z

@ivanjayapurna I wanted to get a second eye on the code before we merge. Can you and @ElArkk independently test the numbjit branch code in, say, Colab on the GPU runtime? I've been working off my home GPU tower to aid development speed, but I want to make sure I'm not doing something that isn't "generally useful".

UPDATE 30 June 2020 9:20 AM: I just tried it out on your notebook, @ivanjayapurna, and everything runs smooth and fast. By default, I have set it to dump on every epoch. Storage is cheap, human time is not.

ElArkk · 2020-07-01T12:53:52Z

I was just thinking, now that we only use one random batch of sequences to calculate the average loss on the dataset, does it still make sense to have an argument for which backend to use? Since the actual training always uses GPU if available, and needs to be able to handle the same or even larger batch_sizes as the average loss calculation does. So if GPU memory would be a problem in average loss calc, it will also be one in training anyways? Maybe I'm missing something here @ericmjl ?

ericmjl · 2020-07-02T17:33:56Z

Sometimes, setting the back-end explicitly can help with debugging. For example, in debugging the memory allocation issue, I found it handy to be able to switch freely between CPU and GPU backend. Hence, I think keeping the backend kwarg in there while also setting a sane default (CPU by default) makes a lot of sense, as it gives both convenience and flexibility.

ivanjayapurna · 2020-07-06T02:43:38Z

Just wanted to add on the testing side - trained on the TEM-1 sequences for 25 epochs on AWS with no memory issues

ivanjayapurna · 2020-07-06T20:50:07Z

Results from 25 epoch training:

The same results but with "epoch 0" plotted:

The plots on the left were prior to random batching, and the plots on the right is results from this PR. It is clear that this made a big difference in the models ability to learn. 2 things:

1.) If I'm understanding it correctly epoch 1 is equivalent to no-evotuning UniRep. Epoch 0 here I believe is the final weights saved without a "step" input to dump_params and is a little confusing. Perhaps we should consider changing this? Simple change, just if no step is inputted its just saved without an index in the name or with "final" appended or something. Perhaps epoch's should be 0 indexed as well, to reflect that epoch 0 is before any training has begun.

2.) There seems to be minimal learning after the 1st epoch still, perhaps the learning rate we used was too high.

ElArkk · 2020-07-07T09:22:59Z

@ivanjayapurna thank you for the thorough testing! What was the learning rate and batch size you used here?

As for epoch 0, I'm wondering where it stems from, since the fit function does never seem to call on dump_params without the step argument. But maybe I missed it?

It could be a good idea to change epoch indexing back to 0, since right now the loss calculations at epoch 1 actually correspond to 'before any training has happened'. What do you think @ericmjl ?

ericmjl · 2020-07-07T12:10:12Z

It could be a good idea to change epoch indexing back to 0, since right now the loss calculations at epoch 1 actually correspond to 'before any training has happened'. What do you think @ericmjl ?

Yes, let's do that. I think I messed up the epoch calculation when I did this PR up. @ElArkk do you have a spare cycle to handle it? (If not, no worries, I can get to this later in the week.)

ElArkk · 2020-07-07T13:26:46Z

@ericmjl @ivanjayapurna I did a quick rework of the epoch calculation, let me know if you think it makes sense this way.

ericmjl · 2020-07-16T20:12:58Z

It works for me. Anything else blocking this PR?

ElArkk · 2020-07-17T07:22:33Z

No, I guess this is ready :)

I just have one concern still: Let's say someone wants to evotune on 50-100k sequences. If they use a batch size of ~100, loss after each epoch would be calculated on just 0.2% and 0.1% of the whole dataset respectively. Do you think this is enough to estimate overall loss @ericmjl ?

ericmjl · 2020-07-17T17:18:53Z

Possibly not for only a few epochs, but in the limit of long-run numbers, it should be not too much of an issue. The key show-stopper that I think we should not compromise on is the interactive feel.

ElArkk · 2020-07-17T23:14:46Z

By interactive feel you mean, not having to wait for too long between epochs for the loss calculation?

In any case, we shouldn't delay merging the sped-up and more stable evotuning any longer! If we see any problems with avg loss calculation, we can always come back to it. @ericmjl what do you think?

ericmjl · 2020-07-18T00:15:26Z

Agreed, hit that button when it’s done! (And go to bed soon, it’s awfully late there for you to be responding! 😸)

ElArkk · 2020-07-18T09:23:28Z

Hitting that button after a good nights sleep 😄

ericmjl · 2020-07-19T00:58:12Z

NOICE! (Dude, you have no idea - I was knocked out for 5 hours this afternoon. I’m lacking sleep myself haha.)

@ElArkk

* Adding pre-commit * Fixed up GPU memory allocation, and added docstrings. * Adding a bash script that makes it easy to install JAX on GPU. - The script builds a conda environment first. - Then it clobbers over with the GPU-based installation based on instructions given by JAX's developers. * Update fit docstring * Set backend to "cpu" by default * Removed parallel kwarg * Switched back to non-Numba-compatible dictionary definition * Add pyproject TOML config file Primarily to add black config * Applied black * Add flake8 to pre-commit hooks * Remove flake8 from pre-commit * Attempting to increase coverage without doing any actual work ^_^ * Add tests for params - One unit test - One lazy man's execution test * Update changelog * Fix test * Add validate_mLSTM1900_params This can be used as part of the test suite. I should have remembered! Co-authored-by: Arkadij Kummer <43340666+ElArkk@users.noreply.github.com> * Used validate_mLSTM1900_params as part of test h/t @ElArkk Co-authored-by: Arkadij Kummer <43340666+ElArkk@users.noreply.github.com> * Fix batch_size in avg_loss function * Flat is better than nested At least I tried. * Fix test * Make format * Change holdout_seqs to default back to None * Set sane defaults for mLSTM1900 layer * Changed to dumping every epoch by default. * add backend explanation * add backend to fit example * change default batching method of fit function to random * fix epoch calculations * fix seq length choice for holdout seqs Co-authored-by: Arkadij Kummer <43340666+ElArkk@users.noreply.github.com> Co-authored-by: ElArkk <arkadij.kummer@gmail.com>

ericmjl added 11 commits June 27, 2020 21:17

Adding pre-commit

ab7c608

Fixed up GPU memory allocation, and added docstrings.

4e5a3b3

Adding a bash script that makes it easy to install JAX on GPU.

f1c954b

- The script builds a conda environment first. - Then it clobbers over with the GPU-based installation based on instructions given by JAX's developers.

Update fit docstring

bf36999

Set backend to "cpu" by default

e88df1e

Removed parallel kwarg

c91c9ee

Switched back to non-Numba-compatible dictionary definition

ef37625

Add pyproject TOML config file

3dd9b5a

Primarily to add black config

Applied black

a0252d8

Add flake8 to pre-commit hooks

df2245a

Remove flake8 from pre-commit

19c7443

ericmjl changed the title ~~Numbajit~~ [ENH] Speed up evotuning and improve evotuning ergonomics Jun 28, 2020

Attempting to increase coverage without doing any actual work ^_^

bfb2fef

ericmjl added 3 commits June 28, 2020 17:57

Add tests for params

cbcac6f

- One unit test - One lazy man's execution test

Update changelog

b0704d0

Fix test

0c8e9ec

ericmjl requested a review from ElArkk June 28, 2020 22:10

ElArkk reviewed Jun 29, 2020

View reviewed changes

ericmjl and others added 8 commits June 29, 2020 21:19

Add validate_mLSTM1900_params

c036df3

This can be used as part of the test suite. I should have remembered! Co-authored-by: Arkadij Kummer <43340666+ElArkk@users.noreply.github.com>

Used validate_mLSTM1900_params as part of test

e1ebd9c

h/t @ElArkk Co-authored-by: Arkadij Kummer <43340666+ElArkk@users.noreply.github.com>

Fix batch_size in avg_loss function

3d47e47

Flat is better than nested

06b8620

At least I tried.

Fix test

a8d2f8e

Make format

2220b49

Change holdout_seqs to default back to None

cf49efd

Set sane defaults for mLSTM1900 layer

7ebe811

Changed to dumping every epoch by default.

e0bab13

ElArkk added 3 commits July 7, 2020 11:12

add backend explanation

0bd3ad9

add backend to fit example

d615a42

change default batching method of fit function to random

abe32ad

fix epoch calculations

2e4e8c1

fix seq length choice for holdout seqs

e758aea

Merge branch 'master' into numbajit

b43b92e

ElArkk merged commit 54ab8e6 into master Jul 18, 2020

ElArkk deleted the numbajit branch July 18, 2020 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Speed up evotuning and improve evotuning ergonomics #57

[ENH] Speed up evotuning and improve evotuning ergonomics #57

ericmjl commented Jun 28, 2020 •

edited

Loading

codecov-commenter commented Jun 28, 2020 •

edited

Loading

ericmjl commented Jun 28, 2020

ElArkk commented Jun 29, 2020

ElArkk Jun 29, 2020

ericmjl Jun 30, 2020

ericmjl commented Jun 30, 2020 •

edited

Loading

ElArkk commented Jul 1, 2020

ericmjl commented Jul 2, 2020

ivanjayapurna commented Jul 6, 2020

ivanjayapurna commented Jul 6, 2020

ElArkk commented Jul 7, 2020

ericmjl commented Jul 7, 2020

ElArkk commented Jul 7, 2020

ericmjl commented Jul 16, 2020

ElArkk commented Jul 17, 2020 •

edited

Loading

ericmjl commented Jul 17, 2020

ElArkk commented Jul 17, 2020

ericmjl commented Jul 18, 2020 •

edited

Loading

ElArkk commented Jul 18, 2020

ericmjl commented Jul 19, 2020

[ENH] Speed up evotuning and improve evotuning ergonomics #57

[ENH] Speed up evotuning and improve evotuning ergonomics #57

Conversation

ericmjl commented Jun 28, 2020 • edited Loading

PR Description

Checklist

General

Code checks

codecov-commenter commented Jun 28, 2020 • edited Loading

Codecov Report

ericmjl commented Jun 28, 2020

ElArkk commented Jun 29, 2020

ElArkk Jun 29, 2020

Choose a reason for hiding this comment

ericmjl Jun 30, 2020

Choose a reason for hiding this comment

ericmjl commented Jun 30, 2020 • edited Loading

ElArkk commented Jul 1, 2020

ericmjl commented Jul 2, 2020

ivanjayapurna commented Jul 6, 2020

ivanjayapurna commented Jul 6, 2020

ElArkk commented Jul 7, 2020

ericmjl commented Jul 7, 2020

ElArkk commented Jul 7, 2020

ericmjl commented Jul 16, 2020

ElArkk commented Jul 17, 2020 • edited Loading

ericmjl commented Jul 17, 2020

ElArkk commented Jul 17, 2020

ericmjl commented Jul 18, 2020 • edited Loading

ElArkk commented Jul 18, 2020

ericmjl commented Jul 19, 2020

ericmjl commented Jun 28, 2020 •

edited

Loading

codecov-commenter commented Jun 28, 2020 •

edited

Loading

ericmjl commented Jun 30, 2020 •

edited

Loading

ElArkk commented Jul 17, 2020 •

edited

Loading

ericmjl commented Jul 18, 2020 •

edited

Loading