Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sumx: no valid experiments found #15

Closed
svengiegerich opened this issue May 18, 2021 · 6 comments
Closed

sumx: no valid experiments found #15

svengiegerich opened this issue May 18, 2021 · 6 comments

Comments

@svengiegerich
Copy link

Hey,
first thanks for this nice, light-weight tool! Very helpful.

Just one thing, I can't get sumx running:
For python -m runx.sumx config_simple, I get the following error:
No valid experiments found for /Users/svengiegerich/runx/config_simple
using version 0.0.10.

However, the path is correct and there are two successful runs (folders) in it; each subfolder contains a metrics.csv looking like this,

start,start/step,0,timestamp,1621378323.1818151
val,loss,0.013856839059957424,epoch,1,timestamp,1621378337.119943
val,loss,0.00520349506242475,epoch,2,timestamp,1621378344.595231

metrics are added by the following code lines,

metrics_val = {'loss': epoch_loss}
logx.metric(phase='val', metrics=metrics_val, epoch=epoch_i + 1)

Every other logging works smoothly, e.g. logx.add_scalar() for tensorboard.

-> Any idea what's wrong here?


My .runx,

LOGROOT: /Users/svengiegerich/runx
CODE_IGNORE_PATTERNS: '*.git,data/raw*,.*,results*'

FARM: bigfarm

# Farm resource needs
bigfarm:
    SUBMIT_CMD: 'submit_job'
    RESOURCES:
        image: mydocker-image-big:1.0
        gpu: 8
        cpu: 64
        mem: 450

and the config_simple.yml,

CMD: 'python train.py'

HPARAMS: [
  {
    logdir: LOGDIR,
    epochs: [1,2],
    RUNX.TAG: 'transformer',
    arch: 'transformer',
  }
]
@ajtao
Copy link
Collaborator

ajtao commented May 19, 2021

Hi @svengiegerich! sumx looks in /Users/svengiegerich/runx for directories that contain both metrics.csv and hparams.json. It sounds like you've confirmed that the metrics.csv files exist. Do you also see the hparams.json files there too?

@svengiegerich
Copy link
Author

Ah, no hparams.json is indeed missing. I run in the interactive mode (python -m runx.sumx config_simple -i) because I don't have access to a farm. So probably this is the issue?
Is it possible to configure .runx in a way that I can use runx non-interactive but also not on a farm? In other words, can I just use sumx with a farm?

Reading #9, I tried to modify .runx but with no success.

# not working
FARM: fake

fake:
    SUBMIT_CMD: na
    RESOURCES:
        dummy: na

@ajtao
Copy link
Collaborator

ajtao commented May 19, 2021

So firstoff, I'll plan to release better support for the 'no farm' mode, where you shouldn't have to define the FARM.

But as a hack, the .runx you show above should actually work. I just confirmed this myself.

What sort of failure are you seeing?

@svengiegerich
Copy link
Author

Running python -m runx.runx config_simple.yml, I get:

 File "/opt/anaconda3/envs/thesis/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/runx.py", line 394, in <module>
    main()
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/runx.py", line 387, in main
    run_experiment(args.exp_yml)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/runx.py", line 380, in run_experiment
    run_yaml(experiment_copy, runroot)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/runx.py", line 330, in run_yaml
    cmd = build_farm_cmd(cmd, job_name, resource_copy, logdir)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/farm.py", line 126, in build_farm_cmd
    raise f'Unsupported farm: {cfg.FARM}'
TypeError: exceptions must derive from BaseException

And if I rename the farm to FARM: ngc, I get:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/thesis/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/runx.py", line 394, in <module>
    main()
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/runx.py", line 387, in main
    run_experiment(args.exp_yml)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/runx.py", line 361, in run_experiment
    experiment = read_config(args.farm, args.exp_yml)
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/utils.py", line 122, in read_config
    cfg.NGC_LOGROOT = read_config_item(experiment, 'NGC_LOGROOT')
  File "/opt/anaconda3/envs/thesis/lib/python3.7/site-packages/runx/utils.py", line 72, in read_config_item
    raise f'can\'t find {key} in config'
TypeError: exceptions must derive from BaseException 

Thanks for your time & help!

@ajtao
Copy link
Collaborator

ajtao commented May 19, 2021

Hi Sven, I appreciate your patience!

I've updated the pypi runx to 0.0.11. Can you please pip install it and try it out. Now your .runx should only need the LOGROOT defined, and all that fake FARM stuff isn't needed anymore. Please let me know if it works.

I've been trying to improve the examples a little for this case. It could certainly be improved :).

@svengiegerich
Copy link
Author

Hey, thanks for the update!

Going through the examples again, I found my issue: I didn't include the hparams=vars(args) argument in logx.initialize(). Now everything works smoothly. As feedback, it would have helped me as a user if this argument was explained in the README; however, I also may have missed it.

Thanks again for this package!


[Just a side note: the "syntax" of metrics.csv seems to be inconsistent across rows (also for your example). Right now, the first line is short on two cells as there is probably no validation score. At least for me, a "consistent" format, with the first line containing 7 cells, would simplify analyzing this metrics.csv's]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants