New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird cdscan error #70

Closed
doutriaux1 opened this Issue Dec 21, 2016 · 33 comments

Comments

Projects
None yet
8 participants
@doutriaux1
Member

doutriaux1 commented Dec 21, 2016

for @mcenerney1 and travis systems cdscan chokes on missing_value attribute. No idea why. The same cdscan version on my mac and linux systems work on the file that fails for @mcenerney1

see: https://travis-ci.org/UV-CDAT/uvcmetrics/builds/185660014

it's got the log details.

cdscan itself is easy to fix as shown in @dnadeau4 PR

but we need to understand why this error is triggered only on SOME systems.

in case the travis log disappear here the gist of the error:

2: XML NAME: /home/travis/test_data/cam_output/c_t_b30.009.cam2.h0_csad836dce7d9e4045bcb5c184366d8bc0.xml
2: RUNNNIG CDSCAN
2: CDSCAN RUN ERROR len() of unsized object
2: END ERROR LOG
2:   File "/home/travis/miniconda/envs/travis/lib/python2.7/site-packages/metrics/computation/reductions.py", line 2864, in run_cdscan
2:     cdscan.main(cdscan_line)
2:   File "/home/travis/miniconda/envs/travis/lib/python2.7/site-packages/cdms2/cdscan.py", line 1635, in main
2:     cleanupAttrs(attrs)
2:   File "/home/travis/miniconda/envs/travis/lib/python2.7/site-packages/cdms2/cdscan.py", line 464, in cleanupAttrs
2:     if len(attval)==1:
2: cdscan_line= ['cdscan', '-q', '-x', '/tmp/travis/uvcmetrics/c_t_b30.009.cam2.h0_csad836dce7d9e4045bcb5c184366d8bc0.xml', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-01.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-02.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-03.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-04.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-05.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-06.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-07.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-08.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-09.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-10.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-11.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0600-12.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-01.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-02.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-03.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-04.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-05.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-06.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-07.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-08.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-09.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-10.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-11.nc', '/home/travis/test_data/cam_output/c_t_b30.009.cam2.h0.0601-12.nc']
@mcenerney1

This comment has been minimized.

Contributor

mcenerney1 commented Dec 21, 2016

It looks like your travis_fix almost worked. diags_test_06 & meta_diags failed. the rest passed.

@mcenerney1

This comment has been minimized.

Contributor

mcenerney1 commented Dec 22, 2016

Is it possible that there is some cache file or build directory that needs to be deleted?

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Dec 22, 2016

no it's bad... really really bad... scary bad...

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Dec 22, 2016

in short the regridder and masking appears to produce different results on your system and travis systems than on any other systems we tested things on. I'm suspecting NC4. But why? I even used travis' won VMs and couldn't reproduce your behavior. IT has to be env variable related...

@mcenerney1

This comment has been minimized.

Contributor

mcenerney1 commented Dec 22, 2016

YIKES!

@williams13

This comment has been minimized.

williams13 commented Dec 22, 2016

@mcenerney1

This comment has been minimized.

Contributor

mcenerney1 commented Dec 22, 2016

I've been plagued with a cdscan type problem and Travis's cite has had problems running the metrics test. Charles has been tracking this one and it "appears" they are the same problem. The crazy thing is that it works fine everywhere else. My system has been scrubbed and it still has issues.

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Dec 22, 2016

we have no idea, that's what's scary. somehow suddenly on travis (and on Jim's machine) cdscan is following a different path. Something with missing_value. It was easy to fix but now the regridder ends up being a bit different more masking. I'm suspecting this is due to missing value being handled a bit differently with the cdscan generated. I tried using travis VMs via docker but I can't reproduce the error there either. I'm suspecting a env issue. We have to figure out why cdscan behaves differently on these two systems...

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Dec 22, 2016

@mcenerney1 could you yank your anaconda all together and try installing 2.6 not 2.8 and see if the problem persists.

@mcenerney1

This comment has been minimized.

Contributor

mcenerney1 commented Dec 22, 2016

reinstalled anaconda and created new env 2.6_1 with
conda create -n 2.6_1 -c conda-forge -c uvcdat uvcdat=2.6.1
getting
ImportError: No module named vtkGeovisCorePython
What's missing? vtk is
vtk 7.1.0.2.6 uvcdat_master uvcdat

@zshaheen

This comment has been minimized.

Member

zshaheen commented Dec 23, 2016

@mcenerney1 Dont use the conda-forge channel when installing 2.6.1.

@mcenerney1

This comment has been minimized.

Contributor

mcenerney1 commented Jan 3, 2017

With
conda create -n 2.6_1 -c uvcdat uvcdat=2.6.1
I get import error

import cdms2
Traceback (most recent call last):
File "", line 1, in
File "/Users/mcenerney1/anaconda/envs/2.6_1/lib/python2.7/site-packages/cdms2/init.py", line 17, in
from cdmsobj import CdArray, CdChar, CdByte, CdDouble, CdFloat, CdFromObject, CdInt, CdLong, CdScalar, CdShort, CdString
File "/Users/mcenerney1/anaconda/envs/2.6_1/lib/python2.7/site-packages/cdms2/cdmsobj.py", line 5, in
import cdmsNode
File "/Users/mcenerney1/anaconda/envs/2.6_1/lib/python2.7/site-packages/cdms2/cdmsNode.py", line 10, in
import cdtime
ImportError: dlopen(/Users/mcenerney1/anaconda/envs/2.6_1/lib/python2.7/site-packages/cdtime.so, 2): Library not loaded: @rpath/libjpeg.9.dylib
Referenced from: /Users/mcenerney1/anaconda/envs/2.6_1/lib/libjasper.1.0.0.dylib
Reason: image not found

@durack1

This comment has been minimized.

Member

durack1 commented Jan 3, 2017

@doutriaux1 @dnadeau4 it also seems a similar issue is now happening with the systematic xml generation on crunchy - it would appear to be a 2.8 specific issue, as 2.6 seems to work:

So 2.8:

(uvcdat) bash-4.1$ /usr/local/anaconda2/envs/2.8/bin/cdscan -x /export/durack1/Desktop/test2p8.xml /cmip5_css02/data/cmip5/output1/CSIRO-BOM/ACCESS1-0/1pctCO2/fx/ocean/fx/r0i0p0/sftof/1/*.nc
Finding common directory ...
Common directory: /cmip5_css02/data/cmip5/output1/CSIRO-BOM/ACCESS1-0/1pctCO2/fx/ocean/fx/r0i0p0/sftof/1/
Scanning files ...
/cmip5_css02/data/cmip5/output1/CSIRO-BOM/ACCESS1-0/1pctCO2/fx/ocean/fx/r0i0p0/sftof/1/sftof_fx_ACCESS1-0_1pctCO2_r0i0p0.nc
Traceback (most recent call last):
  File "/usr/local/anaconda2/envs/2.8/bin/cdscan", line 1681, in <module>
    main(sys.argv)
  File "/usr/local/anaconda2/envs/2.8/bin/cdscan", line 1635, in main
    cleanupAttrs(attrs)
  File "/usr/local/anaconda2/envs/2.8/bin/cdscan", line 464, in cleanupAttrs
    if len(attval)==1:
TypeError: len() of unsized object

And 2.6:

(uvcdat) bash-4.1$ /usr/local/anaconda2/envs/2.6/bin/cdscan -x /export/durack1/Desktop/test2p6.xml /cmip5_css02/data/cmip5/output1/CSIRO-BOM/ACCESS1-0/1pctCO2/fx/ocean/fx/r0i0p0/sftof/1/*.nc
Finding common directory ...
Common directory: /cmip5_css02/data/cmip5/output1/CSIRO-BOM/ACCESS1-0/1pctCO2/fx/ocean/fx/r0i0p0/sftof/1/
Scanning files ...
/cmip5_css02/data/cmip5/output1/CSIRO-BOM/ACCESS1-0/1pctCO2/fx/ocean/fx/r0i0p0/sftof/1/sftof_fx_ACCESS1-0_1pctCO2_r0i0p0.nc
/export/durack1/Desktop/test2p6.xml written
@painter1

This comment has been minimized.

Contributor

painter1 commented Jan 3, 2017

@durack1 , the cleanupAttrs() error is described, along with a fix, in issue CDAT/cdat#2145. I think that @dnadeau4 fixed the problem about two weeks ago. There are more problems with cdscan.

@durack1

This comment has been minimized.

Member

durack1 commented Jan 3, 2017

@painter1 thanks for the heads up.. The 2.8 version above is bombing on most of the CMIP5 data, so whatever the issue is, it should be solved pronto.. I'd be happy to test this so I can get my own code back up and running.. It'd also be good to test cdscan against a bunch of the CMIP5 files to make sure a similar issue doesn't recur in the future

@PeterCaldwell

This comment has been minimized.

PeterCaldwell commented Jan 4, 2017

Thanks for the effort on this. I'd also like to see a solution actually made operational ASAP because one of my projects is totally stalled out until cdscan works on CMIP5 data again.

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Jan 4, 2017

@durack1 @PeterCaldwell so this has been fixed since before XMas. @durack1 I need to update crunchy still though. So we're good. I have a branch with @painter1 additional fix. Once I added the test to the branch I will merge in master and update crunchy, unless @durack1 needs this asap on crunchy

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Jan 4, 2017

as far the bad error, different number and all goes, I wasted a few days tracking this beforeI realized this branch had bad baselines in it... Hence the different number when I was testing my branch on travis... duh...

@durack1

This comment has been minimized.

Member

durack1 commented Jan 4, 2017

thanks @doutriaux1 this issue has been considerably complicated by changes to the permissions on the cron jobs, I presume implemented by a network bot.. This will need tweaking in addition to the update of cdscan/uvcdat on the machine

@durack1

This comment has been minimized.

Member

durack1 commented Jan 4, 2017

@PeterCaldwell @doutriaux1 I have just updated the cron job to run against the 2014-03-31 version of UV-CDAT/cdscan so as long as there are no other system issues it should lead to a successful xml run that completes late next week

@PeterCaldwell

This comment has been minimized.

PeterCaldwell commented Jan 4, 2017

Thanks @durack1 ! So it sounds like you've just reverted back to an old version of cdat until @doutriaux1 can rebuild and verify the bugfixed version on crunchy? I'm cool with that.

@dnadeau4

This comment has been minimized.

Contributor

dnadeau4 commented Jan 4, 2017

@durack1 your problem has been solved and is in master.

#69

@durack1

This comment has been minimized.

Member

durack1 commented Jan 4, 2017

@PeterCaldwell yep, my approach to this stuff.. If it works.. And I have no intention of ever changing it again, unless of course somehow IT fiddling changes everything around again.. I do hope it solves the issue, we'll find out next week

@PeterCaldwell

This comment has been minimized.

PeterCaldwell commented Jan 4, 2017

Thanks guys! Fingers crossed.

@durack1

This comment has been minimized.

Member

durack1 commented Jan 9, 2017

@doutriaux1 it seems that spawned processes aren't inheriting the environment (and the cdscan path) from the parent.. So if you can update the UV-CDAT install and the file found at /usr/local/uvcdat/latest/bin/cdscan that should solve the problem..

@doutriaux1 doutriaux1 closed this Jan 9, 2017

@durack1

This comment has been minimized.

Member

durack1 commented Jan 9, 2017

@doutriaux1 did you update the crunchy UV-CDAT installation? An updated installation will hopefully solve my persistent problem

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Jan 9, 2017

@durack1 try again now.

@durack1

This comment has been minimized.

Member

durack1 commented Jan 9, 2017

@doutriaux1 thanks, I've kicked it off again I'll check back in an hour and see if xmls are being written with the update..

@durack1

This comment has been minimized.

Member

durack1 commented Jan 10, 2017

@doutriaux1 still looks like there is a problem..

@durack1

This comment has been minimized.

Member

durack1 commented Jan 10, 2017

@doutriaux1 looks like your latest cdscan is pointing to a directory 2017-01-09-nox that is not accessible

$ ls -al /usr/local/anaconda2/envs
total 56
drwxrwxrwx 17 doutriaux1 climate 4096 Jan  9 13:24 2017-01-09
drwx------ 15 doutriaux1 climate 4096 Jan  9 13:32 2017-01-09-nox

lrwxrwxrwx  1 root       root      10 Jan  9 13:35 latest -> 2017-01-09
lrwxrwxrwx  1 root       root      14 Jan  9 13:35 latest-nox -> 2017-01-09-nox
[durack1@crunchy cmip5]$ head -n 10 /usr/local/anaconda2/envs/latest/bin/cdscan
#!/usr/local/anaconda2/envs/2017-01-09-nox/bin/python

import sys

@durack1 durack1 reopened this Jan 10, 2017

@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Jan 10, 2017

what do you mean not accessible? It's 777. I'll rechmod the whole the whole /usr/local/anaconda2 just to be safe.

@dnadeau4 dnadeau4 closed this Jan 10, 2017

@durack1

This comment has been minimized.

Member

durack1 commented Jan 10, 2017

@doutriaux1 I changed the perms on 2017-01-09-nox last night to get myself up and running.. My point was that the executables that include a shebang (e.g. cdscan) located in the 2017-01-09 install are actually pointing to 2017-01-09-nox.

This is bad, and should be fixed

@durack1 durack1 reopened this Jan 10, 2017

@dnadeau4

This comment has been minimized.

Contributor

dnadeau4 commented Jan 10, 2017

@durack1 the problem was solved with #69 and was merged into master. Look at line 464 and you will see that the 0dim variables are now read correctly. I think you "nox" problem should be another ticket.

@dnadeau4 dnadeau4 closed this Jan 10, 2017

@doutriaux1 doutriaux1 modified the milestone: 2.10 May 5, 2017

@doutriaux1 doutriaux1 added the bug label May 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment