-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAJA loops and the size of PMC_LxREPT.EEF continuously increases (multiple TB) #55
Comments
Hi, Could you try to remove this product from the list of products to be processed and try to process the next one ? |
Dear Olivier, thanks, that helps. I hacked a test case, jumped over that product and it continued correctly. But, actually, I think last year I also had that error. I simply ignored it. I did not necessarily need those products. But last year MAJA did not loop and fill the hard disk. |
This is one of the recent errors: The bad news is, that in case of an error MAJA still loops and spams the HDD (TBs of PMC_LxREPT.EEF) Any ideas how to resolve this issue, or at least detect the bad products before MAJA hangs? |
Hi, As Olivier said, we encountered this case some time ago but since then it didn't resurface. Especially the fact that the logfile explodes to multiple TB is problematic. I will see with the devs what needs to be modified inside Maja in order to counter at least this latter behavior. Not sure whether we will be able to output the product, though. Where did you download your L1C data from - PEPS/SciHub? Kind regards, |
The products are from SciHub. Yesterday I updated to MAJA Version 3.3.2. (from 3.3.0). Have there been any changes that might influence the problem? Many thanks for the help! |
I am not sure if v3.3.2 will solve it, but it's definitely a good shot. I will be using the most recent, v3.3.5, in order to recreate the error on our platform. So far I have all the necessary info that I need - Should I need something else I will contact you here. In the meanwhile, I am happy to hear from your results with the more recent version! Peter |
o.k. find, I will drop a note in case I have news. 3.3.5 is not officially released, is it? At least there is no TM-version, which is the newer format, I read. Are from 3.3.2 to 3.3.5 any changes that might help? |
I just realized that 3.3.5 [TM] was masked in the list of downloads still... It should appear now. Thank your for reminding me. Peter |
o.k. fine, that means, no need to update immediately. If I can help, or you need input, just drop a note. |
Hi Johann, I ran multiple tests with the same inputs during the night and in none of the cases maja loops indefinitely. Maja does fail with the error message described but that is - as mentioned earlier - out of our reach for now. Can you please tell me what OS you're running on and under which circumstances you call maja? Kind regards, |
Hi Peter, first of all, thanks for the tests. We are using: I also thought that the loop might come from our download script, but it seems that the maja-process is not interrupted/ restarted and there is no command-line output. That's why I thought the loop is within maja. I do not have the PMC_LxREPT.EEF file. I deleted it because of disk space. But, I will try to reproduce it with a broken product. It will take a little bit of time because I think it makes sense to start from scratch and do the backward initialization before the broken product. Thanks! |
Dear Peter, I could reproduce the behavior on our machine. MAJA produces a 3.5T PMC_LxREPT.EEF file. I tested it with the product: S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE top output: the log file: the first part of the PMC_LxREPT.EEF file: Thanks! |
Hi Peter, some more news. If you have a look at the first 20MB of the file, just the first 2000lines are xml, but the remaining MB are only blank: |
Hi Johann, I tried multiple systems this weekend and I cannot get this problem to reproduce on my end. Neither Centos 6/7 (the same 7.6.1810 you have) would show the same behavior. Do you run start_maja with sudo rights/in a docker? Kind regards, |
Dear Peter, first of all, thanks for your investigations. Our system is not in a docker container but a jail root (chroot) on a Debian host. I tried to debug it, to find out where (and maybe which third party lib) loops. For this, I manually prepared a test setup with preprocessed products to only process the broken one. But my test failed, the loop did not happen, MAJA has correctly thrown an exception. It seems that the loop is an edge case, a combination of environment and libs, ... I will try the debugging experiment from scratch and drop a note. Bests! |
Dear Peter, I found another strange behavior:
the next 4 products had log files with zero size (no content). Can that happen? Do you have an idea about the reason? Thanks! |
Hi Johann, While the log is empty, what about the L2A products themselves? Were they empty as well? I never have encountered any of the behaviors you are showing me above - I have a slight suspicion that, as you mentioned, your environment might be at cause. Peter |
Hi Peter, sorry, for the inconvenience. To reproduce the behavior with a debugger also failed. It seems to be some nasty edge case. At least, now I know about the broken products and can handle them properly. I anyway have to do a cleanup. I will install a fresh system and hopefully, the loop does not come up again. Normally I am using Ubuntu, but at the time I installed the system, MAJA did not work out of the box in Ubuntu, so I switched to CentOS. Which is your preferred system? What are you using normally? |
Maja should be working fine with both CentOS >6 and Ubuntu >16.04, which is what we use internally for testing. If you have any problems with Ubuntu, then I recommend to open up a new issue related to the error. |
Dear Peter, we made a step forward, could reproduce the loop, and found out the point where it loops. It seems that it is a multi-threading problem. Maja tries to write to the log file PMC_LxREPT.EEF from multiple threads using pugixml v1.5, which seems to be not threadsafe according to the documentation. Could that be the case? But let's go one step back. My colleague Chris, attached a debugger to Maja when it was looping and got the following callback:
pugi::impl::(anonymous namespace)::node_output loopes continously, which is from pugixml v1.5: https://github.com/zeux/pugixml/blob/v1.5/src/pugixml.cpp, Right? It loops when it tries to write a "Message" Node xml_node_struct which is defined as:
The node when it loops is at [0x47ee9a60] and the name = 'Message', first_child is [0x47ee9ab0] with name = 'Date_Time'. It seems that the log message is written form libMajaDataCommon vns::ReportLogger::BuildDedicatedFormattedEntry. That's where we ended up, because of no more debug symbols. But because writing xml seems to be straight-forward, we think it is a multi-threading problem, when multiple threads try to write to the xml-dom. According to the documentation, it seems writing with pugixml is not threadsafe:
In our case, multiple active threads loop in pugi::impl::(anonymous namespace)::node_output which is one more hint towards a multi-threading problem. Does that make sense? Is Maja multi-threaded when it writes the error to the log-file? Best regards! |
Hi Johann, First of all sorry for the late answer, here are my findings from the last weeks:
We will observe this behavior for Maja4, hoping that it won't show up again. Keep an eye on the projet page to check when it is available - You should be able to upgrade without much changes as the interfaces will stay the same (Params, GIPP, input folder etc.). I hope that helped you! Kind regards, |
Hi Peter, sounds good. Meanwhile, I switched from CentOS to Ubuntu 20.04 and Python 3, without any troubles. I also improved my Python-Script to avoid multiple trials of broken products. It seems that it minimized the chance of looping. At least it did not show up for a while. I also integrated CAMS data, based on the provided download scripts. These scripts seem to be still in Python 2.7 style (print and one or two other issues). Should I prepare pull requests or is any way someone else working on it? Thank's for all the debugging sessions and discussion:-) |
Good to hear that it seems to be working now, and even better with CAMS! Updating the remaining scripts to python3 is on our list until release, so we should be covered. Thank you for your help and the investigation! |
For two tiles (33UYQ, 32TQT), which were correctly processed before, MAJA fails now. It does not proceed, it loops and the size of PMC_LxREPT.EEF continuously increases. Other tiles are still processed correctly.
MAJA Version 3.3.0
Error message: vns::Business::ERROR: ComputeScatteringCorrectionImageFilter(0x20cf500): For band id '3' (nb channel='4') ' no miniLUT has been generated for this angle zone (detector) : '4'. Angles zones are [0,5,6,7,8,9,10,] !!!,Check coherency between metadata and input zone mask source. [vnsComputeScatteringCorrectionImageFilter.txx:ThreadedGenerateData:238]
Last correctly processed product: S2A_MSIL1C_20200310T094031_N0209_R036_T33UYQ_20200310T101526.SAFE
Product where MAJA fails: S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE
and does not further continue.
Log-file for the product where MAJA fails:
S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE.log
The text was updated successfully, but these errors were encountered: