New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong filename in output due to name conflict #4101
Comments
The workaround for this should be to use the CWL workflow's output JSON, and get the But we should also fix this, because we do want to preserve the original basenames. |
Maybe it's good I add a link to a discussion on Gitter that might be related to this issue: https://gitter.im/bd2k-genomics-toil/Lobby?at=628b49284aa6c31dca16791a |
I bisected this problem, and it seems this behavior is caused by the changes in commit 4286352. |
@gmloose Good find! |
I'm guessing the issue is at/near 4286352#diff-0de692d823906d36705d7d2c02a0bc99e191abc8b48f258401f9611184f339a2R862 Maybe this pathmapper contains entries from multiple steps, or multiple elements in a scatter? Hence the conflict. Maybe we can narrow the test further to prevent this. |
I tried to understand the logic behind the code in lines 859-870, but I don't get. I'm confused about line 861. Here, the variable If I understand the intended logic correctly, then the test in line 862 is crucial and the lines 861, 863 and 864 could be removed completely. What's more, I think that line 864 is not only unnecessary, but plain wrong. Furthermore, I don't understand why the counter in line 865 starts at 2 and not at 1, but that is of lesser importance. I also figured out the probable cause of the incorrectly detected name conflict. In my case there are two different files with extension Lastly, I noticed in my debugging sessions, that line 862 never evaluates to true in my case. So, if that line is indeed the crucial one, then all the detected naming conflicts are false alarms. |
I can make a pull request with the changes I suggested above. But you may want to check first if my assessment of the problem is correct. I already verified that the scatter tests in |
Maybe @kannon92 could have a look? He committed the original change. |
I think order is important here. If the file with extension |
I created a PR. I'm not sure how to write a test that properly tests this fix, because I haven't been able to create a small example. |
Added a test that verifies that the fix in DataBiosphere#4101 avoids incorrect filename conflict resolution.
* Fix wrong filename in output due to name conflict The logic that determined a potential filename conflict contained a flaw. This commit fixes that. * Add test for filename conflict resolution Added a test that verifies that the fix in #4101 avoids incorrect filename conflict resolution.
Exclude Toil version 5.6, because it contains a nasty file renaming bug, that is (sometimes) triggered in scatter jobs operating on Measurement Sets. See DataBiosphere/toil#4101 for details.
Fixed in #4139 |
As discussed on Gitter with Michael Crusoe.
I have an issue with some output files having a wrong name, which only occurs on some machines. One of the outputs has a trailing
_2
in its filename. I traced it down to filename conflict resolution intoil/cwl/cwltoil.py
around line 860. However, I have no clue what triggers this name conflict, and I also don't understand why this results in my output files being renamed to something unexpected. I'm using Toil 5.6.0 and cwltool 3.1.20211107152837.Attached is an example of the generated output JSON file. According to MC the problem is in lines 972 & 1969. There's a false positive for a name conflict. Yes, the files have the same name, but they are in different directories
predict.json.gz
.
┆Issue is synchronized with this Jira Story
┆friendlyId: TOIL-1170
The text was updated successfully, but these errors were encountered: