Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong filename in output due to name conflict #4101

Closed
gmloose opened this issue May 17, 2022 · 11 comments
Closed

Wrong filename in output due to name conflict #4101

gmloose opened this issue May 17, 2022 · 11 comments

Comments

@gmloose
Copy link
Contributor

gmloose commented May 17, 2022

As discussed on Gitter with Michael Crusoe.

I have an issue with some output files having a wrong name, which only occurs on some machines. One of the outputs has a trailing _2 in its filename. I traced it down to filename conflict resolution in toil/cwl/cwltoil.py around line 860. However, I have no clue what triggers this name conflict, and I also don't understand why this results in my output files being renamed to something unexpected. I'm using Toil 5.6.0 and cwltool 3.1.20211107152837.

Attached is an example of the generated output JSON file. According to MC the problem is in lines 972 & 1969. There's a false positive for a name conflict. Yes, the files have the same name, but they are in different directories
predict.json.gz
.

┆Issue is synchronized with this Jira Story
┆friendlyId: TOIL-1170

@adamnovak
Copy link
Member

The workaround for this should be to use the CWL workflow's output JSON, and get the File objects from the workflow you are running, and use those to find the actual data. We try to preserve the original basenames, but I'm not sure that the CWL spec necessarily requires us to.

But we should also fix this, because we do want to preserve the original basenames.

@gmloose
Copy link
Contributor Author

gmloose commented Jun 2, 2022

Maybe it's good I add a link to a discussion on Gitter that might be related to this issue: https://gitter.im/bd2k-genomics-toil/Lobby?at=628b49284aa6c31dca16791a

@gmloose
Copy link
Contributor Author

gmloose commented Jun 7, 2022

I bisected this problem, and it seems this behavior is caused by the changes in commit 4286352.

@mr-c
Copy link
Contributor

mr-c commented Jun 8, 2022

@gmloose Good find!

@mr-c
Copy link
Contributor

mr-c commented Jun 8, 2022

I'm guessing the issue is at/near 4286352#diff-0de692d823906d36705d7d2c02a0bc99e191abc8b48f258401f9611184f339a2R862

Maybe this pathmapper contains entries from multiple steps, or multiple elements in a scatter? Hence the conflict. Maybe we can narrow the test further to prevent this.

@gmloose
Copy link
Contributor Author

gmloose commented Jun 10, 2022

I tried to understand the logic behind the code in lines 859-870, but I don't get. I'm confused about line 861. Here, the variable new_target is created unconditionally, but we only need it if the condition in line 862 is true, right?. Furthermore, the way the value for new_target is determined seems to assume that the part after the last _ was added by Toil, which is nonsense.

If I understand the intended logic correctly, then the test in line 862 is crucial and the lines 861, 863 and 864 could be removed completely. What's more, I think that line 864 is not only unnecessary, but plain wrong. Furthermore, I don't understand why the counter in line 865 starts at 2 and not at 1, but that is of lesser importance.

I also figured out the probable cause of the incorrectly detected name conflict. In my case there are two different files with extension .f4 and .f4_TSM0, respectively. In line 861 the _TSM0 part is stripped off the name, and therefore the condition in line 864 evaluates to true. What confuses me, though, is that I also have two different files with extensions .f2 and f2_TSM0, but there the renaming is not being triggered.

Lastly, I noticed in my debugging sessions, that line 862 never evaluates to true in my case. So, if that line is indeed the crucial one, then all the detected naming conflicts are false alarms.

@gmloose
Copy link
Contributor Author

gmloose commented Jun 10, 2022

I can make a pull request with the changes I suggested above. But you may want to check first if my assessment of the problem is correct. I already verified that the scatter tests in src/toil/test/cwl/cwlTest.py still pass with the proposed changes.

@gmloose
Copy link
Contributor Author

gmloose commented Jun 10, 2022

Maybe @kannon92 could have a look? He committed the original change.

@gmloose
Copy link
Contributor Author

gmloose commented Jun 10, 2022

I also figured out the probable cause of the incorrectly detected name conflict. In my case there are two different files with extension .f4 and .f4_TSM0, respectively. In line 861 the _TSM0 part is stripped off the name, and therefore the condition in line 864 evaluates to true. What confuses me, though, is that I also have two different files with extensions .f2 and f2_TSM0, but there the renaming is not being triggered.

I think order is important here. If the file with extension .f4_TSM0 is added first and the file with extension .f4 second, then the effect of lines 861 and 864 is to mark the second file as a duplicate. If the order is reversed, nothing bad happens. So that might explain, though I would expect this behavior to be somewhat random, and not deterministic. 🤔

@gmloose
Copy link
Contributor Author

gmloose commented Jun 14, 2022

I created a PR. I'm not sure how to write a test that properly tests this fix, because I haven't been able to create a small example.

gmloose added a commit to gmloose/toil that referenced this issue Jun 15, 2022
Added a test that verifies that the fix in DataBiosphere#4101 avoids incorrect filename conflict resolution.
mr-c pushed a commit that referenced this issue Jun 15, 2022
* Fix wrong filename in output due to name conflict

The logic that determined a potential filename conflict contained a
flaw. This commit fixes that.

* Add test for filename conflict resolution

Added a test that verifies that the fix in #4101 avoids incorrect filename conflict resolution.
gmloose added a commit to darafferty/rapthor that referenced this issue Jun 16, 2022
Exclude Toil version 5.6, because it contains a nasty file renaming bug,
that is (sometimes) triggered in scatter jobs operating on Measurement Sets.
See DataBiosphere/toil#4101 for details.
@mr-c
Copy link
Contributor

mr-c commented Jun 20, 2022

Fixed in #4139

@mr-c mr-c closed this as completed Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants