-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid runtime error during merge, no headers in table #113
Avoid runtime error during merge, no headers in table #113
Conversation
For unique column names during merge, the ".x" or ".y" suffix is added but this is not sufficient for repeated merges when one side alread has a column name that ends with ".x" or ".y". This patch adds unique column names to avoid such clashes. Since headers are not informative, neither with this patch nor before when these were just variants of "V2", the non-sensical headers should not be part of the table as it is written into a file. Also, the tools downstream cannot read the table with the default export format that lists headers but drops the first field where there are row.names.
Thank you for the patch! I wonder: could we add a test to ensure we don't run into similar issues in the future? |
@smoe I think I wasn't careful when reviewing this PR. The merging procedure has to assume that there aren't duplicate column names. The changes you propose actually breaks the pipeline :) The count files for each sample has to consist of two columns: 'V1', and unique sample name. The sample names must be unique for each count file, so when merging multiple data.frames by In your proposed solution, we lose all sample names, so it breaks the pipeline. Could you send me a use case so that I can look into the problem? I would need the sample sheet. Also, it is important to see how the individual count files look in your pipeline output. Could you show some examples e.g. from |
@borauyar I addressed two problems in one with this patch. The first was that there were already not sample names for the input of merge and hence the mapping of too many v1.x and v1.y onto each other. Maybe that is already a bug that surfaced on my end for some reason. And yes, this is also a problem of the merge routine which should not rename onto something that is already existing but instead come up with v1.y.y or v1.y2 or whatever. Or is this about the write.table after the merge? All the column names are non-informative but the order prevails. The default write.table had the v1... written as headers but left no empty header over the row.names which then irritated the downstream routine about the first row having fewer columns than the others. My hunch was that some default attributes have changed with the advent of R 4.1 and that this was why there are differences. But for that write.table ... I would admittedly have thought that my fix would be R-version independent and was wondering a bit that it just all worked with the new guix installation. And I am surprised even more since the guix install pigx-rnaseq apparently also works with a 4.1 version of R. Puzzled. |
The files that are merged should look like the following: When you read each file using File -1
File -2
This code below will read each file introducing a new column
and the merging code will merge by If the per sample count files don't look like above examples, then we could look for the error where the sample-specific count files are saved. These are for hisat2/star based counts only. If you have issues for Salmon-based counts, |
The issue could also be about which library's
|
@borauyar , @rekado , I am afraid we need to revisit this one. The problem is that the column names are not set when the data files are read:
and the reason is that the name of that sample is a number that the defeault
I do not have an exact overview from where this routine is called, but since the merge fails if headers were not read, I find the current implementation to miss being explicit here. Please:
Above change fixed it for me. Sidenote: These numerical sample names were not my idea - the sequencing center came up with them. |
This is the PR to address what likely is an incompatibility with recent versions of R as found the hard way in Issue #111 .