Avoid runtime error during merge, no headers in table #113

smoe · 2021-10-22T21:01:18Z

This is the PR to address what likely is an incompatibility with recent versions of R as found the hard way in Issue #111 .

For unique column names during merge, the ".x" or ".y" suffix is added but this is not sufficient for repeated merges when one side alread has a column name that ends with ".x" or ".y". This patch adds unique column names to avoid such clashes. Since headers are not informative, neither with this patch nor before when these were just variants of "V2", the non-sensical headers should not be part of the table as it is written into a file. Also, the tools downstream cannot read the table with the default export format that lists headers but drops the first field where there are row.names.

rekado · 2021-10-23T18:32:47Z

Thank you for the patch! I wonder: could we add a test to ensure we don't run into similar issues in the future?

borauyar · 2021-11-04T12:47:52Z

@smoe I think I wasn't careful when reviewing this PR. The merging procedure has to assume that there aren't duplicate column names. The changes you propose actually breaks the pipeline :) The count files for each sample has to consist of two columns: 'V1', and unique sample name. The sample names must be unique for each count file, so when merging multiple data.frames by V1, it should create a table with unique column names that correspond to the list of samples found in the sample sheet. This should be compatible with the colData file which is important because the colData file and the merged counts table are used for downstream analysis. This means the counts file must contain the header, which should consist of sample names from the sample sheet.

In your proposed solution, we lose all sample names, so it breaks the pipeline. Could you send me a use case so that I can look into the problem? I would need the sample sheet. Also, it is important to see how the individual count files look in your pipeline output. Could you show some examples e.g. from mapped_reads/hisat2/<sampe name>.read_counts.tsv?

smoe · 2021-11-04T13:49:31Z

@borauyar I addressed two problems in one with this patch. The first was that there were already not sample names for the input of merge and hence the mapping of too many v1.x and v1.y onto each other. Maybe that is already a bug that surfaced on my end for some reason. And yes, this is also a problem of the merge routine which should not rename onto something that is already existing but instead come up with v1.y.y or v1.y2 or whatever.

Or is this about the write.table after the merge? All the column names are non-informative but the order prevails. The default write.table had the v1... written as headers but left no empty header over the row.names which then irritated the downstream routine about the first row having fewer columns than the others.

My hunch was that some default attributes have changed with the advent of R 4.1 and that this was why there are differences. But for that write.table ... I would admittedly have thought that my fix would be R-version independent and was wondering a bit that it just all worked with the new guix installation. And I am surprised even more since the guix install pigx-rnaseq apparently also works with a 4.1 version of R.

Puzzled.

borauyar · 2021-11-04T14:34:07Z

The files that are merged should look like the following:

When you read each file using data.table::fread it should be a table of 2 columns, where first column name will be V1 and the second column name will the name of the sample, for which the counting was executed.

File -1

,UHR_Rep1
ENSG00000056487,15
ENSG00000100347,0
ENSG00000138944,158
ENSG00000138964,17

File -2

,UHR_Rep2
ENSG00000056487,16
ENSG00000100347,0
ENSG00000138944,107
ENSG00000138964,18

This code below will read each file introducing a new column V1 because it is a data.table and keep the second column that has the sample name.

counts <- lapply(count_files, function(f) {
  data.table::fread(f)
})

and the merging code will merge by V1 as it is supposed to be common for all the tables read in.

If the per sample count files don't look like above examples, then we could look for the error where the sample-specific count files are saved.

These are for hisat2/star based counts only. If you have issues for Salmon-based counts,
then we need to look into the script counts_matrix_from_SALMON.R where all transcript/gene level files are read and combined using sample names extracted from colData file.

borauyar · 2021-11-04T14:38:47Z

The issue could also be about which library's merge function is used at run time. To avoid this ambiguity, we could replace merge with data.table::merge.data.table in:

counts_all <- as.data.frame(Reduce(function(dtf1, dtf2)
  data.table::merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE),
       counts))

smoe · 2021-12-15T22:41:48Z

@borauyar , @rekado , I am afraid we need to revisit this one. The problem is that the column names are not set when the data files are read:

> counts <- lapply(count_files, function(f) {
  data.table::fread(f)
})
> length(counts)
[1] 24
> names(counts)
NULL
> head(counts[[1]])
                   V1     V2
1:                    158320
2: ENSOCUG00000000004   2259
3: ENSOCUG00000000005   1919
4: ENSOCUG00000000006   1081
5: ENSOCUG00000000007     36
6: ENSOCUG00000000008      9

and the reason is that the name of that sample is a number that the defeault header="auto" does not recognize as a header:

$ head .../output/mapped_reads/hisat2/158368.read_counts.csv
,158368
ENSOCUG00000000004,2464
ENSOCUG00000000005,1517
ENSOCUG00000000006,382
ENSOCUG00000000007,13
ENSOCUG00000000008,1
ENSOCUG00000000009,6327
ENSOCUG00000000010,6505
ENSOCUG00000000012,6659
ENSOCUG00000000013,294

I do not have an exact overview from where this routine is called, but since the merge fails if headers were not read, I find the current implementation to miss being explicit here. Please:

counts <- lapply(count_files, function(f) {
-  data.table::fread(f)
+  data.table::fread(f,header=TRUE)
})

Above change fixed it for me.

Sidenote: These numerical sample names were not my idea - the sequencing center came up with them.

borauyar merged commit 87a6379 into BIMSBbioinfo:master Oct 26, 2021

borauyar mentioned this pull request Oct 26, 2021

Spreadsheet-generated sample_sheet not properly importable - early warning? #111

Closed

borauyar added a commit that referenced this pull request Nov 4, 2021

revert changes from #113

e38e365

smoe mentioned this pull request Dec 14, 2021

counts from salmon / tximport crash when adding new samples to project #81

Closed

smoe mentioned this pull request Dec 16, 2021

scripts/collate: header is always needed, not 'auto' #124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid runtime error during merge, no headers in table #113

Avoid runtime error during merge, no headers in table #113

smoe commented Oct 22, 2021

rekado commented Oct 23, 2021

borauyar commented Nov 4, 2021 •

edited

Loading

smoe commented Nov 4, 2021

borauyar commented Nov 4, 2021

borauyar commented Nov 4, 2021

smoe commented Dec 15, 2021 •

edited

Loading

Avoid runtime error during merge, no headers in table #113

Avoid runtime error during merge, no headers in table #113

Conversation

smoe commented Oct 22, 2021

rekado commented Oct 23, 2021

borauyar commented Nov 4, 2021 • edited Loading

smoe commented Nov 4, 2021

borauyar commented Nov 4, 2021

borauyar commented Nov 4, 2021

smoe commented Dec 15, 2021 • edited Loading

borauyar commented Nov 4, 2021 •

edited

Loading

smoe commented Dec 15, 2021 •

edited

Loading