Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid runtime error during merge, no headers in table #113

Merged

Conversation

smoe
Copy link
Contributor

@smoe smoe commented Oct 22, 2021

This is the PR to address what likely is an incompatibility with recent versions of R as found the hard way in Issue #111 .

For unique column names during merge, the ".x" or ".y" suffix is added
but this is not sufficient for repeated merges when one side alread has
a column name that ends with ".x" or ".y". This patch adds unique
column names to avoid such clashes.

Since headers are not informative, neither with this patch nor before
when these were just variants of "V2", the non-sensical headers should
not be part of the table as it is written into a file. Also, the tools
downstream cannot read the table with the default export format that
lists headers but drops the first field where there are row.names.
@rekado
Copy link
Member

rekado commented Oct 23, 2021

Thank you for the patch! I wonder: could we add a test to ensure we don't run into similar issues in the future?

@borauyar
Copy link
Member

borauyar commented Nov 4, 2021

@smoe I think I wasn't careful when reviewing this PR. The merging procedure has to assume that there aren't duplicate column names. The changes you propose actually breaks the pipeline :) The count files for each sample has to consist of two columns: 'V1', and unique sample name. The sample names must be unique for each count file, so when merging multiple data.frames by V1, it should create a table with unique column names that correspond to the list of samples found in the sample sheet. This should be compatible with the colData file which is important because the colData file and the merged counts table are used for downstream analysis. This means the counts file must contain the header, which should consist of sample names from the sample sheet.

In your proposed solution, we lose all sample names, so it breaks the pipeline. Could you send me a use case so that I can look into the problem? I would need the sample sheet. Also, it is important to see how the individual count files look in your pipeline output. Could you show some examples e.g. from mapped_reads/hisat2/<sampe name>.read_counts.tsv?

borauyar added a commit that referenced this pull request Nov 4, 2021
@smoe
Copy link
Contributor Author

smoe commented Nov 4, 2021

@borauyar I addressed two problems in one with this patch. The first was that there were already not sample names for the input of merge and hence the mapping of too many v1.x and v1.y onto each other. Maybe that is already a bug that surfaced on my end for some reason. And yes, this is also a problem of the merge routine which should not rename onto something that is already existing but instead come up with v1.y.y or v1.y2 or whatever.

Or is this about the write.table after the merge? All the column names are non-informative but the order prevails. The default write.table had the v1... written as headers but left no empty header over the row.names which then irritated the downstream routine about the first row having fewer columns than the others.

My hunch was that some default attributes have changed with the advent of R 4.1 and that this was why there are differences. But for that write.table ... I would admittedly have thought that my fix would be R-version independent and was wondering a bit that it just all worked with the new guix installation. And I am surprised even more since the guix install pigx-rnaseq apparently also works with a 4.1 version of R.

Puzzled.

@borauyar
Copy link
Member

borauyar commented Nov 4, 2021

The files that are merged should look like the following:

When you read each file using data.table::fread it should be a table of 2 columns, where first column name will be V1 and the second column name will the name of the sample, for which the counting was executed.

File -1

,UHR_Rep1
ENSG00000056487,15
ENSG00000100347,0
ENSG00000138944,158
ENSG00000138964,17

File -2

,UHR_Rep2
ENSG00000056487,16
ENSG00000100347,0
ENSG00000138944,107
ENSG00000138964,18

This code below will read each file introducing a new column V1 because it is a data.table and keep the second column that has the sample name.

counts <- lapply(count_files, function(f) {
  data.table::fread(f)
})

and the merging code will merge by V1 as it is supposed to be common for all the tables read in.

If the per sample count files don't look like above examples, then we could look for the error where the sample-specific count files are saved.

These are for hisat2/star based counts only. If you have issues for Salmon-based counts,
then we need to look into the script counts_matrix_from_SALMON.R where all transcript/gene level files are read and combined using sample names extracted from colData file.

@borauyar
Copy link
Member

borauyar commented Nov 4, 2021

The issue could also be about which library's merge function is used at run time. To avoid this ambiguity, we could replace merge with data.table::merge.data.table in:

counts_all <- as.data.frame(Reduce(function(dtf1, dtf2)
  data.table::merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE),
       counts))

@smoe
Copy link
Contributor Author

smoe commented Dec 15, 2021

@borauyar , @rekado , I am afraid we need to revisit this one. The problem is that the column names are not set when the data files are read:

> counts <- lapply(count_files, function(f) {
  data.table::fread(f)
})
> length(counts)
[1] 24
> names(counts)
NULL
> head(counts[[1]])
                   V1     V2
1:                    158320
2: ENSOCUG00000000004   2259
3: ENSOCUG00000000005   1919
4: ENSOCUG00000000006   1081
5: ENSOCUG00000000007     36
6: ENSOCUG00000000008      9

and the reason is that the name of that sample is a number that the defeault header="auto" does not recognize as a header:

$ head .../output/mapped_reads/hisat2/158368.read_counts.csv
,158368
ENSOCUG00000000004,2464
ENSOCUG00000000005,1517
ENSOCUG00000000006,382
ENSOCUG00000000007,13
ENSOCUG00000000008,1
ENSOCUG00000000009,6327
ENSOCUG00000000010,6505
ENSOCUG00000000012,6659
ENSOCUG00000000013,294

I do not have an exact overview from where this routine is called, but since the merge fails if headers were not read, I find the current implementation to miss being explicit here. Please:

counts <- lapply(count_files, function(f) {
-  data.table::fread(f)
+  data.table::fread(f,header=TRUE)
})

Above change fixed it for me.

Sidenote: These numerical sample names were not my idea - the sequencing center came up with them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants