-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add process to generate bulk metadata file for each project #84
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I have a couple of suggestions about tracking when processing was done and combining the combining steps.
As we discussed, I think we should implement skipping already completed mapping, but that should be a separate issue/PR.
modules/bulk-salmon.nf
Outdated
@@ -70,6 +70,32 @@ process group_tximport { | |||
""" | |||
} | |||
|
|||
process bulk_metadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this now, I think it probably makes the most sense to just merge this with the group_tximport
process above. They are using mostly the same inputs, shoudl run really fast, and are putting their outputs in the same place, so it would probably be most efficient to do both in a single process/machine. I might name the outputs so that the final emitted values can more easily be accessed.
bin/generate_bulk_metadata.R
Outdated
salmon_version = cmd_info$salmon_version, | ||
mapping_index = cmd_info$index, | ||
total_reads = meta_info$num_processed, | ||
mapped_reads = meta_info$num_mapped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my suggestion for getting the processing date from the salmon output. At least as of the last thing I looked at, salmon used a really odd format for dates (why?) so here I am parsing that explicitly. I kind of hate it. I'm open to other ideas.
mapped_reads = meta_info$num_mapped | |
mapped_reads = meta_info$num_mapped, | |
# salmon outputs an odd format for dates: eg. Thu Aug 19 21:05:14 2021 | |
date_processed = lubridate::as_datetime(meta_info$end_time, format = "%a %b %d %T %Y") |> | |
lubridate::format_ISO8601(usetz = TRUE) |
One idea would be to pull this logic out a bit, and add a fallback, something like this:
date_processed <- lubridate::as_datetime(meta_info$end_time, format = "%a %b %d %T %Y")
# if meta_info is not recorded or the format has changed, use the modification time of the file
if is.na(date_processed){
date_processed <- file.info(meta_info_file)$mtime
}
and then use just the following in the data frame creation:
date_processed = lubridate::format_ISO8601(date_processed, usetz = TRUE)
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Thanks for taking a look at this! I went ahead and merged the two processes that create the bulk counts file and the bulk metadata file together. Let me know if you are okay with the new process name of I also addressed the concern about the date by grabbing it from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I had a few minor comments, and a suggestion on naming, since you asked about it!
modules/bulk-salmon.nf
Outdated
@@ -49,16 +49,20 @@ process salmon{ | |||
|
|||
} | |||
|
|||
process group_tximport { | |||
process group_output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you asked about this, how about merge_bulk_quants
or something like that? group_output
seems too generic to me, and the grouping is happening in the workflow, whereas this is taking groups and merging them.
modules/bulk-salmon.nf
Outdated
bulk_counts = group_output.out.bulk_counts | ||
bulk_metadata = group_output.out.bulk_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required? I feel like if we just output group_output.out
(or whatever you name the process) it would do the same thing? Or we could go back to outputting just the one part. I don't think it really matters, as we don't use it downstream, but this seems redundant.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
This PR adds a process to the bulk workflow to generate one metadata file containing information about each bulk library that has been processed in that project. This results in one
bulk_quant.tsv
and one matchingbulk_metadata.tsv
where the rows ofbulk_metadata.tsv
contain the library ID's that are present in the columns ofbulk_quant.tsv
.To do this, I used the same grouped salmon channel that we group by project ID after performing mapping with salmon. This contains the project ID and all associated directories to salmon outputs for that project. I also am inputting the metadata file that contains the metadata information for each library like seq unit and technology.
Inside the process I am running an Rscript that reads in the library metadata file and list of salmon directories, subsets the library metadata file to only those library ids that have been processed and are part of this project and selects the metadata columns that we would like to retain in the output. I then followed a similar format to the other metadata files that we have for single-cell libraries and added columns containing the
date_processed
,geneome_assembly
,workflow
,workflow_version
, andworkflow_commit
.Finally, for each salmon output directory, I am grabbing the processing information from the json files in that directory that contain the
salmon_version
,index
, and then number of reads processed and mapped.I then write out the file as a
tsv
file as was discussed in our multi-team planning meeting. I have attached an example of what this file would look like.SCPCP000001_bulk_metadata.tsv.zip
An alternative to this, would be creating individual
json
files for each library that is included in bulk, but was unsure we wanted to have that many individual metadata files when we only have one output file for bulk.