Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process to generate bulk metadata file for each project #84

Merged
merged 8 commits into from
Feb 4, 2022

Conversation

allyhawkins
Copy link
Member

This PR adds a process to the bulk workflow to generate one metadata file containing information about each bulk library that has been processed in that project. This results in one bulk_quant.tsv and one matching bulk_metadata.tsv where the rows of bulk_metadata.tsv contain the library ID's that are present in the columns of bulk_quant.tsv.

To do this, I used the same grouped salmon channel that we group by project ID after performing mapping with salmon. This contains the project ID and all associated directories to salmon outputs for that project. I also am inputting the metadata file that contains the metadata information for each library like seq unit and technology.

Inside the process I am running an Rscript that reads in the library metadata file and list of salmon directories, subsets the library metadata file to only those library ids that have been processed and are part of this project and selects the metadata columns that we would like to retain in the output. I then followed a similar format to the other metadata files that we have for single-cell libraries and added columns containing the date_processed, geneome_assembly, workflow, workflow_version, and workflow_commit.

Finally, for each salmon output directory, I am grabbing the processing information from the json files in that directory that contain the salmon_version, index, and then number of reads processed and mapped.

I then write out the file as a tsv file as was discussed in our multi-team planning meeting. I have attached an example of what this file would look like.
SCPCP000001_bulk_metadata.tsv.zip

An alternative to this, would be creating individual json files for each library that is included in bulk, but was unsure we wanted to have that many individual metadata files when we only have one output file for bulk.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have a couple of suggestions about tracking when processing was done and combining the combining steps.

As we discussed, I think we should implement skipping already completed mapping, but that should be a separate issue/PR.

bin/generate_bulk_metadata.R Outdated Show resolved Hide resolved
@@ -70,6 +70,32 @@ process group_tximport {
"""
}

process bulk_metadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now, I think it probably makes the most sense to just merge this with the group_tximport process above. They are using mostly the same inputs, shoudl run really fast, and are putting their outputs in the same place, so it would probably be most efficient to do both in a single process/machine. I might name the outputs so that the final emitted values can more easily be accessed.

bin/generate_bulk_metadata.R Outdated Show resolved Hide resolved
salmon_version = cmd_info$salmon_version,
mapping_index = cmd_info$index,
total_reads = meta_info$num_processed,
mapped_reads = meta_info$num_mapped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my suggestion for getting the processing date from the salmon output. At least as of the last thing I looked at, salmon used a really odd format for dates (why?) so here I am parsing that explicitly. I kind of hate it. I'm open to other ideas.

Suggested change
mapped_reads = meta_info$num_mapped
mapped_reads = meta_info$num_mapped,
# salmon outputs an odd format for dates: eg. Thu Aug 19 21:05:14 2021
date_processed = lubridate::as_datetime(meta_info$end_time, format = "%a %b %d %T %Y") |>
lubridate::format_ISO8601(usetz = TRUE)

One idea would be to pull this logic out a bit, and add a fallback, something like this:

date_processed <- lubridate::as_datetime(meta_info$end_time, format = "%a %b %d %T %Y")
# if meta_info is not recorded or the format has changed, use the modification time of the file
if is.na(date_processed){
  date_processed <- file.info(meta_info_file)$mtime
}

and then use just the following in the data frame creation:

  date_processed = lubridate::format_ISO8601(date_processed, usetz = TRUE)

bin/generate_bulk_metadata.R Outdated Show resolved Hide resolved
@allyhawkins
Copy link
Member Author

Thanks for taking a look at this! I went ahead and merged the two processes that create the bulk counts file and the bulk metadata file together. Let me know if you are okay with the new process name of group_output. Another option I had thought of was group_counts_metadata.

I also addressed the concern about the date by grabbing it from the meta_info.json file from the salmon output. I tested the changes and everything worked as expected.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I had a few minor comments, and a suggestion on naming, since you asked about it!

@@ -49,16 +49,20 @@ process salmon{

}

process group_tximport {
process group_output {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you asked about this, how about merge_bulk_quants or something like that? group_output seems too generic to me, and the grouping is happening in the workflow, whereas this is taking groups and merging them.

bin/generate_bulk_metadata.R Outdated Show resolved Hide resolved
Comment on lines 109 to 110
bulk_counts = group_output.out.bulk_counts
bulk_metadata = group_output.out.bulk_metadata
Copy link
Member

@jashapiro jashapiro Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? I feel like if we just output group_output.out (or whatever you name the process) it would do the same thing? Or we could go back to outputting just the one part. I don't think it really matters, as we don't use it downstream, but this seems redundant.

allyhawkins and others added 2 commits February 4, 2022 11:41
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins allyhawkins merged commit 7db4985 into main Feb 4, 2022
@jashapiro jashapiro mentioned this pull request Feb 4, 2022
@allyhawkins allyhawkins deleted the allyhawkins/bulk-metadata branch March 23, 2023 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants