Add process to generate bulk metadata file for each project #84

allyhawkins · 2022-02-03T14:54:07Z

This PR adds a process to the bulk workflow to generate one metadata file containing information about each bulk library that has been processed in that project. This results in one bulk_quant.tsv and one matching bulk_metadata.tsv where the rows of bulk_metadata.tsv contain the library ID's that are present in the columns of bulk_quant.tsv.

To do this, I used the same grouped salmon channel that we group by project ID after performing mapping with salmon. This contains the project ID and all associated directories to salmon outputs for that project. I also am inputting the metadata file that contains the metadata information for each library like seq unit and technology.

Inside the process I am running an Rscript that reads in the library metadata file and list of salmon directories, subsets the library metadata file to only those library ids that have been processed and are part of this project and selects the metadata columns that we would like to retain in the output. I then followed a similar format to the other metadata files that we have for single-cell libraries and added columns containing the date_processed, geneome_assembly, workflow, workflow_version, and workflow_commit.

Finally, for each salmon output directory, I am grabbing the processing information from the json files in that directory that contain the salmon_version, index, and then number of reads processed and mapped.

I then write out the file as a tsv file as was discussed in our multi-team planning meeting. I have attached an example of what this file would look like.
SCPCP000001_bulk_metadata.tsv.zip

An alternative to this, would be creating individual json files for each library that is included in bulk, but was unsure we wanted to have that many individual metadata files when we only have one output file for bulk.

jashapiro

Looks good! I have a couple of suggestions about tracking when processing was done and combining the combining steps.

As we discussed, I think we should implement skipping already completed mapping, but that should be a separate issue/PR.

bin/generate_bulk_metadata.R

jashapiro · 2022-02-03T15:26:58Z

modules/bulk-salmon.nf

@@ -70,6 +70,32 @@ process group_tximport {
        """
 }

+process bulk_metadata {


Looking at this now, I think it probably makes the most sense to just merge this with the group_tximport process above. They are using mostly the same inputs, shoudl run really fast, and are putting their outputs in the same place, so it would probably be most efficient to do both in a single process/machine. I might name the outputs so that the final emitted values can more easily be accessed.

bin/generate_bulk_metadata.R

jashapiro · 2022-02-03T15:56:59Z

bin/generate_bulk_metadata.R

+    salmon_version = cmd_info$salmon_version,
+    mapping_index = cmd_info$index,
+    total_reads = meta_info$num_processed,
+    mapped_reads = meta_info$num_mapped


Here is my suggestion for getting the processing date from the salmon output. At least as of the last thing I looked at, salmon used a really odd format for dates (why?) so here I am parsing that explicitly. I kind of hate it. I'm open to other ideas.

Suggested change

mapped_reads = meta_info$num_mapped

mapped_reads = meta_info$num_mapped,

# salmon outputs an odd format for dates: eg. Thu Aug 19 21:05:14 2021

date_processed = lubridate::as_datetime(meta_info$end_time, format = "%a %b %d %T %Y") |>

lubridate::format_ISO8601(usetz = TRUE)

One idea would be to pull this logic out a bit, and add a fallback, something like this:

date_processed <- lubridate::as_datetime(meta_info$end_time, format = "%a %b %d %T %Y") # if meta_info is not recorded or the format has changed, use the modification time of the file if is.na(date_processed){ date_processed <- file.info(meta_info_file)$mtime }

and then use just the following in the data frame creation:

date_processed = lubridate::format_ISO8601(date_processed, usetz = TRUE)

bin/generate_bulk_metadata.R

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

…data

allyhawkins · 2022-02-03T19:20:34Z

Thanks for taking a look at this! I went ahead and merged the two processes that create the bulk counts file and the bulk metadata file together. Let me know if you are okay with the new process name of group_output. Another option I had thought of was group_counts_metadata.

I also addressed the concern about the date by grabbing it from the meta_info.json file from the salmon output. I tested the changes and everything worked as expected.

jashapiro

This looks good to me. I had a few minor comments, and a suggestion on naming, since you asked about it!

jashapiro · 2022-02-04T16:18:52Z

modules/bulk-salmon.nf

@@ -49,16 +49,20 @@ process salmon{

 }

-process group_tximport {
+process group_output {


Since you asked about this, how about merge_bulk_quants or something like that? group_output seems too generic to me, and the grouping is happening in the workflow, whereas this is taking groups and merging them.

bin/generate_bulk_metadata.R

jashapiro · 2022-02-04T16:22:32Z

modules/bulk-salmon.nf

+        bulk_counts = group_output.out.bulk_counts
+        bulk_metadata = group_output.out.bulk_metadata


Is this required? I feel like if we just output group_output.out (or whatever you name the process) it would do the same thing? Or we could go back to outputting just the one part. I don't think it really matters, as we don't use it downstream, but this seems redundant.

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins added 2 commits February 2, 2022 20:34

add process to generate bulk metadata file

5e68f35

delete extra comment

3f6d5db

allyhawkins requested a review from jashapiro February 3, 2022 14:54

jashapiro reviewed Feb 3, 2022

View reviewed changes

allyhawkins and others added 4 commits February 3, 2022 11:44

Apply suggestions from code review

05794b6

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

merge counts file and metadata file processes

25bd8f3

grab date processed from salmon meta info

d583886

Merge remote-tracking branch 'origin/main' into allyhawkins/bulk-meta…

9ab4128

…data

allyhawkins requested a review from jashapiro February 3, 2022 19:21

jashapiro approved these changes Feb 4, 2022

View reviewed changes

allyhawkins and others added 2 commits February 4, 2022 11:41

Apply suggestions from code review - add comment

c84a662

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

rename process and adjust outputs

b2a3854

allyhawkins merged commit 7db4985 into main Feb 4, 2022

jashapiro mentioned this pull request Feb 4, 2022

Create 0.2.0 release #89

Closed

allyhawkins deleted the allyhawkins/bulk-metadata branch March 23, 2023 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add process to generate bulk metadata file for each project #84

Add process to generate bulk metadata file for each project #84

allyhawkins commented Feb 3, 2022

jashapiro left a comment

jashapiro Feb 3, 2022

jashapiro Feb 3, 2022

allyhawkins commented Feb 3, 2022

jashapiro left a comment

jashapiro Feb 4, 2022

jashapiro Feb 4, 2022 •

edited

-    mapped_reads = meta_info$num_mapped
+    mapped_reads = meta_info$num_mapped,
+    # salmon outputs an odd format for dates: eg. Thu Aug 19 21:05:14 2021
+    date_processed = lubridate::as_datetime(meta_info$end_time, format = "%a %b %d %T %Y") |>
+      lubridate::format_ISO8601(usetz = TRUE)

		bulk_counts = group_output.out.bulk_counts
		bulk_metadata = group_output.out.bulk_metadata

Add process to generate bulk metadata file for each project #84

Add process to generate bulk metadata file for each project #84

Conversation

allyhawkins commented Feb 3, 2022

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Feb 3, 2022

Choose a reason for hiding this comment

jashapiro Feb 3, 2022

Choose a reason for hiding this comment

allyhawkins commented Feb 3, 2022

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Feb 4, 2022

Choose a reason for hiding this comment

jashapiro Feb 4, 2022 • edited

Choose a reason for hiding this comment

jashapiro Feb 4, 2022 •

edited