Addition of `01-filter-across-types.R` script #54

cbethell · 2019-08-12T20:56:48Z

This PR is the first step of the analysis addressing issue #5

01-filter-across-types.R performs the initial filtering of the dataset.
The script produces a PDF of a barplot representing the sample distribution across cancer types, as well as a TSV with the raw values used to do so.
The script also produces TSV files of the cancer type distribution across primary sites of the brain (where primary_sites.tsv displays a table of all the cancer types expressed within each site and primary_sites_counts.tsv displays the total number of types, along with the highest and second highest expressed types for each unique site).

Questions

Are there any additional data filtering steps that may need to be included for this particular dataset?
Are the results/plots suffice for this particular issue? If not, what other forms of tables or plots would you like to see included? Please keep in mind that the interactive treemap and pie chart are produced in the upcoming script.

Pull Request Check List:

Run a linter
Set the seed (if applicable).
Comments and/or documentation up to date.
Double check your paths.
Check headline numbering.
Spell check any Rmd file or md file.
Restart R and run all notebooks fresh and save.

`01-pbta-analysis-5.R` performs the initial filtering of the dataset. It produces a PDF and TSV of the sample distribution across cancer types, and TSV files of the cancer type distribution across primary sites of the brain.

jharenza

Thanks for this analysis! A few thoughts:

Can you update the code to include the data release V2, which will change the name of the clinical file?
One potentially relevant thing to do for primary site may be to create a new variable for region, where we separate into midbrain and hemispheric. I would use Thalamus|Pons|Spinal Cord|Brain Stem|Midbrain to denote midbrain and the various lobes|cerebellum|hippocampus to be hemispheric, and we can call the remaining as other.

cbethell · 2019-08-13T12:39:07Z

@jharenza

Can you update the code to include the data release V2, which will change the name of the clinical file?

One potentially relevant thing to do for primary site may be to create a new variable for region, where we separate into midbrain and hemispheric. I would use Thalamus|Pons|Spinal Cord|Brain Stem|Midbrain to denote midbrain and the various lobes|cerebellum|hippocampus to be hemispheric, and we can call the remaining as other.

Great, thank you for your suggestions! I can work on both of those today and notify you when I've committed the changes.

cgreene · 2019-08-13T12:49:47Z

@jharenza can you see about addressing #50? Without that file in the md5sum, the download script won't acquire it.

cgreene · 2019-08-13T12:51:12Z

Also, @jharenza : can you add that additional variable to the clinical metadata file? It seems better to have it in one place as opposed to defined separately in many scripts, since that grouping may be used throughout the project.

jaclyn-taroni · 2019-08-13T13:01:04Z

I agree with updating the code to include the data release v2 file which will be called pbta-histologies.tsv once #50 is addressed.

As far as the region point, I agree with @cgreene that region information should be in the histologies file to serve as a single source of truth across the project. Anything that will be used across the project in multiple places should be defined in as few places as possible and as part of the files obtained with bash download-data.sh where possible.

jharenza · 2019-08-13T13:47:56Z

Also, @jharenza : can you add that additional variable to the clinical metadata file? It seems better to have it in one place as opposed to defined separately in many scripts, since that grouping may be used throughout the project.

Yes - I can do this - we can get this updated after our next meeting.

- Updated the code to include the clinical file name which will be included the data release V2. - Added `brain_stem_medulla` and `brain_stem_midbrain` to `primary_sites`(instead of grouping them into `brain_stem`) to provide a more accurate representation of the separate parts of the brain stem.

…nalysis

jharenza · 2019-08-13T16:07:50Z

Just a note - I will add the regions to the clinical file in the next version, since I want to double check my classifications with an MD. The md5sum is now included.

jharenza · 2019-08-13T21:03:43Z

Referring to #57 for the brain region classification. Angela can get to this by Friday.

`01-pbta-analysis-5.R` was renamed `01-filter-across-types.R` to parallel changes made to `sample-distribution-analysis/README.md`

jaclyn-taroni

Thanks @cbethell! I think this script can be improved by making sure all of the references to the packages that need to be there are there (e.g., dplyr::, ggplot2::) and it can probably be simplified. I've left comments about specific simplifications I'd like you to look into. I haven't tested every suggestion I've included, so when you try making those changes you should be sure to do so.

jaclyn-taroni · 2019-08-15T16:00:28Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Create a bar plot of sample distribution across cancer types
+gg_types <- disease_expression %>%
+  ggplot2::ggplot(aes(x = disease_type_new, y = count, fill = count)) +


Suggested change

ggplot2::ggplot(aes(x = disease_type_new, y = count, fill = count)) +

ggplot2::ggplot(ggplot2::aes(x = disease_type_new, y = count, fill = count)) +

otherwise:

Error in aes(x = disease_type_new, y = count, fill = count) : could not find function "aes"

jaclyn-taroni · 2019-08-15T16:05:37Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# data.frame with the location where each cancer type in the dataset is expressed 
+# sorted to show highest expression
+brain_location <- df2 %>%


I think this can be simplified quite a bit to:

brain_location_jt <- df2 %>% dplyr::select(disease_type_new, primary_site) %>% dplyr::group_by(disease_type_new, primary_site) %>% dplyr::tally() %>% dplyr::arrange(desc(n))

> all.equal(brain_location_jt, brain_location) [1] TRUE

jaclyn-taroni · 2019-08-15T16:07:55Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+# data.frame with the count of each unique cancer type expression
+disease_expression <- df2 %>%
+  dplyr::group_by(disease_type_new) %>%
+  dplyr::filter(!is.na(disease_type_new)) %>%


You performed this filtering step at line 47, do you need it here?

jaclyn-taroni · 2019-08-15T16:08:13Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+  dplyr::group_by(disease_type_new, primary_site) %>%
+  dplyr::tally() %>%
+  dplyr::arrange(desc(n)) %>%
+  dplyr::filter(!is.na(disease_type_new))


Same comment about filtering this already as above.

jaclyn-taroni · 2019-08-15T16:09:15Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+disease_expression <- df2 %>%
+  dplyr::group_by(disease_type_new) %>%
+  dplyr::filter(!is.na(disease_type_new)) %>%
+  dplyr::count() %>%


I think you can change this to

Suggested change

dplyr::count() %>%

dplyr::count(name = "count") %>%

And then drop line 55.

jaclyn-taroni · 2019-08-15T16:11:31Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Create a percent variable
+disease_expression <- disease_expression %>%
+  dplyr::mutate(percent = (count / sum_count))


Suggested change

dplyr::mutate(percent = (count / sum_count))

dplyr::mutate(percent = formattable::percent(count / sum_count))

Would allow you to remove lines 65-66. Or I think:

Suggested change

dplyr::mutate(percent = (count / sum_count))

dplyr::mutate(percent = paste0((count / sum_count), "%"))

would be equivalent and negate the need to install formattable. (Untested.)

Out of the above suggestions, this is what worked: dplyr::mutate(percent = paste0(((count / sum_count) * 100).
Would this still be a better solution compared to the inclusion of lines 65-66?

Ah yes good call. Wouldn't you still need to add the "%" or am I misunderstanding the other part of what formattable::percent does?

Ah yes, good catch. I have the "%" in the script but left it out when I copied the line here.

I would say since we will have a monolithic Docker container for this project, any place where we can get rid of dependencies we should try and do that. This seems like something where we don't need a well-used function from a package to do it.

Understood. I'll remove the use of that package and keep this in mind before implementing any other new packages in other scripts.

jaclyn-taroni · 2019-08-15T16:15:31Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+  )
+
+# Write to tsv file
+readr::write_tsv(primary_sites, file.path("results", "primary_sites.tsv"))


How do you intend for people to use this data.frame and TSV?

The idea behind this data.frame was to provide an easy visualization of this particular subset for the brain graphic. However, it isn't completely necessary given that primary_sites_counts specifies the highest and second highest expressed types, along with the counts (all the information that seems to be needed for the brain graphic).

👍 I would take it out in that case. Doing so will facilitate the suggested changes.

jaclyn-taroni · 2019-08-15T16:57:13Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+  dplyr::arrange(primary_site) %>%
+  dplyr::group_by(disease_type_new, primary_site) %>%
+  dplyr::tally() %>%
+  dplyr::arrange(desc(n)) %>%


Does desc need to be dplyr::desc?

Yes, good catch

jaclyn-taroni · 2019-08-15T17:01:05Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+  dplyr::filter(!is.na(disease_type_new))
+
+
+# Use the location_fn to create a data.frame for each unique primary site


I also think this can be simplified. Let's say we change location_fn:

location_fn <- function(location) { disease_type_vector <- brain_location %>% dplyr::arrange(dplyr::desc(n)) %>% dplyr::filter(stringr::str_detect(primary_site, location)) %>% dplyr::pull(disease_type_new) unique(disease_type_vector) }

and we make a vector of the locations (truncated here):

primary_sites_vector <- c( "Basal Ganglia", "Brain Stem- Midbrain", "Brain Stem-Medulla" ) # this names step helps us with melting later names(primary_sites_vector) <- primary_sites_vector

Then I think we can do the following to get primary_sites_counts without the need for the padded data.frame:

# for each string in primary sites vector use location_fn to get the vector of disease types, # it's sorted by cancer_types_list <- lapply(primary_sites_vector, location_fn) # count the disease types for each primary site by taking the length of each element of the list cancer_types_counts <- lapply(cancer_types_list, length) # turn the list of counts into a data.frame and relabel, reorder columns primary_sites_counts <- reshape2::melt(cancer_types_counts, value.name = "number_of_types") %>% dplyr::rename(primary_site = L1) %>% dplyr::select(primary_site, number_of_types) %>% # get the max_type, it is the first element of the cancer_types_list dplyr::mutate(max_type = unlist(lapply(cancer_types_list, function(x) x[1])), # get the second element for second_max_type second_max_type = unlist(lapply(cancer_types_list, function(x) x[2]))) %>% # reorder the rows dplyr::arrange(dplyr::desc(number_of_types))

What is missing here is the style change Basal Ganglia -> basal_ganglia, but I'm not sure you need it. This also would allow you to get rid of the other two custom functions.

jaclyn-taroni · 2019-08-15T18:18:33Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+
+# Load/install packages. This will be removed once a Dockerfile is established.
+source("00-install-packages.R")


I am looking forward to having this removed because if you are installing a new package, it is also being loaded into the namespace which may or may not be what you want to do.

-Added the missing references to the `ggplot2` and `dplyr` packages - Simplified `brain_location` filtering step - Removed unnecessary instances of `dplyr::filter(!is.na(disease_type_new))` - Removed the implementation of the `formattable` package - Simplified the `location_fn` function and removed the `make_padded_data_frame`function - Removed `primary_sites.tsv` as the necessary information can be found in `primary_sites_counts.tsv`

jaclyn-taroni · 2019-08-16T19:40:52Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Create a percent variable
+disease_expression <- disease_expression %>%
+  dplyr::mutate(percent = paste0(((count / sum_count) * 100), "%"))


Now that you've pushed the changes. I wanted to note that the numbers look ever so slightly different to make sure you are aware of that and I think the last thing that function was doing was performing a rounding step, which can likely be accomplished with something like:

Suggested change

dplyr::mutate(percent = paste0(((count / sum_count) * 100), "%"))

dplyr::mutate(percent = paste0((round(count / sum_count, 2) * 100), "%"))

Ah, very true. I will make that change now and check the numbers before committing the change.

- Added the function `round` to round the count values to 4 decimal places (in turn producing percentage values with 2 decimal places) - The changed values are reflected in `disease_expression.tsv` and in `distribution_across_cancer_types.pdf`

- Removed `source(install-packages.R)` and `sessionInfo()` to parallel the changes suggested and made to PR AlexsLemonade#55.

jaclyn-taroni · 2019-08-20T21:26:05Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Function to filter based on primary_site 
+location_fn <- function(location) {
+  disease_type_vector <- brain_location %>%


Can you add more documentation to this function? I tend to favor the old R Google Style Guide recommendations for function documentation. Here's an example in the create-subset-files analysis: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/create-subset-files/01-create_subset_files.Rmd#L47

jaclyn-taroni · 2019-08-21T21:14:00Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+# of key variables within the dataset.
+#
+# Chante Bethell for CCDL 2019
+


Can you add a statement about how this script is intended to be run? An example would be:

#### USAGE This script is intended to be run via the command line from the top directory of the repository: Rscript analyses/sample-distribution-analysis/01-filter-across-types.R

jaclyn-taroni · 2019-08-21T21:17:11Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+  ggplot2::labs(x = "Cancer Types", y = "Count",
+                title = "Sample Distribution Across Cancer Types") +
+  ggplot2::scale_y_continuous(breaks = seq(0, 500, by = 100)) +
+  ggplot2::theme(axis.text.x = element_text(


Suggested change

ggplot2::theme(axis.text.x = element_text(

ggplot2::theme(axis.text.x = ggplot2::element_text(

Based on the test failure

jaclyn-taroni · 2019-08-21T21:17:21Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+    hjust = 1,
+    size = 8
+  ),
+  panel.grid = element_blank()) +


Suggested change

panel.grid = element_blank()) +

panel.grid = ggplot2::element_blank()) +

- Also added a function definition to `01` script

jaclyn-taroni

If you assign your output directories for plots and results to a variable as I've suggested throughout, you will be able to change where all that output goes by making a single change instead multiple changes throughout the script. If you make those changes and the tests pass, this LGTM 👍

jaclyn-taroni · 2019-08-21T22:06:32Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Define the file.path to output directories
+outputDir <- file.path("analyses", "sample-distribution-analysis")
+


If you add this here:

results_dir <- file.path(outputDir, "results") plots_dir <- file.path(outputDir, "plots")

jaclyn-taroni · 2019-08-21T22:06:49Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+outputDir <- file.path("analyses", "sample-distribution-analysis")
+
+# Create directories to hold the output.
+if (!dir.exists(file.path(outputDir, "results"))) {


Suggested change

if (!dir.exists(file.path(outputDir, "results"))) {

if (!dir.exists(results_dir)) {

jaclyn-taroni · 2019-08-21T22:07:11Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Create directories to hold the output.
+if (!dir.exists(file.path(outputDir, "results"))) {
+  dir.create(file.path(outputDir, "results"))


Suggested change

dir.create(file.path(outputDir, "results"))

dir.create(results_dir)

jaclyn-taroni · 2019-08-21T22:07:33Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+if (!dir.exists(file.path(outputDir, "results"))) {
+  dir.create(file.path(outputDir, "results"))
+}
+if (!dir.exists(file.path(outputDir, "plots"))) {


Would change these to plots_dir -- same pattern as results_dir

jaclyn-taroni · 2019-08-21T22:07:50Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Write to tsv file
+readr::write_tsv(disease_expression,
+                 file.path(outputDir, "results",


Suggested change

file.path(outputDir, "results",

file.path(results_dir,

jaclyn-taroni · 2019-08-21T22:08:14Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+# Save plot
+ggplot2::ggsave(
+  gg_types,
+  file = file.path(outputDir, "plots", "distribution_across_cancer_types.pdf"),


Could then change this to

Suggested change

file = file.path(outputDir, "plots", "distribution_across_cancer_types.pdf"),

file = file.path(plots_dir, "distribution_across_cancer_types.pdf"),

jaclyn-taroni · 2019-08-21T22:08:33Z

analyses/sample-distribution-analysis/01-filter-across-types.R

+
+# Write to tsv file
+readr::write_tsv(primary_sites_counts,
+                 file.path(outputDir, "results", "primary_sites_counts.tsv"))


Suggested change

file.path(outputDir, "results", "primary_sites_counts.tsv"))

file.path(results_dir, "primary_sites_counts.tsv"))

jaclyn-taroni · 2019-08-21T23:43:18Z

@jharenza the change to the histologies file you requested has been made and the second point you raised is now tracked at #57. If this looks good to you, please feel free to merge.

jharenza · 2019-08-22T12:54:32Z

Just a note - I will add the regions to the clinical file in the next version, since I want to double check my classifications with an MD. The md5sum is now included.

I talked with Angela Waanders and she recommended we only do this for gliomas (and only if regions are not ambiguous - ie tumor is not in both midline and hemispheric regions) and not all samples, so will add that later.

cgreene · 2019-08-22T13:03:41Z

@jharenza: Ok! It also looks like you still have this PR marked as changes requested. Is there anything else that needs to be changed in the code before it can be merged?

jharenza

all good!

cbethell · 2019-08-22T15:50:27Z

@jharenza Thank you for that confirmation!

cgreene · 2019-08-22T15:52:53Z

Going to merge this 👍

Addition of 01-pbta-analysis-5.R script

8684714

`01-pbta-analysis-5.R` performs the initial filtering of the dataset. It produces a PDF and TSV of the sample distribution across cancer types, and TSV files of the cancer type distribution across primary sites of the brain.

cbethell requested a review from jharenza August 12, 2019 21:01

jharenza suggested changes Aug 12, 2019

View reviewed changes

Merge branch 'master' into 01-pbta-analysis

8b47fe6

cbethell added 2 commits August 13, 2019 10:26

Merge remote-tracking branch 'origin/01-pbta-analysis' into 01-pbta-a…

20a3c16

…nalysis

jaclyn-taroni mentioned this pull request Aug 13, 2019

Addition of 00-README.md and 00-install-packages.R to sample-distribution-analysis #52

Merged

7 tasks

Renamed script

e72af61

`01-pbta-analysis-5.R` was renamed `01-filter-across-types.R` to parallel changes made to `sample-distribution-analysis/README.md`

cbethell changed the title ~~Addition of 01-pbta-analysis-5.R script~~ Addition of 01-filter-across-types.R script Aug 14, 2019

jaclyn-taroni self-requested a review August 15, 2019 13:57

jaclyn-taroni suggested changes Aug 15, 2019

View reviewed changes

jaclyn-taroni reviewed Aug 16, 2019

View reviewed changes

jaclyn-taroni mentioned this pull request Aug 19, 2019

Addition of 02-multilayer-pie.R and plots #55

Merged

7 tasks

cbethell and others added 2 commits August 19, 2019 11:41

Removed source script and sessionInfo()

cbb3436

- Removed `source(install-packages.R)` and `sessionInfo()` to parallel the changes suggested and made to PR AlexsLemonade#55.

Merge branch 'master' into 01-pbta-analysis

60e3e04

jaclyn-taroni reviewed Aug 20, 2019

View reviewed changes

cbethell and others added 3 commits August 21, 2019 16:43

Merge branch 'master' into 01-pbta-analysis

cc6982e

Addition of 01-filter-across-types.R to .circleci/config.yml

ea5280d

Edited file.path to pbta-histologies.tsv

35b36ac

jaclyn-taroni reviewed Aug 21, 2019

View reviewed changes

Edit output directories file.path & renamed analysis in config.yml

32e01bd

- Also added a function definition to `01` script

jaclyn-taroni approved these changes Aug 21, 2019

View reviewed changes

Assign output directory file paths to results_dir and plots_dir

e413f4b

jaclyn-taroni requested a review from jharenza August 21, 2019 23:41

jharenza approved these changes Aug 22, 2019

View reviewed changes

jaclyn-taroni mentioned this pull request Aug 22, 2019

How we add completed analysis modules to CI #77

Closed

cgreene merged commit af1794b into AlexsLemonade:master Aug 22, 2019

cbethell mentioned this pull request Aug 23, 2019

Addition of run-sample-distribution.sh #78

Merged

2 tasks

jaclyn-taroni mentioned this pull request Aug 27, 2019

Planned Analysis: Figure showing the distribution of samples by cancer type #5

Closed

cbethell deleted the 01-pbta-analysis branch December 18, 2019 19:11

	ggplot2::ggplot(aes(x = disease_type_new, y = count, fill = count)) +
	ggplot2::ggplot(ggplot2::aes(x = disease_type_new, y = count, fill = count)) +

	dplyr::mutate(percent = (count / sum_count))
	dplyr::mutate(percent = formattable::percent(count / sum_count))

	dplyr::mutate(percent = (count / sum_count))
	dplyr::mutate(percent = paste0((count / sum_count), "%"))

		dplyr::filter(!is.na(disease_type_new))


		# Use the location_fn to create a data.frame for each unique primary site



		# Load/install packages. This will be removed once a Dockerfile is established.
		source("00-install-packages.R")

	dplyr::mutate(percent = paste0(((count / sum_count) * 100), "%"))
	dplyr::mutate(percent = paste0((round(count / sum_count, 2) * 100), "%"))

	ggplot2::theme(axis.text.x = element_text(
	ggplot2::theme(axis.text.x = ggplot2::element_text(

	panel.grid = element_blank()) +
	panel.grid = ggplot2::element_blank()) +


		# Define the file.path to output directories
		outputDir <- file.path("analyses", "sample-distribution-analysis")

	if (!dir.exists(file.path(outputDir, "results"))) {
	if (!dir.exists(results_dir)) {

	dir.create(file.path(outputDir, "results"))
	dir.create(results_dir)

	file = file.path(outputDir, "plots", "distribution_across_cancer_types.pdf"),
	file = file.path(plots_dir, "distribution_across_cancer_types.pdf"),

	file.path(outputDir, "results", "primary_sites_counts.tsv"))
	file.path(results_dir, "primary_sites_counts.tsv"))

Addition of 01-filter-across-types.R script #54

Addition of 01-filter-across-types.R script #54

Conversation

cbethell commented Aug 12, 2019 • edited Loading

Questions

Pull Request Check List:

jharenza left a comment

Choose a reason for hiding this comment

cbethell commented Aug 13, 2019

cgreene commented Aug 13, 2019

cgreene commented Aug 13, 2019

jaclyn-taroni commented Aug 13, 2019

jharenza commented Aug 13, 2019

jharenza commented Aug 13, 2019

jharenza commented Aug 13, 2019

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni commented Aug 21, 2019

jharenza commented Aug 22, 2019

cgreene commented Aug 22, 2019

jharenza left a comment

Choose a reason for hiding this comment

cbethell commented Aug 22, 2019

cgreene commented Aug 22, 2019

Addition of `01-filter-across-types.R` script #54

Addition of `01-filter-across-types.R` script #54

cbethell commented Aug 12, 2019 •

edited

Loading