-
Notifications
You must be signed in to change notification settings - Fork 67
Addition of 01-filter-across-types.R
script
#54
Conversation
`01-pbta-analysis-5.R` performs the initial filtering of the dataset. It produces a PDF and TSV of the sample distribution across cancer types, and TSV files of the cancer type distribution across primary sites of the brain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this analysis! A few thoughts:
- Can you update the code to include the data release V2, which will change the name of the clinical file?
- One potentially relevant thing to do for primary site may be to create a new variable for region, where we separate into midbrain and hemispheric. I would use
Thalamus|Pons|Spinal Cord|Brain Stem|Midbrain
to denote midbrain and thevarious lobes|cerebellum|hippocampus
to be hemispheric, and we can call the remaining as other.
Great, thank you for your suggestions! I can work on both of those today and notify you when I've committed the changes. |
Also, @jharenza : can you add that additional variable to the clinical metadata file? It seems better to have it in one place as opposed to defined separately in many scripts, since that grouping may be used throughout the project. |
I agree with updating the code to include the data release v2 file which will be called As far as the region point, I agree with @cgreene that region information should be in the histologies file to serve as a single source of truth across the project. Anything that will be used across the project in multiple places should be defined in as few places as possible and as part of the files obtained with |
Yes - I can do this - we can get this updated after our next meeting. |
- Updated the code to include the clinical file name which will be included the data release V2. - Added `brain_stem_medulla` and `brain_stem_midbrain` to `primary_sites`(instead of grouping them into `brain_stem`) to provide a more accurate representation of the separate parts of the brain stem.
Just a note - I will add the regions to the clinical file in the next version, since I want to double check my classifications with an MD. The md5sum is now included. |
Referring to #57 for the brain region classification. Angela can get to this by Friday. |
`01-pbta-analysis-5.R` was renamed `01-filter-across-types.R` to parallel changes made to `sample-distribution-analysis/README.md`
01-pbta-analysis-5.R
script01-filter-across-types.R
script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cbethell! I think this script can be improved by making sure all of the references to the packages that need to be there are there (e.g., dplyr::
, ggplot2::
) and it can probably be simplified. I've left comments about specific simplifications I'd like you to look into. I haven't tested every suggestion I've included, so when you try making those changes you should be sure to do so.
|
||
# Create a bar plot of sample distribution across cancer types | ||
gg_types <- disease_expression %>% | ||
ggplot2::ggplot(aes(x = disease_type_new, y = count, fill = count)) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggplot2::ggplot(aes(x = disease_type_new, y = count, fill = count)) + | |
ggplot2::ggplot(ggplot2::aes(x = disease_type_new, y = count, fill = count)) + |
otherwise:
Error in aes(x = disease_type_new, y = count, fill = count) :
could not find function "aes"
|
||
# data.frame with the location where each cancer type in the dataset is expressed | ||
# sorted to show highest expression | ||
brain_location <- df2 %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified quite a bit to:
brain_location_jt <- df2 %>%
dplyr::select(disease_type_new, primary_site) %>%
dplyr::group_by(disease_type_new, primary_site) %>%
dplyr::tally() %>%
dplyr::arrange(desc(n))
> all.equal(brain_location_jt, brain_location)
[1] TRUE
# data.frame with the count of each unique cancer type expression | ||
disease_expression <- df2 %>% | ||
dplyr::group_by(disease_type_new) %>% | ||
dplyr::filter(!is.na(disease_type_new)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You performed this filtering step at line 47, do you need it here?
dplyr::group_by(disease_type_new, primary_site) %>% | ||
dplyr::tally() %>% | ||
dplyr::arrange(desc(n)) %>% | ||
dplyr::filter(!is.na(disease_type_new)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about filtering this already as above.
disease_expression <- df2 %>% | ||
dplyr::group_by(disease_type_new) %>% | ||
dplyr::filter(!is.na(disease_type_new)) %>% | ||
dplyr::count() %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can change this to
dplyr::count() %>% | |
dplyr::count(name = "count") %>% |
And then drop line 55.
|
||
# Create a percent variable | ||
disease_expression <- disease_expression %>% | ||
dplyr::mutate(percent = (count / sum_count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dplyr::mutate(percent = (count / sum_count)) | |
dplyr::mutate(percent = formattable::percent(count / sum_count)) |
Would allow you to remove lines 65-66. Or I think:
dplyr::mutate(percent = (count / sum_count)) | |
dplyr::mutate(percent = paste0((count / sum_count), "%")) |
would be equivalent and negate the need to install formattable
. (Untested.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of the above suggestions, this is what worked: dplyr::mutate(percent = paste0(((count / sum_count) * 100)
.
Would this still be a better solution compared to the inclusion of lines 65-66?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes good call. Wouldn't you still need to add the "%"
or am I misunderstanding the other part of what formattable::percent
does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, good catch. I have the "%"
in the script but left it out when I copied the line here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say since we will have a monolithic Docker container for this project, any place where we can get rid of dependencies we should try and do that. This seems like something where we don't need a well-used function from a package to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I'll remove the use of that package and keep this in mind before implementing any other new packages in other scripts.
) | ||
|
||
# Write to tsv file | ||
readr::write_tsv(primary_sites, file.path("results", "primary_sites.tsv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you intend for people to use this data.frame and TSV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this data.frame was to provide an easy visualization of this particular subset for the brain graphic. However, it isn't completely necessary given that primary_sites_counts
specifies the highest and second highest expressed types, along with the counts (all the information that seems to be needed for the brain graphic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I would take it out in that case. Doing so will facilitate the suggested changes.
dplyr::arrange(primary_site) %>% | ||
dplyr::group_by(disease_type_new, primary_site) %>% | ||
dplyr::tally() %>% | ||
dplyr::arrange(desc(n)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does desc
need to be dplyr::desc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch
dplyr::filter(!is.na(disease_type_new)) | ||
|
||
|
||
# Use the location_fn to create a data.frame for each unique primary site |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this can be simplified. Let's say we change location_fn
:
location_fn <- function(location) {
disease_type_vector <- brain_location %>%
dplyr::arrange(dplyr::desc(n)) %>%
dplyr::filter(stringr::str_detect(primary_site, location)) %>%
dplyr::pull(disease_type_new)
unique(disease_type_vector)
}
and we make a vector of the locations (truncated here):
primary_sites_vector <- c(
"Basal Ganglia",
"Brain Stem- Midbrain",
"Brain Stem-Medulla"
)
# this names step helps us with melting later
names(primary_sites_vector) <- primary_sites_vector
Then I think we can do the following to get primary_sites_counts
without the need for the padded data.frame:
# for each string in primary sites vector use location_fn to get the vector of disease types,
# it's sorted by
cancer_types_list <- lapply(primary_sites_vector, location_fn)
# count the disease types for each primary site by taking the length of each element of the list
cancer_types_counts <- lapply(cancer_types_list, length)
# turn the list of counts into a data.frame and relabel, reorder columns
primary_sites_counts <- reshape2::melt(cancer_types_counts, value.name = "number_of_types") %>%
dplyr::rename(primary_site = L1) %>%
dplyr::select(primary_site, number_of_types) %>%
# get the max_type, it is the first element of the cancer_types_list
dplyr::mutate(max_type = unlist(lapply(cancer_types_list, function(x) x[1])),
# get the second element for second_max_type
second_max_type = unlist(lapply(cancer_types_list, function(x) x[2]))) %>%
# reorder the rows
dplyr::arrange(dplyr::desc(number_of_types))
What is missing here is the style change Basal Ganglia
-> basal_ganglia
, but I'm not sure you need it. This also would allow you to get rid of the other two custom functions.
|
||
|
||
# Load/install packages. This will be removed once a Dockerfile is established. | ||
source("00-install-packages.R") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am looking forward to having this removed because if you are installing a new package, it is also being loaded into the namespace which may or may not be what you want to do.
-Added the missing references to the `ggplot2` and `dplyr` packages - Simplified `brain_location` filtering step - Removed unnecessary instances of `dplyr::filter(!is.na(disease_type_new))` - Removed the implementation of the `formattable` package - Simplified the `location_fn` function and removed the `make_padded_data_frame`function - Removed `primary_sites.tsv` as the necessary information can be found in `primary_sites_counts.tsv`
|
||
# Create a percent variable | ||
disease_expression <- disease_expression %>% | ||
dplyr::mutate(percent = paste0(((count / sum_count) * 100), "%")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you've pushed the changes. I wanted to note that the numbers look ever so slightly different to make sure you are aware of that and I think the last thing that function was doing was performing a rounding step, which can likely be accomplished with something like:
dplyr::mutate(percent = paste0(((count / sum_count) * 100), "%")) | |
dplyr::mutate(percent = paste0((round(count / sum_count, 2) * 100), "%")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, very true. I will make that change now and check the numbers before committing the change.
- Added the function `round` to round the count values to 4 decimal places (in turn producing percentage values with 2 decimal places) - The changed values are reflected in `disease_expression.tsv` and in `distribution_across_cancer_types.pdf`
- Removed `source(install-packages.R)` and `sessionInfo()` to parallel the changes suggested and made to PR AlexsLemonade#55.
|
||
# Function to filter based on primary_site | ||
location_fn <- function(location) { | ||
disease_type_vector <- brain_location %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more documentation to this function? I tend to favor the old R Google Style Guide recommendations for function documentation. Here's an example in the create-subset-files
analysis: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/create-subset-files/01-create_subset_files.Rmd#L47
# of key variables within the dataset. | ||
# | ||
# Chante Bethell for CCDL 2019 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a statement about how this script is intended to be run? An example would be:
#### USAGE
This script is intended to be run via the command line from the top directory of the repository:
Rscript analyses/sample-distribution-analysis/01-filter-across-types.R
ggplot2::labs(x = "Cancer Types", y = "Count", | ||
title = "Sample Distribution Across Cancer Types") + | ||
ggplot2::scale_y_continuous(breaks = seq(0, 500, by = 100)) + | ||
ggplot2::theme(axis.text.x = element_text( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggplot2::theme(axis.text.x = element_text( | |
ggplot2::theme(axis.text.x = ggplot2::element_text( |
Based on the test failure
hjust = 1, | ||
size = 8 | ||
), | ||
panel.grid = element_blank()) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel.grid = element_blank()) + | |
panel.grid = ggplot2::element_blank()) + |
- Also added a function definition to `01` script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you assign your output directories for plots and results to a variable as I've suggested throughout, you will be able to change where all that output goes by making a single change instead multiple changes throughout the script. If you make those changes and the tests pass, this LGTM 👍
|
||
# Define the file.path to output directories | ||
outputDir <- file.path("analyses", "sample-distribution-analysis") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add this here:
results_dir <- file.path(outputDir, "results")
plots_dir <- file.path(outputDir, "plots")
outputDir <- file.path("analyses", "sample-distribution-analysis") | ||
|
||
# Create directories to hold the output. | ||
if (!dir.exists(file.path(outputDir, "results"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!dir.exists(file.path(outputDir, "results"))) { | |
if (!dir.exists(results_dir)) { |
|
||
# Create directories to hold the output. | ||
if (!dir.exists(file.path(outputDir, "results"))) { | ||
dir.create(file.path(outputDir, "results")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dir.create(file.path(outputDir, "results")) | |
dir.create(results_dir) |
if (!dir.exists(file.path(outputDir, "results"))) { | ||
dir.create(file.path(outputDir, "results")) | ||
} | ||
if (!dir.exists(file.path(outputDir, "plots"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would change these to plots_dir
-- same pattern as results_dir
|
||
# Write to tsv file | ||
readr::write_tsv(disease_expression, | ||
file.path(outputDir, "results", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file.path(outputDir, "results", | |
file.path(results_dir, |
# Save plot | ||
ggplot2::ggsave( | ||
gg_types, | ||
file = file.path(outputDir, "plots", "distribution_across_cancer_types.pdf"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could then change this to
file = file.path(outputDir, "plots", "distribution_across_cancer_types.pdf"), | |
file = file.path(plots_dir, "distribution_across_cancer_types.pdf"), |
|
||
# Write to tsv file | ||
readr::write_tsv(primary_sites_counts, | ||
file.path(outputDir, "results", "primary_sites_counts.tsv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file.path(outputDir, "results", "primary_sites_counts.tsv")) | |
file.path(results_dir, "primary_sites_counts.tsv")) |
I talked with Angela Waanders and she recommended we only do this for gliomas (and only if regions are not ambiguous - ie tumor is not in both midline and hemispheric regions) and not all samples, so will add that later. |
@jharenza: Ok! It also looks like you still have this PR marked as changes requested. Is there anything else that needs to be changed in the code before it can be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all good!
@jharenza Thank you for that confirmation! |
Going to merge this 👍 |
This PR is the first step of the analysis addressing issue #5
01-filter-across-types.R
performs the initial filtering of the dataset.primary_sites.tsv
displays a table of all the cancer types expressed within each site andprimary_sites_counts.tsv
displays the total number of types, along with the highest and second highest expressed types for each unique site).Questions
Are there any additional data filtering steps that may need to be included for this particular dataset?
Are the results/plots suffice for this particular issue? If not, what other forms of tables or plots would you like to see included? Please keep in mind that the interactive treemap and pie chart are produced in the upcoming script.
Pull Request Check List: