Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Addition of 01-filter-across-types.R script #54

Merged
merged 14 commits into from
Aug 22, 2019

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Aug 12, 2019

This PR is the first step of the analysis addressing issue #5

  • 01-filter-across-types.R performs the initial filtering of the dataset.
  • The script produces a PDF of a barplot representing the sample distribution across cancer types, as well as a TSV with the raw values used to do so.
  • The script also produces TSV files of the cancer type distribution across primary sites of the brain (where primary_sites.tsv displays a table of all the cancer types expressed within each site and primary_sites_counts.tsv displays the total number of types, along with the highest and second highest expressed types for each unique site).

Questions

  1. Are there any additional data filtering steps that may need to be included for this particular dataset?

  2. Are the results/plots suffice for this particular issue? If not, what other forms of tables or plots would you like to see included? Please keep in mind that the interactive treemap and pie chart are produced in the upcoming script.

Pull Request Check List:

  • Run a linter
  • Set the seed (if applicable).
  • Comments and/or documentation up to date.
  • Double check your paths.
  • Check headline numbering.
  • Spell check any Rmd file or md file.
  • Restart R and run all notebooks fresh and save.

`01-pbta-analysis-5.R` performs the initial filtering of the dataset. It produces a PDF and TSV of the sample distribution across cancer types, and TSV files of the cancer type distribution across primary sites of the brain.
@cbethell cbethell requested a review from jharenza August 12, 2019 21:01
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this analysis! A few thoughts:

  • Can you update the code to include the data release V2, which will change the name of the clinical file?
  • One potentially relevant thing to do for primary site may be to create a new variable for region, where we separate into midbrain and hemispheric. I would use Thalamus|Pons|Spinal Cord|Brain Stem|Midbrain to denote midbrain and the various lobes|cerebellum|hippocampus to be hemispheric, and we can call the remaining as other.

@cbethell
Copy link
Contributor Author

@jharenza

  • Can you update the code to include the data release V2, which will change the name of the clinical file?
  • One potentially relevant thing to do for primary site may be to create a new variable for region, where we separate into midbrain and hemispheric. I would use Thalamus|Pons|Spinal Cord|Brain Stem|Midbrain to denote midbrain and the various lobes|cerebellum|hippocampus to be hemispheric, and we can call the remaining as other.

Great, thank you for your suggestions! I can work on both of those today and notify you when I've committed the changes.

@cgreene
Copy link
Collaborator

cgreene commented Aug 13, 2019

@jharenza can you see about addressing #50? Without that file in the md5sum, the download script won't acquire it.

@cgreene
Copy link
Collaborator

cgreene commented Aug 13, 2019

Also, @jharenza : can you add that additional variable to the clinical metadata file? It seems better to have it in one place as opposed to defined separately in many scripts, since that grouping may be used throughout the project.

@jaclyn-taroni
Copy link
Member

I agree with updating the code to include the data release v2 file which will be called pbta-histologies.tsv once #50 is addressed.

As far as the region point, I agree with @cgreene that region information should be in the histologies file to serve as a single source of truth across the project. Anything that will be used across the project in multiple places should be defined in as few places as possible and as part of the files obtained with bash download-data.sh where possible.

@jharenza
Copy link
Collaborator

Also, @jharenza : can you add that additional variable to the clinical metadata file? It seems better to have it in one place as opposed to defined separately in many scripts, since that grouping may be used throughout the project.

Yes - I can do this - we can get this updated after our next meeting.

- Updated the code to include the clinical file name which will be included the data release V2.
- Added `brain_stem_medulla` and `brain_stem_midbrain` to `primary_sites`(instead of grouping them into `brain_stem`)  to provide a more accurate representation of the separate parts of the brain stem.
@jharenza
Copy link
Collaborator

Just a note - I will add the regions to the clinical file in the next version, since I want to double check my classifications with an MD. The md5sum is now included.

@jharenza
Copy link
Collaborator

Referring to #57 for the brain region classification. Angela can get to this by Friday.

`01-pbta-analysis-5.R` was renamed `01-filter-across-types.R` to parallel changes made to `sample-distribution-analysis/README.md`
@cbethell cbethell changed the title Addition of 01-pbta-analysis-5.R script Addition of 01-filter-across-types.R script Aug 14, 2019
@jaclyn-taroni jaclyn-taroni self-requested a review August 15, 2019 13:57
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cbethell! I think this script can be improved by making sure all of the references to the packages that need to be there are there (e.g., dplyr::, ggplot2::) and it can probably be simplified. I've left comments about specific simplifications I'd like you to look into. I haven't tested every suggestion I've included, so when you try making those changes you should be sure to do so.


# Create a bar plot of sample distribution across cancer types
gg_types <- disease_expression %>%
ggplot2::ggplot(aes(x = disease_type_new, y = count, fill = count)) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ggplot2::ggplot(aes(x = disease_type_new, y = count, fill = count)) +
ggplot2::ggplot(ggplot2::aes(x = disease_type_new, y = count, fill = count)) +

otherwise:

Error in aes(x = disease_type_new, y = count, fill = count) : 
  could not find function "aes"


# data.frame with the location where each cancer type in the dataset is expressed
# sorted to show highest expression
brain_location <- df2 %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified quite a bit to:

brain_location_jt <- df2 %>%
  dplyr::select(disease_type_new, primary_site) %>%
  dplyr::group_by(disease_type_new, primary_site) %>%
  dplyr::tally() %>%
  dplyr::arrange(desc(n))
> all.equal(brain_location_jt, brain_location)
[1] TRUE

# data.frame with the count of each unique cancer type expression
disease_expression <- df2 %>%
dplyr::group_by(disease_type_new) %>%
dplyr::filter(!is.na(disease_type_new)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You performed this filtering step at line 47, do you need it here?

dplyr::group_by(disease_type_new, primary_site) %>%
dplyr::tally() %>%
dplyr::arrange(desc(n)) %>%
dplyr::filter(!is.na(disease_type_new))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about filtering this already as above.

disease_expression <- df2 %>%
dplyr::group_by(disease_type_new) %>%
dplyr::filter(!is.na(disease_type_new)) %>%
dplyr::count() %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can change this to

Suggested change
dplyr::count() %>%
dplyr::count(name = "count") %>%

And then drop line 55.


# Create a percent variable
disease_expression <- disease_expression %>%
dplyr::mutate(percent = (count / sum_count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dplyr::mutate(percent = (count / sum_count))
dplyr::mutate(percent = formattable::percent(count / sum_count))

Would allow you to remove lines 65-66. Or I think:

Suggested change
dplyr::mutate(percent = (count / sum_count))
dplyr::mutate(percent = paste0((count / sum_count), "%"))

would be equivalent and negate the need to install formattable. (Untested.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of the above suggestions, this is what worked: dplyr::mutate(percent = paste0(((count / sum_count) * 100).
Would this still be a better solution compared to the inclusion of lines 65-66?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good call. Wouldn't you still need to add the "%" or am I misunderstanding the other part of what formattable::percent does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, good catch. I have the "%" in the script but left it out when I copied the line here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say since we will have a monolithic Docker container for this project, any place where we can get rid of dependencies we should try and do that. This seems like something where we don't need a well-used function from a package to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I'll remove the use of that package and keep this in mind before implementing any other new packages in other scripts.

)

# Write to tsv file
readr::write_tsv(primary_sites, file.path("results", "primary_sites.tsv"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you intend for people to use this data.frame and TSV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this data.frame was to provide an easy visualization of this particular subset for the brain graphic. However, it isn't completely necessary given that primary_sites_counts specifies the highest and second highest expressed types, along with the counts (all the information that seems to be needed for the brain graphic).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I would take it out in that case. Doing so will facilitate the suggested changes.

dplyr::arrange(primary_site) %>%
dplyr::group_by(disease_type_new, primary_site) %>%
dplyr::tally() %>%
dplyr::arrange(desc(n)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does desc need to be dplyr::desc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch

dplyr::filter(!is.na(disease_type_new))


# Use the location_fn to create a data.frame for each unique primary site
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this can be simplified. Let's say we change location_fn:

location_fn <- function(location) {
  disease_type_vector <- brain_location %>%
    dplyr::arrange(dplyr::desc(n)) %>%
    dplyr::filter(stringr::str_detect(primary_site, location)) %>%
    dplyr::pull(disease_type_new)
  unique(disease_type_vector)
}

and we make a vector of the locations (truncated here):

primary_sites_vector <- c(
  "Basal Ganglia",
  "Brain Stem- Midbrain",
  "Brain Stem-Medulla"
)
# this names step helps us with melting later
names(primary_sites_vector) <- primary_sites_vector

Then I think we can do the following to get primary_sites_counts without the need for the padded data.frame:

# for each string in primary sites vector use location_fn to get the vector of disease types,
# it's sorted by 
cancer_types_list <- lapply(primary_sites_vector, location_fn)
# count the disease types for each primary site by taking the length of each element of the list
cancer_types_counts <- lapply(cancer_types_list, length)
# turn the list of counts into a data.frame and relabel, reorder columns
primary_sites_counts <- reshape2::melt(cancer_types_counts, value.name = "number_of_types") %>%
  dplyr::rename(primary_site = L1) %>%
  dplyr::select(primary_site, number_of_types) %>%
  # get the max_type, it is the first element of the cancer_types_list
  dplyr::mutate(max_type = unlist(lapply(cancer_types_list, function(x) x[1])), 
                # get the second element for second_max_type
                second_max_type = unlist(lapply(cancer_types_list, function(x) x[2]))) %>%
  # reorder the rows
  dplyr::arrange(dplyr::desc(number_of_types))

What is missing here is the style change Basal Ganglia -> basal_ganglia, but I'm not sure you need it. This also would allow you to get rid of the other two custom functions.



# Load/install packages. This will be removed once a Dockerfile is established.
source("00-install-packages.R")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking forward to having this removed because if you are installing a new package, it is also being loaded into the namespace which may or may not be what you want to do.

-Added the missing references to the `ggplot2` and `dplyr` packages 
- Simplified `brain_location` filtering step
- Removed unnecessary instances of `dplyr::filter(!is.na(disease_type_new))`
- Removed the implementation of the `formattable` package
- Simplified the `location_fn` function and removed the `make_padded_data_frame`function
- Removed `primary_sites.tsv` as the necessary information can be found in `primary_sites_counts.tsv`

# Create a percent variable
disease_expression <- disease_expression %>%
dplyr::mutate(percent = paste0(((count / sum_count) * 100), "%"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you've pushed the changes. I wanted to note that the numbers look ever so slightly different to make sure you are aware of that and I think the last thing that function was doing was performing a rounding step, which can likely be accomplished with something like:

Suggested change
dplyr::mutate(percent = paste0(((count / sum_count) * 100), "%"))
dplyr::mutate(percent = paste0((round(count / sum_count, 2) * 100), "%"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very true. I will make that change now and check the numbers before committing the change.

- Added the function `round` to round the count values to 4 decimal places (in turn producing percentage values with 2 decimal places)
- The changed values are reflected in `disease_expression.tsv` and in `distribution_across_cancer_types.pdf`
cbethell and others added 2 commits August 19, 2019 11:41
- Removed `source(install-packages.R)` and `sessionInfo()` to parallel the changes suggested and made to PR AlexsLemonade#55.

# Function to filter based on primary_site
location_fn <- function(location) {
disease_type_vector <- brain_location %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more documentation to this function? I tend to favor the old R Google Style Guide recommendations for function documentation. Here's an example in the create-subset-files analysis: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/create-subset-files/01-create_subset_files.Rmd#L47

# of key variables within the dataset.
#
# Chante Bethell for CCDL 2019

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a statement about how this script is intended to be run? An example would be:

#### USAGE
This script is intended to be run via the command line from the top directory of the repository:
Rscript analyses/sample-distribution-analysis/01-filter-across-types.R

ggplot2::labs(x = "Cancer Types", y = "Count",
title = "Sample Distribution Across Cancer Types") +
ggplot2::scale_y_continuous(breaks = seq(0, 500, by = 100)) +
ggplot2::theme(axis.text.x = element_text(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ggplot2::theme(axis.text.x = element_text(
ggplot2::theme(axis.text.x = ggplot2::element_text(

Based on the test failure

hjust = 1,
size = 8
),
panel.grid = element_blank()) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panel.grid = element_blank()) +
panel.grid = ggplot2::element_blank()) +

- Also added a function definition to `01` script
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you assign your output directories for plots and results to a variable as I've suggested throughout, you will be able to change where all that output goes by making a single change instead multiple changes throughout the script. If you make those changes and the tests pass, this LGTM 👍


# Define the file.path to output directories
outputDir <- file.path("analyses", "sample-distribution-analysis")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add this here:

results_dir <- file.path(outputDir, "results")
plots_dir <- file.path(outputDir, "plots")

outputDir <- file.path("analyses", "sample-distribution-analysis")

# Create directories to hold the output.
if (!dir.exists(file.path(outputDir, "results"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!dir.exists(file.path(outputDir, "results"))) {
if (!dir.exists(results_dir)) {


# Create directories to hold the output.
if (!dir.exists(file.path(outputDir, "results"))) {
dir.create(file.path(outputDir, "results"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dir.create(file.path(outputDir, "results"))
dir.create(results_dir)

if (!dir.exists(file.path(outputDir, "results"))) {
dir.create(file.path(outputDir, "results"))
}
if (!dir.exists(file.path(outputDir, "plots"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would change these to plots_dir -- same pattern as results_dir


# Write to tsv file
readr::write_tsv(disease_expression,
file.path(outputDir, "results",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file.path(outputDir, "results",
file.path(results_dir,

# Save plot
ggplot2::ggsave(
gg_types,
file = file.path(outputDir, "plots", "distribution_across_cancer_types.pdf"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could then change this to

Suggested change
file = file.path(outputDir, "plots", "distribution_across_cancer_types.pdf"),
file = file.path(plots_dir, "distribution_across_cancer_types.pdf"),


# Write to tsv file
readr::write_tsv(primary_sites_counts,
file.path(outputDir, "results", "primary_sites_counts.tsv"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file.path(outputDir, "results", "primary_sites_counts.tsv"))
file.path(results_dir, "primary_sites_counts.tsv"))

@jaclyn-taroni
Copy link
Member

@jharenza the change to the histologies file you requested has been made and the second point you raised is now tracked at #57. If this looks good to you, please feel free to merge.

@jharenza
Copy link
Collaborator

Just a note - I will add the regions to the clinical file in the next version, since I want to double check my classifications with an MD. The md5sum is now included.

I talked with Angela Waanders and she recommended we only do this for gliomas (and only if regions are not ambiguous - ie tumor is not in both midline and hemispheric regions) and not all samples, so will add that later.

@cgreene
Copy link
Collaborator

cgreene commented Aug 22, 2019

@jharenza: Ok! It also looks like you still have this PR marked as changes requested. Is there anything else that needs to be changed in the code before it can be merged?

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good!

@cbethell
Copy link
Contributor Author

@jharenza Thank you for that confirmation!

@cgreene
Copy link
Collaborator

cgreene commented Aug 22, 2019

Going to merge this 👍

@cgreene cgreene merged commit af1794b into AlexsLemonade:master Aug 22, 2019
@cbethell cbethell deleted the 01-pbta-analysis branch December 18, 2019 19:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants