Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Planned Analysis: Figure showing the distribution of samples by cancer type #5

Closed
cgreene opened this issue Jul 11, 2019 · 20 comments
Closed
Assignees
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach!

Comments

@cgreene
Copy link
Collaborator

cgreene commented Jul 11, 2019

It is often helpful to have a part of the first figure for a dataset landscape paper that shows how the samples are distributed across cancer type to characterize the overall dataset. We would like a figure that summarizes the content of the dataset "at a glance."

@PichaiRaman
Copy link
Contributor

This could include a graphic of a brain to show the origin / location of the different brain tumor types.

@cgreene cgreene added the good first issue Good for newcomers label Jul 14, 2019
@cbethell cbethell self-assigned this Jul 24, 2019
@cbethell
Copy link
Contributor

cbethell commented Jul 26, 2019

To address the first part of this, would a bar plot with the cancer type on the x axis and the frequency on the y axis be well suited for this purpose? Possibly, a more complex version of this:
MSI-Fig1-860x685 from Dung et al.

or part A of this:
1-s2 0-S2211124718304376-gr1 from Knijnenburg et al.

@cgreene
Copy link
Collaborator Author

cgreene commented Jul 29, 2019

@cbethell I think that the bar cart from the first paper would be aligned with what I was thinking here! I'm pretty sure the vision would be to use Column O in https://docs.google.com/spreadsheets/d/1Sa0hNX1lje40HdBpWiLUBY6-53UzvQczFLYb3uIm2jw/edit#gid=1663834900 .

@cbethell
Copy link
Contributor

@cgreene Great! Thank you for the clarification.

@jharenza
Copy link
Collaborator

@cbethell agree with @cgreene on the barplot and Column O!

@cbethell
Copy link
Contributor

The graph below is a rough draft of my interpretation of the first figure suggested by @cgreene.

image

As is, the bars are colored by count, but this can later be changed to reflect molecular_subtype. The y-axis also represents the raw sample count, which can be changed to percentage if preferred. What are the thoughts on the above bar plot thus far?

@cgreene
Copy link
Collaborator Author

cgreene commented Aug 1, 2019

This is really helpful to me. It's nice to understand the characteristics of the data. @jharenza: does this align with what you were expecting the distributions to look like in the dataset?

@jharenza
Copy link
Collaborator

jharenza commented Aug 2, 2019

This looks right and the coloring in this case would be fine being all black or later colored by subtypes, but may make the legend complex. You can also see https://pedcbioportal.kidsfirstdrc.org/study/summary?id=pbta_cbttc as a guide for this dataset. I was also thinking a multi-layer pie chart could be useful here with inner rings being the broad histology, outer being these unique histologies, and next circle can be molecular subtype. Example is the html image here. I have some code for this (made in R) - will push to my last paper repo tomorrow and share. We could also consider assigning specific colors for each histology that we use throughout the paper for consistency. See some of the TCGA papers for color scheme examples.

@jharenza
Copy link
Collaborator

jharenza commented Aug 2, 2019

@cbethell I pushed the code here: https://github.com/marislab/create-pptc-pdx-pie - let me know if you have any issues!

@cbethell
Copy link
Contributor

cbethell commented Aug 2, 2019

@jharenza Thank you! I'll take a look at it now.

@cbethell
Copy link
Contributor

cbethell commented Aug 8, 2019

In addition to the draft PR I filed, I have put together a graphic of the brain showing the sites of various cancer types as suggested by @PichaiRaman.

Given the relatively large amount of unique cancer types in the dataset, I thought it may be more feasible to label the graphic with the highest expressed cancer type at each primary site.

Let me know your thoughts on this concept (I'm sure the graphic itself can be further manipulated for better presentation so please feel free to provide input in this area as well):

pbta-disease-types

@jharenza
Copy link
Collaborator

jharenza commented Aug 8, 2019

Hi @cbethell - we actually have a graphic designer on staff that can handle this brain region figure. I think this is a good start, but we are missing major tumor types, eg medulloblastoma, gliomas will have to be narrowed further, and some types have other regions not represented (eg DIPG also in brainstem and thalamus), so this may be difficult to represent and maybe we think about how to do this a bit more. I saw a figure from a paper I liked where they had a b&w brain and the number of tumors per region were in circles, with the circle being proportional in size to the sample N of that region and the color being either molecular subtype or tumor type, so maybe we can iterate around that a bit. (Can't remember the paper offhand, but thought some GBM/HGG subtyping paper).

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Aug 9, 2019
@cbethell
Copy link
Contributor

cbethell commented Aug 9, 2019

Hi @jharenza - this seems like a good way to present the major tumor types. I will look for the paper and for other diagrams similar to what you describe above, and prepare the data for use in this manner.

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Aug 9, 2019

@cbethell This issue may be a better spot for discussion around the plots included as part of #40. Can you post what the plots look like here? For the interactive plot, you can probably create a repo of your own that includes the HTML file generated in 857fd81 (if I understand correctly) -> turn on GitHub pages. Then you can post a link here for easy access.

@cbethell
Copy link
Contributor

cbethell commented Aug 9, 2019

@jaclyn-taroni per your suggestion, here are the plots included in draft PR #40 :

  • sample-distribution-analyses/plots/distribution_across_cancer_types.pdf
    distribution_across_cancer_types-1

I modified the above plot to show percentages above each bar instead of the raw count. Thoughts?

  • sample-distribution-analyses/plots/treemap.pdf
    treemap-1

Above is a treemap plot used to display broad_histology, short_histology, and disease_type_new. disease_type_new will be replaced with molecular_subtype once determined. What are the thoughts on including the treemap?

@cbethell
Copy link
Contributor

cbethell commented Aug 9, 2019

As noted in draft PR #40, the main ideas I would like input on include:

  1. Does the multilayer sunburst pie chart meet expectations? (In terms of the presentation and the data representation)
  2. Are there any additional data filtering steps that may need to be included for this particular dataset?
  3. Are the results/plots suffice for this particular issue? If not, what other forms of tables or plots would you like to see included?

@jaclyn-taroni
Copy link
Member

For me, the treemap is less valuable as a static image. That is not to say that an interactive treemap would be better than the interactive pie chart you linked above.

@cbethell
Copy link
Contributor

cbethell commented Aug 9, 2019

For me, the treemap is less valuable as a static image. That is not to say that an interactive treemap would be better than the interactive pie chart you linked above.

Good point. I can make the treemap interactive and we can decide how useful it may be from there.

@cbethell
Copy link
Contributor

cbethell commented Aug 9, 2019

For me, the treemap is less valuable as a static image. That is not to say that an interactive treemap would be better than the interactive pie chart you linked above.

Find the link to the interactive treemap here.

Its value still does not appear to exceed that of the interactive pie chart, however, what are the thoughts on including it now?

@jaclyn-taroni
Copy link
Member

I believe the analysis portion of this has been satisfied by #52, #54, and #55 with the exception of the changes in glioma brain region classification tracked in #57.

I've opened AlexsLemonade/OpenPBTA-manuscript#38 to track the final assembly of the figure. If anything related to analyses/sample-distribution-analyses comes up on that manuscript issue, we can either reopen this or create a new analysis ticket.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach!
Projects
None yet
Development

No branches or pull requests

5 participants