Planned Analysis: Tumor Mutation Burden #3
Comments
This may be a simplistic question, but in our case |
@PichaiRaman : I think this came from you - how were you planning to do this analysis? |
@cansav09 I think we are looking for something like this - Figure 1 from the Grobner, 2018 landscape, but for brain tumors: For reference, here is the second pan-pediatric cancer landscape from 2018: |
Now that I've gotten somewhat familiar with the SNV data, I can do this analysis as well. It looks like Grobner says they used Here's my tentative plan, let me know what you think: Step 1) Start with Mutect2 results only. (Once I make the script for this, and then later after #30 is determined, I can easily go back and change it). |
Hi @cansav09! I think @afarrel may have some input here, as he has worked with a lot of TMB data. People do this many different ways. I had previously used only non-synonymous mutations to be reflective of somatic burden. Looks like there is a harmonization effort here, but not sure what they have decided. One other note is that you use coding only, then you should calculate the coding region in Mb from WGS/WES used for this. Currently, I think much of the data you have was derived from WGS, but some incoming data will be WES and we can provide the WES capture region for this. I think #s1-4 look good! |
This is interesting. Glad someone is working on making things uniform, but yeah, hard to tell what the conclusion is from this. |
I think the first part of this analysis will be determining if there are differences between different methods of calculations of TMB. My thought is I should calculate TMBs and compare for the following 2 sets of variables:
So this would result in four different methods of calculating TMBs for each sample. All being Mutations per Mb.
After calculating these TMBs I will test the differences/similarities of the calculations by:
How does this sound? |
TMB typically does not include CNV data, so I would recommend against using it. Otherwise, I would add to compare to adult tumors, as mentioned above in the one reference. |
Just to reiterate what @jharenza said, typically we include SNVs and small InDels in TMB calls. In regards to TMBs you need to define if you are calling all SNVs or only non-synonymous mutations. In some of our applications, we perform TMB calls using Tumor-Only methods because we don't have matched normal data and exclude synonymous mutation calls to reduce false positives and better reflect somatic mutation burden. |
Is there a more exact way to determine the number of Megabases sequences per each sample? I can’t find anyone’s exact code for determining TMB so I am unsure what number(s) to use for the “per coding Megabases” part |
Also were these data filtered to take out intergenic mutations? I’m assuming yes based on what I’m seeing in the annotation but how was this done? And what other filters may have been applied to this data? Can we see the command that was given to run Strelka2? I did notice the methods so far say that Strelka2 was run using best practices so I will investigate that a bit more to see if anything there makes more sense of the data. |
For WES data, it is calculated using the bed regions file and for WGS, you can use the size of the whole genome - enlisting @migbro for this reference file and a size, as I just noticed that the links to the code are not in the manuscript... |
The only filtering done was |
For WGS, will we need to calculate the percent of the reference genome covered in order to get an exact genome size in Mb? Or do we just use a general approximation? Are the bed regions files already prepared previously? Also, I've set up the TCGA data, and I will generally attempt to calculate TMB the same way. Do you have an idea of what is the easiest way to obtain genome sizes for the TCGA samples? Has this been calculated elsewhere? |
Here's a preview of the data as it is now, before dividing by genome size. I'm not sure which disease labels I should be using for both the TCGA and PBTA data. For the TCGA data, I have only used what were brain tumors so hopefully it's more comparable. |
Hi @cansav09! I've previously done TMB analysis for a subset of our samples for a paper (https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656587.full.pdf) . I calculated TMB = total variants in coding region/length of exome bed (bp)*1000000. Where variants in coding region are variants overlapping an inclusive coding region file : https://github.com/AstraZeneca-NGS/reference_data/blob/master/hg38/bed/Exome-AZ_V2.bed and using calculated length of coding region (which is this bed file was 159697302 bp). |
@kgaonkar6 Thanks so much for the code and paper link! This is super helpful! So am I correct in saying that you only used the size of the reference genome for the denominator for all your samples? (As opposed to calculating the percent of the reference exome that each sample had covered and then obtaining a megabase count per sample.) |
I'm trying to look through the code and determine why the WGS samples don't have any intergenic mutations at all (okay there are 2 out of all the samples) but it isn't clear to me yet. The reason I want to know this, is for determining the genome size in Mb to divide by, it seems like we would only want to use the size of the whole genome as our denominator for TMB if the data if we also had a chance of finding intergenic mutations in this workflow i.e. If the workflow is intentionally selecting only coding mutations for WGS it seems like the denominator for TMB should reflect that. But if we are intentionally only trying to look at the coding mutations for WGS so that it is more comparable to WES, then that makes sense, but I'm wondering why we would still want to use the whole genome size as a denominator. Or do we? |
Hmm, that's a good point - I would expect to see intergenic, noncoding mutations as well - @yuankunzhu and @migbro, do you know if the pipelines selectively retained only coding variants? If we only look at coding mutations, then yes, we would only use those within the coding genome/WES regions (as @kgaonkar6 mentioned) as the denominator. |
Also, @cansav09, looking at the figure example, it does look like they restrict to coding SNVs. Your plot looks like you are on the right track, so maybe these new calculations will help - we expect to see a lower TMB in pediatric samples compared to adult. If you also wanted to order the x-axis by median for the groups and add the y-intercepts for 2, 10, and 100 muts/Mb, then we can mimic that figure and assess hypermutated samples in our cohort! |
Sounds good, @jharenza .Should I just make a coding BED file from the hg38 reference genome or is this something you guys have prepared and can share with me? |
I think you can safely use the one listed above, provided by @kgaonkar6, select variants in the region within that bed, and use that bed's size. Sorry the fact we did this previously escaped me! |
@cansav09 @jharenza , this depends on what you are using as your input data. If you got it from Ped cBioportal, the import process basically removes intergenic variants. I'd suggest using the unaltered mafs. Strelka2 used these intervals for WGS calls:
Mutect2, based on their recommended calling regions, with the addition of chromosome M: wgs_canonical_calling_regions.hg38.txt |
Perfect! Thanks, @migbro, @jharenza, and @kgaonkar6, all this information helps a lot! Will hopefully have an updated plot tomorrow. |
One more question, any idea where I could find the coding regions used for TCGA WES samples? I’ve been digging around a bit and haven’t figured out what the best approach is to get the information. |
@cansavvy you can definitely plot on log10 scale! This is nice - there are some known HGAT that are hypermutated. |
Cool! A couple other questions, @jharenza, to calculate TMB for the TCGA samples, I used the same genome windows and size I did for the PBTA samples. I filtered out any mutations that were not in the WGS windows used for TCGA, however all of them turned out to be inside the windows so the filter was irrelevant. Two questions:
|
Hi @cansavvy!
|
@jharenza : TCGA is complicated as noted by @allisonheath in #3 (comment). My impression is that the MC3 calls, which I understand based on the conversation above to be what we're using, follow the BED file here: https://gdc.cancer.gov/about-data/publications/mc3-2017 If we're using MC3, then we should use that interval. It doesn't matter that we're only looking at brain tumors if MC3 only called in those regions anyway. It would be good to check that all of the calls that we're getting from TCGA are in fact in those regions @cansavvy. |
The core of this analysis has been done. The next question for this analysis is whether to use/apply the molecular subtyping labels to these data. This issue has been tracked here on #335 so we can revisit that when molecular subtyping tickets have all been addressed. |
The analyses described in this issue can be found in: See Tumor Mutation Burden Calculation section of the |
@cansavvy @jaclyn-taroni @jashapiro @yuankunzhu - commenting on this issue for some additional background I found, from one of the earlier papers that identified an association with PMS2 mutations and hypermutation. Of note, they do additional filtering, which I am not sure yet (have to read) if TCGA had done (which may be a reason we see higher TMBs in pediatric samples in some algorithms). Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden Link Methods Results Conclusions TMB calculations
WES of 29 samples which also had panel seq TCGA |
For the PPTC PDX paper, we had removed germline variants by way of a panel of normals, since we had tumor-only samples. However, we do mention in the paper that our TMBs are a bit higher because we likely did not remove all of the germline variants in that regard. In OpenPBTA, we should be removing these with the paired normal, but I also wonder if more are coming through for any reason. |
@jharenza file a new issue that uses either the proposed analysis or updated analysis template and link to this one please. The additional structure in those templates has proven to be helpful. |
ok sure - saw it was the one referenced in the PR, so I plopped that info here. I can create a new one :) |
- add `distinct` function where needed - update comments - fix function to display LGAT focal results - rerun nb
* Add notebook comparing GISTIC cn calls to our focal cn calls - Add `02-GISTIC-gene-level-cn-status-comparison.Rmd` to compare GISTIC's CN calls with the calls we prepared in the `focal-cn-file-preparation` module - add analysis to shell script * display full results table in nb and re-run * remove filtering out of sample ids - first attempt to fix the inconsistency in final counts tables * attempt 2 to fix data inconsistency in final table of results - the focal CN calls for LGAT samples do not contain amplification calls (so the data.frame empty due to the way the data is being joined, this will be fixed in an upcoming commit) * attempt #3 to fix data presented in results table - add `distinct` function where needed - update comments - fix function to display LGAT focal results - rerun nb * use `inner_join` to merge GISTIC calls with our focal calls - change `left_join` to `inner_join` * Propagate changes from WIP PR #614 - fix functions formatting GISTIC data files in functions script - change output_filename function arguments to output_filepath - remove `distinct` functions (where I _believe_ they are not needed) - rerun module * Rework prepare_gene_level_gistic * Propagate prepare_gene_level_gistic changes * Rework tallying * Some ideas about a heatmap visualization * Rename notebook, comment out plotting for now Some tweaks for CI, too * Missing parenthesis * Fix file name in shell script; add eval=FALSE, too * Response to @cansavvy comments * Update analyses/compare-gistic/util/GISTIC-comparison-functions.R Co-Authored-By: Candace Savonen <cansav09@gmail.com> * Use same language for second TSV Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com> Co-authored-by: Candace Savonen <cansav09@gmail.com>
We have planned to include an analysis of tumor mutation burden as well as a comparison to the tumor mutation burden of adult tumors in TCGA. This issue is for fleshing out the analysis. The manuscript issue for the methods description is: AlexsLemonade/OpenPBTA-manuscript#7
The text was updated successfully, but these errors were encountered: