Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Planned Analysis: Tumor Mutation Burden #3

Closed
cgreene opened this issue Jul 11, 2019 · 48 comments
Closed

Planned Analysis: Tumor Mutation Burden #3

cgreene opened this issue Jul 11, 2019 · 48 comments
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! snv Related to or requires SNV data

Comments

@cgreene
Copy link
Collaborator

cgreene commented Jul 11, 2019

We have planned to include an analysis of tumor mutation burden as well as a comparison to the tumor mutation burden of adult tumors in TCGA. This issue is for fleshing out the analysis. The manuscript issue for the methods description is: AlexsLemonade/OpenPBTA-manuscript#7

@cansavvy
Copy link
Collaborator

This may be a simplistic question, but in our case TMB = Median(# coding indels + # coding base substitutions) / Megabase? Second question is, I am presuming we need to do some kind of analysis that looks at differences between synonymous and nonsynonymous mutations? This perhaps may needs to be related to age of diagnosis?

@cgreene
Copy link
Collaborator Author

cgreene commented Jul 29, 2019

@PichaiRaman : I think this came from you - how were you planning to do this analysis?

@jharenza
Copy link
Collaborator

@cansav09 I think we are looking for something like this - Figure 1 from the Grobner, 2018 landscape, but for brain tumors:
grobner-figure1

For reference, here is the second pan-pediatric cancer landscape from 2018:
Ma, 2018

@jharenza jharenza reopened this Jul 30, 2019
@cansavvy
Copy link
Collaborator

cansavvy commented Aug 6, 2019

Now that I've gotten somewhat familiar with the SNV data, I can do this analysis as well. It looks like Grobner says they used Number of SNV of coding mutations per Mb? Just to try to clarify, does this mean we include synonymous and non-synonymous mutations alike in this number? They don't mention distinguishing between this in their Methods, but I've seen other papers that look into it.

Here's my tentative plan, let me know what you think:

Step 1) Start with Mutect2 results only. (Once I make the script for this, and then later after #30 is determined, I can easily go back and change it).
Step 2) Calculate total number of mutations per Mb
Step 3) Graph this in a jitterplot like above with the disease type as the x axis.
Step 4) Split this out into a jitterplot that examines nonsynonymous vs synonymous.

@jharenza
Copy link
Collaborator

jharenza commented Aug 6, 2019

Hi @cansav09! I think @afarrel may have some input here, as he has worked with a lot of TMB data. People do this many different ways. I had previously used only non-synonymous mutations to be reflective of somatic burden. Looks like there is a harmonization effort here, but not sure what they have decided. One other note is that you use coding only, then you should calculate the coding region in Mb from WGS/WES used for this. Currently, I think much of the data you have was derived from WGS, but some incoming data will be WES and we can provide the WES capture region for this.

I think #s1-4 look good!

@cansavvy
Copy link
Collaborator

cansavvy commented Aug 7, 2019

Looks like there is a harmonization effort here, but not sure what they have decided.

This is interesting. Glad someone is working on making things uniform, but yeah, hard to tell what the conclusion is from this.

@cansavvy
Copy link
Collaborator

I think the first part of this analysis will be determining if there are differences between different methods of calculations of TMB.

My thought is I should calculate TMBs and compare for the following 2 sets of variables:

  1. Coding mutations only vs all mutations.
  2. SNVs + CNVs vs only using SNVs - this however requires that we have an integration of CNV data - Planned Analysis: Integrated CNV and SV analyses and chromothripsis #27 (Let me know if there is something I can do to help with this).

So this would result in four different methods of calculating TMBs for each sample. All being Mutations per Mb.

All mutations Coding Variants Only
SNVs + CNVs
SNVs only

After calculating these TMBs I will test the differences/similarities of the calculations by:

  1. Correlating sample's TMB calculations (probably both Spearman's and Pearson's).
  2. Comparing TMB distributions (as well as comparing mean and medians)

How does this sound?

@jharenza
Copy link
Collaborator

TMB typically does not include CNV data, so I would recommend against using it. Otherwise, I would add to compare to adult tumors, as mentioned above in the one reference.

@afarrel
Copy link

afarrel commented Aug 23, 2019

Just to reiterate what @jharenza said, typically we include SNVs and small InDels in TMB calls. In regards to TMBs you need to define if you are calling all SNVs or only non-synonymous mutations. In some of our applications, we perform TMB calls using Tumor-Only methods because we don't have matched normal data and exclude synonymous mutation calls to reduce false positives and better reflect somatic mutation burden.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Sep 8, 2019
@cansavvy
Copy link
Collaborator

cansavvy commented Sep 9, 2019

Is there a more exact way to determine the number of Megabases sequences per each sample? I can’t find anyone’s exact code for determining TMB so I am unsure what number(s) to use for the “per coding Megabases” part

@cansavvy
Copy link
Collaborator

cansavvy commented Sep 9, 2019

Also were these data filtered to take out intergenic mutations? I’m assuming yes based on what I’m seeing in the annotation but how was this done? And what other filters may have been applied to this data? Can we see the command that was given to run Strelka2? I did notice the methods so far say that Strelka2 was run using best practices so I will investigate that a bit more to see if anything there makes more sense of the data.

@jharenza
Copy link
Collaborator

jharenza commented Sep 9, 2019

Is there a more exact way to determine the number of Megabases sequences per each sample? I can’t find anyone’s exact code for determining TMB so I am unsure what number(s) to use for the “per coding Megabases” part

For WES data, it is calculated using the bed regions file and for WGS, you can use the size of the whole genome - enlisting @migbro for this reference file and a size, as I just noticed that the links to the code are not in the manuscript...

@jharenza
Copy link
Collaborator

jharenza commented Sep 9, 2019

Also were these data filtered to take out intergenic mutations? I’m assuming yes based on what I’m seeing in the annotation but how was this done? And what other filters may have been applied to this data? Can we see the command that was given to run Strelka2? I did notice the methods so far say that Strelka2 was run using best practices so I will investigate that a bit more to see if anything there makes more sense of the data.

The only filtering done was PASS for variants. The somatic workflows are here.

@cansavvy
Copy link
Collaborator

For WGS, will we need to calculate the percent of the reference genome covered in order to get an exact genome size in Mb? Or do we just use a general approximation? Are the bed regions files already prepared previously?

Also, I've set up the TCGA data, and I will generally attempt to calculate TMB the same way. Do you have an idea of what is the easiest way to obtain genome sizes for the TCGA samples? Has this been calculated elsewhere?

@cansavvy
Copy link
Collaborator

cansavvy commented Sep 11, 2019

Here's a preview of the data as it is now, before dividing by genome size. I'm not sure which disease labels I should be using for both the TCGA and PBTA data.

For the TCGA data, I have only used what were brain tumors so hopefully it's more comparable.

log2_coding_mutations.pdf

@cansavvy
Copy link
Collaborator

Screen Shot 2019-09-11 at 1 10 13 PM

@kgaonkar6
Copy link
Collaborator

Hi @cansav09! I've previously done TMB analysis for a subset of our samples for a paper (https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656587.full.pdf) . I calculated TMB = total variants in coding region/length of exome bed (bp)*1000000. Where variants in coding region are variants overlapping an inclusive coding region file : https://github.com/AstraZeneca-NGS/reference_data/blob/master/hg38/bed/Exome-AZ_V2.bed and using calculated length of coding region (which is this bed file was 159697302 bp).

@cansavvy
Copy link
Collaborator

I've previously done TMB analysis for a subset of our samples for a paper (https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656587.full.pdf) . I calculated TMB = total variants in coding region/length of exome bed (bp)*1000000. Where variants in coding region are variants overlapping an inclusive coding region file : https://github.com/AstraZeneca-NGS/reference_data/blob/master/hg38/bed/Exome-AZ_V2.bed and using calculated length of coding region (which is this bed file was 159697302 bp).

@kgaonkar6 Thanks so much for the code and paper link! This is super helpful! So am I correct in saying that you only used the size of the reference genome for the denominator for all your samples? (As opposed to calculating the percent of the reference exome that each sample had covered and then obtaining a megabase count per sample.)

@cansavvy
Copy link
Collaborator

Also were these data filtered to take out intergenic mutations? I’m assuming yes based on what I’m seeing in the annotation but how was this done?

The only filtering done was PASS for variants. The somatic workflows are here.

I'm trying to look through the code and determine why the WGS samples don't have any intergenic mutations at all (okay there are 2 out of all the samples) but it isn't clear to me yet.

The reason I want to know this, is for determining the genome size in Mb to divide by, it seems like we would only want to use the size of the whole genome as our denominator for TMB if the data if we also had a chance of finding intergenic mutations in this workflow i.e. If the workflow is intentionally selecting only coding mutations for WGS it seems like the denominator for TMB should reflect that.

But if we are intentionally only trying to look at the coding mutations for WGS so that it is more comparable to WES, then that makes sense, but I'm wondering why we would still want to use the whole genome size as a denominator. Or do we?

@jharenza
Copy link
Collaborator

Hmm, that's a good point - I would expect to see intergenic, noncoding mutations as well - @yuankunzhu and @migbro, do you know if the pipelines selectively retained only coding variants?

If we only look at coding mutations, then yes, we would only use those within the coding genome/WES regions (as @kgaonkar6 mentioned) as the denominator.

@jharenza
Copy link
Collaborator

Also, @cansav09, looking at the figure example, it does look like they restrict to coding SNVs. Your plot looks like you are on the right track, so maybe these new calculations will help - we expect to see a lower TMB in pediatric samples compared to adult. If you also wanted to order the x-axis by median for the groups and add the y-intercepts for 2, 10, and 100 muts/Mb, then we can mimic that figure and assess hypermutated samples in our cohort!

@cansavvy
Copy link
Collaborator

cansavvy commented Sep 11, 2019

Sounds good, @jharenza .Should I just make a coding BED file from the hg38 reference genome or is this something you guys have prepared and can share with me?

@jharenza
Copy link
Collaborator

I think you can safely use the one listed above, provided by @kgaonkar6, select variants in the region within that bed, and use that bed's size. Sorry the fact we did this previously escaped me!

@migbro
Copy link
Contributor

migbro commented Sep 11, 2019

@cansav09 @jharenza , this depends on what you are using as your input data. If you got it from Ped cBioportal, the import process basically removes intergenic variants. I'd suggest using the unaltered mafs. Strelka2 used these intervals for WGS calls:

chr1	0	248956422
chr2	0	242193529
chr3	0	198295559
chr4	0	190214555
chr5	0	181538259
chr6	0	170805979
chr7	0	159345973
chr8	0	145138636
chr9	0	138394717
chr10	0	133797422
chr11	0	135086622
chr12	0	133275309
chr13	0	114364328
chr14	0	107043718
chr15	0	101991189
chr16	0	90338345
chr17	0	83257441
chr18	0	80373285
chr19	0	58617616
chr20	0	64444167
chr21	0	46709983
chr22	0	50818468
chrX	0	156040895
chrY	0	57227415
chrM	0	16569

Mutect2, based on their recommended calling regions, with the addition of chromosome M:

wgs_canonical_calling_regions.hg38.txt
It's actually a bed file, but I gave it the txt extension so that this platform would allow the upload.
Hope this helps!

@cansavvy
Copy link
Collaborator

Perfect! Thanks, @migbro, @jharenza, and @kgaonkar6, all this information helps a lot! Will hopefully have an updated plot tomorrow.

@cansavvy
Copy link
Collaborator

One more question, any idea where I could find the coding regions used for TCGA WES samples? I’ve been digging around a bit and haven’t figured out what the best approach is to get the information.

@cansavvy
Copy link
Collaborator

cansavvy commented Sep 12, 2019

Also, @migbro or @jharenza is it possible to get the unaltered mafs in v5 of these data?

@cansavvy
Copy link
Collaborator

cansavvy commented Nov 8, 2019

Screen Shot 2019-11-08 at 10 31 29 AM

Here's the updated plot, There are some outliers of sorts. Is it acceptable to plot this on a log scale? Or what do you recommend?

@jharenza
Copy link
Collaborator

jharenza commented Nov 8, 2019

@cansavvy you can definitely plot on log10 scale! This is nice - there are some known HGAT that are hypermutated.

@cansavvy
Copy link
Collaborator

cansavvy commented Nov 8, 2019

Cool! A couple other questions, @jharenza, to calculate TMB for the TCGA samples, I used the same genome windows and size I did for the PBTA samples. I filtered out any mutations that were not in the WGS windows used for TCGA, however all of them turned out to be inside the windows so the filter was irrelevant.

Two questions:

  1. Is this method okay for calculating the TCGA TMB? Or do you suggest something else like using some other genome window for TCGA data?
  2. All the mutations end up being within the WGS_strelka2 window, so do you still want me to keep in the filter steps? Or can I just write up this explanation and remove the code since no mutations are actually being filtered out?

@jharenza
Copy link
Collaborator

jharenza commented Nov 8, 2019

Hi @cansavvy!

  1. I think you can use the same window in both, especially since we do not have access to the TCGA interval files. The only exception I can think of is if TCGA is WES (is it WES or WGS? I probably should know this), then we would not use a whole genome size for those samples, we would have to use a window and subset for mutations in that window. We would underestimate TMB in this case since we would be missing all noncoding TCGA mutations and having a whole genome as the denominator.
  2. Probably good practice to leave in, but I defer to 1, regarding the window.

@cgreene
Copy link
Collaborator Author

cgreene commented Nov 9, 2019

@jharenza : TCGA is complicated as noted by @allisonheath in #3 (comment).

My impression is that the MC3 calls, which I understand based on the conversation above to be what we're using, follow the BED file here: https://gdc.cancer.gov/about-data/publications/mc3-2017

If we're using MC3, then we should use that interval. It doesn't matter that we're only looking at brain tumors if MC3 only called in those regions anyway. It would be good to check that all of the calls that we're getting from TCGA are in fact in those regions @cansavvy.

@cansavvy
Copy link
Collaborator

The core of this analysis has been done. The next question for this analysis is whether to use/apply the molecular subtyping labels to these data. This issue has been tracked here on #335 so we can revisit that when molecular subtyping tickets have all been addressed.

@jaclyn-taroni
Copy link
Member

The analyses described in this issue can be found in: analyses/snv-callers for TMB calculation and analyses/tmb-compare-tcga for the comparison to TCGA.

See Tumor Mutation Burden Calculation section of the analyses/snv-callers README for the summary of how TMB is calculated for this project.

Any remaining points have now been subsumed by #257, #335.

@jharenza
Copy link
Collaborator

jharenza commented Feb 25, 2020

@cansavvy @jaclyn-taroni @jashapiro @yuankunzhu - commenting on this issue for some additional background I found, from one of the earlier papers that identified an association with PMS2 mutations and hypermutation. Of note, they do additional filtering, which I am not sure yet (have to read) if TCGA had done (which may be a reason we see higher TMBs in pediatric samples in some algorithms).

Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden
Chalmers ZR, Connelly CF, Fabrizio D, Gay L, Ali SM, Ennis R, et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med. 2017 Apr 19;9(1):34.

Link
Abstract
Background
High tumor mutational burden (TMB) is an emerging biomarker of sensitivity to immune checkpoint inhibitors and has been shown to be more significantly associated with response to PD-1 and PD-L1 blockade immunotherapy than PD-1 or PD-L1 expression, as measured by immunohistochemistry (IHC). The distribution of TMB and the subset of patients with high TMB has not been well characterized in the majority of cancer types.

Methods
In this study, we compare TMB measured by a targeted comprehensive genomic profiling (CGP) assay to TMB measured by exome sequencing and simulate the expected variance in TMB when sequencing less than the whole exome. We then describe the distribution of TMB across a diverse cohort of 100,000 cancer cases and test for association between somatic alterations and TMB in over 100 tumor types.

Results
We demonstrate that measurements of TMB from comprehensive genomic profiling are strongly reflective of measurements from whole exome sequencing and model that below 0.5 Mb the variance in measurement increases significantly. We find that a subset of patients exhibits high TMB across almost all types of cancer, including many rare tumor types, and characterize the relationship between high TMB and microsatellite instability status. We find that TMB increases significantly with age, showing a 2.4-fold difference between age 10 and age 90 years. Finally, we investigate the molecular basis of TMB and identify genes and mutations associated with TMB level. We identify a cluster of somatic mutations in the promoter of the gene PMS2, which occur in 10% of skin cancers and are highly associated with increased TMB.

Conclusions
These results show that a CGP assay targeting ~1.1 Mb of coding genome can accurately assess TMB compared with sequencing the whole exome. Using this method, we find that many disease types have a substantial portion of patients with high TMB who might benefit from immunotherapy. Finally, we identify novel, recurrent promoter mutations in PMS2, which may be another example of regulatory mutations contributing to tumorigenesis.

TMB calculations
Panel
TMB was defined as the number of somatic, coding, base substitution, and indel mutations per megabase of genome examined. All base substitutions and indels in the coding region of targeted genes, including synonymous alterations, are initially counted before filtering as described below. Synonymous mutations are counted in order to reduce sampling noise. While synonymous mutations are not likely to be directly involved in creating immunogenicity, their presence is a signal of mutational processes that will also have resulted in nonsynonymous mutations and neoantigens elsewhere in the genome. Non-coding alterations were not counted. Alterations listed as known somatic alterations in COSMIC and truncations in tumor suppressor genes were not counted, since our assay genes are biased toward genes with functional mutations in cancer. Alterations predicted to be germline by the somatic-germline-zygosity algorithm were not counted. Alterations that were recurrently predicted to be germline in our cohort of clinical specimens were not counted. Known germline alterations in dbSNP were not counted. Germline alterations occurring with two or more counts in the ExAC database were not counted. To calculate the TMB per megabase, the total number of mutations counted is divided by the size of the coding region of the targeted territory. The nonparametric Mann–Whitney U-test was subsequently used to test for significance in difference of means between two populations.

Alterations predicted to be germline by the somatic-germline-zygosity algorithm were not counted. Alterations that were recurrently predicted to be germline in our cohort of clinical specimens were not counted. Known germline alterations in dbSNP were not counted. Germline alterations occurring with two or more counts in the ExAC database were not counted.
here, they remove germline or predicted germline alterations using dbSNP germline alterations and ExAC frequencies

WES of 29 samples which also had panel seq
WES was performed on 29 samples as previously described for which CGP had also been performed. Briefly, tumors were sequenced using Agilent’s exome enrichment kit (Sure Select V4; with >50% of baits above 25× coverage). The matched blood-derived DNA was also sequenced. Base calls and intensities from the Illumina HiSeq 2500 were processed into FASTQ files using CASAVA. The paired-end FASTQ files were aligned to the genome (to UCSC’s hg19 GRCh37) with BWA (v0.5.9). Duplicate paired-end sequences were removed using Picard MarkDuplicates (v1.35) to reduce potential PCR bias. Aligned reads were realigned for known insertion/deletion events using SRMA (v0.1.155). Base quality scores were recalibrated using the Genome Analysis Toolkit (v1.1-28). Somatic substitutions were identified using MuTect (v1.1.4). Mutations were then filtered against common single-nucleotide polymorphisms (SNPs) found in dbSNP (v132), the 1000 Genomes Project (Feb 2012), a 69-sample Complete Genomics data set, and the Exome Sequencing Project (v6500). <- extra filtering step

TCGA
TCGA data were obtained from public repositories. For this analysis, we used the somatic called variants as determined by TCGA as the raw mutation count. We used 38 Mb as the estimate of the exome size. For the downsampling analysis, we simulated the observed number of mutations/Mb 1000 times using the binomial distribution at whole exome TMB = 100 mutations/Mb, 20 mutations/Mb, and 10 mutations/Mb and did this for megabases of exome sequenced ranging from 0–10 Mb. Melanoma TCGA data were obtained from dbGap accession number phs000452.v1.p1.

@jharenza
Copy link
Collaborator

For the PPTC PDX paper, we had removed germline variants by way of a panel of normals, since we had tumor-only samples. However, we do mention in the paper that our TMBs are a bit higher because we likely did not remove all of the germline variants in that regard. In OpenPBTA, we should be removing these with the paired normal, but I also wonder if more are coming through for any reason.

@jharenza jharenza reopened this Feb 25, 2020
@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Feb 25, 2020

@jharenza file a new issue that uses either the proposed analysis or updated analysis template and link to this one please. The additional structure in those templates has proven to be helpful.

@jharenza
Copy link
Collaborator

ok sure - saw it was the one referenced in the PR, so I plopped that info here. I can create a new one :)

cbethell added a commit to cbethell/OpenPBTA-analysis that referenced this issue Mar 5, 2020
- add `distinct` function where needed
- update comments
- fix function to display LGAT focal results
- rerun nb
jaclyn-taroni added a commit that referenced this issue Mar 13, 2020
* Add notebook comparing GISTIC cn calls to our focal cn calls

- Add `02-GISTIC-gene-level-cn-status-comparison.Rmd` to compare GISTIC's CN calls with the calls we prepared in the `focal-cn-file-preparation` module
- add analysis to shell script

* display full results table in nb and re-run

* remove filtering out of sample ids 

- first attempt to fix the inconsistency in final counts tables

* attempt 2 to fix data inconsistency in final table of results

- the focal CN calls for LGAT samples do not contain amplification calls (so the data.frame empty due to the way the data is being joined, this will be fixed in an upcoming commit)

* attempt #3 to fix data presented in results table

- add `distinct` function where needed
- update comments
- fix function to display LGAT focal results
- rerun nb

* use `inner_join` to merge GISTIC calls with our focal calls

- change `left_join` to `inner_join`

* Propagate changes from WIP PR #614

- fix functions formatting GISTIC data files in functions script
- change output_filename function arguments to output_filepath
- remove `distinct` functions (where I _believe_ they are not needed)
- rerun module

* Rework prepare_gene_level_gistic

* Propagate prepare_gene_level_gistic changes

* Rework tallying

* Some ideas about a heatmap visualization

* Rename notebook, comment out plotting for now

Some tweaks for CI, too

* Missing parenthesis

* Fix file name in shell script; add eval=FALSE, too

* Response to @cansavvy comments

* Update analyses/compare-gistic/util/GISTIC-comparison-functions.R

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

* Use same language for second TSV

Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
Co-authored-by: Candace Savonen <cansav09@gmail.com>
@jharenza jharenza mentioned this issue May 19, 2022
12 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! snv Related to or requires SNV data
Projects
None yet
Development

No branches or pull requests

9 participants