Skip to content

SebastianHollizeck/PhDThesis

Repository files navigation

Development of new methods for accurate estimation of tumour heterogeneity

Table of Contents

“Begin at the beginning,“ the King said, very gravely, “and go on till you come to the end: then stop.“

Introduction

This first introduction chapter contains all the necessary background information as well as an overview of the work discussed in this thesis. It summarises basic biological properties of DNA and cell biology as well as the respective technologies to read, analyse and measure these biological concepts and then how to evaluate the output of these methods. Section 1.1 delineates the role DNA plays for the cell and then section 1.2 shows how these standards are changed in the tumour and cell free context. Section 1.3 introduces the current technologies used to measure and detect DNA and its variations. Section 1.4 covers the computational methods developed to read out changes in the DNA. Then section 3.1.1 relates how these changes lead to cancer and what we can learn from them. The introduction concludes with section 1.6 as an overview of the aims of the thesis and my work in addressing them in the following chapters.

DNA as an information storage unit

Overview of DNA structure and the nucleobases, which form DNA strands. Nucleotides are split into Purines and Pyrimidines by the structure of the nitrogen ring; complementary pairing of bases is shown as shapes of the bases as well as with 2D structures; Hydrogen (H) bonds are shown as dotted lines; Phosphates are shown as P; 3’ and 5’ ends are defined by the internal number of the carbon atom of the sugar which is exposed; Adapted from “DNA structure“ by BioRender.com (2021) Retrieved from https://app.biorender.com/biorender-templates
Overview of DNA structure and the nucleobases, which form DNA strands. Nucleotides are split into Purines and Pyrimidines by the structure of the nitrogen ring; complementary pairing of bases is shown as shapes of the bases as well as with 2D structures; Hydrogen (H) bonds are shown as dotted lines; Phosphates are shown as P; 3’ and 5’ ends are defined by the internal number of the carbon atom of the sugar which is exposed; Adapted from “DNA structure“ by BioRender.com (2021) Retrieved from https://app.biorender.com/biorender-templates

It is a widely accepted fact, that Deoxyribonucleic acid (DNA) serves as the long term information storage molecule of our cells. This information is protected and allows correction of simple errors through its double helix structure (Watson and Crick 1953; Liang et al. 1998). The nucleotides, which consist of a deoxyribose sugar (hence the name), a phosphate group and the nitrogenous base, are joined together by phosphate groups. Even though there are six common naturally occurring nitrogenous bases: Adenine (A), Thymine (T), Guanine (G), Cytosine (C), Uracil (U) and nicotinamide, only the first four are used to encode the genetic information into DNA. Each of the strands mirrors the other, so that an adenine will be paired up with a thymine forming two hydrogen bonds. Similarly, cytosine will pair with guanine forming an even stronger connection with three hydrogen bonds. While other pairings which do not follow those rules are chemically possible, they are mostly observed in ribonucleic acid (RNA) (Sinden 1994). These very strict bonding rules enable the DNA to be similar to a hard drive with backup on a computer. And as only one strand contains all the information, the DNA polymerase enzyme does only need access to one strand, which allows parallel replication during cell division, but also error corrections, by proof reading the newly synthesised strand with the template. In order to be able to distinguish the two strands, they were assigned the names 3’ and 5’ depending on the numbering of the carbon atom in the sugar, which is exposed ([fig:DNAstructure]).

The entirety of the DNA encoding the organism is commonly called “the genome“ with all genes, which consist of introns and exons are called “the exome“. Unicellular organisms usually have a very small number of introns, which to current knowledge only provide limited information and are only responsible for structure. In vertebrates, introns as well as intergeneic DNA (the DNA between genes) contribute most of the DNA in the genome. For example in humans, only <math display="inline">1\%</math> of the genetic material is considered to be exonic, whereas introns contribute <math display="inline">\approx 24\%</math> and the rest is intergeneic (<math display="inline">\approx 75\%</math>) (Venter et al. 2001).

Structural overview of the metaphase condensed chromosome: DNA is first wrapped around Histones to form nucleosome, which then associate with each other to form the chromatin fiber, which in the metaphase of the cell cycle is condensed even more into the X-shaped chromosome
Structural overview of the metaphase condensed chromosome: DNA is first wrapped around Histones to form nucleosome, which then associate with each other to form the chromatin fiber, which in the metaphase of the cell cycle is condensed even more into the X-shaped chromosome

The DNA in eukaryotes however is not free floating around in the nucleus of a cell, but rather in most eukrayotic organisms, it is highly condensed and structured, first wrapped around nucleosomes like thread on a spool, then organised around histones, into either open (accessible) or closed chromatin, which then can be even further condensed into chromosomes, which have a X-like shape, with two shorter and two longer arms ([fig:chromsomestructure]). This allows some DNA to be accessible whereas the use of other areas can be restricted (Hammond et al. 2017). Through this restriction, the availability of certain genes, which are the sections of the DNA, which encode for short term storage molecules like RNA, can be controlled. This restriction plays an important role in cell fate and cell viability. Ultimately, all information stored to create a new highly complex organism is stored in just the DNA of one cell. Whichever parts are used and how they are used decides the function and the identity of the cell (Bonev and Cavalli 2016).

Individual chromosomes occupy a subspace in the nucleus called chromosome territories. Chromosome territories can be further partitioned to distinct A and B compartments, which are enriched for active and repressed chromatin, respectively. Genomic regions within topologically associating domains (TADs) display increased interactions, while their interactions with neighbouring regions outside the TADs are rather limited.
Individual chromosomes occupy a subspace in the nucleus called chromosome territories. Chromosome territories can be further partitioned to distinct A and B compartments, which are enriched for active and repressed chromatin, respectively. Genomic regions within topologically associating domains (TADs) display increased interactions, while their interactions with neighbouring regions outside the TADs are rather limited.

Even though the X-like structure is the most commonly used and known structure, the DNAs 3D structure is usually very different and only takes this shape for a very short time in the cell cycle. Most of the time, the chromosomes are unravelled into something resembling a ball of yarn, where the “open“ chromatin regions are on the outside and the “closed“ regions are “hidden“ in the inside and each chromosome establishes its own “territory“ inside the nucleus ([fig:chromosometerretories]). This structure allows another DNA cross over with non-sister chromosomes, which is called a chiasma.

Ploidy - it is good to have a backup, if you do it right

Similar to the already discussed organisation of DNA in two strands, another concept of data security, involving ploidy (the number of complete chromosome sets in a cell) is also implemented. Most eukaryotic organisms have at least two of each chromosome (diploid) with some species reaching up to septaploid (Tateoka 1975). However, this concept is not the only reason for the ploidy of somatic cells. For sexually reproducing organisms, at least a diploid set of chromosomes is necessary to enable information to be joined from both parents. Germline cells (sperm and egg) are generally monoploid, such that the resulting cell will be diploid. However, the ploidy of the somatic cells is not as uniform within a species, where it can vary between organisms based on gender or rank (Trivers and Hare 1976). In most organisms, a change in ploidy is fatal (Otto 2007) and only partial ploidy changes like extra copies of chromosome 17 (Gottlieb et al. 1962), chromosome 18 (Cereda and Carey 2012) and chromosome 21 (Hultén et al. 2008) are tolerated. These syndromes can occur when there is an uneven split of chromosomes during cell division.

However, this concept is not the only reason for the ploidy of somatic cell. The additional advantage, apart from sexual reproduction, is that a second almost identical copy of a chromosome allows repair of DNA, even when both strands are damaged, for example in a double strand break. In this case, the information from the sister chromosome will be used, by first cutting the double strand break ends to have an overhang (resection). This overhang will then merge with the sister chromosome’s mirrored strand. In this state, the two chromosomes are fused together in a Holliday junction, which allows the missing part from the resection and the double strand break to be repaired (Lilley 2000). During this process, which is part of the homology directed repair (HDR) machinery, the sister chromosomes exchange parts of their DNA, when resolving the Holliday junction. As these stretches of DNA do not need to be 100% identical, this plays an important role in evolution and diversity (Hanage, Fraser, and Spratt 2006; Kong et al. 2013).

“Fantastical“ mutations and where to find them

Even though the DNA is highly stable, and error correction methods are constantly working to not introduce any changes in the DNA, the source of evolution and adaptation of species relates to a steady mutation rate (Darwin 2010; Sprouffske et al. 2018). These changes in normal tissue, known as somatic mutations, are mostly irrelevant to the organism as a whole and will not be passed on to the next generation. Somatic mutations accumulate linearly over the course of the lifespan of the cell and are not bound to just cell divisions (Ludmil B. Alexandrov et al. 2015; Moore et al. 2021; Cagan et al. 2022). In contrast, if one of those mutations occurs in the germline cell, e.g. sperm or egg producing cells, these mutations will be propagated to all offspring and be present in all cells of that organism and in turn all its offspring. These mutations are called germline mutations. These mutations are also called germline variants, as they establish in the population and represent a variation of the organism. Mutations can also be classified depending on their size, ranging from single nucleotide polymorphisms (SNPs) to small insertions or deletions (InDels) and large structural changes, like the deletion of parts of or even a whole chromosome arm. As previously described with ploidy changes, usually smaller changes have less impact on the overall fitness of the organism, however even SNPs can lead to changes which are not compatible with life (Shamseldin et al. 2015; Frey et al. 2021).

Cell free DNA is more than just bits and pieces

When a cell from a multicellular organism dies, through which ever method, there will be numerous enzymes involved, which clear the debris and recycle material. This means that proteases digest proteins into amino acids, which will later be used for either building new proteins or possibly even digested further for energy production. The same happens with the DNA in the cell when it is released following cell death. However, as discussed in the previous [intro-sec:DNA] the DNA is wrapped around histones and organised in structures called nucleosomes. These protect the DNA from being cut by DNAases by hindering the access to the DNA, similar to how they stopped the access for transcription into RNA. This then in turn leads to the DNA being cut in the linker regions between nucleosomes into fragments mainly in the length of 167 base pairs (bp).

These DNA fragments, which are called cell free DNA (cfDNA), can then be readily detected in bodily fluids, like blood, urine or even stool. By analysing these fragments, non invasive tests for prenatal care have been made possible, as the DNA of the developing foetus can be detected in the mother’s blood (Dan et al. 2012; Nicolaides et al. 2013).

Similar to this process, a cancer also sheds DNA, titled “circulating tumour DNA“ (ctDNA), when its cells die, either through intervention of the immune system, cancer therapies or other processes. These ctDNA fragments can similarly be analysed and molecular properties measured, without even knowing the exact location of the tumour. As a blood test can be routinely performed in the clinic, the monitoring of cancer progression is significantly easier and safer than through other measures. Of course, it is, similar to the prenatal test, acting as a proxy for the cells which are still alive, which have have not yet shed their DNA. Additionally, the amount of shedded DNA is highly variable between tumours, with a general higher amount seen in later stages when tumour burden is high (Diehl et al. 2008; Schwarzenbach, Hoon, and Pantel 2011).

Fragment size distribution of 5 high grade serous ovarian carcinoma (HGSOC) patients and panel of healthy controls. The vertical dashed lines are placed on the fragment sizes between 52 and 172 bp where 10 bp periodicity is observed. The vertical lines at 240 and 324 bp show the range at which a shift in the di-nucleosomal peak occurs between HGSOC patients and healthy controls. The inset plot enlarges the density profile in the di-nucleosomal fragment length range. Figure adopted from Markus et al. (2022)
Fragment size distribution of 5 high grade serous ovarian carcinoma (HGSOC) patients and panel of healthy controls. The vertical dashed lines are placed on the fragment sizes between 52 and 172 bp where 10 bp periodicity is observed. The vertical lines at 240 and 324 bp show the range at which a shift in the di-nucleosomal peak occurs between HGSOC patients and healthy controls. The inset plot enlarges the density profile in the di-nucleosomal fragment length range. Figure adopted from Markus et al. (2022)

Due to the different biological processes which can lead to the release of DNA from cancer cells, in addition to apoptosis, which is the main source of cfDNA from healthy cells, ctDNA has several biological differences to cfDNA. The most prominent features observed are distinct ctDNA fragmentation size patterns, fragment start sites, and fragment ends motives. While the preferred end motive and start site are heavily correlated, restricting the analysis to the lower tail of the mono- and di-nuclesomal peak of the fragment size distribution (74-155bp and 240-325bp) allowed a ctDNA tumour signal enrichment of at least 28%. This is enrichment and the high periodicity of the distribution showed a high dependence on nucleosomal placement. All ctDNA features combined were shown to be able to predict the presence or absence of tumour DNA in a samples regardless of tumour type ([fig:ctdnaFragSize], (Mouliere et al. 2018; Markus et al. 2022)).

DNA sequencing - when is next generation sequencing the current generation?

As we know the building blocks that make DNA as well as the processes and the enzymes responsible, we can synthesise DNA in vitro. By chemically modifying the nucleotides supplied to the synthesis process, the sequence of the copied strand can be analysed. The first method to make use of this used the lambda phage to fuse known ends for the primers needed for the reaction to the piece of DNA and supplied labelled nucleotides (Padmanabhan, Jay, and Wu 1974). This method was then superseded by "Sanger sequencing" after Frederick Sanger who with colleagues published this method in 1977. Through adding dideoxynucleotides in a low concentration, the polymerase chain reaction would terminate trying to integrate these nucleotides and by labelling them radioactively or flourescently, a gel could then be used to determine the sequence of the piece of DNA (Sanger and Coulson 1975; Sanger, Nicklen, and Coulson 1977), which made the method better suited for larger scale projects.

However, this method had multiple issues for modern research questions. Mostly, that it was fairly labour intensive and time consuming to analyse multiple pieces of DNA at the same time, making it very challenging to sequence all the DNA of an organism. The human genome project, which was started in 1990 used machines which automated the Sanger sequencing procedure and it still took hundreds of researchers 13 years to complete the DNA sequence of just one human (Lander et al. 2001; Venter et al. 2001). Even though this was a very long project, it laid the ground work for the usage of the current sequencing technologies.

Library preparation - what we learned from using phages

Adapter ligation during library preparation. The adapters are added to the DNA insert during library preparation. A. The DNA insert is prepared by adding an A-tail and phosphorylation. B. The adapter complex which includes the P5/P7 flow cell binding adapter is added to the DNA insert. C. The DNA insert is ready for sequencing. D. The DNA insert binds to the flow cell for sequencing. Primers bind to the DNA insert to generate reads; Figure adapted from "How short inserts affect sequencing performance" (Illumina 2020)
Adapter ligation during library preparation. The adapters are added to the DNA insert during library preparation. A. The DNA insert is prepared by adding an A-tail and phosphorylation. B. The adapter complex which includes the P5/P7 flow cell binding adapter is added to the DNA insert. C. The DNA insert is ready for sequencing. D. The DNA insert binds to the flow cell for sequencing. Primers bind to the DNA insert to generate reads; Figure adapted from "How short inserts affect sequencing performance" (Illumina 2020)

Library preparation is the name of the preprocessing step, which is done before it is sequenced with the current technologies. The first step to sequence DNA is to obtain the DNA, which is done by lysing the cells of interest, which disrupts the cell membrane and therefore spills all its contents. The then spilled DNA is fragmented into smaller pieces, by either restriction enzymes or sonication, to have a size of about between 200-800bp. These steps are not necessary when preparing sequencing of ctDNA, as discussed in [intro-sec:ctDNA], as the DNA is unbound and already digested into short fragments. Once the DNA is ready, it is phosphoryelated and an A-tail is added, before the adapter complex is ligated. This enables the DNA to bind to the flow cell which is covered with the reverse complement of the adapter ([fig:libraryprep]).

Next generation sequencing

The Illumina sequencing-by-synthesis approach. Cluster strands created by bridge amplification are primed and all four fluorescently labeled, 3’-OH blocked nucleotides are added to the flow cell with DNA polymerase. The cluster strands are extended by one nucleotide. Following the incorporation step, the unused nucleotides and DNA polymerase molecules are washed away, a scan buffer is added to the flow cell, and the optics system scans each lane of the flow cell by imaging units called tiles. Once imaging is completed, chemicals that effect cleavage of the fluorescent labels and the 3’-OH blocking groups are added to the flow cell, which prepares the cluster strands for another round of fluorescent nucleotide incorporation; Figure adapted from Mardis (2008) (Mardis 2008)
The Illumina sequencing-by-synthesis approach. Cluster strands created by bridge amplification are primed and all four fluorescently labeled, 3’-OH blocked nucleotides are added to the flow cell with DNA polymerase. The cluster strands are extended by one nucleotide. Following the incorporation step, the unused nucleotides and DNA polymerase molecules are washed away, a scan buffer is added to the flow cell, and the optics system scans each lane of the flow cell by imaging units called tiles. Once imaging is completed, chemicals that effect cleavage of the fluorescent labels and the 3’-OH blocking groups are added to the flow cell, which prepares the cluster strands for another round of fluorescent nucleotide incorporation; Figure adapted from Mardis (2008) (Mardis 2008)

Next generation sequencing (NGS) is the coined term for basically any standard high-throughput sequencing performed, which includes exome, genome, transcriptome, protein-dna interactions (ChIP) and other epigenome studies. The term NGS is still widely used, even though it has been more than 10 years since the first NGS approach was commercially available. While in the beginning of next generation sequencing there were multiple approaches, the current lion share (80% of sequencing data) of protocols use the Illumina short read sequencing by synthesis approach ([fig:sequencingbysynthesis]) (Mardis 2008; Straiton et al. 2019), which is based on the concept of alternating integration of flourescently labelled nucleotides and imaging with a microscope ([fig:sequencingbysynthesis]) as well as multiplexing, where a DNA fragment is ligated to an index, which allows sequencing of multiple samples at once (G. M. Church and Gilbert 1984; G. Church and Kieffer-Higgins 1988) as it is shown in [fig:libraryprep]. This method enables highly accurate determination of the sequence of a DNA fragment and depending on the flow cell and sequencing machine allows one to sequence a whole genome in just 24h.

Long read sequencing - the "third" generation sequencing

By now, multiple methods which broke free of the size limitations of NGS exist, which are commonly referred to as long read sequencing. Most of the current methods trade the very high accuracy of the second generation NGS methods for the capability of sequencing huge continuous strands of DNA (current record 2.3 Million bp (Payne et al. 2018)) with normal library preparation ranging between 10-30 Kbp. These methods are expected to revolutionise our understanding of the highly repetitive elements that exist in the genome, such as the centromeres of chromosomes. Methods such as the direct molecule sequencing approach by Oxford Nanopore are even able to distinguish post transcriptional modifications on RNA (Pratanwanich et al. 2021). However, for ctDNA, which is highly fragmented, these methods offer only limited advantages over the short read sequencing.

DNA analysis - what to do with the sequence

The types of analysis that can be done with the output from the sequencing machine stretches far, however, all methods need to first infer the location in the genome, the sequenced piece of DNA originated from. As the current methods randomly fragment the DNA ([intro-sec:libraryprep]), the genomic location information is completely lost. This process is referred to as mapping.

Mapping - inferring genomic location of a read

In this process, the fragments of DNA, which were sequenced, are assigned a genomic coordinate on the reference genome. This is only possible due to the fact that we have resolved genome sequences ([intro-sec:sequencing]) for a high number of species. The location a sequenced piece of DNA matches to the reference genome might be unique, but it could also align to multiple locations, due to highly repetitive regions or due to the existence of pseudo genes with almost 100% identity concordance. In addition to this, the reference genome might not accurately reflect the genome of the organism that has been sequenced. Each mapping position is therefore assigned a quality score, which reflects how likely it is to be the actual position of the sequence. As Illumina sequencers have the ability to sequence both ends of the DNA fragment, the position of the ends (read 1 and read 2) relative to each other can also be used to infer the quality, as they should be within a reasonable distance to each other ([fig:libraryprep])

As this process is time-consuming and the exact location of the fragment might not be as important, there exists a subset of tools called pseudo-mappers, which are based on <math display="inline">k</math>-mers, which are predefined DNA sequences of length <math display="inline">k</math>, which help to identify certain regions of interest. These tools are especially common for RNAseq, where the exact location of a read does not matter, only that the read is within a gene (Bray et al. 2016; Patro et al. 2017), but also for methods that estimate similarity between sequences (DNA, RNA or protein) (Ondov et al. 2016; Luczak, James, and Girgis 2017).

For this work however, the exact position of reads is crucial, so only precise mapping methods like BWA (Heng Li 2013) or Bowtie 2 (Langmead et al. 2018), which are optimised for short reads from Illumina systems, provide the necessary functions.

Variant calling - “Spotting the difference“

As intra-species genetic variation is intended for adaptation and evolution, there will be places where the DNA sequence of the subject will differ from the sequence of the reference (see [intro-sec:mutations]). In the context of cancer, these variants can give insights into the development and progression of cancer, and treatment options for patients and can even be used to guide family planning. Depending on the type of variation that is of interest, a different set of computational methods are needed, as germline and somatic variants have different properties.

Germline variant calling - the cards you have been dealt at birth

The most common source of DNA used for germline variant analysis is the mono nuclear layer from the blood of the subject, but really almost any cell can be used for this process, as all cells in the organism will share all germline variants ([intro-sec:mutations]). The only important input on top of the DNA sequence from the sequencer are the reference genome of the organism, as all variant nomenclature is based on the reference and the ploidy of the organism ([intro-sec:ploidy]). The ploidy is key to understand the ranges of allele frequencies at which a variant can biologically occur. For example in a human diploid genome, germline variants can occur either in one or both chromosomes, which mean we assume reads should show an allele frequency of around 50% and 100%, whereas the hexaploid commercial wheat (Mayer et al. 2014) allele frequency for variants would be 16%, 33%, 50%, 66%, 0.83% and 100%. Due to the random sampling and possible sequencing errors, the observed allele frequencies may differ from the theoretical values. Most state of the art germline variant calling methods will also use haplotype reconstructions through de-Bruijn graphs, which features a remapping of reads in relation to each other (Garrison and Marth 2012; Lai et al. 2016; Kim et al. 2018; Benjamin et al. 2019; Cooke, Wedge, and Lunter 2021) where the original mapping location assigned by the aligner ([intro-sec:mapping]) is only used as a guideline. This allows one to resolve even complex haplotypes of the sample by not restricting the method to the linear setup of the reference genome.

Somatic variant calling - life is ever-changing

In contrast to germline variant calling, somatic variant calling methods cannot rely on allele frequency, as not all cells sequenced are expected to have the change in nucleotide. The allele frequency is instead a measure of the sub clonal size. A subclone is defined as the set of cells, which were derived from the cell, which originally acquired the somatic mutation. Depending on the selective advantage, random drift and also the time point when the variant was introduced, these clones can be very variable in size and therefore their contribution within the DNA sequenced could vary. As not all cells have the variants, the selection of the tissue for library preparation is very important, unlike for germline calling. In the setting of cancer, somatic variant calling is important in understanding the genetic landscape of caner samples, where the main question is, which changes are present in the tumour and which lead to disease.

The ideal scenario for tumour somatic variant calling is when a biopsy of the tumour as well as a normal sample of the patient is available. In most clinical cases, this will be the diagnostic biopsy as well as the mono nuclear layer from blood, so true somatic variants can be characterised ([intro-sec:germlinecalling]). These two samples are then analysed together and only changes that are only in the somatic tumour sample and not in the normal sample are reported. Even though this concept sounds simple, there are some pitfalls (GATK Team 2021b). First, there might be some tumour “contamination“ in the normal sample, which needs to be adjusted for (Kim et al. 2018; Taylor-Weiner et al. 2018). Second, there might be normal “contamination“ in the tumour sample, which means that not all cells in the tumour sample are actually tumour. This means that the signal of the tumour specific changes may be reduced and harder to find.

All of these issues are amplified in the scenario, when there is no “normal“ sample available, either because the patient did not consent, due to other medical issues, or because for diagnostic tests thereis often no need for a germline sample. In this case, there is the option for “tumour only“ variant calling, which utilises a database of germline variants in the population, to distinguish between somatic and germline variants, as the variant calling is very similar to just germline variant calling ([intro-sec:germlinecalling]) without the restriction of the ploidy. However, even with an extensive database like gnomAD (Karczewski et al. 2020) it is unlikely to be able to remove all germline variants from the analysis and as there is no direct comparison, the precision of the “tumour only“ method is significantly lower (Karimnezhad et al. 2020).

Cancer

For a long time in human history, the origin of cancer as a disease was a mystery and a multitude of theories, starting in ancient Egypt, were developed. These theories ranged from a curse to chemical imbalance over parasites to trauma. In this section I will outline both the history of cancer as a disease and the treatments starting with ancient times leading up right until the current times. While the first steps are very wide, because the biology itself was not understood, it is quite curious how often people with more knowledge came to worse conclusions and theories, than were already known thousands of years ago.

Around 3000BC the Egyptians describe the bulging tumour of the breast as an incurable disease (Breasted 1930), even then they already had some ointments, which were used, including resection, cauterisation and salting of the affected areas, all of which were still used up until the 19<math display="inline">^{th}</math> century (Steven I. Hajdu 2004). This papyrus document is considered the oldest evidence of cancer in humanity.

When the ancient Greeks laid the foundation for modern medicine with Hippocrates, the first hypothesis about natural causes of cancers was formulated and the terms “cancer“ and “carcinoma“ were coined. The abundance and accumulation of “black bile“ in the body was thought to be the cause of the cancers. However, the treatment was still the same as before, with resection and lotions (Chadwick and Mann 1950).

Following Hippocrates, the Roman physician Celsus progressed the understanding of cancers significantly, by describing metastatic relapse of treated breast cancer in neighbouring armpits and even the spread to distant organs. He also was aware, that the outcome of patients was better, if the tumours were removed early and aggressively (Celsus 1939).

With the destruction of the western Roman Empire, the Middle East became well known for their strong advances towards modern medicine and the court physician to the Emperor of Constantinople Aetius had success with the first total mastectomy and generally was an advocate of the total excision of tumours (Browne 2011).

Sadly, while both the understanding of cancer and the treatment were steadily improving, the Pope prohibited bloodshed as well as surgeries and this lead to a slow-down of advances, especially because autopsies were also forbidden a hundred years later in 1305. However, there were still illegal experiments conducted and the general classification which is still used currently was first proposed, by Henri de Mondeville, who started classifying tumours by their anatomical site (Pilcher 1895).

After the end of the “dark ages“, the wide availability of older medical works from both the Greeks and Romans due to the book print invention, led to the re-emergence of the use of chemical ointments and lotions on cancer lesions. Paracelsus laid the ground-work for the modern chemotherapy by promoting the usage of chemicals, which he himself warned were poisonous in the wrong concentration, for the treatment of cancer (Paracelsus (Bombastus von Hohenheim TPA) 1562).

When the dissection of corpses was no longer banned by the church, more and more cases of “hidden“ causes of death were found post mortem, which were often cancers on internal organs, like the brain. The detection of malignant versus benign tumours was also major breakthrough. This lead to the understanding, that benign tumours might turn malignant after some time and many physicians suggested removal of the benign growths (Severino 1632).

Due to the apparent genetic disposition of cancer, especially breast cancer, two independent physicians (Zacutus Lusitani and Nicholas Tulp) came to the conclusion, that cancer is contagious and proposed isolation of patients (Lusitani 1649; Tulpii 1652) which shows, that while the treatment of cancer was improving steadily, the origin of the disease was still a mystery. It took until 1700 when Deshaies-Gendron (1701) described cancer as a transformation of a normal body part, which continues to grow without control and while he was aware of metastatic disease, he suggested no treatment, as he did not believe cancer to be curable with drugs (Deshaies-Gendron 1701).

However, it took almost 150 years after the theory of cancer being contagious for Nooth (1804) to conduct experiments trying to infect himself with cancer pieces resected from another person, which proved that cancers generally are not infectious (Nooth 1804).

Another ground-breaking work published in the same year was the collection of almost three thousand autopsy reports and their clinical history, which contained a number of detailed cancer cases including: brain, head and neck, lung, breast, esophagus, stomach, colon, liver, pancreas, kidney, uterus, cervix, bladder, and prostate. Many of the terms used by Theophilus Boneti to describe the cases are still in use and the work itself was the first step toward tumour pathology (Steven I. Hajdu 2010a)

With the invention and consequently common use of the microscope in pathology, more and more causes of deaths were identified as caused by cancer. An example is the connection of a chronic cough to lung cancer and swollen joints with sarcoma (Etmüller 2018).

After more and more autopsies of cancer patients, surgeons like Heister (1747) found that breast cancer resection needs to include the breast, the axillary lymph node and the pectoralis major muscle which got to be known as the Halsted radical mastectomy and was the standard of care for a long time.

While the treatment of cancers (mostly surgical) was getting more and more advanced, the origin and cause of cancer in patients was still very much debated. As the aetiology of cancer is complex, as we now know, it is maybe not surprising that it took longer, but by the middle of the 18<math display="inline">^{th}</math> century chronic inflammation as a cause of cancer initiation was hypothesised (Steven I. Hajdu 2010b).

The next big step was taken, when in 1838 the concepts of cells as fundamental building blocks of organisms was published. In the following years, many cancers were dissected and microscopically analysed. This revealed that tumour cells look vastly different from normal cells, and it was thought that their morphological features could serve to identify their fate and became known for defining the cellular origin of benign and malignant tumours. And while Müller (1838) described the tumours as a collection of abnormal cells with stroma, he thought cancer to arise from newly generated cells from a diseased organ and thought the underlying cause to be “amorphous embryonal blastema“ (Müller 1838).

With this foundation, over the next hundred years, lots of advances were made into the morphology of different tumours and many previously undetected ones, like leukemia, were found and extensively characterised. However, even then, there were researchers, which understood that the heterogeneity of cancers is so vast, that while he was convinced that the microscope will be a mandatory instrument to diagnose cancers, more effort to collect and study specimen is necessary to have a complete picture (Bennett 1849).

As many shared the view of Bennett (1849), the second half of the 19<math display="inline">^{th}</math> century was a rich source of surgical pathology and the oncology literature in general. Most outstanding was Rudolf Virchow’s “Die krankhaften Geschwülste“ (Virchow 1863) which is a first landmark book on the classification of cancers, and is still a well of knowledge. From his work, the terms “hyperplasia“ and “metaplasia“ were derived, as pre-cancerous states of cells. He also was one of the first to hypothesise the presence of growth stimulating substances around cancers, which lead to their uncontrolled growth. Virchow (1863) was the first oppose the “amorphous embryonal blastema“ theory and instead was convinced that tumour cells were just abnormally changed cells, which he called “chronic irritation theory“ and believed that metastases were seeded by the original lesion (like in this melanoma [fig:drawing]). He also had major scientific impact in a number of other fields like Parasitology, Forensic and Anthropology[1].

Drawing of central nervous system metastasis from page 121, Volume 2 of “Die krankhaften Geschwülste“ Virchow (1863); translated original caption: Fig. 128: Multiple melanoma of the Pia mater basilaris, most pronounced around the Medulla oblongata, the Pons, the Fossa Sylvii, Fissura longit (sample No. 256a from 1858); Fig. 129: Lower end of spinal cord of Fig. 128 with multiple melanoma of the soft skin with node like growths at the nerve roots (sample No. 256b from 1858)
Drawing of central nervous system metastasis from page 121, Volume 2 of “Die krankhaften Geschwülste“ Virchow (1863); translated original caption: Fig. 128: Multiple melanoma of the Pia mater basilaris, most pronounced around the Medulla oblongata, the Pons, the Fossa Sylvii, Fissura longit (sample No. 256a from 1858); Fig. 129: Lower end of spinal cord of Fig. 128 with multiple melanoma of the soft skin with node like growths at the nerve roots (sample No. 256b from 1858)

While the search of possible cancer causing substances started to gain more and more interest, only one real cause was thought to be found in the ore of the central European mountains, where miners would have a higher prevalence of lung cancers. Nevertheless, this was later found to be caused by radio active material and not by the inhaled dust of minerals as expected. Similar, many parasites and bacteria were found as potential causes of cancer, but none of the findings could produce proof.

While all these steps were getting closer together in time up until the beginning of the 20<math display="inline">^{th}</math> century, they were still fairly minor in contrast to the high speed of discoveries that the last hundred years brought with it. While technically Willhelm Röntgen discovered the X-rays just before the change of the century (Röntgen 1898), both its impact on the body and cancer were only clear a few years into the last century (Frieben 1902; Scholtz 1902). However, similar to how X-rays can cause cancers, researcher also found quickly, that it can also treat cancer and thus the field of radiotherapy was created. This then was the first major change in cancer treatment for around five thousand years, which also could treat inoperable cancers.

The next invention, that I want to highlight within the vast amount of advances made in the advent of the 20<math display="inline">^{th}</math> century, is the cutting needle aspiration syringe, which allowed a non-traumatic biopsy of internal organs for microscopy study. This made it possible to not have exploratory surgery and instead allowed improved diagnosis and planning of necessary surgery.

The next major step in the treatment of cancers comes in the form of chemotherapy, when Ehrlich (1909) published his work “Beiträge zur experimentellen Pathologie und Chemotherapy“ where he injected animals with different toxins in order to destroy cancer cells. Although, it still took another 30 years until after the second world war when the discovery, that a chemical designed for warfare also had a potent anti-tumour effect.

In the meantime, the first long term tissue cultures of animal cancer cell lines were established and further insights like the Warburg effect (Warburg 1928) found, which showed, that cancer cells use glucose at a higher rate than healthy cells. This effect ultimately to the discovery of the positron emission tomography (PET) scan, which allowed a significantly more granular diagnosis and localisation of cancerous lesions than before.

With the success of growing human cell lines in vitro, the USA embarked on a massive experiment to test any potential source of chemical carcinogenesis. But at the same time, multiple viruses were identified to cause cancers in the 1950s, when electron microscopy was invented (Claude, Porter, and Pickels 1947).

Only a few more years later, the biggest advance in the understanding of biology was made, when the structure of DNA was discovered (Watson and Crick 1953) ([intro-sec:DNA])and subsequently lead to numerous new experiments and breakthroughs. When studying how viruses are able to reverse transcribe their RNA and insert a new gene into a healthy cell, which then transformed into a cancer cell, the term “oncogene“ was coined (Huebner and Todaro 1969; Baltimore 1970; Temin, Mizutami, et al. 1970) and the foundation for the understanding of how genes influence the emergence of cancers was laid. This also lead to the understanding, that heritable changes in the genome could predispose a person to cancer, which was previously hypothesised (F. P. Li and Fraumeni Jr 1969). And while the discovery of DNA was a substantial boost for the understanding of cancer, the diagnostic capabilities increased at a similar speed, with urine tests for biomarkers of certain cancers as well as antigen detection.

And this is when we arrive at the “current“ times, when a few years ago next generation sequencing (NGS) ([intro-sec:sequencing]) was introduced and sped up data generation to improve our understanding of the genetic landscapes of different cancers and the subsequent development of genomic tests based on the molecular profile of certain cancers. The completion of the “Human Genome Project“ ([intro-sec:sequencing]) allowed the accurate resolution of common mutations in cancers and consortiums coordinated large efforts to build databases of all somatic variation like TCGA, ICGC and PCAWG (The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020).

However, with the genetic classification and investigation of cancers, researchers observed genetic heterogeneity in addition to the already well established inter-patient heterogeneity (Swanton 2012). Not all cancer cells in a patient shared the same genetic setup and this allowed the disease to adjust and adapt to treatments and ultimately caused resistance (Burrell and Swanton 2014).

Genetic pathology nevertheless enabled the identification of cancer vulnerabilities that then allowed the application of highly specific drugs, like tyrosine kinase inhibitors (TKI)s, which are tailored to target a specific alteration in the genome of a cancer cell, and genetically engineered antibodies which can be hone in on the cancer.

While the therapeutic world is quickly evolving, many of the questions from previous times are still the same. We still don’t know how and when the heterogeneity in cancers occurs, we just know it is a major source of resistance to treatment. In the majority of cancer types we also still do not have an answer to the “cell of origin“ question that has been asked for so long.

So instead of trying to answer these questions directly, there has been an effort to define fundamental features of malignancies, very similar to the early pathology descriptions. The original characteristics comprise

Sustaining proliferative signalling Evading growth suppressors Activating invasion and metastasis Enabling replicative immortality Inducing angiogenesis Resisting cell death

([fig:oldhallmarks]).

Acquired capabilities of cancer; Functional capabilities acquired by most cancers during their development; Figure adapted from Hanahan and Weinberg (2000)
Acquired capabilities of cancer; Functional capabilities acquired by most cancers during their development; Figure adapted from Hanahan and Weinberg (2000)

These hallmarks were for a while considered the core of tumour development and the authors themselves hypothesised, that these core mechanics allow us to condense the complexity that cancer displays, both in the clinic and in labs, with a small set of rules, which all cancers have to obey (Hanahan and Weinberg 2000). In their exact words: “We foresee cancer research developing into a logical science, where the complexities of the disease, described in the laboratory and clinic, will become understandable in terms of a few underlying principles“

However, with 11 years of additional research into the topic, more hallmarks have been found, and the original list was revised by the authors to contain additional characteristics, namely

Avoiding immune destruction Tumour-promoting inflammation Genome instability and mutation Deregulating cellular energetics

(Hanahan and Weinberg 2011). And even then a few years later, even more hallmarks e.g. metabolic rewiring is now considered a part of the characteristics of cancer (Fouad and Aanei 2017).

Emerging hallmarks and enabling characteristics of cancer; updated version of the hallmarks figure ([fig:oldhallmarks], (Hanahan and Weinberg 2000)); Figure adapted from Hanahan (2022); Left, the Hallmarks of Cancer currently embody eight hallmark capabilities and two enabling characteristics. In addition to the six acquired capabilities – Hallmarks of Cancer – proposed in 2000 ([fig:oldhallmarks]), the two provisional “emerging hallmarks” introduced in 2011 (Hanahan and Weinberg 2011) –cellular energetics (now described more broadly as “reprogramming cellular metabolism”) and “avoiding immune destruction” – have been sufficiently validated to be considered part of the core set. Given the growing appreciation that tumors can become sufficiently vascularized either by switching on angiogenesis or by co-opting normal tissue vessels (Kuczynski et al. 2019), this hallmark is also more broadly defined as the capability to induce or otherwise access, principally by invasion and metastasis, vasculature that supports tumor growth. The 2011 sequel further incorporated “tumor-promoting inflammation” as a second enabling characteristic, complementing overarching “genome instability and mutation,” which together were fundamentally involved in activating the eight hallmark (functional) capabilities necessary for tumor growth and progression. Right, this review incorporates additional proposed emerging hallmarks and enabling characteristics involving “unlocking phenotypic plasticity,” “nonmutational epigenetic reprogramming,” “polymorphic microbiomes,” and “senescent cells.”
Emerging hallmarks and enabling characteristics of cancer; updated version of the hallmarks figure ([fig:oldhallmarks], (Hanahan and Weinberg 2000)); Figure adapted from Hanahan (2022); Left, the Hallmarks of Cancer currently embody eight hallmark capabilities and two enabling characteristics. In addition to the six acquired capabilities – Hallmarks of Cancer – proposed in 2000 ([fig:oldhallmarks]), the two provisional “emerging hallmarks” introduced in 2011 (Hanahan and Weinberg 2011) –cellular energetics (now described more broadly as “reprogramming cellular metabolism”) and “avoiding immune destruction” – have been sufficiently validated to be considered part of the core set. Given the growing appreciation that tumors can become sufficiently vascularized either by switching on angiogenesis or by co-opting normal tissue vessels (Kuczynski et al. 2019), this hallmark is also more broadly defined as the capability to induce or otherwise access, principally by invasion and metastasis, vasculature that supports tumor growth. The 2011 sequel further incorporated “tumor-promoting inflammation” as a second enabling characteristic, complementing overarching “genome instability and mutation,” which together were fundamentally involved in activating the eight hallmark (functional) capabilities necessary for tumor growth and progression. Right, this review incorporates additional proposed emerging hallmarks and enabling characteristics involving “unlocking phenotypic plasticity,” “nonmutational epigenetic reprogramming,” “polymorphic microbiomes,” and “senescent cells.”

And even during the time of my PhD, further research revealed additional hallmarks, which got characterised by Hanahan (2022). The newest version adds another two characteristics and hallmarks, specifically:

unlocking phenotypic plasticity nonmutational epigenetic reprogramming polymorphic microbiomes senescent cells

(see [fig:newesthallmarks]).

The evolution of these hallmarks shows, that even though lots of time and effort was invested into cancer research for multiple centuries, there still is no unifying definition and treatment for cancer. The vast heterogeneity not only between cancer types, but also between patients makes it very hard to study. But even within patient there is third type of heterogeneity, which is the main cause of treatment resistance and relapse (Dagogo-Jack and Shaw 2017). Spatial heterogeneity refers to the non-uniform distribution of cancer subclones within different sites of disease. Even when the primary site of disease is known, metastatic sites may show a very different genetic landscape. In contrast, temporal heterogeneity describes the changes of a lesion over time. These changes can be caused by cancer progression or treatment ([fig:heterogeneities]). These two concepts however are not mutually exclusive, but can occur at the same time.

Types of intra patient heterogeneity in cancer: Spatial uneven distribution of subclones in different sites (left); Temporal heterogeneity with regards to genetic landscape within one site, either through progression of tumour or adaptation to treatment; Cells with rough membrane depict cancer cells, smooth membrane are normal cell. Colours show subclonal differences in the cancer. Figure adapted from Dagogo-Jack and Shaw (2017)
Types of intra patient heterogeneity in cancer: Spatial uneven distribution of subclones in different sites (left); Temporal heterogeneity with regards to genetic landscape within one site, either through progression of tumour or adaptation to treatment; Cells with rough membrane depict cancer cells, smooth membrane are normal cell. Colours show subclonal differences in the cancer. Figure adapted from Dagogo-Jack and Shaw (2017)

And while we know, that this diversity exists and efforts have been made to measure and classify them (Noorbakhsh et al. 2018), there is still a lack of methods to directly analyse this heterogeneity and utilise this information to guide clinical approaches directly.

Thesis overview and aims

While genomic tumour heterogeneity is now a well accepted concept, there remains a need for computational methods to assess and visualise this heterogeneity. This work aims to contribute to this unmet need by developing three different custom made tools to infer and monitor genomic tumour heterogeneity. I have completed the work in the following three parts:

  1. Development of two joint somatic variant calling workflows and the impact of these on downstream analysis ([ch:variantcalling]).
  2. Analysis of five rapid autopsy probands with state of the art methods to investigate multi-regional tumour heterogeneity and development of a mitochondrial based phylogeny reconstruction method ([ch:cascade]).
  3. Development of a read-centric method to detect somatic mutational signatures from low coverage whole genome sequencing ([ch:mmf]).
“It is the main source of our mistakes, when making making decision, that we only look at life piece by piece and not as a whole.“

Joint somatic variant calling - if germline can do it, so can we

Introduction

In 2018, at the start of the work presented in this thesis, we observed a difference in methodology between germline and somatic variant calling methods. Where all "modern" germline variant callers, like Strelka2 (Kim et al. 2018), HaplotypeCaller (Poplin et al. 2017), DRAGEN (Miller et al. 2015) and DeepVariant (Poplin et al. 2018), have the built-in capability to jointly call multiple related samples, for example from family trios, virtually no somatic variant caller had this functionality.

The joint analysis of smaller cohorts improves the performance of germline variant calling methods significantly, by allowing to assess technical artefacts, which might be unique for the individual sequencing machine or the researcher handling the DNA (Schirmer et al. 2016; Stoler and Nekrutenko 2021). Additionally, as certain parts of the genome are more problematic to sequence ([intro-sec:sequencing]) and map ([intro-sec:mapping]), a “control“ sample can help to distinguish if a certain observed change occurring frequently is a technical issue or in fact a real change.

For somatic variant calling, this concept has been adopted on in the genome analysis toolkit (GATK) (Brian O’Connor 2020) to allow the use of a panel of normals (PON), which contain frequently seen changes in healthy (“normal“) individuals analysed with the same sequencing technology (GATK Team 2021a). Although, in contrast to the more intricate model for the germline equivalent, this is a post processing step of the analysis. Mutect2, which is the most recent somatic variant calling algorithm provided by the Broad institute, also provides a multi-sample mode, for which all tumour samples need to be from the same patient, either related longitudinally or spatially (GATK Team 2020). However, this mode is not very well publicised and all tutorials released by the developers state that “there is currently no way to perform joint calling for somatic variant discovery“ (GATK Team 2021b). So while all methods in the GATK are considered a beta feature, the multi sample mode needs to be used with care.

There are only two methods currently, which have documented and published capabilities to jointly analyse tumour samples from the same patient to call somatic variants. The first one is a specialised method built on a joint bayesian model for single nucleotide variants (SNV)s that occur in multiple samples called multiSNV (Josephidou, Lynch, and Tavaré 2015). However, it has multiple shortcomings, which make it not usable for our data. First, as the name suggests, the method can only jointly evaluate SNVs and completely ignores INDELs and structural variants, which would be acceptable for the superior performance it provides. However, multiSNV was optimised only for WES and not for the very deep WGS that is now widely available and form a major part of this thesis. This mismatch of input types means exceptionally high runtimes on WGS data. Even with custom parallelisation that was attempted in this work, the predicted runtime for just one multi sample patient would have been longer than 3 years. This shows, that while multiSNV was a great step forward at the time, there is a real need for new methods to stem the tide of sequencing data available due to the ever decreasing sequencing cost.

multiSNV had been the only software available for multi sample analysis for almost five years, but during this work, superFreq (Flensburg et al. 2020) was also published. It combines all standard analysis steps for tumour analysis, like quality assessment, variant calling, copy number analysis and clonal deconvolution, into one program and is even able to jointly analyse samples. However, similar to multiSNV, its focus during optimisation and development was on WES and RNAseq data, so when applied to WGS data, we could not find a server node with enough memory to execute the workflow.

This then prompted us to develop a novel workflow to enable the analysis of high depth WGS, which we estimate to become more and more routine, with the ever dropping prices of sequencing. The following sections will show the development and validation of the joint variant calling methods as described in Hollizeck et al. (2021) ([variantcalling-sec:publication]), and additional analysis on the impact of the joint variant calling on downstream multi-regional tumour analysis ([variantcalling-sec:downstream]), longitudinal sample analysis ([variantcalling-sec:longitudinal]) and clonal deconvolution ([variantcalling-sec:clonal]). The final section provides current information on the usage of the methods by others in the research community ([variantcalling-sec:usage]).

Publication

The full publication related to the joint somatic variant calling can be found at https://doi.org/10.1093/bioinformatics/btab606 and a formatted version is also attached as [ch:appendixManuscript] with all supplementary methods.

References to supplementary data will be prefixed with the letter 6 for tables and methods taken from the publication, supplementary figures from the publication are integrated into the main text and not shown in the appendix. Prefix 7 was used for additional data generated for this work.

Summary

To enable highly sensitive, fast and accurate variant detection from multiple related tumour samples, we have developed joint variant calling extensions to two widely used single-sample algorithms, FreeBayes (Garrison and Marth 2012) and Strelka2 (Kim et al. 2018). Using both simulated and clinical sequencing data, we show that these workflows are highly accurate and can detect variants at much lower variant allele frequencies than other commonly used methods.

FreeBayesSomatic workflow

The original FreeBayes algorithm can jointly evaluate multiple samples, but routinely it does not perform somatic variant calling on tumour-normal pairs. We introduce FreeBayesSomatic which allows concurrent analysis of multiple tumour samples by adapting concepts from SpeedSeq (Chiang et al. 2015) which differentiates the likelihood of a variant between tumour and normal samples instead of imposing an absolute filter for all variants called in the normal. Hence, for each genotype (GT) at SNV sites, FreeBayesSomatic first calculates the difference in likelihoods (LOD) between the normal ([eq:01]) and the tumour ([eq:02]) samples genotype likelihoods (GL) with g<math display="inline">_{0}</math> describing the reference genotype.

<math display="block">\begin{aligned} \text{LOD}_{\text{normal}} &= \max_{g_i \in \text{GT}} \left( \text{GL}(g_0) - \text{GL}(g_i) \right) \label{eq:01}\\ \text{LOD}_{\text{tumour}} &= \min_{s \in \text{Samples}} \left( \min_{g_i \in \text{GT}} \left( \text{GL}_s(g_i) - \text{GL}_s(g_0) \right) \right) \label{eq:02}\\ \text{somaticLOD} & := \left( \text{LOD}_{\text{normal}} \geq 3.5 \land \text{LOD}_{\text{tumour}} \geq 3.5 \right) \label{eq:03}\end{aligned}</math>

Next, the variant allele frequencies (VAF) in both the tumour and the normal samples are compared at each site.

<math display="block">\begin{aligned} \text{VAF}_{\text{tumour}} &= \max_{s \in \text{Samples}} ( \text{VAF}_s) \label{eq:04}\\ \text{somaticVAF} & := \left( \text{VAF}_{\text{normal}} \leq 0.001 ~\lor \right. \nonumber \\ & \left. ( \text{VAF}_{\text{tumour}} \geq 2.7 \cdot \text{VAF}_{\text{normal}}) \right) \label{eq:05}\end{aligned}</math>

A variant is classified as somatic when both somatic LOD as well as somatic VAF pass the criteria somaticLOD ([eq:03]) and somaticVAF ([eq:05]).

The thresholds chosen for both LOD (3.5) and VAF (0.001 and 2.7) calculations were previously fitted by the blue-collar bioinformatics workflow for the “DREAM synthetic 3“ dataset using the SpeedSeq likelihood difference approach (Chapman et al. 2021) and were selected to identify high confidence variants.

Strelka2Pass workflow

In contrast to FreeBayes, whilst Strelka2 has a multiple-sample mode for germline analysis and tumour-normal pair somatic variant calling capabilities, it cannot jointly analyse multiple related tumour samples. We enable this feature by adapting a two-pass strategy previously used for RNA-seq data (Veeneman et al. 2015). First, somatic variants are called from each tumour-normal pair. All detected variants across the cohort are then used as input for the second pass of the analysis, where we re-iterate through each tumour-normal pair but assess allelic information for all input genomic sites.

The method re-evaluates the likelihood of each variant, by integrating every genotype from each tumour-normal pair. This step can "call" a variant (<math display="inline">v</math>) in a sample that initially did not present enough evidence to pass the Strelka2 internal filtering using two conditions: 1) if this variant was called as a proper "PASS" by Strelka2 in any other tumour sample, or 2) if the integrated evidence for this variant across all tumour-normal pairs reached a sufficiently high level. The second condition was based on the somatic evidence score (SomEVS) reported by Strelka2, which is the logarithm of the probability of the variant <math display="inline">v</math> being an artefact.

<math display="block">p_{error}(v) = 10^{\left( \frac{-\text{SomEVS}(v)}{10} \right)} \label{eq:06}</math>

While the germline sample is shared between all processes, we can approximate these individual probabilities as being independent, since one variant calling process is agnostic of the other. Hence, we derive the following:

<math display="block">p_{error}(v_{s_1},v_{s_2},\ldots,v_{s_n}) = \prod_{s \in \text{Samples}} p_{error}(v_{s}) \label{eq:07}</math>

And therefore:

<math display="block">\text{SomEVS}(v_{s_1},v_{s_2},\ldots,v_{s_n}) = \sum_{s \in \text{Samples}} \text{SomEVS}(v_{s}) \label{eq:08}</math>

This allows the summation ([eq:08]) of the SomEVS score across all supporting variants to assign a "PASS" filter, if it reached a joint SomEVS score threshold. This threshold can be set by the user and is 20 by default, which corresponds to an estimated error rate of 1%. These "recovered" variants need to pass a set of additional quality metrics related to depth of coverage, mapping quality and read position rank sum score.

As an additional improvement, we also built multiallelic support into Strelka2 which originally only reports the most prevalent variant at a specific site. Within the two-pass analysis, we reconstruct the available evidence for a multiallelic variant at a called site from the allele-specific read counts and report the minor allele at this site, if there is sufficient support from other samples. This method allows recovery of minor alleles only if another sample has this variant called by Strelka2, as SomEVS scores are not available for minor alleles.

Validation

While the development of new methods can challenge previous assumptions, all methods need be validated against the current gold standard methods in the field, with data which allows objective evaluation. For germline variant calling, there have been multiple community led challenges and specifically designed test datasets, but there is currently no somatic variant calling equivalent. This issue is even more pronounced for our method, as we do not only need a tumour-normal pair, but we need the multiple tumour samples in the dataset to be related. To allow a fair comparison, we first generated a fully synthetic dataset, where every variant is known and fully defined ([variantcalling-sec:simdata]) to allow a general performance assessment of the methods. Then to ensure that these metrics also hold true in real world data, we then analysed a previously published dataset which had orthogonal validation of a subset of SNVs through targeted amplicon sequencing (TAS) in [variantcalling-sec:realdata].

image

Simulated data

Characteristics of simulated data: A) Simulated phylogeny of samples B) Number of simulated germline and somatic variants per sample C) Variant allele frequency distribution of simulated variants per sample D) Distance to nearest variant in each sample.
Characteristics of simulated data: A) Simulated phylogeny of samples B) Number of simulated germline and somatic variants per sample C) Variant allele frequency distribution of simulated variants per sample D) Distance to nearest variant in each sample.

We first simulated a phylogeny with somatic and germline variants from ten tumour samples and one normal ([fig:varcalling:fig1]A and [A:fig:S01]A, B). Germline variants were simulated at a uniform allele frequency of <math display="inline">0.5</math>. Somatic VAFs were sampled from a custom distribution, modelled to favour low allele frequency variants to closely represent real world data (min VAF: <math display="inline">0.001</math>; max VAF: <math display="inline">1</math>; Fig. S1C, D). Paired-end sequencing reads with realistic error profiles were simulated for WGS data at 160X average coverage using the ART-MountRainier software (Huang et al. 2011). The simulated reads were aligned to GRCh38 and both germline and somatic variants from the phylogeny were spiked into the aligned reads using Bamsurgeon (Ewing et al. 2015). We compared the workflows for FreeBayes and Strelka2 with and without our extensions for joint variant calling on the simulated datasets. The performance of Mutect2 joint variant calling was also assessed using its proposed best practice workflow. As both Mutect2 and FreeBayes do not return a verdict for each individual sample, we needed to assign each sample in the multi-sample VCF its own FILTER value. We called a somatic variant as present in a sample, if there were at least two reads supporting it for this sample and the overall FILTER showed a “PASS“, which was the same cut-off used in the refiltering step in the Strelka2-pass workflow.

Performance of workflows using simulated data: A) Precision and B) Recall of Mutect2, FreeBayes and Strelka2, run in single tumour-normal paired and joint calling configurations.
Performance of workflows using simulated data: A) Precision and B) Recall of Mutect2, FreeBayes and Strelka2, run in single tumour-normal paired and joint calling configurations.

While the precision of each method without our extensions was greater than <math display="inline">99.8\%</math>, they all missed at least 25% of all variants in the samples (i.e. recall <math display="inline">\leq 75\%</math>). In contrast, the recall of the modified workflows increased to <math display="inline">\approx 95\%</math> with only a minute decrease in the precision for both FreeBayes and Strelka2 ([A:fig:S02]). Mutect2 had virtually no change in precision, but the recall actually decreased from <math display="inline">\approx 75\%</math> to <math display="inline">\approx 41\%</math> when analysing the samples jointly ([A:fig:S02]B). Additionally, with our modified workflows, true positive variants were called with VAFs as low as 0.008 (median detected VAF <math display="inline">\geq 0.14</math> for joint sample analysis and <math display="inline">\geq 0.21</math> for single tumour-normal pair analysis), enabling improved distinction between true variants and technical errors ([A:fig:S03]). This improvement in performance for Strelka2 is only achieved after the refiltering step and not just a result of the second pass ([A:fig:S04], [A:varcalling:steps]).

Variant allele frequencies (VAF) of variants detected by joint sample analysis; A) VAF distribution of true positive variants additionally detected by Strelka2pass B) and FreeBayesSomatic C) VAF distribution of false positive variants additionally detected by FreeBayesSomatic D) and Strelka2pass E) VAF distribution of false negatives not called by FreeBayesSomatic F) and Strelka2pass.
Variant allele frequencies (VAF) of variants detected by joint sample analysis; A) VAF distribution of true positive variants additionally detected by Strelka2pass B) and FreeBayesSomatic C) VAF distribution of false positive variants additionally detected by FreeBayesSomatic D) and Strelka2pass E) VAF distribution of false negatives not called by FreeBayesSomatic F) and Strelka2pass.

Performance of individual steps in the Strelka2pass workflow using the simulated data: A) Precision and B) Recall of tumour-normal paired analysis, two-pass step without refiltering (supplying variants from all tumour-normal pairs for evaluation) and two-pass step with refiltering (the final workflow)
Performance of individual steps in the Strelka2pass workflow using the simulated data: A) Precision and B) Recall of tumour-normal paired analysis, two-pass step without refiltering (supplying variants from all tumour-normal pairs for evaluation) and two-pass step with refiltering (the final workflow)

Summary of variant filters assigned by Mutect2; The counts for each filter type are denoted by black boxplots with white circles depicting the median values. The fitted distribution of variant counts outlines each boxplot; A) Counts of filter assignments for false negative variants and B) true negative variants called by Mutect2 C) Filter assignment for all variants reported for sequenced patient data sequenced with WGS or D) WES.
Summary of variant filters assigned by Mutect2; The counts for each filter type are denoted by black boxplots with white circles depicting the median values. The fitted distribution of variant counts outlines each boxplot; A) Counts of filter assignments for false negative variants and B) true negative variants called by Mutect2 C) Filter assignment for all variants reported for sequenced patient data sequenced with WGS or D) WES.

The performance of joint variant calling in Mutect2 was inferior compared to all other methods ([A:fig:S02]A, B). This was primarily due to the "clustered_events" filter in Mutect2, which excluded the majority of false negative variants, with negligible contribution to the exclusion of true negative variants ([A:fig:S05]A, B). This result was unexpected as the simulated variants were evenly distributed along the genome and the corresponding allele frequencies were sampled randomly ([A:fig:S01]D).

Assessing the performance of different workflows using tumour samples with different evolutionary relationships in the simulated data; A) Simulated phylogeny highlighting two samples with high evolutionary distance (sim-a and sim-j) where MRCA denotes the most recent common ancestor. B) Simulated phylogeny highlighting two samples with low evolutionary distance (sim-a and sim-b).C) Precision and E) Recall of FreeBayes and Strelka, run in individual tumour-normal paired and joint calling configurations using two (sim-a and sim-j), three (sim-a, sim-g and sim-j), five (sim-a, sim-c, sim-f, sim-h and sim-j) and all ten tumour samples D) Precision and F) Recall estimates for FreeBayes and Strelka run in individual tumour-normal paired and joint calling configurations. Comparing the performance of these workflows when using two evolutionary distant samples (sim-a and sim-j), two evolutionary close samples (sim-a and sim-b) and all ten tumour samples.
Assessing the performance of different workflows using tumour samples with different evolutionary relationships in the simulated data; A) Simulated phylogeny highlighting two samples with high evolutionary distance (sim-a and sim-j) where MRCA denotes the most recent common ancestor. B) Simulated phylogeny highlighting two samples with low evolutionary distance (sim-a and sim-b).C) Precision and E) Recall of FreeBayes and Strelka, run in individual tumour-normal paired and joint calling configurations using two (sim-a and sim-j), three (sim-a, sim-g and sim-j), five (sim-a, sim-c, sim-f, sim-h and sim-j) and all ten tumour samples D) Precision and F) Recall estimates for FreeBayes and Strelka run in individual tumour-normal paired and joint calling configurations. Comparing the performance of these workflows when using two evolutionary distant samples (sim-a and sim-j), two evolutionary close samples (sim-a and sim-b) and all ten tumour samples.

Since the extent of the improvement in our joint calling workflows is bound by the number of shared variants between samples, we sub-sampled the simulated dataset, to show the effect of incomplete sampling on our methods, which is more likely in clinical settings. Furthermore, the evolutionary distance between the related samples in addition to the number of samples, has a major impact on the number of shared variants, as only variants acquired between the germline and the most recent common ancestor (MRCA), will benefit from the joint analysis. Therefore, we selected three sample subsets which included two, three and five samples with high evolutionary distance to show the minimum expected improvement ([fig:varcalling:fig1]A, B). There was a clear linear improvement for both FreeBayesSomatic and Strelka2Pass when increasing the number of samples, even if they had a distant evolutionary relationship. In contrast, when using only two samples with a small evolutionary distance, the increase in performance was almost as large as when jointly analysing all 10 available samples. This shows that samples with a high number of shared variants will perform better in joint calling workflows ([A:fig:S06]).

Clinical data

To validate the performance of our new workflows, we then analysed WGS and whole-exome sequencing (WES) data of multi-region tumour samples from eight patients, with late stage melanoma, who had multiple tumour samples (average 7 samples per patient; total number of samples 55) collected following enrollment in a rapid autopsy program (CASCADE) conducted at the Peter MacCallum Cancer Centre ([A:tab:S1] and [A:varcalling:clinical]) (Solomon et al. 2020; Vergara et al. 2021). The published studies had multiple somatic variants from the clinical samples orthogonally validated through targeted amplicon sequencing (TAS). We used these TAS-validated variants as the gold standard to evaluate the performance of different workflows, acknowledging that the technical biases inherent to TAS data are different to those present in WGS and WES ([A:fig:S07]) and that there would be sampling biases depending on different tumour cells analysed in each data type.

Correlation of variant allele frequencies (VAF) from WES and WGS data against targeted amplicon sequencing VAF values with fitted violin plots of each individual distribution. Grey background shows 95% confidence interval for the fit of the linear model (dotted line).
Correlation of variant allele frequencies (VAF) from WES and WGS data against targeted amplicon sequencing VAF values with fitted violin plots of each individual distribution. Grey background shows 95% confidence interval for the fit of the linear model (dotted line).

Performance of the different workflows using clinical samples from eight cancer patients: A) Number of variants called by Strelka2 run in the tumour-normal paired (grey) and joint calling configurations, which have been validated by targeted amplicon sequencing (TAS). The same for C) FreeBayes and E) Mutect2 workflows. Precision of tumour-normal paired and joint analysis of TAS validated clinical data for B) Strelka2, D) FreeBayes and F) Mutect2; [A:tab:S1] provides the sample naming map to the original publications.
Performance of the different workflows using clinical samples from eight cancer patients: A) Number of variants called by Strelka2 run in the tumour-normal paired (grey) and joint calling configurations, which have been validated by targeted amplicon sequencing (TAS). The same for C) FreeBayes and E) Mutect2 workflows. Precision of tumour-normal paired and joint analysis of TAS validated clinical data for B) Strelka2, D) FreeBayes and F) Mutect2; [A:tab:S1] provides the sample naming map to the original publications.

In concordance with the results of the simulated data, our improved workflows found additional variants in all but one patient ([fig:varcalling:fig1]D, [A:fig:S08]) (total additional variants Strelka2Pass: <math display="inline">64</math>; FreeBayesSomatic: <math display="inline">85</math>) with only a slight drop in precision for FreeBayesSomatic (mean: <math display="inline">0.94</math> vs. <math display="inline">0.88</math>) and Strelka2Pass (mean: <math display="inline">0.97</math> vs. <math display="inline">0.92</math>). Since the panel of variants validated by TAS was limited (<math display="inline">7108</math> bp for patients CA-B through -H), this increase in detected variants suggests that a high number of shared variants in samples are missed with current approaches, which in turn leads to an overestimation of tumour heterogeneity between samples, as these variants are thought to not be present rather than undetected.

Correlation between cellularity and proportion of variants found only with joint calling using FreeBayesSomatic. Grey background shows 95% confidence interval for fit of linear model (dotted line)
Correlation between cellularity and proportion of variants found only with joint calling using FreeBayesSomatic. Grey background shows 95% confidence interval for fit of linear model (dotted line)

Even though the number of shared variants is a major influencing factor when jointly calling variants, low cellularity samples benefit more from the joint calling, as conventional methods cannot reliably distinguish low allele frequency variants from noise. Through a joint analysis approach, the number of recovered variants is higher in low cellularity samples, which indicates, that especially for clinical samples with variable tumour purity, joint analysis can have a major impact on improving performance ([fig:varcalling:fig1]E, [A:fig:S09]).

Improvement in recall using FreeBayesSomatic and Strelka2pass over Mutect2 in the clinical samples.
Improvement in recall using FreeBayesSomatic and Strelka2pass over Mutect2 in the clinical samples.

Mutect2 in contrast, did not show significant improvement in any sample in its joint calling configuration, but showed inferior performance compared to the tumour-normal pairwise approach in two samples ([A:fig:S08]E), similar to its decreased performance in the simulated data ([A:fig:S02]). This was due to true variants being removed by the internal filters of the tool ([A:fig:S05]C, D). This is in stark contrast to our novel workflows, where the joint analysis preserves all called sites from the pairwise method and finds additional variants. Overall, Mutect2 found less validated variants in all patients than both Strelka2Pass (mean: <math display="inline">2.2</math>) and FreeBayesSomatic (mean: <math display="inline">2.5</math>) with comparable levels of precision ( A@[A:fig:S08]  2.8 @@ and [A%

fig:S10] , [A%
fig:S10] @ ) but longer run times ([A:tab:S2]).
Our improved workflow also enabled the discovery of multiallelic variants with Strelka2, which led to the discovery of on average <math display="inline">42</math> additional variants (min: <math display="inline">1</math>; max: <math display="inline">535</math>) in the analysed WES and <math display="inline">987</math> additional variants in the WGS (min: <math display="inline">81</math>; max <math display="inline">2329</math>). These variants are strong indicators of sub clonal structure and are invaluable for the study of evolutionary trajectories in cancer, as shown in the following sections.

Effects of additional somatic variants on downstream analysis

The ability to find additional shared variants has significant impact on our understanding of cancer evolution and the timing of initiation and metastatic seeding. Recent work has shown, that similar to the well known genetic heterogeneity, there is heterogeneity when it comes to the timing of metastatic seeding. While traditionally it was thought that tumours only metastasise after they reach a certain size, to escape the restrictions of the niche, like reduced nutrition, recent publications have shown, that often there is also very early metastatic seeding (Hu et al. 2019). For all methods analysing this heterogeneity, evolutionary timing and history are fully reliant on the somatic variants found in the data. Therefore, if we improve the input provided to these analysis methods, we can expect a clearer and possibly more granular result.

The following section will quantify the effect of additionally found variants on phylogenetic reconstruction and clonal decomposition, which use somatic variants as input.

Phylogenetic reconstruction

As this work is not about the advantages and shortcomings of different phylogenetic reconstruction tools, we have not performed a comprehensive comparison of tools, but rather focused on the results of using additional variants discovered using our novel joint somatic variant calling workflow. For this reason, we chose to use neighbour joining (NJ) (Saitou and Nei 1987), because it is fast, readily available in most phylogenetic reconstruction tool kits and if the input distance is correct, the output will be correct. Furthermore, even if the distance is not 100% correct, if the distance is “nearly additive“ and the input distances are not far off from the real distance, the tree topology will still be reconstructed correctly (Mihaescu, Levy, and Pachter 2007). Lastly, in contrast to many other methods like UPGMA and WPGMA (Sokal and Michener 1958), NJ does not assume an equal mutation rate of each sample, because we know, that the molecular clock hypothesis (Zuckerkandl and Pauling 1962) is not valid for different lineages of cancers (Shibata 2010).

The only thing that NJ requires as an input is a distance matrix of all samples, so the next step was the selection of the right distance metric. While there are many distance measures for DNA sequences, which allow accounting for different probabilities of transitions and transversions as well as uneven base composition, models like F81 (Felsenstein 1981) or HKY85 (Hasegawa, Kishino, and Yano 1985) are only really designed for germline mutations and are not easily applicable for subclonal somatic mutations. For this reason, we decided to first transform the variants present in all samples into a binary occurrence vector and then calculate the Hamming distance (Hamming 1950) between all samples. This generates a maximum parsimony approach and the branch length of the trees will be directly translatable to the amount of variants which are different between samples.

[fig:ca9phylo] shows both the reconstructed phylogenies of the autopsy samples of the previously described late stage melanoma patient “CA-F“ from the manuscript ([ch:appendixManuscript], [A:tab:S1]), using the variants found with the default tumour-normal method on the left and our improved joint method on the right. The exact same reconstruction methodology was otherwise used.

Reconstructed phylogenies of a patient with multiple spatially distinct samples; Neighbour joining on Hamming distance on variant occurrence vector. Tip labels describe the location of the sample in the patient. Trees are shown as unrooted with germline as fixated origin point; black line ruler shows the length of an edge with 2000 mutations; LN = lymph node
Reconstructed phylogenies of a patient with multiple spatially distinct samples; Neighbour joining on Hamming distance on variant occurrence vector. Tip labels describe the location of the sample in the patient. Trees are shown as unrooted with germline as fixated origin point; black line ruler shows the length of an edge with 2000 mutations; LN = lymph node

There are several obvious differences, first in the longer edge connecting the germline to all other samples, which we consider as the state of no somatic variants. This shows that there are many more shared mutations in all samples, than what would have been anticipated with the default method, which corresponds to an overestimation of the heterogeneity of the samples. As the accumulation of somatic variants is still used as a proxy for timing and cell divisions, when assuming a high mutation rate for lung cancer (<math display="inline">5.3 \cdot 10^{-8}</math> from Werner et al. (2020) (Werner et al. 2020)) this difference of <math display="inline">\approx 36000</math> variants is equivalent to <math display="inline">\approx 2000</math> cell divisions. While the cell doubling rate of cancers is highly dependent on the type (Arai et al. 1994), this change makes a substantial difference when assessing the timing of the tumour initiation and further evolution.

Secondly, there have been topological changes, which generate a longer bifuricating edge between the olive coloured “right liver lobe“ and the “right parietal lobe“ lesions showing a bottle neck in cancer evolution, which corresponded with the clinical history, where the patient lived with stable disease for almost ten years, before rapid disease progression just prior to death. The extreme distance of the primary/diagnostic sample from the rest of the samples could be in part due to a difference in sequencing quality, as this was an FFPE tumour biopsy. However, as this feature is preserved between both the joint and the pairwise analysis, it does not appear to be an effect of our new method.

Side by side view of the reconstructed trees from [fig:ca9phylo]; internal edges, which are distinct between trees are shown as dotted lines; common subtree is shown in red Tree labels have been sorted to minimise distance between labels; LN = lymph node ; Visualisation generated with dendextend (Galili 2015)
Side by side view of the reconstructed trees from [fig:ca9phylo]; internal edges, which are distinct between trees are shown as dotted lines; common subtree is shown in red Tree labels have been sorted to minimise distance between labels; LN = lymph node ; Visualisation generated with dendextend (Galili 2015)

[fig:tanglePhyloCA9] shows a topology focused view of the two phylogenetic trees, which highlights the differences between the two reconstructions (Vienne 2018). The common subtrees are coloured the same on both sides and connecting lines show identical labels. This format shows that while the trees look quite similar at first glance, they show vastly different topologies.

One example of this is the “small bowel“ tumour sample which was originally connected to the red common subtree, but is now much closer to the “right cerebral lobe“ lesion, forming a parallel clade with the “right liver lobe“ lesion. In general, whereas the pairwise tree shows a very linear topology, with minimal branching and no disjunct subclades, these features are clearly present in the joint reconstruction. ([fig:tanglePhyloCA9]).

Longitudinal analysis

In addition to the analysis of spatial heterogeneity through multi-regional tumour sampling, we were also interested in the use of our novel joint variant calling workflow for the analysis of temporal heterogeneity through longitudinal samples collected from the same patient over time. Examples of longitudinal analysis could include the joint analysis of diagnostic and relapse tumour tissue samples, or even the repeated serial testing of ctDNA. In the following section, we present work using the published workflows on a longitudinal dataset, which highlights the flexibility and widespread usability of the new methods. Similar to the spatially related samples, the joint analysis can improve the performance, which then in turn enables improved detection of lower allele frequency variants, particularly in the setting of low tumour burden as is commonly seen with ctDNA analysis.

In addition to their autopsy which resected nine distinct tumour sites ([fig:CA-Fschematic]), Patient “CA-F“ also had three longitudinal blood samples taken, from which ctDNA was extracted and WES performed. These blood samples were taken as non-invasive surveillance seven, five and two months before the death of the patient ([fig:CA-Ftimeline]).

Timeline from diagnosis till death for patient CA-F: 1.9mm melanoma removed after diagnosis but with negative sentinel lymph node biopsy; : PET scan and subsequent liver biopsy confirm relapse with wide spread metastases; trametinib treatment from Oct. 2012 till Jan. 2013 with minor response; blood plasma samples during treatment (1 and 2) as well as after progression (3); death and rapid autopsy of nine metastatic sites (, [fig:CA-Fschematic]); Tumour fraction in plasma samples was estimated via digital droplet PCR quantification of the original driver mutation (BRAF:K601E)
Timeline from diagnosis till death for patient CA-F: 1.9mm melanoma removed after diagnosis but with negative sentinel lymph node biopsy; : PET scan and subsequent liver biopsy confirm relapse with wide spread metastases; trametinib treatment from Oct. 2012 till Jan. 2013 with minor response; blood plasma samples during treatment (1 and 2) as well as after progression (3); death and rapid autopsy of nine metastatic sites (, [fig:CA-Fschematic]); Tumour fraction in plasma samples was estimated via digital droplet PCR quantification of the original driver mutation (BRAF:K601E)

To show that even in longitudinal data, the joint analysis can boost the signal, we jointly variant called variants in the diagnostic biopsy sample with the three ctDNA samples and compared them with the results from the pairwise analysis. On average, we found 2905 additional variants in each of the ctDNA samples, which is more than doubles the average number of variants found with the pairwise analysis (2414). Out of those, we found 534 (<math display="inline">\approx 20\%</math>) variants in the ctDNA samples, which were found as a high confidence variant in the diagnostic sample, indicating that these findings were high quality calls.

Improved somatic variant calling in longitudinal data: Variant allele frequency (VAF) of variants found additionally through joint variant calling which were found as high confidence variants in the primary sample; Variants with less than 0.1 VAF in the primary are coloured grey; “T1 recovered“ shows variants, which were high confidence in all ctDNA samples but T1 and were only found through joint calling there; Axis label show the date of blood collection
Improved somatic variant calling in longitudinal data: Variant allele frequency (VAF) of variants found additionally through joint variant calling which were found as high confidence variants in the primary sample; Variants with less than 0.1 VAF in the primary are coloured grey; “T1 recovered“ shows variants, which were high confidence in all ctDNA samples but T1 and were only found through joint calling there; Axis label show the date of blood collection

As in the spatially analysis, in longitudinal data lower tumour purity samples benefit more from the joint analysis. We see that time point 2 (T2) had the highest number of recovered variants (377) which were found as high confidence variants in both other time points ([fig:longitudinalVAFsctDNA] A vs. B vs. C) and T2 also has the lowest ctDNA fraction recorded (T1: 60%; T2: 20%; T3: 60%). A total of 106 variants were not found in the ctDNA samples with the pairwise analysis, even though they were high confidence variants in the primary sample ([fig:longitudinalVAFsctDNA]F). These variants usually showed a lower depth of coverage (dp) in the ctDNA samples, which is likely the explanation as to why they were below the limit of detection of the pairwise analysis.

Longitudinal data informs diagnostic variant calling: VAFs of variants additionally found through joint calling in the primary samples; A) “Primary recovered by PASS“ shows variants which were high confidence in at least one ctDNA sample but not found in the primary; B) “Primary recovered joint“ shows variant which were low confidence in all samples in the pairwise analysis individually but were called as high confidence in the joint analysis; Axis label show the date of blood collection
Longitudinal data informs diagnostic variant calling: VAFs of variants additionally found through joint calling in the primary samples; A) “Primary recovered by PASS“ shows variants which were high confidence in at least one ctDNA sample but not found in the primary; B) “Primary recovered joint“ shows variant which were low confidence in all samples in the pairwise analysis individually but were called as high confidence in the joint analysis; Axis label show the date of blood collection

Finally, we can also find 398 additional variants in the primary sample. 396 of these variants were discarded in the pairwise analysis due to low evidence in the tissue sample, but could be found with a high confidence in the longitudinal data. The last two variants were could be found in the joint analysis, as all 4 samples showed evidence for this variants just below the limit of detection ([fig:longitudinalVAFsprimary]).

This shows that both spatially and longitudinal related samples should be analysed jointly, as it substantially increases the amount of true variants found, which can have a large impact on the downstream analysis of the samples.

Clonal deconvolution

One of the most important pieces of information that can be derived from multiple related samples from the same patient is the clonal deconvolution, where subclonal reoccurring patterns of mutations (clones) are resolved both spatially and longitudinally. These reoccurring clones can be linked to either parallel evolution through positive selection pressure, like a targeted drug, or due to the process of developing metastases where a part of the cancer disseminates and grows at a different site. In contrast to the lack of options for joint somatic variant calling, there is a plethora of algorithms and methods available for clonal deconvolution. Since 2015 PhyloWGS (Deshwar et al. 2015), Canopy (Y. Jiang et al. 2016), CLOE (Marass et al. 2016), CloneFinder (Miura et al. 2018), MACHINA (El-Kebir, Satas, and Raphael 2018) and MOBSTER (Caravagna et al. 2020) were published, to name a few. Underlying all of these models is the ability to cluster variants with similar variant allele frequencies together, to reduce the combinatorial space and enhance the confidence in the signal (Tarabichi et al. 2021). Due to the high number of tools, it is very challenging to select the right tool, especially since all of them have advantages and disadvantages (Miura et al. 2020). In this work we decided to use PhylogicNDT (Leshchiner et al. 2018) as it has been shown to work well on clinical samples (Gerstung et al. 2020) and does not have the restriction for the input to be from copy number neutral areas which many of the other tools have.

Both the variants found with the default pairwise as well as with the new joint workflows were annotated with their local allele specific copy number to form a MAF like file format which is required by PhylogicNDT. While PhylogicNDT allows the user to supply the cancer cell fraction for every variant, the program can also estimate them from the supplied allelic counts and the copy number. Local copynumber calls were derived from copy number segment calls made by Sequenza by intersecting the chromosomal location of each variant with the copy number segment containing the variants location. This requires multiple steps and the source code is shown in [lst-jvcAppendix:parseVcf] (parsing VCF), [lst-jvcAppendix:cnv] and [lst-jvcAppendix:convertMAF] (convert to MAF format). Variants which couldnt be annotated with copy number information, because their genomic location did not overlap with any called copy number segment, were discarded for this analysis.

Reconstructed clonal trees from PhylogicNDT; Blue circle at top depicts the germline/normal state. The coloured edges with the same coloured circle represents a distinct subclone of the parent from which the edge emerges; The number in braces next to the edge is the number of mutations which define this subclone with an added gene symbol added, if there was a cancer driver gene mutation. The left part shows the result when using the default pairwise method of Strelka2 and the right side uses the results from the Strelka2Pass workflow
Reconstructed clonal trees from PhylogicNDT; Blue circle at top depicts the germline/normal state. The coloured edges with the same coloured circle represents a distinct subclone of the parent from which the edge emerges; The number in braces next to the edge is the number of mutations which define this subclone with an added gene symbol added, if there was a cancer driver gene mutation. The left part shows the result when using the default pairwise method of Strelka2 and the right side uses the results from the Strelka2Pass workflow

[fig:clonaldeconv] shows the highest parsimony clonal tree reconstructed by PhylogicNDT for the pairwise as well as the joint variant calling for patient CA-F. As the copynumber calling information is the same for both inputs, the only difference is in the called variants. While there was no subclonal structure detected at all for the pairwise analysis, there is a highly complex branching structure detected using the jointly called variants with multiple subclones originating from the ancestral clone. As this is a clinical sample, we cannot be certain that the more branched model is the actual truth, but it is biologically more logical that a late stage cancer has developed several subclones, rather than it being a very homogeneous disease at all of the 10 sites at autopsy with no evolution over ten years of disease (Gerstung et al. 2020). It is of particular interest, that the CDC27 gene was mutated at different time points in different clones (clone 8 vs. clone 4), which is a clear indicator of convergent evolution, which would definitely be missed without the joint analysis.

Longitudinal enriched phylogeny

Of course it is finally also possible to build a phylogeny which incorporates data from both the spatial tumour tissue and longitudinal ctDNA analysis. However, as the ctDNA can provide a holistic view of all cancer metastases ([intro-sec:ctDNA]) the interpretation needs to accommodate for that.

Reconstructed phylogeny with longitudinal ctDNA samples: Tree from [fig:ca9phylo] with three additional ctDNA samples from different time points approximately one year prior to death. The ruler shows the equivalent of 2000 mutations; LN = lymph node
Reconstructed phylogeny with longitudinal ctDNA samples: Tree from [fig:ca9phylo] with three additional ctDNA samples from different time points approximately one year prior to death. The ruler shows the equivalent of 2000 mutations; LN = lymph node

The addition of the ctDNA samples led to a further bipartition edge, which separates the “right liver lobe“, “small bowel“ and “right cerebral lobe“ lesions from the rest of the tree ([fig:phyloCA9ctDNA]). This was already inferable from the topology of the previous tree in [fig:tanglePhyloCA9] “joint“, but is even more pronounced with the inclusion of the ctDNA samples.

This shows that the addition of more samples helps to refine and improve the trajectory and history of cancer samples and it is vital to do this analysis jointly to generate the optimal result.

Usage statistics and uptake

Ultimately when choosing research software, publication and citations are not a good metric to evaluate the quality of a method (Gardner et al. 2022). Many published software packages are not maintained or not even functional even though they are published. While we developed these joint somatic variant calling workflows to deal with a challenge we faced, the interest of others has been continuously expressed by bother members of the bioinformatics community in the short period of time since publication.

Cumulative download numbers of the “dawsontoolkit“ docker container since publication of the manuscript; Actual counts are shown as dots, with smoothed trajectory depicted as dotted line with the 95% confidence interval shown as a grey background; confidence interval has been adjusted with exponential decay of prediction accuracy with distance from the last data point; Start date 7<math display="inline">^{th}</math> September 2021 (publication of method); End point prediction 31<math display="inline">^{th}</math> December 2022 (End of current year); Numbers were recorded daily from the DockerHub API
Cumulative download numbers of the “dawsontoolkit“ docker container since publication of the manuscript; Actual counts are shown as dots, with smoothed trajectory depicted as dotted line with the 95% confidence interval shown as a grey background; confidence interval has been adjusted with exponential decay of prediction accuracy with distance from the last data point; Start date 7<math display="inline">^{th}</math> September 2021 (publication of method); End point prediction 31<math display="inline">^{th}</math> December 2022 (End of current year); Numbers were recorded daily from the DockerHub API

To have some proxy of the usage statistics of the workflows, we recorded the download numbers of the “dawsontoolkit“ docker container after the publication of the manuscript. The container only consists of software for refiltering and joint analysis of the workflows. Obviously, this is an imperfect measurement, as people can reuse a downloaded container as often as they want, which would not appear in the count and similarly, just because the container was downloaded, the analysis might not have been used. Nevertheless, it still shows an interaction and an interest in the methods. The download numbers show a sustained and stable increase in downloads ([fig:usageStats]). This suggests, that there is a need in the methods, rather than a simple curiosity after publication, which hopefully will facilitate a higher quality analysis of future projects and therefore lead to a better understanding of cancer evolution and heterogeneity.

“Death is a release from and an end of all pains: beyond it our sufferings cannot extend: it restores us to the peaceful rest in which we lay before we were born“

CASCADE - Late stage lung cancer in the spotlight

Introduction

As tumour heterogeneity is seen as one of the major causes of resistance to treatment and ultimately relapse, much cell line based research has been conducted to solve tissue of origin and evolutionary trajectories via bulk and single cell sequencing paired with cellular barcoding (Fennell et al. 2021; Penter, Gohil, and Wu 2022). However, while cell line models are a great resource for high throughput methods and allow easier reproducibility of results, they are no real substitute for primary patient cells. With the increased availability of patient samples through bio-banking efforts like the UK BioBank (Sudlow et al. 2015) and the Victorian Cancer BioBank (Cancer Council Victoria 2006), both patient derived xenografts (PDX) and organoids have gained more and more traction (Yoshida 2020) as specialised models to grow primary patient cells in an environment which closely resembles the body of the patient. While this method is superior in many aspects, there are some significant drawbacks. The culturing of the cells requires more effort and is not as easily scalable. These methods also require fresh patient samples, which are not always readily available.

While it is fairly easy to collect diagnostic specimens from tumour biopsies for storage and research, late stage tumour biopsies are rare. Due to the deteriorating health status of the patient biopsies can be dangerous and often an unnecessary burden for the patient. However, these samples are especially critical when answering the question of how the cancer was able to evade treatment and lead to death, as it may reveal an unappreciated insight into spatial and temporal heterogeneity.

To try to combat this issue the cancer tissue collection after death (CASCADE) program was initiated. It recruits cancer patients close to the end of life and enrolls them in a rapid autopsy program. These autopsies are carried out at any time of the day to minimise the impact on the sample to allow high quality assessment including DNA and RNA sequencing of the frozen cells (Alsop et al. 2016). While the program collects cancer patients unconditionally of their type of disease, the analysis of this thesis is restricted to five lung cancer patients available at the time of this work. Currently there are no extra lung cancer patients enrolled, but recruitment is still ongoing. Four of these five patients had an Epidermal growth factor receptor (EGFR) based cancer and one had a RET Proto-Oncogene (RET) fusion with KIF5B. Each of those patients had on average 30 specimens resected and put into a bio bank. We then continued to sequence, on average, eight of these samples with either whole genome sequencing (WGS) or whole exome sequencing (WES) to deeply analyse and classify the underlying resistance and driver mechanisms of each patient and their heterogeneity.

Lung cancer

With around 1.6 million deaths world-wide each year, lung cancer is the number one cause of cancer death (Siegel, Miller, and Jemal 2018). Every year about twelve thousand Australians get diagnosed with lung cancer. These cases can be generally split into two groups: small cell lung cancers (SCLC) and non-small cell lung cancers (NSCLC), which account for around 15% and 85% of cases, respectively. The majority of NSCLC are either lung adenocarcinoma or lung squamous cell carcinoma (Molina et al. 2008). Even though smoking is highly associated with lung cancers, there is a big group of never smokers, with a high risk of lung cancers in East Asia, especially women, which is correlated with outside influences like pollution and occupational carcinogens and paired with genetic susceptibility (Sophie Sun, Schiller, and Gazdar 2007). This group usually shows EGFR (epidermal growth factor receptor) driven tumours. EGFR is a transmembrane receptor tyrosine kinase, which is usually only expressed in epithelial, mesenchymal, and neurogenic tissue, but its overexpression in other tissues is a hallmark of many human malignancies, not just NSCLC.

Even with those strict classifications in place, it is widely accepted, that cancer is a heterogeneous disease, which needs to be accounted for when developing treatments (Suda et al. 2018). The ongoing research of lung cancer has led to a shift from cytotoxic chemotherapy to a more personalized approach by accounting for the genetic background of each patient’s disease (Lindeman et al. 2018). But not only the inter-patient heterogeneity needs to be taken into account, but also the heterogeneity between different sites of the disease in the same patient (Leong et al. 2018; Savas et al. 2016). This makes the choice of treatment for one single patient more and more difficult, as some sites might respond to treatment, where others might not. This means, in order to design the perfect treatment regime for a patient, a deep understanding of the overall complexity of the disease is needed. By studying a diverse background of driver mechanisms of lung cancers and their respective treatment and resistance modes, a general insight in the biologic background is possible. Analysing not only one, but several metastases of the same patients paints a much clearer picture of disease progression and the process behind the resistance to treatment that ultimately led to death.

Publications

This chapter includes and reproduces data analysis that contributed to two publications, however as they were not sole first author publications, they are only mentioned here instead of included in full. The first publication features the resistance mechanism of small cell transformation seen in patient CA-L ([cascade-sec:CA86]) (' Burr et al. (2019)) and the second shows the discovery of emerging novel resistance mutations in a RET-fusion driven NSCLC in patient CA-A ([cascade-sec:CA99]) (' Solomon et al. (2020)).

Patient level analysis

This section outlines the analyses performed for each patient and highlights work specifically done for certain patients due to their unique clinical features. However, most of the analysis was streamlined with the same workflow applied to each patient. The following sections expand on the individual steps.

  1. Quality control: Each sample of a patient is checked for kinship and sequencing quality
  2. Read mapping
  3. Joint somatic variant calling: SNPs, InDels and SVs are called jointly
  4. Copy number calling
  5. Variant effect annotation: short and structural variants are annotated with possible biological effects
  6. Phylogenetic reconstruction
  7. Clonal deconvolution

Analysis workflow

This section summarises the primary analysis performed for each patient in detail. Specific analysis are discussed in the individual patient sections.

Quality control

When multiple samples per patient are available, the possibility of sample mix-ups is higher than when just dealing with a tumour normal pair, so in addition to the standard read depth, sequencing quality and reads-on-target analysis that is routinely performed after sequencing, we performed an additional step of kinship detection. We use concepts commonly employed in germline cohort analysis, like child and parents (trio) or even large databases (gnomAD). As most germline variants are due to mendelian inheritance, we can use the percentage of shared homo- and heterozygous germline variants to estimate the relatedness of two samples. For our analysis we used NGSCheckMate (Lee et al. 2017) and all the results shown in later sections are based on it, however we also used Somalier (Pedersen et al. 2020) on two patient samples with surprising kinship results and Somalier confirmed the result.

While this analysis is very useful to detect samples which do not belong to a patient, either through mislabelling or similar, it does not protect from mix-ups within a patient’s samples. However, only orthogonal validation will be able to discern these errors.

Other quality controls were performed with fastQC (Andrews 2010) for read integrity and ‘CollectWgsMetrics’ from Picard (Broad Institute 2019) for WGS samples and ‘samtools flagstat’ (Danecek et al. 2021) for on-target estimation for WES samples.

Read mapping

For highest mapping performance, reads were aligned alternative contig aware with BWA (Heng Li 2013) (v0.7.17) to GRCh38 (GCA_000001405.15) with alternative contigs but no decoy regions. Initial mapping was post-processed with ‘bwa-postalt.js’ from bwa-kit to adjust the mapping assignment and quality mapping both to alternative and canonical contigs. Finally reads were duplicate marked with ‘MarkDuplicates’ from the Picard-toolkit.

Joint somatic variant calling

For short variants (SNPs and InDels), the workflows presented in [ch:variantcalling] were used and while the Strelka2Pass workflow generates structural variant calls, they are not jointly called over all samples. Instead for the structural variants (SVs) we used GRIDSS2 (Cameron et al. 2021), which has a calling model for multiple related tumour samples and as GRIDSS2 is also a prerequisite for copy number calling with PURPLE ([cascade-sec:cnv]) using the same structural variants allows a higher conformity of analysis.

Copy number analysis

After somatic variant calling, copy number analysis is critical when dissecting the resistance and driver alterations of a tumour sample. While lung cancers are known for their high mutational burden (Ludmil B. Alexandrov et al. 2020), often genetic amplifications can be found as driver or resistance mechanism. One of the more common resistance mechanisms is a high EGFR or MET amplification which significantly affect transcription (Bjaanæs et al. 2021). And while copy number alterations are often shared between metastases (Ni et al. 2013), the same heterogeneity that can be found in variant calling analysis also affects copy number analysis. Many modern copy number calling methods will use the B-allele frequency, the allele frequency of a heterozygous germline variant, to gain allele specific copy number calls (Favero et al. 2015; Talevich et al. 2016; Cameron et al. 2019). Although each of those methods will only use the input of one tumour and one germline sample. As described in [ch:variantcalling] we can actually improve the performance by analysing all tumour samples jointly. So far only HATCHet (Zaccaria and Raphael 2020) has a joint copy number calling method, but requires significant time investment for installation and subjective manual parameter optimisation on a per patient basis. In contrast both sequenza and PURPLE have very easy installation and usage procedures. To ensure low subjectivity and high reproducibility of our result, we chose to not use HATCHet, and instead use the clinically used and approved PURPLE workflow for all WGS samples and sequenza for WES seeing its successful use in similar situations (Leong et al. 2018; Vergara et al. 2021) and because PURPLE is not suitable for WES data. In spite of the potentially higher accuracy of HATCHet, virtually no downstream analysis was equipped to utilise multi clone resolution copy number calls.

Variant effect annotation

For small variants (SNPs and InDels) “Variant Effect Predictor“ (VEP) version 92 (McLaren et al. 2016) was used to assign possible effects. As a variant can affect multiple genes due to overlapping gene boundaries, effects within a curated list of lung cancer related genes ([A:cas:tab:lungcancergenes]) were assigned an impact in line with the VEP provided impact values of ‘LOW’, ‘MODERATE’  and ‘HIGH’ . To only have one effect per variant, only the variant with the highest impact was returned. In cases of multiple transcripts being affected with the same impact level, the putative canonical transcript result is used.

For structural variants, the effect annotation depends on the type of the structural variants. For amplifications and deletions, the genes within the variant are compiled and returned as a list. The effect of inversions and similar structural changes are assumed to be fusion based, so the breakpoint is annotated with the gene hit by both breakpoints and a potential fusion gene is returned.

Phylogenetic reconstruction

Variants called in any sample were transformed into a binary presence/absence vector with a pure absence vector as the germline native state. The vectors were then concatenated into a string representation, and for each pair the Hamming distance were computed (Hamming 1950). The distance matrix was used as input for the neighbour joining algorithm and visualised with ape (Paradis and Schliep 2018).

Clonal deconvolution

Clonal deconvolution for each patient was done with PhylogicNDT Cancer cell fractions (CCF) were left to PhylogicNDT with the option ‘–maf_input_type calc_ccf’ by supplying the allele specific local copy number call for each variant the same way as shown in [variantcalling-sec:clonal] with the copy number calls from [cascade-sec:cnv]. If no copy number was reported for a variant, it was removed from the analysis.

For the clustering of variants all variants with no known protein changing function were included, by removing all variants with the VEP consequence “intergenic variant“, “intron variant“, “upstream gene variant“, or “downstream gene variant“. While these variants certainly might distinguish clones within the sample, they could only arise through random genetic drift and did not relate to resistance mechanisms.

As PhylogicNDT can only visualise 58 distinct clusters, due to the lacking number of distinct colours, we restricted the analysis after clustering all mutations. Clusters with a variant in one of the 319 driver genes suggested by PhylogicNDT were always retained. All other clusters were automatically removed, if the number of variants <math display="inline">n</math> supporting the cluster was smaller than 10, or the CCF value in each sample was too homogeneous, or the confidence interval <math display="inline">CI</math> of the CCF value was too high.

The homogeneity of the CCF value of a cluster <math display="inline">c</math> was assessed by calculating the z-score of each sample <math display="inline">s</math> CCF in respect to all other samples CCF. If one samples z-score indicted a fold change of less than 1.5, the cluster was removed ([eq:clusterRestriction]).

<math display="block">Inclusion(c) = \left\{ \begin{array}{@{}lcr@{}} \text{FALSE,} & for & n_c(vars) < 10 \\ \text{FALSE,} & for & \forall s:\ | \text{z-score}(CCF_s) | < 1.5 \\ \text{FALSE,} & for &\forall s:\ CI(CCF_s) < 0.1 \\ \text{TRUE,} & else &\\ \end{array} \right. \label{eq:clusterRestriction}</math>

Patient CA-A

This patient was a 61 year old male with a metastatic RET-KIF5B fusion positive NSCLC. After failure of Carboplatin, Pemetrexed, and Pembrolizumab followed by the multi kinase inhibitor Lenvatinib, he then received compassionate access to the RET tyrosine kinase inhibitor Selpercatinib ([fig:ca99timeline]). He experienced almost immediate improvement following Selpercatinib with decreased levels of carcinoembryonic antigen and almost 100% reduction of RET fusion positive ctDNA after one month ([fig:ca99ctDNA]). Similar to the ctDNA analysis, Positron emission tomography (PET) and computed tomography (CT) imaging revealed significantly reduced tracer uptake in multiple sites and partial response to treatment ([fig:ca99pet]).

Timeline of patient CA-A from diagnosis until death: Diagnostic biopsy detected <i>KIF5B-RET</i> positive lung adenocarcinoma; SRS: stereotactic radiosurgery; WRBT: whole brain radiation therapy; a total of six blood samples were taken just before and during the selpercatinib treatment.
Timeline of patient CA-A from diagnosis until death: Diagnostic biopsy detected KIF5B-RET positive lung adenocarcinoma; SRS: stereotactic radiosurgery; WRBT: whole brain radiation therapy; a total of six blood samples were taken just before and during the selpercatinib treatment.

Serial sampling of the plasma of the patient and analysis with the commercial Guardant360 assay (Talasaz et al. 2014) revealed a previously undetected RET G810S resistance mutation after three months of treatment . While at this point the driver mutation allele frequency was still dropping in the plasma, by month four the abundance of RET G810S had increased and was accompanied by additional mutations in the same site (RET G810R, C, and V). In addition there was an increase of fusion positive ctDNA this suggesting the development of acquired resistance to Selpercatinib. While the patient initially was responsive to the treatment, repeat PET scans showed progressive disease after six months, which ultimately led to the death of the patient ([fig:ca99pet]).

Allelic frequencies of driver and emerging resistance mutations during Selpercatinib treatment (11 months after diagnosis); <i>KIF5B-RET</i> fusion is the initiating driver with RET G810R/S/C/V the emerging resistance SNPs
Allelic frequencies of driver and emerging resistance mutations during Selpercatinib treatment (11 months after diagnosis); KIF5B-RET fusion is the initiating driver with RET G810R/S/C/V the emerging resistance SNPs

PET scans of patient CA-A before and during Selpercatinib treatment
PET scans of patient CA-A before and during Selpercatinib treatment

Schematic of tumour lesions in patient CA-A: Primary diagnostic sample shown in red; All 24 autopsy samples were coloured by organ they were collected from: Brain (7), left lung (2), right lung (1), liver (9), T8 bone (1), ascitic fluid (1), adrenal gland (2), kidney (1); Additionally to the post mortem blood sample, six serial blood samples were taken ([fig:ca99timeline])
Schematic of tumour lesions in patient CA-A: Primary diagnostic sample shown in red; All 24 autopsy samples were coloured by organ they were collected from: Brain (7), left lung (2), right lung (1), liver (9), T8 bone (1), ascitic fluid (1), adrenal gland (2), kidney (1); Additionally to the post mortem blood sample, six serial blood samples were taken ([fig:ca99timeline])

At autopsy, 24 tumour tissue biopsies and a post mortem blood sample were collected and eight of them were selected for WGS at 130x coverage ([fig:ca99schematic], [tab:ca99wgsSamples]) and analysed with the standard workflow ([cascade-sec:workflow]).

Autopsy samples sequenced for patient CA-A: Sample number is the internal sample collection during CASCADE autopsy, the organ of the sample, the fraction of tumour cells from H& E stain and the pathology of the tumour sample.
Sample number Organ H&E Type
11 right occipital lobe 0.7
26 right liver lobe 0.6
31 left lower lung 0.2
41 left liver lobe 0.2
47 left liver lobe 0.5
55 left liver lobe 0.4
57 right liver lobe 0.6
59 right pleura 0.7

Somatic variant calling revealed substantial spatial heterogeneity, where both the occipital lobe and the right pleura sample only contained RET G810S, the right liver lobe harboured predominantly RET G810R with either G810S and G810C as minor clones and lastly, the left liver samples showed almost an even mix between G810C and G810S clones but no G810R presence. The emergence of these mutations in multiple different sites at different allele frequencies, especially in already established sites in the liver, suggests that these mutations are the result of parallel evolution under positive selection through therapy, rather than seeding from one resistant clone. Apart from the mutations changing RET G810 no other variants affecting RET or any other lung cancer genes were found in multiple samples. The occipital lobe sample also contained a BRCA1 V939A mutation and one left liver sample (47) showed a synonymous KIT S967%3D mutation. Additionally, no other variant found in non cancer related genes allowed the same explanation of resistance ([fig:ca99heatmap]).

Phylogeny based on the short variants showed a clear clustering of the right (26 and 57) and left liver samples (41, 47, and 55) with the ocipital lobe (11) and pleural sample (59) sharing the most mutations as a hint towards the longest evolutionary trajectory and final progression. There was also a bifurcating line separating several progression (11, 41,47,55; top right) and stable disease sites. The much lower and higher number of both samples 31 and 41 may have been contributed to by the low tumour purity of those samples ( t@[tab:ca99wgsSamples]  3.1 @@ and [t% ab:ca99cnv] , [t% ab:ca99cnv] @ , [fig:ca99phylo]).

Phylogeny of autopsy samples from patient CA-A; reconstructed with all somatic SNVs and InDels. Ruler symbolises 4000 variants difference
Phylogeny of autopsy samples from patient CA-A; reconstructed with all somatic SNVs and InDels. Ruler symbolises 4000 variants difference

Heatmap of driver gene variants in patient CA-A: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.
Heatmap of driver gene variants in patient CA-A: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.

The structural variant calling with GRIDSS2 showed consistent presence of the KIF5B-RET fusion at high allele frequency (min: 0.27 max: 0.535), consistent with a cancer cell fraction of 1 when correcting for local copy number changes (min: 2 max: 3; [fig:ca99.11circos]), in all but sample 31, which might have been due to the low purity of the sample ([tab:ca99wgsSamples]). While there was a high number of structural variants present in each sample, consistent with the genomic instability commonly seen in late stages of cancer (Gerstung et al. 2020) almost of these rearrangements were sub-clonal and therefore not the main cause of resistance or cancer initiation and rather the result of progressive tumour evolution. To allow a more focused look at structural events and their effect, we restricted the visualisation to events with an allele frequency of 0.2 or higher ( f@[fig:ca99.11circos]  3.7 @@ and [f% ig:ca99.11circosnoAF] , [f% ig:ca99.11circosnoAF] @ ).

While this change also has a minute effect on the PURPLE copy number calls, which are informed by structural variants, these structural variants will also be sub-clonal and therefore removal will result in a cleaner clonal copy number profile.

Circos plot of patient CA-A with somatic structural variants with allele frequency <math display="inline">> 0.2</math>: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-A with somatic structural variants with allele frequency <math display="inline">> 0.2</math>: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Circos plot of patient CA-A with all somatic structural variants: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-A with all somatic structural variants: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

f@[fig:ca99.11circos]  3.7 f@ and [f% ig:ca99.26circos] , [f% ig:ca99.26circos] f ig:ca99.41circos,fig:ca99.47circos,fig:ca99.55circos,fig:ca99.57circos,fig:ca99.59circos,@ show very similar copy number profiles of all tumour samples of patient CA-A, with loss of heterozygosity on almost all chromosomes apart from chromosome 5 and 9. Only [fig:ca99.26circos] showed a less granular mix of gain and loss of the minor allele, which was most likely due to the lower tumour purity of the sample. However, all samples exhibited high copy number gain levels consistent with whole genome duplication ( t@[tab:ca99wgsSamples]  3.1 @@ and [t% ab:ca99cnv] , [t% ab:ca99cnv] @ ). All samples showed a copy number gain in chromosome 7 at the EGFR locus leading to EGFR amplification (min: 3.9 max: 9.9), which is a known resistance mechanism to Levantinib (Jin et al. 2021) and was therefore most likely a result of previous treatment ([fig:ca99timeline]). Just as expected from the results of the Guardant360 ctDNA results leading up to the death of the patient, there was no MET amplification present in the patient at any site.

With the absence of any other plausible mechanism, of SNV, SV or CNV nature, the only possible explanation of resistance to Selpercatinib in the patient was the solvent front mutations RET G810C, G810R, G810S, and G810V as described in Solomon et al. (2020). However, the autopsy revealed substantial spatial heterogeneity of the emerged resistance mechanisms within the patient which was previously unappreciated.

Copy number analysis results for patient CA-A: results are taken from the best fit result of PURPLE; WG: whole genome
Sample number purity ploidy polyclonal % WG duplication
11 0.73 2.86 29.75
26 0.61 3.00 26.57
31 0.21 5.40 49.86
41 0.28 4.30 42.43
47 0.46 2.86 29.72
55 0.38 2.86 38.41
57 0.61 3.00 28.55
59 0.69 2.60 37.11

With the high percentage of likely sub-clonality of samples ( @@[tab:ca99cnv]  3.2 w@ and [@% )] , [@% )] w hich far exceeded the maximum amount PhylogicNDT was able to visualise. In [fig:ca99.ccfCluster] the clonal abundance per sample for the three clones 64, 70 and 139 was shown, which corresponded to RET G810C, RET G810R, and RET G810S respectively, exactly as seen in the longitudinal data ([fig:ca99ctDNA]). The clonal abundance in each sample was highly variable and followed an exclusion pattern, where if one resistant clone was present another was not. Only sample 31 and 55 showed signs of mixed populations. We attributed these patterns to parallel evolution due to selection pressure of the drug on already established lesions rather than metastatic seeding.

Cancer cell fraction of mutation clusters for patient CA-A; transparent polygons show the 95% confidence intervals. Clusters were generated with PhylogicNDT
Cancer cell fraction of mutation clusters for patient CA-A; transparent polygons show the 95% confidence intervals. Clusters were generated with PhylogicNDT

We attempted to build a clonal tree from the clustered mutations with PhylogicNDT with the restricted clusters using [eq:clusterRestriction], however no output was generated after 480 CPU hours. In contrast all other patients required less than 100 CPU hours. We assumed the high subclonality of the data made it impossible to converge to a solution. This again highlights the necessity of computational methods to close the gap between current algorithms and the available datasets.

Patient CA-I

This 56 year old female never smoker presented with an EGFR exon 19 deletion positive NSCLC in stage IV with metastatic involvement. After an initial good response to Gefitinib treatment, the patient showed progressive nodal disease and was treated with Carboplatin/Gemcitabine chemotherapy with mixed response. After the change to the tyrosine kinase inhibitor Afatinib small intra-cranial metastases were detected and biopsy of a parasternal mass revealed small cell transformation in addition to the EGFR T790M resistance mutation. Subsequent treatment changes to Carboplatin/Etoposide as well as CAV (cyclophosphamide, doxorubicin, vincristine) and finally Nivolumab were not successful and the patient died 40 months after diagnosis ([fig:ca51timeline]).

Timeline of patient CA-I from diagnosis until death: Diagnostic biopsy detected EGFR exon 19 deletion lung adenocarcinoma; Second biopsy after 24 months revealed additional EGFR T790M mutation and small cell transformation
Timeline of patient CA-I from diagnosis until death: Diagnostic biopsy detected EGFR exon 19 deletion lung adenocarcinoma; Second biopsy after 24 months revealed additional EGFR T790M mutation and small cell transformation

At autopsy six lesions and one blood sample were collected and biobanked ([fig:ca51schematic]). After quality assessment by pathology with H&E stain, all autopsy samples and the initial diagnostic biopsy were sequenced with WES ([tab:ca51wesSamples]) and analysed with the standard workflow ([cascade-sec:workflow]). The secondary small cell confirmation biopsy at 29 months did not contain enough tissue for sequencing.

Schematic of tumour lesions in patient CA-I: Primary diagnostic sample shown in red; All six autopsy samples were coloured by organ they were collected from: Parasternal (1), left lung (2), right lung (1), diaphragm (1), liver (1); Additionally a post mortem blood sample was taken
Schematic of tumour lesions in patient CA-I: Primary diagnostic sample shown in red; All six autopsy samples were coloured by organ they were collected from: Parasternal (1), left lung (2), right lung (1), diaphragm (1), liver (1); Additionally a post mortem blood sample was taken

Autopsy samples sequenced for patient CA-I: Sample number is the internal sample collection during CASCADE autopsy, the organ of the sample, the fraction of tumour cells from H& E stain and the pathology of the tumour sample. Dx: diagnostic sample
Sample number Organ H&E Type
Dx right VATS - adenocarcinoma
557 parasternal mass 0.9
559 left diaphragm 0.9
566 right liver lobe 0.6
573 right hilar lymph node 0.9
579 left lung lobe 0.8
583 left pleura 0.9

Somatic variant calling showed very little genetic heterogeneity. The original EGFR exon 19 deletion was present in all sequenced samples from the diagnostic sample to the 40 months later autopsy samples. No other protein altering somatic mutations were detected at a purity corrected allele frequency <math display="inline">\geq 0.33</math>. While the diagnostic sample presented with a TERT H687Q mutation at 25% VAF and sufficient local copy number amplification for 100% cancer cell fraction, none of the autopsy samples showed any support for this variant. Generally, the number of somatic protein altering SNVs and InDels (min: 988 max: 1236 mean: 1090.5) was very close to the average lung adenocarcinoma with an observed tumour mutational burden of 18.75 (Ludmil B. Alexandrov et al. 2013). In contrast, the diagnostic sample showed a much higher number of mutations, which we attribute to the formalin-fixed paraffin-embedded (FFPE) preservation, which is known to cause DNA damage (Hongdo Do and Dobrovic 2015). This DNA damage could have led to a higher rate of called somatic variants ([fig:ca51numVars]). In order to appreciate the relationship of the autopsy samples, we removed the diagnostic sample from the phylogenetic analysis. The full phylogeny can be seen in [fig:ca51phyloWithDx]. The phylogeny of SNVs and InDels shows no internal hierarchical structure, but rather shows that both the stem of tumour initiation and additional private mutations accumulated during the disease progression were approximately equal in number, with the stem being slightly longer. Only a very limited number of somatic variants were shared between cancer samples, which were not part of the initial stem ([fig:ca51phylo]). This was consistent with the clinical history, where there was never any clinical remission to treatment, but only mixed response in a subset of already established lesions.

Phylogeny of autopsy samples from patient CA-I; reconstructed with all somatic SNVs and InDels. Ruler symbolises 2000 variants difference. The phylogeny with diagnostic sample can be found in [fig:ca51phyloWithDx]
Phylogeny of autopsy samples from patient CA-I; reconstructed with all somatic SNVs and InDels. Ruler symbolises 2000 variants difference. The phylogeny with diagnostic sample can be found in [fig:ca51phyloWithDx]

In keeping with the small cell transformation, we could highlight an intronic TP53 mutations, which was present at almost 100% VAF in all autopsy samples. However, the same variant was also found in the diagnostic sample, which was reportedly adenocarcinoma. The variant was therefore not sufficient to drive the transformation, but could have been a predisposition for the patient ([fig:ca51heatmap]).

Heatmap of driver gene variants in patient CA-I: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.
Heatmap of driver gene variants in patient CA-I: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.

When comparing the diagnostic sample with the samples at autopsy the most striking difference was the lower overall copy number gain. The cause of this difference was the amplified minor allele in the diagnostic sample, which was almost completely absent in all autopsy samples. However, this amplification in the diagnostic sample could be rooted in the FFPE DNA damage combined with the lower purity of the sample, which in turn was used as a sign of amplification of the minor allele after purity correction through sequenza. The consistent feature between all samples, diagnostic and autopsy, was the loss of heterozygosity on chromosomes 2, 4, 10 through 13, and 19, with consistent copy number gains at the end of chromosome 1, 3 and 7 and all of chromosome 5 and 6. The high amplification of chromosome 7 is consistent with the origin of the EGFR driven primary tumour and the copy number loss of the start of chromosome 17 for all autopsy samples, but only a loss of heterozygosity in the primary sample is consistent with the genetic prerequisites for small cell transformation. ([fig:ca51.dxcircos] vs. f@[fig:ca51.557circos]  3.15 f@ and [f% ig:ca51.559circos] , [f% ig:ca51.559circos] f ig:ca51.566circos,fig:ca51.573circos,fig:ca51.579circos,fig:ca51.583circos,@ and [tab:ca51cnv]).

Circos plot of patient CA-I sample dx with somatic structural variants: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-I sample dx with somatic structural variants: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Circos plot of patient CA-I sample 557 with somatic structural variants: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-I sample 557 with somatic structural variants: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Copy number analysis results for patient CA-I: results are taken from the best fit result of sequenza
Sample number purity ploidy WG duplication
Dx 0.29 5.5 True
557 0.93 2.6 False
559 0.96 2.6 False
566 0.69 2.7 False
573 0.94 2.6 False
579 0.95 2.6 False
583 0.95 2.6 False

Clonal evolutionary tree of patient CA-I; Highest support tree for clustered ccf clones generated with PhylogicNDT; Support for clone is shown in parenthesis; Major driver alterations of clones were annotated; Clusters with less than 5 supporting variants were discarded; Cluster with 2000 supporting variants only present in sample Dx was discarded as FFPE artefact
Clonal evolutionary tree of patient CA-I; Highest support tree for clustered ccf clones generated with PhylogicNDT; Support for clone is shown in parenthesis; Major driver alterations of clones were annotated; Clusters with less than 5 supporting variants were discarded; Cluster with 2000 supporting variants only present in sample Dx was discarded as FFPE artefact

Clonal deconvolution of somatic variants with PhylogicNDT revealed a linear, most likely longitudinal, development of clones adjusting to the changing treatment, with the initial clone 1 containing the exon 19 deletion and individual subclones branching off. While a HLA-A disrupting variant in cluster 13 was observed at high frequency in the diagnostic sample, it was out-competed by other clones at autopsy, and seen as transient cluster 4. Due to the small cell transformation, which correlates with down regulation of major histocompability complex (MHC) components, a direct disruption of HLA-A was likely not necessary anymore. Some clones also were only observed in specific sites, like cluster 15 or not found at certain sites (cluster 5) which pointed to high heterogeneity of disease at autopsy ( f@[fig:ca51.clonalTree]  3.16 @@ and [f% ig:ca51.ccfCluster] , [f% ig:ca51.ccfCluster] @ ).

These results agree with the phylogenetic reconstruction which showed initial shared somatic evolution of all sites, with strong individual evolution and limited substructure.

Cancer cell fraction of mutation clusters of clonal tree for patient CA-I; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca51.clonalTree]
Cancer cell fraction of mutation clusters of clonal tree for patient CA-I; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca51.clonalTree]

Patient CA-J

Patient CA-J was a 65 year old female never smoker, who presented with moderately differentiated lung adenocarcinoma (Stage IIIB). Molecular pathology revealed metastatic EGFR L858R positive disease. Initial treatment with both Carboplatin/Paclitaxel and radiotherapy was halted after detection of metastatic disease with bone, left adrenal gland and bilateral lung lesions and she was changed to the tyrosine kinase inhibitor Erlotinib. Progressive pulmonary disease and subsequent left lung core biopsy showed an additional BRAF V600E mutation. Treatment was adjusted to Carboplatin and Pemetrexed, however further disease progression was evident. The patient was finally enrolled in the EVICT trial involving treatment with the BRAF inhibitor Vemurafenib in combination with Erlotinib, which led to stable bone metastasis after one month, but ultimately led to progression of both pulmonary and bone metastases. The patient died 29 months after initial diagnosis ([fig:ca80timeline]).

Timeline of patient CA-J from diagnosis until death: Diagnostic biopsy detected EGFR L858R positive stage IIIB lung adenocarcinoma; Second diagnosis after 3 months revealed additional brain, bone and lung metastasis with a reclassification to stage IV; Biopsy at the end of erlotinib treatment revealed additional BRAF V600E mutation; one blood sample was taken during the second round of chemotherapy and three more during the time the patient was enrolled in the EVICT trial
Timeline of patient CA-J from diagnosis until death: Diagnostic biopsy detected EGFR L858R positive stage IIIB lung adenocarcinoma; Second diagnosis after 3 months revealed additional brain, bone and lung metastasis with a reclassification to stage IV; Biopsy at the end of erlotinib treatment revealed additional BRAF V600E mutation; one blood sample was taken during the second round of chemotherapy and three more during the time the patient was enrolled in the EVICT trial

Serial plasma sampling Just before and during the enrollment in the EVICT trial allowed us to monitor the genomic landscape of the disease during treatment via specially designed ddPCR analysis. After the second round of chemotherapy a <math display="inline">\approx 60\%</math> variant allele fraction of EGFR L858R was found, suggesting a high ctDNA fraction. The after initial partial response to the change to Vemurafenib and Erlotinib in the EVICT trial, accompanied with a substantial drop in detectable EGFR L858R, the patient relapsed. This progression could also be observed in the steady increase in the TP53 “stop gained“ and BRAF V600E mutation, which were detectable at higher levels than before the EVICT trial ([fig:ca80plasma]).

Blood plasma analysis of patient CA-J: Three putative driver mutations were analysed at four time points during during treatment. progression and partial response were assigned by clinicians based independent of ctDNA analysis; Y-axis was broken from 25-60 for visibility
Blood plasma analysis of patient CA-J: Three putative driver mutations were analysed at four time points during during treatment. progression and partial response were assigned by clinicians based independent of ctDNA analysis; Y-axis was broken from 25-60 for visibility

At autopsy 18 sites of disease were resected and biobanked. A representative six samples from different organs and sites were selected and WGS was performed after H& E staining confirmed high enough tumour purity ([tab:ca80wgsSamples], [fig:ca80schematic]). All WGS samples were analysed with the standard analysis workflow ([cascade-sec:workflow]).

Schematic of analysed tumour lesions in patient CA-J: Primary diagnostic sample shown in red; All 18 autopsy samples were coloured by organ they were collected from: skull (1), left lung(5), right lung (6), diaphragm (2), liver(2), adrenal gland (2); Additionally to the post mortem blood sample, four serial blood samples were taken ([fig:ca80timeline])
Schematic of analysed tumour lesions in patient CA-J: Primary diagnostic sample shown in red; All 18 autopsy samples were coloured by organ they were collected from: skull (1), left lung(5), right lung (6), diaphragm (2), liver(2), adrenal gland (2); Additionally to the post mortem blood sample, four serial blood samples were taken ([fig:ca80timeline])

Autopsy samples sequenced for patient CA-J: Sample number is the internal sample collection during CASCADE autopsy, the organ of the sample, the fraction of tumour cells from H& E stain and the pathology of the tumour sample. Dx: diagnostic sample
Sample number Organ H&E Type
Dx left lung core -
2 adrenal gland 0.5
20 right lower lung 0.7
24 left upper lung 0.9
28 left middle lung 0.5
32 right upper lung 0.5
42 base of skull 0.4

Somatic variant calling revealed heterogeneity of resistance and driver mutations. While the initial driver variant EGFR L858R was present in all autopsy samples, sample 2 presented with an allele frequency of 0.13 with clonal presence in all other sample. The secondary BRAF V600E mutation, which was detected in the progression biopsy 22 months after diagnosis, was not present in sample 2 and only at low allele frequency (0.13) in sample 28. Only sample 32 showed the BRAF mutation at 100% VAF. While the absence in sample 2 could be explained by the overall low tumour purity of the sample, both 24 and 42 had a higher than 50% estimated tumour purity ([tab:ca80cnv]) and showed a lower VAF, therefore suggesting a lesser involvement of the mutation in resistance.

Additionally to the BRAF mutation, some sites (20, 24, 32, and 42) developed a “stop gained“ mutation in TP53 (TP53 G38Ter) at 100% VAF. While both the BRAF and TP53 mutations were present at similar allele fractions in the diagnostic sample (Dx) suggesting a clonal structure, the TP53 variant was more prevalent at autopsy in multiple samples. So while BRAF and TP53 mutations were correlated in the diagnostic sample, the TP53 “stop gained“ developed independently. Sample 28 in spite of showing traces of BRAF V600E did not develop a TP53 mutation and sample 2, which did not contain a BRAF change instead exhibited two different additional putative driver events (FLT4 V1097M and KEAP1 Q282H) which were not observed in any other sample. Finally, Sample 32 also contained a subclonal KRAS L5Q mutation in addition to both BRAF and TP53 mutations ([fig:ca80heatmap]).

The emergence of the TP53 mutations was related to expansion of samples 20, 24, 32, and 42 differentiating them from the adrenal gland (2) and the original site of disease. This very early split and seeding and the low abundance of the putative EGFR driver mutation suggests at very different disease trajectories ([fig:ca80phylo]).

Phylogeny of autopsy samples from patient CA-J; reconstructed with all somatic SNVs and InDels. Ruler symbolises 4000 variants difference.
Phylogeny of autopsy samples from patient CA-J; reconstructed with all somatic SNVs and InDels. Ruler symbolises 4000 variants difference.

Heatmap of driver gene variants in patient CA-J: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.
Heatmap of driver gene variants in patient CA-J: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.

Similar to the short variants, there were some structural variants present in all samples. The inversions on chromosome 12 as well as the co-located break and fusion with the start of chromosome 5 could be observed in all samples, even those with very low tumour purity, but the inversions and fusions of chromosome 7, 8, 9, and 11 could only be seen in the higher purity samples 20, 24, 32, and 42 and most were subclonal, as they only had a median allele frequency of 14.3% (min: 10.3%, max: 99.7%) in all samples. While multiple samples exhibited gene fusions with lung cancer driver and resistance genes like BRAF, FGFR1 and GNAS these fusions were only present at subclonal levels <math display="inline">\leq 10\%</math>.

All samples apart from sample 2 showed whole genome duplication and high polyclonality, which suggests that in addition to the heterogeneity observed through short and structural variants, there was an additional level of heterogeneity of copy number alterations. The lower purity of sample 2 might have been a confounding factor, however both sample 28 and 32 showed lower purities, but a much higher polyclonality and genome duplication. While sample 2 had several minor focal amplifications in PMS2, STK11 and GADD45B, no major copy number amplification was found. All other samples showed amplifications in KRAS (min: 2.9, max: 5.0, median: 4.6), CDK4 (min: 3.2, max: 24.4, median: 21.7) and BRAF (min: 2.1, max: 6.0, median: 3.9) in addition to the highly amplified EGFR (min: 10.6, max: 266.7, median: 197.4) and MET (min: 4.2, max: 6.3, median: 4.6) locus. Both EGFR and MET copy number gain most likely were the resistance mechanism to the initial treatment with the tyrosine kinase inhibitor Erlotinib. ( f@[fig:ca80.2circos]  3.23 f@ and [f% ig:ca80.20circos] , [f% ig:ca80.20circos] f ig:ca80.24circos,fig:ca80.28circos,fig:ca80.32circos,fig:ca80.42circos,@ , [tab:ca80cnv]).

Circos plot of patient CA-J sample 2 with somatic structural variants with allele frequency <math display="inline">\geq 0.1</math>: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-J sample 2 with somatic structural variants with allele frequency <math display="inline">\geq 0.1</math>: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Circos plot of patient CA-J sample 20 with somatic structural variants with allele frequency <math display="inline">\geq 0.1</math>: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-J sample 20 with somatic structural variants with allele frequency <math display="inline">\geq 0.1</math>: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Copy number analysis results for patient CA-J: results are taken from the best fit result of PURPLE; WG: whole genome
Sample number purity ploidy polyclonal % WG duplication
2 0.16 2.18 22.53 False
20 0.39 4.80 43.36
24 0.73 3.70 30.32
28 0.18 3.90 42.28
32 0.25 4.75 48.45
42 0.52 3.35 41.63

Both the somatic variants as well as copy number analysis showed clear signs of a BRAF driven tumour with both BRAF mutations and amplifications as well as amplification of CDK4. However the patient also displayed potential alternative methods of resistance, like KEAP1 and FLT4 mutations which could only be appreciated by analysing multiple sites of the cancer.

Cancer cell fractions of individual mutations of cluster 2 containing TP53 mutations for patient CA-J : each dot represented a distinct variant. Variants were connected with a dotted line to the same variant in other samples. Mutations were clustered with PhylogicNDT;
Cancer cell fractions of individual mutations of cluster 2 containing TP53 mutations for patient CA-J : each dot represented a distinct variant. Variants were connected with a dotted line to the same variant in other samples. Mutations were clustered with PhylogicNDT;

PhylogicNDT analysis revealed a high degree of subclonality, consistent with the results from PURPLE. However, the low tumour purity of samples 2 and 28 lead to unrealistic clustering of variants in these samples. While the TP53 mutation was not found in either sample 2 or sample 28 ([fig:ca80heatmap]), the cluster containing this mutation was assigned a 100% cancer cell fraction overall. As the individual mutations do show multiple substructures in this cluster, for example connecting 2 and 20 at low ccf as well as high, and parameter tuning did not lead to a more granular representation, we considered the results to be low quality and not interpretable ([tab:ca80cnv], [fig:ca80ccfMuts]).

Patient CA-K

This 69 year old was a male patient who presented with multifocal lung adenocarcinoma without distant metastasis. The most PET avid location (left upper lung) was designated the primary site over the two other less avid locations (right upper lung and left lower lung). Initial treatment with the tyrosine kinase inhibitor Gefitinib was stopped when the dominant lung lesion and a hilar node lesion showed signs of progression and he was changed to Afatinib. A CT scan after 50 months showed a mild increase at the primary site, stable disease in satellite nodules and no new metastatic sites. After short treatment with Erlotinib, molecular pathology of a biopsy revealed the acquired EGFR T790M resistance mutation. Enrolment in the CLOVIS trial involving treatment with the EGFR tyrosine kinase inhibitor Rociletinib and chemotherapy was ceased due to disease progression. Biopsy 2 confirmed the T790M mutation and therapy with Osimertinib was started, which led to slowly progressive disease. Due to new intracranial disease, and increase in lung and renal disease treatment was switched to the PD-1 inhibitor Nivolumab, but no remission was achieved and the patient died after 103 months ([fig:ca82timeline]).

Timeline of patient CA-K from diagnosis until death: Diagnostic biopsy detected EGFR L858R positive lung adenocarcinoma; Biopsy 1 after 66 months showed additional EGFR T790M mutation; Biopsy 2 showed no additional variants; one blood sample was taken towards the end of Osimertinib treatment and one second one during PD-1 checkpoint blockade treatment. E: Erlotinib; R: Rociletinib; P: PD-1 inhibitor
Timeline of patient CA-K from diagnosis until death: Diagnostic biopsy detected EGFR L858R positive lung adenocarcinoma; Biopsy 1 after 66 months showed additional EGFR T790M mutation; Biopsy 2 showed no additional variants; one blood sample was taken towards the end of Osimertinib treatment and one second one during PD-1 checkpoint blockade treatment. E: Erlotinib; R: Rociletinib; P: PD-1 inhibitor

Analysis of the plasma sample collected five months prior to the death of the patient with the AVENIO commercial kit revealed an AKT, a KRAS, and several EGFR resistance mutations with high confidence at very low allele frequency. Out of the reported low confidence somatic variants, several more known EGFR resistance alterations could be validated in the autopsy samples. Due to the overall low VAF of found variants, we assumed a low ctDNA fraction of the sample. Nevertheless, the multiple EGFR mutations suggested a high polyclonality of the disease and a purely genomically EGFR driven cancer ([tab:ca82plasma]). While these results suggest that a high degree of the final heterogeneity was already present, the longitudinal distance of the plasma sample to the autopsy samples suggests subsequent evolution, which could also be seen absence of the APC mutation in the plasma ([fig:ca82heatmap]).

Somatic variants found in plasma with AVENIO sequencing for patient CA-K: all high confidence variants are shown; low confidence variants also seen at autopsy in any sample were also selected
Gene Change VAF (%) High confidence Found at autopsy
AKT1 Glu17Lys 0.88 True
KRAS Gly12Val 0.18
Asp761Tyr 0.05
Thr790Met 0.71
Cys797Ser 0.26
Leu858Arg 1.03
Leu718Gln 0.69
Ser720Thr 0.67

At autopsy 18 sites were resected and biobanked and seven high quality representative samples from different organs were selected for WGS ([fig:ca82schematic], [tab:ca82wgsSamples]) and analysed with the standard workflow ([cascade-sec:workflow]).

Schematic of analysed tumour lesions in patient CA-K: Primary diagnostic sample shown in red; All 17 autopsy samples were coloured by organ they were collected from: Brain (6), left lung (4), right lung (4), kidney (3); Additionally to the post mortem blood sample, two serial blood samples were taken ([fig:ca82timeline])
Schematic of analysed tumour lesions in patient CA-K: Primary diagnostic sample shown in red; All 17 autopsy samples were coloured by organ they were collected from: Brain (6), left lung (4), right lung (4), kidney (3); Additionally to the post mortem blood sample, two serial blood samples were taken ([fig:ca82timeline])

Autopsy samples sequenced for patient CA-K: Sample number is the internal sample collection during CASCADE autopsy, the organ of the sample, the fraction of tumour cells from H& E stain and the pathology of the tumour sample. Dx: diagnostic sample
Sample number Organ H&E Type
Dx left lung core 0.8
1 right kidney 0.7
4 right upper lung 0.8
5 right lower lung 0.7
6 right middle lung 0.7
8 left lower lung 0.9
9 left upper lung 0.5
13 left brain 0.5

Joint somatic variant calling on all autopsy samples revealed significant genetic heterogeneity of resistance mechanisms present at each site. While the initial EGFR L858R mutation was still present in all sequenced lesions, the left lower lung sample (8) was the only one with a homozygous variant. All other samples presented with 50% VAF for the activating mutation. Biopsy 1, 66 months after diagnosis, showed the additional EGFR T790M mutation, but at autopsy sample 1 (adrenal gland) did not show evidence of the mutation at all and samples 8, 9, and 13 (left lung and brain) exhibited the variant at subclonal frequencies (<math display="inline"><30\%</math> VAF). Either the mutation was already subclonal at biopsy, or the resistance was outcompeted by a different clone due to the Osimertinib treatment which targets T790M. The adrenal gland lesion, which did not contain the T790M mutation, instead presented with two other clonal EGFR mutations (S720T and L718Q) which are both known resistant mechanisms to Osimertinib (Johnson et al. 2010; Bersanelli et al. 2016).

Additionally the mutations AKT1 G17L and APC E190Ter bifurcated the the autopsy samples into two groups, because the variants were mostly mutually exclusive, where only samples 8 and 9 showed both the stop gained APC mutation at 100% VAF with very low VAF of the AKT1 mutation. Furthermore, several EGFR mutations also showed different spatial clonal abundance. Multiple samples exhibited different C797 substitutions both of which are known resistance mechanisms to Osimertinib (Shuhang Wang et al. 2016; Leonetti et al. 2019): Sample 4 contained the EGFR C797G mutation at 7% VAF whereas samples 6, 8, and 9 had a C797S mutation at 1%, 14%, and 22% VAF respectively. However, while 8 shared the sample protein change (C797S) the genomic change was different to both sample 6 and 9. Only sample 4 contained EGFR L792H as a subclonal mutation at 11% VAF and both sample 1 and 6 contained a subclonal EGFR S720T mutation, another known resistance mechanisms (Johnson et al. 2010; Zhang et al. 2018). Additionally to the adrenal sample both lower and middle lower lung samples contained EGFR L718Q. Lastly, all but sample 8 contain the SMAD4 frameshift mutation at a median cancer cell fraction of 73% (min: 16%, max: 100%) ([fig:ca82heatmap]).

These genomic changes grouped the samples according to their anatomical location, with right kidney (1) and left brain (13) as outliers, but left (8 and 9) and right lung (4, 5, and 6) samples clustering together ([fig:ca82phylo]).

Phylogeny of autopsy samples from patient CA-K; reconstructed with all somatic SNVs and InDels. Ruler symbolises 4000 variants difference
Phylogeny of autopsy samples from patient CA-K; reconstructed with all somatic SNVs and InDels. Ruler symbolises 4000 variants difference

Heatmap of driver gene variants in patient CA-K: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.
Heatmap of driver gene variants in patient CA-K: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.

Similar to the short variants, structural variants and copy number changes also showed a difference between samples 8 and 9 compared to the rest. While all samples showed inversions on chromosome 6, 8 and 9 with fusions between chromosome 1 and 18, chromosome 6 and 8 and chromosome 8 and 9, samples 8 and 9 also showed a fusion of chromosome 3 with 18 and additional inversions on chromosome 18. The additional inversions and haploinsufficiency directly affect SMAD4. These structural changes complemented the “missing“ SMAD4 frameshift mutation in these samples and suggested a key role of SMAD4 in the resistance to treatment.

Samples 8 and 9 were the only samples in the patient with significantly amplified copy numbers resulting in a whole genome duplication, however they still exhibited the same pattern of loss of heterozygosity. In all samples we observed a copy number gain in the q arm of chromosome 1 amplifying both NTKR1 and DDR2 with a loss of heterozygosity for NRAS and MTOR on the p arm of the same chromosome. The loss of heterozygosity presented in all samples on chromosome 3 reduced the representation of TM4SF1, PIK3CA, and USP13, however in both sample 8 and 9 the other chromosome was amplified leading to a copy number neutral area. The loss of heterozygosity on chromosome 5 leads to a haploinsufficiency for APC, PIK3R1, CSF1R, PDGFRB, and FLT4 for all samples, which combined with the stop mutation in samples 8 and 9 is suggestive of a common resistance pathway. The loss of heterozygosity on chromosome 8 affected TUSC3 and FGFR1. No lung cancer driver genes were affected by the loss of heterozygosity on chromosome 15. The seemingly heterozygous loss of chromosome X resulted in the loss of ARAF and AR as a consequence, but the loss must have been subclonal given the sex of the patient was male. Lastly, the additional copy number loss only present in both samples 8 and 9 affected JAK2, CD274, and PDCD1LG2 and these samples also showed EGFR amplification in contrast to all other samples ( f@[fig:ca82.1circos]  3.30 f@ and [f% ig:ca82.4circos] , [f% ig:ca82.4circos] f ig:ca82.5circos,fig:ca82.6circos,fig:ca82.8circos,fig:ca82.9circos,fig:ca82.13circos,@, [tab:ca82cnv]).

Circos plot of patient CA-K sample 1: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-K sample 1: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Circos plot of patient CA-K sample 8: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-K sample 8: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Copy number analysis results for patient CA-K: results are taken from the best fit result of PURPLE; WG: whole genome
Sample number purity ploidy polyclonal % WG duplication
1 0.78 1.84 7.62 False
4 0.48 1.84 4.80 False
5 0.58 1.88 1.02 False
6 0.79 1.86 6.93 False
8 0.30 3.40 6.44 True
9 0.69 3.45 8.32 True
13 0.47 1.90 0.05 False

Cancer cell fraction of mutation clusters of clonal tree for patient CA-K; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca82.clonalTree]
Cancer cell fraction of mutation clusters of clonal tree for patient CA-K; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca82.clonalTree]

Cancer cell fraction of mutation clusters of clonal tree for patient CA-K; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca82.clonalTree]
Cancer cell fraction of mutation clusters of clonal tree for patient CA-K; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca82.clonalTree]

Similar to the phylogeny, the clonal deconvolution with PhylogicNDT revealed a higher complexity of disease after a bottle neck, where the right side lung samples (samples 4, 5, and 6) all presented with a very high prevalence of cluster 11 and followed by clonal diversification after the CNBD1 stop-gained mutation. In contrast samples 1 and 13 (kidney and brain) showed an additional EGFR S720W mutation separating the distant sites from the original lung disease. Finally, the original site of disease in the left lung lobe displayed less diversification, however the early split of cluster 1 into 10 and 3 splitting left lung and all other sites suggests early metastatic seeding ( f@[fig:ca82.clonalTree]  [fig:ca82.clonalTree] @@ and [f% ig:ca82.ccfCluster] , [f% ig:ca82.ccfCluster] @ ).

Patient CA-L

This 68 year old female ex-smoker presented with EGFR mutant NSCLC, however after 12 months of the treatment with the EGFR inhibitor Erlotinib a transformation to small cell lung cancer (SCLC) was detected. While previously it was thought that the different subsets of lung cancers are distinct, more and more evidence is found showing neuroendocrine transformation as a resistance mechanism to targeted therapies not only in lung but also in prostate cancers (Oser et al. 2015; Aggarwal et al. 2018). The treatment was altered to chemotherapy and then PD-1 inhibition, however due to the loss of MHC-I antigen presentation of small cell lung cancer, the tumour failed to respond (Burr et al. 2019) and the patient died after 29 months ([fig:ca86timeline]).

Timeline of patient CA-L from diagnosis until death: Diagnostic biopsy detected EGFR exon 19 deletion positive lung adenocarcinoma; Biopsy after 15 months Erlotinib treatment showed signs of small cell transformation; blood samples were taken at the end of the first Erlotinib treatment, during the chemotherapy treatment and 28 and 29 months after the initial diagnosis.
Timeline of patient CA-L from diagnosis until death: Diagnostic biopsy detected EGFR exon 19 deletion positive lung adenocarcinoma; Biopsy after 15 months Erlotinib treatment showed signs of small cell transformation; blood samples were taken at the end of the first Erlotinib treatment, during the chemotherapy treatment and 28 and 29 months after the initial diagnosis.

During autopsy 25 lesions were resected and biobanked and representative samples, both adeno- and small cell carcinoma according to histology, were selected for WES ([fig:ca86schematic], [tab:ca86wesSamples]) and analysed with the standard workflow ([cascade-sec:workflow]). For granular analysis of the transition from adeno- to small cell carcinoma, the progression sample after 15 months (P) was dissected to the individual types based on histology staining.

Schematic of analysed tumour lesions in patient CA-L: Primary diagnostic sample shown in red; Samples are coloured by organ they were collected from: left lung (6), right lung (9), sternum (1), diaphragm (2), liver (3), adrenal gland (1) kidney (2), anal (1); Additionally to the post mortem blood sample, five serial blood samples were taken ([fig:ca86timeline])
Schematic of analysed tumour lesions in patient CA-L: Primary diagnostic sample shown in red; Samples are coloured by organ they were collected from: left lung (6), right lung (9), sternum (1), diaphragm (2), liver (3), adrenal gland (1) kidney (2), anal (1); Additionally to the post mortem blood sample, five serial blood samples were taken ([fig:ca86timeline])

Autopsy samples sequenced for patient CA-L: Sample number is the internal sample collection during CASCADE autopsy, the organ of the sample, the fraction of tumour cells from H& E stain and the pathology of the tumour sample. P.1/2: micro-dissected progression biopsy (80% small cell, 20% adeno)
Sample number Organ H&E Type
P.1 right lung core small cell
P.2 right lung core adenocarcinoma
8 right upper lung 0.9 small cell
17A left lower lung - poorly differentiated adeno
26 right kidney 1 adenocarcinoma

Even though not all samples had changed from adeno- to small cell carcinoma, all samples showed the TP53 “stop gained“ mutation at 100% VAF, showing that the TP53 mutation was not sufficient for the histological transformation (Offin et al. 2019). Unsurprisingly, the samples that remained adeno (17A and 26) show a higher dependency on EGFR which led to higher clonal abundance of the initial EGFR exon 19 deletion and a subsequently higher VAF of EGFR T790M, while the small cell transformed lung sample (8) acquired a secondary TP53 M40I mutation. No additional variants were close to clonal representation ([fig:ca86heatmap]).

Surprisingly even though the samples were taken at different times (one at progression and one at autopsy) the small cell transformed samples P.1 and 8 were evolutionarily more closely related and shared more variants, than to the sample P.2 which was taken at the same time. In contrast, the two adenocarcinoma samples taken at autopsy were clustered together. Additionally, the split of sites of small cell transformation and adenocarcinoma already had happened before the progression sample and the small cell transformed samples appeared to be evolutionary different from samples 17A and 26. In general, the phylogeny suggested the presence of at least 3 distinct trajectories: one giving rise to the adeno sample at progression (P.2), one resulting in the small cell sample P.1 and its longitudinal successor sample 8 and lastly the two adenocarcinoma samples 17A and 26 ([fig:ca86phylo]).

Phylogeny of samples from patient CA-L; reconstructed with all somatic SNVs and InDels. Ruler symbolises 2000 variants difference.
Phylogeny of samples from patient CA-L; reconstructed with all somatic SNVs and InDels. Ruler symbolises 2000 variants difference.

Heatmap of driver gene variants in patient CA-L: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.
Heatmap of driver gene variants in patient CA-L: Protein altering mutations are highlighted with their HGVSp notation; non protein altering mutations are grouped as “other“.

Copy number analysis with sequenza revealed a high prevalence of loss of heterozygosity in all samples,but both sample P.2 and 8 showed almost no copy number gains on chromosome 9 and 10 with sample 8 even extending through to chromosome 12. In general small cell transformed samples showed a higher level of copy number gain than the original adenocarcinoma. The difference in copy number in the two spatially intertwined types of cancers can only be attributed to the small cell transformation. Additionally to the increased overall ploidy of the small cell sample P.1 over P.2 ([tab:ca86cnv]), P.2 also lost chromosome X completely ([fig:ca86.p1circos] vs. [fig:ca86.p2circos]). Interestingly, the small cell samples still had the same high amplification level of EGFR seen in the adenocarcinoma samples (min: 6 max: 13) suggesting the transformation retained EGFR based signalling. While commonly small cell transformation is associated with RB1 loss, the locus was amplified in all samples with a loss of heterozygosity. As this patient’s sequencing was restricted to exonic regions, we could not rule out a regulatory defect. Interestingly, while the small cell transformed part of the progression sample (P.1) showed a heterozygous loss of chromosome X, the adenocarcinoma part (P.1) showed an almost complete loss of chromosome X, which could not be observed in any of the autopsy samples, which instead showed and amplification. This indicated, that the small cell transformation happened at multiple sites instead of being spread after the transformation ( f@[fig:ca86.p1circos]  3.38 f@ and [f% ig:ca86.p2circos] , [f% ig:ca86.p2circos] f ig:ca86.8circos,fig:ca86.17Acircos,fig:ca86.26circos,@), .

Circos plot of patient CA-L sample P.1: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-L sample P.1: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Circos plot of patient CA-L sample P.2: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.
Circos plot of patient CA-L sample P.2: outer first ring shows the canonical chromosomes with gaps (centromere, heterochromatin,...) highlighted as darker areas; second ring visualises all somatic SNVs corrected for tumour purity and scaled from 0 to 1, the colour representing the base change of SNV like in Ludmil B. Alexandrov et al. (2013); vertical lines directly under the SNVs symbolise InDels, with yellow for insertions and red for deletions; the third ring shows the total copy number alterations, with green showing a copy number gain and red a loss, dots at the outer border show a copy number greater than four; the last ring shows the minor copy number, with blue depicting a gain and orange a loss, this ring allows the detection of copy number neutral changes, like loss of heterozygosity; the center shows all structural variants: translocations in blue, deletions in red, insertions in yellow, tandem duplications in green and inversions in black.

Copy number analysis results for patient CA-L: results are taken from the best fit result of sequenza
Sample number purity ploidy WG duplication
P.1 0.86 3.3 True
P.2 0.27 2.1 False
8 0.96 3.1 True
17A 0.18 4.2 True
26 0.28 3.7 True

Clonal evolutionary tree of patient CA-L; Highest support tree for clustered ccf clones generated with PhylogicNDT; Support for clone is shown in parenthesis; Major driver alterations of cluster were annotated; Clusters with less than 5 supporting variants were discarded.
Clonal evolutionary tree of patient CA-L; Highest support tree for clustered ccf clones generated with PhylogicNDT; Support for clone is shown in parenthesis; Major driver alterations of cluster were annotated; Clusters with less than 5 supporting variants were discarded.

Cancer cell fraction of mutation clusters of clonal tree for patient CA-L; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca86.clonalTree]
Cancer cell fraction of mutation clusters of clonal tree for patient CA-L; transparent polygons show the 95% confidence intervals. Clusters and cluster colours are taken from [fig:ca86.clonalTree]

Clonal deconvolution with PhylogicNDT showed a split into clusters 6 and 2 from the initial clone with both the initiating EGFR deletion and the TP53 “stop gained“ mutation. While cluster 6 is present in all samples, the small cell transformed samples P.1 and 8 show a much lower cancer cell fraction, suggesting a correlation. Cluster 2 was then split again in to cluster 4 and a successive evolution of cluster 7 to cluster 30. While cluster 4 shows a similar pattern to cluster 6, the high variability of confidence intervals made assessment challenging. Cluster 4 and 6 could possibly be resolved into the same clone with deeper sequencing. Finally cluster 5 and 13, the progeny of cluster 30 allowed the perfect split of samples into small cell transformed and adenocarcinoma samples. Samples P.2, 17A, and 26, where the presence of cluster 13 could be observed were adenocarcinoma. However, the presence of cluster 5, which was only present in the small cell samples at autopsy suggested an incomplete micro-dissection of the progression sample ( f@[fig:ca86.clonalTree]  3.40 @@ and [f% ig:ca86.ccfCluster] , [f% ig:ca86.ccfCluster] @ ). Surprisingly, there was no known genetic determinant found, which explained the split into EGFR T790M positive and small cell transformed lung cancer. In contrast, it is likely that the priming for for transformation or remaining adenocarcinoma was epigenetic (Fennell et al. 2021; Suva, Riggi, and Bernstein 2013).

Mitochondrial phylogenetic reconstruction - the power house of the phylogenies

While phylogenetic reconstruction is a well established method for genetic variants from canonical chromosomes to study metastatic progression and timing of evolutionary divergence (Deshwar et al. 2015; Brown et al. 2017; Hu et al. 2019), there are multiple issues. In [variantcalling-sec:phylo] and [variantcalling-sec:clonal] we showed how important the proper variant calling method is to accurately recover phylogenies and clonal patterns. In addition, using somatic variants to reconstruct phylogenies is a flawed concept to begin with.

Most models studying genetic variation assume neutral evolution of the DNA loci (Kimura 1968; M. Lynch 1989), but cancers almost exclusively exhibit positive selection (Cannataro and Townsend 2018). And while passenger mutations might not directly affect fitness of the cell, they only exist due to the link to the driver mutation and therefore provide little to no additional information gain in addition to the driver. Furthermore, while in small populations genetic drift as a stochastic process overpowers selective processes (fitness coefficient <math display="inline">s</math>) and can therefore be assumed to be neutral, in larger populations <math display="inline">N_e</math> (effective population size) where [mmf-eq:neutralSelection] does not hold true, mutations are under selective pressure (Eyre-Walker and Keightley 2007). <math display="block">N_e \cdot s \ll 1 \label{mmf-eq:neutralSelection}</math>

In summary, we can assume that with cancer cell growth, positive selection through treatment and tumour micro environmental niches, almost all assumptions of the coalescent theory (Kingman 1982) are not applicable for tumour samples and therefore methods using somatic variants and their respective results need to be selected and evaluated carefully.

To tackle this issue, and assist with the interpretation of phylogenetic reconstruction results, we adjusted a method used in single cell sequencing to track clonal expansion with mitochondrial somatic mutations (Ludwig et al. 2019) to be usable for standard bulk sequencing. Mitochondrial variants are an ideal source of clonality information, because the mutation rate is significantly higher than nuclear DNA, due to the missing proof reading and repair mechanisms, which allows very granular separation in a shorter time period. Additionally, while there are several diseases caused by defects in mitochondria such as Kearns-Sayre syndrome (Harvey and Barnett 1992), MERRF (Adam et al. 1993) and MELAS (Hirano et al. 1992), they usually follow a mendelian inheritance pattern and are hereditary and not somatically acquired. In the cancer context, somatic mutations in mitochondrial DNA are assumed to be approximately neutral with a possible selection pressure towards healthy ageing and negative selection in cancer (Rodell et al. 2013; Yuan et al. 2020).

Method

First a pileup of all mitochondrial positions was performed. Before the pileup we preselected reads which uniquely mapped to the mitochondrial genome and only retained high mapping quality reads. Then the nucleotide counts in each position were transformed into a MultiEssayExperiment (Ramos et al. 2017) for final analysis in R. The preprocessing code can be found in [lst-cascadeAppendix:mitoPreProcessing].

The final MultiEssayExperiment is then read into R and quality metrics applied to exclude samples with not enough coverage on the mitochondrial contig. Our analysed WGS samples showed an extensive coverage of mitochondrial DNA, however WES library preparation procedures might restrict coverage. Patient CA-I had a coverage of more than 100x for all but the germline sample which only had an overall coverage of 17x. Similarly, patient CA-L showed lower depth for the germline sample (127x) but a generally high coverage for all tumour samples (mean: 543x, min: 138x). All other Patients (CA-A/J/K) where samples were sequenced with WGS exhibited a coverage of more than 200 even for low performing samples with a median depth of 67916, 45603, and 49726 per sample ([fig:mtCoverage]).

Average coverage of mitochondrial DNA of CASCADE patients: Orange squares show germline sample for each patient; black points show tumour samples; horizontal red dotted line shows quality cut off suggested by Ludwig et al. (2019)
Average coverage of mitochondrial DNA of CASCADE patients: Orange squares show germline sample for each patient; black points show tumour samples; horizontal red dotted line shows quality cut off suggested by Ludwig et al. (2019)

This proved, that even without specifically enriching for mitochondrial DNA, most samples will contained enough tumour reads for this analysis.

To ensure optimal results, we excluded all samples with an average coverage of less than 50x. This means we removed the germline sample for patient CA-I, however as we expected the germline sample to be the ancestral state for all samples, this has virtually no effect on the reconstruction procedure. Additionally, we were more interested in the relationships between the tumour sample, which was still accessible even with the removed germline sample.

In contrast to the simple Hamming distance used for the presence-absence vector representation of canonical somatic variants ([cascade-sec:phylo]), for mitochondrial variants we employed an allele frequency (<math display="inline">vaf</math>) based distance ([eq:mitoDist]) of two samples <math display="inline">s_i</math> and <math display="inline">s_j</math>. The difference in read support was normalised with the product of the total allelic depth <math display="inline">cov</math> and summed up at all sites of variation <math display="inline">v</math>.

<math display="block">mitoDist(s_i,s_j) = \sum_{v \in Variants} \left| \frac{vaf_{s_i}(v) \cdot cov_{s_i}(v) - vaf_{s_j}(v) \cdot cov_{s_j}(v)}{cov_{s_i}(v) \cdot cov_{s_j}(v)} \right| \label{eq:mitoDist}</math>

This distance was only calculated for variant sites where both samples had at least a coverage of 100x to have a representative sampling of the allelic prevalence in each sample, as a human cell usually has more than 100 mitochondria (Cole 2016).

Results

While the mitochondrial variants analysis only used a fraction of the size of the genomic DNA loci and therefore most likely violates the infinite sites assumption (Kimura 1969), it was still able to generate an orthogonal view of the heterogeneity and trajectory of the multi-regional samples in each patient.

Patient CA-A

While the separation of progression (11, 47, 55, and 59) and stable (26, 31, 41, and 57) disease sites was already visible in the somatic phylogeny, the bottle neck of treatment and new metastasis is more obvious in the mitochondrial phylogeny. However the individual resolution of splits appeared to be lower for the mitochondrial reconstruction, as seen in [fig:CA99mitoPhylo].

Mitochondrial and somatic phylogenetic reconstruction of CA-A: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)
Mitochondrial and somatic phylogenetic reconstruction of CA-A: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)

Patient CA-I

Neither the somatic variants nor the mitochondrial variants resolved the evolutionary trajectory in a granular fashion. The slightly longer stem of shared variants in the mitochondrial phylogeny was most likely due to the low coverage of the germline sample. Similar to all other patients, the substructure of the samples was changed. While using the somatic variants showed sample 566 as the closest to the germline sample, the mitochondrial variant phylogeny instead indicated sample 559 as the closest ( f@[fig:mtCoverage]  3.42 @@ and [f% ig:CA51mitoPhylo] , [f% ig:CA51mitoPhylo] @ ).

Mitochondrial and somatic phylogenetic reconstruction of CA-I: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)
Mitochondrial and somatic phylogenetic reconstruction of CA-I: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)

Patient CA-J

The mitochondrial reconstruction presented sample 2 as a member of the more genetically complex samples 20, 24, 32, and 42 in spite of the missing TP53 mutation. Sample 28 on the other hand, which showed almost no evolutionary distance to the normal sample in the somatic analysis presented as a substantial outlier. This shows that the TP53 mutation of samples 20, 24, 32, and 42 was likely acquired after the seeding of the distant sites like sample 2 in the adrenal gland. The difference in distance to the normal sample fore sample 28 was likely due to a “cold“ primary site of disease with little cell proliferation, which however still accumulated mitochondrial mutations (Abascal et al. 2021) ([tab:ca80cnv], [fig:CA80mitoPhylo]).

Mitochondrial and somatic phylogenetic reconstruction of CA-J: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)
Mitochondrial and somatic phylogenetic reconstruction of CA-J: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)

Patient CA-K

In contrast to the somatic variant phylogeny, which showed an outgroup of samples 8 and 9, with a second cluster of samples 4, 5, and 6, the mitochondrial data supported a different split into two groups. These groups almost perfectly bifurcated the samples into those derived from the left and right sided disease sites, with sample 6 being the only sample from the right side clustered with the left lung and brain samples 8, 9, and 13. These data suggested that while only samples 8 and 9 showed a whole genome duplication and the APC “stop gained“ mutation, they were more closely related to the other samples than assumed from the somatic variant analysis and probably were seeded by the same cells ([tab:ca82cnv], f@[fig:ca82heatmap]  3.29 @@ and [f% ig:CA82mitoPhylo] , [f% ig:CA82mitoPhylo] @ ).

Mitochondrial and somatic phylogenetic reconstruction of CA-K: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)
Mitochondrial and somatic phylogenetic reconstruction of CA-K: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)

Patient CA-L

While the somatic variants linked the small cell carcinoma samples P.1 and 8 together, the mitochondrial analysis showed that the closest relative to P.1 was P.2. As both of the progression samples were taken 14 months ahead of the death of the patient, this agreed with the clinical history of the samples better. Additionally, instead of grouping the the adenocarcinoma sample 17A and 26 together, the mitochondrial phylogeny suggested, that while they share a common resistance mechanism (EGFR T790M), it might have been acquired in parallel instead of being seeded from the same lesion, as all samples other than the P.1/2 samples were not grouped together. Lastly, the closeness of sample 8 and the germline sample possibly indicated the presence of small cell disease already “before“ the progression samples were collected. However, the FFPE conservation of the P samples could have altered the molecular clock and influenced the branching site on the tree ( f@[fig:ca86heatmap]  3.37 @@ and [f% ig:CA86mitoPhylo] , [f% ig:CA86mitoPhylo] @ ).

Mitochondrial and somatic phylogenetic reconstruction of CA-L: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)
Mitochondrial and somatic phylogenetic reconstruction of CA-L: Somatic variants based reconstruction (A) and mitochondrial variants based reconstruction (B)

Summary

With the analysis of the mitochondrial history of samples, we could shed some light on the timing of lesions and the development of resistance mechanisms, which is not heavily influenced by the treatment and its selection pressure. While the infinite sites hypothesis does not hold true for mitochondrial DNA, due to the limited sites and reduced repair mechanisms, the selection pressure of treatment and their resistance mechanisms parallel evolution bias in the analysis of multiple related tumour samples also violates multiple assumptions for phylogenetic reconstruction when using somatic variants.

Neither options come without pitfalls and caveats, but this method could offer an alternate view on the history and seeding time of lesions and their kinship using data that was previously discarded but was abundantly available. This approach is also available at virtually no additional cost, as mitochondrial variants can be readily detected from standard WGS and WES. Ideally both somatic variants and mitochondrial variants would be integrated into a holistic approach, however due to the substantial difference in scale between nuclear DNA and mitochondrial DNA, the process is intricate and outside the scope of this work.

Cohort level analysis

Even though there were only five lung cancer patients who have participated in the CASCADE program to date, each of the patients had a high number of samples analysed, revealing the complexity of intra-patient heterogeneity in NSCLC ([cascade-sec:patientLevel]). Despite this heterogeneity, there were several parallels between the patients showing similarities in disease trajectories. The process of small cell transformation (SCT) is still significantly under explored and understood, due to the rarity of the transformation as well as the lower overall survival in comparison to other resistance mechanisms. In the following section, the patients who developed small cell carcinoma transformation were compared and contrasted with the adenocarcinoma cases to further explore this mechanism of resistance.

The generally accepted hallmarks of SCT, apart from the histological changes such as high MKI67 expression and down regulation of major histocompatibility complex I and II, are a much higher mortality, a high prevalence of FHIT and MAD1L1 deletions or loss, and TP53 and RB1 mutations (Meerbeeck, Fennell, and De Ruysscher 2011; Raso, Bota-Rabassedas, and Wistuba 2021). However, while in patient CA-L all samples showed a TP53 “stop gained“ mutation, patient CA-I’s transformation did not occur in the setting of TP53 loss. Additionally, neither of the patients presented with a RB1, FHIT, or MAD1L1 loss ( f@[fig:ca51heatmap]  3.13 @@ and [f% ig:ca86heatmap] , [f% ig:ca86heatmap] @ ).

In agreement with recent literature showing whole genome doubling (WGD) for SCLC (Zhou et al. 2021), we observed chromosomal arm amplification in patient CA-I and full WGD for patient CA-L. However, all NSCLCs patient also showed at least one sample with complete WGD, casting doubt on WGD being a distinguishing feature of SCT ( t@[tab:ca99cnv]  3.2 t@ and [t% ab:ca51cnv] , [t% ab:ca51cnv] t ab:ca80cnv,tab:ca82cnv,tab:ca86cnv,@). Additionally, the overall loss of heterozygosity could be seen in both NSCLC and SCLC and this seems to be a general feature of late stage lung cancers rather than NSCLC ([fig:cascadeLOH]) suggesting that copy number alterations are not the main drivers of SCT.

Percentage of LOH per chromosome in CASCADE patients: Per chromosome violin plots are shown in grey with median as white dot; individual percentages per sample are displayed grouped with colour per patient; LOH was called with a major copy number <0.6 and a major copy number > 0.6
Percentage of LOH per chromosome in CASCADE patients: Per chromosome violin plots are shown in grey with median as white dot; individual percentages per sample are displayed grouped with colour per patient; LOH was called with a major copy number <0.6 and a major copy number > 0.6

The most prominent difference of NSCLC and SCLC in our patients was the reconstructed phylogeny. While the NSCLC showed substructure and meaningful evolutionary splits, the SCLC patients phylogenies showed a distinct “star shaped“ pattern. Each sample branch was very close to the others and with substantial amounts of private variants in each sample ( f@[fig:CA51mitoPhylo]  3.44 @@ and [f% ig:CA86mitoPhylo] , [f% ig:CA86mitoPhylo] @ vs. f@[fig:CA99mitoPhylo]  3.43 f@ and [f% ig:CA80mitoPhylo] , [f% ig:CA80mitoPhylo] f ig:CA82mitoPhylo,@). Even though the shared variants in CA-L seemed to provide the ability to transform, they do not necessitate the transformation, as both samples CA-L 17A and 26 remained NSCLC with virtually no known genetic determinant of status. This in term suggested either a currently undetected genetic determinant or potential epigenetic regulation to explain the SCT ([fig:ca86heatmap]).

In contrast to CA-L, who presented with both NSCLC and SCLC, CA-I’s samples were completely transformed. The biopsied adenocarcinoma, which already had an MHC-I disrupting mutation was completely out-competed by a secondary clone, which did not present with any additional genetic driver alterations. Similar to patient CA-L this suggested, that the genetic prerequisites for SCT were already present in the clonal population, but not sufficient to drive transformation ( f@[fig:ca51.clonalTree]  3.16 @@ and [f% ig:ca51.ccfCluster] , [f% ig:ca51.ccfCluster] @ ).

Gene fusions or regulatory genetic variants leading to aberrant transcription could have been the cause for this phenomenon observed in both SCT cases. These could be detected in RNA sequencing of the biobanked fresh autopsy samples to exclude genetic causes which were not picked up with the performed WES, or detect transcription alterations. However, this analysis is outside the scope of this work.

Outlook

In this chapter we described both the high inter and intra patient heterogeneity of late stage lung cancer patients and showed that mitochondrial variant based phylogeny could help to resolve the sample relationships in the context of selection pressure through treatment. Additionally we uncovered a different disease trajectory for cases showing SCT where the cancers appeared genetically primed for SCT but genetic primed, but genetic alterations alone were on sufficient to drive transformation. This suggests that genetic analysis alone will not allow the prediction or early detection of SCT as a resistance mechanism. This uncoupling of genetic evolution and disease histology was also reported recently in a study focusing on multi-region analysis of treatment naïve SCLC cases (Zhou et al. 2021).

While there exist several multi-region lung cancer studies up to date, their focus was on early stage and treatment naïve disease (Jamal-Hanjani et al. 2017; Leong et al. 2018). While these studies showed ubiquitous intra-tumour heterogeneity and copy number alterations in early stage disease, there is an unmet need for the assessment of late stage lung cancers.

With this work we took a first step towards understanding and measuring the heterogeneity of tumour samples and treatment resistance mechanisms in late stage NSCLC, but many unanswered questions remain. The epigenetic marks and their inheritance patterns in cancer are a massive unexplored field which increases the heterogeneity of cancer even more (Easwaran, Tsai, and Baylin 2014). Additionally, we could only hypothesise and reconstruct the longitudinal trajectory of the disease from autopsy samples. The next step to validate and further explore these findings would be to analyse temporally spaced samples from the same disease.

“When the sum is already greater than the parts, there is room to make it greater still.“

MisMatchFinder - detection of mutational signatures from low coverage WGS

Introduction

Early researchers and physicians realised that cancers can have different morphologies and clinical progression depending on the primary occurrence of the tumour ([intro-sec:cancer]), with the extensive sequencing of cancer specimens over the last two decades, the mutational signatures of cancers came into focus. These signatures are specific and characteristic combinations of mutations, which stem from distinct biological processes. These processes include exposure to DNA damaging agents like chemotherapy treatment, tobacco and UV radiation, and biological intrinsic pathway errors in DNA-replication or -repair. As each of those processes has a more or less distinct profile of mutations (Hollstein et al. 1991; Kucab et al. 2019) the analysis and deconvolution of the signatures contributing to a patients mutational landscape can help to diagnose and treat a patient. While many signatures occur at a background level and are related to “normal“ cellular processes like ageing (Ludmil B. Alexandrov et al. 2013), others can point to defective mismatch repair or gain of function mutations in specific pathways, which can lead to new avenues of therapy for a patient (ONeil, Bailey, and Hieter 2017).

Supplementary information and plots for this chapter are attached in the appendix and prepended with 9.

Mutational signature analysis

Traditionally cancer mutational signature analysis entails a somatic variant calling process ([intro-sec:variantcalling]) followed by a counting and deconstruction step, which assigns weights to the individual signatures. These signatures are a precompiled list of mutation count relations ([fig:sig7a]). While individual SNPs already contains valuable information, there is an improvement in signature granularity when also counting the base up and downstream of the nucleotide change. This expands the feature space of counts from the six base classes of SNPs (C>A, C>T, C>G, T>C, T>A, and T>G) to 96 unique trinucleotide contexts (Ludmil B. Alexandrov et al. 2013). While there technically are six more base changes and several more trinucleotide contexts combinatorially possible, they can be collapsed into the afore mentioned 96 by using the reverse complement of the change.

Trinculeotide count contributions for SBS signature 7a (UV exposure); values taken from Ludmil B. Alexandrov et al. (2020)
Trinculeotide count contributions for SBS signature 7a (UV exposure); values taken from Ludmil B. Alexandrov et al. (2020)

Additionally to the single base substitution (SBS) there exists doublet base substitution signatures (DBS) and InDel signatures for somatic mutations of cancers (Ludmil B. Alexandrov et al. 2020), which are all based on the same principle and enable a higher precision for stratification of similar cancer subtypes and DNA damaging agents.

Restrictions and pitfalls of standard signature analysis

Especially for cancer samples, the focus when analysing mutational signatures, is on somatic variants of the sample. This requires deep sequencing of the tumour sample with at least WES or WGS. For optimal results, a matched germline sample for tumour-normal variant calling is required ([intro-sec:somaticcalling]). This means the cost of the assay is surprisingly high for the diffuse result of signature contributions of the variants in the sample. This is especially relevant when it comes to clinical diagnostic tools, where every biopsy of the patient is precious and not easily obtained. Additionally, the matched germline sample might not be available. For an analysis, which is based on the averaged and aggregated somatic variants, to require a high quality input could be seen as counter-intuitive. Especially, as the current gold standard analysis will report signatures, even if there are virtually no variants reported. We therefore developed a method which can be adapted for low coverage whole genome sequencing and requires no prior knowledge of the cancer or a germline sample.

Overview

This chapter describes a newly developed method, which allowed the detection of somatic signatures from low coverage WGS of cfDNA. This method, with further optimisation and validation, has the potential to provide a novek approach for non-invasive monitoring of patients and screening of at-risk individuals in a clinical setting.

Methods

With the change from a variant focused approach to a read based method, our method MisMatchFinder analysed “mismatches“ of a read from the reference genome, rather than a genomic locus. This had the advantage of not requiring a matched normal and could theoretically be used for virtually any sequencing data source like TAS, WES, WGS or even RNA sequencing. However, it also meant, that the error suppression methods, which are usually used by variant calling methods like read position ranks sum (RPRS) or strand bias were not applicable. To solve the problem of high background noise we developed multiple filtering steps, which are presented in the following sections.

Mathematical concept

A mismatch in this work was considered as any position in an aligned read, which did not show the same base as the reference at the aligned position. The mismatch inherited all the metrics of the read such as mapping quality, base quality and read position.

Ultimately, there were three sources of mismatches in a read, which are somatic variants, germline variants and sequencing errors ([mmf-eq:1]). <math display="block">n(mismatches) = n(somatic~var.) + n(germline~var.) + n(seq.~ error) \label{mmf-eq:1}</math>

With the sequencing error being a function of the sequencing machine and chemistry, the error rate was assumed to be stable, almost constant, when using the same type of sequencing machine and chemistry (Schirmer et al. 2016; Stoler and Nekrutenko 2021). We therefore reduced [mmf-eq:1] to <math display="block">n(mismatches) = n(som.~var.) + n(germ.~var.) + c_{seq.~err.} \label{mmf-eq:2}</math>

Secondly, the number of germline variants was assumed to be approximately the same between two people (Auton et al. 2015), which again simplified [mmf-eq:2]:

<math display="block">n(mismatches) = n(som.~var.) + c_{germ.~var.} + c_{seq. err.} \label{mmf-eq:3}</math>

Of course, [mmf-eq:3] was a crude estimate and instead the constants exhibited variability and were not real constants. To better approximate the inherent variableness of sequencing error and number of germline variants, we instead used Gaussian distributions

<math display="block">n(mismatches) = n(som.~var.) + \mathcal{N}(\mu_{germ.~var.}, \sigma_{germ.~var.}^{2}) + \mathcal{N}(\mu_{seq.~err.}, \sigma_{seq.~err.}^{2}) \label{mmf-eq:4}</math>

However, both [mmf-eq:3] and [mmf-eq:4] allowed the same conclusion, that with small enough values for either <math display="inline">c_{germ.~var}/c_{seq.~err.}</math> or <math display="inline">\mu_{germ.~var}/\mu_{seq.~err.}</math> and <math display="inline">\sigma_{germ.~var}/\sigma_{seq.~err.}</math> respectively, there exists a linear correlation between the amount of mismatches on a read and the somatic variants it contained:

<math display="block">n(mismatches) \sim n(som.~var.) \label{mmf-eq:final}</math>

With the help of [mmf-eq:final] we were theoretically able to approximate tumour mutational burden and signatures from individual reads with MisMatchFinder. The method therefore was independent of read depth and required no matched normal sample for somatic variant calling.

Data preprocessing

As MisMatchFinder employs multiple internal measures to filter and process sequencing data, the steps used for preprocessing were minimal: The reads were aligned to the GRCh38 reference genome ([intro-sec:mapping]). For optimal mapping and additional noise reduction, paired end sequencing of at least 75 bp is required to ensure a few bases overlap on the standard fragment length of less than 155bp of ctDNA ([intro-sec:ctDNA]).

All datasets in this work were sequenced with 100bp paired end, aligned with BWA to GRCh38 and optical duplicates were marked with Picard unless further specified.

Mismatch detection

In contrast to conventional variant calling approaches, which find regions of interest through pileups (position wise) and then realign reads in the surrounding area, to accurately estimate the most likely event that led to the observed haplotype ([intro-sec:variantcalling]), with MisMatchFinder we evaluated every individual read pair as a separate entity to fully span the heterogeneity of all cells and their genetic background. The “MD“- and “CIGAR“-tag of sequencing reads from the preprocessed BAM file were used to reconstruct the sequence of the read and the positions, where the read showed a different base than the reference. These potential mismatch sites were then filtered in multiple steps to reduce the impact of both germline variants as well as sequencing errors, to ensure the assumptions of [mmf-sec:concept] held true.

Filtering steps

Apart from the standard filters, which most variant callers employ, like mapping quality (MQ) and base quality (BQ), which were used to ignore reads as well as read positions respectively, the method also internally filters out common sequencing errors next to homopolymer regions (Heydari et al. 2019). While we set default cutoffs, for optimal performance on our data (MQ=20, BQ=55, homopolyLength=5), the program allows the user to adjust them to their liking. This is also possible for both the region of interest (ROI) bed-file which was used to restrict the analysis to only highly mappable regions of the genome ([ch-mmfAppendix:bedfiles]), as well as for multiple other parameters. These include minimum average base quality, minimum and maximum number of mismatches per read and/or fragment, and the minimum and maximum length of a fragment (Hudecova et al. 2021). If any of these values were not within the specified range, the read was discarded from the analysis. No read flagged as secondary or duplicate was included in the analysis.

Consensus reads - what happens when the sequencer isn’t sure

When paired end sequencing of ctDNA is analysed, the fraction of fragments where reads overlap is higher, than with “normal“ tissue based sequencing, due to the shorter fragment length of ctDNA ([intro-sec:ctDNA]). This allowed a fragment internal consensus calculation, by adjusting for differences between forward and reverse reads. In many variant calling methods, these differences are used by measuring the “strand bias“ (Guo et al. 2012; Saunders et al. 2012; GATK Team 2019) or “strand balance probability“ (Garrison and Marth 2012) by looking at a specific locus and evaluating the discrepancy of all forward and all reverse reads. As our method examined each read/fragment independently, the bias could not be calculated, however in the overlapping region of both reads, a consensus was generated. If both reads agreed on the mismatch, the BQ of both reads were summed to emphasise the increased evidence for this variant. In contrast, if they disagreed the base of the higher quality was used and its quality was decreased by half of the BQ of the lower quality base ([fig:mmf-consensus] bottom). To increase the stringency of the method, the user can also enable the ‘–strictOverlap’ option, which will only consider a mismatch, if both reads agree with each other. As we were only interested in mismatches from the reference, all positions where both agree with the reference were irrelevant for the analysis and were discarded ([fig:mmf-consensus] top). For the most stringent analysis, MisMatchFinder can additionally be configured to only use mismatches in the overlap part of a fragment (‘–onlyOverlap’), which significantly reduced the number of sequencing errors which were retained in the final analysis ([mmf-sec:cleanSim]).

Schematic of consensus computation method for overlapping reads in MisMatchFinder; Read 1 and Read 2 depict two overlapping paired end reads aligned to the reference sequence; Positions in the overlap are numbered for later referral; Read positions agreeing with the reference are coloured black, positions differing from the reference but agreeing in both reads are coloured purple (position 3) and differences between reads are coloured in the respective read colours (blue and red, position 6); Calculation for the resulting base quality (BQ<math display="inline">_{cons}</math> for each possibility is shown as formulas)
Schematic of consensus computation method for overlapping reads in MisMatchFinder; Read 1 and Read 2 depict two overlapping paired end reads aligned to the reference sequence; Positions in the overlap are numbered for later referral; Read positions agreeing with the reference are coloured black, positions differing from the reference but agreeing in both reads are coloured purple (position 3) and differences between reads are coloured in the respective read colours (blue and red, position 6); Calculation for the resulting base quality (BQ<math display="inline">_{cons}</math> for each possibility is shown as formulas)

This option however also reduced the available data by restricting the analysis to areas where reads were overlapping. Due to the fragment size distribution of ctDNA a paired end sequencing with 100bp read length will statistically in most cases lead to an overlap of at least 45 nucleotides ([intro-sec:ctDNA]) and with 150bp read length most ctDNA fragments will be almost entirely covered by both reads. However due to soft-clipping and incomplete alignment, this number was lower in reality. In our tests, the restriction led to approximately 18 nucleotides (min: 14bp max: 25bp std.dev.: 1.45) being retained in the analysis out of 100. Nevertheless, this meant that with a read depth of 8-10x <math display="inline">\approx</math>80% of the genome was covered by the overlap of at least one read pair.

Germline filtering - exclusion of normal variation

To further ensure the [mmf-sec:concept] assumption that the germline is a very small constant, we needed to remove as many mismatches as possible, which originated from germline variants. For this purpose, we built a custom zarr (Miles et al. 2021) based storage system from the gnomAD database (v.3.1) (Karczewski et al. 2020) using scikit-alel (Murillo R.; Peter Ralph; Nick Harding; Rahul Pisupati; Summer Rae; Tim Millar 2021). An in-depth explanation of the generation as well as a script for an end user can be found in [ch-mmfAppendix:germlineFilter].

It allowed precise filtering of known germline variant sites from the analysis. The method enables the specification of an allele frequency cut off, but as a default all sites, which were detected in any sample in gnomAD were filtered. We even included sites with low quality variants in the database, as these were sites of sequencing or mapping complications, which most likely interfered with our analysis as well.

Count normalisation

After the filtering steps, all remaining mismatches were aggregated to oligo-nucleotide counts. With this step we also classified directly neighbouring mismatches as DBS, which were counted as separate entities. SBS and DBS both can be used to identify underlying biological mutational processes, but they have very different signatures associated with them (Ludmil B. Alexandrov et al. 2020).

The counts formed this way were influenced by the background frequency of their reference oligo-nucleotides in the analysed genomic region. As the frequencies of di- and tri-nucleotides are not uniformly distributed in the genome, the chance for a mismatch found in an “AAA“ reference context is almost seven times higher than a mismatch in “CGC“ ([A:mmf:tab:tricounts], [A:mmf:tab:dicounts]). To reduce this bias towards high frequency oligo-nucleotides, we implemented a count normalisation step.

First the di- and tri-nucleotides in the analysed regions were counted using the reference without any black-listed and/or only in white-listed regions. These counts were then either

used to directly weight the observed mismatch counts, which led to a more uniform distribution of mismatches, or by building a fraction of observed oligo-nucleotides and the total counts in the genome

([A:mmf:tab:tricounts], [A:mmf:tab:dicounts]), the weighting achieved an approximation of how the counts would have been distributed if the whole genome was analysed. These two options are available with ‘–normaliseCounts’  for the approximation to full genome. By also adding ‘–flatNormalisation’  only the observed counts are used for normalisation.

All analysis presented in this work was not normalised, as the white listed regions used showed no significant difference in oligo-nucleotide abundance.

Signature deconvolution - finding the original signal

The deconvolution of the involved signatures from a known set of signatures is equivalent to finding the minimal distance between <math display="inline">m</math> as the observed number of mismatches in each oligo-nucleotide context (a vector of length 96) and <math display="inline">\textbf{S}w</math>, where <math display="inline">\textbf{S}</math> is the matrix of oligo-nucleotide defined contributions for each signature, resulting in a matrix of <math display="inline">96 \times k</math> with <math display="inline">k</math> being the number of known signatures. Lastly, <math display="inline">w</math> is the vector of weights of each signature, which we aimed to estimate.

<math display="block">\begin{aligned} \text{minimise:} & \quad (m - \textbf{S}w)^T(m - \textbf{S}w) = m^Tm - w^T\textbf{S}^Tm - m^T\textbf{S}w + w^T\textbf{S}^T\textbf{S}w \label{mmf-eq:optim1}\\ \text{with:} & \quad \sum_j w_j = 1 \quad \text{and} \quad \boldsymbol{\forall} j ~ w_j \geq 0 \label{mmf-eq:requirements}\end{aligned}</math>

[mmf-eq:optim1] can then be written as

<math display="block">\text{minimise:} \quad - m^T\textbf{S}w + \frac{1}{2}w^T\textbf{S}^T\textbf{S}w \label{mmf-eq:optim2}</math>

with the same restrictions as shown in [mmf-eq:requirements]. These equations and the idea to solve them with quadratic programming (QP) have been taken from A. G. Lynch (2016), and the iterative linear models (ILM) solving approach was adapted from deconstructSigs (Rosenthal 2016). Both methods were reimplemented in python in MisMatchFinder, using the quadprog package (McGibbon et al. 2021) for QP and a translation of the R code of deconstructSigs for ILM.

MisMatchFinder allows the use of either QP or ILM, as they in many cases produce very similar results (A. G. Lynch 2016). However, the default method is QP, even though ILM is the more interpretable and more parsimonious method, because of the increased number of signatures, in the latest work by Ludmil B. Alexandrov et al. (2020), ILM did not resolve the correct signatures if the signal was not strong enough. QP on the other hand showed more stable solutions ([fig:mmf-ILMerror]).

Distances of the estimated weights generated with ILM and QP from the true weight used as input; Truth is a synthetic count sample with (SBS1: 0.25; SBS3: 0.05; SBS5: 0.46; SBS7a: 0.1; SBS19: 0.03; SBS21: 0.01; SBS31: 0.08; SBS57: 0.02;)
Distances of the estimated weights generated with ILM and QP from the true weight used as input; Truth is a synthetic count sample with (SBS1: 0.25; SBS3: 0.05; SBS5: 0.46; SBS7a: 0.1; SBS19: 0.03; SBS21: 0.01; SBS31: 0.08; SBS57: 0.02;)

The combinatorial problem in ILM, already shown by A. G. Lynch (2016), was exacerbated with “wide“ signatures like SBS3 ([fig:sig7a]) and low signature contribution weights. As we were interested in analysing low tumour purity samples with low somatic signature signals from ctDNA, ILM was less sensitive in our test. Especially for SBS3, the contribution of the signature was only assigned by ILM with sufficient signal (15 and 20 mutations per megabase respectively for for SBS7a and SBS3) where QP allowed for a more linear increase in signal, even at lower levels. On the other hand, ILM will assign an overall higher proportional weight than QP once the signal reaches a certain threshold ([fig:mmf-methodDifferences]). ILM was therefore better suited for high confidence signal, but less effective for the more subtle differences we expected from ctDNA.

The deconvolution method could be an area for further optimisation by creating a custom deconvolution system adjusted for ctDNA detection and the signatures present.

For the rest of this work, unless specified differently, the results shown were generated using the QP deconstruction method.

Signature detection

Signature deconvolution with QP resulted in non-negative signature weights for almost all of the signatures when using MisMatchFinder derived counts, however a positive signature did not necessarily indicate the activity of this process in a tumour capacity due to the normal somatic mutation background and germline residual signal ([mmf-sec:germlineArtifacts]). To enable calling of significantly active signatures in samples, we developed a z-score like system, which used the distribution of each signature weight in the healthy population as a background.

As the weight values after deconvolution were between 0 and 1 inclusive, with a high enrichment for 0 and 0-adjacent weights, we chose the beta distribution with probability density function (PDF, [eq:betaPDF]) with shape parameters <math display="inline">\alpha</math> and <math display="inline">\beta</math> and normalisation constant <math display="inline">\mathrm{B}</math> to ensure the cumulative density function (CDF) sums up to 1. <math display="block">f(x; \alpha ,\beta)= \frac{1}{\mathrm{B}(\alpha, \beta)} x^{\alpha-1}(1-x)^{\beta-1} \label{eq:betaPDF}</math>

To enable a z-score like estimation, we calculated the <math display="inline">\lambda</math>-quantile of the cumulative density function of the beta distribution ([eq:betaCDF]) for each patient <math display="inline">p</math> samples signature weight <math display="inline">w_s(p)</math> of signature <math display="inline">s</math> with healthy sample fitted shape parameters <math display="inline">\alpha_s</math> and <math display="inline">\beta_s</math> per signature by solving for <math display="inline">\lambda</math> resulting in the inverse beta cumulative density function ([eq:invBetaCDF]).

<math display="block">F(x; \alpha ,\beta) = \frac{\displaystyle\int_{0}^{x} t^{\alpha-1}(1-t)^{\beta-1} dt}{\mathrm{B}(\alpha, \beta)} \label{eq:betaCDF}</math>

<math display="block">x = F^{-1}(\lambda; \alpha ,\beta) = \left\{ x: F(x; \alpha, \beta) = \lambda \right\} \label{eq:invBetaCDF}</math>

This allowed us to estimate how many healthy samples would have a signature weight less than the patient sample for the respective signature <math display="inline">s</math> ([eq:sigCDFScore]). While we did not see significant down regulation of signatures from the background, the method could support both over- and under-representation analysis. Ten percent of all samples were removed from both tails of the distribution of each signature before fitting the shape parameters to minimise the impact of outliers and instead fit the core distribution.

<math display="block">\forall s \in Signatures,\ \forall p \in Patients:\ \text{CDF-score}(s,p) := \frac{w_s(p)}{F^{-1}(\lambda; \alpha_s ,\beta_s)} \label{eq:sigCDFScore}</math>

We used <math display="inline">\lambda = 0.99</math> to calculate CDF-scores for all signatures and assumed a CDF-score of more than 2 to be significantly different.

This allowed the prioritization of significantly changed signatures in the tumour samples with regards to our healthy background depending on the desired application. A higher CDF-score cut off could increase specificity at the cost of sensitivity.

While most signature weights could be estimated very well with the moment matching estimation of fitdistrplus (Delignette-Muller and Dutang 2015) some signatures did not show an ideal fit. However, in these cases, the fit resulted in a wider distribution with higher theoretical quantiles than empirically observed, which reduced the predictive power of the signature. These signatures also showed a double peak distribution indicative of a population sub structure in the healthy population ( A@[A:fig:fittedDistsSBS3]  9.5 @@ and [A%

fig:fittedDistsSBS17a] , [A%
fig:fittedDistsSBS17a] @ vs. A@[A:fig:fittedDistsSBS12]  9.7 @@ and [A%
fig:fittedDistsSBS46] , [A%
fig:fittedDistsSBS46] @ ).

Tumour detection

With the calling of “active“ signatures in [mmf-sec:sigdetection] we were subsequently interested in the classification of samples into “healthy“ and perturbed (cancer) signature states. The classification as perturbed would allow a short-listing of samples for a more in-depth analysis with orthogonal validation.

For this purpose we trained a glmnet classifier with the assigned signature weights vector of each sample with an <math display="inline">\alpha</math> value of 0.7 due to the high correlation of the data.

The only other method which enables the classification of lcWGS of cfDNA into healthy and tumour (perturbed) samples was ichorCNA, which uses the copy number state of the sample to infer tumour purity (Adalsteinsson et al. 2017). ichorCNA is currently considered the gold standard and we used their default method as a reference to compare our results to. Any sample with a tumour purity of <math display="inline">\geq</math> 3% was considered a positive and everything else a negative sample, as suggested by the ichorCNA developers.

The clinical status of disease was used as the ground truth for each sample. Every sample which was taken during a time, when the patient had active disease was assigned the tumour class and every healthy individual the healthy class.

Results

This section presents the results of applying MisMatchFinder to multiple distinct datasets with different configuration. [mmf-sec:simulated] is the evaluation of the method on simulated data, which allowed accurate and definitive insight into the sensitivity of MisMatchFinder and served as proof of concept. Then, [mmf-sec:realworld] summarises the results from multiple real world datasets, demonstrating the methods performance on real world data and the associated clinical insights.

Simulated Data - the validation promised land

Just like in [ch:variantcalling], the novelty of the approach led to the issue of an absence of a gold standard dataset, with which to evaluate the performance of our new method. While there are low coverage WGS datasets of cancer patients, none of them had validated signatures associated with them. So we simulated data, to allow both optimisation of parameters of our method and granular detection of technical artefacts.

Sequencing errors filtering

To judge the ability of our approach to filter out sequencing errors, we first simulated “clean“ sequencing reads with neither germline or somatic variants with the ART simulation suite (Huang et al. 2011). As current estimates of Illumina sequencing error rates were in the range of 1 in 666 to 1 in 1149 (Stoler and Nekrutenko 2021) which was significantly higher than even the highest tumour mutational burdens of cancers (melanoma: 1 in 5k; tobacco smoking lung cancer: 1 in 100k) it was very important to be able to eliminate as much of the background errors as possible.

Mismatchrate of different filtering methods on sequencing data simulated with ART(Huang et al. 2011) for both 10x and 3x coverage; Mismatches correspond to simulated sequencing errors; all: no filters, overlap: only use the overlapping parts of paired end reads with consensus building ([mmf-sec:consensus]), strict OL: overlap but reads <i>must</i> agree, high BQ strict OL: strict OL with high BQ in both variants; A) Absolute counts B) counts from A normalised by the number of analysed bases all: all aligned bases, other: number of bases in read overlap
Mismatchrate of different filtering methods on sequencing data simulated with ART(Huang et al. 2011) for both 10x and 3x coverage; Mismatches correspond to simulated sequencing errors; all: no filters, overlap: only use the overlapping parts of paired end reads with consensus building ([mmf-sec:consensus]), strict OL: overlap but reads must agree, high BQ strict OL: strict OL with high BQ in both variants; A) Absolute counts B) counts from A normalised by the number of analysed bases all: all aligned bases, other: number of bases in read overlap

By only using high base quality mismatches, where both reads agree on the mismatch 99.98% of all sequencing errors could be eliminated and only 1 mismatch in 10M bases would be wrongly counted as a variant ([fig:mmf-mismatchrate]). This false discovery rate was multiple orders of magnitude lower than without consensus computation and the remaining error rate was lower than most tumour mutational burden estimates (Ludmil B. Alexandrov et al. 2020; Lawrence et al. 2013). We therefore considered our assumption of constant and small sequencing error contribution to be correct ([mmf-sec:concept]).

Spike-in signature detection

With the technical errors eliminated in simulated data, we used a similar method in real world data. However, to also establish a baseline for the detection limit and sensitivity of the method, we decided to first use a hybrid approach. We spiked somatic variants into genuine cfDNA low coverage WGS sequencing data of a healthy control, reducing the amount of unknown variability from other published datasets.

While it would have been possible to simulate the variants completely de novo, without any prior knowledge, we know that somatic mutations follow a certain pattern and there are mutational hotspots (T. Chen et al. 2016; Moore et al. 2021), so we decided to instead use the COSMIC database (Tate et al. 2018; Wellcome Sanger Institute 2021) as the catalogue to select mutations from. This allowed us to randomly draw mutations, which occurred in a specific cancer subtype. By using COSMIC variants our simulations were less synthetic. The in-depth protocol is shown in [ch-mmfAppendix:spikein]. The downside of this method is that the spike-ins were not predominantly introduced on shorter fragments, as would be the case with real ctDNA.

The following section discusses the results for the simulation of the very distinct SBS7a UV signature (see [fig:sig7a]) which is predominantly present in Melanoma and secondly the much flatter and more uniform SBS3 ([A:fig:sig3]), which is a sign of defective homologous recombination in breast and other cancers.

Signature analysis results of spiked-in somatic variants; signatures with a weight less than 1% were collated into “unknown“; The original spike-in signature was coloured in green (SBS3) and purple (SBS7a), unrelated signatures are coloured white and signatures corresponding to sequencing artefacts are coloured in lightgrey; r0.1 corresponds to approximately 0.1 variants per mega base; Weights were generated with deconstructSigs (Rosenthal 2016)
Signature analysis results of spiked-in somatic variants; signatures with a weight less than 1% were collated into “unknown“; The original spike-in signature was coloured in green (SBS3) and purple (SBS7a), unrelated signatures are coloured white and signatures corresponding to sequencing artefacts are coloured in lightgrey; r0.1 corresponds to approximately 0.1 variants per mega base; Weights were generated with deconstructSigs (Rosenthal 2016)

The spike-in was done at multiple different ratios, to simulate varying tumour purities and tumour mutational burden (TMB). [fig:mmf-spikeinsanity] shows the signature analysis result of the lowest spike-in ratio “r0.1“ which corresponded to 0.1 somatic variants per mega base and resulted in approximately 300 variants for the whole genome. As the spike-in process had to satisfy certain quality measures, not all candidate variants could be used. As such, the final simulated BAM contained 264 additional variants for the SBS3 simulation and 287 for the SBS7a equivalent. The variants corresponded to 304 and 364 “tumour“ reads respectively within the <math display="inline">\approx</math> 261 million reads of the simulated BAM. With increasing ratio, the spike-in signatures showed decreasing weights for other signatures, which likely got introduced due to the incomplete spike-in process ([ch-mmfAppendix:spikein]).

Signature weight differences for different deconvolution methods; Methods are the quadratic programming (QP) and iterative linea model (ILM); deconvolution was performed on the same counts generated with MisMatchFinder on 7 simulated dataset with increasing mutational burden from 5 to 100 mutations per mega base spike-in; for 0 mutations per mega base, the normal sample used for the spike-in was used
Signature weight differences for different deconvolution methods; Methods are the quadratic programming (QP) and iterative linea model (ILM); deconvolution was performed on the same counts generated with MisMatchFinder on 7 simulated dataset with increasing mutational burden from 5 to 100 mutations per mega base spike-in; for 0 mutations per mega base, the normal sample used for the spike-in was used

Melanoma - UV exposure (SBS7a)

With melanoma, the previously reported normal TMB ranges from 0.1 to 100 mutations per mega base (Ludmil B. Alexandrov et al. 2020). Melanoma is usually seen as a cancer with very high mutational load, which made it the ideal target for our new mutation based tool. With only the strict overlap ([mmf-sec:consensus]) and the germline ([mmf-sec:germlinefiltering]) filtering enabled, we could see that already from r5, which represented 16899 mutated reads (of 260 Mio.), we could detect the UV signature SBS7a. While this signal would likely be too low to trust in a clinical setting, with r10, the signature weight wass already 2% and well established. Additionally, the method was very specific on this dataset. Only SBS7a showed an increase with higher spike-in rate, with minor contributions from other other C>T heavy signatures like SBS2 and SBS30 ( f@[fig:mmf-methodDifferences]  4.6 @@ and [f% ig:mmf-spikeSBS7asignatures] , [f% ig:mmf-spikeSBS7asignatures] @ ), which partly already stemmed from the spike-in process, which was slightly enriched for SBS2 ([fig:mmf-spikeinsanity]B “unknown“). All other signatures, which were present in the normal sample showed a decrease. This decrease was to accommodate an additional signature, as all signature weights need to sum up to 1.

Signature weights differences from normal for SBS7a spike-in; Weights were de-constructed with QP method in MisMatchFinder and the weights assigned to the normal sample used for the spike-in were subtracted; Only Signatures with <math display="inline">\text{original weight}\geq 1\%</math> or a minimum difference of 0.5% are shown. The full weights can be seen in [A:fig:sbs7aspikeindifferences]; r0.1 corresponded to 0.1 mutations per mega base (287 variants) and r100 was the equivalent of 100 mutations per mega base (286974 variants)
Signature weights differences from normal for SBS7a spike-in; Weights were de-constructed with QP method in MisMatchFinder and the weights assigned to the normal sample used for the spike-in were subtracted; Only Signatures with <math display="inline">\text{original weight}\geq 1\%</math> or a minimum difference of 0.5% are shown. The full weights can be seen in [A:fig:sbs7aspikeindifferences]; r0.1 corresponded to 0.1 mutations per mega base (287 variants) and r100 was the equivalent of 100 mutations per mega base (286974 variants)

Defective homologous recombination-based DNA damage repair (SBS3)

Just as with the SBS7a signatures, even for the much more diffuse signature SBS3, MisMatchFinder specifically revealed the spike-in signature and did not assign the additional mismatches to any other signature. There was a small increase in SBS4 for the very low mutation rate simulations, where no SBS3 was detected. Unsurprisingly, the detection limit for SBS3 was slightly higher than for SBS7a (5 vs. 15 mutations per mega base), because of its more uniform profile. Exactly as with SBS7a, all other signatures showed a slight decrease, to accommodate the additional signature weight ( f@[fig:mmf-methodDifferences]  4.6 @@ and [f% ig:mmf-spikeSBS3signatures] , [f% ig:mmf-spikeSBS3signatures] @ ). While the detection threshold was slightly higher than the currently assumed median TMB in breast cancer, especially triple negative breast cancer (TNBC) has shown a higher TMB, which was at comparable levels to the limit of detection we saw in this simulated dataset (Barroso-Sousa et al. 2020).

Signature weights differences from normal for SBS3 spike-in; Weights were deconstructed with QP method in MisMatchFinder and the weights assigned to the normal sample used for the spike-in were substracted; Only Signatures with <math display="inline">\text{original weight}\geq 1\%</math> or a minimum difference of 0.5% are shown. The full weights can be seen in [A:fig:sbs3spikeindifferences]; r0.1 corresponds to 0.1 mutations per megabase (264 variants) and r100 is the equivalent of 100 mutations per megabase (285367 variants)
Signature weights differences from normal for SBS3 spike-in; Weights were deconstructed with QP method in MisMatchFinder and the weights assigned to the normal sample used for the spike-in were substracted; Only Signatures with <math display="inline">\text{original weight}\geq 1\%</math> or a minimum difference of 0.5% are shown. The full weights can be seen in [A:fig:sbs3spikeindifferences]; r0.1 corresponds to 0.1 mutations per megabase (264 variants) and r100 is the equivalent of 100 mutations per megabase (285367 variants)

Germline filtering

In order to ensure our assumptions also hold true in real patient data, we evaluated the effect of removing germline variants from the analysis. We utilised the same simulated samples from [mmf-sec:simSignatures], where the reads were original cfDNA sequencing reads from a healthy person. These reads had a known natural background germline variant profile as any arbitrary sample would.

Signature analysis without germline variant filtering; Weights were deconstructed with QP method in MisMatchFinder, but in contrast to [fig:mmf-methodDifferences], the filter removing all known germline variants was disabled
Signature analysis without germline variant filtering; Weights were deconstructed with QP method in MisMatchFinder, but in contrast to [fig:mmf-methodDifferences], the filter removing all known germline variants was disabled

In stark contrast to the previous analysis ([fig:mmf-methodDifferences]), when retaining mismatches in known germline variant sites, the sensitivity of the method reduced significantly. Only for the SBS7a spike-in at the very highest mutation frequency (100 mutations per mega base) was a signal detected. This signal was still weaker than what was previously found with just 10 mutations per mega base. Unsurprisingly SBS3 performed worse, just as before, and no signal was detected at any frequency ([fig:mmf-noGermlineFilterAnalysis]).

This extreme change was caused by the much higher number of mismatches which were used in the analysis (<math display="inline">\approx 1.8</math> Mio without germline filter and <math display="inline">\approx 130</math>K with germline filter). This increase in mismatches in the analysis diluted the spike-in variants. [fig:mmf-percentIncrease] showed that without the germline filter the additional found mismatches never exceeded 5% of all mismatches, which seemed to be the detection threshold for SBS7a. With germline filtering this threshold corresponded perfectly with the increase of SBS7a weight in those samples ([fig:mmf-methodDifferences]).

Percent increase of mismatches in analysis with and without germline filter; Values are normalised to the number of mismatches found in the normal sample (depicted as 0 mutations per mega base); dotted grey line shows the maximum increase in the left panel (without germline filter)
Percent increase of mismatches in analysis with and without germline filter; Values are normalised to the number of mismatches found in the normal sample (depicted as 0 mutations per mega base); dotted grey line shows the maximum increase in the left panel (without germline filter)

While we had already established that the spike-in variants could not be detected when retaining germline variant sites, the computed signature weights in the normal sample were vastly different as well. Without the germline filter, the most prevalent signatures were SBS1 and SBS5 which are thought to be molecular clock like signatures, related to the age of the individual (Ludmil B. Alexandrov et al. 2020). In the germline filtered analysis the most prevalent signatures were SBS4 (tobacco smoking), SBS12 (unknown) and SBS46 (sequencing artefact). In general it seemed like the germline filter removed predominantly SBS1 and SBS5, while most other signatures remained the same ([fig:mmf-noGermlinePie]).

As the sample was acquired through a healthy donor blood bank, we had no way to verify if the individual was a smoker.

Signature weights of the normal sample with and without germline filter; MisMatchFinder derived signature weights with and without germline filter; weights below 1% contribution are accumulated in “unknown“ (darkgrey), lightgrey signatures show sequencing artefact signatures, yellow shows smoking related signatures and blue depicts APOBEC signatures
Signature weights of the normal sample with and without germline filter; MisMatchFinder derived signature weights with and without germline filter; weights below 1% contribution are accumulated in “unknown“ (darkgrey), lightgrey signatures show sequencing artefact signatures, yellow shows smoking related signatures and blue depicts APOBEC signatures

This convinced us that germline filtering, additionally to the consensus overlap analysis, was fundamentally important for the method to recover signal. In the following sections, unless further specified, the germline filter was enabled for all analysis.

Real world data - analysis of patient data

While simulated data is perfect to ensure the method performs as expected in edge cases and to estimate detection limits, only real world data allows the final examination to understand if the model used for analysis can mirror biological concepts. To show our new method is usable for a variety of datasets, we used a mixture of different cancer types with different library preparation. In [mmf-sec:healthy] we focused on the analysis of 60 healthy individuals. We generated a background noise model and excluded aging as one of the sources of variation from our data. Then we analysed two metastatic breast cancer patients with BRCA1 positive disease, comparing matched tumour-normal sequencing with MisMatchFinder. This dataset allowed us to evaluate how efficient germline filtering was ([mmf-sec:germlineArtifacts]) and how accurate and sensitive our method was when compared to the current gold standard of tumour-normal tissue analysis ([mmf-sec:matchedMBCB]). A second dataset containing samples of two cfDNA time points and corresponding tissue from a patient with metastatic melanoma allowed us to validate performance of MisMatchFinder in a different cancer context ([mmf-sec:melpatients]). Lastly we analysed a dataset of 79 tumour only cfDNA samples both melanoma (40) and breast cancer (39) patients to show the potential clinical application of MisMatchFinder ([mmf-sec:tumourDetection]). The mean tumour purity assigned to each sample by ichorCNA was 16.2% (min: 0.3% max: 71.5% sd: 17.0%) displaying a clinically expected range of high and low ctDNA fractions. The mean age of patients was 53.4 (min: 34 max: 74) for the breast cancer patients and 60.2 (min: 34 max: 81) for the melanoma patients which is very closely age matched to our healthy samples ([mmf-sec:healthy]).

In the following section, we showed that MisMatchFinder exhibits barely any technical bias.

Bias detection

A dataset with healthy samples is key to detect biases, because any variability that cannot be accounted to either age or gender is unwanted and will affect the cancer samples in the same way. We expected an increased mismatch rate in the older individuals do to the accumulation of somatic mutations due to “clock like signatures“ (Abascal et al. 2021). In contrast, tumour samples should be biased based on tumour purity, as higher amount of tumour reads would result in a higher amount of mismatches from somatic mutations. To verify that our assumptions were correct, we performed a principle component analysis (PCA) of the raw tri-nucleotide mismatch count numbers, of all 79 tumour only and 60 healthy samples, which MisMatchFinder can report alongside the weights of signatures.

PCA (PC1 and PC2) of tri-nucleotide mismatch counts of healthy donor and tumour samples (melanoma and metastatic breast cancer) of varying purity; PCA was conducted on scaled and centered data
PCA (PC1 and PC2) of tri-nucleotide mismatch counts of healthy donor and tumour samples (melanoma and metastatic breast cancer) of varying purity; PCA was conducted on scaled and centered data

Neither age, nor sex of the sample seemed to have any influence on the mismatches of the sample [fig:mmf-pca1v2]A, B. In contrast, there appeared to be a batch effect with regards to the used flowcell on the sequencer and library preparation ([fig:mmf-pca1v2]C, D). As these two are strongly intertwined, it was not possible to differentiate the two effects, however the flowcell bias was evident across multiple library preparations and therefore more likely to be the source of the bias. Flowcell 1 contained samples from both ‘g’ and ‘h’ and flowcell 3 both ‘a’ and ‘d’, suggesting that the flowcell had more influence than the preparation. This is consistent with recent literature, which suggests, that there are more and less error prone flow cells (Stoler and Nekrutenko 2021).

While there was a slight bias towards higher PC1 values for higher DNA input samples, which might be due to a higher library complexity when sequencing, it was minor and the same bias exists for every other method using de-duplicated sequencing data as it might have an effect on the non-redundant data available ([fig:mmf-pca1v2]E). Similarly, the very low coverage samples were at the very left of PC1, but there was a substantial spread along the axis for the higher coverage samples as well ([fig:mmf-pca1v2]F).

As expected from the model and the biology, there was a trend of separation for both tumour type and tumour purity, with the higher purity samples oriented toward the top right and the healthy and lower purity samples at the left ([fig:mmf-pca1v2]G, H).

These tendencies can also be observed when looking at PC2 and PC3 ([A:fig:mmf-pca2v3]). PC3 still accounts for <math display="inline">\approx \text{9\%}</math> of the observed variability and PC6 is the first component with an variance value smaller than 1 suggesting that only the first five components should be retained for further analysis (explaining a cumulative 97.3% of the variance). However, as we use the counts for signature deconstruction the PCA analysis only served as a quality control.

Healthy cohort

We sequenced the 60 healthy samples, from varying age groups (range: 24 yrs.-70 yrs. median: 48.5 yrs.) with 24 males and 36 females, in the exact same way as the tumour only samples for breast cancer and melanoma to an effective average coverage of 8x WGS, with mixed healthy samples and cancer samples on sequencing flow cells to account for and minimise batch effects.

With recent literature indicating a linear relationship of aging and somatic mutations (Martincorena et al. 2018; Abascal et al. 2021; Cagan et al. 2022), we were interested if the MisMatchFinder method could identify the accumulation of somatic mutations with age. While the samples between 30 and 59 showed the expected linear increase in mean mismatch rate, both the 20-29 and 60-69 year old samples showed the inverse trend, where young individuals exhibited a higher mismatch rate than older. This discrepancy could have been rooted in the germline filtering step as shown in [mmf-sec:germlineArtifacts], or it may have been a sign of loss of heterogeneity in the hematopoiesis as shown in aging mice (Ganuza et al. 2019). Lastly, with only 40 healthy samples, we might not have had enough representation in each age group to detect subtle changes.Whatever the source, MisMatchFinder was not able to infer the age of a sample with the default settings in this dataset.

Mismatchrates of healthy samples by age: distribution of sample is shown as violin plots on the left hand of each group, with a boxplot and the mean indicated as a black box and a white dot respectively; Observed values were displayed as black dots on the right hand of each group
Mismatchrates of healthy samples by age: distribution of sample is shown as violin plots on the left hand of each group, with a boxplot and the mean indicated as a black box and a white dot respectively; Observed values were displayed as black dots on the right hand of each group

Black list generation

With the strong effect filtering of both technical errors ([mmf-sec:cleanSim]) and germline variants ([mmf-sec:germlinefiltering]) as background noise had on our method, we hypothesised that a blacklist of mismatches found in our healthy individuals would help us further cut down on unwanted background signal and refine the somatic mismatch calls. We therefore ran MisMatchFinder with significantly relaxed quality cut-offs to capture as much variation as possible. This included a reduction in mapping quality and base quality as well as not restricting the analysis to the highly accurate overlap part of the paired end reads. However we still restricted the analysis to the same highly mappable areas of the genome the same as for the tumour analysis as well as filtering already known germline variants for a better estimation of the impact of this filter step.

The site files generated through MisMatchFinder were then concatenated and aggregated to multiple blacklists with cut-offs of a variant present in at least 3, 5 or 10 times. The code used for the post processing of MisMatchFinder site files can be found in [lst-mmf:blacklist].

Blacklisted mismatches from healthy individuals: A) Histogram of frequency of mismatches found by at least n samples (multiplicity) B) Percentage of mismatches in healthy samples found in blacklist of mismatches found in at least three healthy samples; violin plot with median as white dot (left) and real values (right)
Blacklisted mismatches from healthy individuals: A) Histogram of frequency of mismatches found by at least n samples (multiplicity) B) Percentage of mismatches in healthy samples found in blacklist of mismatches found in at least three healthy samples; violin plot with median as white dot (left) and real values (right)

While there was a substantial number of shared variants in the reduced parameter analysis out of the <math display="inline">\approx 15</math> million mismatches, only 3413 were in more than 50% of all samples ([fig:mmf-BlackListStats]A), and the blacklist filtering did not lead to a performance boost. When we used the default settings for the healthy samples and used the generated blacklist as a filter on average only 2% of mismatches were removed by the blacklist assignment (min: 1%, max: 3%) when at least three samples showed the mismatch ([fig:mmf-BlackListStats]B). Utilising the ten sample blacklist, the mean filtering amount dropped to 0.1% of mismatches of the sample (min: 0.04% max: 0.3%). We therefore concluded that the blacklist filtering step has no additional benefit and it was not included in the default analysis of MisMatchFinder.

BRCA1 mutation positive breast cancer patient samples

The first dataset of cancer patients involved two BRCA1 mutation positive females. The data contained matched tumour, germline and cfDNA sequencing with high depth (<math display="inline">\approx 80x</math>) WGS for all samples. With the matched normal, we used the current standard protocol of somatic mutational pattern analysis ([mmf-sec:signatureanalysis]) and compared it with our new method ([mmf-sec:matchedMBCB]).

As the sequencing data had a much higher depth than what is used in standard clinical practice for plasma sequencing, we down-sampled the data to 8x coverage for MismtachFinder analysis, bringing it in line with the simulated data. By using several different seeds for the sampling, we generated pseudo technical replicates of the sequencing ([ch-mmfAppendix:subsampling]), which in term gave an approximation of the stability of the results of MisMatchFinder.

Germline artifacts

As discussed above, the germline filter step is vital to boost the signal of somatic variants ([mmf-sec:germline]). We were interested in how many germline variants were not filtered out with our filtering step and were still contributing to background noise. The high depth matched healthy WGS samples of the breast cancer patients was used for this analysis. We called germline variants on the matched normal using Strelka2 and compared the called variants with the sites reported by MisMatchFinder as somatic (retained after germline filtering) on the sub-sampled data. All variants with any quality filter assigned by Strelka2 were considered for this analysis, such that possible clonal hematopoiesis (CH) variants were still considered. [tab:mmf-germlineArtifacts] showed that on average 2100 germline variants were not filtered out per run. However, this only equated to 0.9% for patient 1 (MBCB196) and 1.5% for patient 2 (MBCB298) of all sites found to be mutated. While the exact numbers of any arbitrary sample will depend on the strictness of the parameters of the analysis as well as the mutation rate of the sample, with default parameters a similar result should be expected with other samples, suggesting the germline filter was removing virtually all mismatches caused by germline variants.

The germline variant removal was therefore very effective filtering the 3.75 (MBCB196) and 3.76 (MBCB298) million germline SNPs called by Strelka2 to less than 0.05% of the original number. As the genetic background in gnomAD is not balanced and shows a lack of non-european ancestry data (Tiao and Goodrich 2020), this filtering could become less effective when analysing samples from indigenous or otherwise genetically less characterised patients, as their germline variants might not be comprehensively represented.

Nevertheless, this result combined with the effective filtering of technical errors ([mmf-sec:cleanSim]) suggested, that nearly all of the remaining sites were somatic mutations of either the healthy tissue or the cancer cells.

Germline variants retained after germline filtering with MisMatchFinder analysis; Default parameters were used when running MisMatchFinder with gnomAD zarr for filtering. seed column showed the seed used to subsample the high depth sequencing BAM, “mismatch sites“ column contains number of sites found to be changed, “germline sites“ contains the number of sites also found with germline variant calling, fraction shows fraction of column 4 and 3
Patient ID seed mismatch sites germline sites fraction
1007 216950 2107 0.0097
1234 217145 2073 0.0095
1337 216823 2080 0.0096
1717 217593 2089 0.0096
2358 217317 2097 0.0096
3311 217219 2046 0.0094
5229 216876 2062 0.0095
6060 217388 2080 0.0096
MBCB196 9876 217656 2008 0.0092
1756 148495 2168 0.0146
3599 149901 2224 0.0148
4117 149382 2277 0.0152
4306 149549 2248 0.0150
4359 149805 2205 0.0147
5788 150103 2241 0.0149
5887 150099 2287 0.0152
8387 149533 2248 0.0150
MBCB298 9754 149547 2229 0.0149

Additionally, we were interested to see if the filtering did remove true somatic signal, by removing somatic variants which mirrored a known germline variant. We therefore annotated high confidence variants called with Strelka2 in both the patient tumour tissue and cfDNA samples with a germline flag when that variant was previously found in gnomAD. Surprisingly we found that <math display="inline">\approx 50\%</math> of somatic variants found in any sample were also found in gnomAD. To ensure these samples were not outliers, we also analysed the tissue samples from [ch:cascade] and saw that all samples had half of the somatic variants called already present in gnomAD ([fig:mmf-germlineStatus]A). While gnomAD is known to contain potential somatic contamination, these overlaps were most likely due to error prone sites in the genome and the higher rate of deaminations and transitions over transversions (Meyerson et al. 2020). To ensure our filtering did not skew the results we performed signature deconvolution on the somatic variants of a sample, which were not found in gnomAD and variants which were found in gnomAD. While there were minor signatures, which were clinically relevant (SBS13 and SBS87) in the “germline“ partition, the majority of signatures were identified using variants not contained in gnomAD. Specifically SBS13’s contribution was larger in the non-germline selection even though it was found in both analyses ([fig:mmf-germlineStatus]B vs C, [A:fig:signaturesGermlineSplit]).

Analysis of somatic variants with respect to germline database gnomAD: A) percentage of somatic variants called with Strelka2 found also in gnomAD; Violin plot with median as white dot (left) and values (right) B) signature analysis of variants not found in gnomAD for sample MBCB196 cfDNA c) signature analysis of variants also found in gnomAD for sample MBCB196 cfDNA; Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)
Analysis of somatic variants with respect to germline database gnomAD: A) percentage of somatic variants called with Strelka2 found also in gnomAD; Violin plot with median as white dot (left) and values (right) B) signature analysis of variants not found in gnomAD for sample MBCB196 cfDNA c) signature analysis of variants also found in gnomAD for sample MBCB196 cfDNA; Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)

In summary, the germline filtering was a very powerful step, which boosted the performance of our signature detection method significantly with minimal signal deterioration. The high overlap of germline database variants and somatic variants poses a challenge for perfect concordance between our method and the current gold standard and would be an ideal area for further optimisation especially if ageing signatures like SBS1 and SBS5, which are caused by deamination, are of interest. These variants will be predominantly removed and were the most likely reason for the absence of ageing signatures seen on healthy individuals ([mmf-sec:healthyBias]). A potential solution would be to incorporate a machine learning model to distinguish germline from somatic variants (Spinella et al. 2016; Sahraeian et al. 2022).

Matched tissue and cfDNA WGS samples

Signature weights for the WGS of two BRCA1 mutation positive breast cancer patients: Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)
Signature weights for the WGS of two BRCA1 mutation positive breast cancer patients: Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)

The matched tissue-normal WGS for the two metastatic breast cancer patients allowed a concordance analysis with a known true signature. We called variants with Strelka2 and performed default signature deconvolution with sigminer to obtain weights using default parameters for the GRCh38 genome build and QP deconstruction method (Shixiang Wang et al. 2021). Due to their known germline BRCA1 mutations we expected a contribution of SBS3. Both the tissue and the ctDNA samples of patient MBCB196 show a greater than 3% SBS3 as expected, however neither samples for MBCB298 showed SBS3 ([fig:mmf-mbcbWGSsigPie]).

Signature weights for subsampled BRCA1 positive patients: each quadrant represents one high depth WGS sample downsampled to 8x with different seed (x-axis) signature weights per downsampling were shown in the columns in each quadrant. Colours represent clinically relevant signatures: blue (APOBEC), green (HRD), red (UV radiation); light grey and white show sequencing artefact and signatures of unknown significance respectively. Only signatures considered to be active ([mmf-sec:sigdetection]) were displayed
Signature weights for subsampled BRCA1 positive patients: each quadrant represents one high depth WGS sample downsampled to 8x with different seed (x-axis) signature weights per downsampling were shown in the columns in each quadrant. Colours represent clinically relevant signatures: blue (APOBEC), green (HRD), red (UV radiation); light grey and white show sequencing artefact and signatures of unknown significance respectively. Only signatures considered to be active ([mmf-sec:sigdetection]) were displayed

When applying the default parameter MisMatchFinder analysis to the downsampled WGS, we observed consistent results with any seed, suggesting that the signal was present in all reads and subsampling did not skew the results. Secondly, the expected SBS3 signature was found at <math display="inline">\approx 18\%</math> (min: 14% max: 21%) in the tissue and <math display="inline">\approx 24\%</math> (min: 20% max: 26%) in the cfDNA, recapitulating the results from the standard analysis ([mmf-sec:sigdetection]). The chemotherapy related signatures SBS25, SBS31, and SBS87 seen in both MBCB196 and MBCB298, were not reported by MisMatchFinder. This could have been caused by the germline filtering step, however these signatures were not clinically actionable and expected after chemotherapy treatment so their exclusion was not of major concern.

Surprisingly, even though the APOBEC signatures SBS2 and SBS13 were observed in the normal signature deconvolution of patient 1 and 2, MisMatchFinder only found very low levels of SBS13 contribution in the cfDNA sample (mean: 0.4% (min: 0.1% max: 0.7%) and no contribution in the tissue samples. However, MisMatchFinder was able to detect a high prevalence of SBS2 in both tissue (mean: 12% min: 11% max: 12%) and cfDNA (mean: 69% min: 65% max 74%) samples of MBCB298, indicating that MisMatchFinder was capable of detecting APOBEC signatures.

Importantly, even thought the WGS analysis with sigminer did not result in a SBS3 positive deconvolution, MisMatchFinder was able to recover the expected signature in the tissue samples (mean: 28%, min: 27%, max: 29%). As the tissue sample in this case was a FFPE sample, with a much higher number of somatic variants called and very low tumour purity (<math display="inline">\leq 13\%</math>) the standard analysis could have been overwhelmed with FFPE artefacts, which were successfully filtered out using MisMatchFinder. Interestingly, the cfDNA sample did not show any SBS3, but extensively high APOBEC activity. The absence of SBS3 could potentially be explained by reversion mutations in the cancer reenabling BRCA1/2 (Lin et al. 2018), while the high APOBEC activity has been shown in some breast cancer types with Kataegis (Ludmil B. Alexandrov et al. 2020; Rebhandl et al. 2015). Both the tissue samples of MBCB196 and the cfDNA sample of MBCB298 showed activity of UV signatures SBS38 and SBS7b respectively. While it is unlikely that these were caused by UV exposure, the APOBEC deamination signature has similarities to UV exposure. We therefore thought the UV signatures may have represented a leaky APOBEC signature rather than genuine UV exposure.

While MisMatchFinder did not detect the expected SBS3 signature in all samples, we could show a high concordance with the current gold standard analysis and MisMatchFinder revealed clinically relevant signatures SBS2 and SBS3.

Melanoma patient samples

For melanoma real world data, we analysed samples from a single melanoma patient containing tumour-normal matched tissue WES with two additional longitudinal time points of cfDNA sequenced both through WES and lcWGS. This allowed us to compare the current standard signature analysis for both time points with our novel low coverage method that MisMatchFinder was developed for.

With the help of the WES of both the tissue and the cfDNA samples, we established the gold standard to compare the MisMatchFinder result to. Strelka2 was used to call somatic variants and the high confidence somatic SNPs were used to generate signature weights with sigminer.

As expected from a skin melanoma, the tissue as well as both WES cfDNA samples revealed a high contribution of SBS7a and SBS7b, with approximately 50% of all somatic variants associated with these two signatures in the tissue and time point 1 and 27% in time point 2. While several other signatures were found to be present in all samples, none of them had clinical relevance ([fig:mmf-melaWESsigPie]).

Signature weights for the WES of two melanoma patients; First column shows the results for the tissue baseline and middle and right column show the cfDNA; Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)
Signature weights for the WES of two melanoma patients; First column shows the results for the tissue baseline and middle and right column show the cfDNA; Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)

This high exposure of SBS7a and SBS7b was highly concordant with our analysis of the lcWGS sample of time point 1, where more than 50% of all somatic mismatches signatures, which were called active ([mmf-sec:sigdetection]), were attributed to SBS7a and SBS7b. While the proportion of SBS7a to SBS7b was different between the WES result and the lcWGS, the weights for those signatures were very similar. Time point 2 on the other hand showed less agreement between WES (sigminer) and lcWGS (MisMatchFinder). While SBS7b was still detected at similar levels between both methods, MisMatchFinder failed to detect SBS7a, which was the highest contributing signature at time point 2 using the somatic variants from WES. Additionally both lcWGS samples show a high prevalence of SBS3, which was not detected in the WES. While SBS3 is usually accredited to BRCA1/2 positive breast cancers, there have been reports of homology repair deficiency in melanomas. These mismatches might have been caused by subclonal variants, which are below detection threshold for conventional variant calling ( f@[fig:mmf-melaWESsigPie]  4.18 @@ and [f% ig:mmf-melaMMFsigPie] , [f% ig:mmf-melaMMFsigPie] @ ).

Signature weights of lcWGS of two melanoma samples Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)
Signature weights of lcWGS of two melanoma samples Colours show cancer associated signatures: blue (APOBEC), red (UV exposure), orange (tobacco), purple (chemotherapy), light grey (sequencing artefacts), dark grey (everything below 1% weight)

Tumour detection analysis

The comparison of our glmnet based analysis of MisMatchFinder signatures showed similar results to ichorCNA, which is the most commonly used method to call tumour presence (i.e. ctDNA detection) from low coverage sequencing of cfDNA. The two methods classified 21 and 23 true positives and 55 and 60 true negatives which resulted in a sensitivity of <math display="inline">0.525</math> for MisMatchFinder and <math display="inline">0.575</math> for ichorCNA and a specificity of <math display="inline">0.91</math> for MisMatchFinder and <math display="inline">1</math> for ichorCNA on the melanoma samples ( t@[tab:mmf-looMatMMFmela]  [tab:mmf-looMatMMFmela] @@ and [t% ab:mmf-looMatichorCNAmela] , [t% ab:mmf-looMatichorCNAmela] @ ).

vlines, colspec=cccc, cell33 = LimeGreen, cell44 = LimeGreen, vline1 = 1,2fg=white & 1-2 & True class & 1-4
2-1 & 2-2 & positive & negative
Prediction & positive & 21 & 5
4-1 & negative & 19 & 55
vlines, colspec=cccc, cell33 = LimeGreen, cell44 = LimeGreen, vline1 = 1,2fg=white & 1-2 & True class & 1-4
2-1 & 2-2 & positive & negative
Prediction & positive & 23 & 0
4-1 & negative & 17 & 60

While MisMatchFinder was inferior to ichorCNA, there were 5 melanoma samples, which showed very low tumour purity with ichorCNA (mean: 0.6% min: 0.4% max: 1.1%) but MisMatchFinder correctly identified them as cancerous samples. As ichorCNA is purely based on copy number alterations, these samples most likely only had mutationally driven disease, which is why MisMatchFinder could detect altered mutational signatures. Interestingly, only one of the patients exhibited melanoma associated signature SBS7c, while all others had a very high contribution of SBS4 and SBS29, which are thought to play a role in tobacco related cancers.

This suggested to us, that while MisMatchFinder is able to detect the right signature if it is present ([mmf-sec:melaSim]) the the detection of ctDNA in a sample is not purely dependant on the presence of a single dominant signature, and rather is influenced by the combination of signatures detected in the sample. We therefore restricted the detection of the melanoma samples to only melanoma associated signatures SBS7a through SBS7d and SBS38, to validate our hypothesis. With only the melanoma related signatures, only 2 samples were classified as tumour positive, which showed, that the interplay of signatures contains additional data and potentially more signatures than previously thought are related to cancer.

vlines, colspec=cccc, cell33 = LimeGreen, cell44 = LimeGreen, vline1 = 1,2fg=white & 1-2 & True class & 1-4
2-1 & 2-2 & positive & negative
Prediction & positive & 1 & 0
4-1 & negative & 38 & 60

Finally, when comparing our method to ichorCNA on the breast cancer samples, ichorCNA’s performance was superior to the melanoma samples [tab:mmf-looMatichorCNAbreast], but MisMatchFinder preformed considerably worse [tab:mmf-looMatMMFbreast]. The increased sensitivity of ichorCNA (0.74 vs 0.575) on the breast cancer samples was expected, as breast cancers are known to accumulate specific copy number alterations during their evolution (Dawson et al. 2013) and these changes have been shown to be able to stratify and diagnose breast cancer samples (Russnes et al. 2010; Curtis et al. 2012). MisMatchFinder on the other hand requires a mutationally driven cancer signal, which is considerably weaker in breast cancers (Ludmil B. Alexandrov et al. 2020).

vlines, colspec=cccc, cell33 = LimeGreen, cell44 = LimeGreen, vline1 = 1,2fg=white & 1-2 & True class & 1-4
2-1 & 2-2 & positive & negative
Prediction & positive & 29 & 0
4-1 & negative & 10 & 60

However, when using information gain feature selection on the signature weights reported for the breast cancer samples by MisMatchFinder, only signatures SBS3 and SBS5 showed a reduced entropy. We therefore used only these signatures in the training and subsequently boosted the sensitivity from 2% to 18%. Finally, by just using a cut-off approach for weights in SBS3, we achieved the highest sensitivity (0.36%) with a minute drop in specificity (0.95%). We used twice the standard deviation added to the mean of SBS3 weights observed in healthy samples as a cut-off ([fig:mmf-SBS3distribution]).

SBS3 signature weight distribution in healthy and breast cancer samples: MisMatchFinder reported signature weights are shown as both a violinplot with boxplot in black and mean shown as white point on the left and the real measurements on the right; red line depicts classification cut-off twice at 0.003
SBS3 signature weight distribution in healthy and breast cancer samples: MisMatchFinder reported signature weights are shown as both a violinplot with boxplot in black and mean shown as white point on the left and the real measurements on the right; red line depicts classification cut-off twice at 0.003

With samples presenting with a SBS3 signature weight above 0.003 considered positive, we found 12 out of 39 (30%) breast cancer samples and only misclassified 3 healthy samples as cancer positive. With only BRCA1/2 mutation positive cancers having a specific signature we only expected to be able to detect a subset of breast cancer patients. With homologous recombination deficiency (SBS3) being reported in up to to 40% of breast cancers in previous studies (Akashi-Tanaka et al. 2015), our lower sensitivity for breast cancer could be caused by the lack of mutational signatures in the other 60% of cases.

Summary

In this chapter we presented a new method to detect signatures from low coverage whole genome sequencing data as a complementary method to ichorCNA to detect and classify patient samples. Simulated data showed a high specificity of our method for deriving signatures. Real patient data confirmed that while there is a discrepancy between the current standard method of signature detection and MisMatchFinder, the most prevalent clinically relevant signatures were detected. Furthermore, MisMatchFinder was able to detect tumour presence in samples without additional input or need for matched normals.

As the concept has so far only been explored for SNVs, the method can be extended to also analyse DBS and even InDel signatures, which could potentially have an impact on the accuracy and sensitivity of the method. And with more and more insight into the biological features of ctDNA additional enrichment steps, like fragment size selection, could be included to boost the signal of low tumour purity samples (Mouliere et al. 2018; Markus et al. 2022).

With some improvements and optimisations, MisMatchFinder could be seamlessly integrated into the ctDNA analysis workflows involving low coverage WGS to give insight into the mutational processes and resistance mechanisms (Homburger et al. 2019; M. Chen et al. 2021).

“As you think, so you become. Our busy minds are forever jumping to conclusions, manufacturing and interpreting signs that aren’t there.“

Conclusion

This thesis explored different computational methods to assess the genetic heterogeneity of multiple patient samples using DNA sequencing. With sequencing costs now trending towards $100 per genome, many clinical studies are now accumulating data at an unprecedented rate (Stephens et al. 2015), but computational methods have not kept up with the development. With multiple spatial and longitudinal samples sequenced, the known concept of spatial and tissue heterogeneity could be assessed and insights into disease trajectory and evolution generated. In the past, the accurate molecular diagnosis of a patients cancer led to the discovery of targeted therapies and a massive improvement in cancer care. So it is likely, that similar advances can be made with the accurate analysis of the diversity and history of cancer clones within a patient. While single cell sequencing has already highlighted and described new cell states and types, the methods are far from being able to be used in a clinical situation. Additionally, the amount of data already generated for collaborative efforts like TCGA (The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020) or PCAWG are cause enough to optimise and develop methods to facilitate further research in the cancer space.

The contributions of this work include the development of multiple proof of principle methods, which show that the analysis of bulk sequencing requires further research and has unrealised potential for both diagnostic and research questions.

This thesis presents three distinct but related projects, which explore the analysis of tumour heterogeneity at different levels and depth, with a focus on method development.

In [ch:variantcalling] we presented the work we conducted to improve the detection of somatic variants at very low allele frequencies. When multiple samples, separated in time or space, of the same patient were available, we were able to improve the detection threshold of variants substantially. These low abundance variants are invaluable in a clinical setting, where they can indicate an arising resistance mechanism, or relapse of disease. With the improved sensitivity of our method, treatment of patients can be adjusted earlier and more accurately. With the increase in multi-region analysis in cancer research, several research projects are already using the joint somatic variant calling approach developed. In the future these concepts could be adjusted for the use in single cell sequencing and to employ evolutionary modelling to allow usage of priors for the spatial or temporal distance of samples which are analysed jointly.

[ch:cascade] illustrates the in-depth analysis of resistance mechanisms and evolutionary history of five lung cancer patients enrolled in the CASCADE autopsy program. Using the joint somatic variant calling from [ch:variantcalling] as a basis, we explored various ways to describe and visualise the disease trajectory of each patient from diagnosis till death. Additionally to the variants, we used copy number analysis and structural variants to contrast and compare each sample within a patient to generate phylogenies to visualise the evolutionary distances and to generate a pseudo time scale for the timing of mutations. In order to further clarify the grouping and distances of samples, we implemented a distance measure based on mitochondrial variants. With this method we could assess the effect of selection pressure on established variants and their timing in the nuclear variant derived phylogenies. This work has already lead to two publications, describing two resistance mechanisms: a novel resistance mechanism to a RET-fusion targeted treatment in patient CA-A (Solomon et al. 2020), and the mechanism of small cell transformation in patient CA-L (Burr et al. 2019). Across the cohort with unsurprisingly found few unifying features apart from loss of heterozygosity and large scale genomic amplification. Clear genomic determinants of treatment resistance were identified for the three NSCLC cases which did not show phenotypic switching. The diversity of these genomic mechanisms were profoundly highlighting the true extend of inter patient heterogeneity. In contrast to the NSCLC cases, the two cases with small cell transformation showed distinct evolutionary trajectories , with similarity in their phylogenies, both nuclear and mitochondrial, suggesting initial shared evolution with high private mutations and no clear genetic discriminant for the small cell transformation. Additionally, the small cell transformations did not exhibit the previously hypothesised genomic hallmarks of TP53 and RB1 loss. These findings moved to focus on the importance of characterising non-genomic evolution in parallel to genomic changes in the study of treatment resistance. To fully explore the heterogeneity of disease in these patients and generate a complete landscape, further RNA and epigenetic profiling will be necessary to shed light on the mechanism leading to small cell phenotypic switching.

With [ch:mmf] we explored the avenue of measuring tumour evolution over time through the use of ctDNA, as it is often not feasible to serially biopsy a patient during treatment. We tailored a method to be fully tumour agnostic which can be easily be applied to the clinical setting by using low coverage whole genome sequencing, which has already been used for diagnostics in some cancers. The method uses highly specialised filtering steps to eliminate the background noise from the “normal contamination“ and sequencing errors in these data. We showed that the method can accurately detect specific cancer related signatures at low tumour purity and tumour burden in simulated and patient data for melanoma and breast cancer. MisMatchFinder was able to detect signatures, which were not causing clonal fixation, but rather were present in all clones at different levels. This same effect was also observed when we built a machine learning model to predict the presence or absence of cancer in a plasma sample. Tumour samples were accurately classified without their commonly associated mutational signatures and instead showed substantially perturbed mutational patterns. The predictive ability of MisMatchFinder was already comparable to the current gold standard ichorCNA, and there are multiple possible ways to improve MisMatchFinder further. The restriction of the analysis to fragments of a size which enriches for tumour signal showed substantial performance improvements in our data and recent literature suggests a similar effect for fragment start sites and sequence. These additional filters should increase sensitivity and help reduce the amount of false positive classifications in our predictions. Additionally, MisMatchFinder and predictor could be extended to use double base substitutions and InDel signatures, which are known to be very specific for certain cancer types. Lastly, the classifier could be retrained to classify cancer type in addition to cancer presence with additional data from public datasets.

With the introduction of further sequencing platforms and technologies apart from Illumina (Singular Genomics 2021; Ultima Genomics 2022; Element Biosciences Inc. 2022), the ability to generate high quality multi-region/-sample sequencing datasets is more accessible than ever before. These new datasets will require ongoing development of computational tools to analyse the amount of data, learning from the methods already deployed in other fields like population genetics. Ever growing cancer cohort studies will require substantial optimisation and development of methods as can be seen with the rapid development of single cell analysis.

These methods will also require investment of time and money to generate gold standard datasets to further the improvement of algorithms with objective performance measures. As discussed in this thesis, while simulated datasets are ideal to test certain aspects of methods, approaches like “Genome in a Bottle“ are necessary for somatic variant calling methods to continue improving inn the same way that germline variant calling field has evolved (Valle-Inclan et al. 2022).

While somatic variant calling in cancer often has the flair of being “solved“ every improvement in the field leads to further understanding of concepts and biological processes. In particular, the field of multi-sample variant calling analysis is still very young and requires new guidelines and best practices to evolve, as it challenges current knowledge and hypothesis. The contribution of this work to the field was focused on methods to maximise insight into cancer biology in a multitude of cases. We showed that jointly calling somatic variants is superior and able to uncover previously unappreciated genetic variation. Secondly, we contributed to the biomedical field by detecting and describing novel modes of resistance and lastly, we developed a proof of concept approach to detect somatic mutational patterns from plasma samples. Overall, we believe the work done during this PhD has significantly contributed to the knowledge and capabilities in the field of genetic cancer heterogeneity.

About

Source repository for my PhD thesis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages