# An introduction to simulations with msprime

We've split this first workshop into three parts:
 
 1. Introduction to tree sequences (Georgia)
 2. Introduction to simulating with `msprime` (Georgia)
 3. Processing simulation results (Jerome)
 
#### Creative Commons Licence
 These materials were designed for SMBE Speciation 2019. They can be re-used, but please say where you got them from!
 
#### Presenter details
 
*Jerome Kelleher* (`jerome.kelleher` at `well.ox.ac.uk`) is a Senior Statistical Programmer at the Big Data Institute and the University of Oxford, UK.

*Georgia Tsambos* (`gtsambos` at `student.unimelb.edu.au`) is a PhD student in statistical genetics at Melbourne Integrative Genomics, which is part of the University of Melbourne, Australia.

# 1. Introduction to tree sequences

This notebook provides a 30 minute introduction to the tree sequence data structure that underlies `msprime`, and shows you how to use basic features of the `tskit` package.

### Things we'll cover in this notebook
 - 1.1 [Why use tree sequences?](#why_use_ts) 
 - 1.2 [Trees](#trees)
 - 1.3 [Tree sequences](#tree_sequences)
 - 1.4 [Table encoding](#table_representation)
 - 1.5 [Storing variation](#variation)

#### Main reference

[1] Kelleher, J., Etheridge, A. M., & McVean, G. (2016). Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Computational Biology, 12(5), e1004842. https://doi.org/10.1371/journal.pcbi.1004842

<a id='why_use_ts'></a>
## 1.1 Why use tree sequences?

Genetic sequences are BIG and VERY REPETITIVE:

```
   ...GTAACGCGATAAGAGATTAGCCCAAAAACACAGACATGGAAATAGCGTA...
   ...GTAACGCGATAAGAGATTAGCCCAAAAACACAGACATGGAAATAGCGTA...
   ...GTAACGCGATAAGATATTAGCCCAAAAACACAGACATGGAAATAGCGTA...
   ...GTAACGCGATAAGATATTAGCCCAAAAACACAGACATGGAAATAGCGTA...
   ...GTAACGCGATAAGATATTAGCCCAAAAACACAGACATGGAAATAGCGTA...
   ...GTAACGCGATAAGATATTAGCCCAAAAACACAGACATGGTAATAGCGTA...
   ...GTAACGCGATAAGATATTAGCCCAAAAACACAGACATGGTAATAGCGTA...
```
Because of this, you are probably used to storing your data in a compressed format, and decompressing it only when you need to perform analyses or query the data. Doing this can be time-consuming and computationally expensive, however.

### The key idea
Common haplotypes in a sample are often simply a consequence of some common history. So if we know this history (as we always do in simulations!), storing it directly is often more convenient and efficient than storing the raw haplotypes.  
A *tree sequence* is an encoding of a complete genealogy for a sample of chromosomes at each chromosomal location  [1].
They offer a few benefits to population geneticists compared with traditional genetic file formats:

- They can store large simulated datasets extremely compactly. (Often >100 of times smaller than VCFs for real-sized datasets!) 

- As they hold rich detail about the history of the sample, many important processes can be observed directly from the tree structure. So a tree sequence is often more informative than raw genotype/haplotype data, even though it is also more compact.

- They can be queried and modified extremely quickly. In later workshops we will see that this enables quick calculation of many important population statistics.

In this first part of our workshop, we'll introduce you to tree sequences, and show you how to extract basic information from tree sequence files with the `tskit` package.
 

<a id='trees'></a>
## 1.2 Trees

At a single nucleotide base, the genealogy of a sample at can be represented by a single tree.

The tree consists of *nodes*, which represent the alleles held by different chromosomes in the history of the sample, and *edges*, which represent genealogical relationships between the alleles.

Suppose we have a sample of 4 alleles, each from a different DNA sequence, and we wish to understand their history. We could represent this history with a tree like this one:

<p><img src="pics/simple-tree.png" alt="" width="45%"/></p>
 
The 4 *sample nodes* are those labelled 0 - 3 at the leaves of the tree. The other nodes are *ancestral nodes*: these are the alleles held by individuals that are ancestral to the sample.

The height of the nodes in the tree indicates the age of the node, and an edge joining a pair of nodes is used to indicate that the allele of the lower node is descended from the allele of the upper node.

<a id='tree_sequences'></a>
## 1.3 Tree sequences

 - [tskit: a toolkit for tree sequences](#tree_sequences:tskit)
 - [Iterating through tree sequences](#tree_sequences:iteration)

The sample history encoded by a tree at a single base will typically also apply to some interval of neighbouring bases. However, due to recombinations in the history of the sample, the genealogy will typically be different at more distant locations on the chromosome, and so must be represented by a different tree. Thus, the history of a sample of sequences can be encoded in a sequence of trees - *a tree sequence*!

<p><img src="pics/tree-sequence.png" alt="" width="70%"/></p>

The endpoints of the intervals are the locations where recombination has occurred in the history of the sample. 

Notice that the adjacent trees look very similar to each other. This makes sense: each recombination should correspond to a single "tree edit" (or "subtree-prune-and-regraft" operation). Because these recombinations are specific to particular lineages, many genealogical relationships are unaffected by a given recombination. This means that topological features are often shared over many neighbouring trees.


<a id='tree_sequences:tskit'></a>
### tskit: a toolkit for tree sequences

`tskit` is a Python package with a bunch of useful tools for working with tree sequences. Online documentation for `tskit`, including installation information, can be found [here](https://tskit.readthedocs.io/en/latest/introduction.html).

We'll also need the `SVG` module to plot our tree sequences nicely, and the `io` package for an exercise later on.

In [None]:
import tskit
from IPython.display import SVG
import io

Tree sequence files can be loaded using the `load` function...

In [None]:
ts = tskit.load("example-1.trees")

... and plotted using the imported SVG module, as long as they are fairly simple.

In [None]:
display(SVG(ts.draw_svg()))

Many key properties of the tree sequence can be queried directly from the tree sequence: for example, the total number of nodes and the total number of edges (a rough measure of the 'size' of the tree sequence which often corresponds to the speed with which tree sequence calculations are likely to run):

In [None]:
ts.num_nodes

In [None]:
ts.num_edges

Trees and tree sequence objects have many useful inbuilt methods and attributes. See the official documentation for [tree sequences](https://tskit.readthedocs.io/en/latest/python-api.html?highlight=SVG#tskit.TreeSequence) and [trees](https://tskit.readthedocs.io/en/latest/python-api.html?highlight=SVG#tskit.Tree) for a fuller description of these.

<a id='tree_sequences:iteration'></a>
### Iterating through the trees

The trees in the tree sequence can be accessed in a few different different ways.
If you wish to access each of the trees sequentially, you can use the `trees()` iterator:

In [None]:
for tree in ts.trees():
    print("Tree on interval", tree.interval)
    display(SVG(tree.draw()))
    print()

We can also access the tree that spans a given genomic position:

In [None]:
tree_at_loc6 = ts.at(6)
SVG(tree_at_loc6.draw())

<a id='table_representation'></a>
## 1.4 Table representation

 - [Tables in tskit](#table_representation:tskit)

It turns out that each tree sequence object can be entirely specified by a set of tables. Instead of storing each tree individually, each individual topological feature is stored as a row in a relevant table.

Our example tree sequence can be represented with the following NodeTable and EdgeTable:

<p><img src="pics/tree-sequence-with-tables.png" alt="" width="40%"/></p>
<p><img src="pics/tree-sequence.png" alt="" width="70%"/></p>

There are other tables like Mutations, Sites, Populations, etc. that we will see later in this workshop.

Any topological feature that is common to several trees must only be recorded once in the corresponding collection of tables. For instance, all of the trees in our example tree sequence have an edge joining nodes 1 and 5, and this edge is recorded just once in the table.

<p><img src="pics/tables-with-highlights.png" alt="" width="25%"/></p>
<p><img src="pics/tree-sequence-with-highlights.png" alt="" width="70%"/></p>


This *succinctness* is one of main reasons why the tree sequence format is so compact!

<a id='table_representation:tskit'></a>
### With tskit: TableCollections

The set of tables representing a `TreeSequence` object are stored in its `tables` attribute as a `TableCollection` object. See the [official documentation](https://tskit.readthedocs.io/en/latest/python-api.html?highlight=SVG#tables) for more details.

In [None]:
tables = ts.tables
print(tables)

In [None]:
print(tables.nodes)

In [None]:
print(tables.edges)

These collections of tables are the 'guts' of the tree sequence format. If you ever need to modify a tree sequence, you'll have to extract the relevant tables, make changes and convert the tables back into a tree sequence. 

**Exercise:** Can you modify the following Table Collection until the corresponding tree sequence looks like the one in this plot? 

<p><img src="pics/tree-sequence-exercise.png" alt="" width="50%"/></p>

In [None]:
# Modify this code.
nodes_ex = io.StringIO("""\
id      is_sample   population      time
0       1       0               0.0
1       1       0               0.0
2       1       0               0.0
3       1       0               0.0
4       1       0               0.0
5       0       0               1.0
6       0       0               1.5
7       0       0               2.0
8       0       0               3.0
9       0       0               3.5
10      0       0               4.0
""")
edges_ex = io.StringIO("""\
id      left            right           parent  child
0       0.00000000      1.00000000      5       0
1       0.00000000      1.00000000      5       1
2       0.00000000      1.00000000      6       2
3       0.00000000      1.00000000      6       3
4       0.00000000      0.50000000      7       5
5       0.50000000      1.00000000      8       5
6       0.50000000      1.00000000      8       6
7       0.00000000      0.50000000      9       6
8       0.00000000      0.50000000      9       7
9       0.50000000      1.00000000      10      8
""")

# Load the tree sequence.
ts_ex = tskit.load_text(nodes=nodes_ex, edges=edges_ex, strict=False)

# Test by plotting it.
for tree in ts_ex.trees():
    print('Tree on interval:', tree.interval)
    display(SVG(tree.draw()))
    print()

<a id='variation'></a>
## 1.5 Variation

 - [Mutations in tskit](#mutations:tskit)
 
Variation is a consequence of mutations in the history of the sample. Thus, by adding some information about mutations to our tree sequences, we can use them to encode full haplotype data for each of our samples. 

We need to store the genomic location of each variant site, as well as the lineage of the tree affected by the mutation. (This corresponds to an edge in the tree sequence). The subsample of haplotypes with the mutated type is simply the subset of nodes in the part of the tree that descends from the mutation.

<p><img src="pics/tree-sequence-with-mutations.png" alt="" width="70%"/></p>

This information is encoded with two extra tables: a site table showing the locations of the variant sites, and a mutations table showing the lineage of the tree affected by the mutation.

<p><img src="pics/tables-with-mutations.png" alt="" width="70%"/></p>

The mutation and alleles highlighted in pink correspond to the highlighted rows of the site and mutation tables.

Note that even if we had 400 or 400 000 samples, we would still only need two rows to store information about the alleles held by the samples at this position!

<a id='mutations:tskit'></a>
### With tskit: mutations

Here's a tree sequence file corresponding to our example above.
By default, the `draw` method will plot each mutation on the relevant edge of the relevant tree.

In [None]:
ts = tskit.load("example-2.trees")

for tree in ts.trees():
    print("Tree on interval", tree.interval)
    display(SVG(tree.draw(width=500)))
    print()

Information about mutations are stored in a `SiteTable` and a `MutationTable`:

In [None]:
tables = ts.tables
print(tables.sites)

In [None]:
print(tables.mutations)

Information about the variants are stored the `variants()` iterator:

In [None]:
for var in ts.variants():
    print(var.genotypes)

You can also access all genotypes at once (but beware, this can be big!)

In [None]:
ts.genotype_matrix()

**Exercise:** Trees and tree sequence objects have many useful inbuilt methods and attributes. Have a play around with these in the time before the next part of the workshop.

See the official documentation for [tree sequences](https://tskit.readthedocs.io/en/latest/python-api.html?highlight=SVG#tskit.TreeSequence) and [trees](https://tskit.readthedocs.io/en/latest/python-api.html?highlight=SVG#tskit.Tree).