# Bayesian Phylogenetics in Linguistics explained

This is a Jupyter notebook that shows you a very simplified example of Bayesian phylogenetics in linguistics, step by step, from the initial data to a resulting summary tree.

## Program libraries
The code here is written in Python. Python comes with a lot of libraries to deal with numerical processes, trees, plotting, and so on; I load them here.

In [2]:
%load_ext autoreload
%autoreload 2
import numpy, ete3, newick, pandas as pd
DF = pd.DataFrame
from matplotlib import pyplot as plt
from helpers import roll_die

## Language data

Now let us do some Bayesian phylogenetics, starting from a very simple and constructed example. Let us say we have some data like this.

I use lexical data in my example because it is the data type best understood for computational models. The methodology is not specific to it, as I will explain below; Having a different type of data, and corresponding models of language change, will mean that the rest of the methodology can be applied equivalently.


In [3]:
data = DF({
  "Kalang": ['tuˈmun', 'pak', 'ˈrata'],
  "Ta'e": ['ˈana', 'ˈʋula', 'ˈrata'],
  "Sunggama": ['anˈa', 'wuˈla', 'deˈna'],
  "Nda'o": ['ˈana', 'ˈwuɹa', 'ˈn͡dena']},
  index=["child", "moon", "flat"])
data

       Kalang    Nda'o Sunggama   Ta'e
child  tuˈmun     ˈana     anˈa   ˈana
moon      pak    ˈwuɹa    wuˈla  ˈʋula
flat    ˈrata  ˈn͡dena    deˈna  ˈrata

I will not explain automatic cognate coding in this notebook, so let us for now assume that it can be done and that automatic or manual cognate coding gives us the following cognate table corresponding to the one above.


In [4]:
data = DF({
  "Kl": [1, 1, 1],
  "Ta": [2, 2, 1],
  "Sg": [2, 2, 2],
  "Nd": [2, 2, 2]})
data

   Kl  Nd  Sg  Ta
0   1   2   2   2
1   1   2   2   2
2   1   2   2   1

## Stochastic processes for creating trees

The basic mathematical theorem underlying this kind of inference explains the data as generated by a stochastic process. How might a stochastic process of language change look like? Let us imagine the following graphic process of how languages might split up. This process is a bit bold and simple, but it is the best compromise connecting the mathematical intuition behind the mathematical structure called the “Yule tree prior” and descriptive population dynamics that I could imagine.

Imagine an island region with difficult navigation between the islands, with one island, and later more, supporting an isolated language community with no contact to any other island.

Once in a generation, each inhabited island sends out a boat to try to cross the turbulent seas and settle a new island. Five out of six boats sink in the process, but one in six boats reaches an uninhabited shore and founds a new village there, following the same tradition.

How could we simulate this process?

First, we need a function that describes the turbulences of the seas. Let us use a roll of a six-sided dice, likes this: Press Ctrl-Enter in the following cell to re-roll the die.


In [12]:
print(roll_die())

⚀


With that, we can model the crossing of the sea by “Roll a die. Iff you roll a ⚀, the crossing is successful and a new village/speech community is founded.” We can now write a function that takes the language tree at a given point in time and propagates it by one generation by rolling the die for each village. For now, I shall use “v” for the ancestral village, “•” for descendants of a village that stayed and “Δ” for successful sailors.


In [156]:
def extend_by_one_generation(tree, success="⚀"):
    for village in tree.get_leaves():
        if roll_die() in success:
            village.add_descendant(newick.Node("{:}•".format(village.name), "1"))
            village.add_descendant(newick.Node("{:}Δ".format(village.name), "1"))
        else:
            village.name += "•"
            village.length = village.length + 1
    return tree

def many_generations_later(tree=None, n_generations=7, success="⚀"):
    if tree is None:
        tree = newick.Node("v", "1")
    for i in range(n_generations):
        extend_by_one_generation(tree, success)
    return tree

In [157]:
tree = many_generations_later()
print(tree.ascii_art())

──v•••••••


In [158]:
print(tree.newick)

v•••••••:8.0


This now describes a stochastic process of population spread. A different model might be that the sea is much less dangerous and the crossing succeeds in about half the cases, which we model as a dice roll of ⚀⚁⚂.

We still assume that migration is always to a new, uninhabited place, and with no contact back. Both assumptions make the mathematical calculations sufficiently easy that I can both show them here, and use them in the computer models I run.


In [159]:
print(many_generations_later(success="⚀⚁⚂").ascii_art())

                                 ┌─v•••••••
                      ┌─v••••••──┤
                      │          └─v••••••Δ
           ┌─v•••─────┤
           │          │          ┌─v•••Δ•••
           │          └─v•••Δ••──┤
           │                     └─v•••Δ••Δ
──v••──────┤
           │                     ┌─v••Δ••••
           │          ┌─v••Δ•••──┤
           │          │          └─v••Δ•••Δ
           └─v••Δ•────┤
                      │          ┌─v••Δ•Δ••
                      └─v••Δ•Δ───┤
                                 └─v••Δ•ΔΔ•


If you re-run these notebook cells a few times (Ctrl-Enter), or if you have a good mathematical intuition, you will notice that the second tree ends up usually far bigger than the first tree.

## Bayes' Theorem

Now, how does this help us reconstruct language trees? The example lets me show you how to use a mathematical property of probabilities, known as “Bayes' Theorem” after 18th century scholar Thomas Bayes. The theorem relates conditional probabilities. For our context here, “conditional probability” should be read as “How compatible are two facts”.

For example, a high probability P(A|B) means that the fact A is very compatible with the fact B. A low ‘marginal’ or ‘prior’ probability P(B) can be read as a quantitative way of phrasing ‘I doubt B’.

So let us assume we are really unsure whether the sea is difficult (⚀⚁⚂) or nigh impossible (⚀) to cross, so we take P(⚀⚁⚂)=P(⚀)=0.5

Now we go to the region, and after some research there, we gather the data given above, which convinces us [P(X)=1] of the following two facts.

- There are exactly these four languages spoken in the island region, no more.
- There was a single ancestral village with the tradition described above 7 generations ago.

How compatible is “the sea is difficult to cross” or “the seas is nigh impossible to cross” with this new data?

Bayes' Theorem tells us that

     P(S|D) = P(D|S) * P(S) / P(D)

In this formula, S represents the possible options for the sea, or more generally our model. D is the data, in this case “there are four villages”.
The P(D|S) in that formula is the reason we built the computer model above: This conditional probability is not just a measure of compatibility of beliefs, it is also a repeatable experiment like you might think of when you hear the term “probability”: We can run our two models many, many times and count how often we see D, that is, how often the tree has exactly 4 leaves.



In [164]:
# P(4 languages | 7 generations, ⚀)
p_one = numpy.mean([len(many_generations_later(n_generations=7, success="⚀").get_leaves())==4 for _ in range(10000)]) * 0.5
p_one

0.05985

In [165]:
# P(4 languages | 7 generations, ⚀⚁⚂)
p_three = numpy.mean([len(many_generations_later(n_generations=7, success="⚀⚁⚂").get_leaves())==4 for _ in range(10000)]) * 0.5
p_three


0.0136

We do not actually have P(D), but we can still calculate the P(S|D) values, the “posterior probabilities”, because we know they are probabilities, so their sum must be one.


In [167]:
p_one, p_three = p_one / (p_one + p_three), p_three / (p_one + p_three)
p_one, p_three

(0.8148400272294077, 0.18515997277059223)


After looking at the data, we see that the ⚀ model is much more convincing than the ⚀⚁⚂ model.
Where we were really unsure before, now we are quite convinced that ⚀ is correct.
We can even put a number on how much better it is: The “Bayes Factor” in favour of ⚀ is


In [168]:
p_one / p_three

4.400735294117647

## Models of language evolution

Now we have seen the basic principle of Bayes' Theorem in action, we can turn to language data.

Just like for the split of speaker communities, we need an explicit stochastic model that describes how languages change over time. Let us take a stochastic version of the basic assumption of glottochronology and assume that words are replaced by new, unrelated words at some underlying constant speed.

This is an obvious simplification, chosen for the illustration purposes here. The models actually used in phylogenetic inference are also always vast simplifications, but hopefully at least somewhat more robust than the one presented here.


