## Discrete states of DNA

In [None]:
# AGCT, transversions and transitions. Please consider that transitions in biological terms is different to transition in probabilistic terms.

In [307]:
import toyplot
import itertools
import random

edges = [i for i in itertools.permutations('ACGT', r=2)]
transitions = {("A","G"),("C","T")}
colors = []
for e in edges:
    sorted_e = tuple(sorted(list(e)))
    if sorted_e in transitions:
        colors.append("blue")
    else:
        colors.append("red")

        
canvas, axes, graph = toyplot.graph(
        [i[0] for i in edges],
        [i[1] for i in edges],
        ecolor=colors,
        ewidth=5,
        eopacity=0.35,
        width=350,
        height=350,
        margin=0,
        tmarker=">", 
        vsize=50,
        vcoordinates=[(-1,1),(-1,-1),(1,1), (1,-1)],
        vstyle={"stroke": "black", "stroke-width": 2, "fill": "none"},
        vlstyle={"font-size": "20px"},
        layout=toyplot.layout.FruchtermanReingold(edges=toyplot.layout.CurvedEdges())
)

f_style = {"font-size": "14px"} 
canvas.text(175,40, "Transitions", style=f_style)
canvas.text(175,310, "Transitions", style=f_style)
canvas.text(40,175, "Transversions", angle=90, style=f_style)
canvas.text(175,175, "Transversions", style=f_style)
canvas.text(310,175, "Transversions", angle=270, style=f_style);

These discrete state can be modelled by Markov chains

## Markov chains

<div class="markov">
  <iframe width="800" height="500" src="https://setosa.io/markov/index.html#%7B%22tm%22%3A%5B%5B0.5%2C0.2%2C0.1%2C0.2%5D%2C%5B0.1%2C0.5%2C0.3%2C0.1%5D%2C%5B0.05%2C0.2%2C0.7%2C0.05%5D%2C%5B0%2C0.05%2C0.05%2C0.9%5D%5D%7D"></iframe>
</div>

- parts of a markov chain
- property (memory-less)
- stationary distributuon or squilibrion state or simple pi


 is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event


markov property Markov property refers to the memoryless property of a stochastic process. 

Graph

procesar:
    

for some:  Markov chain or Markov process  are synonyms


A next-state stochastic process variable, dependent only on 
the current state, with the property of "forgetting" all states 
before, has the Markov property.


Markov process
A continuous stochastic process with the Markov property is 
called a Markov process. The probability of change from 
one state to the subsequent state is governed by a 
Markov propagator.


Markov chain
A discrete process with the Markov property is called a 
Markov chain. The probability of change from one state to 
the subsequent state is called the transition probability.

In [None]:
# a graph lke the previous one can be represented in a matrix like the following

In [2]:
import numpy as np

In [308]:
                               #  A     C     G     T
transition_matrix = np.array([[0.50, 0.20, 0.10, 0.20],  # A
                              [0.10, 0.50, 0.30, 0.10],  # C
                              [0.05, 0.20, 0.70, 0.05],  # G
                              [0.00, 0.05, 0.05, 0.90]   # T
                             ])

using the transition probabilities in this matrix we can do a random walk changing from one state to another a given number of steps (time)

In [12]:
def print_random_walk(states, steps, initial_state, transition_matrix):
    current_state = initial_state
    result_sequence = [current_state]

    step = 1
    while step < steps:
        previous_state = current_state
        previous_idx = states.index(previous_state)
        
        # Choice one random base on the probabilities on the matrix picking the previous row 
        current_state = np.random.choice(list(states), p=transition_matrix[previous_idx])
        result_sequence.append(current_state)
        step += 1
    
    return "".join(result_sequence)

In [309]:
sequence = print_random_walk(states="ACGT", 
                  steps=100, 
                  initial_state="G", 
                  transition_matrix=transition_matrix)

sequence

'GGGGGGTTCCCACCTTTTTTTTCCGGGGGGGTTTTTTCCCGGGGGGACCCTTTTTTTTTTTTTTGGGGGGGGAGGCCTTTTTTTTTTTTCCGGGCCCTTT'

We can notice that this random walk generated some patterns in the sequence, these patterns are influenced by the transition probabilities in a Markov model because as we mentioned before, the probability of have a given future state depends of the previous state.

At some point this distribution, even being random, some patterns are predictable. We can predict, for example, how many T's are expected in a sequence given a set of transition probabilities.

That predictability is represented by the stationary distribution or equilibrium in a Markov model. 


### Stationary distributions or Equilibrium

This is commonly called pi

In [310]:
# we can see how distributuon of each base if converging and it will not change with time (steps)
sequence = print_random_walk(states="ACGT", 
                  steps=1000, 
                  initial_state="G", 
                  transition_matrix=transition_matrix)
sequence

'GGTCTTGGGGGGGCCCCCCCGGGACTTTTCCAAAGAAAAGGGGGTTTTTTTTTTTTTTTTTGCATTTTCCCAATTTTTTTTTTTTTCCCAAAAAGGCCCCGGATTCCGAAAAAACCTTTTTTTTTTTTTTTTTTTTTTTTTTGCCACCCCCGGCGGGCGGTTTTTTTTTTTTTTGCCCCGCTTTTTTTTTTTTTTCTTTTTTTTTTTGGGGCGGGGGGCCGGGGGCATTTTTTTTTTCCGGGCGCCAAAGGGGGCGGGGCCCCGGCTTTTTTTTTTTTTTTTTTTTGCGATTTTTGGCCAATTTGGGCGGAGACCCCCTTTTTTTTTGGGGCCCGGGCCATTTTTGGGGGGGGCGGGGGCACAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGTTTTTTGGCCCCAGCCCTTTGGGGTTTTTTTTTTTCGGGGGGGCCCGCCGGGGGGGGCCGCTTTTTTTTCTCCTTTTTTTTTCAATTTTCCCAACCGGGATCCCGTTTCCGGCAAACGGGGTTTTCCCCCCCCCGGCTTGATCCCCTTTGGCCCTTTTTCCGGGGGGGTTTTTTTTGCCGGGGGGGCCACGGGCCCCCCAACGGCCCACCCCCCTTTTTTTTTTCCCGGGCTTTTTTTTTTTTTTGCCCTTTTTTTTTTCCCGGGGCCCGGGACCGCCGGGGCCCCCCTCTTTTTTTTCAACCCGTTTTTTTTTTTTTTTTTTTTTTTTCCGCAGCCAATTTTTTTTTTTTTTTTTTTTTTTTTGGGCCCGGCGAAATTTTTTTTTTTTTTTTTTTCATTGGGGCCCTTTTTTTCCCGGGCCCGGGGGCCCCGGTTTTTTTTGGGGGAAACAAATTGAAATTTTTTTTTTTTTTTTTTTTTTTCCGGGGGGGGGGGGGGCCCCTTTGCCGGGGCGCCCGGGGGCTTTTTTTTCTTTTTGGGCCGACCCGCGGGCCAATTCCCTTGGGGGGCCAATTTTTCGGGGGGGCACCCCCGGGGGCG

In [311]:
empiric_frequencies = {}
for base in "ACGT":
    empiric_frequencies[base] = sequence.count(base) / len(sequence)
    
empiric_frequencies  

{'A': 0.079, 'C': 0.233, 'G': 0.253, 'T': 0.435}

let me illustrate that with a plot

In [183]:
import toyplot

def plot_convergency(states, steps, initial_state, transition_matrix):

    current_state = initial_state
    sequence = [current_state]
    empiric_frequencies = np.zeros((int(steps), len(states))) 

    step = 1
    while step < steps:
        previous_state = current_state
        previous_idx = states.index(previous_state)

        current_state = np.random.choice(list(states), p=transition_matrix[previous_idx])
        sequence.append(current_state)

        step_freq = []
        for base in states:
            step_freq.append(sequence.count(base) / len(sequence))

        empiric_frequencies[step] = step_freq
        step += 1


    label_style = {"text-anchor":"start", "-toyplot-anchor-shift":"5px"}
    canvas, axes, mark = toyplot.plot(empiric_frequencies)

    for i in states:
        axes.text(steps, empiric_frequencies[-1,states.index(i)], i, style=label_style)
    
    return canvas, axes, mark

In [312]:
plot_convergency(states="ACGT", initial_state="A", steps= 10000, transition_matrix=transition_matrix);

we can find this Stationary distribution using different approach, two very simple approachs are :


In [162]:
def get_stationary_dist_monte_carlo(states, steps, initial_state, transition_matrix):
    # Create empty array to put the frequency of a given state
    frequencies = np.zeros(len(states)) 
    
    # do first step
    initial_idx = states.index(initial_state)
    frequencies[initial_idx] = 1
    previous_idx = initial_idx
    
    # do other steps
    step = 1
    while step < steps:
        current_state = np.random.choice(list(states), p=transition_matrix[previous_idx])
        current_idx = states.index(current_state)
        frequencies[current_idx] += 1
        previous_idx = current_idx
        step += 1
    
    stationary_distribution = frequencies / steps   # π
    return stationary_distribution


In [163]:
get_stationary_dist_monte_carlo(states="ACGT", 
                          steps=1e6, 
                          initial_state="A", 
                          transition_matrix=transition_matrix)

array([0.065744, 0.185359, 0.284671, 0.464226])

In [165]:
def get_stationary_dist_matrix_multiplication(states, iterations, transition_matrix):
    """Repeated matrix multiplication method"""
    current_matrix = transition_matrix
    
    iteration = 1
    while iteration < iterations:
        # Calculate the Matrix product of two arrays
        current_matrix = np.matmul(current_matrix, transition_matrix)
        iteration += 1
    
    stationary_distribution = current_matrix[0]
    return stationary_distribution

In [313]:
get_stationary_dist_expo("ACGT", iterations=100, transition_matrix=transition_matrix)

array([0.06593407, 0.18681319, 0.28571429, 0.46153846])

## Homology in alignments

In [None]:
# reminder about homology in molecular world

<html>
<head>
</head>
<body>
	<table>
		<thead>
			<tr>
				<th>Ind.</th>
				<th>N1</th>
				<th>N2</th>
				<th>N3</th>
				<th>N4</th>
				<th>N5</th>
				<th>N6</th>
				<th>N7</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>Ind. 1</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#FC8D62;">C</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
			</tr>
			<tr>
				<td>Ind. 2</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
			</tr>
			<tr>
				<td>Ind. 3</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
			</tr>
			<tr>
				<td>Ind. 4</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
			</tr>          
		</tbody>
	</table>
</body>
</html>

calculate distance between two usually is highly underestimated.

Especially due to Superimposed substitution, 


Superimposed substitutions ("multiple hits"): The occurrence of two or more substitutions at the same site

<html>
<head>
</head>
<body>
	<table>
		<thead>
			<tr>
				<th>Time</th>
				<th>N1</th>
				<th>N2</th>
				<th>N3</th>
				<th>N4</th>
				<th>N5</th>
				<th>N6</th>
				<th>N7</th>
				<th>N8</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>1</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#FC8D62;">C</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#E78AC3;">T</td>
			</tr>
			<tr>
				<td>2</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#FC8D62;">C</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#E78AC3;">T</td>
			</tr>
			<tr>
				<td>3</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#E78AC3;">T</td>
			</tr>
			<tr>
				<td>4</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#E78AC3;">T</td>
			</tr>
			<tr>
				<td>5</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
			</tr>
		</tbody>
	</table>
    
    
</body>
</html>

Critical to models of nucleotide evolution is the realization that because there are only four possible character states, it is expected that as genetic distance increases, some sites will undergo multiple superimposed substitutions.

In this case, some sites that have undergone change will have reverted to the state they were originally in.
To reliably reconstruct evolutionary relationships among divergent sequences, this expected reversion must be taken into account.
Simple measures of distance that do not take multiple substitutions into account are said to be uncorrected. Corrected distances use one of several models of sequence evolution to estimate the number of sites that have undergone multiple substitutions.
Uncorrected methods tend to underestimate genetic distance.



<html>
<head>
</head>
<body>
	<table>
		<thead>
			<tr>
				<th>Species</th>
				<th>N1</th>
				<th>N2</th>
				<th>N3</th>
				<th>N4</th>
				<th>N5</th>
				<th>N6</th>
				<th>N7</th>
				<th>N8</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>Sp. 1</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#FC8D62;">C</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#E78AC3;">T</td>
			</tr>
			<tr>
				<td>Sp. 2</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#E78AC3;">T</td>
				<td style="background-color:#66C2A5;">A</td>
				<td style="background-color:#8DA0CB;">G</td>
				<td style="background-color:#66C2A5;">A</td>
			</tr>
		</tbody>
	</table>
    
    
</body>
</html>

Model asre important because they allow to correct to some extent the bias of underestimate differences when we are calculating distances between sequences froom observed data.44

show plot of expected vs observed. Here is where substitution models are important.



## Substitution models

Assumming that the sequences evolved under certain model. For example if all sites changed under the same probability or if some sites are more likely to change than others.  Based on these assumptions we can compare observed pattern against predicted expectations in each model and correct the distance.

They are Markov model 



 They have two parts:
 - Stationary distribution
 - Transition matrix


Based on these we can derive a formula to correct the distance bias, usally based in some known distribution, for example the  Poisson distribution.  
This distribution is conveneinte in some cases because it fits well where events are independents, continuous in time, and rare.

### Jukes-Cantor model

Basic model, proposed by Jukes and Cantor (1969)
It has 2 assumbtions
- This assumes that there is only one substitution rate and it is equal for all bases
- All frequencies for all bases are equal (in the stationary distribution) 

This model is also know as a one paremeter model, because the model only needs one parameter.

In [399]:
# Define the only parameter
rate = 0.333  # α (rate of substitution) 

In [387]:
# Transition matrix for Jukes-Cantor model
                 #  A        G     C     T
jc_tm = np.array([[1-3*rate, rate, rate, rate],  # A
                  [rate, 1-3*rate, rate, rate],  # G
                  [rate, rate, 1-3*rate, rate],  # C
                  [rate, rate, rate, 1-3*rate]   # T
                 ])

In [388]:
jc_tm

array([[0.001, 0.333, 0.333, 0.333],
       [0.333, 0.001, 0.333, 0.333],
       [0.333, 0.333, 0.001, 0.333],
       [0.333, 0.333, 0.333, 0.001]])

In [389]:
get_stationary_dist_expo("ACGT", iterations=100, transition_matrix=jc_tm)

array([0.25, 0.25, 0.25, 0.25])

In [397]:
plot_convergency(states="ACGT", initial_state="C", steps= 1000, transition_matrix=jc_tm);

The formula to correct the distance is:
$$D = -\frac{3}{4}ln(1 - \frac{4}{3}d)$$

Where ***D*** is the corrected distance and ***d*** is the distance between two sequences

In [516]:
def get_distances_jc(sequences, pair=(0,1)):
    # tranform list of strings into arrays (useful to conect with random walk function)
    # and convert sequences in ascii values for easy diff calculation
    sequences_array = np.asarray([[ord(base) for base in sequence] for sequence in sequences])
    
    # simplify the alignment and only count the unique patterns (instead of sites) 
    # to calculate the likelihood of a given substitution
    patterns, pattern_frequencies = np.unique(sequences_array, axis=1, return_counts=True)
    
    # get differences
    differences = patterns[pair[0]] - patterns[pair[1]]
    # calculate the distance
    distance = np.count_nonzero(differences) / len(sequences[pair[0]])
    
    # correct distance using jc model
    corrected_distance = -3/4 * np.log(1 - 4/3 * distance)
    
    return corrected_distance, distance

In [521]:
# Simple definition of sequences
sequences = ["GCTTCTGATTAACCTGCT",
             "GCTTCTGATTTCTCTGCC",
]

In [522]:
get_distances_jc(sequences, (0, 1))

(0.2635484151284164, 0.2222222222222222)

Let's generate a sequence under Jukes-Cantor model

In [523]:
sequence = print_random_walk(states="ACGT", 
                  steps=1000, 
                  initial_state="C", 
                  transition_matrix=jc_tm)
sequence

'CTATAGCTCTGCGAGATGCATGTACGCTACATCTCTCTATAGCGCTAGATATCTGTACACATGATCTCTGCGCATACACTGACAGCTCTCGTGCAGCGCTCTGATGTATCATGTCGAGAGTGCGCTCTGACGACACTAGCGCACTAGTGTGCGCGCATGCATGAGTATCTCACTCTGTGATACGCAGTCATACTCACGACTCTGCGAGTATCTATGACGTATGTCTGTATCTCACATCTACACGCTATGATATGACTATGATACAGCGACTAGTCTGCTACGCGTGTATGTGCTAGAGCGCACGTCACTGTACGTACTCATCTCTCATGACATCACACAGTCTGAGTACTCTCATCATGAGCTATAGCTGCTAGCAGTACTGATCACGATGAGCAGCATCGTATAGCAGCATACTGACAGTGATGATATCACATGTCATCGTCGACAGCGATAGCTAGACATGTACTGCTCGAGAGCATATAGCTCAGCATATGCATGCTATCGATCATAGTCACACGCGCTACAGTGACAGAGACGTCTCAGCAGTGATCGCACTCTGACTCAGCTCAGAGCTGCTCGACATCAGTGTACGTGCAGTCAGAGCACGTATACTCATGCGCACACGATATGATGTCTACTCACGATCGCGTGATACATGTGACTCGCGTGATCGCGATGCGATGTGTGCATCGTGATCAGATAGTCAGTATAGCAGTCGCAGTAGCAGTGCGCATCGATGATGATAGCTACACGTCAGCTCGCAGTGCACTCAGCTACTACTATGCTCATATATAGCTCTACACGATAGCATCGACACTGAGTCGTGCTCACGTACTGACGTCTAGCAGAGCGCTCTACGCTACTGTGTATGTACGCTCTGACAGTGTATGACGACGCTACATACGCTCACACGCATGTGTAGACTCACGCGCTGCTATCACGCGACGTCTCATGAGTGTGTCTCACGCAGATACTGAGACTGACGACAGCTACGTCGCA

Now introduce a random mutation 

In [None]:
for i in range(100):
    sequence

### Kimura 2 parameters models

### Gamma

### Invariant sites

# Archive

In [159]:
def get_probability_sequence(sequence, transition_matrix, stationary_distribution):
    initial_state = sequence[0]
    probability = pi[initial_state]
    previous_state = initial_state

    for i in range(1, len(sequence)):
        current_state = sequence[i]
        probability *= transition_matrix[previous_state][current_state]
        previous_state = current_state
        
    return probability

In [159]:
def get_probability_sequence(sequence, transition_matrix, stationary_distribution):
    initial_idx = 0
    probability = stationary_distribution[initial_idx]
    previous_state = sequence[initial_idx]

    for i in range(1, len(sequence)):
        current_state = sequence[i]
        probability *= transition_matrix[previous_state][current_state]
        previous_state = current_state
        
    return probability

In [None]:
get_probability_sequence("ACCCCGGCCCCAAAGGGGGTATATAA", transition_matrix=transition_matrix, stationary_distribution=[0.25,0.25,0.25,0.25])

In [14]:
## comparing our distribution with a totally random sequence

In [None]:
# The frequency of bases will be very different if the sequence that we have does fit into our markov model. For ecample

In [5]:
absolutly_random_sequence = "".join(np.random.choice(list("AGCT"), size=1000))
absolutly_random_sequence

'TTCAGTTCGATCGACAATACGTGGCTGTTACGCTTACGCGACGACAGTCTTACACAGGCACAAGTGCGTACGATCTGAGGCGTGGCATACGATCCTCGCGTGGGAGTGTACCGCGACTGATCGCCGCCAATCCTTTTAAGGCAGCGTGAGCTACCATATGACGGCGAAGGGGTAACATTCGTGTCGGACCGGTAATTTTGGATGGAGGACGTTGTCCGGTTTGGATAGTGGGACTTCGGGATGTTTACCACCGAACGTACTGACGAGGTCAGGGGAGTCGACTCAATATCAGAGGCGTACCTGATAAGGTTCCTACTACTCCTCGCGATTCCGGCGCAAAAGGATGTTCTAGGATAATGAATATTCTACCGGCAGCCGTGGACCGGTTGTCCGCATTTAGTAGATAGGGTGAATACATAACATAAGCCTCGTGGGGCCGCCCGGCGCTGAGTCACAGGCAAGGTAACTTAGTAGAGGGCGCGTGTGAACCCTGATCGATCGCGTGTTAGATCCGCCGGCGAGCGATATTCGGGCTAGATCATATGGACGCATTTGTTAACCGTTCGAGTCAGGGCCAAGAAGGTGGATCACATGTCGAGTTGTTTCATTTCGCCGACCCGTCTCAGCCGCAATGGCGAAAGCAAGGTGGTCTCGTGCGAGGCAGCGCCAAAAATCGGGTGGGACGACCCCTGGAACAATGTCGGGATGACATAAAACCAATTTAATGCCATACACGACGGCGCGAAGAGTTTGTGCAGCCCGCAATCGTATAAAAGGCCAGAAGCGTACGGCGTGATCCTGTTTGGTCACTATGCCCCATGCTTAGGGCGCACGCTTTTCCTCACCCGCTGGGAATACGGTATAAGCGCGCGAGCCCCCGCCGCGTTCTTAGGTCCCTAACTTAAGCTATTGCCCCCTGGATGAAGGGCGGTTCGTCACTTCTGTCCGGTCCCCTGGGCGCTTCATCTCCTACAGCATTTGGCTTCATGCGGGATCCCG

In [167]:
empiric_frequencies = {}
for base in "AGCT":
    empiric_frequencies[base] = absolutly_random_sequence.count(base) / len(absolutly_random_sequence)
    
empiric_frequencies  

{'A': 0.223, 'G': 0.291, 'C': 0.257, 'T': 0.229}

In [374]:

from __future__ import division
import math
import numpy as np

def JC69_Q():
    a = 0.01/3
    Q = np.asarray(
    [  -3*a, a, a, a,
       a, -3*a, a, a,
       a, a, -3*a, a,
       a, a, a, -3*a ])
    Q.shape = (4,4)
    return Q
    
def TN93_values():
    pi = [0.2,0.3,0.2,0.3]
    T,C,A,G = pi
    Y = T + C
    R = A + G
    a1 = 2   # Y transitions
    a2 = 1   # R transitions
    b = 0.2    # transversions
    return T,C,A,G,Y,R,a1,a2,b
    
def TN93_Q():
    T,C,A,G,Y,R,a1,a2,b = TN93_values()
    Q = np.asarray(
        [ -(a1*C + b*R), a1*C, b*A, b*G,
          a1*T, -(a1*T + b*R), b*A, b*G,
          b*T, b*C, -(a2*G + b*Y), a2*G,
          b*T, b*C, a2*A, -(a2*A + b*Y) ])
    Q.shape = (4,4)
    return Q
    
def TN93_P(t):
    # order TCAG
    T,C,A,G,Y,R,a1,a2,b = TN93_values()
    e2 = math.e**(-b*t)
    e3 = math.e**(-(R*a2 + Y*b)*t)
    e4 = math.e**(-(Y*a1 + R*b)*t)

    TY,CY,AR,GR = T/Y,C/Y,A/R,G/R
    M = [
        T + TY*R*e2 + CY*e4, C + CY*R*e2 - CY*e4,
        A*(1 - e2), G*(1 - e2),
        T + TY*R*e2 - TY*e4, C + CY*R*e2 + TY*e4,
        A*(1 - e2), G*(1 - e2),
        T*(1 - e2), C*(1 - e2),
        A + AR*Y*e2 + GR*e3, G + GR*Y*e2 - GR*e3,
        T*(1 - e2), C*(1 - e2),
        A + AR*Y*e2 - AR*e3, G + GR*Y*e2 + AR*e3  ]
    P = np.array(M)
    P.shape = (4,4)
    return P
        
def convert(Q,t,debug=False):
    if debug:
        print ('Q', '\n', Q)
    evals, evecs = np.linalg.eig(Q)
    if debug:
        print  ('evals', '\n', evals)
    L = np.diag([math.e**(k*t) for k in evals])
    if debug:
        print  ('L', '\n', L)
    U = evecs
    if debug:
        print  ('U', '\n', U)
    iU = np.linalg.inv(U)
    if debug:
        print  ('iU', '\n', iU)
    P = np.dot(U,np.dot(L,iU))
    if debug:
        print  ('P', '\n', P)
    return P
    
def long_time(P,N=5000):
    M = P
    for i in range(N):
        M = np.dot(P,M)
    return M

In [None]:
convert(JC69_Q(), 10000, debug=True)

Q 
 [[-0.01        0.00333333  0.00333333  0.00333333]
 [ 0.00333333 -0.01        0.00333333  0.00333333]
 [ 0.00333333  0.00333333 -0.01        0.00333333]
 [ 0.00333333  0.00333333  0.00333333 -0.01      ]]
evals 
 [-1.33333333e-02  4.33680869e-19 -1.33333333e-02 -1.33333333e-02]
L 
 [[1.24184982e-58 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 1.24184982e-58 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.24184982e-58]]
U 
 [[-0.8660254   0.5        -0.2747774  -0.12962158]
 [ 0.28867513  0.5         0.7841548   0.40424231]
 [ 0.28867513  0.5         0.04519959 -0.76264153]
 [ 0.28867513  0.5        -0.55457698  0.4880208 ]]
iU 
 [[-0.8660254  -0.01093835  0.42842131  0.44854245]
 [ 0.5         0.5         0.5         0.5       ]
 [-0.          0.77008713 -0.05158606 -0.71850107]
 [-0.          0.36930849 -0.82431531  0.45500682]]
P 
 [[0.25 0.25 0.25 0.25]
 [0.25 0.25 0.

array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]])