# **S**ubstitution **I**nferences  using **M**ath and **B**ayesian **A**pproximation (**SIMBA**)

SIMBA is a modification of the FUBAR analysis where instead of using maximum likelihood, we implemented a mathematical approach based on Nei and Gojobori, (1986), way of calculating number of synonymous and non-synonymous substitution. Markov Chain Monte Carlo (MCMC) Bayesian Inference is then used to estimate the best possible number of synonymous and non-synonymous rate to end up with calculating substitution rates. 

It is worthnoting that different approaches usually reach close values especially when working with similar codons. The SIMBA approach like most evolutionary approaches before it treats each codon as an independent unit undergoing selection without effects of neighbouring codons or homologous codons in other sequences. When working with too divergent sequences the obtained values might be biased, however because this is an approach to be useful when working with Spike genes of SARS CoV 2, this shall not be the case. 

SIMBA approach when looked into carefully, it can be seen that it follows the same routine as Fast Unconstrained Bayesian AppRoximation (FUBAR), only that the computer expensiveness of finding maximum likelihoods first though useful for a generalised tool cannot be useful for a tool with a specific similar sequences to analyse. Take Simba as a lightweight version of FUBAR specific for the Spike gene of SARS-CoV-2. 

A number of approaches used to ensure faster inferences include:
1. Using a mathematical approach which is shorter than maximum likelihood apporach in FUBAR
2. Assumption codons are independent evolutionary units.
3. Exclusion of weighting since its going to be used on very similar amino acids.
4. MCMC sampler to be based on pymc3.

## Introduction

Synonymous substitutions maybe used as a molecular clock for dating the evolutionary time of closely related species.
 
 <p align ='center'>
 <img
    src = 'DNA triplet code.png'
>
</p>

 
Genetic code table indicates that all substitutions at the second nucleotide positions of codons result in amino acid whereas a fraction of the nucleotide changes at the first and third positions are synonymous.

Under the assumption of equal nucleoties frequencies and random substitution, this fraction is ~5 % for the first position and ~72 % for the thrid position.

> $f_i$ = fraction of synoymous changes at the _i_th position of a given codon (i = 1,2,3)

> _s_ = number of synonymous sites

> n = number of non-synoymous sites.

The _n_ and _s_ for this codon are then give by:

$$
\begin{align}
s = \sum_{i=1}^{3} f_i \  \text{\ and\  n=(3-s)}
\end{align}
$$

> using __Leu__ as an example,

$f_1 = \frac{1}{3}  \text{A} \to \text{G}$

using genetic code table, there is 1 in 3 chances that a change is from T $\to$ C.

$f_2$ = 0,   

$f_3 =  \frac{1}{3}(A \to G)$ thus, 

$ S = \frac{1}{3} + 0 + \frac{1}{3} = \frac{2}{3}$

$ n = \frac{2}{3} + \frac{3}{3} + \frac{2}{3} = \frac{7}{3}$


### Spike Gene Cutter

So far there is no simple process of cutting and cleaning Spike gene sequences from SARS-CoV-2 whole genome sequences. However, the user is supposed to use alignment of choice after the spike gene has been extracted. User can cut their genes using this block of code.

In [None]:
def spike_gene_cutter_cleaner(input_file, file_format):

    # from Bio import SeqIO    # We use Biopython's SeqIO parser to load our sequences
    import sys

    # The  next section loads the SARS-CoV-2 sequences from file.

    input_file = input("Enter full 'input' filename and filetype eg input_file.fasta")
    file_format = input("Enter 'file format' eg fasta")

    # Extraction of the approximate S gene
    spike_genes = []        # All extracted spike genes will be stored in this list

    for seq_record in SeqIO.parse(input_file, file_format):
        spike_cut = seq_record[21500:25500]
        spike_genes.append(spike_cut)
    

    spike_file = SeqIO.write(spike_genes, 'Spike_gene_file.fasta','fasta')

    # Next part removes spike genes with a number of Unidentified nucleotides more than 2

    clean_spike_genes = [] # This list will store cleaned spike genes

    for spike_record in SeqIO.parse('Spike_gene_file.fasta','fasta'):
        # first its wiser to use a string object of the sequence for counting additionally all letters should be ensured to be in upper case

        string_spike = (str(spike_record.seq)).upper()

        if string_spike.count('N') > 2:              #Unidentified nucleotides more than 2
            if spike_record.seq not in clean_spike_genes:  # Removing repeat sequences
                clean_spike_genes.append(spike_record)      #spike record is appended instead of string_spike so it does not lose Sequence object attributes


print('Please check current folder for the Spike gene sequences which were extracted')


# In case the user prefers to see extracted sequences
print_sample = input('Do you want to view snips of extracted sequences: Y/N ')

if print_sample.upper() == 'Y':
    for spike in SeqIO.parse('Spike_gene_file.fasta','fasta'):
            print(f'Length of approximate spike_gene = {len(spike)}')
            print(f'Representative spike sequence = {repr(spike.seq)}')
            print(f'Sequence id = {spike.id}')

elif print_sample.upper() == 'N':
        pass

else:
    print("Please enter 'Y' for 'Yes' and 'N' for 'No'")

# User can use alignment tool of choice and it is recommended to align against the GISAID Ref Seq for more accurate
# downstream processes

The next step is to define the codon table which will be used to compare any substitutions to the GISAID reference sequence.

In [None]:
from Bio import SeqIO

ref_seq = SeqIO.read('GISAID Ref Seq.fasta', 'fasta')     # Reading the GISAID reference sequence into memory

aligned_sequences = ()
def triplet_code_table(triplet_code):

    '''Triplet codes that code for specific amino acids based on the DNA triplet code table'''

    for triplet_code in aligned_sequences:
        # Gene sequence is a fasta file from a codon-aligned sequences, loaded using AlignIO of Biopython.
        # Reminder that all loaded DNA sequences must be in upper case and in string format
        if triplet_code == 'TTT' or 'TTC':
            amino_acid = 'F'

        elif triplet_code == 'TTA' or 'TTG' or 'CTT' or 'CTC'or 'CTA' or 'CTG':
            amino_acid = 'L'

        elif triplet_code == 'ATT' or 'ATC' or 'ATA':
            amino_acid = 'I'

        elif triplet_code == 'ATG':
            amino_acid = 'M'

        elif triplet_code == 'GTT' or 'GTC' or 'GTA' or 'GTG':
            amino_acid = 'V'

        elif triplet_code == 'TCT' or 'TCC' or 'TCA' or 'TCG' or 'AGT' or 'AGC':
            amino_acid = 'S'

        elif triplet_code == 'CCT' or 'CCC' or 'CCA' or 'CCG':
            amino_acid = 'P'

        elif triplet_code == 'ACT' or 'ACC' or 'ACA' or 'ACG':
            amino_acid = 'T'

        elif triplet_code == 'GCT' or 'GCC' or 'GCA' or 'GCG':
            amino_acid = 'A'

        elif triplet_code == 'TAT' or 'TAC':
            amino_acid = 'Y'
        
        elif triplet_code == 'TAA' or 'TAG' or 'TGA':
            amino_acid = '*'

        elif triplet_code == 'CAT' or 'CAC':
            amino_acid = 'H'

        elif triplet_code == 'CAA' or 'CAG':
            amino_acid = 'Q'

        elif triplet_code == 'AAA' or 'AAG':
            amino_acid = 'K'

        elif triplet_code == 'GAA' or 'GAG':
            amino_acid = 'E'

        elif triplet_code == 'TGT' or 'TGC':
            amino_acid = 'C'

        elif triplet_code == 'TGG':
            amino_acid = 'W'

        elif triplet_code == 'CGT' or 'CGC' or 'CGA' or 'CGG' or 'AGA' or 'AGG':
            amino_acid = 'R'

        elif triplet_code == 'GGT' or 'GGC' or 'GGA' or 'GGG':
            amino_acid = 'G'
        
        elif 'N' is in triplet_code:
            amino_acid = '?'

        else:
            print('Please check if all bases are A, C, T or G')

Now that we have defined the codon table, it is time to calculate base substitution frequenciesin one codon.

In [None]:
def sum_of_codon_synonymous_substitutions(triplet_code, lower=0, upper=3):
    for i in range(lower=0, upper=3):      #Lower represents the lower bound in our sum and upper is the upper bound. For python index, the first value would be 0
        
        if i = 0:                      
            if base != ref_seq:            
                fraction_of_change1 = 0.05            #Fraction of change denotes the chances of a codon for coding another amino acid when there is a mutation eg 
            else:
                fraction_of_change1= 0                                    #changing the first codon nucleotide results in 5% chance of change in amino acid coded.                     
        
        elif i = 1:
            if base != ref_seq:
                fraction_of_change2 = 1
            else:
                fraction_of_change2 = 0

        elif i = 2:
            if base != ref_seq:
                fraction_of_change3 = 0.72
            else:
                fraction_of_change3 = 0

        sum_of_synonymous_fractions = sum(fraction_of_change1 + fraction_of_change2 + fraction_of_change3)              #gives use the sum of fraction of synonymous changes at ith position.
        
        return sum_of_synonymous_fractions


def sum_of_non_synonymous_substitutions(sum_of_synonymous_fractions):
    sum_of_non_synonymous_fractions = 3 - sum_of_synonymous_fractions   # to find sum of non-synonymous fraction we subtract the sum of 
    return sum_of_non_synonymous_fractions                               #synonymous fraction from 3


## Increasing number of codons

For a DNA sequence of __r__ codons, the total number of synonymous and non-synonymous sites is therefore given by:

$$ 
\begin{align*}
S = \sum_{j=1}^r S_i \     \text{and\   n = (3r -S)} \\
\text{where} \  s_i = \text{value of s for the i-th codon} \\
\text{j = position number of codon j in DNA sequence with r codons}
\end{align*}
$$


When two sequences are compared, the averages of __S__ and __N__ are used.

## Computing nucleotide differences between a pair of homologous sequences

To compute the number of synonymous and non synonymous nucleotide differences between a pair of homologous sequences, we compare the two sequences codon, we compare the two sequences codon by codon and count the number of synoymous and non-synonymous nucleotide differences  for each pair of codon compared.

When there is only one nucleotide difference, we can immediately decide whether the substitution is synonymous or non-synonymous.
> eg, if the codon pairs compared are __GTT__ (_Val_) and __GTA__ (_Val_), there is one synonymous difference.

We denote $S_d$ and $n_d$ the number of synonymous and non-synoynymous differences per codon, respecitively. In the prent case, $S_d = 1$ and $n_d = 0$.

For example, in the comparison of __TTT__ and __GTA__, the two pathways are as follows:

> _Pathway I_:
> > __TTT__ (_Phe_) $\to$ __GTT__ (_Val_) $\to$ __GTA__ (_Val_)

> _Pathway II_:
> > __TTT__ (_Phe_) $\to$ __TTA__ (_Leu_) $\to$ __GTA__ (_Val_)

Pathway I involves one synonymous and one non-synonymous substitution whereas Pathway II involves two non-synonymous  substitutions.

We assume that pathway I and II occur with equal probability.

The $s_d$ and &n_d$ then become 0.5 and 1.5 repsectively.

When there are three nucleotide differences between the codons compared, there are six different possible pathwaysbetween the codons [3(x-1)] and in each pathway there are 3 mutation steps [x] where x = nucleotide differences.

It is now clear that the total number of synonymous and non-synonymous differences can be obtained by summing up these values over all codons i.e.
$$
\begin{align*}
S_d = \sum_{j=1}^{r} s_{dj} \\
N_d = \sum_{j=1}^{r} n_{dj} \\
\text{where} \ s_{dj} \text{\  and } n_{dj} \text{\  are } s_d \text{\ and } n_d \text{ for the j-th codon respectively} \\
\text{and r is the number of codons compared}
\end{align*}
$$




## Estimating the proportion of synonymous and non-synonymous differences

We can then therefore, estimate the proportion of synonymous ($p_s$) and non-synonymous($p_n$) differences by the following equations:
$$
\begin{align}
p_s = S_d / S \\
p_n = N_d/N\\
\end{align}
$$

To estimate the number of synonymous substitution ($d_s$) and non-synonymous substitutions ($d_N$) per site, the following formula developed by _Jukes and Cantor (1969)_ is used:

$$
\begin{align}
d = -\frac{3}{4}log_e(1-\frac{4}{3}p) \\
\text{where p is either }p_S \text{ or } p_N
\end{align}
$$

THis method gives only approximate estimates of $d_S$ and $d_N$, and is very accurate for more similar sequences.