# Gene Mutation

## Introduction

**Gene mutations** are permanent alterations in a DNA sequence that makes up a _gene_, such that, it is different from that of its parent.

> **Citation:** Genetics Home Reference. n.d. “What Is a Gene Mutation and How Do Mutations Occur?” Genetics Home Reference. https://ghr.nlm.nih.gov/primer/mutationsanddisorders/genemutation.

In this assignment, we will be exploring this phenomenon in more detail as we develop a python program that is capable of being used by biotechnological companies in their genetic research projects.

A _gene_ is the basic physical and functional unit of hereditary.  DNA bases make up gene strings. The bases can take a value from the set ${A, C, G, T}$. Mutations are relatively rare but there are tiny probabilities of _inserting_ a new character ($p_{i}$), _deleting_ and existing character($p_{d}$), or _changing_ to a new character randomly ($p_{c}$). 

If we have a starting _gene string_ and it carries out asexual reproduction that creates two child strings, there is a chance that any of the mutation processes might occur. The two children might also produce two other children each, and they might still have undergone mutation. As we can notice, this forms a binary genealogical tree.

Our task involves building this tree if we are given the 7 strings. The example we will be working with is, `Set_Strings`:
``` python
('a', 'AGTTATGTGTCAGAGCAAAAGATTCCTCATCTAGCGGTCGCAAGTCATTGCC'),
('b', 'AAGTTATTTGCTCACAGGGAACGAATCCAGCTCTGCGGTCGAGGCCACATTGCC'),
('c', 'AGTTATTTTCAGAGAAATGATTCCTTCTCACCGGTCGAGCCAGTGCC'),
('d', 'AGTTTATGTGTCAGAAGCAAAAGATTACTAATCTACGCGTCGCAAGGTCTATTCC'),
('e', 'ACAGCTTATATAGCTCATAGGGAGCGAAATCCAGCCCCGCGGTGCGAGGCCCCTTGTCGC'),
('f', 'AAGTATATGGCACGAGGGAACAGTATCAGCTCTTCGGATAAAGGCCACAGTGCC'),
('g', 'AGTTATGTGTCACAGGCAAAAGATCCTTCTCTGCGGTCGAACCCATTGCC')
```

## Longest Common Subsequence

Before, we look into the mutations. We need to compare two strings and find out the _longest common subsequence_ (LCS) of them. Such a number will be useful in finding how much a particular string resembles another.

Using [#DynamicProgramming](https://seminar.minerva.kgi.edu/app/outcome-index/learning-outcomes/cs110-DynamicProgramming?course_id=989) is a better solution since there are _overlapping subproblems_ whereby we need to identify the LCS of each substring within string a in a substring within string b. Also, the problem has an _optimal substructure_ whereby there is an optimal LCS value within each breakdown. That means we can use solutions to the subproblems to solve the main problem.

In [1]:
# Carrying out tests of our LCS function
import helpers
import problem_unittests

problem_unittests.test_lcs(helpers.longest_common_subsequence)

Tests Passed!


The _time complexity_ of the `longest_common_subsequence` function is $O(n^{2})$ since the number of subproblems we go through are $m \times n$ whereby $m,n$ are the lengths of the two strings.

Its _space complexity_ is also, $O(n^{2})$ since we store the solutions of each subproblem is an $m \times n$ matrix.

In [2]:
# Let's save a matrix C containing the LCS of
# each set string compared with each other
C = helpers.generate_lcs_matrix(verbose=True)

      a     b     c     d     e     f     g
a  52.0  40.0  41.0  48.0  39.0  38.0  45.0
b  40.0  54.0  38.0  38.0  47.0  44.0  43.0
c  41.0  38.0  47.0  39.0  36.0  36.0  39.0
d  48.0  38.0  39.0  55.0  38.0  37.0  42.0
e  39.0  47.0  36.0  38.0  60.0  39.0  40.0
f  38.0  44.0  36.0  37.0  39.0  54.0  40.0
g  45.0  43.0  39.0  42.0  40.0  40.0  50.0


Local strategy
---

Let us first rank the values to in the matrix.

In [3]:
R = helpers.generate_rank_matrix(C, verbose=True)

     a    b    c    d    e    f    g
a  0.0  4.0  3.0  1.0  5.0  6.0  2.0
b  4.0  0.0  5.5  5.5  1.0  2.0  3.0
c  1.0  4.0  0.0  2.5  5.5  5.5  2.5
d  1.0  4.5  3.0  0.0  4.5  6.0  2.0
e  3.5  1.0  6.0  5.0  0.0  3.5  2.0
f  4.0  1.0  6.0  5.0  3.0  0.0  2.0
g  1.0  2.0  6.0  3.0  4.5  4.5  0.0


Based on the above rankings, we can come up with a _local strategy_ that begins at any starting point, and points to the next two most related strings unless another string already points to it. If that is the case, the one with the closest relationship gets to keep it.

Here is an example:

![Local strategy](images/local_strategy.JPG "Using a local strategy")

Global strategy
---

For the global strategy, we first need to find the relative lcs values of the different string pairs. Then we sum them up row by row to find the strings with the highest cumulative relative lcs. Such a string would be the earliest ancestor since it is related with all of the rest of the population.

In [4]:
M = helpers.generate_relative_lcs(C, verbose=True)
sums = helpers.obtain_row_sums(M)

          a         b         c         d         e         f         g
a  1.000000  0.769231  0.788462  0.923077  0.750000  0.730769  0.865385
b  0.740741  1.000000  0.703704  0.703704  0.870370  0.814815  0.796296
c  0.872340  0.808511  1.000000  0.829787  0.765957  0.765957  0.829787
d  0.872727  0.690909  0.709091  1.000000  0.690909  0.672727  0.763636
e  0.650000  0.783333  0.600000  0.633333  1.000000  0.650000  0.666667
f  0.703704  0.814815  0.666667  0.685185  0.722222  1.000000  0.740741
g  0.900000  0.860000  0.780000  0.840000  0.800000  0.800000  1.000000
        Sum
a  5.826923
b  5.629630
c  5.872340
d  5.400000
e  4.983333
f  5.333333
g  5.980000


Before, we move on to the next step and generate a heap that would form the binary tree of our genealogy, we need to sort the list so that we can ensure that our heap would always have the most related strings at the top of the tree.

In [5]:
sums_sorted = sorted(sums, key = lambda x: x[1], reverse=True)
sums_sorted

[('g', 5.9799999999999995),
 ('c', 5.872340425531914),
 ('a', 5.826923076923077),
 ('b', 5.62962962962963),
 ('d', 5.3999999999999995),
 ('f', 5.333333333333333),
 ('e', 4.983333333333333)]

## Appendix

### Appendix A: CS 110 Dashboard

!["CS 110 Dashboard"](images/my_dashboard_apr29.png "CS110 Dashboard")