# Gene Mutation

## Introduction

**Gene mutations** are permanent alterations in a DNA sequence that makes up a _gene_, such that, it is different from that of its parent.

> **Citation:** Genetics Home Reference. n.d. “What Is a Gene Mutation and How Do Mutations Occur?” Genetics Home Reference. https://ghr.nlm.nih.gov/primer/mutationsanddisorders/genemutation.

In this assignment, we will be exploring this phenomenon in more detail as we develop a python program that is capable of being used by biotechnological companies in their genetic research projects.

A _gene_ is the basic physical and functional unit of hereditary.  DNA bases make up gene strings. The bases can take a value from the set ${A, C, G, T}$. Mutations are relatively rare but there are tiny probabilities of _inserting_ a new character ($p_{i}$), _deleting_ and existing character($p_{d}$), or _changing_ to a new character randomly ($p_{c}$). 

If we have a starting _gene string_ and it carries out asexual reproduction that creates two child strings, there is a chance that any of the mutation processes might occur. The two children might also produce two other children each, and they might still have undergone mutation. As we can notice, this forms a binary genealogical tree.

Our task involves building this tree if we are given the 7 strings. The example we will be working with is, `Set_Strings`:
``` python
('a', 'AGTTATGTGTCAGAGCAAAAGATTCCTCATCTAGCGGTCGCAAGTCATTGCC'),
('b', 'AAGTTATTTGCTCACAGGGAACGAATCCAGCTCTGCGGTCGAGGCCACATTGCC'),
('c', 'AGTTATTTTCAGAGAAATGATTCCTTCTCACCGGTCGAGCCAGTGCC'),
('d', 'AGTTTATGTGTCAGAAGCAAAAGATTACTAATCTACGCGTCGCAAGGTCTATTCC'),
('e', 'ACAGCTTATATAGCTCATAGGGAGCGAAATCCAGCCCCGCGGTGCGAGGCCCCTTGTCGC'),
('f', 'AAGTATATGGCACGAGGGAACAGTATCAGCTCTTCGGATAAAGGCCACAGTGCC'),
('g', 'AGTTATGTGTCACAGGCAAAAGATCCTTCTCTGCGGTCGAACCCATTGCC')
```

### Longest Common Subsequence

Before, we look into the mutations. We need to compare two strings and find out the _longest common subsequence_ (LCS) of them. Such a number will be useful in finding how much a particular string resembles another.

Using [#DynamicProgramming](https://seminar.minerva.kgi.edu/app/outcome-index/learning-outcomes/cs110-DynamicProgramming?course_id=989) is a better solution since there are _overlapping subproblems_ whereby we need to identify the LCS of each substring within string a in a substring within string b. Also, the problem has an _optimal substructure_ whereby there is an optimal LCS value within each breakdown. That means we can use solutions to the subproblems to solve the main problem.

In [1]:
# Carrying out tests of our LCS function
import helpers
import problem_unittests

problem_unittests.test_lcs(helpers.longest_common_subsequence)

Tests Passed!


The _time complexity_ of the `longest_common_subsequence` function is $O(n^{2})$ since the number of subproblems we go through are $m \times n$ whereby $m,n$ are the lengths of the two strings.

Its _space complexity_ is also, $O(n^{2})$ since we store the solutions of each subproblem is an $m \times n$ matrix.

In [2]:
# Let's save a matrix C containing the LCS of
# each set string compared with each other
C = helpers.generate_lcs_matrix(verbose=True)

      a     b     c     d     e     f     g
a  52.0  40.0  41.0  48.0  39.0  38.0  45.0
b  40.0  54.0  38.0  38.0  47.0  44.0  43.0
c  41.0  38.0  47.0  39.0  36.0  36.0  39.0
d  48.0  38.0  39.0  55.0  38.0  37.0  42.0
e  39.0  47.0  36.0  38.0  60.0  39.0  40.0
f  38.0  44.0  36.0  37.0  39.0  54.0  40.0
g  45.0  43.0  39.0  42.0  40.0  40.0  50.0


## Appendix

### Appendix A: CS110 Dashboard

!["CS110 Dashboard"](images/my_dashboard_apr14.png "CS110 Dashboard")