# Bioinformatics Problems on Rosalind.info

## 6. Counting Point Mutations

A mutation is simply a mistake that occurs during the creation or copying of a nucleic acid, in particular DNA. Because nucleic acids are vital to cellular functions, mutations tend to cause a ripple effect throughout the cell. Although mutations are technically mistakes, a very rare mutation may equip the cell with a beneficial attribute. In fact, the macro effects of evolution are attributable by the accumulated result of beneficial microscopic mutations over many generations.

The simplest and most common type of nucleic acid mutation is a **point mutation**, which replaces one base with another at a single nucleotide. In the case of DNA, a point mutation must change the complementary base accordingly.


**Problem**

Given two strings _s_ and _t_ of equal length, the Hamming distance between _s_ and _t_, denoted dH(s,t), is the number of corresponding symbols that differ in _s_ and _t_. 

Given: Two DNA strings _s_ and _t_ of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t).

In [6]:
# read file
f = open("data/rosalind_hamm.txt", 'r')
raw_data = f.readlines()
f.close()

mutation_count = 0 # initialize mutations

# iterate through the sequences
for i in range(len(raw_data[0])):
    if raw_data[0][i] != raw_data[1][i]:
        mutation_count += 1

print("length of sequence =", len(raw_data[0]))
print("number of mutations =",mutation_count)

length of sequence = 932
number of mutations = 469


## 7. Mendel's First Law

Also known as the law of segregation, Mendel's 1st Law stated that:

- every organism possesses a pair of alleles for a given factor. 
- If an individual's two alleles for a given factor are the same, then it is homozygous for the factor; if the alleles differ, then the individual is heterozygous. 
- The first law concludes that for any factor, an organism randomly passes one of its two alleles to each offspring, so that an individual receives one allele from each parent.
- any factor corresponds to only two possible alleles, the dominant and recessive alleles. 
- An organism only needs to possess one copy of the dominant allele to display the trait represented by the dominant allele. 

We may encode the dominant allele of a factor by a capital letter (e.g., AA) and the recessive allele by a lower case letter (e.g., aa). Because a heterozygous organism can possess a recessive allele without displaying the recessive form of the trait, we henceforth define an organism's genotype to be its precise genetic makeup and its phenotype as the physical manifestation of its underlying traits.

The different possibilities describing an individual's inheritance of two alleles from its parents can be represented by a Punnett square; see Figure below for an example.

![](http://rosalind.info/media/problems/iprb/220px-Punnett_Square.svg.png)

A Punnett square representing the possible outcomes of crossing a heterozygous organism (Yy) with a homozygous recessive organism (yy); here, the dominant allele Y corresponds to yellow pea pods, and the recessive allele y corresponds to green pea pods.


**Problem**

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

In [13]:
# HD = homozygous dominant (AA)
# HT = heterozygous (Aa)
# HR = homozygous recessive (aa)
def dominant_phenotype(HD, HT, HR):
    tot = HD + HT + HR
    # probability of obtaining recessive allele if both from HR
    pRecc_2HR = HR * (HR-1)
    
    # probability of obtaining recessive allele if both from hetero
    pRecc_2HT = (0.5*HT) * (0.5*(HT-1))
    
    # probability of obtaining recessive allele if one from hetero and one from HR
    # double because two scenarios (e.g. Aa * aa & aa * Aa)
    pRecc_1HT_1HR = 2.0 * HR * (0.5*HT)
    
    pRecc  = (pRecc_2HR + pRecc_2HT + pRecc_1HT_1HR)/(tot*(tot-1))
    return 1-pRecc

dominant_phenotype(2, 2, 2)

0.7833333333333333

## 8. Translating RNA into Protein

Just as nucleic acids are polymers of nucleotides, proteins are chains of smaller molecules called amino acids; 20 amino acids commonly appear in every species. Just as the primary structure of a nucleic acid is given by the order of its nucleotides, the primary structure of a protein is the order of its amino acids. Some proteins are composed of several subchains called polypeptides, while others are formed of a single polypeptide.

The notion that protein is always created from RNA, which in turn is always created from DNA, forms the central dogma of molecular biology. Like all dogmas, it does not always hold; however, it offers an excellent approximation of the truth.



## Enumerating Gene Order

Point mutations can create changes in populations of organisms from the same species, but they lack the power to create and differentiate entire species. This more arduous work is left to larger mutations called genome rearrangements, which move around huge blocks of DNA. Rearrangements cause major genomic change, and most rearrangements are fatal or seriously damaging to the mutated cell and its descendants (many cancers derive from rearrangements). For this reason, rearrangements that come to influence the genome of an entire species are very rare.

Because rearrangements that affect species evolution occur infrequently, two closely related species will have very similar genomes. Thus, to simplify comparison of two such genomes, researchers first identify similar intervals of DNA from the species, called synteny blocks; over time, rearrangements have created these synteny blocks and heaved them around across the two genomes (often separating blocks onto different chromosomes, see Figure below).

![](https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-8-82/MediaObjects/12859_2006_1454_MOESM2_ESM.png)

A pair of synteny blocks from two different species are not strictly identical (they are separated by the action of point mutations or very small rearrangements), but for the sake of studying large-scale rearrangements, we consider them to be equivalent. As a result, we can label each synteny block with a positive integer; when comparing two species' genomes/chromosomes, we then only need to specify the order of its numbered synteny blocks.

