Question 1: Computing GC Content

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.



In [2]:
path = './test.fasta'

counts = {}
seq = ''
label = ''
with open(path, 'r') as f:
    for line in f:
        line = line.strip()
        if line[0] == '>': # Header line
            if label and seq: # if label and seq exist calculate gc
                gc = (seq.count('G') + seq.count('C')) / len(seq) * 100
                counts[label] = gc # add to dictionary
            label = line[1:] # remove the >
            seq = '' # reinitialize seq
        else:
            seq += line 
 # get the last sequence
    if label and seq:
        gc = (seq.count('G') + seq.count('C')) / len(seq) * 100
        counts[label] = gc

max_id = max(counts, key=counts.get)
print(max_id) 
print(f"{counts[max_id]:.6f}") # return the max value with only 6 decimal places

Rosalind_0808
60.919540


Question 2: Sex-Linked Inheritance

The conditional probability of an event A
 given another event B
, written Pr(A∣B)
, is equal to Pr(A and B)
 divided by Pr(B)
.

Note that if A
 and B
 are independent, then Pr(A and B)
 must be equal to Pr(A)×Pr(B)
, which results in Pr(A∣B)=Pr(A)
. This equation offers an intuitive view of independence: the probability of A
, given the occurrence of event B
, is simply the probability of A
 (which does not depend on B
).

In the context of sex-linked traits, genetic equilibrium requires that the alleles for a gene k
 are uniformly distributed over the males and females of a population. In other words, the distribution of alleles is independent of sex.

Given: An array A
 of length n
 for which A[k]
 represents the proportion of males in a population exhibiting the k
-th of n
 total recessive X-linked genes. Assume that the population is in genetic equilibrium for all n
 genes.

Return: An array B
 of length n
 in which B[k]
 equals the probability that a randomly selected female will be a carrier for the k
-th gene.

In [None]:
A = [0.1, .5, .8] # list of allele frequencies
B = [] # list of carrier frequencies
for k in A:
    Bk = 2 * k * (1 - k) # Hardy-Weinberg equation for carrier frequency
    Bk = round(Bk, 2) # round to 2 decimal places
    B.append(Bk)
print(B)

[0.18, 0.5, 0.32]
