## <span style ='color: green'> Counting DNA Nucleotides</span>

### <span style ='color: blue'>Problem</span>
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

**Given: A DNA string s of length at most 1000 nt**

**Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s**

##### Sample Dataset
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

##### Sample Output
20 12 17 21

### <span style="color: blue;"> Solution</span>

In [16]:
codes="AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
code_dict={'A':0,'C':0,'G':0,'T':0} #initialize the dictionary
for i in codes:
    code_dict[i]+=1 #calculate the frequency of each of the nucleotides

for values in code_dict.values():
    print(values,end=' ') #print the nucleotides in a single line

20 12 17 21 

## <span style='color:green'>Transcribing DNA into RNA</span>

### <span style ='color:blue'>Problem</span>
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t
 corresponding to a coding strand, its transcribed RNA string u
 is formed by replacing all occurrences of 'T' in t
 with 'U' in u.

**Given: A DNA string t having length at most 1000 nt**

**Return: The transcribed RNA string of t.**

##### Sample Dataset
GATGGAACTTGACTACGTAAATT

##### Sample Output
GAUGGAACUUGACUACGUAAAUU

### <span style="color: blue;"> Solution</span>

In [47]:
dna="GATGGAACTTGACTACGTAAATT"
rna=[i.replace('T','U') for i in dna] #replace T with U in the dna string
print(*rna,sep='') #you can also use ''.join(rna) and it will print the same result

GAUGGAACUUGACUACGUAAAUU


## <span style ='color:green'>Complementing a Strand of DNA</span>

### <span style='color:blue'>Problem</span> 
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s
 is the string sc
 formed by reversing the symbols of s
, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

**Given: A DNA string s of length at most 1000 bp**.

**Return: The reverse complement sc of s.**

##### Sample Dataset
AAAACCCGGT

##### Sample Output
ACCGGGTTTT

### <span style="color: blue;"> Solution</span>

In [46]:
sample = "AAAACCCGGT"
complement_template = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

#the list comprehension below reverses the sample, and returns the complement of each nucleotide
reverse_complement = ''.join([complement_template[code] for code in reversed(sample)])

print(reverse_complement)

ACCGGGTTTT


## <span style='color:green'>Computing GC Content</span>
### <span style ='color:blue'>Problem</span>
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

**Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each)**.

**Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below**

##### Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

##### Sample Output
Rosalind_0808
60.919540

### <span style="color: blue;"> Solution</span>

In [42]:
sample_dataset={
    'Rosalind_6404':'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG',
    'Rosalind_5959':'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC',
    'Rosalind_0808':'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT'
}

keys=sample_dataset.keys() #gets the keys of the dictionary
g_c_content={x:(sample_dataset[x].count('C') + sample_dataset[x].count('G'))/len(sample_dataset[x]) for x in keys}
#computes the g + c ratio in the string provided and reads it into a dictionary

print(max(g_c_content,key=g_c_content.get), #prints the keys associated with the maximum keys in the g_c ditionary
      max(g_c_content.values()))

Rosalind_0808 0.6091954022988506


## <span style ='color:green'>Counting Point Mutations</span>
### <span style='color:blue'>Problem</span>
**Given two strings s and t of equal length, the Hamming distance between s and t denoted dH(s,t) is the number of corresponding symbols that differ in s and t**
<img src='https://rosalind.info/media/problems/hamm/Hamming_distance.png' style='height: 50px'/>
<img src='https://rosalind.info/media/problems/hamm/point_mutation.png' style='height:300px'/>

**Given: Two DNA strings s and t of equal length (not exceeding 1 kbp). Return: The Hamming distance dH(s,t).**

##### Sample Dataset
GAGCCTACTAACGGGAT

CATCGTAATGACGGCCT

##### Sample Output
7

### <span style="color: blue;"> Solution</span>

In [48]:
s="GAGCCTACTAACGGGAT"
t="CATCGTAATGACGGCCT"

difference=0
for x in range(len(s)):
    if s[x]!=t[x]:
        difference+=1
print(f'Hamming distance is = {difference}')

Hamming distance is = 7


## <span style='color:green'> Mendel's First law</span>

<img src= "https://rosalind.info/media/problems/iprb/balls_tree.png" style ="height: 200px"/>

### <span style='color:blue'>Problem</span>
Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals 
are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

<br/> Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). 
<br/> Assume that any two organisms can mate.

### <span style="color: blue;"> Solution</span>

In [1]:
def probability (k,m,n):
    summation=k+m+n
    pr_k=k/summation
    pr_m=m/summation
    pr_n=n/summation
    
    pr_k_k=pr_k * ((k-1)/(summation-1))*1
    pr_k_m=pr_k * ((m)/(summation-1))*1
    pr_k_n=pr_k * ((n)/(summation-1))*1
    
    pr_m_m=pr_m * ((m-1)/(summation-1))*.75
    pr_m_k=pr_m * ((k)/(summation-1))*1
    pr_m_n=pr_m * ((n)/(summation-1))*.5
    
    pr_n_n=pr_n * ((n-1)/(summation-1))*0
    pr_n_m=pr_n * ((m)/(summation-1))*.5
    pr_n_k=pr_n * ((k)/(summation-1))*1
    
    
    answer=pr_k_k + pr_k_m + pr_k_n + pr_m_m + pr_m_k + pr_m_n + pr_n_n + pr_n_m + pr_n_k
    return answer

In [30]:
probability(2,2,2)

0.7833333333333333

#### Test data set

In [35]:
probability(22,19,21)

0.7608408249603387

## <span style ='color:green'>Translating RNA into Protein</span>

### <span style ='color:blue'>Problem</span>
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

**Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).**

**Return: The protein string encoded by s.**

##### Sample Dataset
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
##### Sample Output
MAMAPRTEINSTRING

### <span style="color: blue;"> Solution</span>

In [70]:
from Bio.Seq import Seq

rna=Seq('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA')
print(rna.translate())

MAMAPRTEINSTRING*


### Solution with sample dataset

In [84]:
with open ('rosalind_prot.txt','r') as f:
    file=f.read()
rna=Seq(file)
print(rna.translate())

MCHLYMSESLTQAARVSPLLLGICVDPCFDLTSQWLSSWTDLVGLELLSASEASMRSRARTGPNSTGKLSHNRKQLNSLTGTPLQVVSKSCYNATRRSRWRILTYDDHLTGRSGTEVPCSYRRLLTFGPNYFTSYGYVVDGTLLDSVATQGDGFIRPFTISAAHVSLFCLDSWSTTTLAVSRAVARLNLPLQAPSNHVLEGDLRWDCGEACCRKRAGVPIFLAHEKWVLDPDIHTFYLRRTAEYPNLETRHVILGTGRVPLTAAFESRGHEGNERSRKYSSCCTPHYLYPTGESVVSQGRGHNLHVTCPSLRPPENAGFCLAASLVTISKASCLLFIDGTAVNLDDFWKRRALHRQRASHVLYSLESRPKICALTRPNIDRTGTLVPTGGHSVGSERPNQCRNIATHRLRFCATRHNNKAEIPWPLPVSEGSVSPVICIFTLGAPTHLVLTIGQSLRSCKLWDAKYHMLPKCAHARWGTMRGLTPVNNLLPFFPALTCVQTNSRLRHPTALRGLSQPVIILPRLSRYRHTAGLPLVYTALISKFERKKEKPLESYACQNLNHPDALGNQLISYNDRSVIYHCRTKHVYDRESNHLLNRTFLFRVNRRKVNRDRVIIFRNFVRNTLSLGRASLKTVIRKKILPIPTSTLPIAVVRPRVGCGRAPCPVFRCRRFCLLSCWSVIRAGSLCGGWHYNESTTLRLMRLSTLHWSRSPKKHNPAVINEVNSCSKKSLLRYLMPMGSFVILILAKRPSGEAHHQGVVLSQHIQELRLTSATSRWYYCIKGPEYNRHDTDLSWRITTGDLYLGMFGPWSLVLVEINSSAGQCWHFIRTSHEWTPNLLVHSLWKCRFRHTNLIYDPSLNLSRSGDADGDRTGIRISVAPILNGTRPRRHTSDNLPIMKLCCIGRYADWQEPETIEHYEYKSCRLDHTYYLSLRKMYGLDIVSATRGAVGEHFDGRTTCPVLPLYISAGGYTWFLPVRKYRAALGTMRNTVYTYDDCVPE



**RNA codon table from Rosalind website** [Link](https://rosalind.info/glossary/rna-codon-table/)

## <span style='color:green'>Finding a Motif in DNA</span>

### <span style ='color:blue'>Problem</span>
Given two strings s and t. t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k] , where j and k represent the starting and ending positions of the substring in s; for example, 

if s = "AUGCUUCAGAAAGGUCUUACG", 

then s[2:5] = "UGCU".

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s(see the Sample below).

**Given: Two DNA strings s and t (each of length at most 1 kbp).**

**Return: All locations of t as a substring of s.**

##### Sample Dataset
GATATATGCATATACTT

ATAT
##### Sample Output
2 4 10

### <span style="color: blue;"> Solution</span>

In [89]:
s = 'GATATATGCATATACTT'
t = 'ATAT'

positions = [start + 1 for start in range(len(s)) if s[start:start + len(t)] == t]

print(*positions)

2 4 10


### Solution with sample dataset

In [128]:
with open ('rosalind_subs.txt','r') as f:
    file=f.read().split('\n')
dna=file[0]
sub_string=file[1]
positions = [start + 1 for start in range(len(dna)) if dna[start:start + len(sub_string)] == sub_string]
print(*positions)

63 92 101 111 159 166 175 190 243 301 337 353 432 530 545 552 587 613 628 651 658 724 743 847


## <span style='color:green'>Consensus and Profile</span>
### <span style='color:blue'>Problem</span>
A matrix is a rectangular table of values divided into rows and columns. An m×n
 matrix has m
 rows and n
 columns. Given a matrix A
, we write Ai,j
 to indicate the value found at the intersection of row i
 and column j
.

Say that we have a collection of DNA strings, all having the same length n
. Their profile matrix is a 4×n
 matrix P
 in which P1,j
 represents the number of times that 'A' occurs in the j
th position of one of the strings, P2,j
 represents the number of times that C occurs in the j
th position, and so on (see below).

A consensus string c
 is a string of length n
 formed from our collection by taking the most common symbol at each position; the j
th symbol of c
 therefore corresponds to the symbol having the maximum value in the j
-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

**Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.**

**Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)**

**Sample Dataset**

**Sample Output**

### <span style ='color:blue'>Solution</span>

In [259]:
import re
dna=('''>Rosalind_1ATCCAGCT>Rosalind_2GGGCAACT>Rosalind_3ATGGATCT>Rosalind_4AAGCAACC>Rosalind_5TTGGAACT>Rosalind_6ATGCCATT>Rosalind_7ATGGCACT''')

dna=re.split('>Rosalind_\d',dna)[1:]

In [238]:
def count(Motifs):
    count = {}
    k = len(Motifs[0])
    for symbol in "ACGT":
        count[symbol] = []
        for j in range(k):
            count[symbol].append(0)
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count

def consensus(Motifs):
    k = len(Motifs[0])
    motif_count = count(Motifs)
    consensus = ""
    for j in range(k):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if motif_count[symbol][j] > m:
                m = motif_count[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol
    return consensus

In [261]:
motif=[x.replace('\n','') for x in dna]

print(consensus(motif))
for key, value in count(motif).items():
    print(f"{key} :  {' '.join(map(str,value))}")

ATGCAACT
A :  5 1 0 0 5 5 0 0
C :  0 0 1 4 2 0 6 1
G :  1 1 6 3 0 1 0 0
T :  1 5 0 0 0 1 1 6


### Solution with sample dataset

In [252]:
with open ('rosalind_cons.txt','r') as file:
    motif=re.split('>Rosalind_\d*',file.read())[1:]

In [253]:
no_line_motif=[x.replace('\n','') for x in motif]

print(consensus(no_line_motif))
data=count(no_line_motif)
for letter, counts in data.items():
    print(f"{letter}: {' '.join(map(str, counts))}")

CTCTGCCTAACACGCATGGGTAGTAACCGCTCTGGGCCGAGTCTGATTCGAGAACTCGTGTCCCCCGCTGTGGACCACCCAGGGTCACGCATTAACGATTGGGTTAGGATCTGAGCTAGACAGAAGTGCCATATGCGAATCGGCGCTCATTATTGGGAGAAAAGCGCACGCAACGAGATTGGCAAGTTGCTACGCAGATCGTGCTCTGTCCATTAGCCGGCAAGTACGGTAAAACAACCAATTCAGCTACAGTAACAACTAGACATCCCTAACCTCCGTGCGAAAAGCGGCGAGACATAGTGCGCAGAAATACTGCGATCCAAAAGACACCGCTTCCAGCATCCCAAGACCCGAAGTGAAAGTTAACAAGCTGTCTCGGCAAAGATAACAATAGACCAGGGACAGGAGCTGATCTTGCGTATCTAAAGAAACAACGAGGACAGCGACGTGAAAGAAGAAAAGGAAGCAGGACAGCCAGACTAACTGATTGAATCGAATCAGGCATCTACCAACCATCTGTCCGCATGGAACACGAGGGCCTCGAAAGGATAACAAACAACTGAAATGAGAGGAATAGAACACCCGGCATCCGCGACAAGGGAGATGAGCCAAATCCGAACACAAAAAGACTGACGAAGCACAGCGAGGAGACACCGTGTACCTCCCCTGTCCGCGTACAGTCTCAGCAACTGCATAGCGCGAAGATTAATAAGAGAAGAGGCCAGAGCCCGCGATATTAAACAACGGCACACATCAAATAGCCAAGGAGCGGACAAAATAAGGTAGGTAAGAGCTCTGAGTTGAGCGTAACGTTCACTCGACATGTCGACCAAACGACATAAGCGTGGCTGAGGTACGAGGGAGCGAGTACACCGACGAGATGAGAAAGCACCGTGGACACTGTAGCGCTGAAAAATGAACAC
A: 2 3 2 2 2 1 2 1 4 4 3 3 1 2 0 3 1 2 1 1 0 3 2 3 3 4 0 0 3 3 1 1 2 1 2 1 2