#Intro

Hello! Welcome to this notebook!
My name is Hubert and I am a biotechnologist and a bioinformatician.

This `.ipynb` file contains a collection of 20 [Rosalind Problems](https://rosalind.info/problems/list-view/) solved by me using only one 3rd party library: `pandas`.

I begin with my Setup: import `pandas` and define two helpful functions.






`parse_fasta()` is useful for working with FASTA files. It transforms a string conatining names and sequences in FASTA format into a `pandas.Dataframe` with columns `'name'` that contains names without the `>` symbol and `'seq'` that contains the sequences string all in uppercase. It is used by several functions defined for the problems.

`worklow()` is a function that works only if this file is opened in Google Colab. As you can download the .txt files for testing your functions from Rosalind, `workflow()` allows to conveniently upload them to Google Colab, pass them to a function of choice and download the result .txt in just one step. the resulting txt file may be uploaded to Rosalind directly to assess the function.

All Rosalind Problems are linked in their titles.

Please note that some functions may actually call other functions that were defined previously, e.g `orf()` calls both `rna()` and `prot()` in its body.

#Setup

##Import `pandas`

In [7]:
import pandas as pd

##`parse_fasta` function

In [39]:
def parse_fasta(fasta_file):
  """
  Transforms a FASTA string into a pandas DataFrame with names and sequences
  Given: string of sequences in FASTA format
  Return: a pandas dataframe with 'name' and 'seq' columns with sequence data
  """
  raw_fasta_list = fasta_file.split()
  fasta_df = pd.DataFrame(columns=["name", "seq"])
  seq=""
  for line in raw_fasta_list:
    if line[0] == ">":
      if len(seq) > 0:
        new_seq = pd.DataFrame({'name': [name], 'seq': [seq.strip().upper()]})
        fasta_df = pd.concat([fasta_df, new_seq], ignore_index=True)

      name = line.replace(">", "")
      seq=""
    else:
      seq += line

  new_seq = pd.DataFrame({'name': [name], 'seq': [seq.strip().upper()]})
  fasta_df = pd.concat([fasta_df, new_seq], ignore_index=True)
  return fasta_df

In [40]:
df = parse_fasta('>Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC >Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT')
df

Unnamed: 0,name,seq
0,Rosalind_6404,CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGG...
1,Rosalind_5959,CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGC...
2,Rosalind_0808,CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTC...


## `workflow` function

In [None]:
def workflow(func:str):
  """
  Upload/Download workflow for Google Colab
  Given: Rosalind Problem function name as string
  Return: none

  This function allows to upload a txt file provided by Rosalind from the local
  machine and pass it onto a function appropriate for the Problem.
  This function downloads the output of the function as a txt file.
  """
  output_filename = f'rosalind_{func}_output.txt'
  function = eval(func)

  from google.colab import files

  def string_from_uploaded_file():
    uploaded = files.upload()
    for bytesObject in uploaded.values():
      return bytesObject.decode('utf-8')

  print(f'First download a dataset from Rosalind https://rosalind.info/problems/{func}/\n')
  print('Next upload this dataset by clicking the [Choose Files] button below:\n')
  input_string = string_from_uploaded_file()

  output_string = function(input_string)

  print(output_string)

  with open(output_filename, 'w') as f:
    f.write(output_string)

  files.download(output_filename)

#Rosalind Problems

##  DNA: [Counting DNA Nucleotides](https://rosalind.info/problems/dna/)

In [50]:
def dna(sequence):
  """
  Solves: https://rosalind.info/problems/dna/
  Given: A DNA string sequence of length at most 1000 nt.
  Return: A string with four numbers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in sequence.
  """
  sequence = sequence.upper()

  nt_count = [str(sequence.count(nt)) for nt in "ACGT"]
  nt_count_str = ", ".join(nt_count)
  return nt_count_str

In [52]:
dna_sample = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
dna(dna_sample) == "20, 12, 17, 21"

True

## RNA: [Transcribing DNA into RNA](https://rosalind.info/problems/rna/)

In [53]:
def rna(dna_seq):
  """
  Solves: https://rosalind.info/problems/rna/
  Given: A DNA string dna_seq having length at most 1000 nt.
  Return: The transcribed RNA string of dna_seq
  """
  dna_seq = dna_seq.upper()
  return dna_seq.replace("T", "U")

In [54]:
rna_sample = "GATGGAACTTGACTACGTAAATT"

rna(rna_sample) == "GAUGGAACUUGACUACGUAAAUU"

True

## REVC: [Complementing a Strand of DNA](https://rosalind.info/problems/revc/)

In [55]:
def revc(dna_strand):
  """
  Solves: https://rosalind.info/problems/revc/
  Given: A DNA string dna_strand of length at most 1000 nt.
  Return: The reverse complement of dna_strand
  """
  reverse_strand = dna_strand[::-1]
  complement_dict = {"A":"T", "T":"A", "C":"G", "G":"C"}
  complementary_strand = "".join([complement_dict[nt] for nt in reverse_strand])

  return complementary_strand

In [56]:
revc_sample = "AAAACCCGGT"
revc(revc_sample) == "ACCGGGTTTT"

True

## GC: [Computing GC Content](https://rosalind.info/problems/gc/)

In [63]:
def gc(dna_fasta):
  """
  Solves: https://rosalind.info/problems/gc/
  Given: DNA strings in FASTA format.
  Return: A string that consists of: The ID of the sequence having the highest GC-content, followed by the GC-content of that string [%]
  """
  seq_df = parse_fasta(dna_fasta)

  def gc_proc(seq):
    return (seq.count("G")+seq.count("C"))/len(seq)*100

  seq_df['gc_content'] = [gc_proc(seq) for seq in seq_df['seq']]

  max_gc = seq_df[seq_df['gc_content'] == seq_df['gc_content'].max()]

  return max_gc['name'].values[0] +'\n'+ str(max_gc['gc_content'].values[0])


In [65]:
gc_sample = '>Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC >Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT'
print(gc(gc_sample))


Rosalind_0808
60.91954022988506


## HAMM: [Counting Point Mutations](https://rosalind.info/problems/hamm/)

In [68]:
def hamm(seq, seq_alt):
  """
  Solves https://rosalind.info/problems/hamm/
  Given: Two DNA strings seq and seq_alt of equal length.
  Return: The Hamming distance dH(seq, seq_alt) as stringwhich corresponds to number of point mutations
  """
  seq = seq.upper()
  seq_alt = seq_alt.upper()
  mutation_count = sum([seq[n]!=seq_alt[n] for n in range(len(seq))])
  return str(mutation_count)

In [69]:
hamm_sample1 = 'GAGCCTACTAACGGGAT'
hamm_sample2 = 'CATCGTAATGACGGCCT'
hamm(hamm_sample1, hamm_sample2) == '7'

True

## SUBS: [Finding a Motif in DNA](https://rosalind.info/problems/subs/)

In [70]:
def subs(seq, motif):
  """
  Solves: https://rosalind.info/problems/subs/
  Given: two DNA strings: seq and motif
  Return: All locations of motif in seq as a string
  """
  nt = 0
  motif_length = len(motif)
  position_list = []
  for nucleotide in seq[:-motif_length]:
    if nucleotide == motif[0]:
      if seq[nt:nt+motif_length] == motif:
        position_list.append(str(nt+1))
    nt+= 1
  return " ".join(position_list)

In [71]:
subs_sample_seq = "GATATATGCATATACTT"
subs_sample_motif = "ATAT"

print(subs(subs_sample_seq, subs_sample_motif))

2 4 10


## GRPH: [Overlap Graphs](https://rosalind.info/problems/grph/)

In [74]:
def grph(fasta, k=3):
  """
  Solves: https://rosalind.info/problems/grph/
  Given: A collection of DNA strings in FASTA format.
  Return: The adjacency list corresponding to O3 as a string
  """

  name_list = parse_fasta(fasta)['name']
  seq_list = parse_fasta(fasta)['seq']
  output_string=""

  n_1 = 0
  for sequence1 in seq_list:
    tail = sequence1[-k:]

    n_2 = 0
    for sequence2 in seq_list:
      if sequence1 != sequence2:
        head = sequence2[:k]
        if tail == head:
          output_string = output_string + name_list[n_1] + " " + name_list[n_2] + "\n"
      n_2 += 1
    n_1 += 1

  return(output_string[:-1])

In [75]:
grph_sample = ">Rosalind_0498\nAAAT\nAAA\n>Rosalind_2391\nAAATTTT\n>Rosalind_2323\nTTTTCCC\n>Rosalind_0442\nAAATCCC\n>Rosalind_5013\nGGGTGGG"
print(grph(grph_sample))

Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323


## LCSM: [Finding a Shared Motif](https://rosalind.info/problems/lcsm/)

In [78]:
def lcsm(fasta):
  """
  Solves: https://rosalind.info/problems/lcsm/
  Given: A collection of DNA strings (max 100) of length at most 1 kbp each in FASTA format
  Return: A longest common substring of the collection (if more are possible, only one is given)
  """

  def get_common_subseq(seq_0, sequence_list, length):
    for start in range(len(seq_0) - length + 1):
        subsequence = seq_0[start:start+length]
        for sequence in sequence_list:
            if subsequence not in sequence:
                break
        else:
            return subsequence
    return ""

  def get_longest_common_subseq(sequence_list):
    seq_0 = sequence_list.pop(0)
    left = 0
    right = len(seq_0) + 1
    while left + 1 < right:
        mid = int((left + right) / 2)
        if get_common_subseq(seq_0, sequence_list, mid) != "":
            left = mid
        else:
            right = mid
    return get_common_subseq(seq_0, sequence_list, left)

  sequence_list = parse_fasta(fasta)['seq']

  return get_longest_common_subseq(sequence_list)

In [79]:
lcsm_sample = """>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA"""
lcsm(lcsm_sample) == "AC" or lcsm(lcsm_sample) == "TA"

True

##REVP: [Locating Restriction Sites](https://rosalind.info/problems/revp/)

In [104]:
def revp(fasta_seq):
  """
  Solves: https://rosalind.info/problems/revp/
  Given: A DNA string in FASTA format
  Return: The position and length of every reverse palindrome in the string having length bewteen 4 and 12
  """

  pos_len_df = pd.DataFrame(columns=['pos', 'len'])
  sequence = parse_fasta(fasta_seq)['seq'][0]

  for motif_length in range(4, 13):
    for nt_index in range(len(sequence)-motif_length+1):
      motif = sequence[nt_index : nt_index + motif_length]
      if motif == revc(motif):
        motif_df = pd.DataFrame({'pos':[(nt_index+1)], 'len':[motif_length]})
        pos_len_df = pd.concat([pos_len_df, motif_df], ignore_index=True)
        #pos_len_dict[nt_index+1] = motif_length

  output_string =""
  sorted_df = pos_len_df.sort_values(by="pos")
  for index, row in sorted_df.iterrows():
    output_string = output_string + str(row['pos']) +" "+ str(row['len']) + " \n"

  return(output_string)

In [105]:
revp_sample = ">Rosalind_24\nTCAATGCATGCG\nGGTCTATATGCAT"
print(revp(revp_sample))

4 6 
5 4 
6 6 
7 4 
17 4 
18 4 
20 6 
21 4 



##PROT: [Translating RNA into Protein](https://rosalind.info/problems/prot/)

In [106]:
def prot(RNA_seq):
  """
  Solves: https://rosalind.info/problems/prot/
  Given: A RNA string corresponding to a strand of mature mRNA
  Return: The protein string encoded by the mRNA
  """
  genetic_code = {
    'UUU': 'F',     'CUU': 'L',     'AUU': 'I',     'GUU': 'V',
    'UUC': 'F',     'CUC': 'L',     'AUC': 'I',     'GUC': 'V',
    'UUA': 'L',     'CUA': 'L',     'AUA': 'I',     'GUA': 'V',
    'UUG': 'L',     'CUG': 'L',     'AUG': 'M',     'GUG': 'V',
    'UCU': 'S',     'CCU': 'P',     'ACU': 'T',     'GCU': 'A',
    'UCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
    'UCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
    'UCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
    'UAU': 'Y',     'CAU': 'H',     'AAU': 'N',     'GAU': 'D',
    'UAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
    'UAA': 'Stop',  'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
    'UAG': 'Stop',  'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
    'UGU': 'C',     'CGU': 'R',     'AGU': 'S',     'GGU': 'G',
    'UGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
    'UGA': 'Stop',  'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
    'UGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G'
  }
  protein=''
  n=0
  for nt in RNA_seq[:-2:3]:
    codon = RNA_seq[n: n+3]
    translation = genetic_code[codon]
    if translation == 'Stop':
      break
    protein += translation
    n+=3
  return protein


In [None]:
prot_sample = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"
prot(prot_sample)

'MAMAPRTEINSTRING'

## MPRT: [Finding a Protein Motif](https://rosalind.info/problems/mprt/)


In [107]:
def mprt(UniProtIDs):
  """
  Solves: https://rosalind.info/problems/mprt/
  Given: At most 15 UniProt Protein Databes access IDs
  Return: For each protein possessing the N-glycosylation motif,
          output its given access ID followed by a list of locations
          in the protein string where the motif can be found
  """

  name_list = UniProtIDs.split()
  name_seq_dict = {}
  output_string = ""

  import requests as rq

  #get a name:sequence dictionary
  for name in name_list:
    id = name[:6]

    url = f"http://www.uniprot.org/uniprot/{id}.fasta"
    response = rq.get(url)
    fasta_string = response.text

    fasta_list = fasta_string.split("\n")
    fasta_list = fasta_list[1:]

    sequence = "".join(fasta_list)
    name_seq_dict[name] = sequence

  name_Nglyc_dict = {}

  for protein in name_seq_dict:
    name_Nglyc_dict[protein] = ""
    n = 0
    prot_seq = name_seq_dict[protein]
    for aminoacid in prot_seq[:-4]:
      if aminoacid == "N":
        if (prot_seq[n+1] != "P") and ((prot_seq[n+2] == "S") or (prot_seq[n+2] =="T")) and (prot_seq[n+3] != "P"):
          name_Nglyc_dict[protein] = name_Nglyc_dict[protein] + str(n+1) +" "
      n+=1

  for name, positions in name_Nglyc_dict.items():
    if positions != '':
      output_string = output_string + name + "\n"
      output_string = output_string + positions[:-1] +"\n"

  output_string = output_string[:-1]
  return output_string


In [108]:
mprt_sample = """
A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST
"""
print(mprt(mprt_sample))

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614


## PRTM: [Calculating Protein Mass](https://rosalind.info/problems/prtm/)

In [111]:
def prtm(prot_seq):
  """
  Solves: https://rosalind.info/problems/prtm/
  Given: A protein string prot_seq of length at most 1000 aa.
  Return: The total weight of the protein according to the
          monoisotopic weights given here:
          https://rosalind.info/glossary/monoisotopic-mass-table/
  """
  aa_mass = {
    'A': 71.03711,
    'C': 103.00919,
    'D': 115.02694,
    'E': 129.04259,
    'F': 147.06841,
    'G': 57.02146,
    'H': 137.05891,
    'I': 113.08406,
    'K': 128.09496,
    'L': 113.08406,
    'M': 131.04049,
    'N': 114.04293,
    'P': 97.05276,
    'Q': 128.05858,
    'R': 156.10111,
    'S': 87.03203,
    'T': 101.04768,
    'V': 99.06841,
    'W': 186.07931,
    'Y': 163.06333
  }

  prot_seq = prot_seq.replace("\r", "")
  prot_seq = prot_seq.replace("\n", "")

  molecular_weight = sum([aa_mass[aa] for aa in prot_seq])

  return str(round(molecular_weight, 3))

In [112]:
prtm_sample = "SKADYEK"
print(prtm(prtm_sample))

821.392


## SPLC: [RNA Splicing](https://rosalind.info/problems/splc/)

In [115]:
def splc(fasta):
  """
  Solves: https://rosalind.info/problems/splc/
  Given: A DNA string (of length at most 1 kbp) and a collection of
         its substrings acting as introns. All strings are given in FASTA format
  Return: A protein string resulting from transcribing and translating
          the exons of the given DNA sequence
  """

  DNA_seq = parse_fasta(fasta).iloc[0,1]
  introns = parse_fasta(fasta).iloc[1:,1]

  for intron in introns:
    DNA_seq = DNA_seq.replace(intron, "")

  mRNA_seq = rna(DNA_seq)

  return prot(mRNA_seq)

In [116]:
splc_sample = """
>Rosalind_10
ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG
>Rosalind_12
ATCGGTCGAA
>Rosalind_15
ATCGGTCGAGCGTGT
"""

splc(splc_sample)

'MVYIADKQHVASREAYGHMFKVCA'

##ORF: [Open Reading Frames](https://rosalind.info/problems/orf/)

In [137]:
def orf(fasta):
  """
  Solves: https://rosalind.info/problems/orf/
  Given: A DNA string of length at most 1 kbp in FASTA format
  Return: Every distinct candidate protein string that can be translated from ORFs
  """

  sequence = parse_fasta(fasta).iloc[0,1]
  rev_compl = revc(sequence)

  prot_seq = []

  for n in range(len(sequence)-2):
    if sequence[n:n+3] == "ATG":
      prot_seq.append(prot(rna(sequence[n:])))

  for n in range(len(rev_compl)-2):
    if rev_compl[n:n+3] == "ATG":
      prot_seq.append(prot(rna(rev_compl[n:])))

  return "\n".join(list(set(prot_seq)))

In [138]:
orf_sample = """
>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
"""

print(orf(orf_sample))

MA
MTPRLGLESLLE
M
MIRVASQ
MLLGSFRLIPKETLIQVAGSSPCNLS
MGMTPRLGLESLLE


##MRNA: [Inferring mRNA from Protein](https://rosalind.info/problems/mrna/)

In [None]:
def mrna(protein_seq):
  """Solves: https://rosalind.info/problems/mrna/

  Given:  A protein string of length at most 1000 aa.
  Return: The total number of different RNA strings
          from which the protein could have been translated,
          modulo 1,000,000. Stop codons are included.
  """

  genetic_code = {
        'UUU': 'F',     'CUU': 'L',     'AUU': 'I',     'GUU': 'V',
        'UUC': 'F',     'CUC': 'L',     'AUC': 'I',     'GUC': 'V',
        'UUA': 'L',     'CUA': 'L',     'AUA': 'I',     'GUA': 'V',
        'UUG': 'L',     'CUG': 'L',     'AUG': 'M',     'GUG': 'V',
        'UCU': 'S',     'CCU': 'P',     'ACU': 'T',     'GCU': 'A',
        'UCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
        'UCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
        'UCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
        'UAU': 'Y',     'CAU': 'H',     'AAU': 'N',     'GAU': 'D',
        'UAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
        'UAA': 'Stop',  'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
        'UAG': 'Stop',  'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
        'UGU': 'C',     'CGU': 'R',     'AGU': 'S',     'GGU': 'G',
        'UGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
        'UGA': 'Stop',  'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
        'UGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G'
  }
  aminoacid_count = {}
  for aminoacid in genetic_code.values():
    if aminoacid not in aminoacid_count:
      aminoacid_count[aminoacid] = 1
    else:
      aminoacid_count[aminoacid] += 1

  protein_seq = protein_seq.replace("\r", "")
  protein_seq = protein_seq.replace("\n", "")

  combinations = 3
  for aminoacid in protein_seq:
    combinations *= aminoacid_count[aminoacid]

  return str(combinations%1000000)



In [None]:
mrna_sample = "MA"
mrna(mrna_sample)

'12'

##PERM: [Enumerating Gene Orders](https://rosalind.info/problems/perm/)

In [None]:
def perm(number):
  """Solves: https://rosalind.info/problems/perm/

  Given:  A positive integer n, no larger than 7.
  Return: The total number of permutations of length n,
          followed by a list of such permutations.
  """
  number = int(number)

  def find_permutations(n):
    """
    Helper function that finds number of permutation
    Given: an integer
    Returns: a list of permutations
    """
    if n == 1:
      return [[1]]
    else:
        prev_perm_list = find_permutations(n-1)
        permutations = []
        for prev_perm in prev_perm_list:
            for i in range(n):
                current_perm = prev_perm[:i] + [n] + prev_perm[i:]
                permutations.append(current_perm)
        return permutations


  if number == 1:
    return "1\n1"

  perm_list = find_permutations(number)

  output_string = str(len(perm_list)) +'\n'

  for permutation in perm_list:
    output_string += " ".join(str(digit) for digit in permutation) + '\n'

  return output_string.strip()

In [None]:
print(perm(3))

6
3 2 1
2 3 1
2 1 3
3 1 2
1 3 2
1 2 3


##IPRB [Mendel's First Law](https://rosalind.info/problems/iprb/)

In [None]:
def iprb(numbers : str):
  """Solves https://rosalind.info/problems/iprb/

  Given:  Three positive integers k, m, and n, representing a population
          containing k+m+n organisms: k individuals are homozygous dominant
          for a factor, m are heterozygous, and n are homozygous recessive.
  Return: The probability that two randomly selected mating organisms will
          produce an individual possessing a dominant allele (and thus displaying
          the dominant phenotype). Assume that any two organisms can mate.
  """

  number_list = numbers.split()
  k = int(number_list[0])
  m = int(number_list[1])
  n = int(number_list[2])
  pop = k + m + n

  prob_AA_AA = k/pop * (k-1)/(pop-1)
  prob_AA_Aa = k/pop * m/(pop-1)
  prob_AA_aa = k/pop * n/(pop-1)

  prob_Aa_AA = m/pop * k/(pop-1)
  prob_Aa_Aa = m/pop * (m-1)/(pop-1) * 0.75
  prob_Aa_aa = m/pop * n/(pop-1) * 0.5

  prob_aa_AA = n/pop * k/(pop-1)
  prob_aa_Aa = n/pop * m/(pop-1) * 0.5

  total_dom_offspring_prob = prob_AA_AA + prob_AA_Aa + prob_AA_aa + prob_Aa_AA + prob_Aa_Aa + prob_Aa_aa + prob_aa_AA + prob_aa_Aa

  return(str(round(total_dom_offspring_prob, 5)))

In [None]:
help(iprb)

Help on function iprb in module __main__:

iprb(numbers: str)
    Solves https://rosalind.info/problems/iprb/
    
    Given:  Three positive integers k, m, and n, representing a population 
            containing k+m+n organisms: k individuals are homozygous dominant 
            for a factor, m are heterozygous, and n are homozygous recessive.
    Return: The probability that two randomly selected mating organisms will 
            produce an individual possessing a dominant allele (and thus displaying 
            the dominant phenotype). Assume that any two organisms can mate.



In [None]:
iprb_sample = "2 2 2"
iprb(iprb_sample)

'0.78333'

In [None]:
def workflow(name):
  output_filename = f'rosalind_{name}_output.txt'
  function = eval(name)

  from google.colab import files

  def string_from_uploaded_file():
    uploaded = files.upload()
    for bytesObject in uploaded.values():
      return bytesObject.decode('utf-8')

  print(f'First download a dataset from Rosalind https://rosalind.info/problems/{name}/\n')
  print('Next upload this dataset by clicking the [Choose Files] button below:\n')
  input_string = string_from_uploaded_file()

  output_string = function(input_string)

  print(output_string)

  with open(output_filename, 'w') as f:
    f.write(output_string)

  files.download(output_filename)

In [None]:
workflow("iprb")

First download a dataset from Rosalind https://rosalind.info/problems/iprb/

Next upload this dataset by clicking the [Choose Files] button below:



Saving 17 rosalind_iprb.txt to 17 rosalind_iprb.txt
0.8001


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##IEV: [Calculating Expected Offsrping](https://rosalind.info/problems/iev/)

In [None]:
def iev(couple_numbers : str):
  """Solves: https://rosalind.info/problems/iev/

  Given:  Six nonnegative integers, each of which does not exceed 20,000.
          The integers correspond to the number of couples in a population
          possessing each genotype pairing for a given factor. In order,
          the six given integers represent the number of couples having the
          following genotypes: AA-AA, AA-Aa, AA-aa, Aa-Aa, Aa-aa, aa-aa.
  Return: The expected number of offspring displaying the dominant phenotype
          in the next generation, under the assumption that every couple has
          exactly two offspring.
  """
  counts = couple_numbers.split()

  def double(number):
    number = int(number)
    return number*2

  counts = list(map(double, counts))

  counts[3] = counts[3]*0.75    # Aa + Aa
  counts[4] = counts[4]*0.5     # Aa + aa
  counts[5] = 0                 # aa + aa

  return str(sum(counts))


In [None]:
iev_sample = "1 0 0 1 0 1"
iev(iev_sample)

'3.5'

##FIB: [Rabbits and Recurrence Relations](https://rosalind.info/problems/fib/)

In [None]:
def fib(numbers):
  """Solves: https://rosalind.info/problems/fib/

  Given:  Positive integers n <= 40and k <= 5 as a string.
  Return: The total number of rabbit pairs that will be present after n months,
          if we begin with 1 pair and in each generation, every pair of
          reproduction-age rabbits produces a litter of k rabbit pairs
          (instead of only 1 pair).
  """

  n = int(numbers.split()[0])
  k = int(numbers.split()[1])

  def fib2(n, k):

    if n == 1:
      return 1

    if n == 2:
      return k

    gen_one = fib2(n-1, k)
    gen_two = fib2(n-2, k)

    if n <= 4:
      return gen_one + gen_two

    return gen_one + gen_two*k

  return str(fib2(n, k))

In [None]:
fib_sample = "5 3"

fib(fib_sample) == "19"

True

In [None]:
workflow("fib")

First download a dataset from Rosalind https://rosalind.info/problems/fib/

Next upload this dataset by clicking the [Choose Files] button below:



Saving 19 rosalind_fib.txt to 19 rosalind_fib.txt
2863311531


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##CONS: [Consensus and Profile](https://rosalind.info/problems/cons/)

In [139]:
def cons(fasta):
  """ Solves: https://rosalind.info/problems/cons/

  Given:  A collection of at most 10 DNA strings of equal length (at most 1 kbp)
          in FASTA format.
  Return: A consensus string and profile matrix for the collection.
          (If several possible consensus strings exist, then you may return
          any one of them.)
  """

  seq_list = parse_fasta(fasta).iloc[:,1]
  dna_length = len(seq_list[0])

  matrix = []
  for i in range(dna_length):
    matrix.append([0]*4)

  nt_dict = {"A":0, "C":1, "G":2, "T":3}
  rev_nt_dict = {0:"A", 1:"C", 2:"G", 3:"T"}

  for position in range(dna_length):
    for sequence in seq_list:
      nt = sequence[position]
      matrix[position][nt_dict[nt]] += 1

  text_list =["A:", "\nC:", "\nG:", "\nT:"]

  for position in range(dna_length):
    for n in range(4):
      text_list[n] += " " + str(matrix[position][n])

  output_string = "".join(text_list)
  consensus_sequence = ""

  for position_column in matrix:
    max_index = position_column.index(max(position_column))
    consensus_sequence += rev_nt_dict[max_index]

  output_string = consensus_sequence + "\n"+ output_string

  return output_string

In [140]:
cons_sample = """>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
"""

print(cons(cons_sample))


ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
