# Homework: Strings and Regular Expressions (Solutions)

**Q1**. Use regular expressions to find potential Open Reading Frames (ORF) in chromosome 1 of Saccharomyces cerevisiae strain S288C (Baker's yeast). Here, we define an ORF to have the following properties:

- START codon is ATG
- STOP codon is TGA, TAA or TAG
- Between START and STOP
    - Must be multiple of 3
    - At least 10 codons
    - Does not contain other START or STOP codon
    
A codon is a sequence of 3 non-overlapping bases that codes for an amino acid. For example, the first ORF in chromosome 1 of Saccharomyces cerevisiae strain S288C is `ATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAA` which consists of 21 codons. Each ORF should include the START codon (which codes for methionine) but not the STOP codon.

- 2.1 Write code to download the FASTA file from `http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa`   (5 points) 
- 2.2 Write a single regular expression to capture all ORFs meeting the criteria given above (20 points) 
- 2.3 Apply the regular expression to find all matches in the downloaded sequence. You will have to exclude the FASTA comment lines that start with '>' and do some processing to generate a single DNA sequence. (20 points) 
- 2.4 How many ORFs are found in total? (5 points) 
- 2.5 Write a script to convert the ORFs to the putative peptide that it would translate to, using this [translation table](http://rosalind.info/glossary/dna-codon-table/). Express the peptide sequence as a string of single letter codes for each amino acid (SLC). What is the peptide sequence for the last ORF found? (50 points) 

2.1 Write code to download the FASTA file from `http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa`   (5 points) 

In [1]:
import urllib.request

urllib.request.urlretrieve("http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa", "chr01.fsa")

('chr01.fsa', <http.client.HTTPMessage at 0x7eff0fdcb908>)

2.2 Write a single regular expression to capture all ORFs meeting the criteria given above (20 points) 

ATG(?:(?:(?!TGA|TAA|TAG).{3}){10,}?)(?=T(?:GA|AA|AG)) # This was tested with newline characters in mind

2.3 Apply the regular expression to find all matches in the downloaded sequence. You will have to exclude the FASTA comment lines that start with '>' and do some processing to generate a single DNA sequence. (20 points) 

In [2]:
import re

sequence_data = open('chr01.fsa','r') # Opens the file chr01.fsa as a read-only file
sequenceText = sequence_data.read() 
sequenceText = sequenceText.replace('\n', '') # Removes all newline characters

In [3]:
regExpression = re.compile("ATG(?:(?:(?!TGA|TAA|TAG).{3}){10,}?)(?=TGA|TAA|TAG)") # Compiles our regular expression
matches = regExpression.findall(sequenceText) # Finds a list of strings that matches the criteria given in our RegEx
print(matches)

['ATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAA', 'ATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCAC', 'ATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCA', 'ATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGG', 'ATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCACCGAGCAATAATACGG', 'ATGATACAATTATATCTTATTTCCATTCCCATATGC', 'ATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCT', 'ATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTG', 'ATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTA', 'ATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCG', 'ATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATT', 'ATGTGGAGTACTGTTTTATGGCGCTTATGT

2.4 How many ORFs are found in total? (5 points) 

In [4]:
print("There are", len(matches), "Open Reading Frames.")

There are 890 Open Reading Frames.


2.5 Write a script to convert the ORFs to the putative peptide that it would translate to, using this [translation table](http://rosalind.info/glossary/dna-codon-table/). Express the peptide sequence as a string of single letter codes for each amino acid (SLC). What is the peptide sequence for the last ORF found? (50 points) 

In [10]:
import string

def DNAtoprotein(toBeTranslated):
    """RNAtoprotein takes a RNA sequence and converts it to its respective amino acid"""
    
    codons = []
    # First we need to split the reading frame into codons
    for index in range(0, len(toBeTranslated), 3):
        codons.append(toBeTranslated[index:index+3])
    
    amino_acid_table = {'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L', 'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TAA': '#',
                       'TAG': '#',  'TCG': 'S', 'TAT': 'Y', 'TAC': 'Y', 'TGT': 'C', 'TGC': 'C', 'TGG': 'W', 'CTT': 'L', 
                       'CTC': 'L', 'CTA': 'L', 'CTG': 'L', 'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'CAT': 'H',
                       'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'ATT': 'I',
                       'ATC': 'I', 'ATA': 'I', 'ATG': 'M', 'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'AAT': 'N', 
                       'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 'GTT': 'V',
                       'GTC': 'V', 'GTA': 'V', 'GTG': 'V', 'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAT': 'D',
                       'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G', 'TGA': '#'}
    
    seq=""
    for codon in codons:
        seq+=amino_acid_table[codon]
    
    return seq

print(matches[-1])
print(DNAtoprotein(matches[-1]))

#sequence_data.close()

ATGAGTGGTAGTGAGAGTTGGATAAGATATATTGGGCAGGGGATAGATGGTTGTTGGGGTGTGGTGATGGATAGTGAGTGGATAGTGAGTGGATGGATGGTGGAGTGGGGGAATGAGACAGGGCATGGGGTGGTGAGG
MSGSESWIRYIGQGIDGCWGVVMDSEWIVSGWMVEWGNETGHGVVR
