#**Conditionals Homework**
##Drew Honson

Restriction enzymes provide an easy way to clone sequences of interest into plasmid vectors, but each restriction enzyme cuts only at a single unique site. Given that biologists have discovered and mass produced hundreds of different restriction enzymes, searching a DNA sequence manually for a useful site is impractical. As such, most sequence analysis and manipulation software contains a restriction enzyme database and a search engine to find and annotate the locations of restriction sites within a provided sequence. The program I have written below provides a simple function to identify the sites of five common enzymes in a user-provided sequence.

Conditionals can be found in the error message and in the enzyme search.

In [10]:
def restriction(x):
    """Searches a sequence for the presence of EcoRI, BamHI, HindIII, XhoI, and NheI
    then prints the enzymes found and the locations of their sites"""
    #Import packages for easy substring indexing and more readable print out
    import re
    import pprint
    
    #Format sequence for program
    seq = x.upper()
    
    #Import the restriction enzyme dictionary
    renzymes = {'EcoRI':'GAATTC', 'BamHI':'GGATTC', 'HindIII':'AAGCTT', 
                'XhoI':'CTCGAG', 'NheI':'GCTAGC'}
    bases = ['A','C','T','G']
    
    #Search the string for non-base letters and return an error if they're found 
    #First Conditional
    for i in seq:
            if i not in bases:
                raise ValueError('There are non-base letters in your sequence')
            
    #Set up empty lists for the enzymes found and their locations
    enz_in_seq = []
    loc = []
    
    #Search the string for the digest site, append the enzyme to enz_in_seq
    #Second Conditional
    for k,v in renzymes.items():
        if v in seq:
            enz_in_seq.append(k)
            
    #Append the locations of the digest site for each enzyme to indloc 
    #Append the the indloc for each enzyme to loc        
    for i in enz_in_seq:
        indloc = []
        for m in re.finditer(renzymes[i], seq):
            indloc.append(str(m.start()) + ':' + str(m.end()))
        loc.append(indloc)

    #Make a dictionary of the enzymes and their locations
    endict = dict(zip(enz_in_seq, loc))
    
    #Use pretty print to format the program output
    return pprint.pprint(endict, width = 1)
    

In [12]:
#Trying the program on a small section of one of my vectors
#It detected an NheI site I hadn't accounted for, which I then confirmed
#on another software. Unfortunately, this means I need to change my cloning
#strategy.
restriction('GAATTCTAGAACTGTTTCTGTAGACCAGGTTGGCCTCAAAATCAGAGTTGCTAGCTTCTGCCTCCCCAATACTAGGAGTAAAGCCCCATTGCAAATTCTCCTCGAGATCTGCGATCTAAGTAAGCTTGGCATTCCGGTACTGT')

{'EcoRI': ['0:6'],
 'HindIII': ['121:127'],
 'NheI': ['49:55'],
 'XhoI': ['100:106']}
