# Simulation of DNA Sequence Evolution

In this assignment, we're going to write code to simulate the evolution of a DNA sequence through time, with one change at a site at each time step.

AACGT -> AAGGT -> ATGGT -> ATGGC -> ...

To do this, we'll need some functions from both the `random` and `copy` modules.

In [227]:
# Import needed modules
import random
import copy

First, we'll need to create a random starting sequence.

In [228]:
# The four nucleotides
nucs = ["A","C","G","T"]

In [229]:
# Length of sequence
seqLen = 24

# Use a for loop and random.choice() to create a random DNA sequence (startSeq) with length seqLen
# startSeq should be a list

#startSeq = ...

# Code here...

In [230]:
startSeq = []
for _ in range(seqLen):
    startSeq.append( random.choice(nucs) )

In [231]:
# Set the number of time steps to simulate
steps = 20

In [232]:
# Create a list to hold the set of sequences from each time step in the simulation
seqs = []
seqs.append( startSeq )

In [233]:
# Use a for loop to simulate changes to the sequence across time steps
for x in range(steps):
    # Randomly pick an index in your sequence to change
    # Code here...
    index=random.choice(range(seqLen))
    
    # Create a copy of the last sequence in seqs using copy.copy()
    # Code here...
    lastSeq=copy.copy(seqs[-1])
    
    # Draw a new nucleotide and make sure that it's different than what you started with
    # Hint: a while loop would be very useful here
    # Code here...
    nucleotide=lastSeq[index]
    newNuc=random.choice(nucs)
    while newNuc == nucleotide:
        newNuc=random.choice(nucs)
        
    # Update the randomly chosen position in the new sequence with the new nucleotide
    # Code here...
    lastSeq[index]=newNuc
    
    # Add the updated sequence to the seqs list
    # Code here...
    seqs.append(lastSeq)

In [234]:
# Print all sequences -- if needed, you can uncomment the code below
for i in range(steps):
    print(seqs[i])

['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'C', 'A', 'A', 'G', 'T', 'T', 'A', 'A', 'A', 'G', 'T', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'C', 'A', 'A', 'G', 'T', 'C', 'A', 'A', 'A', 'G', 'T', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'C', 'A', 'A', 'G', 'T', 'C', 'A', 'A', 'T', 'G', 'T', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'C', 'A', 'A', 'G', 'T', 'C', 'A', 'G', 'T', 'G', 'T', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'C', 'A', 'A', 'G', 'T', 'C', 'A', 'G', 'T', 'A', 'T', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'C', 'A', 'A', 'G', 'T', 'C', 'A', 'G', 'T', 'A', 'C', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'C', 'A', 'A', 'G', 'T', 'C', 'A', 'A', 'T', 'A', 'C', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', 'G', 'T', 'C', 'G', 'T', 'A', 'A', 'G', 'T', 'C', 'A', 'A', 'T', 'A', 'C', 'G', 'C', 'G']
['T', 'A', 'G', 'A', 'G', 'G', '

In [235]:
# Print the starting sequence as a string
startSeqStr = ""
for n in startSeq:
    startSeqStr += n
print(startSeqStr)

# Print the ending sequence as a string
lastSeqStr = ""
for n in seqs[len(seqs)-1]:
    lastSeqStr += n
print(lastSeqStr)

# Print the Hamming distance between the starting and ending sequences
dist = 0
for i in range( len(startSeq) ):
    endSeq = seqs[len(seqs)-1]
    if startSeq[i] != endSeq[i]:
        dist += 1

print( "Hamming distance between first and last sequences is: " + str(dist) )

TAGAGGGTCGCAAGTTAAAGTGCG
CAGACGGTACTACCTCTGGGACCC
Hamming distance between first and last sequences is: 14


The Hamming distance between two sequences is simply the number of nucleotide positions at which they differ. What is the Hamming distance between your starting and ending sequences? How does this value compare if you re-run the simulation several times, first with a small number of steps and next with a much larger number of steps?

In [236]:
# steps=20_Hd=11    re-run: steps=10_Hd=7  ;  re-run: steps=40_Hd=16    ;    re-run:steps=100_Hd=21
# In general, the Hamming distance increase with an increase in the number of steps 

In [237]:
# Here is a dictionary that translates between codons in a DNA sequence and amino acids
gencode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

# Translate your ending DNA sequence into an amino acid sequence
# Make sure your sequence length is a multiple of 3!

# Code here...

# making sure my sequence length is a multiple of 3 before the analysis
if len(lastSeqStr)%3 == 0:

# creating a list to store the codons
    aa = []

# selecting nucleotides from 1 to 3, 4 to 6, 7 to 9, etc. and adding each set of these three nucleotides (the codons) to my list (aa) 
    for x in range(0,len(lastSeqStr),3):
        aa.append( lastSeqStr[ x : x + 3 ] )
        
# creating a string to store the result from the translation 
    aaStr = " "

# selecting each codon from my list (aa) and linking them to my string (aaStr), which is the way/format the result will be shown
    for n in aa:   
        aaStr += gencode[n]
        #test = aaStr[-1]
    print (aaStr)

 QTVLPLGP
