# Simulation of DNA Sequence Evolution

In this assignment, we're going to write code to simulate the evolution of a DNA sequence through time, with one change at a site at each time step.

AACGT -> AAGGT -> ATGGT -> ATGGC -> ...

To do this, we'll need some functions from both the `random` and `copy` modules.

In [41]:
# Import needed modules
import random
import copy

First, we'll need to create a random starting sequence.

In [42]:
# The four nucleotides
nucs = ["A","C","G","T"]

In [43]:
# Length of sequence
seqLen = 24

# Use a for loop and random.choice() to create a random DNA sequence (startSeq) with length seqLen
# startSeq should be a list



# Code here...

In [44]:
startSeq = []
for _ in range(seqLen):
    startSeq.append( random.choice(nucs) )

In [45]:
# Set the number of time steps to simulate
steps = 20

In [46]:
# Create a list to hold the set of sequences from each time step in the simulation
seqs = []
seqs.append( startSeq )

In [47]:
# Use a for loop to simulate changes to the sequence across time steps
for _ in range(steps):
    # Randomly pick an index in your sequence to change
    # Code here...
    select = random.randint(0,23)
    
    # Create a copy of the last sequence in seqs using copy.copy()
    # Code here...
    newSeq = copy.copy(seqs[-1])
    # Draw a new nucleotide and make sure that it's different than what you started with
    # Hint: a while loop would be very useful here
    # Code here...
    newNuc = random.choice(nucs)
    while newNuc == newSeq[select]:
        newNuc = random.choice(nucs)
    # Update the randomly chosen position in the new sequence with the new nucleotide
    # Code here...
    newSeq[select] = newNuc
    # Add the updated sequence to the seqs list
    # Code here...
    seqs.append(newSeq)

In [48]:
# Print all sequences -- if needed, you can uncomment the code below
for i in range(steps):
    print(seqs[i])

['T', 'T', 'A', 'C', 'C', 'A', 'G', 'A', 'T', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'A', 'T', 'T', 'A', 'A', 'A', 'A', 'T']
['T', 'T', 'T', 'C', 'C', 'A', 'G', 'A', 'T', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'A', 'T', 'T', 'A', 'A', 'A', 'A', 'T']
['T', 'T', 'T', 'C', 'C', 'A', 'G', 'A', 'T', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'A', 'T', 'C', 'A', 'A', 'A', 'A', 'T']
['T', 'T', 'T', 'C', 'T', 'A', 'G', 'A', 'T', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'A', 'T', 'C', 'A', 'A', 'A', 'A', 'T']
['T', 'T', 'T', 'C', 'T', 'A', 'G', 'A', 'T', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'T', 'T', 'C', 'A', 'A', 'A', 'A', 'T']
['T', 'T', 'T', 'C', 'T', 'A', 'G', 'A', 'T', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'T', 'T', 'C', 'A', 'A', 'A', 'C', 'T']
['T', 'T', 'T', 'C', 'T', 'A', 'G', 'A', 'T', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'T', 'T', 'C', 'T', 'A', 'A', 'C', 'T']
['T', 'T', 'T', 'C', 'T', 'A', 'G', 'A', 'G', 'T', 'T', 'T', 'C', 'A', 'C', 'T', 'T', 'T', 'C', 'T', 'A', 'A', 'C', 'T']
['T', 'T', 'T', 'C', 'T', 'A', '

In [49]:
# Print the starting sequence as a string
startSeqStr = ""
for n in startSeq:
    startSeqStr += n
print(startSeqStr)

# Print the ending sequence as a string
lastSeqStr = ""
for n in seqs[len(seqs)-1]:
    lastSeqStr += n
print(lastSeqStr)

# Print the Hamming distance between the starting and ending sequences
dist = 0
for i in range( len(startSeq) ):
    endSeq = seqs[len(seqs)-1]
    if startSeq[i] != endSeq[i]:
        dist += 1

print( "Hamming distance between first and last sequences is: " + str(dist) )

TTACCAGATTTTCACTATTAAAAT
TTACTAGGGTTACAGTTAGTACTT
Hamming distance between first and last sequences is: 11


The Hamming distance between two sequences is simply the number of nucleotide positions at which they differ. What is the Hamming distance between your starting and ending sequences? How does this value compare if you re-run the simulation several times, first with a small number of steps and next with a much larger number of steps?

Answer: With multiple simulations, the Hamming distance generally hangs around the same number plus or minus a few changes. With a smaller number of steps, the distance tends to increase, and with a larger number of steps the opposite is true.

In [50]:
# Here is a dictionary that translates between codons in a DNA sequence and amino acids
gencode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

# Translate your ending DNA sequence into an amino acid sequence
# Make sure your sequence length is a multiple of 3!

# Code here...
from textwrap import wrap
aminos = []
if len(lastSeqStr) % 3 == 0:
    codons = wrap(lastSeqStr, 3)
    for i in codons:
        aminos.append(gencode[i])
else:
    print("Sequence length error")
print(aminos)


['L', 'L', 'G', 'L', 'Q', 'L', 'V', 'L']
