# Simulation of DNA Sequence Evolution

In this assignment, we're going to write code to simulate the evolution of a DNA sequence through time, with one change at a site at each time step.

AACGT -> AAGGT -> ATGGT -> ATGGC -> ...

To do this, we'll need some functions from both the `random` and `copy` modules.

In [1]:
# Import needed modules
import random
import copy

First, we'll need to create a random starting sequence.

In [2]:
# The four nucleotides
nucs = ["A","C","G","T"]

In [3]:
# Length of sequence
seqLen = 24

# Use a for loop and random.choice() to create a random DNA sequence (startSeq) with length seqLen
# startSeq should be a list

startSeq = []
for i in range(seqLen):
    startSeq.append( random.choice(nucs) )

In [4]:
# startSeq = []
# for _ in range(seqLen):
#    startSeq.append( random.choice(nucs) )

In [5]:
# Set the number of time steps to simulate
steps = 20

In [6]:
# Create a list to hold the set of sequences from each time step in the simulation
seqs = []
seqs.append( startSeq )

In [8]:
# Use a for loop to simulate changes to the sequence across time steps
for sequence in range(steps):
    # Randomly pick an index in your sequence to change
    indexChange = random.choice(range(seqLen))
    
    # Create a copy of the last sequence in seqs using copy.copy()
    newSeq = copy.copy(seqs[-1])
    
    # Draw a new nucleotide and make sure that it's different than what you started with
    # Hint: a while loop would be very useful here
    newNuc = random.choice(nucs)
    while newNuc == newSeq[indexChange]:
        newNuc = random.choice(nucs)
    
    # Update the randomly chosen position in the new sequence with the new nucleotide
    newSeq[indexChange] = newNuc
    
    # Add the updated sequence to the seqs list
    seqs.append(newSeq)
    
    # Print all sequences
    print(seqs[sequence])

['C', 'C', 'C', 'G', 'G', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'A', 'A', 'G', 'G', 'T', 'G', 'A', 'G']
['C', 'C', 'C', 'G', 'G', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'A', 'A', 'G', 'G', 'T', 'G', 'C', 'G']
['C', 'C', 'C', 'G', 'G', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'A', 'A', 'T', 'G', 'T', 'G', 'C', 'G']
['C', 'C', 'C', 'G', 'G', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'A', 'A', 'T', 'G', 'T', 'T', 'C', 'G']
['C', 'C', 'C', 'G', 'G', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'T', 'A', 'T', 'G', 'T', 'T', 'C', 'G']
['A', 'C', 'C', 'G', 'G', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'T', 'A', 'T', 'G', 'T', 'T', 'C', 'G']
['A', 'C', 'C', 'G', 'A', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'T', 'A', 'T', 'G', 'T', 'T', 'C', 'G']
['A', 'C', 'C', 'G', 'A', 'T', 'T', 'G', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'T', 'A', 'T', 'G', 'G', 'T', 'C', 'G']
['A', 'C', 'C', 'G', 'A', 'T', '

In [9]:
# Print all sequences -- if needed, you can uncomment the code below
#for i in range(steps):
#    print(seqs[i])

In [10]:
# Print the starting sequence as a string
startSeqStr = ""
for n in startSeq:
    startSeqStr += n
print(startSeqStr)

# Print the ending sequence as a string
lastSeqStr = ""
for n in seqs[len(seqs)-1]:
    lastSeqStr += n
print(lastSeqStr)

# Print the Hamming distance between the starting and ending sequences
dist = 0
for i in range( len(startSeq) ):
    endSeq = seqs[len(seqs)-1]
    if startSeq[i] != endSeq[i]:
        dist += 1

print( "Hamming distance between first and last sequences is: " + str(dist) )

CCCGGTTGACCGTGCAAAGGTGAG
TGGATGGGTCATAACCTATGGCTG
Hamming distance between first and last sequences is: 18


The Hamming distance between two sequences is simply the number of nucleotide positions at which they differ. What is the Hamming distance between your starting and ending sequences? How does this value compare if you re-run the simulation several times, first with a small number of steps and next with a much larger number of steps?

My first Hamming distance was 18.
Re-running the simulation several times with a small number of steps results in a smaller Hamming distance, as it is limited by a small number of steps. Re-running with a large number of steps results in a Hamming distance similar to the inital amount of steps, as the distance is limited by the length of the sequence.


In [11]:
# Here is a dictionary that translates between codons in a DNA sequence and amino acids
gencode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

# Translate your ending DNA sequence into an amino acid sequence
# Make sure your sequence length is a multiple of 3!
if len(lastSeqStr) % 3 == 0: # translates if sequence length is multiple of 3
    aminoList = [] # breaks up sequence into codons
    for i in range( 0,len(lastSeqStr),3 ):
        aminoList.append( lastSeqStr[i:i+3] )
    aminoStr = "" # translate codons into amino acid
    for i in aminoList:
        aminoStr = aminoStr + gencode.get(i)
    print( "Amino acid sequence: " + aminoStr )
else: 
    print("Sequence length is not a multiple of 3. ")

Amino acid sequence: WMGHNLWL
