# Simulation of DNA Sequence Evolution

In this assignment, we're going to write code to simulate the evolution of a DNA sequence through time, with one change at a site at each time step.

AACGT -> AAGGT -> ATGGT -> ATGGC -> ...

To do this, we'll need some functions from both the `random` and `copy` modules.

In [57]:
# Import needed modules
import random
import copy

First, we'll need to create a random starting sequence.

In [58]:
# The four nucleotides
nucs = ["A","C","G","T"]

In [59]:
# Length of sequence
seqLen = 24

In [60]:
startSeq = []
for _ in range(seqLen):
    startSeq.append( random.choice(nucs) )

In [61]:
# Set the number of time steps to simulate
steps = 20

In [62]:
# Create a list to hold the set of sequences from each time step in the simulation
seqs = []
seqs.append( startSeq )

In [63]:
# Use a for loop to simulate changes to the sequence across time steps
for x in range(steps):
    index=random.randint(0,23)# Randomly pick an index in your sequence to change
    newseq=copy.copy(seqs[-1])# Create a copy of the last sequence in seqs using copy.copy()
    nuc=newseq[index]#Obtains the letter that will be changed in the original sequence that corresponds to the index
    newnuc=random.choice(nucs)#this draws a new nucleotide from the nucs list
    while newnuc==nuc:# If new letter is same as what started with, then re-run it and get a different letter.
        newnuc=random.choice(nucs)
    newseq[index]=newnuc #change the letter with the specific index to the newly generated letter
    seqs.append(newseq) #add the new sequence to the seqs list

In [64]:
# Print all sequences -- if needed, you can uncomment the code below
for i in range(steps):
    print(seqs[i])

['A', 'T', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'A', 'G', 'T', 'G', 'C', 'C', 'G', 'A', 'T', 'T', 'A', 'C', 'G', 'G']
['A', 'T', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'A', 'G', 'T', 'G', 'C', 'C', 'G', 'T', 'T', 'T', 'A', 'C', 'G', 'G']
['A', 'C', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'A', 'G', 'T', 'G', 'C', 'C', 'G', 'T', 'T', 'T', 'A', 'C', 'G', 'G']
['A', 'C', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'G', 'G', 'T', 'G', 'C', 'C', 'G', 'T', 'T', 'T', 'A', 'C', 'G', 'G']
['A', 'C', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'G', 'G', 'T', 'G', 'C', 'T', 'G', 'T', 'T', 'T', 'A', 'C', 'G', 'G']
['A', 'C', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'G', 'G', 'G', 'G', 'C', 'T', 'G', 'T', 'T', 'T', 'A', 'C', 'G', 'G']
['A', 'C', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'G', 'G', 'G', 'G', 'C', 'T', 'G', 'T', 'T', 'T', 'A', 'C', 'G', 'C']
['A', 'C', 'C', 'G', 'T', 'A', 'A', 'A', 'T', 'A', 'G', 'G', 'G', 'G', 'C', 'T', 'G', 'A', 'T', 'T', 'A', 'C', 'G', 'C']
['A', 'C', 'C', 'G', 'T', 'A', '

In [65]:
# Print the starting sequence as a string
startSeqStr = ""
for n in startSeq:
    startSeqStr += n
print(startSeqStr)

# Print the ending sequence as a string
lastSeqStr = ""
for n in seqs[-1]:
    lastSeqStr += n
print(lastSeqStr)

# Print the Hamming distance between the starting and ending sequences
dist = 0
for i in range( len(startSeq) ):
    endSeq = seqs[-1]
    if startSeq[i] != endSeq[i]:
        dist += 1

print( "Hamming distance between first and last sequences is: " + str(dist) )

ATCGTAAATAAGTGCCGATTACGG
CTAGTGAAAAGGGGGTGAAGGTGG
Hamming distance between first and last sequences is: 12


The Hamming distance between two sequences is simply the number of nucleotide positions at which they differ. What is the Hamming distance between your starting and ending sequences? How does this value compare if you re-run the simulation several times, first with a small number of steps and next with a much larger number of steps?

The Hamming distance is initially 11, with 20 iterations. Instead when the simulation only has 5 iterations the Hamming Distance is 3, and when the number of iterations is 60, the Hamming Distance is 21. This reflects that when there are more steps, more mutations will occur.

In [66]:
# Here is a dictionary that translates between codons in a DNA sequence and amino acids
gencode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

protein= '' #defines an empty string to put the amino acid sequence
if len(lastSeqStr)%3 == 0: #if the last sequence (as a string) is divisible by 3 with no remainder
    for x in range(0,len(lastSeqStr),3): #then for each of the nucleotides from the first to the last with interval of 3
        codon=lastSeqStr[x:x + 3] #the codon is defined in multiples of three from the first to the last nucleotide
        protein += gencode[codon] #catenate the codon from the gencode table to the protein string
print (protein) #print the string that is protein

LVKRG_RW
