# Simulation of DNA Sequence Evolution

#Giovani Hernandez Canchola
#Ass 9

In this assignment, we're going to write code to simulate the evolution of a DNA sequence through time, with one change at a site at each time step.

AACGT -> AAGGT -> ATGGT -> ATGGC -> ...

To do this, we'll need some functions from both the `random` and `copy` modules.

In [None]:
# Import needed modules
import random
import copy

In [None]:
#The four nucleotides
nucs = ["A","C","G","T"]

#Length of sequence
seqLen = 24

# Use a for loop and random.choice() to create a random DNA sequence (startSeq) with length seqLen
# startSeq should be a list
startSeq = []
for _ in range(0,seqLen):
    startSeq.append(random.choice(nucs))

In [None]:
# Set the number of time steps to simulate
steps = 20

# Create a list to hold the set of sequences from each time step in the simulation
seqs = []
seqs.append(startSeq)

In [None]:
# Use a for loop to simulate changes to the sequence across time steps
for _ in range(0,steps):
    # Randomly pick an index in your sequence to change
    ranInd=random.randint(0,seqLen)
    ranPos = ranInd - 1
    
    # Create a copy of the last sequence in seqs using copy.copy()
    lastSeq=copy.copy(seqs[-1])
    
    # Draw a new nucleotide and make sure that it's different than what you started with
    # Hint: a while loop would be very useful here
    newNuc=(random.choice(nucs))
    while lastSeq[ranPos] == newNuc:
        newNuc=(random.choice(nucs)) 
    
    # Update the randomly chosen position in the new sequence with the new nucleotide
    lastSeq[ranPos]=newNuc
    
    # Add the updated sequence to the seqs list
    seqs.append(lastSeq)

In [None]:
# Print all sequences -- if needed, you can uncomment the code below
for i in range(0,steps+1):
    print(seqs[i])

In [None]:
# Print the starting sequence as a string
startSeqStr = ""
for n in startSeq:
    startSeqStr += n
print(startSeqStr)

# Print the ending sequence as a string
lastSeqStr = ""
for n in seqs[len(seqs)-1]:
    lastSeqStr += n
print(lastSeqStr)

# Print the Hamming distance between the starting and ending sequences
dist = 0
for i in range( len(startSeq) ):
    endSeq = seqs[len(seqs)-1]
    if startSeq[i] != endSeq[i]:
        dist += 1

print( "Hamming distance between first and last sequences is: " + str(dist) )

The Hamming distance between two sequences is simply the number of nucleotide positions at which they differ. What is the Hamming distance between your starting and ending sequences? How does this value compare if you re-run the simulation several times, first with a small number of steps and next with a much larger number of steps?

-The first time I ran this script (steps=20) the Hamming distance was 10.

-If there are more steps, the Hamming distance between first and last sequences will be bigger. 
If the number of steps is really larger, we could see same nucleotide at certain positions. Nevertheless, it doesn't mean that that nucleotide never changed (possible it changed several times and became the original nucleotide again). 

In [None]:
# Here is a dictionary that translates between codons in a DNA sequence and amino acids
gencode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

# Translate your ending DNA sequence into an amino acid sequence
# Make sure your sequence length is a multiple of 3!

#Create a list of nucleotides from last sequence
lastSeq=seqs[len(seqs)-1]
#Create an integer variable whit the number of codons in the sequence
codLen=int(seqLen/3)
#Create 3 lists, with first, second and third positions, respectively
a=lastSeq[0:seqLen:3]
b=lastSeq[1:seqLen:3]
c=lastSeq[2:seqLen:3]
#Create an empty list that will store triplets
cod=[]
#Generate all triplets (first position of "a","b" and "c" lists, second positions, etc.)
#In each step, transform each triplet as a string and append it to the "cod" list
for n in range(0,codLen):
    abc=[]
    abc.extend(a[n])
    abc.extend(b[n])
    abc.extend(c[n])
    Mabc = ""
    for m in abc:
        Mabc += m
    cod.append(Mabc)
#Create a list, with the translated aa of each triplet (using dictionary "genecode")
aa=[]
for l in range(0,codLen):
    preAA=gencode[cod[l]]
    aa.append(preAA)
#Print the transalated sequence as a string
aaStr = ""
for k in aa:
    aaStr += k
print(aaStr)