# Bioinformatics challenges with Python
### Reachout for Healthcare Science (King's Health Partners, June 2018)

In genetic labs, bioinformaticians write programs to help diagnose diseases from the patient's DNA sequence. The sequence is contained in digital files with the DNA bases represented as 'A','T','G' and 'C'. Below are some Bioinformatics head-scratchers - Try to solve the errors in the code!

### Run the cell below before attempting the challenges

In [2]:
# This cell contains functions and variables used by the challenges.

answers = ['TATATCGGATAAATACGCGTATAACCG', 0.35, 'UACAUCAGCGCUAGAUAGCGACUACUAUCGAGUCAUAUAGGAUCUAGGCUAU']

def test(result, answer):
    if result == answer:
        print('Code complete! Correct answer: {}'.format(answer))
    else:
        print('Not quite! Your answer is incorrect: {}'.format(result))

gencode = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}        

def transcribe(seq):
        output = []
        if (len(seq)%3 != 0) or (gencode[seq[0:3]]!='M') or (gencode[seq[-3:]]!='_'):
            print('Error! This is not a valid RNA sequence. Have you chosen correct start and stop codons?')
        else:
            for i in range(3,len(seq)+3,3):
                try:
                    output.append(gencode[seq[i-3:i]])
                except KeyError:
                    print('Error! Your DNA molecule should only contain symbols for Thymine, Adenine, Cytosine or Guanine.')
                    exit()
            return("".join(output))
            

## 1. Complementary base pairing
DNA bases are linked by hydrogen bonds and undergo complementary base pairing. This means Adenine always pairs with Thymine, while Guanine always pairs with Cytosine. We can represent the double-stranded molecule as such:
```
Strand 1: ATCGCTCGATCGATCATCATAT
          ||||||||||||||||||||||
Strand 2: TAGCGAGCTAGCTAGTAGTATA
```
Strand 2 is known as the complementary strand. Follow the comments to fix the code below and find the complementary strand of the sequence.

In [5]:
# The code in this box finds the complementary strand of a DNA sequence.
# Follow the instructions in comments like this to fix it.

def complement(strand1):
    strand2 = []
    # This function calculates the complement strand for a sequence. The line below works as a lookup table,
    # where 1,2,3,4 should be replaced with the correct complementary DNA bases (in quotes).
    base_pair = {'A':'1', 'G':'2', 'C':'3', 'T':'4'}
    for base in strand1:
        strand2.append(base_pair[base])
    return "".join([str(i) for i in strand2])

seq = 'ATATAGCCTATTTATGCGCATATTGGC'
test(complement(seq), answers[0])

Not quite! Your answer is incorrect: 141412334144414232314144223


## 2. GC Content
The GC content is the proportion of the sequence that contains **'G'** and **'C'** bases. GC bonds are stronger than AT bonds, therefore GC content can indicate the stability of a DNA molecule. The human genome has GC-rich and GC-poor regions, yet we can  use this as a measure of the quality of our human sequence.

Follow the comments to fix the code below and calculate the GC content of the sequence.

In [90]:
# The code below calulates the GC content of a given sequence.
# Follow the instructions in comments like this to fix it

def gccontent(seq):
    # Change the line below so the correct bases are counted for GC content
    return ((seq.count('G')/seq.count('A'))/len(seq))

# Edit the DNA sequence below until it has a GC content of 0.35. Run the code to check your answer.
seq = 'ATATTTATAAATAGCATAGCGC'
test(gccontent(seq), answers[1])

Not quite! Your answer is incorrect: 0.01515151515151515


## 3. The second Nucleic Acid

Within the nucleus of our cells DNA is translated into RNA, which contains instructions for building proteins in our bodies. The two molecules are closely related, however in RNA, Uracil (U) replaces the Thymine (T) in DNA.

In [6]:
# The code below translates a DNA sequence into it's corresponding RNA sequence.
# Follow the instructions in comments like this to fix it.

def translate(seq):
    # Change the line below so the Thymine bases in the sequence (T) are correctly replaced with Uracil (U)
    # Hint, in seq.replace('1','2'), the base being replaced is '1'.
    return complement(seq).replace('1','2')

seq = 'ATGTAGTCGCGATCTATCGCTGATGATAGCTCAGTATATCCTAGATCCGATA'
test(translate(seq), answers[2])

Not quite! Your answer is incorrect: 2424224323224342432342242242234322424243342224332242


## 4. Transcription and the triplet code

Proteins are synthesised from the genetic code in a process known as **transcription**. Ribosomes in our cells read RNA bases in sets of three, known as **codons**. Each codon translates to a specific **amino acid** that is the building block of a protein. Just like DNA and RNA sequences, we can represent protein sequences as letters where each letter corresponds to an amino acid.

The triplet code has repeats, the same amino acid can be translated from the same DNA or RNA codon:

```
Methionine (M) = ATG
STOP codons = TAG, TAA or TGA```

In [7]:
# The code below returns the amino acid sequence from an RNA molecule.
# Follow the instructions in comments like this to fix it.

# Edit the DNA sequence below to create a 'valid' protein sequence.
# Hint: Proteins must begin with a start codon (Methionine) and stop codon, and contain triplets.
DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'

if transcribe(DNA):
    print('Code complete! Your protein sequence is:\n{}'.format(transcribe(DNA)))

Error! This is not a valid RNA sequence. Have you chosen correct start and stop codons?
