# Basic sequence manipulations with Python

In this workshop you will learn how to load a text file containing a DNA sequence, convert it to RNA and translate it to protein. The workshop will build on what you learned in your [first workshop](https://cistron.github.io/splats/python1/).

Make sure you fully read all instructions - **skim reading won't cut it** - and progress code cell by code cells. Some you can just execute, others you will have to edit.

## Downloading the sequence file

We start off by downloading the sequence of the [human adenosin receptor A1 (*ADORA1*)](https://www.ncbi.nlm.nih.gov/nuccore/CR541749). This file contains just the cleaned coding sequences from the linked GenBank entry.

In jupyter notebooks (such as this), command prepended with an exclamation mark (`!`) are executed by the command line. Hence, we can download the data with `curl` programme. This also means you can use other command line programmes, such as `cat`, easily to inspect the file contents.

**Unfortunately, the code below will only work on Mac and Linux machines. On Windows PCs, please download [cds_adora1.txt](https://cistron.github.io/splats/data/python2/cds_adora1.txt) manually and save it same folder as this notebook.**

In [None]:
# use program curl to download data
!curl -O https://cistron.github.io/splats/data/python2/cds_adora1.txt

In [None]:
# use program cat to print out file contents
!cat cds_adora1.txt

Ultimately, we want this output to be read into a single `string` variable we can manipulate (i.e. convert to RNA and translate to protein). For this, the file has be opened with Python and we have to deal with the linebreaks, which are hidden `\n` characters at the end of each line.

## Reading the sequence file

In order to access file contents with Python, its core input/output packages provide an `open` function. This will provide you with an iterable file object (also called a handle), assigned to the variable `file` below.

```python
# opens a file for reading
file = open('dna_sequence.txt',mode='r')
# do stuff with the file contents
file.close()
```

To free up memory, make sure to `close` the connection to the file again.

In [None]:
# reading the sequence file 1
# edit the code to read the text file you previously downloaded
file = open( ... , mode='r')
# Print the file contents line by line with a for-loop
for line in file:
    print(line, end='')
# close the file
file.close()

The best way to to avoid leaving files open, is to use an approach, which doesn't explicity require us to call the `close()` method. The file will be implicity closed after executing the indented code of the `with` block.

```python
with open("dna_sequence.txt", mode='r') as file:
   # perform file operations
```

In [None]:
# reading the sequence file 2
# Use the `with` annotation to print the contents of cds_adora.txt
...
    for line in file:
        print(line, end='')

File objects comes with several methods providing more control over read-out. Below are a few of them. You can find more information on file reading and writing in [this part of the Python documentation](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files).

* `file.read(n)` returns  `n` number of characters as a string
* `file.read()` returns the complete file as a string
* `file.readline(n)` returns `n` number of characters of the current line as a string
* `file.readline()` returns the current line as a string
* `file.readlines(n)` returns the `n`th line from the current read position as a one-element list
* `file.readlines()` returns all the lines as elements in a list

The following two methods let you query and change your position within a file.

* `file.seek(n)` moves the position to the `n`th character in the file
* `file.tell()` returns the current position

In [None]:
# reading the sequence file 3
# Read the whole file content into a list using the `readlines()` method
with open('cds_adora1.txt','r') as file:
     lines = ...
print(lines)

## Removing line-breaks

In your last output, you should see that each item in the `lines` list contains the previously mentioned `\n` line break character. A quick and easy way to do this is with the `.rstrip()` method, which removes any trailing whitespaces (the newline character is considered whitespace). If you are dealing with leading whitespaces `.lstrip()` is its partner method and `.strip()` removes all bordering whitespaces.

To strip each item's whitespace
* create a new empty list to receive the chunks of coding sequence (`cds_list`)
* iterate over the `lines` list using a `for` loop
* in the loop apply `.rstrip()` to each of list item (`line`) and use the `.append()` method to add it to `cds_list`

In [None]:
# removing line-breaks 1
# complete the code below to strip off the newline characters from each item of `lines`
cds_list = []
for line in lines:
    line_without_whitespace = ...
    ...append(line_without_whitespace)
print (cds_list)

With the newline characters (`\n`) removed, the list of coding sequence chunks (`cds_list`) can be joined into one continuous string using the `.join()` method.

In [None]:
# Execute this cell to see the effect of `.join()`
a_list = ['a', 'b', 'c']
with_separator = '-'.join(a_list)
without_separator = ''.join(a_list)
print(with_separator)
print(without_separator)

As you can see from the example above, the string to which the `.join()` method is applied functions as a separator. If left empty `''`, the items in the list are joined directly to each other.

In [None]:
# removing line breaks 2
# complete the code below to join `cds_list` into a single string `cds`
cds = ...
print(cds)

## From DNA to RNA

Great, the coding sequence is in one continuous string. To translate it into an amino acid sequence, we need a codon table, such as the dictionary in the code cell below. There are, however, two problems:

1. The codons are all in uppercase and
2. all codons are RNA.

You could either convert the codon table to lower-case DNA or convert the coding sequence to uppercase RNA (less the tedious).

In [None]:
# Execute cell to store the codon table as a dictionary
codon_table  = {
    "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A",
    "UGC":"C", "UGU":"C", "GAC":"D", "GAU":"D",
    "GAA":"E", "GAG":"E", "UUC":"F", "UUU":"F",
    "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G",
    "CAC":"H", "CAU":"H", "AUA":"I", "AUC":"I",
    "AUU":"I", "AAA":"K", "AAG":"K", "UUA":"L",
    "UUG":"L", "CUA":"L", "CUC":"L", "CUG":"L",
    "CUU":"L", "AUG":"M", "AAC":"N", "AAU":"N",
    "CCA":"P", "CCC":"P" ,"CCG":"P", "CCU":"P",
    "CAA":"Q", "CAG":"Q", "AGA":"R", "AGG":"R",
    "CGA":"R", "CGC":"R", "CGU":"R", "CGG":"R",
    "AGC":"S", "AGU":"S", "UCA":"S", "UCC":"S",
    "UCG":"S", "UCU":"S", "ACA":"T", "ACC":"T",
    "ACG":"T", "ACU":"T", "GUA":"V", "GUC":"V",
    "GUG":"V", "GUU":"V", "UGG":"W", "UAC":"Y",
    "UAU":"Y", "UAG":"!", "UAA":"!", "UGA":"!"
}

Converting a string to all uppercase is easy. The string-function `.upper()` does the trick.

In [None]:
# Execute this cell to see the effect of the `.upper()` function.
lowercase = "plz, makez me all upper case - kthxbai!"
uppercase = lowercase.upper()
print(uppercase)

Replacing letters in a string is equally straightfoward with the `.replace('old','new')` function. (If you are in a complicated mood, you could also iterate over the string using a `for`-loop and use `if`-statements.)

In [None]:
# Execute this cell to see the effect of the replace function.
bad_grammar = "It was there football after all."
proper_grammar = bad_grammar.replace('there','their')
print(f"Incorrect: {bad_grammar}")
print(f"Correct: {proper_grammar}")

These string manipulation methods can also be chained together, e.g. `string.upper().replace('m','n')`. Now have a go at converting the coding sequence (`cds`) to uppercase RNA.

In [None]:
# complete the code below to convert the coding sequence (`cds`) to uppercase RNA
rna = ...
print(rna)

## Chopping RNA into codons

Now that you have an RNA template, it's time to slice it into triplets of nucleotides (codons), which can then be translated into amino acids using the `codon_table` dictionary.

Slicing is a fundamentally important operation in Python. The (most basic) slice syntax `vegetable[n:m]` will cut before the indices `n` and `m` in the slice operator; this means the `m`th index will not (!) be included in the slice. The indices may refer to characters in a string or items in a list, but slicing can be applied to many of Python's data structures. Remember that Python indexes begin with 0 for the first element.

In [None]:
# Execute this cell to see slicing in action
futurama = 'Bender is great'
print(futurama[0:6]) # characters 0 to 5
print(futurama[:6]) # from beginning of string to character 5
print(futurama[7:]) # from character 7 to end of the string
print(futurama[:]) # makes a copy (handy for creating copies of lists)
print(futurama[:-1]) # one can also count from the back, here clipping off the last character
print(futurama[::2]) # an optional third parameter is steplength ...
print(futurama[::-1]) # ... which can be used to even reverse a string

In [None]:
# use this code cell have some fun with slicing

Now that you are familiar with slicing, you can think about breaking the RNA sequence down into triplets. Doing this manually (`rna[0:3]`, `rna[4:6]`, ...) would be rather silly. So let's use a `for`-loop (though there's always more than one way to get to the cheezburger).

Just a reminder about basic syntax of `for`-loops.

```python
for index in iterator:
    # do stuff
```

Where iterator can a whole range of iterable data objects, such as the output of a the `range()` function.

In [None]:
# Hint: the range function happily creates number every nth step.
n = 4
for i in range(0,50,n):
    print(i, end=' ')

In [None]:
# edit code to slice RNA into triplets
rna_length = len(rna)
codons = [] # an empty list to store the codons
for base_index in range(...,...,...):
    # slicey-dicey using base_index
    codon = ...
    condons.append(codon)
print(codons)

## Translating codons into amino acids

Using `codon_table` dictionary, individual codons can be decoded into the corresponding amino acids. Dictionaries contain information stored as `'key':'value'` pairs and values can be looked-up via the key: `my_dictionary['key']`.

In [None]:
# This should return methionine (M)
print(codon_table['AUG'])

All that is left to do, is to iterate over the `codons` list with a `for`-loop and store the amino acids chain.

In [None]:
# edit code to translate codons into amino acids
protein = "" # an empty string to store the aa (could also use a list) 
for ...:
    # concatenate dictionary output
print(protein)

## Make it functional

So far we have only written Spaghetti code - one long script (well, broken up by comments). While this is fine for a short project as this, it quickly becomes hard to read, hard to edit, hard to debug: unmanagable. Organise your code into suitable function; e.g. one to load your file into a string, one to prepare your string for translation, one to translate your RNA string.

Brief reminder: a function is a block of reusable code that is used to perform a single action, with the following syntax:

```python
def functionname( parameters ):
   """function_docstring tells others what this function does"""
   # do something
   return something_else
```

In [None]:
# load file
def file_to_string (filename):
    """Loads a textfile of DNA sequence, strips any right side
    white spaces and returns the sequence as a string"""
    ...

In [None]:
# upper case and RNA
def dna_to_upper_rna (dna):
    """Converts a string of DNA into uppercase RNA"""
    ...

In [None]:
# translate RNA
def translate_rna (rna):
    """Translates a string of uppercase RNA
    returns a string of amino acids"""
    ...

In [None]:
# call the functions
cds = file_to_string('cds_adora1.txt')
rna = dna_to_upper_rna(cds)
protein = translate_rna(rna)
print (protein)

## Not for the faint-hearted: list comprehension makes this super-short

Below are 2-ish lines of code, which carry out the same operation you just pieced together. How? List comprehension and method chaining. Have a read on [list comprehension in the Python documentation](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) and see whether you can make sense of the code below. Have a play with list comprehension and replace some of the `for`-loops of the code you have previously written.

In [None]:
# everything in a few lines (not counting the codon table)
with open('cds_adora1.txt','r') as file:
    rna = ''.join([line.rstrip().upper().replace('T','U') for line in file.readlines()])
protein = ''.join([codon_table[rna[i:i+3]] for i in range(0, len(rna), 3)])
print(protein)