# Practical 1

## Python programming

### Objectives 

This practical aims to familiarise you with the key elements of the Python programming language. Understanding and writing code is essential to effective problem solving in bioinformatics; you will need to have a basic appreciation to use materials in this course, both practicals and lectures, as would be expected from having done the pre-requisites of the course.

In this practical we:
* review Python programming basics
* look at some file processing using Python
* provide some examples of [what Python Notebooks are](#Appendix:-An-introduction-to-notebooks)



---

## Variables and simple data types ##

A variable is the equivalent of a box in which you could put stuff in (i.e. you assign a value to a variable). The stuff can have different types, such as integer, float, string or other [type objects].

You assign a variable to a value using the `=` operator.

A variable can almost have any name (except reserved Python keywords and strings starting with a number).

[type objects]: https://docs.python.org/3/library/types.html

In [None]:
# will not work. Starts with a number
7_up = 'this'

# or this. if is a reserved word
if = 'this'

### Integers ###

In [None]:
# Integers are numbers without a decimal point
myvar = 12
print (type(myvar))
print (myvar)

### Floats

In [None]:
# Floats are numbers with decimal points
myvar = 12.2
print (type(myvar))
print (myvar)

### Strings ###

In [None]:
# Strings are text-snippets, i.e. arbitrarily long sequences of characters
myvar = "12"
print (type(myvar))
print (myvar)

### Take-aways ###

 * Strings are defined explicitly by surrounding the text between double or single quotes. 
    * Why are there these two options?
 * "12" is not equal to 12. "12" is the concatenation of the characters "1" and "2", while 12 is an integer type.

In [None]:
# Asking is "12" equal to 12 (string vs int)
"12" == 12

* You can convert types from one to another. Examples:

In [None]:
print (type(int("12")), int("12"))
print (type(float("12")), float("12"))
print (type(float(11)), float(11))

---

## Working with variables and simple data types ##

**Incrementing variables**.

Notice how you can add or multiply the content of a variable and store the result in the same variable:

In [None]:
# Use '#' for comments.
myvar = 12        # assign a value to a variable
myvar = myvar + 1 # increment the value assigned, by another value (1)
print (myvar)     # print it out

myvar += 1        # Handy shorthand notation for incrementing a value assigned
print (myvar)

In [None]:
myvar = 14
myvar = myvar * 2 # Double the value of myvar
print (myvar)

myvar *= 2        # This statement is equivalent to the "doubling" above 
print (myvar)

**String methods and getting help.**

Strings have many useful predefined functions:

In [None]:
# Define a variable containing a short (DNA) string
dna = "ATCGTTTTCGATC"

In [None]:
# Count the occurrences of character A in the (DNA) variable
print (dna.count("A"))

In [None]:
# How long is the (DNA) string
print (len(dna))

In [None]:
print (dna.replace("T","U"))

In [None]:
# The function "replace" does not change the string in-place
print (dna)

In [None]:
rna = dna.replace("T","U")
print (rna)
# or in-place
dna = dna.replace("T","U")
print (dna)

In [None]:
# Checking membership with 'in'
print ("U" in rna)
print ("T" in rna)

In [None]:
# Accessing elements in string by position index. Notice counting begins with 0.
print (dna)
print (dna[0])
print (" "+dna[1])
print ("   "+dna[3])
# Get last element
print ("            "+dna[-1])

In [None]:
# Adding some 'bases'
dna += "AAAAAA"
print (dna)

In [None]:
# Listing available 'string methods'
print (dir(str))

In [None]:
print (help(str))

---

## More advanced data types: lists and dictionaries ##

A variable can also contain other "boxes", with the two most useful [types] of box containers
being lists and dictionaries.

[types]: https://docs.python.org/2/library/types.html


**Lists:** ordinal index

In [None]:
# Delimiters: square brackets
a_list = [1,2,12.2,"P23456"]

List representation: 

Index | Value
--- | ---
0 | 1
1 | 2
2 | 12.2
3 | "P23456"


In [None]:
print (type(a_list))

In [None]:
# 0 based numbering
print (a_list[3])

In [None]:
# Updating elements
print (a_list)
a_list[3] = 15
print (a_list)

In [None]:
# Adding elements at the end
a_list.append(20.0)
print (a_list)

In [None]:
# 'slicing a list'
print (a_list[1:3])

In [None]:
# List length. Like for strings
print (len(a_list))

In [None]:
# Membership testing
print (5 in a_list)
print (20 in a_list)

**Dictionaries**: alphanumerical keys

In [None]:
# Delimiters: curly brackets
my_dict={"name" :"hemoglobin",
         100: "PRO2979", 
         "length": 147}

Dictionary representation:

Key | Value
--- | ---
"name" | "hemoglobin"
100 | "PRO2979"
"length" | 147


In [None]:
print (type(my_dict))

In [None]:
print (my_dict["name"])

In [None]:
print (my_dict[100])

In [None]:
my_dict["length"] = 120
print (my_dict["length"])

In [None]:
print (my_dict)
my_dict["ID"] = "P69891"
print (my_dict["ID"])
print (my_dict)

In [None]:
print (my_dict.keys())

---

## Control structures

The constructs `if` and `for` are the most common control structures used to execute instructions according to certain conditions (`if`) or repetitions (`for`). The instructions under control are identified by their common level of indentation, which is indicated by tabs or 4 blank spaces.


### Conditional statements ###

**if, elif, and else**

The conditional operators are `==`, `>`, `<`, `>=`, `<=`, `is` and `in` (note that a single `=` is not an operator). Any function returning boolean values (`True` or `False`) could also be used as a condition. Several conditions can be combined using `and` and `or` keywords. For example:

```python
   if (age < 30) and (weight > 150):
       pass # do nothing
```

**if, elif & else block**

```python
if condition1:
    do something if condition1 is True
    do more things if condition1 is True
    do even more things if condition1 is True
elif condition2:
    do something if condition2 is True and condition1 is False
else:
    do something if condition1 and condition2 are False
    do more things if condition1 is False
    do something whatever True or False
```

**Example:**

In [None]:
mycodon = "ATG"

if "T" in mycodon:
    print ("The codon is DNA, changing to RNA")
    mycodon = mycodon.replace("T","U")
else:
    print ("The codon is possibly RNA")

if mycodon == "AUG":
    print ("This is the start codon!")

print ("The analysed codon is", mycodon)

---

### Control flow ###

**for, in**: executing repeatedly on elements

The for loop:

```python
for index_variable in list:
    do something using index_variable
    do something different inside the loop
do something outside of the loop
```

**Example:**

In [None]:
hydrophobics = "FLIMVPAWG"
sequence = ["M","H","K","L"]

for aa in sequence:
    print ("The amino acid is",aa)
    if aa in hydrophobics:
        print (" It is a hydrophobic residue")
print ("End of the program")

---

### Advanced control structure topics

**The break statement allows termination of a for loop prematurely:**

```python
   for element in list
       if element == 'this is what we want':
           print ('found it')
           break
```

**Looping over the keys of a dictionary using:**

```python
   for key in dict:
```

**Looping over the dictionary keys and values simultaneously using:**

```python
   for key, value in dict.items():
```

**Executing repeatedly over several lists simultaneously:**

Imagine you have several lists of same length, which is the equivalent of a spreadsheet where the columns are stored in different lists. How can you simultaneously list all the elements of each list?

Index | List 1 | List 2 | List 3 | ...
--- | --- | --- | --- | ---
0 | List1[0] | List2[0] | List3[0] | ...
1 | List1[1] | List2[1] | List3[1] | ...
2 | List1[2] | List2[2] | List3[2] | ...
... |  ... | ... | ... | ...


Here are three simple strategies that you could use to loop simultaneously over values in lists of same length. 

Lets take the example of two lists containing some UniProt protein codes and their sequence length, respectively:


**Strategy 1**: Incrementing an index using range


In [None]:
prot_code = ['O13572','Q12346','O13583','P32329','Q12303','P53057']
prot_length = [173,211,123,569,508,165]
for element in range(len(prot_code)):
    print (element, prot_code[element], prot_length[element])

**Strategy 2**: Using enumerate(list): return two values, the index and the value at this index

In [None]:
for index, code in enumerate(prot_code):
    print (index, code, prot_length[index])

**Strategy 3**: Using zip(list1,list2): concatenate the values of several list

In [None]:
# But we aren't giving the index here...
for code, length in zip(prot_code, prot_length):
    print (code, length)

---

## Functions ##

A function is a block of organised, reusable code that is used to perform a single, related action.

There are several good reason why you should organise your code by writing functions:
 
 * allows you to reuse your code at different places in your code,
 * makes your code easier to understand, and
 * allows you to identify bugs more easily (and also by not duplicating code preventing new bugs).


**Function syntax**

```python
   def afunction (param1, param2, ...):
       do something with the parameters
       do more things with the parameters
   return value

   # Calling the function
   var = afunction(10, 2e-3, "PE1290")
```

**Example:**

In [None]:
def is_DNA(seq):
    not_DNA = "QWERYUIOPSDFHJKLZXVBNM"
    for aa in not_DNA:
        if aa in seq:
            return False
    return True

print (is_DNA("GTTCGACCA"))

**About return**

A value (integer, float, string, object...) is sent back from the function using the `return` statement.

The `return` statement could happen at any point in the function definition and will terminate the function execution (the example above has two `return` statements, i.e. two possible termination points of the function).


---

## Reading, parsing and writing formatted text files ##

A common task in bioinformatics is to write scripts that parse data stored in formatted files. Here are some standard procedures to read and write files as well as some tips how to parse formatted (tab, csv, etc.) files.

The syntax for opening text files, reading text files and writing to them is relatively standard. You could use the following examples as recipes.


### Reading text files ###

Recipe:

```python
    in_file = open("filename", 'r')
    idx = 0
    for aline in in_file:
        idx += 1
        print (idx, aline,)
    in_file.close()
```

The keyword `open` is used to open file called `filename` and return a file handle in the `read (r)` mode, which is stored in the variable `in_file`. The `for` loop then reads each line of the file, and stores each line in turn in the variable `aline`. In the `for` loop, the index (`idx`) of the line as well as the line (the contents of `aline`) itself is printed. Finally, the file has to be closed using `.close().


### Writing text files ###

Writing text into files uses very similar syntax to reading files. Instead of printing via `print` you use the `.write()` method of a open file handle (in `write, (w)` mode).

Recipe: the following code writes each value found in the list `data` onto a different line to the output file named `outfile.txt`

```python
   data = ["P1231", "P234234", "O89098"]
   fh = open("outfile.txt", "w")
   for element in data:
       fh.write(element+"\n")
   fh.close()
```

**Gotchas**

The carriage return character `\n` (i.e. go to the next line) has to be explicitly appended to each line of output. This is different behaviour than the command print that adds a carriage return (newline) character. 

Only strings can be passed to the `.write()` function. Data of other types will need to be converted into strings (e.g. `str(100.0)`).


### Parsing structured text from files ###

Parsing is about reading data in one format so that you can use it to your needs.

The function `split()`, of string types, is very useful for parsing formatted files in which the data is formatted into columns (i.e. data fields are organised in columns separated by spaces, tabs or commas (delimeters).) 

Below is an example of a column formatted text file (`cyclotide.txt`):

```
1 kalata_B1 GLPVCGETCVGGTCNTPGCTCSWPVCTRN
2 cycloviolacin_O1 GIPCAESCVYIPCTVTALLGCSCSNRVCYN
4 kalata_B2 GLPVCGETCFGGTCNTPGCSCTWPICTRD
5 palicourein GDPTFCGETCRVIPVCTYSAALGCTCDDRSDGLCKRN
6 vhr1 GIPCAESCVWIPCTVTALLGCSCSNKVCYN
7 tricyclon_A GGTIFDCGESCFLGTCYTKGCSCGEWKLCYGTN
8 circulin_A GIPCGESCVWIPCISAALGCSCKNKVCYRN
21 cycloviolacin_O2 GIPCGESCVWIPCISSAIGCSCKSKVCYRN
24 kalata_B6 GLPTCGETCFGGTCNTPGCSCSSWPICTRN
25 kalata_B3 GLPTCGETCFGGTCNTPGCTCDPWPICTRD
26 kalata_B7 GLPVCGETCTLGTCYTQGCTCSWPICKRN
27 cycloviolacin_O8 GTLPCGESCVWIPCISSVVGCSCKSKVCYKN
28 cycloviolacin_O11 GTLPCGESCVWIPCISAVVGCSCKSKVCYKN
30 kalata_B4 GLPVCGETCVGGTCNTPGCTCSWPVCTRD
31 vodo_M GAPICGESCFTGKCYTVQCSCSWPVCTRN
32 cyclopsychotride_A SIPCGESCVFIPCTVTALLGCSCKSKVCYKN
33 cycloviolacin_H1 GIPCGESCVYIPCLTSAIGCSCKSKVCYRN
...
```

You could use the following code to extract the data in each column:

```python 
   in_file = open("cyclotide.txt", 'r')
   for line in in_file:
       fields = line.strip().split()
       print(fields[0], fields[1], fields[2])
   in_file.close()
```


The line `fields = line.strip().split()` performs the parsing of each line. The `strip()` functions removes both the carriage return (\\n character) at the end of the line as well as leading and trailing whitespace. The `split()` then creates a list of elements containing each space separated element in the line.

Read the help provided for `strip()` and `split()` functions:

In [None]:
help(str)

The first iteration of the for loop and the parsing logic will result in a list, `fields` of length 3 with:
 * fields[0] containing "1",
 * fields[1] containing "kalata_B1", and 
 * fields[2] containing "GLPVCGETCVGGTCNTPGCTCSWPVCTRN"

Note, that the integer index is a `str` type. You will most likely want to convert it to an int using `fields[0] = int(fields[0])`.

**Importance of choosing delimiters**

This recipe will fail if the fields contain variable numbers of blanks spaces ("kalata B1" instead of kalata_B1) as the `.split()` function will interpret it as several fields. If the fields are separated by something else other than blank spaces, for example tabs, then the fields could be retrieved using code that passes the delimiter to the `split()` function. 

For example the following file is tab separated:

(Note: this command will not work if you are working in Windows - we recommend instead using a text editor to view the file)

In [None]:
%cat data.txt

You could modify the previous example to parse the elements of each line ("\t" is the value for a 'tab' in Python):

In [None]:
in_file = open("data.txt")
for line in in_file:
    fields = line.strip().split("\t")
    print (fields[1], int(fields[3])-int(fields[2]))
in_file.close()

In the first iteration of the `for` loop, `fields` will contain 
```
["P8096HYR","Transcription factor from plant","1","21","MREVHLLLLLVL"]
```. See how the second element contains spaces but was not split into
several fields by `.split()`. Note also how the keyword `int` was used to convert `string` field elements into an integer. Remember that elements extracted by `split()` are of type `string`, and if you want to use numerical 
values that have been extracted you need to convert the string into integer or a float values.

For more information on special string literals such as `'\t'` see section 2.4.1 in the [Python documentation].

[Python documentation]: https://docs.python.org/3.6/reference/lexical_analysis.html

**Parsing complex files**

Databases from EBI or NCBI provide data as flat files. For example an UniProt flat file looks like this:
(Note: this command will not work if you are working in Windows - we recommend instead using a text editor to view the file)

In [None]:
%cat uniprot_example.embl

Accessing information from these types of files is slightly more complicated than from column formatted or tab separated text. Here the lines have to be tested for certain criteria (using `if` statements) that allows to pinpoint the lines of interest. For example, to extract from the previous file, the identifier, the organism, the molecular weight and the sequence, we could inspect the first two characters of each line as these allow identification of lines of interest.

**Simple EMBL format parser:**

In [None]:
ident =""
organism = ""
m_weight = 0
seq = ""

in_file = open("uniprot_example.embl")
for line in in_file:
    fields = line.strip().split()
    if line[0:2] == "ID":
        ident = fields[1]
    if line[0:2] == "OS":
        organism = line[5:].strip()
    if line[0:2] == "SQ":
        m_weight = int(fields[4])
    if line[0:2] == "  ":
        seq += line.replace(" ","").strip()
in_file.close()

print(ident, organism, m_weight, seq)

You could also use `str.startswith()`. Investigate how it works:

In [None]:
help(str)


---

## Using Python libraries ##

Python libraries are Python files that contain definitions of functions and objects. These functions and
objects can be used within your code by importing them with the `import` keyword. 

There are a large number of available python libraries that you can use. Some of the standard python libraries (that come with Python by default) that you should consider learning are `sys` (system), `os` (operating system) and `re` (regular expressions). We have written libraries that we will use during this course, please see `reference_guide.ipynb` for brief introduction.

One of the most popular and well used Python bioinformatics libraries is [BioPython]. You can search for publicly available libraries on the [Python Package Index] (PyPI).

[BioPython]: http://biopython.org/
[Python Package Index]: https://pypi.python.org/pypi

In [None]:
# Help on the sys module. 
# You need to import it first.
import sys
help(sys)

---

<a id='appendix_notebooks'></a>
## Appendix: An introduction to notebooks

We assume that you have installed a Python distribution with IPython, and that you are now in an IPython/Jupyter notebook. Type in a cell the following command, and press `Shift+Enter` to validate it:

In [None]:
print("Hello world!")

A notebook contains a linear succession of **cells** and **output areas**. A cell contains Python code, in one or multiple lines. The output of the code is shown in the corresponding output area.

Now, we do a simple arithmetic operation.

In [None]:
2+2

The result of the operation is shown in the output area. Let's be more precise. The output area not only displays text that is printed by any command in the cell, it also displays a text representation of the last returned object. Here, the last returned object is the result of 2+2, i.e. 4.

In the next cell, we can recover the value of the last returned object with the _ (underscore) special variable. In practice, it may be more convenient to assign objects to named variables, like in myresult = 2+2.

In [None]:
_ * 3

IPython not only accepts Python code, but also shell commands. Those are defined by the operating system (Windows, Linux, Mac OS X, etc.). We first type ! in a cell before typing the shell command. Here, we get the list of notebooks in the current directory.

In [None]:
!ls *.ipynb

IPython comes with a library of **magic commands**. Those commands are convenient shortcuts to common actions. They all start with `%` (percent character). You can get the list of all magic commands with `%lsmagic`.

In [None]:
%lsmagic

Cell magic have a `%%` prefix: they apply to an entire cell in the notebook.

For example, the `%%writefile` cell magic lets you create a text file easily. This magic command accepts a filename as argument. All remaining lines in the cell are directly written to this text file. Here, we create a file `test.txt` and we write `Hello world!` in it.

In [None]:
%%writefile test.txt
Hello world!

In [None]:
# Let's check what this file contains.
with open('test.txt', 'r') as f:
    print(f.read())

As you can see in the output of `%lsmagic`, there are many magic commands in IPython. You can find more information about any command by adding a `?` after it. For example, here is how we get help about the `%run` magic command:

In [None]:
%run?

We covered the basics of IPython and the notebook. Let's now turn to the rich display and interactive features of the notebook. Until now, we only created code cells, i.e. cells that contain... code. There are other types of cells, notably Markdown cells. Those contain rich text formatted with Markdown, a popular plain text formatting syntax. This format supports normal text, headers, bold, italics, hypertext links, images, mathematical equations in LaTeX, code, HTML elements, and other features, as shown below.

### New paragraph

This is *rich* **text** with [links](http://ipython.org), equations:

$$\hat{f}(\xi) = \int_{-\infty}^{+\infty} f(x)\, \mathrm{e}^{-i \xi x}$$

code with syntax highlighting:

```python
print("Hello world!")
```

and images:

![This is an image](http://ipython.org/_static/IPy_header.png)

By combining code cells and Markdown cells, you can create a standalone interactive document that combines computations (code), text and graphics.