<a href="https://colab.research.google.com/github/SantosRAC/intro_python_ismb2022/blob/main/ISBM_2022_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Python programming for bioscientists

Tutorial in the 30th Conference on Intelligent Systems for Molecular Biology (ISMB 2022)

**Instructors**

 * Renato Augusto Corrêa dos Santos, PhD
 * PhD candidate Hemanoel Passarelli
 * Master student Pedro Ilídio
 * Vinícius Frasceschini

**Learning objectives**

 * Introduce Google Colab digital notebooks;
 * Present the basic logic and data structures in Python;
 * Provide hands-on experience in analyzing biological sequences using Biopython.




# Day 1 (July 6th, 2022)

## Block 1

**Timing**: 1h30min

**Learning objectives**

 * Run basic Python code and introduce students to documentation with Markdown on Google Colab notebooks
 * Introduce the basics of variables, numeric operations, strings, and booleans
 * Introduce built-in functions (print)
 * Show how to import a library and use functions beyond the built-in functions (math)

**Quick view of Google Colab's functionalities**

**What variables are and how to define variables in Python**

Variables have:
 * Type
 * Value

There are also rules we must follow to name variables:
 * They can have letters, numbers, or even 
 * They must NOT have whitespace
 * They must NOT be reserved, special words of the Python language (we'll see some of these reserved words soon)

We will see later that operators that we use in several tasks also cannot be used for naming variables.

The `=` sign is used to assign values to variables.

In [4]:
name = "Renato"

In [13]:
number = 10

**Best practices in Python programming and the PEP8 conventions**

Try your best to provide names that are as descriptive as possible.

**Numbers**

We have three different types of numbers in Python:
 * Integers
 * Floating points
 * Complex numbers

This is an integer in Python:

In [16]:
10

10

Remember that variables have a type and a value.

We can check what type our newly created variable is by using the `type` function (we will use this function several times over the two days).

In [18]:
type(10)

int

**Python can be used in arithmetic operations**

Like in mathematics, Python can also be used to carry out many different arithmetic operations:

| **Operator** | **Description** | **Example** |
|------|------|------|
|  +   | Add the values  |  |
|  -   | Subtract the right value from the left value |  |
|  *   | Multiply the left and right values |  |
|  /   | Divide the left value the the right value |  |
|  //  | Integer division |  |
|  %   | Modulus - returns any remainder |  |
|  **  | Exponent (or power of) |  |

(Table was obtained from Hunt, 2021)

**Looking for help**

**Strings in Python**

 * Sequence (or series) of ordered characters
 * Immutable

We can define a string in Python using either a single (`'`) or a double quotation mark (`"`):

In [6]:
author = "Renato" #correct
instructor = "Renato" #correct
Author = "Renato" #incorrect (PEP8 conventions)


Using both single and double quotation marks is not possible:

In [7]:
instructor = "Renato'

SyntaxError: EOL while scanning string literal (3551669578.py, line 1)

The main reason for both cases being possible is because sometimes we want to use quotes within a string:

In [10]:
person = "Renato's student"

Let's check the type of this variable:

In [12]:
person

"Renato's student"

In [11]:
type(person)

str

`str` not only indicates that the variable is pointing to a string, but it also can work as a function that converts an object from one type to string. We had previously created the `number` variable. Let's convert it to a string:

In [14]:
number = 10
type(number)

int

In [15]:
str(number)

'10'

**Booleans in Python**

 * RACS_TODO: include definitions





## Block 2

**Timing**: 1h

**Learning objectives**

* Introduce basic functions and methods for string manipulation and a comparative analysis with sequences (lists and tuples)
* Introduce dictionaries and a comparative analysis between sequence indexing and dictionary keys

**Functions and methods in Python for string handling**


**Functions and methods in Python for sequence handling**

**Dictionaries in Python**

 * Key <-> value associations
 * The indices (dictionary keys) are unordered (different from lists and strings)
 * Mutable (different from other objects we studied; e.g., strings and tuples are immutable)

In [None]:
{'var1': 'mut1', # See whitespaces, PEP8 convention (https://www.python.org/dev/peps/pep-0008/#whitespace-in-expressions-and-statements)
'var2': 'mut2'}

Like in previous cases, we can create a variable to "store" our dictionary.

Let's create a variable for using later in our practical sessions:

In [None]:
sarscov2vars = {'var1': 'mut1',
'var2': 'mut2'}

The `dict` function can use as input different data structures to build a dictionary. Tuples and lists are examples that we have already studied.

We can use a list of tuples or a tuple of lists!

## Block 3

**Timing**: 1h

**Learning objectives**

* Show how to build functions
* File input/ output

Remember that we previously used the Python built-in functions `string` and `input`.

We also showed you how to use data structures, functions, and methods from a library for doing calculations that were not included as a built-in Python implementation (`math`).

We can also create our own functions using the `dev` word:

**Reading files**

We are constantly working with sequence files in Bioinformatics - sequences downloaded from NCBI databases, from organism-specific sources, and this is usually how we store information locally, for further analysis. Therefore, one of the important things we're going to learn today is how to read and write files (input/output).

 * File `Wuhan-Hu-1_19A.fasta` contains the FASTA file for the reference strain
 * `spike_proteins.fasta` contains additional FASTA sequences, that correspond to other virus variants

In [1]:
open('Data/Wuhan-Hu-1_19A.fasta', 'r')

<_io.TextIOWrapper name='Data/Wuhan-Hu-1_19A.fasta' mode='r' encoding='UTF-8'>

`r` means that you are opening the file for reading. Other options could be to write something into the file (`w`), appending information to id (`a`). We can do both things at the same time as well (`r+`).

If a string with the relative file path is passed, but no additional argument is provided, python will assume we are only reading the file.

In [4]:
open('Data/Wuhan-Hu-1_19A.fasta')

<_io.TextIOWrapper name='Data/Wuhan-Hu-1_19A.fasta' mode='r' encoding='UTF-8'>

Like other Python objects, we can use variables to store file content in memory.

In [2]:
ref_sequence_file = open('Data/Wuhan-Hu-1_19A.fasta', 'r')

In [3]:
ref_sequence_file

<_io.TextIOWrapper name='Data/Wuhan-Hu-1_19A.fasta' mode='r' encoding='UTF-8'>

Lines in text files (like in `FASTA` format) are usually read as `strings`.

A good practice when reading files (or lines in text files) is to use the `with` keyword (see link at the botton of this notebook). As you saw a few lines before, we have to `open` a file and it is important to ensure it is closed after reading. With the `with` it is done properly.

In [37]:
ref_sequence_file.closed

True

In [6]:
ref_sequence_file.close()

In [32]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    file = ref_sequence_file.read()
    print(file)

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICA

We learned previously many `functions` and `methods` that can be used to manipulate `strings`.

Now, we are going to convert the multiple lines representing the sequence of one single SARS-CoV-2 variant into a single string. For this purpose, we must:

 * Create a variable to store a string (our virus sequence)
 * Read each line of the file
  * For each line, we have to check if it starts with a `>`
   * If it does not, then we have to add that line to a variable we've just created

Now we are going to one additional methods:
 * `startswith`
 * `strip`

For this, we will practice `for loops` and `conditionals`.

In [44]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    print(ref_sequence_file.readline())

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1



After python reads the file, everytime `readline` is called it will print the `next line` in that file, until it is closed or until the blocks of code inside the `with` keyword finishes.

In [45]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    print(ref_sequence_file.readline())
    print(ref_sequence_file.readline())

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS



The `readlines` method generates an iterator, which allows us to use the `for loop` we learned previously, to analyze each line as a string:

In [47]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if line.startswith('>'):
            print(line)

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1



Notice that the method `readline` was use here.

In [48]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if not line.startswith('>'):
            print(line)

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS

NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV

NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE

GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT

LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK

CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN

CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD

YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC

NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN

FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP

GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY

ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI

SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE

VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC

LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM

QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN

TLVKQLSS

There are at least two methods that can be used to remove the additinal new line after reading each line: `strp` and `replace`. We already learned how to use the `replace` one. Let's try the former.

In [49]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if not line.startswith('>'):
            print(line.strip('\n'))

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
TLVKQLSSNFGAISSVLNDILSRL

We are almost there!

Now, let's go back to the first step in our algorithm and create a variable able to store this information:

In [50]:
wuhan_variant_seq = ''

with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if not line.startswith('>'):
            wuhan_variant_seq = wuhan_variant_seq + line.strip('\n')

In [51]:
wuhan_variant_seq

'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITG

# References and links

## Books

* Libeskind-Hadas, Ran, and Eliot Bush. Computing for biologists: Python programming and principles. Cambridge University Press, 2014.
* Hunt, John. "A Beginners Guide to Python 3 Programming." (2021).

## Pages of the official Python documentation

* [Reading and writing files](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)
* 
