# Introduction to Python programming for bioscientists
Tutorial in the 30th Conference on Intelligent Systems for Molecular Biology (ISMB 2022)

### Instructors {{format}}
- Hemanoel Passarelli Araujo
    * <font size=2>PhD candidate, Federal University of Minas Gerais, Brazil (passarelli@ufmg.br)</font>
- Pedro de Carvalho Braga Ilídio Silva
    * <font size=2>Master student, University of São Paulo, Brazil (ilidio@alumni.usp.br)</font>
- Renato Augusto Corrêa dos Santos 
    * <font size=2>PhD, University of Campinas, Brazil (renatoacsantos@gmail.com)</font>
- Vinícius Henrique Franceschini dos Santos
    * <font size=2>University of São Paulo, Brazil (vinicius6.santos@usp.br)</font>

### Learning Objectives for Tutorial

Programming skills have become crucial for bioscientists. In this tutorial, we will introduce python basic concepts and we will compare SARS-CoV-2 genomes to show how powerfull the Biopython toolkit can be to analyze biological sequences.

The main objectives are:

- To introduce Google Colab digital notebooks;
- To present the basic logic and data structures in Python;
- To provide hands-on experience in analyzing biological sequences using Biopython.

<div id='table-of-contents' />

### Notebook structure
This notebook is structured into six modules (M):

**First day**
- [M1](#m1): Introduction to python and data structures (study-load: 80 minutes);
- [M2](#m2): Logical operations (study-load: 50 minutes);
- [M3](#m3): Loops and iteration (study-load: 50 minutes);

**Second day**
- [M4](#m4): Functions (study-load: 80 minutes);
- [M5](#m5): Interacting with the operating system (study-load: 50 minutes);
- [M6](#m6): Biopython, file parsing, and multiple sequence analysis (study-load: 50 minutes);

We assume that you are already familiar with google colab notebooks. Check our notebook on this topic: <link to "how to use colab notebook">. 


## Practical Project: COVID-19 and SARS-CoV-2

Coronaviruses are RNA viruses able to infect both human and animals. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the coronavirus disease 2019 (COVID-19) and new viruses lineages still emerge in 2022. 

The large amount of data generated during the pandemics allowed us to better understand the SARS-CoV-2 genome and understand the main genetic mechanisms of virus transmission. It is normal for viruses to change over time and accumulate mutations. The set of mutations in a genome can be used to define a viral lineage. 

The SARS-CoV-2 genome comprises about 30 Kbp and contains four structural proteins, including spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins. SARS-CoV-2 viruses rely on their S protein to interact with the human ACE2 receptor to enter in the cell and start the infection. The S protein has two subunits: S1 and S2. The S1 subunit is located in the N-terminus of the S protein and engage with the ACE2 human receptor, while the S2 subunit mediate the fusion with the host cell membrane.  

The combination of mutations in the S protein is usually employed to discriminate SARS-CoV-2 lineages. See in the image below the main variants of concern of SARS-CoV-2:

![sars-cov-2](sars-cov-2-aln.jpg)
image source: https://viralzone.expasy.org/9556

In this tutorial, we will use the spike protein sequences of several SARS-CoV-2 strains to explore Python's potential for working with biological data.


## First day - July 6th, 2022

<div id='m1' />

### M1: Introduction to python and data structures
[Back to table of contents](#table-of-contents)

**Estimated study-load**: 80 minutes

**Learning objectives**

 * Run basic Python code and introduce students to documentation with Markdown on Google Colab notebooks
 * Introduce the basics of variables, numeric operations, strings, and booleans
 * Introduce built-in functions (print)
 * Show how to import a library and use functions beyond the built-in functions (math)

**Quick view of Google Colab's functionalities**

**What variables are and how to define variables in Python**

Variables have:
 * Type
 * Value

There are also rules we must follow to name variables:
 * They can have letters, numbers, or even 
 * They must NOT have whitespace
 * They must NOT be reserved, special words of the Python language (we'll see some of these reserved words soon)

We will see later that operators that we use in several tasks also cannot be used for naming variables.

The `=` sign is used to assign values to variables.

In [None]:
name = "Renato"

In [None]:
number = 10

**Best practices in Python programming and the PEP8 conventions**

Try your best to provide names that are as descriptive as possible.

**Numbers**

We have three different types of numbers in Python:
 * Integers
 * Floating points
 * Complex numbers

This is an integer in Python:

In [None]:
10

10

Remember that variables have a type and a value.

We can check what type our newly created variable is by using the `type` function (we will use this function several times over the two days).

In [None]:
type(10)

int

**Python can be used in arithmetic operations**

Like in mathematics, Python can also be used to carry out many different arithmetic operations:

| **Operator** | **Description** | **Example** |
|------|------|------|
|  +   | Add the values  |  |
|  -   | Subtract the right value from the left value |  |
|  *   | Multiply the left and right values |  |
|  /   | Divide the left value the the right value |  |
|  //  | Integer division |  |
|  %   | Modulus - returns any remainder |  |
|  **  | Exponent (or power of) |  |

(Table was obtained from Hunt, 2021)

**Looking for help**

**Strings in Python**

 * Sequence (or series) of ordered characters
 * Immutable

We can define a string in Python using either a single (`'`) or a double quotation mark (`"`):

In [None]:
author = "Renato" #correct
instructor = "Renato" #correct
Author = "Renato" #incorrect (PEP8 conventions)


Using both single and double quotation marks is not possible:

In [None]:
instructor = "Renato'

SyntaxError: EOL while scanning string literal (3551669578.py, line 1)

The main reason for both cases being possible is because sometimes we want to use quotes within a string:

In [None]:
person = "Renato's student"

Let's check the type of this variable:

In [None]:
person

"Renato's student"

In [None]:
type(person)

str

`str` not only indicates that the variable is pointing to a string, but it also can work as a function that converts an object from one type to string. We had previously created the `number` variable. Let's convert it to a string:

In [None]:
number = 10
type(number)

int

In [None]:
str(number)

'10'

**Booleans in Python**

 * RACS_TODO: include definitions





<div id='m2' />

### M2: Logical operations
[Back to table of contents](#table-of-contents)

**Estimated study-load**: 50 minutes

**Learning objectives**
* Introduce basic functions and methods for string manipulation and a comparative analysis with sequences (lists and tuples)
* Introduce dictionaries and a comparative analysis between sequence indexing and dictionary keys

#### Functions and methods in Python for string handling

#### Functions and methods in Python for sequence handling

#### Dictionaries in Python
 * Key <-> value associations
 * The indices (dictionary keys) are unordered (different from lists and strings)
 * Mutable (different from other objects we studied; e.g., strings and tuples are immutable)

In [None]:
{'alpha': 'B.1.1.7', # See whitespaces, PEP8 convention (https://www.python.org/dev/peps/pep-0008/#whitespace-in-expressions-and-statements)
 'beta': 'B.1.351',
 'gamma': 'P1',
 'delta': 'B.1.617.2',
 'omicron': 'BA.1'}

Like in previous cases, we can create a variable to "store" our dictionary.

Let's create a variable for using later in our practical sessions:

In [None]:
vars_id = {
    'alpha': 'B.1.1.7',
    'beta': 'B.1.351',
    'gamma': 'P1',
    'delta': 'B.1.617.2'
    }

We can access the values from each key with the notation: `dict[key]`.

For example, to get the ID of the Gamma variable, we run:

In [None]:
vars_id['gamma']

'P1'

The `dict` type has some methods to manipulate its content. 

In [None]:
# returns the number of itens in the dict
len(vars_id)

4

In [None]:
# returns the keys of the dict
vars_id.keys()

dict_values(['B.1.1.7', 'B.1.351', 'P1', 'B.1.617.2'])

> Note that the output of the above are in a unusal format. We can convert it to a list for a better interpretation

In [None]:
# returns list of all keys
list(vars_id.keys())

['alpha', 'beta', 'gamma', 'delta']

In [None]:
# returns list of all values
list(vars_id.values())

['B.1.1.7', 'B.1.351', 'P1', 'B.1.617.2']

To add a key/value pair to the dict, we use the following:

In [None]:
vars_id['omicron'] = 'BA.1'

vars_id

{'alpha': 'B.1.1.7',
 'beta': 'B.1.351',
 'gamma': 'P1',
 'delta': 'B.1.617.2',
 'omicron': 'BA.1'}

The values of the `vars_id` were strings. But the values of dictonaries in Python can be of any type, even `dict`.

A dictonary of dictionaries (or a nested dictionary) is a good way to structure data in Python:

In [None]:
vars_mutation = {
    'alpha': {
        'var_id': 'B.1.1.7',
        'ntd': 2,
        'rbd': 1,
        'cleavage': 1,
        'fusion_peptides': 0,
        'heptad_rep1': 0,
        'heptad_rep2': 0
        },
    'beta': {
        'var_id': 'B.1.351',
        'ntd': 4,
        'rbd': 3,
        'cleavage': 0,
        'fusion_peptides': 0,
        'heptad_rep1': 0,
        'heptad_rep2': 0
        },
    'gamma': {
        'var_id': 'P1',
        'ntd': 5,
        'rbd': 3,
        'cleavage': 1,
        'fusion_peptides': 0,
        'heptad_rep1': 0,
        'heptad_rep2': 1
        },
    'delta': {
        'var_id': 'B.1.617.2',
        'ntd': 8, 
        'rbd': 3,
        'cleavage': 1,
        'fusion_peptides': 0,
        'heptad_rep1': 1,
        'heptad_rep2': 0
        },
    'omicron': {
        'var_id': 'BA.1',
        'ntd': 8,
        'rbd': 15,
        'cleavage': 2,
        'fusion_peptides': 1,
        'heptad_rep1': 2,
        'heptad_rep2': 0
        }
    }



In [None]:
vars_mutation['omicron']

{'var_id': 'BA.1',
 'ntd': 8,
 'rbd': 15,
 'cleavage': 2,
 'fusion_peptides': 1,
 'heptad_rep1': 2,
 'heptad_rep2': 0}

Note that the result of the above is also a `dict`. We can access the values of the nested dict using the same `[]` notation

In [None]:
# returns the number of deletions in the RBD domain of the Gamma variant
vars_mutation['gamma']['rbd']

3

<div id='m3' />

### M3: Loops and iteration
[Back to table of contents](#table-of-contents)

**Estimated study-load**: 50 minutes

**Learning objectives**

* Show how to build functions
* File input/ output

Remember that we previously used the Python built-in functions `string` and `input`.

We also showed you how to use data structures, functions, and methods from a library for doing calculations that were not included as a built-in Python implementation (`math`).

We can also create our own functions using the `dev` word:

**Reading files**

We are constantly working with sequence files in Bioinformatics - sequences downloaded from NCBI databases, from organism-specific sources, and this is usually how we store information locally, for further analysis. Therefore, one of the important things we're going to learn today is how to read and write files (input/output).

 * File `Wuhan-Hu-1_19A.fasta` contains the FASTA file for the reference strain
 * `spike_proteins.fasta` contains additional FASTA sequences, that correspond to other virus variants

In [None]:
open('Data/Wuhan-Hu-1_19A.fasta', 'r')

<_io.TextIOWrapper name='Data/Wuhan-Hu-1_19A.fasta' mode='r' encoding='UTF-8'>

`r` means that you are opening the file for reading. Other options could be to write something into the file (`w`), appending information to id (`a`). We can do both things at the same time as well (`r+`).

If a string with the relative file path is passed, but no additional argument is provided, python will assume we are only reading the file.

In [None]:
open('Data/Wuhan-Hu-1_19A.fasta')

<_io.TextIOWrapper name='Data/Wuhan-Hu-1_19A.fasta' mode='r' encoding='UTF-8'>

Like other Python objects, we can use variables to store file content in memory.

In [None]:
ref_sequence_file = open('Data/Wuhan-Hu-1_19A.fasta', 'r')

In [None]:
ref_sequence_file

<_io.TextIOWrapper name='Data/Wuhan-Hu-1_19A.fasta' mode='r' encoding='UTF-8'>

Lines in text files (like in `FASTA` format) are usually read as `strings`.

A good practice when reading files (or lines in text files) is to use the `with` keyword (see link at the botton of this notebook). As you saw a few lines before, we have to `open` a file and it is important to ensure it is closed after reading. With the `with` it is done properly.

In [None]:
ref_sequence_file.closed

True

In [None]:
ref_sequence_file.close()

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    file = ref_sequence_file.read()
    print(file)

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICA

We learned previously many `functions` and `methods` that can be used to manipulate `strings`.

Now, we are going to convert the multiple lines representing the sequence of one single SARS-CoV-2 variant into a single string. For this purpose, we must:

 * Create a variable to store a string (our virus sequence)
 * Read each line of the file
  * For each line, we have to check if it starts with a `>`
   * If it does not, then we have to add that line to a variable we've just created

Now we are going to one additional methods:
 * `startswith`
 * `strip`

For this, we will practice `for loops` and `conditionals`.

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    print(ref_sequence_file.readline())

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1



After python reads the file, everytime `readline` is called it will print the `next line` in that file, until it is closed or until the blocks of code inside the `with` keyword finishes.

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    print(ref_sequence_file.readline())
    print(ref_sequence_file.readline())

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS



The `readlines` method generates an iterator, which allows us to use the `for loop` we learned previously, to analyze each line as a string:

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if line.startswith('>'):
            print(line)

>Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1



Notice that the method `readline` was use here.

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if not line.startswith('>'):
            print(line)

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS

NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV

NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE

GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT

LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK

CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN

CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD

YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC

NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN

FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP

GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY

ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI

SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE

VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC

LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM

QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN

TLVKQLSS

There are at least two methods that can be used to remove the additinal new line after reading each line: `strp` and `replace`. We already learned how to use the `replace` one. Let's try the former.

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if not line.startswith('>'):
            print(line.strip('\n'))

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
TLVKQLSSNFGAISSVLNDILSRL

We are almost there!

Now, let's go back to the first step in our algorithm and create a variable able to store this information:

In [None]:
wuhan_variant_seq = ''

with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file.readlines():
        if not line.startswith('>'):
            wuhan_variant_seq = wuhan_variant_seq + line.strip('\n')

In [None]:
wuhan_variant_seq

'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITG

**Looking for help**

**Strings in Python**

 * Sequence (or series) of ordered characters
 * Immutable

We can define a string in Python using either a single (`'`) or a double quotation mark (`"`):

In [None]:
author = "Renato" #correct
instructor = "Renato" #correct
Author = "Renato" #incorrect (PEP8 conventions)


Using both single and double quotation marks is not possible:

In [None]:
instructor = "Renato"

The main reason for both cases being possible is because sometimes we want to use quotes within a string:

In [None]:
person = "Renato's student"

Let's check the type of this variable:

In [None]:
person

"Renato's student"

In [None]:
type(person)

str

`str` not only indicates that the variable is pointing to a string, but it also can work as a function that converts an object from one type to string. We had previously created the `number` variable. Let's convert it to a string:

In [None]:
number = 10
type(number)

int

In [None]:
str(number)

'10'

**Booleans in Python**

 * RACS_TODO: include definitions





## Second day - July 7th, 2022

<div id='m4' />

### M4: Functions
[Back to table of contents](#table-of-contents)

**Estimated study-load**: 80 minutes

**Learning objectives**
* Functions and modules

Note that in previous cells we used a lot of repeated, copy-pasted code. If you briefly revisit the last cells you will notice a lot of very similar lines, differing only regarding to the filename or maybe a variable's name.

An important goal we should have while writing programs is to avoid repetitions like that as much as we can. The reason is that repeated code is often much harder to be adapted, corrected and improved by yourself and other programmers which might use your code in the future.

Consider the case we find a clearly better way of {{task}}. The more direct way would be to simply {...}

<div id='m5' />

### M5: Interacting with the operating system
[Back to table of contents](#table-of-contents)

**Estimated study-load**: 50 minutes
**Learning objectives**
* Comparing biological sequences

#### Reading spike protein sequences
In this block we are going to take a look at how we can compare biological sequences using these powerful tools Python provides us.

Let's revisit the sequences of different SARS-CoV-2's spike proteins we've been using. We now have the knowledge to import those sequences from a fasta file using a `for` loop and a dictionary named `spike_proteins` to store them:

In [None]:
# Create an empty dictionary and store it in a variable called spike_proteins
spike_proteins = {}

# Open the fasta file and store it in the spike_proteins_fasta variable
with open('Data/spike_proteins.fasta') as spike_proteins_fasta:
    # Iterate across the file lines
    for line in spike_proteins_fasta:
        # If the current line is a description line
        if line.startswith('>'):
            # Get the variant's name from the description line
            variant_name = line.strip(' >\n').split()[0]
            # Create a new dictionary item with variant_name as the key and an
            # empty string as the value
            spike_proteins[variant_name] = ''
        else:
            # Append the sequence piece to the coresponding dictionary item
            spike_proteins[variant_name] += line.strip()

We can, as before, define a function to wrap this process of reading a fasta file into a dictionary, and make it easy for us to read other fastas thereafter.

In [None]:
def read_fasta(fasta_path):
    spike_proteins = {}

    with open(fasta_path) as spike_proteins_fasta:
        for line in spike_proteins_fasta:
            if line.startswith('>'):
                variant_name = line.strip(' >\n').split()[0]
                spike_proteins[variant_name] = ''
            else:
                spike_proteins[variant_name] += line.strip()

    return spike_proteins

See how our dictionary came out:

In [None]:
spike_proteins = read_fasta('Data/spike_proteins.fasta')
print(spike_proteins)

{'Alpha_B.1.1.7': 'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAISGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIDDTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSHRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPINFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILARL

Now that we have those sequences in a convenient format to our explorations, how can we begin extracting useful information from them? How can we determine conserved regions and important residue insertions, deletions and substitutions?

A first approach might came up as the following function.

In [None]:
def seq_align_1(seq1, seq2):
    alignment = ''

    for residue1, residue2 in zip(seq1, seq2):
        if residue1 == residue2:
            alignment += residue1
        else:
            alignment += '.'

    return alignment 

The function receives two strings representing the pair of protein sequences we intend to compare. We then compare the first residue of the first protein with the first residue of the second. Then the second residue of the first protein with the second residue of the second. Then the third residue of both proteins, then the fourth and so on. If the compared residues at each iteration matches, we add their one letter representation to the `alignment` string. Otherwise, we add a `.` character to indicate the mismatch. At the end, the returned `alignment` variable will represent a consensus sequence between the two compared proteins, showing matching residues and ommiting mismatched ones with a `.` in their corresponding place.

In [None]:
seq_align_1("ABCDEFG", "ACBDEFP")

'A..DEF.'

But how will our function perform on real-world biological sequences? Let's test it on SARS-CoV-2 data. The labels of available sequences can be shown with

In [None]:
spike_proteins.keys()

dict_keys(['Alpha_B.1.1.7', 'Beta_B.1.351', 'Gamma_P1', 'Delta_B.1.617.2', 'Omicron_BA.1', 'Omicron_BA.2'])

So let's compare the **Beta_B.1.351** variant with **Alpha_B.1.1.7**:

In [None]:
alignment = seq_align_1(spike_proteins['Beta_B.1.351'], spike_proteins['Alpha_B.1.1.7'])
print(alignment)

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAI................................................................F.............K............................L....G...N.....................................L..L..............LHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTG.IADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGV.GFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDI.DTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNS.RRARSVASQSIIAYTMSLG.ENSVAYSNNSIAIP.NFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDIL.RLDKVEAEVQIDRLITGRLQS

Although the alignment shows a large mismatching region between both proteins, more careful inspection of the beggining of this region reveals they are not so different as may be thougth:

In [None]:
start, end = 50, 150
print(alignment[start:end])
print(spike_proteins['Beta_B.1.351'][start:end])
print(spike_proteins['Alpha_B.1.1.7'][start:end])

TQDLFLPFFSNVTWFHAI................................................................F.............K...
TQDLFLPFFSNVTWFHAIHVSGTNGTKRFANPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNK
TQDLFLPFFSNVTWFHAISGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYHKNNKSWM


More specifically, if we consider the histidine-valine pair at positions 67 and 68 as an insertion and realign the sequences from position 70 on, we see much higher resemblance of both proteins:

In [None]:
start, end = 50, 150
istart, iend = 68, 70  # Insertion start, insertion end
seqB = spike_proteins['Beta_B.1.351'] 
seqA = spike_proteins['Alpha_B.1.1.7']
blank = ' ' * (iend-istart+2)  # For printing purposes

new_alignment = seq_align_1(seqB[iend:], seqA[istart:])

print(alignment[start:istart] + blank + new_alignment[:end-iend])
print(seqB[start:istart] + '(' + seqB[istart:iend]+ ')' + seqB[iend:end])
print(seqA[start:istart] + blank + seqA[istart:end-2])

TQDLFLPFFSNVTWFHAI    SGTNGTKRF.NPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVY...N..
TQDLFLPFFSNVTWFHAI(HV)SGTNGTKRFANPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNK
TQDLFLPFFSNVTWFHAI    SGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYHKNNKS


The same issue seems to occur with the tyrosine insertion next to the end of the region shown above, and has a much more drastic effect when comparing other variants such as **Gamma_P1** and **Alpha_B.1.1.7**:

In [None]:
seq_align_1(spike_proteins['Gamma_P1'], spike_proteins['Alpha_B.1.1.7'])

'MFVFLVLLPLVSSQCVN.T.RTQLP.AYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAI................................................................F.............K............................L....G...N.....................................L..L..............L..................A.....Y....................D.....L.....T...............................N..........................................F............L.........................G.......................N..............L.R...........................G......................Y.......................................G.............F..F.....D..D..........L.I..............T...N...................A.............S............................................R...S.....................S.......T...................................................................Q............F...........PSK.S............................................L..L..............A.............A..................................N.....I.....S..S....L........Q....L.....................L..................

We then feel the need for more sophisticated alignment functions, capable of

1. Taking insertions and deletions into consideration;

2. Weighting substitutions differently, based on its probability of occuring in nature. For instance, an alanine to proline substitution is likely to have much more impact on the protein's structure than an alanine to valine substitution;

3. Performing local alignments, i.e. identifying highly matching regions instead of a global single alignment and

4. Performing not only pairwise but multiple sequences alignments at once.

Coming up with algorithms or searching for them in the scientific literature and implementing them as Python functions would cost us considerable time. Luckly, a lot of people have run into this same task, so ready-to-use implementations were already meticulously developed and are available to everyone who needs them.

Functions made by other people to expand Python capabilities are usually distributed in what we call **packages**. From this moment on we will be exploring some functions of the `biopython` package, that provides plenty of tools for reading and processing biological data.

<div id='m6' />

### M6: Biopython, file parsing, and multiple sequence analysis
[Back to table of contents](#table-of-contents)

**Estimated study-load**: 50 minutes

**Learning objectives**:
* Introduce Biopython to work with computational molecular biology;
* Demonstrate how to parse fasta file;
* Align protein sequences;
* Extract alignment regions;


The official documentation of biopython is available [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html).

#### What is biopython and how to install?

Biopython is a set of freely tools for computational molecular biology written in Python by an international team of developers. You can use biopython to parse several bioinformatic file formats, including fasta, gbk, and Blast output.

It is very important for a bioinformatist to become familiar with Biopython, as it is literally a Swiss Army knife that can help you in many situations. At this point in the tutorial, you've noticed that there are several file formats that we can use to store information. A classic format is the fasta.

When we mention about "parsing" a fasta file, we want to extract the information and store it so that we have more control to process it. However, before we start exploring the potential of biopython to handle files, let's install it.

In [None]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.79-cp39-cp39-macosx_10_9_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 1.0 MB/s eta 0:00:01
Installing collected packages: biopython
Successfully installed biopython-1.79


In [None]:
import Bio
print(Bio.__version__)

1.79


#### Parsing a fasta file

Let's read the fasta file containing the spike protein for seven SARS-CoV-2 lineages.

In [None]:
from Bio import SeqIO

file_path = "spike_proteins.fasta"

for seq_record in Bio.SeqIO.parse(file_path, "fasta"):
    print(seq_record.id) #sequence name after each ">"
    print(repr(seq_record.seq)) # part of protein sequence
    print(len(seq_record)) #sequence length

Wuhan-Hu-1_19A
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1273
Alpha_B.1.1.7
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Beta_B.1.351
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Gamma_P1
Seq('MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1273
Delta_B.1.617.2
Seq('MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1271
Omicron_BA.1
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Omicron_BA.2
Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT')
1270


We did it! Now we have gathered the information contained in the fasta file much faster than in the previuos modules. We can do better! Let's store this information in a python dictionary.

In [None]:
with open(file_path, "r") as fh: #fh = file handle
    record_dict = SeqIO.to_dict(SeqIO.parse(fh, "fasta"))

In [None]:
record_dict

{'Wuhan-Hu-1_19A': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Wuhan-Hu-1_19A', name='Wuhan-Hu-1_19A', description='Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1', dbxrefs=[]),
 'Alpha_B.1.1.7': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Alpha_B.1.1.7', name='Alpha_B.1.1.7', description='Alpha_B.1.1.7 tr|A0A7T8KZF1|A0A7T8KZF1_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=3 SV=1', dbxrefs=[]),
 'Beta_B.1.351': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Beta_B.1.351', name='Beta_B.1.351', description='Beta_B.1.351 QRN78347.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[]),
 'Gamma_P1': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Gamma_P1', name='G

In this dictionary, we have each sequence name as a key and all related information as values. 

In [None]:
# How many sequences have we read?
print(len(record_dict))

7


In [None]:
# Getting sequences names
print((record_dict.keys()))

dict_keys(['Wuhan-Hu-1_19A', 'Alpha_B.1.1.7', 'Beta_B.1.351', 'Gamma_P1', 'Delta_B.1.617.2', 'Omicron_BA.1', 'Omicron_BA.2'])


In [None]:
# Sequence information for Omicron BA.2 variant

record_dict["Omicron_BA.2"]

SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT'), id='Omicron_BA.2', name='Omicron_BA.2', description='Omicron_BA.2 UJE45220.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[])

The SeqRecord object offers a lot of information as attributes, including:

    - .seq: the sequence itself.

    - .id: the primary ID used to identify the sequence.

    - .name: similar to id.

    - .description: expasive name of the fasta sequence in a more readable presentation.


In [None]:
# Retrieving the sequence as a Seq object
record_dict["Omicron_BA.2"].seq

Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT')

In [None]:
# Sequence description
record_dict["Omicron_BA.2"].description

'Omicron_BA.2 UJE45220.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]'

#### Multiple Sequence analysis (MSA)

Multiple sequence analysis is the alignment of three or more biological sequences (DNA or Protein). We can use the output to infer evolutionary relationships. In this section, we will use a multiple sequence alignmet (msa) from spike proteins to explore mutations.

In [None]:
from Bio import AlignIO

msa_file = "clustal_spike_msa.txt"
spike_align = AlignIO.read(msa_file, "clustal")
print(spike_align)

Alignment with 7 rows and 1275 columns
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Omicron_BA.1
MFVFLVLLPLVSSQCVNLITRTQ---SYTNSFTRGVYYPDKVFR...HYT Omicron_BA.2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Alpha_B.1.1.7
MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFR...HYT Gamma_P1
MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Delta_B.1.617.2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Wuhan-Hu-1_19A
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Beta_B.1.351


We can see that now we have a multiple sequence alignment by the gaps inserted in Omicron_BA.2 sequence. Let's explore Bio.Align.MultipleSeqAlignment object. Let's view again the spike protein align.

![sars-cov-2](sars-cov-2-aln.jpg)
image source: https://viralzone.expasy.org/9556

In [None]:
start= 501 + 1 # remember that python is 0-indexed.
end = 503
print(spike_align[:, start:end])

Alignment with 7 rows and 1 columns
Y Omicron_BA.1
Y Omicron_BA.2
Y Alpha_B.1.1.7
Y Gamma_P1
N Delta_B.1.617.2
N Wuhan-Hu-1_19A
Y Beta_B.1.351


You can check in the image that all lineages, except Delta B.1.617.2 and Wuhan-Hu-1, present a mutation that changes asparagine to tyrosine in position 501 (N501Y). You can also create a simple function to retrive a specific position in the alignment.

In [None]:
# function to get a position

def get_aln_position(aln, start, end):
    return aln[:, start+1:end]

#Retrieving position 417
print(get_aln_position(aln=spike_align, start=417, end=419))

Alignment with 7 rows and 1 columns
N Omicron_BA.1
N Omicron_BA.2
K Alpha_B.1.1.7
T Gamma_P1
K Delta_B.1.617.2
K Wuhan-Hu-1_19A
N Beta_B.1.351


#### Distance tree from MSA

We can also use python to construct a simple tree based on sequence distance.

In [None]:
from Bio.Phylo.TreeConstruction import DistanceCalculator

# instance assignment
calculator = DistanceCalculator('identity')

# calculate distance matrix (dm) from SARS-CoV-2 spike sequences

dm = calculator.get_distance(spike_align)
print(dm)

Omicron_BA.1	0
Omicron_BA.2	0.02117647058823524	0
Alpha_B.1.1.7	0.029803921568627434	0.027450980392156876	0
Gamma_P1	0.03450980392156866	0.026666666666666616	0.014117647058823568	0
Delta_B.1.617.2	0.0337254901960784	0.025882352941176467	0.013333333333333308	0.015686274509803977	0
Wuhan-Hu-1_19A	0.03137254901960784	0.02431372549019606	0.007843137254901933	0.009411764705882342	0.007843137254901933	0
Beta_B.1.351	0.0337254901960784	0.026666666666666616	0.01254901960784316	0.0117647058823529	0.014117647058823568	0.007843137254901933	0
	Omicron_BA.1	Omicron_BA.2	Alpha_B.1.1.7	Gamma_P1	Delta_B.1.617.2	Wuhan-Hu-1_19A	Beta_B.1.351


In [None]:
# Constructing the distance tree
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
tree.root_with_outgroup("Wuhan-Hu-1_19A",outgroup_branch_length=0.002)

#visualize the distance tree
Bio.Phylo.draw_ascii(tree=tree)

                                       ______________________ Omicron_BA.2
            __________________________|
          _|                          |______________________ Omicron_BA.1
         | |
        _| |_______________ Gamma_P1
       | |
      _| |_____________ Delta_B.1.617.2
     | |
  ___| |__________ Alpha_B.1.1.7
 |   |
_|   |_______ Beta_B.1.351
 |
 |___ Wuhan-Hu-1_19A



In this course module, we explored the potential of Biopython to work with biological sequences. We saw how to parse fasta files, how to work with multiple sequence alignments, and how to build a simple distance tree using SARS-CoV-2 as a model organism.

There are many utilities that you can further explore with Biopython. This course has showed you some possibilities. Keep up with your studies.

See you soon.

## What next?
[Back to table of contents](#table-of-contents)

{Directions to keep studying Python and bioinformatics in a motivating way}.

### References and links

#### Books

* Libeskind-Hadas, Ran, and Eliot Bush. Computing for biologists: Python programming and principles. Cambridge University Press, 2014.
* Hunt, John. "A Beginners Guide to Python 3 Programming." (2021).

#### Pages of the official Python documentation

* [Reading and writing files](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)
* 
