# Introduction to Python programming for bioscientists
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SantosRAC/intro_python_ismb2022/blob/main/ISBM_2022_Python.ipynb)

Tutorial in the 30th Conference on Intelligent Systems for Molecular Biology (ISMB 2022)

### Instructors
- Hemanoel Passarelli Araujo
    * PhD candidate, Federal University of Minas Gerais, Brazil (passarelli@ufmg.br)
- Pedro de Carvalho Braga Ilídio Silva
    * Master student, University of São Paulo, Brazil (ilidio@alumni.usp.br)
- Dr. Renato Augusto Corrêa dos Santos
    * Post-doctoral researcher, University of São Paulo, Brazil (renatoacsantos@gmail.com)
- Vinícius Henrique Franceschini dos Santos
    * University of São Paulo, Brazil (vinicius6.santos@usp.br)

### Learning Objectives for Tutorial

Programming skills have become crucial for bioscientists. In this tutorial, we will introduce Python basic concepts and we will compare SARS-CoV-2 sequences to show how powerful the Biopython toolkit can be to analyze biological data.

The main objectives are:

- To introduce digital notebooks on Google Colab;
- To present the basic logic and data structures in Python;
- To provide hands-on experience in analyzing biological sequences using Biopython.

### Notebook structure
<div id='table-of-contents' />
This notebook is structured into six modules (M):

**First day**
- [M1](#m1): Introduction to Python variables and basic types (study-load: 80 minutes);
- [M2](#m2): Logical operations and additional data structures (study-load: 50 minutes);
- [M3](#m3): Loading content from files (study-load: 50 minutes);

**Second day**
- [M4](#m4): Functions (study-load: 80 minutes);
- [M5](#m5): Interacting with the operating system (study-load: 50 minutes);
- [M6](#m6): Biopython, file parsing, and multiple sequence analysis (study-load: 50 minutes);


## First day - July 6th, 2022

### Practical Project: COVID-19 and SARS-CoV-2

Coronaviruses are RNA viruses able to infect humans and other animals. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the coronavirus disease in 2019 (COVID-19) and new virus lineages still emerge in 2022. 

The large amount of data generated during the pandemics allowed us to better understand the SARS-CoV-2 genome and the main genetic mechanisms of virus transmission. It is normal for viruses to change over time and accumulate mutations and this set of mutations in a genome can be used to define a viral lineage. 

The SARS-CoV-2 genome comprises about 30 Kbp and contains four structural proteins, including spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins. SARS-CoV-2 viruses rely on their S protein to interact with the human ACE2 receptor, making it possible for the viral particles to enter the cell and start the infection. The S protein has two subunits: S1 and S2. The S1 subunit is located in the N-terminus of the S protein and engages with the ACE2 human receptor, while the S2 subunit mediates the fusion of the viral envelope with the host cell's membrane.  

The combination of mutations in the S protein is of special importance to discriminating SARS-CoV-2 lineages. See in the image below the main variants of concern of SARS-CoV-2:

<!--![sars-cov-2](sars-cov-2-aln.jpg)-->
![sars-cov-2](https://viralzone.expasy.org/resources/Variants_graph.svg)
**Image and sequences source:** https://viralzone.expasy.org/9556

In this tutorial, we will use the spike protein sequences of several SARS-CoV-2 strains to explore Python's potential for working with biological data.


<div id='m1' />

### M1: Introduction to Python variables and basic types

[Back to table of contents](#table-of-contents)

**Estimated study load**: 80 minutes

**Learning objectives**

 * Running minimal Python code;
 * Basics of Python notebooks;
 * Basics of variables, numeric operations and strings;
 * `type()` and `print()` built-in functions.

Imagine the following scenario. You work for a biomedical laboratory with an ongoing research project on sequencing and cataloging the genome of SARS-CoV-2 samples collected from the local population. You receive a file containing the spike protein sequence of a newly sequenced sample. You are asked to determine which virus strain infected the patient whom the sample was taken, leveraging data for subsequent epidemiologic analyses.

The sequenced protein is represented in the code cell below. Pieces of textual information in the code cells will always be represented between single or double quotes (`'` or `"`).

In [None]:
"MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESGVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQNVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"

As mentioned in the project introduction, you know that what defines different viral lineages are specific mutations in the spike protein, especially in some important regions such as the RBD, the receptor-binding domain. An idea then strikes you: you only need to find a specific sequence motif to determine the strain you are dealing with. For the Omicron variant, for example, the following sequence segment must be present in the sequenced protein:

In [None]:
"RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV"

'RGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG'

Despite it, manually searching for subsequences, additionally for all main variants, can be very time-consuming. Impressively, however, as we are using Python, it becomes just a matter of asking it: "is A in B?". We just have to write: (sequence A) in (sequence B), and a `True` or `False` output will tell us the validity of our affirmation.

In [None]:
"RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV" in "MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESGVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQNVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"

False

It must not be the Omicron variant then!

Since the very long lines make it difficult for us to read and reuse the sequences, we can define "aliases" for each piece of text using the "equal" sign (`=`):

In [None]:
omicron_rbd_segment = "RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV"
sequenced_protein = "MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESGVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQNVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"

The previous test, if the Omicron's sequence segment was present in our sequenced spike protein, can now be rewritten much more concisely:

In [None]:
omicron_rbd_segment in sequenced_protein

False

If you find such sequence segments for all SARS-CoV-2 variants, you can achieve our original goal. You then go ahead and search in public databases for those motifs, naming them as below. 

In [None]:
alpha_rbd_segment = "RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV"
beta_rbd_segment = "RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV"
gamma_rbd_segment = "YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT"
delta_rbd_segment = "YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG"

You are ready to check each variant.

In [None]:
alpha_rbd_segment in sequenced_protein

False

In [None]:
beta_rbd_segment in sequenced_protein

False

In [None]:
gamma_rbd_segment in sequenced_protein

False

In [None]:
delta_rbd_segment in sequenced_protein

True

You've found it! The source patient has likely been infected with a Delta-strain SARS-CoV-2.

At this point, we are only scratching the surface of all the power a programming language gives you (Python, in our case). Throughout this tutorial, we will show several examples of biological analyses that can be efficiently automated, gently introducing you to the journey of computer programming, where you become able to unlock all the capabilities of your personal computer and make it do all the tedious computational jobs for you.

> **NOTE:** Programming empowers you.

#### Quick view of Google Colab's functionalities

At this stage, we will follow the steps on the complementary notebook about digital notebooks on Google Colaboratory.

#### What variables are and how to define them in Python

What we called "aliases" in the previous section are formally called **variables**. Variables are identifiers that "point to" or "store" many types of information.

Variables have:
* **Name:** the identifier we give to them. E.g. `delta_rbd_segment`, `omicron_rbd_segment`;
    
* **Type:** the kind of the information it represents. E.g. a number, an integer number, a piece of text, etc.;
    
* **Value:** the actual information it is holding. E.g. `3.14159`, `256` or maybe `"Hi there!"`.


> **NOTE:** Variables have **name**, **value** and **type**.

There are some rules we must follow when naming variables:
* Variable names can only have letters, numbers and underscores;
* Consequently, they must NOT have whitespaces;
* Although they can contain numbers, they CANNOT start with them; 
* They must NOT be reserved, special words of the Python language (we will see some of these reserved words soon).

Here are some examples using **incorrect** variable names, so that the code cells will naturally throw an error when we run them. As we have seen earlier, the `=` sign is what we use to assign values to variables.

In [None]:
genome size = 29000

SyntaxError: ignored

In [None]:
number-of-colonies-in-our-plate = 32

In [None]:
my_solution_concentration_in_g/cm^3 = 0.45

In [None]:
3D_structure_of_our_protein = "our_amazing_protein.pdb"

See if you can come up with valid alternatives for the variable names above.

Multiple possibilities exist to visually separate words when defining a variable, namely what we call `camelCase`, `PascalCase` or `snake_case`. We conventionally adopt the last option as the preferred one.

In [None]:
coronavirusGenomeSize = 29000  # camelCase

In [None]:
CoronavirusGenomeSize = 29000  # PascalCase

In [None]:
coronavirus_genome_size = 29000  # snake_case

> **Side note:** All text written after the `#` character in each line is ignored by Python. This is very useful for writing **comments** in our code, explaining to other people how some of our confusing lines of code work.

> **Side note:** Even if shorter names are easier to write multiple times, try your best to provide variable names that are as descriptive as possible, ensuring your code is always easy to be understood by other programmers.

#### PEP8 conventions

The most accepted Python code style conventions (such as using `snake_case` for variables) are defined in a document called [PEP8](https://peps.python.org/pep-0008/). These are the standards we will follow in this tutorial. If you have any doubts regarding spacing, indentation, and other seemingly arbitrary code formatting options, you can always search for what [PEP8](https://peps.python.org/pep-0008/) states about it.

> **NOTE**: [PEP8](https://peps.python.org/pep-0008/) is the main standard for Python code formatting.

#### Working with numbers

There are three different types of numbers in Python:
* Integers;
* Floating-point numbers;
* Complex numbers (we are not going to provide examples of how to work with complex numbers in this activity)

This is an integer in Python:

In [None]:
10

10

Remember that variables have a type and a value.

We can check what type our newly created variable is by using the `type` function (we will use this function several times over the two days).

In [None]:
type(10)

int

> **Side note:** As you might have noticed, we use functions writing their names directly followed by parentheses. We will see them in detail tomorrow (even learning how to build our own!), but for now it is enough to know they take input values specified inside the parentheses (`10` in this case) and outputs some new information about them. In our case, the function `type` outputs the type of the value we provide to it. We will revisit functions many times throughout the tutorial, so don't worry about them for now.

> **NOTE:** Numbers in Python can be mainly integers (represented by `int`) or decimal (a.k.a. floating-point, `float`).

##### Arithmetic operations

Like a calculator, Python can also be used to carry out many different arithmetic operations:

| **Operator** | **Description** | **Example** |
|------|------|------|
|  +   | Add the values  |  |
|  -   | Subtract the right value from the left value |  |
|  *   | Multiply the left and right values |  |
|  /   | Divide the left value by the right value |  |
|  //  | Integer division |  |
|  %   | Modulus - returns any remainder |  |
|  **  | Exponent (or power of) |  |

(Table was obtained from Hunt, 2021, but similar tables are easily found on the internet)

The following are gene positions on the Wuhan [reference genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/) (with a total size of 29,903 bp). Remember we are looking to an enveloped positive-sense single-stranded RNA virus.

| star | end | gene name |
|------|-----|-----------|
|25393	| 26220 | ORF3a |
|26245	| 26472 |	E |
|26523	| 27191	| M |
|27202	| 27387	| ORF6 |
|27394	| 27759	| ORF7a |
|27756	| 27887	| ORF7b |
|27894	| 28259	| ORF8 |
|28274	| 29533	| N |
|29558	| 29674	| ORF10 |

Let's work with ORF3a. What is the gene size given the start and end position in the genome?

In [None]:
26220 - 25393

827

Notice that we must add `1` to the result, since start and end positions are considered part of the gene. For instance, the number of integers from `2` to `5` is `4` (they are `2`, `3`, `4` and `5`) and not `5 - 2 = 3`.

In [None]:
26220 - 25393 + 1

828

How many amino acids will the translated peptide sequence have?

In [None]:
828 / 3

276.0

Almost. If you check the [NCBI identifier](https://www.ncbi.nlm.nih.gov/protein/1796318599) for this protein, the product of ORF3a, you will notice that it actually has 275 aa. We forgot to exclude the stop codon from our calculations!

In [None]:
828 / 3 - 1

275.0

There we go.

> **NOTE:** Basic arithmetic operations can be carried on in Python with `+`, `-`, `*`, `/`, `**` and others.

 Here is one more example: let's calculate the ORF3a's length relative to the total genome size (29,903 bp):


In [None]:
26220 - 25393 + 1 / 29903

827.0000334414607

Wait, we actually calculated

In [None]:
26220 - 25393 + (1 / 29903)

827.0000334414607

As usual in arithmetic, multiplication/division operations are calculated before addition/subtraction. We must use parentheses to specify a different order:

In [None]:
(26220 - 25393 + 1) / 29903

0.027689529478647626

Converting to a percentage value:

In [None]:
((26220 - 25393 + 1) / 29903) * 100

2.7689529478647628

Visually, it seems to make sense:

![SARS-CoV-2 genome (Viralzone EXPASy)](https://viralzone.expasy.org/resources/SARS_genome.png)

Source: [SARS-CoV-2 genome (Viralzone EXPASy)](https://viralzone.expasy.org/resources/SARS_genome.png)

> **NOTE:** Usual operation precedence rules are always followed in Python arithmetic expressions.

#### Looking for help

Python provides us with a `help` function that can be called on any object to display information about it.

In [None]:
help(290)

Help on int object:

class int(object)
 |  int([x]) -> integer
 |  int(x, base=10) -> integer
 |  
 |  Convert a number or string to an integer, or return 0 if no arguments
 |  are given.  If x is a number, return x.__int__().  For floating point
 |  numbers, this truncates towards zero.
 |  
 |  If x is not a number or if base is given, then x must be a string,
 |  bytes, or bytearray instance representing an integer literal in the
 |  given base.  The literal can be preceded by '+' or '-' and be surrounded
 |  by whitespace.  The base defaults to 10.  Valid bases are 0 and 2-36.
 |  Base 0 means to interpret the base from the string as an integer literal.
 |  >>> int('0b100', base=0)
 |  4
 |  
 |  Methods defined here:
 |  
 |  __abs__(self, /)
 |      abs(self)
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __and__(self, value, /)
 |      Return self&value.
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __ceil__(...)
 |      Ceiling of an Integral retur

In [None]:
help(int) # or help(int())

Help on class int in module builtins:

class int(object)
 |  int([x]) -> integer
 |  int(x, base=10) -> integer
 |  
 |  Convert a number or string to an integer, or return 0 if no arguments
 |  are given.  If x is a number, return x.__int__().  For floating point
 |  numbers, this truncates towards zero.
 |  
 |  If x is not a number or if base is given, then x must be a string,
 |  bytes, or bytearray instance representing an integer literal in the
 |  given base.  The literal can be preceded by '+' or '-' and be surrounded
 |  by whitespace.  The base defaults to 10.  Valid bases are 0 and 2-36.
 |  Base 0 means to interpret the base from the string as an integer literal.
 |  >>> int('0b100', base=0)
 |  4
 |  
 |  Methods defined here:
 |  
 |  __abs__(self, /)
 |      abs(self)
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __and__(self, value, /)
 |      Return self&value.
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __ceil__(...)
 |      Ceiling of

In [None]:
help(type(int))

Help on class type in module builtins:

class type(object)
 |  type(object_or_name, bases, dict)
 |  type(object) -> the object's type
 |  type(name, bases, dict) -> a new type
 |  
 |  Methods defined here:
 |  
 |  __call__(self, /, *args, **kwargs)
 |      Call self as a function.
 |  
 |  __delattr__(self, name, /)
 |      Implement delattr(self, name).
 |  
 |  __dir__(self, /)
 |      Specialized __dir__ implementation for types.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __instancecheck__(self, instance, /)
 |      Check if an object is an instance.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __setattr__(self, name, value, /)
 |      Implement setattr(self, name, value).
 |  
 |  __sizeof__(self, /)
 |      Return memory consumption of the type object.
 |  
 |  __subclasscheck__(self, subclass, /)
 |     

However, as Python beginners, you will likely find a lot of confusing information in the output, so alternative ways of searching for help can be more useful.

Whenever you have doubts about something you want to do with Python, or if you don't understand codes written somewhere on the internet, you will probably Google it:

For example, let's try "how to define variable python".

One of the first results will show a [Stack Overflow page on "python-variable-declaration"](https://stackoverflow.com/questions/11007627/python-variable-declaration). Stack Overflow is a largely known forum for asking programming-related questions, and, being Python one of the most popular languages out there, chances are high your question was already asked and answered by some other people.

There are many other useful resources as results in the same Google search. Here are only some examples:
 * [Software Carpentry](https://software-carpentry.org/lessons/index.html)
 * [w3schools](https://www.w3schools.com)
 * [Tutorials Point](https://www.tutorialspoint.com)
 * [Guru99](https://www.guru99.com)
 * [Learn Python](https://www.learnpython.org)

As you get used to programming, more frequently you just need to remember a very specific detail about a function or any other object you are using. In this cases, the [official Python documentation](https://docs.python.org/) can be very useful, as it aways provides concise but detailed information about all the built-in objects available. Maybe not always a good learning resource, but certainly a very useful reference text.

#### Strings in Python

The "pieces of text" we are often using in our code are usually called **strings** in the programming realm, since they are understood as sequences (or strings) of characters.

We can define a string in Python using either a single (`'`) or a double quotation mark (`"`):

In [None]:
virus_acronym = "SARS-CoV-2"  # Correct
virus_acronym = 'SARS-CoV-2'  # Correct

Nevertheless, note that using both single and double quotation marks at the same time is not possible:

In [None]:
virus_acronym = "SARS-CoV-2'

SyntaxError: ignored

The main reason for both cases being possible is that sometimes we want to use quotes within a string:

In [None]:
dna_synthesis_direction = "5'-AGATGTA-3'"

Let's check the type of this variable:

In [None]:
dna_synthesis_direction

"5'-AGATGTA-3'"

In [None]:
type(dna_synthesis_direction)

str

`str` not only indicates that the variable is pointing to a string, but it also can work as a function that converts an object from one type to a string. Let's convert an integer variable to a string:

In [None]:
ORF3a_start_position = 25393
type(ORF3a_start_position)

int

In [None]:
str(ORF3a_start_position)

'25393'

In [None]:
ORF3a_start_position_as_text = str(ORF3a_start_position)
type(ORF3a_start_position_as_text)

str

> **NOTE**: Strings represent textual information in programming languages.

Several operations can be carried out with strings. Let's see some examples.

As we said previously, elements in strings are ordered and can be accessed using indices:

In [None]:
virus_strain = "respiratory syndrome coronavirus 2"

In [None]:
virus_strain[0]  # In Python, indexing starts in zero

'r'

We can also extract substrings from our original string. This is the basic syntax:

In [None]:
virus_strain[0:5]  # 'respiratory syndrome coronavirus 2'

'respi'

Note that the initial value is inclusive (`index 0` or 'r') and the stop value is exclusive (`index 5` in this string is 'r').

Let's see what happens if we leave the first or the last element blank:

In [None]:
virus_strain[5:]

'ratory syndrome coronavirus 2'

In [None]:
virus_strain[:5]

'respi'

Notice that rules regarding `inclusive` and `exclusive` positions are maintained.

Additionally, a third argument can be used to indicate steps:

In [None]:
# virus_strain[From:To:Step]
virus_strain[0:5:2]  # 'respiratory syndrome coronavirus 2'

'rsi'

Some operations can be used with strings as we saw previously for integers and floats. Let's suppose we want to add an acronym to our string.

Creating a new string:

In [None]:
virus_strain_acronym = "(SARS-CoV-2)"

We can concatenate these variables to create a new string:

In [None]:
virus_strain + virus_strain_acronym

'respiratory syndrome coronavirus 2(SARS-CoV-2)'

An important consideration is that strings are not mutable, so we do not expect this operation to change anything in the original strings (`virus_strain` and `virus_strain_acronym`).

In [None]:
virus_strain

'respiratory syndrome coronavirus 2'

In [None]:
virus_strain_acronym

'(SARS-CoV-2)'

In [None]:
full_virus_name = virus_strain + virus_strain_acronym

In [None]:
full_virus_name

'respiratory syndrome coronavirus 2(SARS-CoV-2)'

> **NOTE:** Strings are immutable!

We could save this string to a new variable (`full_virus_name`, let's say) if we want to use it later:

Whitespaces are also considered characters in strings. We can fix it:

In [None]:
full_virus_name = virus_strain + " " + virus_strain_acronym

In [None]:
full_virus_name

'respiratory syndrome coronavirus 2 (SARS-CoV-2)'

Similarly, it is not possible to change a given character in a string:

In [None]:
virus_strain[0]

'r'

In [None]:
virus_strain[0] = 'R'

TypeError: ignored

In [None]:
virus_strain * 3

'respiratory syndrome coronavirus 2respiratory syndrome coronavirus 2respiratory syndrome coronavirus 2'

Several functions are available to manipulate strings (see the language documentation for [text sequences](https://docs.python.org/3/library/stdtypes.html#string-methods)). Notice that the way we use them is slightly different, with a dot (`.`) separating the function from the target string.

> **Side note:** These functions we use with dots receive a different name - they are called **methods**, as we will better explain tomorrow.

In [None]:
virus_strain.upper()

'RESPIRATORY SYNDROME CORONAVIRUS 2'

In [None]:
virus_strain.capitalize()

'Respiratory syndrome coronavirus 2'

We have not learned what a list is yet, but here is another option for concatenating our two strings using whitespace as the separator:

In [None]:
' '.join([virus_strain, virus_strain_acronym])

'respiratory syndrome coronavirus 2 (SARS-CoV-2)'

In [None]:
full_virus_name = ' '.join([virus_strain, virus_strain_acronym])

In [None]:
full_virus_name

'respiratory syndrome coronavirus 2 (SARS-CoV-2)'

#### `print` and some other built-in functions

Let's define some variables.

In [None]:
virus1 = 'alpha'
virus2 = 'beta'
virus3 = 'delta'

Now, let's check their values:

In [None]:
virus1

'alpha'

Maybe checking two variables in the same cell would be nice, to avoid clicking "run" so many times:

In [None]:
virus1
virus2

'beta'

Notice, however, that no more than one value is always printed as the output of a cell. One way to solve this problem is using a function called print.

In [None]:
print(virus1)
print(virus2)

alpha
beta


`print` can as many values as you want. You should inform them separated with commas, and `print` will output them in the order you specified using a whitespaces as separators.

In [None]:
print(virus1, virus2)

alpha beta


In [None]:
ORF3a_start_position = 25393
str(ORF3a_start_position)

'25393'

In [None]:
print("ORF3a starts in position", ORF3a_start_position, "of the SARS-CoV-2 reference genome")

ORF3a starts in position 25393 of the SARS-CoV-2 reference genome



> **NOTE**: `print` can be useful to force cells into showing the output we desire.

<div id='m2' />

### M2: Logical operations and additional data structures
[Back to table of contents](#table-of-contents)

**Estimated study load**: 50 minutes

**Learning objectives**
* Introduce booleans in Python
* Introduce sequences and compare lists and tuples with strings
* Introduce `for` loops and control structures with `if`
* Introduce dictionaries and a comparative analysis between sequence indexing and dictionary keys

#### Booleans in Python

**Boolean** variables are variables that represent either validity or invalidity of an expression. They hold the information either that something is true, or that something is false. Only these two values are possible, and often will result from logical expressions that we will see ahead. These two possible values are represented in Python by:
 * `True`
 * `False`


In [None]:
False

False

In [None]:
python_is_awesome = True
print(python_is_awesome)

 Notice that the boolean values are not textual information, and thus are not surrounded by quotes as if they were strings. They are reserved keywords of Python to represent those specific on/off, true/false, yes/no meanings. As such, Python will raise an exception if we try to use it as variable names, for example. 

In [None]:
'False'  # This is not a boolean value! It's a string.

'False'

In [None]:
False = "We don't like bioinformatics."

SyntaxError: ignored

> **Side note:** Since Python is case-sensitive, make sure to write these tokens capitalised. Deviations in most cases will cause Python to interpret the word as an undefined variable name, generating errors.

In [None]:
FALSE

NameError: ignored

In [None]:
python_is_awesome = true

NameError: ignored

But how would we use boolean variables? As we mentioned, they often come up as results of logical expressions.

Several operators can be used to test if a condition is `True`. For instance, let's first remember we previously assigned the value `29000` to our variable `coronavirus_genome_size`. We did it like so:

In [None]:
coronavirus_genome_size = 29000

Let's convert this amount of base pairs to the kilobases (kb) unit.

In [None]:
coronavirus_genome_size_kb = 29000 / 1000
coronavirus_genome_size_kb

29.0

Are these virus RNA genomes large?

A [recent study](https://doi.org/10.1016/j.tibs.2021.05.006) analyzing factors involved in the increase of genome size lists several other viruses.

 * Flaviviridae (e.g., DENV-1): 10-11Kb
 * Arteriviridae (e.g., EAV): 12-16Kb
 * Coronaviridae (e.g., SARS-CoV-2): 21-32Kb
 * Mononiviridae (e.g., PSCNV): > 41Kb

In [None]:
DENV1 = 10.5
EAV = 14
PSCNV = 41.0

Assuming the values assigned here for different viruses, we can make comparisons to ask whether the SARS-CoV-2 genome is larger than some of these other viruses:

In [None]:
coronavirus_genome_size_kb > DENV1

True

In [None]:
coronavirus_genome_size_kb > EAV

True

In [None]:
coronavirus_genome_size_kb > PSCNV

False

In [None]:
SARSCoV2_variant1 = 29
SARSCoV2_variant2 = 29

There are also the logical operators `==`, `>=` and `<=` to represent "equals", "greater than or equal to" and "less than or equal to", respectively.

In [None]:
SARSCoV2_variant1 == SARSCoV2_variant2

True

In [None]:
SARSCoV2_variant1 >= SARSCoV2_variant2

True

In [None]:
SARSCoV2_variant1 <= SARSCoV2_variant2

True

In [None]:
SARSCoV2_variant1 = 29.01
SARSCoV2_variant2 = 29.03

In [None]:
SARSCoV2_variant1 == SARSCoV2_variant2

False

> **Side note:** a common mistake we sometimes make even after gaining some programming experience is to write `=` (variable assignment) when we meant `==` (comparison). If we had written ```SARSCoV2_variant1 = SARSCoV2_variant2``` in the cell above, we would have redefined the first variable to have the same value as the second!

In [None]:
SARSCoV2_variant1 >= SARSCoV2_variant2

False

In [None]:
SARSCoV2_variant1 <= SARSCoV2_variant2

True

> **NOTE**: Booleans are results of comparisons.

Evaluations can get very complicated. We will not go deep into them in this course, but let's see briefly some other operators that can be combined to answer more complex questions:

In [None]:
# Using 'and', for the output to be True, the results of the two tests MUST be true:
(DENV1 < coronavirus_genome_size) and (EAV < coronavirus_genome_size)

True

In [None]:
(DENV1 < coronavirus_genome_size) and (EAV > coronavirus_genome_size)

False

In [None]:
print(DENV1 < coronavirus_genome_size)
print(EAV > coronavirus_genome_size)

True
False


In [None]:
# Using 'or', for the output to be True, the results of AT LEAST ONE of
# the two tests MUST be true:
(DENV1 < coronavirus_genome_size) or (EAV < coronavirus_genome_size)

True

In [None]:
(DENV1 < coronavirus_genome_size) or (EAV > coronavirus_genome_size)

True

In [None]:
print(DENV1 < coronavirus_genome_size)
print(EAV > coronavirus_genome_size)

True
False


In [None]:
(DENV1 > coronavirus_genome_size) or (EAV > coronavirus_genome_size)

False

#### Sequences in Python (we will focus on lists now)

Python sequences are objects ordered by position.
They are iterable, meaning it is possible to get each of their elements at a time.

Python sequences include:
 * Strings (i.e., we can read each character at a time) - strings were previously studied and are considered 'text sequences'
 * Lists
 * Tuples
 * Ranges


Besides continuing to learn about strings, in the present tutorial we will solely focus on **lists**, since we consider it the most flexible and general-purpose Python sequence type. This is what a Python list looks like:

In [None]:
sarscov2_proteins = ["spike protein", "envelope protein", "membrane protein", "nucleocapsid protein"]

A good practice stated in PEP8 is to keep statements shorter than 79 characters. Let's reduce the size of our list declaration:

In [None]:
sarscov2_proteins = ["spike protein",
                     "envelope protein",
                     "membrane protein",
                     "nucleocapsid protein"]

In [None]:
sarscov2_proteins

['spike protein',
 'envelope protein',
 'membrane protein',
 'nucleocapsid protein']

Like in string manipulation, we can access elements by index:

In [None]:
sarscov2_proteins[1]

'envelope protein'

However, differently from strings, we can alter the elements in lists because these are mutable.

In [None]:
sarscov2_proteins[1] = "Envelope protein"

In [None]:
sarscov2_proteins

['Spike protein',
 'Envelope protein',
 'membrane protein',
 'nucleocapsid protein']

What if we wanted to alter all elements in the list to have the initial letter capitalized?

1. Remember we are dealing with strings (each element of our list is an independent string)
2. Since we have strings, we can use string methods to reach our objectives

In [None]:
sarscov2_proteins[0].capitalize()

'Spike protein'

In [None]:
sarscov2_proteins[0] = sarscov2_proteins[0].capitalize()

In [None]:
sarscov2_proteins[0] = sarscov2_proteins[0].capitalize()
sarscov2_proteins[1] = sarscov2_proteins[1].capitalize()
sarscov2_proteins[2] = sarscov2_proteins[2].capitalize()
sarscov2_proteins[3] = sarscov2_proteins[3].capitalize()

In [None]:
sarscov2_proteins

['Spike protein',
 'Envelope protein',
 'Membrane protein',
 'Nucleocapsid protein']

Even though this option is much easier than changing each element by assigning a new value individually, it is still a tedious task in cases the list contains hundreds or thousands of elements.

We are now going to learn a new concept - the `for` loop. As we said at the beginning of this section, sequences are iterable. Let's see what it means.

In [None]:
# PEP8 expects four whitespaces in indentation
for protein in sarscov2_proteins:
    print(protein)

Spike protein
Envelope protein
Membrane protein
Nucleocapsid protein


Under the `for` statement, we can add whatever we want to do for each element in our list, as long as the code we add is aligned with the same indentation:

In [None]:
for protein in sarscov2_proteins:
    print(protein)
    print(protein.capitalize())

spike protein
Spike protein
Envelope protein
Envelope protein
membrane protein
Membrane protein
nucleocapsid protein
Nucleocapsid protein


There is a function called `enumerate` that can be used to iterate through elements in our list and recover the index and value associated with each list position:

In [None]:
for position, protein in enumerate(sarscov2_proteins):
    print(position)
    print(protein)
    print(protein.capitalize())


0
spike protein
Spike protein
1
Envelope protein
Envelope protein
2
membrane protein
Membrane protein
3
nucleocapsid protein
Nucleocapsid protein


We can finally assign new values to each position:

In [None]:
for position, protein in enumerate(sarscov2_proteins):
    print(position)
    print(protein)
    print(protein.capitalize())
    sarscov2_proteins[position] = protein.capitalize()

0
spike protein
Spike protein
1
Envelope protein
Envelope protein
2
membrane protein
Membrane protein
3
nucleocapsid protein
Nucleocapsid protein


In [None]:
sarscov2_proteins

['Spike protein',
 'Envelope protein',
 'Membrane protein',
 'Nucleocapsid protein']

Since strings are also iterable, we could loop through list elements:

In [None]:
full_virus_name = "respiratory syndrome coronavirus 2 (SARS-CoV-2)"
full_virus_name

In [None]:
for letter in full_virus_name:
    print(letter)

r
e
s
p
i
r
a
t
o
r
y
 
s
y
n
d
r
o
m
e
 
c
o
r
o
n
a
v
i
r
u
s
 
2
 
(
S
A
R
S
-
C
o
V
-
2
)


But remember: you can't change elements in strings because they are immutable!

What if we wanted to check whether a particular object is on our list?

For instance, suppose we wanted to check if an accessory protein, ORF8, is in the list of structural proteins:

In [None]:
sarscov2_proteins

['Spike protein',
 'Envelope protein',
 'Membrane protein',
 'Nucleocapsid protein']

The same keyword `in` we used at the very begining comes in handy again:

In [None]:
"ORF8 protein" in sarscov2_proteins

False

In [None]:
"spike protein" in sarscov2_proteins

False

It is important to remember that Python is case-sensitive. Tests like this will also be sensitive and return True only if they match perfectly. We previously capitalized the first letter in each element:

In [None]:
"Spike protein" in sarscov2_proteins

True

As we saw previously, booleans return `True` or `False`, depending on the result of our operation given the inputs.

We will see more loops soon!

#### Control flow using `if` statements


When we need our code to make decisions based on the results of evaluations in our code, we use control flow. The keyword for doing so in Python is `if`.

Let's make a statement that checks whether some of the genomes we discussed so far are smaller than that of SARS-CoV-2.

In [None]:
coronavirus_genome_size = 29.0
DENV1_genome_size = 10.5
EAV_genome_size = 14
PSCNV_genome_size = 41.0

In [None]:
if DENV1_genome_size < coronavirus_genome_size:
  print("Genome of DENV1 is smaller than of SARS-CoV-2")

Genome of DENV1 is smaller than of SARS-CoV-2


In [None]:
DENV1_genome_size < coronavirus_genome_size

True

In [None]:
if PSCNV_genome_size < coronavirus_genome_size:
  print("Genome of PSCNV is smaller than of SARS-CoV-2")

In [None]:
PSCNV_genome_size < coronavirus_genome_size

False

The `if` statement can be associated with an `else` statement that considers a given action for cases where `False` is returned from the evaluation.

In [None]:
if DENV1_genome_size > coronavirus_genome_size:
    print("Genome of DENV1 is larger than of SARS-CoV-2")
else:
    print("Genome of DENV1 is smaller than of SARS-CoV-2")

Genome of DENV1 is smaller than of SARS-CoV-2


Control flow can be incorporated in `for` loops. For example, if we want to check if each of the values 

In [None]:
genome_sizes = [DENV1_genome_size,
                PSCNV_genome_size,
                EAV_genome_size]

In [None]:
for genome_size in genome_sizes:
  print(genome_size)

10.5
41.0
14


In [None]:
for genome_size in genome_sizes:
  if genome_size < coronavirus_genome_size:
    print(f'{genome_size} is smaller than {coronavirus_genome_size}')
  else:
    print(f'{genome_size} is bigger than {coronavirus_genome_size}')

10.5 is smaller than 29.0
41.0 is bigger than 29.0
14 is smaller than 29.0


Here we only have a list with the genome sizes. Using only lists, there is not a very simple way of also printing which virus name each of the genome sizes are associated to. A natural way of doing so is using another data structure we now introduce, `dictionaries`:

In [None]:
genome_sizes_dict = {'DENV1': 10.5, 'EAV': 14, 'PSCNV': 41.0}
                     
for virus, size in genome_sizes_dict.items():
  if size < coronavirus_genome_size:
    print(f"{virus}'s genome is smaller than coronavirus'.")
  else:
    print(f"{virus}'s genome is bigger than coronavirus'.")

DENV1's genome is smaller than coronavirus'.
EAV's genome is smaller than coronavirus'.
PSCNV's genome is bigger than coronavirus'.


#### Dictionaries in Python
Dictionaries are collections of paired elements. Each pair is an **item**. The first element of each pair is an immutable object (such as a string or a number, but not a list) that serves as an item's identifier. We call it the **key** of the pair. The second element is the actual **value** we are interested to store.

 * Keys: immutable identifiers;
 * Values: objects to store;
 * Items: combined key-value pairs.




In [None]:
# Dictionary anatomy:
{"first key": "first value",    # First item
 "second key": "second value",  # Second item
 "third key": "third value"}    # Third and final item

# You could also write it in a single line, if not too difficult to read:
{"first key": 100, "second key": 200, "third key": 300}

But let's revisit our very first example to demonstrate how useful can dictionaries be.

In [None]:
{'alpha': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV',  # See whitespaces, PEP8 convention (https://www.python.org/dev/peps/pep-0008/#whitespace-in-expressions-and-statements)
 'beta': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV',
 'gamma': 'YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT',
 'delta': 'YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG'}

We put all RBD segments we found grouped together in a single dictionary, identifying each segment as we previously did with variables. We can create a single variable to store our whole dictionary!

In [None]:
rbd_segment = {'alpha': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV',
               'beta': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV',
               'gamma': 'YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT',
               'delta': 'YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG'}

We can access the values from each item with the notation: `dict[key]`.

For example, to get the RBD sequence of the Gamma variable, we run:

In [None]:
rbd_segment['gamma']

'YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT'

To add a key/value pair to the dict, we use the following syntax, very similar to how we use variables:

In [None]:
rbd_segment['omicron'] = "RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV"

Let's check our dictionary again:

In [None]:
rbd_segment

{'alpha': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV',
 'beta': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV',
 'delta': 'YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG',
 'gamma': 'YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT',
 'omicron': 'RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV'}

If you notice, dictionaries work a lot like groups of variables, where we have each variable's name associated to its value. But using dictionary items instead have some key advantages over defining multiple variables:

 * **Dictionaries are iterable:** We can easily reapeat the same procedure for all items;
 * **They are more flexible:** Dictionary keys are much less restricted than variables' names. They can be strings (with any whitespace or weird character) or even floats, for example.
 * **They can be automated:** Dictionary items can be created programatically, in a `for` loop, for instance. We cannot create variables with different names in a `for` loop.
 
**Side note:** We actually can, but it is highly discouraged in general and much more complicated, way beyond the scope of this course.
 

Now, look how simply we can solve the introductory problem, using the same idea as before:

In [None]:
rbd_segment  # In case you need a refresher of how our dict looks like.

{'alpha': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV',
 'beta': 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV',
 'delta': 'YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG',
 'gamma': 'YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT',
 'omicron': 'RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV'}

In [None]:
for variant, segment in rbd_segment.items():
    if segment in sequenced_protein:
        print(f"The sample possibly contains the {variant.capitalize()} "
               "variant's genome!")

The sample possibly contains the Delta variant's genome!


Using the `dict.items()` method, we can access each key and value for each item. They will be stored in the variables defined in the loop statement. That is:

In [None]:
my_dict = {"first key": 100, "second key": 200, "third key": 300}

for key, value in my_dict.items():
    print(f"My key is '{key}'. My value is {value}.")

My key is 'first key'. My value is 100.
My key is 'second key'. My value is 200.
My key is 'third key'. My value is 300.


There are several useful functions and methods like `dict.items()` to deal with dictionaries. Here are some examples:

In [None]:
# Return the number of itens in the dictionary
len(rbd_segment)

4

In [None]:
# Return the keys of the dictionary
rbd_segment.keys()

dict_keys(['alpha', 'beta', 'gamma', 'delta'])

Notice that the output above is in an unusual format, but don't mind it in this course. We can convert it to a list for easier interpretation.

In [None]:
# Return list of all keys
list(rbd_segment.keys())

['alpha', 'beta', 'gamma', 'delta', 'omicron']

In [None]:
# Return list of all values
list(rbd_segment.values())

['RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV',
 'RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV',
 'YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT',
 'YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG',
 'RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV']

The values of the `rbd_segment` were strings. But remember that dictionary values in Python can be of any type.

Just as an example, let's create a dictionary of lists, with the RBD sequence as the first item of the list, and the variant ID as the second item:

In [None]:
rbd_segment_id = {'alpha': ['RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV','B.1.1.7'],
                  'beta': ['RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV','B.1.351'],
                  'gamma': ['YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT','P1'],
                  'delta': ['YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG','B.1.617.2']}

In [None]:
rbd_segment_id['delta']

['YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG', 'B.1.617.2']

Notice that the result of the above is a `list`. We can access the values of this list by its index, using the `[]` notation.

> Remember that the first item has an index of 0.

In [None]:
# Return the ID of the gamma variant
rbd_segment_id['gamma'][1]

'P1'

Here is a more complicated example so you can better visualize all the possibilities:

In [None]:

complicated_dictionary = {
    'First key': 'My first value!',  # This is the first item
    'Second key': 222222,            # This is the second item
    'Third...': ['A whole list could be here.', 3, 'Three!'],
    4: 'Fourth key is an integer!',
    5.023: "I'm item 5. My key is a float!"
    'Sixth': {"Wait..": "Another dictionary!"}
}

SyntaxError: ignored

<div id='m3' />

### M3: Loading content from files
[Back to table of contents](#table-of-contents)

**Estimated study load**: 50 minutes

**Learning objectives**

* File input/ output

Using small chunks of information at a time makes it easy to just create lists, dictionaries or strings as we have been doing. But as the information amount grows, it becomes much more convenient to store this data separately from our program, in their own **files**. Take for instance the DNA sequences at the beginning of this tutorial that create very long cumbersome lines of code. They would greatly benefit from being stored in FASTA files.

We are constantly working with sequence files in Bioinformatics - sequences downloaded from NCBI databases, from organism-specific sources, and this is usually how we store information locally, for further analysis. Therefore, one of the important things we're going to learn today is how to read and write files (input/output).

At this point, please go to the repository https://github.com/SantosRAC/intro_python_ismb2022, download the following files and upload them to this notebook by clicking `Browse` after running the next cell.

* File `Wuhan-Hu-1_19A.fasta` is a FASTA file containing a spike protein sequence of the reference strain;
* `spike_proteins.fasta` contains additional FASTA sequences, that correspond to other virus variants.

In [None]:
from google.colab import files
uploaded = files.upload()

MessageError: ignored

In [None]:
open('Data/Wuhan-Hu-1_19A.fasta', 'r')

`r` means that you are opening the file for reading. Other options could be to wipe down and write something into the file (`w`) or to append new information to it (`a`). We can do both things at the same time as well (`r+`).

If a string with the relative file path is passed, but no additional argument is provided, python will assume we are only reading the file.

In [None]:
open('Data/Wuhan-Hu-1_19A.fasta')

Like other Python objects, we can use variables to store file content in memory.

In [None]:
ref_sequence_file = open('Data/Wuhan-Hu-1_19A.fasta', 'r')

In [None]:
ref_sequence_file

Lines in text files (like in FASTA format) are usually read as `strings`.

A good practice when reading files (or lines in text files) is to use the `with` keyword (see link at the bottom of this notebook). As you saw a few lines before, we have to `open` a file and it is important to ensure it is closed after reading. The `with` statement takes charge of doing it properly when its inner block of code ends.

In [None]:
ref_sequence_file.closed

In [None]:
ref_sequence_file.close()

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    file = ref_sequence_file.read()
    print(file)

We learned previously many functions and methods that can be used to manipulate `strings`.

Now, we are going to convert the multiple lines representing the sequence of one single SARS-CoV-2 variant into a single string. For this purpose, we must:

 * Create a variable to store a string (our virus sequence)
 * Read each line of the file
  * For each line, we have to check if it starts with a `>`
   * If it does not, then we have to add that line to a variable we've just created

Now we are going to see additional methods:
 * `startswith`
 * `strip`

For this, we will practice `for` loops and `conditionals`.

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    print(ref_sequence_file.readline())

After python reads the file, every time `readline` is called, it will print the next line in that file. This will be done until it is closed or until the blocks of code inside the `with` keyword finish.


In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    print(ref_sequence_file.readline())
    print(ref_sequence_file.readline())

The file object `ref_sequence_file` we created can be iterated through with a `for` loop, similarly to how we did earlier with lists, dictionaries and strings, to analyze each line of the file as a string:

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file:
        if not line.startswith('>'):
            print(line)

FileNotFoundError: ignored

There are at least two methods that can be used to remove the additional new line after reading each line: `strp` and `replace`. We already learned how to use the `replace` one. Let's try the former.

In [None]:
with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file:
        if not line.startswith('>'):
            print(line.strip('\n'))

We are almost there!

Now, let's go back to the first step in our algorithm and create a variable able to store this information:

In [None]:
wuhan_variant_seq = ''

with open('Data/Wuhan-Hu-1_19A.fasta', 'r') as ref_sequence_file:
    for line in ref_sequence_file:
        if not line.startswith('>'):
            wuhan_variant_seq = wuhan_variant_seq + line.strip('\n')

In [None]:
wuhan_variant_seq

It works!

The code above is already working really well for files containing a single sequence. However, to read multi-sequence FASTA files, some refinements are needed.

In the multi-FASTA format, different sequences are separated with description lines (those starting with the `>` character), each description line marking the beginning of a new sequence and providing an identifier and a short description text to it. Therefore, each time we encounter a line starting with `>` we must extract the sequence's name (identifier) from it and proceed to record the following lines as the content of that sequence specifically. 

To store each sequence content separately, we attribute them to different entries of a dictionary as we did before.

In [None]:
# Create an empty dictionary and store it in a variable called spike_proteins
spike_proteins = {}

# Open the fasta file and store it in the spike_proteins_fasta variable
with open('Data/spike_proteins.fasta') as spike_proteins_fasta:
    # Iterate across the file lines
    for line in spike_proteins_fasta:
        # If the current line is a description line
        if line.startswith('>'):
            # Get the variant's name from the description line
            variant_name = line.strip(' >\n').split()[0]
            # Create a new dictionary item with variant_name as the key and an
            # empty string as the value
            spike_proteins[variant_name] = ''
        else:
            # Append the sequence piece to the coresponding dictionary item
            spike_proteins[variant_name] += line.strip()

Now we can read multi-FASTA files as well!

But what if several cells of code later we realize there are some more files we need to read? Based on what we've been doing, one might be tempted to simply copy-paste the cell above. However, having to do this multiple times can get highly impractical. Tomorrow we will revisit this issue in more detail and provide ways for you to get around it.

## Second day - July 7th, 2022

<div id='m4' />

### M4: Functions
[Back to table of contents](#table-of-contents)

**Estimated study load**: 80 minutes

**Learning objectives**
* Functions and modules

Note that in previous cells we used a lot of repeated, copy-pasted code. Revisiting them you will notice a lot of very similar cells, differing only regarding a few lines, a file path, or maybe a variable's name.

Consider the case in which you want to read several fasta files, saving each one's data into its variable, uniquely named. Following the previous cells, one might be tempted to do this as follows:

In [None]:
envelope_fasta = 'Data/envelope_protein.fasta'
nucleocapsid_fasta = 'Data/nucleocapsid_phosphoprotein.fasta'
surface_fasta = 'Data/surface_glycoprotein.fasta'

proteins = {}

with open(envelope_fasta) as proteins_fasta:
    for line in proteins_fasta:
        if line.startswith('>'):
            variant_name = line.strip(' >\n').split()[0]
            proteins[variant_name] = ''
        else:
            proteins[variant_name] += line.strip()

with open(nucleocapsid_fasta) as proteins_fasta:
    for line in proteins_fasta:
        if line.startswith('>'):
            variant_name = line.strip(' >\n').split()[0]
            proteins[variant_name] = ''
        else:
            proteins[variant_name] += line.strip()

with open(surface_fasta) as proteins_fasta:
    for line in proteins_fasta:
        if line.startswith('>'):
            variant_name = line.strip(' >\n').split()[0]
            proteins[variant_name] = ''
        else:
            proteins[variant_name] += line.strip()


: 

Or, as shown yesterday, at least read the files altogether with a `for` loop:

In [None]:
fasta_files = [
    'Data/envelope_protein.fasta'
    'Data/nucleocapsid_phosphoprotein.fasta'
    'Data/surface_glycoprotein.fasta'
]

proteins = {}

for file_path in fasta_files:
    with open(file_path) as proteins_fasta:
        for line in proteins_fasta:
            if line.startswith('>'):
                variant_name = line.strip(' >\n').split()[0]
                proteins[variant_name] = ''
            else:
                proteins[variant_name] += line.strip()

Even so, we still had to copy that code from the previous cell and will need to do it again if more FASTA files come up to be read.

However, an important goal we should have while writing programs is to avoid repetitions like that as much as we can. Repeated code is often much harder to be adapted, corrected, and improved by yourself and other programmers who might use your code in the future, because each tiny change has to be reproduced across all copy-pasted sections making updating your code a laborious and error-prone task.

For instance, imagine you realize that the sequences you have been reading contain soft-masked basepairs, i.e. lower case characters indicating low confidence in these sequenced regions. If you intend to ignore this information and use upper case for easier sequence comparison afterwards, you would then modify the fasta reading procedure to the following:

In [None]:
fasta_files = [
    'Data/envelope_protein.fasta'
    'Data/nucleocapsid_phosphoprotein.fasta'
    'Data/surface_glycoprotein.fasta'
]

proteins = {}

for file_path in fasta_files:
    with open(file_path) as proteins_fasta:
        for line in proteins_fasta:
            if line.startswith('>'):
                variant_name = line.strip(' >\n').split()[0]
                proteins[variant_name] = ''
            else:
                # The only modified line:
                proteins[variant_name] += line.strip().upper()

Note that every single part of the code in which you have read a fasta file now needs this small modification, and now you have to go through all your code, searching for all of them, worrying about missing some.

Given there are only a few small additions needed throughout this notebook, you might not be convinced yet of the burden's size. Imagine though you have written a huge set of programs, with dozens of FASTA reads spread across multiple files. The work a simple update will cost you may quickly escalate if your code is full of those redundancies. So yes,

> **NOTE:** Redundant code is much more laborious and error-prone to be corrected and updated.

But how can we solve that? A crucial concept in almost every programming language to deal with this kind of situation is called **functions**. Functions enable us to wrap and concisely reuse important pieces of code. Let's wrap the fasta reading procedure we came up with earlier.

In [None]:
def read_fasta(filepath):
    result = {}

    with open(filepath) as protein_fasta:
        for line in protein_fasta:
            if line.startswith('>'):
                sequence_name = line.strip(' >\n').split()[0]
                result[sequence_name] = ''
            else:
                result[sequence_name] += line.strip().upper()

    return result

FileNotFoundError: ignored

Now, reading three different fasta files is as simple as doing:

In [None]:
envelope_seq = read_fasta(filepath='Data/envelope_protein.fasta')
nucleocapsid_seq = read_fasta(filepath='Data/nucleocapsid_phosphoprotein.fasta')
# Specifying 'filepath=' is optional.
surface_seq = read_fasta('Data/surface_glycoprotein.fasta')

> **Note:** Functions can make your code much cleaner, concise and readable.

But let's dive into how the function syntax works.

The `def` keyword is used to define functions. We then provide a name for our function (before parentheses) and a bunch of names (inside parentheses) for variables we would like to change each time we run the code below them. These variables are what we call **arguments** or **parameters** of a function. This first line of code is called the function **header**. The indented lines below the header (the function's **body**) will be executed every time we write the function's name followed by parentheses and a set of parameter values (a function **call**).

In [None]:
# Anatomy of a function definition.

# Function's header.
def function_name(parameter_1, parameter_2, parameter_3):
    # Function's body.
    # Code...
    # More code...
    # Doing a lot of code here.
    return result_of_our_code

Normally, every variable we create inside a function cannot be accessed outside of it. To "export" the results of our computation, we must then use the `return` keyword, terminating the function execution and exposing, back to the main program, the values we intended to get in the first place when calling the function. Let's take a look at one more example.

In [None]:
def dna_to_rna(dna_sequence):
    rna_sequence = ''
    
    for bp in dna_sequence:
        if bp == 't':
            rna_sequence += 'u'
        elif bp == 'T':
            rna_sequence += 'U'
        else:
            rna_sequence += bp

    return rna_sequence

We have defined a function to convert a DNA string to an RNA equivalent. Now we can call our new function on a random example sequence:

In [None]:
rna = dna_to_rna(dna_sequence='ggatgtggtgaGATGAGtagtGATGGATGATGT')
print(rna)

ggauguggugaGAUGAGuaguGAUGGAUGAUGU


What happens can be imagined as if the value returned from the function body (using the `return` keyword) would "replace" the whole function call in the "external" code. In our example, if the variable `rna_sequence` inside the function scope holds the returned value, we could understand what happens as if we were running the following code:

In [None]:
# We call the dna_to_rna function.
rna = dna_to_rna('ggatgtggtgaGATGAGtagtGATGGATGATGT')
# NOTE: in the function call, we can provide only the values we want to
# set to our parameters (only one in this case), without specifying their
# names. Their identity is inferred from the order in which they are
# presented.

# >>>> Entering the function scope.
# Inside the function body, that's what will happen.

# The function parameters receive their values based on the function call.
dna_sequence = 'ggatgtggtgaGATGAGtagtGATGGATGATGT'

# The function body is executed with the defined parameters.
rna_sequence = ''

for bp in dna_sequence:
    if bp == 't':
        rna_sequence += 'u'
    elif bp == 'T':
        rna_sequence += 'U'
    else:
        rna_sequence += bp

# return rna_sequence
# <<<< Exiting the function scope.

# We exit the function and return to the first code line of this cell,
# where the function was called, substituting the call by the returned
# value.
rna = rna_sequence
# Continue to the remaining of the code after the function call.
print(rna)

ggauguggugaGAUGAGuaguGAUGGAUGAUGU


The above thus represents the following "transformation":

In [None]:
rna = dna_to_rna('ggatgtggtgaGATGAGtagtGATGGATGATGT')
#         |    |    |
#         V    V    V
rna = 'ggauguggugaGAUGAGuaguGAUGGAUGAUGU'

Functions are everywhere. Some functions we have been using, such as `print()` or `enumerate()`, are already pre-defined in any program and available to use anytime we would like so. For that reason, we often call them *built-in* functions, as we saw earlier. Some other functions exist "inside" another object in Python. In such cases, we use the dot notation to indicate an  object
belongs to or is contained inside another:

In [None]:
list(rbd_segment.keys())

['alpha', 'beta', 'gamma', 'delta', 'omicron']


The function `keys()` lives "inside" or "belongs to" the dictionary we stored earlier in the variable `rbd_segment`. When called, it returns the keys of the dictionary in `rbd_segment`.

We give those "functions belonging to objects" the slightly different name **methods**, but their role remains essentially the same: wrapping and reusing useful code. We thus say `keys()` is a method of dictionaries, in the same way, `replace()` and `join()` are string methods.


> **Note:** Methods are functions "inside" or "belonging to" Python objects.

Functions and modules concepts will be revisited in the following sections.

<div id='m5' />

### M5: Comparing biological sequences
[Back to table of contents](#table-of-contents)

**Estimated study load**: 50 minutes
**Learning objectives**
* Comparing biological sequences

#### Reading spike protein sequences
In this block, we will see how we can compare biological sequences using some of these powerful tools Python provides us.

Let's revisit the sequences of different SARS-CoV-2 spike proteins we have been using. We now have a function to import those sequences from a fasta file and store them in a dictionary:

In [None]:
spike_proteins = read_fasta('Data/spike_proteins.fasta')
print(spike_proteins)

{'Alpha_B.1.1.7': 'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAISGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIDDTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSHRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPINFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILARL

In [None]:
spike_proteins = {'Alpha_B.1.1.7': 'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAISGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIDDTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSHRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPINFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILARLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTHNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT', 
                  'Beta_B.1.351': 'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFANPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRGLPQGFSALEPLVDLPIGINITRFQTLHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGNIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGVENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT', 
                  'Gamma_P1': 'MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNYPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLSEFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGTIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAAIKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASFVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT', 
                  'Delta_B.1.617.2': 'MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESGVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQNVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT', 
                  'Omicron_BA.1': 'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHVISGTNGTKRFDNPVLPFNDGVYFASIEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLDHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPIIVREPEDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFDEVFNATRFASVYAWNRKRISNCVADYSVLYNLAPFFTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGNIADYNYKLPDDFTGCVIAWNSNKLDSKVSGNYNYLYRLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGVGHQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLKGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTKSHRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKYFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFKGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNHNAQALNTLVKQLSSKFGAISSVLNDIFSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT',
                  'Omicron_BA.2': 'MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLGRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFDEVFNATRFASVYAWNRKRISNCVADYSVLYNFAPFFAFKCYGVSPTKLNDLCFTNVYADSFVIRGNEVSQIAPGQTGNIADYNYKLPDDFTGCVIAWNSNKLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYGFRPTYGVGHQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTKSHRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKYFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNHNAQALNTLVKQLSSKFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT'}

for var, s in spike_proteins.items():
    print(var.lower().split('_')[0], '=', s[450:500])

alpha = RLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTYGV
beta = RLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGV
gamma = YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPT
delta = YRLFRKSNLKPFERDISTEIYQAGSKPCNGVEGFNCYFPLQSYGFQPTNG
omicron = RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGV
omicron = RLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYGFRPTYGV


Now that we have those sequences in a convenient format for our explorations, how can we begin extracting useful information from them? How can we determine conserved regions and important residue insertions, deletions, and substitutions?

A first approach might come up as the following function.

In [None]:
def seq_align_1(seq1, seq2):
    alignment = ''

    for residue1, residue2 in zip(seq1, seq2):
        if residue1 == residue2:
            alignment += residue1
        else:
            alignment += '.'

    return alignment 

The function receives two strings representing the pair of protein sequences we intend to compare. We then compare the first residue of the first protein with the first residue of the second. Then the second residue of the first protein with the second residue of the second. Then the third residue of both proteins, then the fourth, and so on. If the compared residues at each iteration match, we add their one-letter representation to the `alignment` string. Otherwise, we add a `.` character to indicate the mismatch. In the end, the returned `alignment` variable will represent a consensus sequence between the two compared proteins, showing matching residues and omitting mismatched ones with a . in their corresponding place.

In [None]:
seq_align_1("ABCDEFG", "ACBDEFP")

'A..DEF.'

But how will our function perform on real-world biological sequences? Let's test it on SARS-CoV-2 data. The labels of available sequences can be shown with

In [None]:
spike_proteins.keys()

dict_keys(['Alpha_B.1.1.7', 'Beta_B.1.351', 'Gamma_P1', 'Delta_B.1.617.2', 'Omicron_BA.1', 'Omicron_BA.2'])

So let's compare the **Beta_B.1.351** variant with **Alpha_B.1.1.7**:

In [None]:
alignment = seq_align_1(spike_proteins['Beta_B.1.351'], spike_proteins['Alpha_B.1.1.7'])
print(alignment)

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAI................................................................F.............K............................L....G...N.....................................L..L..............LHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTG.IADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGV.GFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDI.DTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNS.RRARSVASQSIIAYTMSLG.ENSVAYSNNSIAIP.NFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDIL.RLDKVEAEVQIDRLITGRLQS

Although the alignment shows a large mismatching region between both proteins, more careful inspection of the beginning of this region reveals they are not so different as may be thought:

In [None]:
start, end = 50, 150
print(alignment[start:end])
print(spike_proteins['Beta_B.1.351'][start:end])
print(spike_proteins['Alpha_B.1.1.7'][start:end])

TQDLFLPFFSNVTWFHAI................................................................F.............K...
TQDLFLPFFSNVTWFHAIHVSGTNGTKRFANPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNK
TQDLFLPFFSNVTWFHAISGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYHKNNKSWM


More specifically, if we consider the histidine-valine pair at positions 67 and 68 as insertion and realign the sequences from position 70 on, we see a much higher resemblance of both proteins:

In [None]:
start, end = 50, 150
istart, iend = 68, 70  # Insertion start, insertion end
seqB = spike_proteins['Beta_B.1.351'] 
seqA = spike_proteins['Alpha_B.1.1.7']
blank = ' ' * (iend-istart+2)  # For printing purposes

new_alignment = seq_align_1(seqB[iend:], seqA[istart:])

print(alignment[start:istart] + blank + new_alignment[:end-iend])
print(seqB[start:istart] + '(' + seqB[istart:iend]+ ')' + seqB[iend:end])
print(seqA[start:istart] + blank + seqA[istart:end-2])

TQDLFLPFFSNVTWFHAI    SGTNGTKRF.NPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVY...N..
TQDLFLPFFSNVTWFHAI(HV)SGTNGTKRFANPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNK
TQDLFLPFFSNVTWFHAI    SGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYHKNNKS


The same issue seems to occur with the tyrosine insertion next to the end of the region shown above, and has a much more drastic effect when comparing other variants such as **Gamma_P1** and **Alpha_B.1.1.7**:

In [None]:
seq_align_1(spike_proteins['Gamma_P1'], spike_proteins['Alpha_B.1.1.7'])

'MFVFLVLLPLVSSQCVN.T.RTQLP.AYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAI................................................................F.............K............................L....G...N.....................................L..L..............L..................A.....Y....................D.....L.....T...............................N..........................................F............L.........................G.......................N..............L.R...........................G......................Y.......................................G.............F..F.....D..D..........L.I..............T...N...................A.............S............................................R...S.....................S.......T...................................................................Q............F...........PSK.S............................................L..L..............A.............A..................................N.....I.....S..S....L........Q....L.....................L..................

We then feel the need for more sophisticated alignment functions, capable of

1. Taking insertions and deletions into consideration;

2. Weighting substitutions differently, based on their probability of occurring in nature. For instance, an alanine to proline substitution is likely to have much more impact on the protein's structure than an alanine to valine substitution;

3. Performing local alignments, i.e. identifying highly matching regions instead of a global single alignment and

4. Performing not only pairwise but multiple sequence alignments at once.

Coming up with algorithms or searching for them in the scientific literature and implementing them as Python functions would cost us considerable time. Luckily, a lot of people have run into this same task, so ready-to-use implementations were already meticulously developed and are available to everyone who needs them.

These pre-made functions come in what we call **modules**. Modules can be understood as groups of functions, constants, or any other objects ready to be added, or **imported** to our code at any moment.

For instance, if you ever need the $\pi$ constant or the $\log(x)$ mathematical function, they can be found inside the built-in module `math`. We can load the `math` module with the `import` keyword as shown below.

In [None]:
# Modules are imported with the 'import' keyword.
import math

# We can access constants and functions inside imported modules
# using the dot ('.') notation. 
print(math.pi)
print(math.log(1000))

# We can use them directly though, if we employ the 'from' keyword as well.
from math import pi, log

print(pi)
print(log(1000))

3.141592653589793
6.907755278982137
3.141592653589793
6.907755278982137


This module contains several mathematical functions and constants ready to use, saving us the trouble of writing these (often pretty complicated) functions ourselves.

Functions and modules made by other people to expand Python capabilities are further wrapped and distributed in what we call **packages**, that can be downloaded and installed. Packages are identical to modules in the way we access them from our programs in Python, exposing their inner objects with dots.  However, among the objects they hold, we frequently find whole modules. Thus, they can be interpreted as "modules of modules". If modules are files containing objects in Python code (similar to the code cells we have been using), packages are folders containing these files.

From this moment on, we will explore some functionalities of the `biopython` package, which provides plenty of tools for reading and processing biological data.

<div id='m6' />

### M6: Biopython, file parsing, and multiple sequence analysis
[Back to table of contents](#table-of-contents)

**Estimated study load**: 50 minutes

**Learning objectives**:
* Introduce Biopython to work with computational molecular biology;
* Demonstrate how to parse fasta file;
* Align protein sequences;
* Extract alignment regions;


The official documentation of biopython is available [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html).

#### What is Biopython and how do I install it?

Biopython is a set of free tools for computational molecular biology written in Python by an international team of developers. You can use biopython to parse several bioinformatics file formats, including FASTA, GBK, and BLAST output.

It is very important for a bioinformatician to become familiar with Biopython, as it is a Swiss Army knife that can help you in many situations. At this point in the tutorial, you've noticed that there are several file formats that we can use to store information. A classic format is the FASTA.

When we say "parsing" a FASTA file, we want to extract the information and store it so that we have more control to process it. However, before we start exploring the potential of Biopython to handle files, let's install it.

In [None]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.79-cp39-cp39-macosx_10_9_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 1.0 MB/s eta 0:00:01
Installing collected packages: biopython
Successfully installed biopython-1.79


> **Side note:** `pip` is a recursive acronym for *pip installs packages*. It is the most standart tool to manage Python packages on your machine.

We are now ready to import the Biopython package! Note that the Biopython developers established that when importing Biopython, we should call it just `Bio`, for short.

In [None]:
import Bio

# Inside Bio, there is a string variable named __version__, that
# identifies the version of the biopython package we have installed.
print(Bio.__version__)

1.79


#### Parsing a FASTA file

Let's read the FASTA file containing the spike protein for seven SARS-CoV-2 lineages.

In [None]:
# 'SeqIO' (sequence input/output) is a module contained in Biopython.
from Bio import SeqIO

file_path = "Data/spike_proteins.fasta"

# 'parse' is a function inside the module 'SeqIO'
for seq_record in SeqIO.parse(file_path, "fasta"):
    print(seq_record.id)         # Sequence name after each ">"
    print(repr(seq_record.seq))  # Part of protein sequence
    print(len(seq_record))       # Sequence length

Wuhan-Hu-1_19A
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1273
Alpha_B.1.1.7
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Beta_B.1.351
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Gamma_P1
Seq('MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1273
Delta_B.1.617.2
Seq('MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1271
Omicron_BA.1
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Omicron_BA.2
Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT')
1270


We did it! Now we have gathered the information contained in the FASTA file much faster than in the previous tutorial blocks. And we can do better! Let's store this information in a Python dictionary.

In [None]:
with open(file_path, "r") as fasta_file:
    record_dict = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta"))

In [None]:
record_dict

{'Wuhan-Hu-1_19A': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Wuhan-Hu-1_19A', name='Wuhan-Hu-1_19A', description='Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1', dbxrefs=[]),
 'Alpha_B.1.1.7': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Alpha_B.1.1.7', name='Alpha_B.1.1.7', description='Alpha_B.1.1.7 tr|A0A7T8KZF1|A0A7T8KZF1_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=3 SV=1', dbxrefs=[]),
 'Beta_B.1.351': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Beta_B.1.351', name='Beta_B.1.351', description='Beta_B.1.351 QRN78347.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[]),
 'Gamma_P1': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Gamma_P1', name='G

In this dictionary, we have each sequence name as a key and all related information as values. 

In [None]:
# How many sequences have we read?
len(record_dict)

7


In [None]:
# Getting sequences names
list(record_dict.keys())

NameError: ignored

In [None]:
# Sequence information for Omicron BA.2 variant
record_dict["Omicron_BA.2"]

SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT'), id='Omicron_BA.2', name='Omicron_BA.2', description='Omicron_BA.2 UJE45220.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[])

In [None]:
type(record_dict["Omicron_BA.2"])

The `SeqRecord` object offers a lot of information as attributes, including:
  - `.seq`: the sequence itself.
  - `.id`: the primary ID used to identify the sequence.
  - `.name`: similar to id.
  - `.description`: expansive name of the fasta sequence in a more readable presentation.


In [None]:
# Retrieving the sequence as a Seq object
record_dict["Omicron_BA.2"].seq

Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT')

In [None]:
# Sequence description
record_dict["Omicron_BA.2"].description

'Omicron_BA.2 UJE45220.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]'

#### Multiple Sequence analysis (MSA)

Multiple sequence analysis is the alignment of three or more biological sequences (DNA or Protein). We can use the output to infer evolutionary relationships. In this section, we will use a multiple sequence alignment (MSA) from spike proteins to explore mutations.

In [None]:
from Bio import AlignIO

msa_file = "Data/clustal_spike_msa.txt"
spike_align = AlignIO.read(msa_file, "clustal")
print(spike_align)

Alignment with 7 rows and 1275 columns
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Omicron_BA.1
MFVFLVLLPLVSSQCVNLITRTQ---SYTNSFTRGVYYPDKVFR...HYT Omicron_BA.2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Alpha_B.1.1.7
MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFR...HYT Gamma_P1
MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Delta_B.1.617.2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Wuhan-Hu-1_19A
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Beta_B.1.351


We can see that now we have a multiple sequence alignment by the gaps inserted in the Omicron_BA.2 sequence. Let's explore Bio.Align.MultipleSeqAlignment object. Let's view again the spike protein align.

![sars-cov-2](https://viralzone.expasy.org/resources/Variants_graph.svg)
image source: https://viralzone.expasy.org/9556

In [None]:
start = 501 + 1  # Remember that python is 0-indexed.
end = 503
print(spike_align[:, start:end])

Alignment with 7 rows and 1 columns
Y Omicron_BA.1
Y Omicron_BA.2
Y Alpha_B.1.1.7
Y Gamma_P1
N Delta_B.1.617.2
N Wuhan-Hu-1_19A
Y Beta_B.1.351


You can check in the image that all lineages, except Delta B.1.617.2 and Wuhan-Hu-1, present a mutation that changes asparagine to tyrosine in position 501 (N501Y). You can also create a simple function to retrieve a specific position in the alignment.

In [None]:
def get_alignment_position(alignment, start, end):
    return alignment[:, start+1:end]

# Retrieving position 417
print(get_alignment_position(spike_align, start=417, end=419))

Alignment with 7 rows and 1 columns
N Omicron_BA.1
N Omicron_BA.2
K Alpha_B.1.1.7
T Gamma_P1
K Delta_B.1.617.2
K Wuhan-Hu-1_19A
N Beta_B.1.351


#### Distance tree from MSA

We can also use Python to construct a simple tree based on sequence distance.

In [None]:
from Bio.Phylo.TreeConstruction import DistanceCalculator

# Instance assignment
calculator = DistanceCalculator('identity')

# Calculate distance matrix (dm) from SARS-CoV-2 spike sequences
dm = calculator.get_distance(spike_align)
print(dm)

Omicron_BA.1	0
Omicron_BA.2	0.02117647058823524	0
Alpha_B.1.1.7	0.029803921568627434	0.027450980392156876	0
Gamma_P1	0.03450980392156866	0.026666666666666616	0.014117647058823568	0
Delta_B.1.617.2	0.0337254901960784	0.025882352941176467	0.013333333333333308	0.015686274509803977	0
Wuhan-Hu-1_19A	0.03137254901960784	0.02431372549019606	0.007843137254901933	0.009411764705882342	0.007843137254901933	0
Beta_B.1.351	0.0337254901960784	0.026666666666666616	0.01254901960784316	0.0117647058823529	0.014117647058823568	0.007843137254901933	0
	Omicron_BA.1	Omicron_BA.2	Alpha_B.1.1.7	Gamma_P1	Delta_B.1.617.2	Wuhan-Hu-1_19A	Beta_B.1.351


In [None]:
# Constructing the distance tree
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
tree.root_with_outgroup("Wuhan-Hu-1_19A", outgroup_branch_length=0.002)

# Visualize the distance tree
Bio.Phylo.draw_ascii(tree=tree)

                                       ______________________ Omicron_BA.2
            __________________________|
          _|                          |______________________ Omicron_BA.1
         | |
        _| |_______________ Gamma_P1
       | |
      _| |_____________ Delta_B.1.617.2
     | |
  ___| |__________ Alpha_B.1.1.7
 |   |
_|   |_______ Beta_B.1.351
 |
 |___ Wuhan-Hu-1_19A



In this course module, we explored the potential of Biopython to work with biological sequences. We saw how to parse FASTA files, how to work with multiple sequence alignments, and how to build a simple distance tree using SARS-CoV-2 as a model organism.

There are many utilities that you can further explore with Biopython. This course has shown you only some of the possibilities.

> **NOTE:** Biopython has lots of useful stuff!


## Conclusion
[Back to table of contents](#table-of-contents)

At this point, we hope you got a better understanding of how a programming language can be an useful tool for analyzing biological data. We saw some basic Python data types (numbers, booleans and strings), some useful collections of objects (lists and dictionaries), how to read files, how to write our own functions and how can external packages can help us (with special attention to Biopython). We intended to provide you with the minimum set of tools to get you efficiently started in solving computational problems.

However, there is still a lot of interesting concepts and ideas that could not be brought to you within the time we had available. We then bring to you a handful of Python and bioinformatics learning resources bellow, hoping to keep you motivated on this journey of computer programming.

All the material we developed for this tutorial will be publicly and permanently available at [this GitHub repository](https://github.com/SantosRAC/intro_python_ismb2022).

We appreciate if you could provide us feedback on the organization and execution of this tutorial:

> **[Feedback survey](https://docs.google.com/forms/d/e/1FAIpQLSftVbI5O-P6EidL-PBgmqjdVE9QX3SfsgGKqkX6DDxJzGvrfQ/viewform)**

Thank you!


### What next?
The [Biology Meets Programming: Bioinformatics for Beginners](https://www.coursera.org/learn/bioinformatics) is a great course to learn Python already in a biology setting. The instructors are the authors of the [Bioinformatics Algorithms](https://www.bioinformaticsalgorithms.org/) textbook, another recommended resource developed with the hands-on approach in mind.

If you would like to further practice and develop your programming skills, [Rosalind](https://rosalind.info/problems/locations/) is a nice place to visit from time to time. It is an open bank of bioinformatics problems to be solved with any programming language. They start from beginner-friendly introductory exercises and progress to interesting and more complex applications. 

Biology aside, probably the most popular way of learning Python today is through the [Codeacademy](https://www.codecademy.com/learn/learn-python-3) platform. Such interactive style of lessons are considered by many the best way of learning how to code.

These and some other resources that may interest you can be found bellow.

### Interactive tutorials
 - [Code Academy - Learn Python 3](https://www.codecademy.com/learn/learn-python-3)
 - [Code Academy - All Python courses](https://www.codecademy.com/catalog/language/python)
 
### Bioinformatics problems database
- [Rosalind](https://rosalind.info/problems/locations/)

### Video resources
- [Coursera - Biology Meets Programming: Bioinformatics for Beginners](https://www.coursera.org/learn/bioinformatics)
- [Coursera - Bioinformatics Specialization](https://www.coursera.org/specializations/bioinformatics)

- [Datacamp - Introduction to Python](https://www.datacamp.com/courses/intro-to-python-for-data-science?utm_source=learnpython_com&utm_campaign=learnpython_tutorials)
- [YouTube - StatQuest's high throughput sequencing playlist](https://www.youtube.com/playlist?list=PLblh5JKOoLUJo2Q6xK4tZElbIvAACEykp)

### Textbooks
- Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming, Eric Matthes, 2016
- [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/)
- [Bioinformatics Algorithms - Phillip Compeau, Pavel Pevzner, 2018](https://www.bioinformaticsalgorithms.org/)
- [Mastering Python for Bioinformatics - Ken Youens-Clark, 2021](https://www.oreilly.com/library/view/mastering-python-for/9781098100872/)
- Python for Bioinformatics - Sebastian Bassi, 2018
- [Bioinformatics Programming Using Python - Mitchell L Model, 2009](https://www.oreilly.com/library/view/bioinformatics-programming-using/9780596804725/)
- Bioinformatics Algorithms: Design and Implementation in Python - Rocha and Ferreira, 2018
- Bioinformatics with Python Cookbook - Second Edition: Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology - Antao 2018



### References and links

#### Books

* Libeskind-Hadas, Ran, and Eliot Bush. Computing for biologists: Python programming and principles. Cambridge University Press, 2014.
* Hunt, John. "A Beginners Guide to Python 3 Programming." (2021).

#### Pages of the official Python documentation

* [Reading and writing files](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)
