# Introduction to Python programming for bioscientists - ISMB 2022

Organizer(s):
- Hemanoel Passarelli Araujo, Federal University of Minas Gerais, Brazil (passarelli@ufmg.br)

- Pedro de Carvalho Braga Ilídio Silva, University of São Paulo, Brazil (ilidio@alumni.usp.br)

- Renato Augusto Corrêa dos Santos, University of Campinas, Brazil (renatoacsantos@gmail.com)

- Vinícius Henrique Franceschini dos Santos, University of São Paulo, Brazil (vinicius6.santos@usp.br)


# Learning Objectives for Tutorial:

Programming skills have become crucial for bioscientists. In this tutorial, we will introduce python basic concepts and we will compare SARS-CoV-2 genomes to show how powerfull the Biopython toolkit can be to analyze biological sequences.

The main objectives are:

- To introduce Google Colab digital notebooks;
- To present the basic logic and data structures in Python;
- To provide hands-on experience in analyzing biological sequences using Biopython.

# Notebook struture 

This notebook is structured into six modules (M):

- M1: Introduction to python and data structures (study-load: 80 minutes);

- M2: Logical operations (study-load: 50 minutes);

- M3: Loops and iteration (study-load: 50 minutes);

- M4: Functions (study-load: 80 minutes);

- M5: Interacting with the operating system (study-load: 50 minutes);

- M6: Biopython, file parsing, and multiple sequence analysis (study-load: 50 minutes);

We assume that you are already familiar with google colab notebooks. Check our notebook on this topic: <link to "how to use colab notebook">. 


# Practical Project: COVID-19 and SARS-CoV-2

Coronaviruses are RNA viruses able to infect both human and animals. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the coronavirus disease 2019 (COVID-19) and new viruses lineages still emerge in 2022. 

The large amount of data generated during the pandemics allowed us to better understand the SARS-CoV-2 genome and understand the main genetic mechanisms of virus transmission. It is normal for viruses to change over time and accumulate mutations. The set of mutations in a genome can be used to define a viral lineage. 

The SARS-CoV-2 genome comprises about 30 Kbp and contains four structural proteins, including spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins. SARS-CoV-2 viruses rely on their S protein to interact with the human ACE2 receptor to enter in the cell and start the infection. The S protein has two subunits: S1 and S2. The S1 subunit is located in the N-terminus of the S protein and engage with the ACE2 human receptor, while the S2 subunit mediate the fusion with the host cell membrane.  

The combination of mutations in the S protein is usually employed to discriminate SARS-CoV-2 lineages. See in the image below the main variants of concern of SARS-CoV-2:

![sars-cov-2](sars-cov-2-aln.jpg)
image source: https://viralzone.expasy.org/9556

In this tutorial, we will use the spike protein sequences of several SARS-CoV-2 strains to explore Python's potential for working with biological data.


# M1: Introduction to python and data structures

Learning objective: <Insert here the learning objective for this module.>

Estimated study-load: 80 minutes

## Python

## Variables and native functions

# M6: Biopython, file parsing, and multiple sequence analysis

Learning objectives: 

    - Introduce Biopython to work with computational molecular biology;

    - Demonstrate how to parse fasta file;

    - Align protein sequences;

    - Extract alignment regions;

Estimated study-load: 50 minutes

The official documentation of biopython is available [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html).

## What is biopython and how to install?

Biopython is a set of freely tools for computational molecular biology written in Python by an international team of developers. You can use biopython to parse several bioinformatic file formats, including fasta, gbk, and Blast output.

It is very important for a bioinformatist to become familiar with Biopython, as it is literally a Swiss Army knife that can help you in many situations. At this point in the tutorial, you've noticed that there are several file formats that we can use to store information. A classic format is the fasta.

When we mention about "parsing" a fasta file, we want to extract the information and store it so that we have more control to process it. However, before we start exploring the potential of biopython to handle files, let's install it.

In [2]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.79-cp39-cp39-macosx_10_9_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 1.0 MB/s eta 0:00:01
Installing collected packages: biopython
Successfully installed biopython-1.79


In [3]:
import Bio
print(Bio.__version__)

1.79


## Parsing a fasta file

Let's read the fasta file containing the spike protein for seven SARS-CoV-2 lineages.

In [22]:
from Bio import SeqIO

file_path = "spike_proteins.fasta"

for seq_record in Bio.SeqIO.parse(file_path, "fasta"):
    print(seq_record.id) #sequence name after each ">"
    print(repr(seq_record.seq)) # part of protein sequence
    print(len(seq_record)) #sequence length

Wuhan-Hu-1_19A
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1273
Alpha_B.1.1.7
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Beta_B.1.351
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Gamma_P1
Seq('MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1273
Delta_B.1.617.2
Seq('MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1271
Omicron_BA.1
Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT')
1270
Omicron_BA.2
Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT')
1270


We did it! Now we have gathered the information contained in the fasta file much faster than in the previuos modules. We can do better! Let's store this information in a python dictionary.

In [7]:
with open(file_path, "r") as fh: #fh = file handle
    record_dict = SeqIO.to_dict(SeqIO.parse(fh, "fasta"))

In [8]:
record_dict

{'Wuhan-Hu-1_19A': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Wuhan-Hu-1_19A', name='Wuhan-Hu-1_19A', description='Wuhan-Hu-1_19A sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1', dbxrefs=[]),
 'Alpha_B.1.1.7': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Alpha_B.1.1.7', name='Alpha_B.1.1.7', description='Alpha_B.1.1.7 tr|A0A7T8KZF1|A0A7T8KZF1_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=3 SV=1', dbxrefs=[]),
 'Beta_B.1.351': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Beta_B.1.351', name='Beta_B.1.351', description='Beta_B.1.351 QRN78347.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[]),
 'Gamma_P1': SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT'), id='Gamma_P1', name='G

In this dictionary, we have each sequence name as a key and all related information as values. 

In [9]:
# How many sequences have we read?
print(len(record_dict))

7


In [15]:
# Getting sequences names
print((record_dict.keys()))

dict_keys(['Wuhan-Hu-1_19A', 'Alpha_B.1.1.7', 'Beta_B.1.351', 'Gamma_P1', 'Delta_B.1.617.2', 'Omicron_BA.1', 'Omicron_BA.2'])


In [17]:
# Sequence information for Omicron BA.2 variant

record_dict["Omicron_BA.2"]

SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT'), id='Omicron_BA.2', name='Omicron_BA.2', description='Omicron_BA.2 UJE45220.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[])

The SeqRecord object offers a lot of information as attributes, including:

    - .seq: the sequence itself.

    - .id: the primary ID used to identify the sequence.

    - .name: similar to id.

    - .description: expasive name of the fasta sequence in a more readable presentation.


In [18]:
# Retrieving the sequence as a Seq object
record_dict["Omicron_BA.2"].seq

Seq('MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLP...HYT')

In [20]:
# Sequence description
record_dict["Omicron_BA.2"].description

'Omicron_BA.2 UJE45220.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]'

## Multiple Sequence analysis (MSA)

Multiple sequence analysis is the alignment of three or more biological sequences (DNA or Protein). We can use the output to infer evolutionary relationships. In this section, we will use a multiple sequence alignmet (msa) from spike proteins to explore mutations.

In [68]:
from Bio import AlignIO

msa_file = "clustal_spike_msa.txt"
spike_align = AlignIO.read(msa_file, "clustal")
print(spike_align)

Alignment with 7 rows and 1275 columns
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Omicron_BA.1
MFVFLVLLPLVSSQCVNLITRTQ---SYTNSFTRGVYYPDKVFR...HYT Omicron_BA.2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Alpha_B.1.1.7
MFVFLVLLPLVSSQCVNFTNRTQLPSAYTNSFTRGVYYPDKVFR...HYT Gamma_P1
MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Delta_B.1.617.2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Wuhan-Hu-1_19A
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR...HYT Beta_B.1.351


We can see that now we have a multiple sequence alignment by the gaps inserted in Omicron_BA.2 sequence. Let's explore Bio.Align.MultipleSeqAlignment object. Let's view again the spike protein align.

![sars-cov-2](sars-cov-2-aln.jpg)
image source: https://viralzone.expasy.org/9556

In [70]:
start= 501 + 1 # remember that python is 0-indexed.
end = 503
print(spike_align[:, start:end])

Alignment with 7 rows and 1 columns
Y Omicron_BA.1
Y Omicron_BA.2
Y Alpha_B.1.1.7
Y Gamma_P1
N Delta_B.1.617.2
N Wuhan-Hu-1_19A
Y Beta_B.1.351


You can check in the image that all lineages, except Delta B.1.617.2 and Wuhan-Hu-1, present a mutation that changes asparagine to tyrosine in position 501 (N501Y). You can also create a simple function to retrive a specific position in the alignment.

In [78]:
# function to get a position

def get_aln_position(aln, start, end):
    return aln[:, start+1:end]

#Retrieving position 417
print(get_aln_position(aln=spike_align, start=417, end=419))

Alignment with 7 rows and 1 columns
N Omicron_BA.1
N Omicron_BA.2
K Alpha_B.1.1.7
T Gamma_P1
K Delta_B.1.617.2
K Wuhan-Hu-1_19A
N Beta_B.1.351


### Distance tree from MSA

We can also use python to construct a simple tree based on sequence distance.

In [79]:
from Bio.Phylo.TreeConstruction import DistanceCalculator

# instance assignment
calculator = DistanceCalculator('identity')

# calculate distance matrix (dm) from SARS-CoV-2 spike sequences

dm = calculator.get_distance(spike_align)
print(dm)

Omicron_BA.1	0
Omicron_BA.2	0.02117647058823524	0
Alpha_B.1.1.7	0.029803921568627434	0.027450980392156876	0
Gamma_P1	0.03450980392156866	0.026666666666666616	0.014117647058823568	0
Delta_B.1.617.2	0.0337254901960784	0.025882352941176467	0.013333333333333308	0.015686274509803977	0
Wuhan-Hu-1_19A	0.03137254901960784	0.02431372549019606	0.007843137254901933	0.009411764705882342	0.007843137254901933	0
Beta_B.1.351	0.0337254901960784	0.026666666666666616	0.01254901960784316	0.0117647058823529	0.014117647058823568	0.007843137254901933	0
	Omicron_BA.1	Omicron_BA.2	Alpha_B.1.1.7	Gamma_P1	Delta_B.1.617.2	Wuhan-Hu-1_19A	Beta_B.1.351


In [94]:
# Constructing the distance tree
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
tree.root_with_outgroup("Wuhan-Hu-1_19A",outgroup_branch_length=0.002)

#visualize the distance tree
Bio.Phylo.draw_ascii(tree=tree)

                                       ______________________ Omicron_BA.2
            __________________________|
          _|                          |______________________ Omicron_BA.1
         | |
        _| |_______________ Gamma_P1
       | |
      _| |_____________ Delta_B.1.617.2
     | |
  ___| |__________ Alpha_B.1.1.7
 |   |
_|   |_______ Beta_B.1.351
 |
 |___ Wuhan-Hu-1_19A



In this course module, we explored the potential of Biopython to work with biological sequences. We saw how to parse fasta files, how to work with multiple sequence alignments, and how to build a simple distance tree using SARS-CoV-2 as a model organism.

There are many utilities that you can further explore with Biopython. This course has showed you some possibilities. Keep up with your studies.

See you soon.