# Lecture 5 - Multiple Sequence Alignment

You can use multiple sequence alignment (MSA) to compare homologous gene and protein sequences and understand their evolution.

## Software installation

Let's begin by installing a few packages that will be required.

In [None]:
!mamba install clustalo fasttree -c bioconda -y 

In [None]:
!mamba install -c etetoolkit -y ete3 

### Exercise 1

**Glucose-6-phosphate isomerase** (GPI) is an enzyme involved in the first part of glycolysis responsible for converting glucose-6-phosphate to fructose-6-phosphate:

![stuff](files/R00771.png)

Inside the  `files/` folder you will find the file [gpi.faa](files/gpi.faa) with protein sequences for this enzyme retrieved from different organisms. Let's run an MSA and use that to build a phylogenetic tree.

#### Option 1: 

- Go to [Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/)
- Upload the fasta file and submit
- Go to *Phylogenetic Tree* and *Download Phylogenetic Tree Data*


#### Option 2:

- Run the command below:

Here is a quick explanation: this command runs Clustal Omega (`clustalo`) and directly sends the result to the FastTree algorithm using unix pipes (`|`), which will redirect (`>`) its output to a file in Newick (`*.nw`) format (commonly used to represent phylogenetic trees). Don't worry if this sounds too complicated. 

In [None]:
!clustalo -i files/gpi.faa | fasttree -quiet > files/gpi_tree.nw 

We can now load this tree using the [ETE Toolkit](http://etetoolkit.org/), a powerful Python library for building and visualizing phylogenetic trees.

In [None]:
from ete3 import Tree

t = Tree('files/gpi_tree.nw')
t.render('%%inline', w=500)

It is possible to configure several visualization parameters. Let's display our tree as a circular tree instead:

In [None]:
from ete3 import TreeStyle

ts = TreeStyle()
ts.mode = 'c'
ts.show_scale = False
t.render('%%inline', w=500, tree_style=ts)

**Discussion point:**
- Consider the connections in the tree. Is this what you would expect?

----------

### Exercise 2:

**ATP synthase** is a transmembrane protein that uses an electrochemical proton gradient to store energy as ATP. It is a complex portein composed of multiple subunits. 

![ATP synthase](files/atps.png)

Inside the  `files/` folder you will find the file [atps_a.faa](files/atps_a.faa) that contains protein sequences for the **subunit a** of the enzyme complex retrieved from different organisms. Let's repeat the previous exercise using this protein. Again you have two options:

#### Option 1: 

- Go to [Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/)
- Upload the fasta file and submit
- Go to *Phylogenetic Tree* and *Download Phylogenetic Tree Data*


#### Option 2:

- Run the command below:

In [None]:
!clustalo -i files/atps_a.faa | fasttree -quiet > files/atps_tree.nw 

Now it is your turn to load and display the tree using **ETE Toolkit**:

In [None]:
# type your code here...

**Discussion points**:

- Does the phylogenetic tree look similar to the previous one?
- Can you say something about the evolution of the species by looking at the evolution of individual genes?

-----------
### Exercise 3

Ribosomes are universal proteins present in all life forms. They are complex *"molecular machines"* composed of a mixture of RNA (orange) and protein subunits (blue):

![ribosome](files/ribosome.png)

The (non-protein-coding) ribosomal RNA sequences are typically used for phylogenetic identification of species due to their slow evolution rates and widely conserved sequences. In particular, we use the 16S subunit in prokaryotes and the 18S subunit in eukaryotes.

Inside the  `files/` folder you will find the file [18S.fna](files/18S.fna) that contains 18S rRNA sequences for the different organisms we analysed before. Let's once again build a phylogenetic tree.

> Keep in mind that this time we are working with an RNA sequence.

#### Option 1: 

- Go to [Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/)
- **Change the input format to RNA**
- Upload the fasta file and submit
- Go to *Phylogenetic Tree* and *Download Phylogenetic Tree Data*


#### Option 2:

- Run the command below (note the `-nt` flag):

In [None]:
!clustalo -i files/18S.fna | fasttree -quiet -nt > files/18S_tree.nw 

Load and display the 18S-based phylogenetic tree using **ETE Toolkit**:

In [None]:
# type your code here...

**Discussion points**:

- Does this tree look like what you expected?
- Can you think of a better way to build a phylogenetic *species* tree? 