# Alignment and Phylogenetic trees

## Introduction

Online Bioinformatics Tools and Databases are widely available and very useful for understanding genomic data. Some examples of what can be done include finding similarities between DNA/RNA/protein sequences, understanding evolutionary events, and comparing proteins between different species.

Even though there are many Bioinformatics databases and tools available to the community, [here](https://docs.google.com/document/d/1oYkS-5gRVQzKOinPwTguxWvcqLHc4quv6H9PuOEJp4k/edit?usp=sharing) is a list of some of the mostly used databases and services, including the ones needed for this exercise.



First, we will investigate similarity between [innexin family proteins](https://pfam.xfam.org/family/Innexin), and visualize it with [phylogenetic trees](https://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956/).

## Challenge

21 **Hirudo verbana (HVE)** innexins were painstakingly extracted from an early draft genome. 
Two pairs have strong similarity (9A and 9B; 11A and 11B). 


Another leech genome, **Helobdella robusta (HRO)**, also has 21 annotated innexins.  For this exercise, they each have a letter from *A* to *U*.

Our goal is to associate each Helobdella innexin with a Hirudo innexin number.  To do this, we will:
1. First, build a multiple sequence alignment and a phylogenetic tree for the Hirudo innexins. 

This will give us a baseline understanding of the relationships between the Hirudo innexins -- which ones are closest to which others?

2. Next, build a multiple sequence alignment and tree for all 42 innexins. 

We will use this tree, if possible, to assign Hirudo numbers to the Helobdella innexins.

## Exercise

## Multiple Sequence Alignment (MSA)

**Step 1**

First, let's use Clustal alignment, an ON-LINE tool from the EBI. You could find more about Clustal tools [here](http://www.clustal.org/clustal2/)

Click [here](https://www.ebi.ac.uk/Tools/msa/clustalo/) to go to Clustal and start doing a multiple sequence alignment.

**Step 2** 

We will first use the [Hirudo innexin protein sequences](https://drive.google.com/file/d/1qiHLVyFIN3_yaSXKM-FuNiI-Waeg1fh2/view?usp=sharing) and try to align them to each other. Open the file with a text editor on your computer, like TextEdit.app on Mac. We will cut and paste the sequences from this file, use default paramenters and submit the alignment.

Once Clustal has finished the multiple sequence alignment (MSA) - it might take a couple of seconds - you will see something like this:

<img src="https://drive.google.com/uc?id=1cAq0RdrL-UmWZ6v6euBhxF_3tsmjGUm2">

**Question :** What part of the protein is the most conserved?


*Hint:* Clustal marks conserved amino acids with " * " if all are the same or " . " if they are similar


*Your answer here*

**Step 3** 

Now let's take a look at the tree built.  In the results tab, click at the phylogenetic tree for these innexins.

**Question :**  Which innexins are closest to each other?  Is there an extreme outlier?  How do outlier sequences look back in the MSA?

*Your answer here*

**Step 4**

Repeat Step 2 and 3 with the [Hirudo innexin mRNA sequences](https://drive.google.com/file/d/1_fZq7I3PimJyyEhmsKtrWIIzXQhQOwN4/view?usp=sharing).

**Question :** Does the mRNA tree have the same topology as the protein tree?  Do any closest pairs change?  What happens to innexin 15?

*Your answer here*

**Step 5**

Now, let's do a multiple sequence alignment (MSA) combining both innexin protein sequence files. Cut and paste all [42 Hirudo + Helobdella Innexins](https://drive.google.com/file/d/1Z9tQGP57PQWf93VbIBtdTXQOdeg1vAHi/view?usp=sharing) into Clustal.

For the alignment, use default parameters, view the multiple sequence alignment (MSA) and inspect the tree.

**Question :** The Helobdella innexins have letters (A through U).  

1. Which Helobdella letters go with which Hirudo numbers?  
Build a correspondence table.

2. Are any Hirudo innexins left out? 
3. Any Helobdella innexins? 
4. What do you recommend for HRO_inx_S?

*Your answer here*

**Step 6**

For the next step, we renumbered the combined innexins. Cut and paste our [renumbered innexins](https://drive.google.com/file/d/15xSsOXs-zUPsyzl4P-wuqXSxhuU-iTX5/view?usp=sharing) into Clustal.

**Question :** Do the numbers correspond?  Which innexins have diverged the most?

*Your answer here*

##More of Phylogenetic Trees

Clustal shows trees in a traditional "cladogram" layout.

[T-REX](http://www.trex.uqam.ca) from University of Queensland (UQAM) has an option for a **radial layout**.  
Let's use the dendrogram files (.dnd) to display trees in T-REX.

**Step 7**

Open [T-REX](http://www.trex.uqam.ca) and select **Tree Viewer** under Main Menu.

**Step 8**

Cut and paste every character from [combined innexins dendogram](https://drive.google.com/file/d/1X9Pqc81IpHcLeqy5wAQxsOsQCroP4s-i/view?usp=sharing), into the Newick viewer input window.

The screen should look something like this:

<img src="https://drive.google.com/uc?id=1phQMuOS7AP30tnTDNmNUmYS7ziieHSqx">

After pasting the characters, click on View Tree.

The default tree view is "horizontal" and take a look.

**Question :** Where is HRO_inx_S ?

*Your answer here*

**Step 9**

Select "Radial View" for the tree.

**Question :** What do you notice with the radial view that was obscured in the horizontal view?  Which innexins are closest to each other?

*Your answer here*

## Alignment with Octopus genome

For this part, of the challenge, let's add the **Octopus vulgaris** innexins and see how they fit in.

We'll label them with OVU once we find them.


**Step 10**

Open [Uniprot-KB](https://www.uniprot.org) and enter the query below

> organism:"Octopus vulgaris (Common octopus) [6645]"



**Question :** How many Octopus proteins are there?

*Your answer here*

**Step 11**

Now, let's refine the query.

We will now restrict to just proteins with "innexin" in the protein name and reject proteins with "fusion" in the name

Enter query:

> name:innexin NOT fusion AND organism:"Octopus vulgaris (Common octopus) [6645]"

Then, display 100 entries, by clicking "show" (to the right)

**Question :** How many Octopus proteins have been labeled as "Innexin"?

*Your answer here*

**Step 12**

Download the Octopus innexins.
Click on "download", and select *FASTA (canonical)* format and uncompressed.

**Question :** What is the name of the file that lands on your computer?

*Your answer here*

**Step 13**

Let's inspect the downloaded file.

Then, cut and paste the content into [Clustal](https://www.ebi.ac.uk/Tools/msa/clustalo/). Launch a multiple sequence alignment and click on the Phylogenetic Tree.

**Question :** How does the Octopus innexin MSA compare to the leech MSAs?

*Your answer here*

**Question :** Are any of the Octopus innexins nearly identical to each other (tiny leaf distance)?

*Hint:* Look at the distances by clicking on "real"

**Step 14**

Scan from top to bottom of the tree and count each "leaf" that is distinct, i.e., count nearly identical sequences as just one.

Stop counting where a large group of Octopus sequences seems to be a flat series of leaves all by themselves. 


**Question :** How many distinct Octopus innexins seem to be there?

*Your answer here*

**Step 15**

We used the initial Octopus innexin tree to label highly similar sequences with the same letter.

Cut and paste the [Octopus innexin](https://drive.google.com/file/d/1gCIkvN0vASR1RpmRJj8N0QACnbZw-8iu/view?usp=sharing) into Clustal

**Question :** Are sequences that are close in the tree labeled with the same letter?

**Step 16**

Let's add in the rest of the annotated Octopus innexins.
Same as before, query in uniprot:

> name:innexin NOT fusion taxonomy:"Cephalopoda (cephalopods) [6605]" NOT organism:vulgaris

Download the resulting Octopus maculoides (and one squid) sequences into a *FASTA (canonical)*, not compressed.

**Question :** How many sequences? What is the name of the file that landed on your computer?

*Your answer here*

**Step 17**

As a final step, use a text editor to open the previous file and delete the squid sequence.

Then, cut and paste the **Octopus Maculoides** sequences into Clustal. Add the filtered **Octopus Vulgaris** sequences. 

Take a look at the MSA and the tree from the results.

**Question :** Do the two sets of Octopus sequences pair up?

*Your answer here*

## Final Observation

The innexins from the two leech genomes pair up (more or less), as do the innexins from the two octopus genomes. There was a radiation within the clades that is not preserved from one type of organism to the next. 

The best way to find the innexin family within a new genome is to find the first one.  Take a look at the following paper and this quote

> "These [innexin] proteins share structural features that are characterized by four transmembrane domains, two extracellular loops, and intracellular N- and C-terminal domains."



**Question :** If you had no innexin sequences for a newly sequenced genome and no nearby organisms to use for comparison, how could you go about finding a first innexin protein, given the observation above?

*Your answer here*