# Research Project - Impact of functional divergence on divergence time estimation

## Dataset

- **Bett's (2018) deep phylogeny dataset - 102 taxa 29 genes**

- **35_noEF2 dataset**

- **49gene_concat_4_removed dataset**

## 1. Effect of Functional Divergence on Tree Topology

### a. Comparing Tree Topology before and after filtering most functionally divergent sites.

#### - Introduction

GroupSim was used to score functional divergent (FD) sites based on amino acid usage. It is written by Capra (2008). A pipeline was built to create the input files for GroupSim, and use the outputs (FD scores) to seperate MSA into most and least functionally divergent according to user-specified percentage. The pipeline can be found in `/GroupSim/pipeline`.

The grouping is **eukatryotes vs archaea + bacteria**

### Unfiltered Tree (Betts)

![betts_filtered_50](GroupSim/Betts/iqtree/102_taxa_unfiltered.png)

### Filtered Tree - 50% least Divergent Sites (Betts)

![betts_filtered_50](GroupSim/Betts/iqtree/alignment_102_taxa_l_filtered50.png)

### Filtered Tree - 50% Most Divergent Sites (Betts)

![betts_filtered_50](GroupSim/Betts/iqtree/alignment_102_taxa_m_filtered50.png)

> - Green - Bacteria
> - Red - Archaea
> - Blue - Eukaryotes

### Filtered Tree - 50% least Divergent Sites with LG+C60 (Betts)

![betts_filtered_50_C60](GroupSim/Betts/iqtree/filtered_50_LG_C60_G.png)

Sites that are the least divergent recover a 3D tree, with archaea as a sister clade with eukaryotes, it also suggests eukaryotes as an earlier branching domain than archaea. Using profile-mixture model C60 on IQTree results in a similar overall tree topology.

The most divergent sites recovers similar tree topology as the unfiltered tree, with bacteria and archaea as the two primary domain and eukaryote branching off the latter. This is interesting because subsets of the data can suggest different relationships, and raises the question of which portion of the data is more representative of the underlying evolutionary process.

Similar results are observed with other datasets.

### Unfiltered Tree - (35_noEF2)

![35_noef2_01](GroupSim/35_noEF2/iqtree/35_eu/35_noEF2_unfiltered.png)

### Filtered Tree -  50% least Divergent Sites with LG+C60 (35_noEF2)

![35_noef2_02](GroupSim/35_noEF2/iqtree/35_eu/35_l_filtered50_C60.png)

### Filtered Tree -  50% most Divergent Sites with LG+C60 (35_noEF2)

![35_noef2_03](GroupSim/35_noEF2/iqtree/35_eu/35_m_filtered50_C60.png)

Same pettern is observed with 49_gene dataset (in 49_gene_4removed folder).

### b. Exploring the effect of different groupings on functional divergence calculation and tree building

### - **eukatryotes + archaea vs bacteria**

### Filtered Tree -  50% least Divergent Sites with LG+C60 (35_noEF2)

![35__euarch](GroupSim/35_noEF2/iqtree/35_euarch/35_l_euarch.png)

### Filtered Tree - 50% Most Divergent Sites with LG+C60 (35_noEF2)

![35__euarch](GroupSim/35_noEF2/iqtree/35_euarch/35_m_euarch.png)

### - **eukatryotes + TACK archaea vs other archaea + bacteria**

### Filtered Tree - 50% Least Divergent Sites with LG+C60 (35_noEF2)

![35__euarch](GroupSim/35_noEF2/iqtree/35_tack/35_l_tack.png)

### Filtered Tree - 50% Most Divergent Sites with LG+C60 (35_noEF2)

![35__euarch](GroupSim/35_noEF2/iqtree/35_tack/35_m_tack.png)

It seems grouping have a strong effect on the topology. When eukaryotes and archaea are grouped together, the two trees resolve similar relationshiop with archaea as a sister clade with bacteria, and eukaryotes branching off the former. However, there is substantial difference in the lenght of branch seperating the two clades.

When eukaryotes are grouped with TACK archaea, the results are similar to that of eukaryotes vs archaea + bacteria grouping. The least divergent subset again resolves eukaryotes and bacteria as sister clades. 

### c. Goldilocks zone

We also try to filter out the top and bottom 25% to obtain the less extreme portion of the MSA.

### Filtered Tree - 50% Goldilocks zone with LG+C60 (35_noEF2)

![35__euarch](GroupSim/35_noEF2/iqtree/35_goldilocks/35_middle.png)

Although the tree is generally 2D, this tree seems to contradict other lines of evidence that Lokiarchaea seem to the closest known relatives to eukaryotes. In this tree, eukaryote also forms an early branching clade along with two other clades of archaea.

## 2. Relationship between site rate and functional divergence score

Here we investigate the relationship if functional divergence is related to rate score (IQTree --rate), and if more divergent sites display a distinctive pattern of evolution rate.

![35__euarch](GroupSim/35_noEF2/iqtree/rate/rate_vs_fd_score.png)

The does not seem to be a relationship between site rate and FD score.

## 3. Heterotachy metrics

Using GHOST partition, and seperating MSA into subsets according to GHOST class and inferring a tree for each, we get different time estimates. (James analysis)

## a. Variation over mean 

For each site in the MSA, we compute a score that is the sum of eucledien distance of all subtrees from the mean tree, weighted by the probability of each site to be in each GHOST class.

![heterotachy](heterotachy/35_gene_hscore.png)

![heterotachy](heterotachy/35_gene_variation.png)

## b. Site Rate Shift

Another metrics implemented is the site rate shift that compares the rate of sites in two trees derived from two groups. Two trees are first inferred seperately on two different groups of species on the same genes (same MSA), along with their respective site rates. The site rates on the two trees are then ranked and the differences in their ranking are scored. We postulate that sites that exhibit relatively larger change in site rates are more heterotachous and filtering them out would minimise its effects.

### Filtered Tree - 75% sites with lowest rate shift score LG+C60 (35_noEF2)

![srs_35](heterotachy/35_rs_filter25_bootstrap.png)

### Filtered Tree - 75% sites with lowest rate shift score LG+C60 (Betts)

![](heterotachy/102_rs_filter25.png)

Runnning the filtered MSA through GHOST and dating. (Wait for James)

## 4. Diversity analysis (MFP dataset)

![](heterotachy/round3/round3_rs_filtered.png)

Filtering has little effect on the diversity of bacteria relative to archaea. 