# Phylogeny of 13 ray-finned fish based on gene order data

This notebook documents the code and procedure we used to establish the **gene-order-based phylogeny of ray-finned fish** presented in the **[the bowfin genome paper](https://www.researchsquare.com/article/rs-92055/v1)**. This phylogeny supports the **Holostei hypothesis** of ray-finned fish evolution, which defines **the bowfin and spotted gar as sister groups**. The notebook is a mixed language jupyter notebook including both **python and R** code.

# Table of contents

- [Libraries and packages](#libraries-and-packages)
    - [External dependencies](#external-dependencies)
    - [Custom python modules](#custom-python-modules)
- [Input data description](#input-data-description)
- [Neighbor-joining gene order phylogeny](#gene-order-phylogeny)
    - [Marker genes selection](#marker-genes-selection)
    - [Adjacencies extraction](#adjacencies-extraction)
    - [Distance matrix computation](#distance-matrix-computation)
    - [Neighbor-joining tree reconstruction](#neighbor-joining-tree-reconstruction)
    - [Bootstrap support](#bootstrap-support)
- [Maximum parsimony gene order phylogeny](#gene-order-phylogeny-pars)


## Libraries and packages <a name="libraries-and-packages"></a>

All external dependencies to run the notebook are listed in the `binder/environment.yml` file, which includes both python packages and R libraries. Custom python modules are also stored in the `modules/` folder.

### External dependencies <a name="external-dependencies"></a>

In [None]:
#Standard imports
import os
import itertools
import glob
import bz2
from collections import defaultdict
import random
random.seed(1234)

In [None]:
#Gene tree manipulation
from Bio import Phylo
from Bio.Phylo.Consensus import majority_consensus,get_support
from ete3 import Tree, NodeStyle, TreeStyle,TextFace, AttrFace,faces

In [1]:
#Data loading and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_style("white")

Matplotlib is building the font cache; this may take a moment.


In [None]:
#Use R in python
%load_ext rpy2.ipython

### Custom python modules <a name="custom-python-modules"></a>

In [None]:
#Load my python code
%run modules/genomes.py
%run modules/adjacencies.py
%run modules/matrix.py
%run modules/plot_trees.py

## Input data description  <a name="input-data-description"></a>

path to data + description how they were obtained (Genomicus V2)

## Neighbor-Joining gene order phylogeny <a name="gene-order-phylogeny"></a>

Explain

In [None]:
CORRECT_FOR_FRAC_BIAS = False #set to True to correct for post-WGD fractionation bias

### Marker genes selection <a name="marker-genes-selection"></a>

Description

In [None]:
#global variable with all species name
ALL_SPECIES = ['Gallus gallus','Xenopus tropicalis','Lepisosteus oculatus', 'Amia calva', 'Paramormyrops kingsleyae','Scleropages formosus', 'Astyanax mexicanus','Danio rerio','Gasterosteus aculeatus', 'Tetraodon nigroviridis', 'Takifugu rubripes', 'Oreochromis niloticus', 'Oryzias latipes', 'Poecilia formosa', 'Xiphophorus maculatus']

#global variable with name of non-duplicated species (non-teleost)
NON_DUP = ['Lepisosteus oculatus', 'Amia calva', 'Gallus gallus', 'Xenopus tropicalis']

#Build a dict containing genes of all study species
all_sp_genes, random_chr = {}, []
for sp in ALL_SPECIES:
    sp = sp.replace(' ','.')
    all_all_genes[sp], genes_on_random_contig = extract_all_genes(sp)
    all_all_genes[sp] = set(all_all_genes[sp])
    random_chr += genes_on_random_contig

In [None]:
#Filter gene families to retain (1-to-1 or strict 1-to-2 with duplicated species)
filter_families(all_sp_genes, NON_DUP, 'ancGenes.Euteleostomi.filtered')

In [None]:
#Write reduced genomes, i.e. containing only retained families, in fasta format
name_families, unRAND = read('ancGenes.Euteleostomi.filtered_freeze') 
write_genomes(name_families,'MyGenomes_freeze.fa')

### Adjacencies extraction <a name="adjacencies-extraction"></a>

In [None]:
d_seq = load_genomes("MyGenomes_freeze.fa")
adj_list, adj_list_rev = save_all_adj(d_seq)

In [None]:
#Filter out tetraodon adjacecnies on chromosome unRANDOM that are in no other species
#UnRANDOM are scaffold assembled in a single contig
all_adj,to_ign = make_matrix(adj_list, adj_list_rev,unRAND)
adj_list, adj_list_rev = save_all_adj(d_seq, to_ignore=to_ign)

### Distance matrix computation <a name="distance-matrix-computation"></a>

In [None]:
make_distance_matrix(adj_list, adj_list_rev, SP_DICT, name=NAME)

### Neighbor-joining tree reconstruction <a name="neighbor-joining-tree-reconstruction"></a>

In [None]:
%%R 
library('ape') 

nj_tree <- function(name){
    dist_mat <- read.table(paste('dist_mat_', name,sep=''), header = TRUE, sep = "", skip = 0)
    rownames(dist_mat) <- colnames(dist_mat)
    dist_mat <- as.matrix(dist_mat)

    a.nj <- bionj(dist_mat) # neighbour joining tree construction
    write.tree(a.nj, file=paste('bionj_', name, '.nwk', sep=''))
    plot(a.nj, "phylo") # we plot it (unrooted)
    nodelabels()
}

### Bootstrap support  <a name="bootstrap-support"></a>

In [None]:
all_adj, t = make_matrix(adj_list, adj_list_rev, unRAND)
bootstrap_matrix(all_adj, name=NAME)

In [None]:
%%R 
library('ape') 

nj_tree <- function(name){
    dist_mat <- read.table(paste('dist_mat_', name, sep=''), header = TRUE, sep = "", skip = 0)

    # bootstrap the adjacecncies
    for (i in 0:99){
      infile = paste('bootstrap_', name, '/dist_mat_', name,as.character(i), '.txt', sep='')
      outfile = paste('bootstrap_', name, '/bionj_', name, as.character(i), '.nwk', sep='')
      dist_mat <- read.table(infile, header = TRUE, sep = "", skip = 0)
      rownames(dist_mat) <- colnames(dist_mat)
      dist_mat <- as.matrix(dist_mat)
        
      # neighbour joining tree construction for each bootstrap replicate
      a.nj <- bionj(dist_mat)
      write.tree(a.nj, file=outfile)
    }
}

## Maximum parsimony gene order phylogeny <a name="gene-order-phylogeny-pars"></a>