Updated link to zenodo. Fixed typos.

ThibaultLatrille · Jan 17, 2023 · 0a38323 · 0a38323
1 parent 38b9e49
commit 0a38323
Show file tree

Hide file tree

Showing 7 changed files with 51 additions and 58 deletions.
diff --git a/OrthoMam/Snakefile b/OrthoMam/Snakefile
@@ -140,6 +140,6 @@ rule gene_table:
         bayescode=expand(rules.read_bayescode.output,CDS=CDS_list,model=MODELS,chain=config['CHAINS']),
         script=f"{ROOT}/scripts/table_gene_omega.py"
     output:
-        csv=f"{FOLDER}/GeneTable.csv.gz"
+        tsv=f"{FOLDER}/GeneTable.tsv"
     shell:
-        "python3 {input.script} --div_folder {EXP_FOLDER} --xml {XML_FOLDER} --species Homo_sapiens Chlorocebus_sabaeus Bos_taurus Equus_caballus Ovis_aries Capra_hircus Canis_familiaris --output {output.csv}"
+        "python3 {input.script} --div_folder {EXP_FOLDER} --xml {XML_FOLDER} --species Homo_sapiens Chlorocebus_sabaeus Bos_taurus Equus_caballus Ovis_aries Capra_hircus Canis_familiaris --output {output.tsv}"
diff --git a/README.md b/README.md
@@ -14,8 +14,8 @@ The pipeline consist of three main steps, each with its own folder and Snakemake
 - [II. Divergence](https://github.com/ThibaultLatrille/AdaptaPop#ii-divergence---run-bayescode-on-orthomam)
 - [III. Contrast polymorphism and divergence](https://github.com/ThibaultLatrille/AdaptaPop#iii-run-global-analysis-contrasting-polymorphism-and-divergence)
 
-Section I and II are independent from each other and can be run in parallel, while section III depends on the results of I and II.
-Alternatively, the results of each section can be downloaded at https://doi.org/10.5281/zenodo.7107234.
+Section I and II are independent of each other and can be run in parallel, while section III depends on the results of I and II.
+Alternatively, the results of each section can be downloaded at https://doi.org/10.5281/zenodo.7107233.
 
 If problems and/or questions are encountered, feel free to [open issues](https://github.com/ThibaultLatrille/AdaptaPop/issues).
 
@@ -30,10 +30,10 @@ cd AdaptaPop
 
 ### General dependencies
 
-Install python3 packages
+Install python3 (>=3.9) and python3 packages
 ```
 sudo apt install -qq -y python3-dev python3-pip
-pip3 install snakemake scipy numpy matplotlib pandas ete3 bio statsmodels --user
+pip3 install --user snakemake numpy==1.23 scipy matplotlib pandas ete3 bio statsmodels seaborn rpy2 
 ```
 
 ### I. Polymorphism - Download and filter .vcf files 
@@ -53,33 +53,26 @@ for FOLDER in ./*/ do
 done
 ```
 
-Alternatively, you can download the filtered .vcf files at https://doi.org/10.5281/zenodo.7107234 (`Polymorphism.zip`) and unzip them in the folder `Polymorphism`.
+Alternatively, you can download the filtered .vcf files at https://doi.org/10.5281/zenodo.7107233 (`Polymorphism.zip`) and unzip them in the folder `Polymorphism`.
 
 ### II. Divergence - Run BayesCode on OrthoMam
-This section is independent from the previous one ([I. Polymorphism](https://github.com/ThibaultLatrille/AdaptaPop#i-polymorphism---download-and-filter-vcf-files)).
+This section is independent of the previous one ([I. Polymorphism](https://github.com/ThibaultLatrille/AdaptaPop#i-polymorphism---download-and-filter-vcf-files)).
 
-To run on OrthoMam (mammalian orthologs), you must download the alignments and trees at https://doi.org/10.5281/zenodo.7107234 (`OrthoMam.zip`) and unzip them in the folder `OrthoMam`. The input data will be already filtered and ready to be used in the folder `OrthoMam/Datasets`.
+To run on OrthoMam (mammalian orthologs), you must download the alignments and trees at https://doi.org/10.5281/zenodo.7107233 (`OrthoMam.zip`) and unzip them in the folder `OrthoMam`. The input data will be already filtered and ready to be used in the folder `OrthoMam/Datasets`.
 
+Install *BayesCode* from https://github.com/ThibaultLatrille/bayescode (see instructions there).
 
 ```
 cd Orthomam
 ```
 
-Install the compiling toolchains:
-```
-sudo apt install -qq -y make cmake clang openmpi-bin openmpi-common libopenmpi-dev
-```
-Clone and compile the C++ code for *BayesCode*
-```
-git clone https://github.com/ThibaultLatrille/bayescode && cd bayescode && git checkout dev && make release && cd ..
-```
 Run using snakemake, this requires access to large computation facilities:
 ```
 snakemake -j 128
 ```
 Or use the script `snakeslurm.sh` if run on a cluster (slurm) to submit the jobs.
 
-Alternatively, you can download the results of BayesCode on OrthoMam at https://doi.org/10.5281/zenodo.7107234 (`OrthoMam.zip`) and unzip them in the folder `OrthoMam`. The results will be located in the folder `OrthoMam/Experiments`.
+Alternatively, you can download the results of BayesCode on OrthoMam at https://doi.org/10.5281/zenodo.7107233 (`OrthoMam.zip`) and unzip them in the folder `OrthoMam`. The results will be located in the folder `OrthoMam/Experiments`.
 Moreover, you can use this pipeline as a stand-alone to run `BayesCode` on your own set of alignments and trees by changing the `Snakefile` and/or the `config.yaml`.
 
 ### III. Run global analysis contrasting polymorphism and divergence
@@ -98,7 +91,7 @@ snakemake -j 128
 ```
 Or use the script `snakeslurm.sh` if run on a cluster (slurm) to submit the jobs.
 
-Alternatively, results of this section can be downloaded at https://doi.org/10.5281/zenodo.7107234 (`VCF.zip`). The archive file `VCF.zip` contains a vcf file for every population. Each vcf file contains SNPs for which is was possible to infer the ancestral and derived codon, hence to also predict its selection coefficient based on the difference in fitness between amino-acid estimated at the mammalian scale.
+Alternatively, at the gene level, the rate of adaptation at the phylogenetic scale and at the population scale (McDonald & Kreitman) can be downloaded at https://doi.org/10.5281/zenodo.7107234 (`GeneTable.tsv` and `MK_statistics.gz`).
 
 ## 3. Add features or debug in the python scripts
 You made modifications to one of the python script, a notebook, this README.md, or you added new features.

diff --git a/manuscript/diff.tex b/manuscript/diff.tex
@@ -1,7 +1,7 @@
 %! BibTeX Compiler = biber
 %DIF LATEXDIFF DIFFERENCE FILE
 %DIF DEL main-biorxiv.tex    Fri Dec  2 16:24:48 2022
-%DIF ADD main-revision.tex   Fri Jan 13 15:41:04 2023
+%DIF ADD main-revision.tex   Tue Jan 17 11:26:05 2023
 %TC:ignore
 \documentclass{article}
 
@@ -113,7 +113,7 @@
     \large
     \textbf{T. {Latrille}$^{1,2,3}$, N. {Rodrigue}$^{4}$, N. {Lartillot}$^{1}$}\\
     \normalsize
-    $^{1}$Université de Lyon, CNRS, LBBE UMR 5558, Villeurbanne, France\\
+    $^{1}$Université de Lyon, \DIFdelbegin \DIFdel{CNRS, LBBE UMR 5558}\DIFdelend \DIFaddbegin \DIFadd{Université Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et Biologie Evolutive, UMR5558}\DIFaddend , Villeurbanne, France\\
     $^{2}$École Normale Supérieure de Lyon, Université de Lyon, Lyon, France\\
     $^{3}$Department of Computational Biology, Université de Lausanne, Lausanne, Switzerland\\
     $^{4}$Department of Biology, Institute of Biochemistry, and School of Mathematics and Statistics, Carleton University, Ottawa, Canada \\
@@ -304,11 +304,11 @@
     $42$ ontologies are observed with a p-value ($p_{\mathrm{v}}$) corrected for multiple comparison (Holm–Bonferroni correction, $p_{\mathrm{v}}^{\mathrm{adj}}$) lower than the risk $\alpha=0.05$ (\DIFdelbegin \DIFdel{see }\DIFdelend table~S1).
     At a finer scale, we weighted genes by their proportion of sites considered under adaptation with a $\omega$-based site model ($\omega > 1$, \DIFdelbegin \DIFdel{see }\DIFdelend table~S2) or with a mutation-selection model ($\omega > \omega_{0}$, \DIFdelbegin \DIFdel{see }\DIFdelend table~S3).
     For each ontology, the proportion of sites under adaptation is compared between the set of genes sharing this given ontology and the rest of the genes (Mann-Whitney U test).
-    The statistical test based on the the first criterion ($\omega>1$) is correlated with ontologies related to immune processes, while the statistical test based on the the second criterion ($\omega > \omega_{0}$) is also correlated with ontologies related to the external membrane and cellular adhesion.
+    The statistical test based on the \DIFdelbegin \DIFdel{the }\DIFdelend first criterion ($\omega>1$) is correlated with ontologies related to immune processes, while the statistical test based on the the second criterion ($\omega > \omega_{0}$) is also correlated with ontologies related to the external membrane and cellular adhesion.
 
     \subsection*{Congruence between phylogeny- and population-based methods}
     Finally, we investigated whether the phylogeny-based and the population-based methods give congruent results in terms of detection of adaptive evolution (fig.~\ref{fig:method}).
-    To do so, population genomic data were collected for 29 populations across 7 genera.
+    To do so, population genomic data were collected for 29 populations across 7 genera \DIFaddbegin \DIFadd{(fig.~S2)}\DIFaddend .
     For each population, $\rateApop$ as proposed by McDonald \& Kreitman (MK)\cite{mcdonald_adaptative_1991} was computed on the concatenate of the 822 genes classified as adaptive by the phylogeny-based method (red dots in fig.~\ref{fig:method} and~\ref{fig:unfolded-MK}).
     This result was compared to a null distribution obtained by computing $\rateApop$ over sets of 822 genes that were randomly sampled ($1,000$ replicates) among the genes classifed as nearly-neutral according to the mutation-selection model (green violins in fig.~\ref{fig:method} and~\ref{fig:unfolded-MK}).
     Importantly, the terminal lineages over which the population-genetic method was applied were not included in the phylogenetic analysis\DIFaddbegin \DIFadd{, such as to avoid fallacy and circularity in the estimation of $\rateApop$ and $\rateAphy$}\DIFaddend .
@@ -366,9 +366,9 @@
 , detecting adaptation as $\omega > \omega_{0}$ instead of $\omega > 1$ is always a more sensitive statistical test.
     We thus argue that mutation-selection provides both a theoretical and practical improvement over the current method of testing to detect adaptation.
     }\DIFaddend The application of this approach exome-wide across placental mammals suggests that $822$ out of $14,509$ proteins are under a long-term evolutionary Red-Queen, with ontology terms related to immune processes and the external membrane of cells.
-    Enrichment \DIFdelbegin \DIFdel{of }\DIFdelend \DIFaddbegin \DIFadd{in }\DIFaddend ontologies related to immune processes is expected, as found by \DIFdelbegin \DIFdel{many studies\mbox{%DIFAUXCMD
+    Enrichment \DIFdelbegin \DIFdel{of }\DIFdelend \DIFaddbegin \DIFadd{in }\DIFaddend ontologies related to immune processes is expected, as found \DIFdelbegin \DIFdel{by many studies\mbox{%DIFAUXCMD
 \cite{kosiol_patterns_2008, enard_viruses_2016, ebel_high_2017}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{previous studies\mbox{%DIFAUXCMD
+}\DIFdelend \DIFaddbegin \DIFadd{in previous studies\mbox{%DIFAUXCMD
 \cite{kosiol_patterns_2008}}\hspace{0pt}%DIFAUXCMD
 }\DIFaddend .
     However, we also detect an enrichment \DIFdelbegin \DIFdel{with }\DIFdelend \DIFaddbegin \DIFadd{in }\DIFaddend ontologies related to the external membrane and cell adhesion\DIFdelbegin \DIFdel{, which are }\DIFdelend \DIFaddbegin \DIFadd{.
@@ -391,7 +391,7 @@
     We thus showed empirically that the mutation-selection codon model provides a null (nearly-neutral) model from which we can disentangle purifying and adaptive evolution.
     However, our procedure still has some limitations.
 
-    \DIFaddbegin \DIFadd{First, synonymous mutations are considered neutral in our implementation of mutation-selection codon models.
+    \DIFdelbegin \DIFdel{Mutation-selection }\DIFdelend \DIFaddbegin \DIFadd{First, synonymous mutations are considered neutral in our implementation of mutation-selection codon models.
     In other words, we assumed no synonymous codon usage bias (CUB), or a weak effect as it is suggested in mammals\mbox{%DIFAUXCMD
 \cite{plotkin_synonymous_2011}}\hspace{0pt}%DIFAUXCMD
 .
@@ -403,7 +403,7 @@
 .
     Such a model could be used to detect adaptation in a more general context for which CUB cannot be overlooked (e.g.~}\textit{\DIFadd{Drosophila}}\DIFadd{).
     Further work will thus be required to relax the assumption that synonymous mutations are neutral and to assess the sensitivity of $\rateAphy$ to CUB.
-    Second, }\DIFaddend Mutation-selection codon models assume a constant effective population size while it has been established that its fluctuations has a major effect on selection dynamics\cite{lanfear_population_2014, platt_protein_2018}.
+    Second, mutation-selection }\DIFaddend codon models assume a constant effective population size while it has been established that its fluctuations has a major effect on selection dynamics\cite{lanfear_population_2014, platt_protein_2018}.
     Estimating changes in effective population size in a mutation-selection framework is possible\cite{latrille_inferring_2021}, although too computational intensive in its current implementation to be performed genome-wide.
     \DIFdelbegin \DIFdel{Second}\DIFdelend \DIFaddbegin \DIFadd{Third}\DIFaddend , epistasis is not modeled while it can have a large effect on the response of the rate of evolution with change in population size\cite{latrille_quantifying_2021}.
     More generally, pervasive epistasis generates an entrenchment of the amino acids\cite{goldstein_evolutionary_2004, goldstein_nonadaptive_2015, goldstein_sequence_2017}, resulting in a slowing down of the rate evolution\cite{rodrigue_detecting_2017, patel_epistasis_2022} or a standstill\cite{youssef_evolution_2022}.
@@ -445,15 +445,15 @@
     \subsection*{Adaptation in phylogeny-based method}
     Classical codon models estimates a parameter $\omega=\dnds$, namely the ratio of the non-synonymous over the synonymous substitution rates\cite{muse_likelihood_1994,goldman_codonbased_1994}.
     In the so-called site models, $\omega$ is allowed to vary across sites\cite{yang_codonsubstitution_2000, huelsenbeck_dirichlet_2006}.
-    In \textit{Bayescode}, site-specific $\omega^{(i)}$ (fig.~\ref{fig:scatterplot}B, y-axis) \DIFdelbegin \DIFdel{are independent identically distributed }\DIFdelend \DIFaddbegin \DIFadd{as independent and identically distributed random effects }\DIFaddend from a gamma distribution\cite{lartillot_phylobayes_2013}.
+    In \textit{Bayescode}, site-specific $\omega^{(i)}$ (fig.~\ref{fig:scatterplot}B, y-axis) are \DIFdelbegin \DIFdel{independent identically distributed }\DIFdelend \DIFaddbegin \DIFadd{modeled as independent and identically distributed random effects }\DIFaddend from a gamma distribution\cite{lartillot_phylobayes_2013}.
     \DIFaddbegin \DIFadd{Therefore, this corresponds to a continuous version of the M5 model of \mbox{%DIFAUXCMD
 \textcite{yang_codonsubstitution_2000}}\hspace{0pt}%DIFAUXCMD
 .
     }\DIFaddend In a second step, the average over sites is calculated, giving estimates of $\omega$ for each protein-coding sequence (fig.~\ref{fig:scatterplot}A, y-axis).
     \DIFaddbegin \DIFadd{To assess the sensitivity of our estimate of $\omega$ to the prior as implemented in }\textit{\DIFadd{BayesCode}}\DIFadd{, we ran }\textit{\DIFadd{CODEML}} \DIFadd{site model (F1x4 MG, 8 discrete categories of $\omega$) from the PAML software\mbox{%DIFAUXCMD
 \cite{yang_paml_2007} }\hspace{0pt}%DIFAUXCMD
 on a sample of 100 random genes.
-    Both $\omega$ estimates from }\textit{\DIFadd{BayesCode}} \DIFadd{and }\textit{\DIFadd{CODEML}} \DIFadd{are strongly correlated at the site level ($r^2=0.985$, slope of $0.92$ in fig.~S1) and the gene level ($r^2=0.991$, slope of $0.9$ in fig.~S2).
+    Both $\omega$ estimates from }\textit{\DIFadd{BayesCode}} \DIFadd{and }\textit{\DIFadd{CODEML}} \DIFadd{are strongly correlated at the site level ($r^2=0.985$, slope of $0.92$ in fig.~S1.A) and at the gene level ($r^2=0.991$, slope of $0.9$ in fig.~S1.B).
 }\DIFaddend 
 
     In contrast \DIFaddbegin \DIFadd{to $\omega$-based codon models}\DIFaddend , mutation-selection models assume that the protein-coding sequence is at mutation-selection balance under a fixed fitness landscape, which is itself characterized by a fitness vector over the $20$ amino acid at each site\cite{yang_mutationselection_2008, halpern_evolutionary_1998, rodrigue_mechanistic_2010}.
@@ -520,15 +520,15 @@
 
     \section{Data availability}\label{sec:data-availability}
     The data underlying this article are available at \DIFdelbegin %DIFDELCMD < \url{10.5281/zenodo.7107234}%%%
-\DIFdelend \DIFaddbegin \url{https://doi.org/10.5281/zenodo.7107234}\DIFaddend .
+\DIFdelend \DIFaddbegin \url{https://doi.org/10.5281/zenodo.7107233}\DIFaddend .
     Scripts and instructions necessary to reproduce the empirical experiments on the original dataset or with user-specified datasets is available at \url{https://github.com/ThibaultLatrille/AdaptaPop}.
 
     \section{Acknowledgements}\label{sec:acknowledgements}
     We gratefully also acknowledge the help of Nicolas Galtier and Julien Joseph for their advice and review concerning this manuscript.
     This work was performed using the computing facilities of the CC LBBE/PRABI.
     This study makes use of data generated by the NextGen Consortium.
     The European Union’s Seventh Framework Programme (FP7/2010-2014) provided funding for the project under grant agreement no 244356 - “NextGen”.
-    \DIFdelbegin \DIFdel{Funding: }\DIFdelend \DIFaddbegin \textbf{\DIFadd{Funding:}} \DIFaddend Université de Lausanne.
+    Funding: Université de Lausanne.
     Agence Nationale de la Recherche, Grant ANR-15-CE12-0010-01 / DASIRE.
     Agence Nationale de la Recherche, Grant ANR-19-CE12-0019 / HotRec.
     Natural Sciences and Engineering Research Council of Canada.