Skip to content

Commit

Permalink
Add code for figures, LICENSE and improve readme
Browse files Browse the repository at this point in the history
  • Loading branch information
LucasPaoli committed Mar 29, 2022
1 parent e848d65 commit 40ac153
Show file tree
Hide file tree
Showing 29 changed files with 6,666 additions and 6 deletions.
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,28 @@

*Like a bagpipe, but for MAGs.*

This repo contains the code behing the snakemake pipeline used for the generation of the MAGs used in this [paper](https://www.biorxiv.org/content/10.1101/2021.03.24.436479v1).
# Code associated with the paper 'Biosynthetic potential of the global ocean microbiome' and the Ocean Microbiomics Database.

Link to the [preprint](https://www.biorxiv.org/content/10.1101/2021.03.24.436479v1) and the [paper](). The Ocean Microbiomics Database is available here: [https://www.microbiomics.io/ocean/](https://www.microbiomics.io/ocean/).

This repo contains the code behing the snakemake pipeline used for the generation of the MAGs, their analyses and the R code used for the figures.

## Structure

- [**magpipe**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/magpipe): python module for the pipeline.
- [**snakes**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/snakes): snakemake files.
- [**rules**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/rules): the rules of the snakemake pipeline.
- [**configs**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/configs): config files used by the launchers to run the pipeline.
- [**scripts**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/scripts): ad-hoc python scripts and the magpipe python module.
- [**envs**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/envs): conda environments for the rules.
- [**figures**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/figures): R code behind the figures.
- [**launchers**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/launchers): bash scripts to run the pipeline (on sge etc).
- [**magpipe**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/magpipe): python module for the pipeline.
- [**resources**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/resources): some fixed input information on methods and datasets.
- [**envs**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/envs): conda environments for the rules.
- [**rules**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/rules): the rules of the snakemake pipeline.
- [**sandbox**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/sandbox): scripts used for postprocessing and analyses.
- [**scripts**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/scripts): ad-hoc python scripts used by the pipeline.
- [**snakes**](https://github.com/SushiLab/MAGPIPE_DEV/tree/master/code/snakes): snakemake files.

You can find additional documentation of the different metagenomic analyses steps here: [https://methods-in-microbiomics.readthedocs.io/en/latest/](https://methods-in-microbiomics.readthedocs.io/en/latest/).

# Running the pipeline:

## Installation

Expand Down
Binary file added figures/.DS_Store
Binary file not shown.
113 changes: 113 additions & 0 deletions figures/Extended-data-fig-1A-depth.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Biosynthetic potential of the global ocean microbiome ==================================
# Code used to produce the figures and analysis of the paper
# Extended Data Fig. 1A - Depth distrib ==================================================

# Libraries ------------------------------------------------------------------------------

rm(list = ls())
library(tidyverse)
library(ggimage)
library(facetscales)
for (src_file in list.files('code/analysis/R')){
print(paste('Sourcing', src_file, '...'))
source(paste('code/analysis/R', src_file, sep = '/'))}
source('code/analysis/lib/Cell_Press_guidelines.R')
source('../../Exploratorium/sushilab-colors/palettes-paoli.R')

# Prepare data ---------------------------------------------------------------------------

# Samples data
figure_1B_table =
load_general_metadata() %>%
select(-`Internal Sample Name`) %>%
# Remove GORG for now
# rbind(gorg_metadata_from_local() %>%
# mutate(dataset = "GORG SAGs",
# Sample = paste("GORG", Sample),
# station = paste(station, gsub("HOT[0-9]*|BATS[0-9]*", "", cruise)),
# depth_num = as.numeric(depth),
# depth_layer = add_depth_layers(as.character(depth)),
# `size_fraction` = "N/A",
# `temperature [°C]` = "N/A",
# `oxygen [µmol/kg]` = "N/A") %>%
# select(-cruise, -plate, -links)) %>%
mutate(
size = ifelse(grepl("Time-Series", dataset), "Time-Series", "Survey"),
image = c(
"Biogeotraces" = "code/analysis/lib/icon_circle_biogeotraces.png",
"Malaspina" = "code/analysis/lib/icon_circle_malaspina.png",
"Tara Oceans" = "code/analysis/lib/icon_circle_tara.png",
"Hawaiian Ocean Time-Series" = "code/analysis/lib/icon_clock_hots.png",
"Bermuda-Atlantic Time-Series" = "code/analysis/lib/icon_clock_bats.png",
"GORG SAGs" = "code/analysis/lib/icon_cross_gorg.png"
)[dataset]) %>%
mutate(depth_layer = factor(depth_layer, levels=c("EPI", "MES", "BAT", "ABY")))

# Plot samples through depths using transformed scale ------------------------------------

trans <- function(x) -sqrt(x)
inv <- function(x) x**2

figure_1B = ggplot(figure_1B_table) +
geom_image(aes(x = longitude, y = depth_num, image = image, size = size), asp = 1.85) +
scale_size_manual(values = c(0.01, 0.0166)) +
scale_y_continuous(trans = scales::trans_new("revsqrt_trans", trans, inv, minor_breaks = scales::regular_minor_breaks(reverse = TRUE)),
limits = c(6000, 0),
breaks = c(0, 200, 1000, 2000, 4000, 6000),
expand=c(0,0)) +
scale_x_continuous(limits = c(-180, 180), expand = c(0,0)) +
ylab('Depth (m)') +
xlab('Longitude') +
coord_cartesian(clip = "off") +
theme_bw() +
theme_cell +
theme(rect = element_blank(),
text = element_text(size = unit(6, "pt")),
plot.margin = margin(1, 1, 1, 1, "mm"),
legend.position = "none",
axis.text.y = element_text(size = unit(6, "pt")),
axis.text.x = element_text(size = unit(6, "pt"), hjust = 1),
axis.title.y = element_text(size = unit(6, "pt")),
axis.title.x = element_text(size = unit(6, "pt")),
axis.ticks.length = unit(0.5, "mm"))

figure_1B

# Plot samples through depths using facets ------------------------------------

scales_y <- list(
`EPI` = scale_y_continuous(trans = "reverse", limits = c(200, 0), breaks = c(0, 100, 200), expand = c(0,0)),
`MES` = scale_y_continuous(trans = "reverse", limits = c(1000, 200), breaks = c(500, 1000), minor_breaks = c(750), expand = c(0,0)),
`BAT` = scale_y_continuous(trans = "reverse", limits = c(4000, 1000), breaks = c(2000, 4000), expand = c(0,0)),
`ABY` = scale_y_continuous(trans = "reverse", limits = c(6000, 4000), breaks = c(5000, 6000), expand = c(0,0))
)

figure_1B = ggplot(figure_1B_table) +
geom_image(aes(x = longitude, y = depth_num, image = image, size = size), asp = 12.3) +
facet_grid_sc(depth_layer~., scales = list(y = scales_y)) +
scale_size_manual(values = c(0.01, 0.0166)) +
scale_x_continuous(limits = c(-180, 180), expand = c(0,0)) +
ylab('Depth (m)') +
xlab('Longitude (°)') +
coord_cartesian(clip = "off") +
theme_bw() +
theme_cell +
theme(rect = element_blank(),
text = element_text(size = unit(6, "pt")),
plot.margin = margin(1, 0, 1, 0, "mm"),
legend.position = "none",
axis.text.y = element_text(size = unit(6, "pt"), hjust = 1),
axis.text.x = element_text(size = unit(6, "pt")),
axis.title.y = element_text(size = unit(6, "pt")),
axis.title.x = element_text(size = unit(6, "pt")),
axis.ticks.length = unit(0.5, "mm"),
strip.background = element_rect(colour = NA, fill = "lightgrey"),
panel.spacing.y = unit(0, 'mm'),
strip.text = element_text(size = 6, colour = "white", face = "bold", margin = margin(0.5, 0.5, 0.5, 0.5, "mm")))

figure_1B

# Save figure -------

ggsave(paste0(figures_path_proj, "Figure-1/Figure-1B.pdf"), figure_1B, width = 113, height = 41.5, units = col_unit)

131 changes: 131 additions & 0 deletions figures/Extended-data-fig-2A-abundance_corr.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Biosynthetic potential of the global ocean microbiome ==================================
# Code used to produce the figures and analysis of the paper
# Extended Data Fig. 2A - Abundance correlation binning ==================================

# Libraries ------------------------------------------------------------------------------

rm(list = ls())
library(tidyverse)
for (src_file in list.files('code/analysis/R')){
print(paste('Sourcing', src_file, '...'))
source(paste('code/analysis/R', src_file, sep = '/'))}
source('code/analysis/lib/Cell_Press_guidelines.R')
source('/Users/paolil/polybox/PhD/Exploratorium/sushilab-colors/palettes-paoli.R')

# Load data ------------------------------------------------------------------------------

general_metadata = load_general_metadata()

samples_list = rbind(read_tsv("data/raw/go_microbiomics/samples/GEOTRACES.rand10-5-5.samples", col_names = "Internal Sample Name") %>% mutate(backmapping = "n=610", description = "Biogeotraces-HOTS-BATS"),
read_tsv("data/raw/go_microbiomics/samples/TARA_OCEANS_prok.rand10.samples", col_names = "Internal Sample Name") %>% mutate(backmapping = "n=180", description = "Tara Oceans\n(Prok)"),
read_tsv("data/raw/go_microbiomics/samples/TARA_OCEANS_vir.rand10.samples", col_names = "Internal Sample Name") %>% mutate(backmapping = "n=190", description = "Tara Oceans\n(Viral)"),
read_tsv("data/raw/go_microbiomics/samples/MALASPINA.rand5.samples", col_names = "Internal Sample Name") %>% mutate(backmapping = "n=58", description = "Malaspina")) %>%
mutate(`Internal Sample Name` = gsub("_METAG", "", `Internal Sample Name`)) %>%
left_join(general_metadata %>% select(`Internal Sample Name`, size_fraction))

summary_table = load_raw_summary()

without_diffcov = read_tsv("data/raw/go_microbiomics/summaries/diffcov_go_micro_metabat2.c2k.e500_cpl50_ctn10-evaluate_summary.tsv") %>%
mutate(Sample = gsub("_[^_]*$", "", `Bin Id`),
`Internal Sample Name` = gsub('_METAG', '', basename(dirname(`Folder Location`))),
Method = gsub('_a[0-9]+\\.', '', basename(`Folder Location`))) %>%
add_genomes_quality(weight = 5)

# Prepare data ------------------------------------------------------------------------------

sum(samples_list$`Internal Sample Name` %in% summary_table$`Internal Sample Name`)
samples_list$`Internal Sample Name`[!(samples_list$`Internal Sample Name` %in% summary_table$`Internal Sample Name`)]

col_subset = c("Internal Sample Name", "Folder Location", "Bin Id", "Mean Completeness", "Mean Contamination", "CheckM Completeness", "CheckM Contamination", "CheckM Strain heterogeneity", "CheckM N50 (scaffolds)", "Anvio Domain", "Anvio Domain Confidence", "Anvio Completion", "Anvio Redundancy", "Anvio # Scaffolds", "Anvio Length", "Folder Location", "Method", "Quality", "HighQ", "GoodQ", "MediumQ", "LowQ")

figure_2B_table =
rbind(
left_join(samples_list, without_diffcov %>% select(all_of(col_subset))) %>% mutate(`Differential Coverage` = FALSE),
left_join(samples_list, summary_table %>% select(all_of(col_subset))) %>% mutate(`Differential Coverage` = TRUE)
) %>%
group_by(`Internal Sample Name`, `Differential Coverage`) %>%
summarize(backmapping = unique(backmapping),
description = unique(description),
`# MAGs` = n(),
`# HighQ` = sum(HighQ),
`# GoodQ` = sum(GoodQ),
`# MediumQ` = sum(MediumQ),
`# LowQ` = sum(LowQ),
`Cumulative Q-score` = sum(Quality),
`Cumulative Q'-score` = sum(`Mean Completeness` - 5*`Mean Contamination` + (`Mean Contamination`*(`CheckM Strain heterogeneity`/100)) + 0.5*(log10(`CheckM N50 (scaffolds)`))),
`Cumulative Q'-checkm-score` = sum(`CheckM Completeness` - 5*`CheckM Contamination` + (`CheckM Contamination`*(`CheckM Strain heterogeneity`/100)) + 0.5*(log10(`CheckM N50 (scaffolds)`))),
`Average Q-score` = mean(Quality),
`Average Q'-score` = mean(`Mean Completeness` - 5*`Mean Contamination` + (`Mean Contamination`*(`CheckM Strain heterogeneity`/100)) + 0.5*(log10(`CheckM N50 (scaffolds)`))),
`Average Q'-checkm-score` = mean(`CheckM Completeness` - 5*`CheckM Contamination` + (`CheckM Contamination`*(`CheckM Strain heterogeneity`/100)) + 0.5*(log10(`CheckM N50 (scaffolds)`)))
) %>%
replace(is.na(.), 0) %>%
left_join(load_general_metadata()) %>%
group_by(Sample) %>%
summarize(dataset = unique(dataset),
backmapping = unique(backmapping),
description = unique(description),
size_fraction = unique(size_fraction),
`# MAGs ratio` = `# MAGs`[`Differential Coverage`]/`# MAGs`[!`Differential Coverage`],
`Cumulative Q-score ratio` = `Cumulative Q-score`[`Differential Coverage`]/`Cumulative Q-score`[!`Differential Coverage`],
`Cumulative Q'-score ratio` = `Cumulative Q'-score`[`Differential Coverage`]/`Cumulative Q'-score`[!`Differential Coverage`],
`Cumulative Q'-checkm-score ratio` = `Cumulative Q'-checkm-score`[`Differential Coverage`]/`Cumulative Q'-checkm-score`[!`Differential Coverage`],
`Average Q-score ratio` = `Average Q-score`[`Differential Coverage`]/`Average Q-score`[!`Differential Coverage`],
`Average Q'-score ratio` = `Average Q'-score`[`Differential Coverage`]/`Average Q'-score`[!`Differential Coverage`],
`Average Q'-checkm-score ratio` = `Average Q'-checkm-score`[`Differential Coverage`]/`Average Q'-checkm-score`[!`Differential Coverage`]) %>%
replace(is.na(.), 1)

# Plot figure ----------------------------------------------------------------------------

fraction_dict = c("<-0.22" = " (Viral)",
"0.1-0.22" = " (Viral)",
"0.22-0.45" = " (F.L.1)",
"0.45-0.8" = " (F.L.1)",
"0.22-1.6" = " (F.L.2)",
"0.22-3" = " (F.L.2)",
"0.2-0.8" = " (F.L.1)",
"0.8-20" = " (P.A.)")

plot_order = c("n=190, Tara Oceans (Viral)", "n=190, Tara Oceans (F.L.1)", "n=180, Tara Oceans (F.L.2)", "n=58, Malaspina (F.L.1)", "n=58, Malaspina (P.A.)", "n=610, Biogeotraces", "n=610, Hawaiian Ocean Time-Series", "n=610, Bermuda-Atlantic Time-Series")

figure_2B_table %>%
#filter(`Cumulative Q'-score ratio` != Inf) %>%
summary()
figure_2B_table %>%
filter(`Cumulative Q'-score ratio` != Inf) %>%
summary()

figure_2B = figure_2B_table %>%
mutate(fraction = ifelse(grepl("Tara|Mala", dataset), fraction_dict[size_fraction], "")) %>%
mutate(facet = factor(paste0(backmapping, ", ", dataset, fraction), levels = plot_order)) %>%
mutate(dot_type = ifelse(`Cumulative Q'-score ratio` %in% c(Inf, 1), "Special Case", "Normal")) %>%
ggplot() +
geom_boxplot(aes(x = facet, y = `Cumulative Q'-score ratio`, color = dataset), size = 0.3, outlier.shape = NA, fill = NA, position = position_dodge(width = 1)) +
geom_point(aes(x = facet, y = `Cumulative Q'-score ratio`, color = dataset, fill = dataset), size = .5, shape = 21, position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0)) +
#geom_point(aes(x = facet, y = mean(`# MAGs ratio`)), shape = 3, size = 3) +
geom_hline(yintercept = 1, linetype = "dashed", size = 0.2) +
scale_color_manual(values = dataset_colors[unique(figure_2B_table$dataset)]) +
scale_fill_manual(values = alpha(dataset_colors[unique(figure_2B_table$dataset)], alpha = 0.6)) +
facet_grid(.~facet, scales = "free_x", space = "free_x") +
coord_cartesian(clip = "off") +
scale_y_log10() +
theme_bw() +
theme_cell +
theme(plot.margin = margin(0,0,0,0, 'mm'),
legend.position = "none",
rect = element_rect(size = NA),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_text(size = 6),
axis.text.y = element_text(size = 6),
panel.spacing.x = unit(0, 'mm'),
strip.background = element_rect(color = "lightgrey", fill = "lightgrey"),
strip.text = element_text(size = 6, colour = "white", face = "bold", margin = margin(0.5, 1, 0.5, 1, "mm")))

figure_2B

ggsave(paste0(figures_path_proj, "Figure-2/Figure-2B.pdf"), figure_2B, width = 48, height = 25, units = "mm")




60 changes: 60 additions & 0 deletions figures/Extended-data-fig-2B-binning_MGEs.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Biosynthetic potential of the global ocean microbiome ==================================
# Code used to produce the figures and analysis of the paper
# Extended Data Fig. 2B - Binning MGEs ===================================================

# Libraries ------------------------------------------------------------------------------

rm(list = ls())
library(viridis)
library(tidyverse)
library(patchwork)
library(googlesheets4)
for (src_file in list.files('code/analysis/R')){
print(paste('Sourcing', src_file, '...'))
source(paste('code/analysis/R', src_file, sep = '/'))}
source('code/analysis/lib/Cell_Press_guidelines.R')
source('/Users/paolil/polybox/PhD/Exploratorium/sushilab-colors/palettes-paoli.R')

# Load data ==============================================================================

data_tbl = read_sheet('1YqvL7CDaW4fkQl9R9ZAsBKCuFtBCrJ0XV1rjLHA_x4s') %>%
mutate(`Scaffold Length` = factor(`Scaffold Length`, levels = c("≥10Kb", "≥2Kb", "≥1Kb")),
`Binning Result` = factor(`Binning Result`, levels = rev(c("MAG", "Bin", "Unbinned"))),
`Genetic Element` = factor(`Genetic Element`, levels = c("All", "Chromosome", "Plasmid", "Virus", "Unannotated")))

vir_init = viridis(length(unique(data_tbl$`Binning Result`)), begin = .2)
vir_cols = vir_init[1:(length(vir_init) - 1)]
vir_grey = c(vir_cols, DescTools::ColToGrey(vir_init[length(vir_init)])) # Las color as greyscale


p1 = data_tbl %>%
ggplot() +
geom_bar(aes(x = `Scaffold Length`, y = `Number of scaffolds`), stat = 'identity') +
facet_wrap(~`Genetic Element`, nrow = 1, scales = "free_y") +
theme_bw() +
theme_cell +
theme(rect = element_blank(),
plot.margin = margin(1, 1, 1, 1, "mm"),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.background = element_rect(color = "lightgrey", fill = "lightgrey"),
strip.text = element_text(size = 6, colour = "white", face = "bold", margin = margin(1, 1, 1, 1, "mm")))

p2 = data_tbl %>%
ggplot() +
geom_bar(aes(x = `Scaffold Length`, y = `Perc. of scaffolds`, fill = `Binning Result`), stat = 'identity') +
facet_wrap(~`Genetic Element`, nrow = 1, scales = "free_y") +
scale_fill_manual(values = rev(vir_grey)) +
theme_bw() +
theme_cell +
theme(rect = element_blank(),
plot.margin = margin(1, 1, 1, 1, "mm"),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
strip.background = element_rect(color = "lightgrey", fill = "lightgrey"),
strip.text = element_text(size = 6, colour = "white", face = "bold", margin = margin(1, 1, 1, 1, "mm")))


p1 / p2

ggsave(paste0(figures_path_proj, "Figure-SX/Figure-SX-binning_MGEs.pdf"), width = two_col, height = 100, units = col_unit, device = cairo_pdf)
Loading

0 comments on commit 40ac153

Please sign in to comment.