# Execute metagenome functional profile
Jacobo de la Cuesta-Zuluaga. August 2024.

The aim of this notebook is to obtain the functional profile from metagenome samples.

This notebook uses an alternative pipeline to the `02_Run_Functional.ipynb`. Instead of the database-dependent approach used by `mifaser`, this one uses the `nf-code/metadenovo`, which is assembly based. [You can find the pipeline's documentation here](https://nf-co.re/metatdenovo).

It assembles the metagenome or metatranscriptome samples, performs gene calling and aligns the reads to the assembled metagenomes/metatranscriptomes. This has the advantage that it doesn't require a database of genes against which the reads will be mapped. On the other hand, comparing samples from multiple runs of the pipeline might not be the most adequate and all samples to be used in a given analysis might benefit from being processed uniformly.

## Before we start
This notebook assumes that the sequences already went through QC. In this case, we're using the output files from the `taxprofiler` pipeline, which performs sequence quality control and removal of host sequences. See notebook 01 for that. 

## Load libraries and set paths

First, we'll set up the libraries and the work directory where we'll save our files

In [1]:
# Libraries
library(tidyverse)
library(conflicted)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
# Solve conflicts
conflicts_prefer(dplyr::filter)

[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.


The following chunk will define the directories where the data is stored and where the output will be saved. The present example assumes everything will be contained in the same directory: `base_dir`. 

In [10]:
# Directories
# Base directory
base_dir = "/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test"

# Data
data_dir = file.path(base_dir, "data")
dir.create(data_dir)

# Sequences
seq_dir = file.path(data_dir, "taxprofiler/analysis_ready_fastqs")

# Out
out_dir = file.path(data_dir, "Metadenovo")
dir.create(out_dir)

# sheets dir
sheets_dir = file.path(data_dir, "sheets")
dir.create(sheets_dir)

# Software
bin_dir = file.path(base_dir, "bin")
dir.create(bin_dir)
conda_env = "nextflow"

“'/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test/data' already exists”


“'/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test/data/sheets' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test/bin' already exists”


## Create samples file
Similar to the file we passed to taxprofiler, we'll need to create a file with the name of the sample and the files corresponding to forward and reverse reads.

**Note** that in this case we'll need the clean reads, not the raw reads.

In [4]:
# List raw sequences
clean_seq_list = list.files(seq_dir,  
        pattern = "merged.fastq.gz",
        full.names = TRUE)
# F
forward_reads = clean_seq_list %>%
    str_subset("_1")
#R
reverse_reads = clean_seq_list %>%
    str_subset("_2")

clean_seq_list

In [5]:
# Combine lists of files to create a data frame
reads_tax_df = data.frame(fastq_1 = forward_reads, # Full path of forward reads
        fastq_2 = reverse_reads) %>% # Full path of reverse reads
    mutate(sample = basename(fastq_1), # Sample name from the file
        sample = str_remove(sample, "_[0-9]\\.merged.*")) %>%
    relocate(sample, fastq_1, fastq_2) # Reorder columns

reads_tax_df %>%
    head()

Unnamed: 0_level_0,sample,fastq_1,fastq_2
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,MI-142-H,/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test/data/taxprofiler/analysis_ready_fastqs/MI-142-H_1.merged.fastq.gz,/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test/data/taxprofiler/analysis_ready_fastqs/MI-142-H_2.merged.fastq.gz
2,MI-237-H,/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test/data/taxprofiler/analysis_ready_fastqs/MI-237-H_1.merged.fastq.gz,/mnt/lustre/groups/maier/maide581/projects/Small_projects/Metemgee_test/data/taxprofiler/analysis_ready_fastqs/MI-237-H_2.merged.fastq.gz


In [6]:
# Write samples file
Metadenovo_samplesfile = file.path(sheets_dir, "samples_file_Metadenovo.csv")
write_csv(reads_tax_df,
    file = Metadenovo_samplesfile)

# Execute pipeline

In [19]:
# Path to eukulele database
eukulele_path = "/mnt/lustre/groups/maier/databases/EUKulele"

# Host genomes
host_genome = "/mnt/lustre/groups/maier/databases/Host_genomes/hg19_main_mask_ribo_animal_allplant_allfungus.fa"

In [20]:
# Create command
# Base command
Metadenovo_cmd = str_glue(
  "conda activate {{conda_env}} && \\
  cd {{out_dir}} && \\
  nextflow run nf-core/metatdenovo -r 1.0.1 \\
  -profile m3c \\
  --input {{samples_sheet}} \\
  --outdir {{out_dir}} \\
  --eukulele_db gtdb \\
  --eukulele_dbpath {{eukulele_path}}")

In [21]:
# Fill command
Clean_tax_cmd = str_glue(Metadenovo_cmd,
    conda_env = conda_env,
    samples_sheet = Metadenovo_samplesfile,
    out_dir = out_dir, 
    eukulele_path = eukulele_path)

Clean_tax_cmd