# Execute metagenome functional profile
Jacobo de la Cuesta-Zuluaga. July 2024.

The aim of this notebook is to obtain the functional profile from metagenome samples.

## Before we start
This notebook assumes that the sequences already went through QC. In this case, we're using the output files from the `taxprofiler` pipeline, which performs sequence quality control and removal of host sequences. See notebook 01 for that. 

In addition, you need to have a `conda` environment with `python v.3.8` to run `mifaser`, the functional profiler.

## Load libraries and set paths

First, we'll set up the libraries and the work directory where we'll save our files

In [None]:
# Libraries
library(tidyverse)
library(conflicted)

In [None]:
# Solve conflicts
conflicts_prefer(dplyr::filter)

In [None]:
# Directories
# Base directory
base_dir = "/PATH/TO/YOUR/PROJECT/FOLDER"

# Data
data_dir = file.path(base_dir, "data")
dir.create(data_dir)

# Sequences
seq_dir = file.path(data_dir, "taxprofiler/analysis_ready_fastqs")

# Out
out_dir = file.path(data_dir, "mifaser")
dir.create(out_dir)

# sheets dir
sheets_dir = file.path(data_dir, "sheets")
dir.create(sheets_dir)

# Software
bin_dir = file.path(base_dir, "bin")
dir.create(bin_dir)
conda_env = "mifaser"

## Download `mifaser`

Next, we'll download the repo of the functional profiler. I have found this is the easiest way, since it comes with all the software and databases needed.

In [None]:
# Download mifaser repo
# Directory
mifaser_dir = file.path(bin_dir, "mifaser/")

# Command
git_cmd = str_glue("git clone https://bitbucket.org/bromberglab/mifaser.git {mifaser_dir}",
    mifaser_dir = mifaser_dir)

git_cmd

## Create samples file
Similar to the file we passed to taxprofiler, we'll need to create a file with the name of the sample and the files corresponding to forward and reverse reads.

Importantly, this file needs to have a first column called `ArrayTaskID` with the number of the sample (1 for first sample, 2 for second and so on).

**Note** that in this case we'll need the clean reads, not the raw reads.

In [None]:
#ArrayTaskID     Sample_name     Forward Reverse
# List raw sequences
clean_seq_list = list.files(seq_dir,  
        pattern = "merged.fastq.gz",
        full.names = TRUE)
# F
forward_reads = clean_seq_list %>%
    str_subset("_1")
#R
reverse_reads = clean_seq_list %>%
    str_subset("_2")

    clean_seq_list

In [None]:
data.frame(forward = forward_reads, # Full path of forward reads
        reverse = reverse_reads)

In [None]:
# Combine
reads_tax_df = data.frame(Forward = forward_reads, # Full path of forward reads
        Reverse = reverse_reads) %>% # Full path of reverse reads
    mutate(Sample_name = basename(Forward), # Sample name from the file
        Sample_name = str_remove(Sample_name, "_[0-9]\\.merged.*"),
        ArrayTaskID = row_number()) %>%
    relocate(ArrayTaskID, Sample_name, Forward, Reverse) # Reorder columns

reads_tax_df %>%
    head()

In [None]:
# Write samples file
mifaser_samplesfile = file.path(sheets_dir, "samples_file_mifaser.tsv")
write_tsv(reads_tax_df,
    file = mifaser_samplesfile)

## Create slurm script

In [None]:
mifaser_slurm_raw = str_glue(.open = "[", .close = "]",
"#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=[[job_name]]

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=16

# Specify the total memory required per node
#SBATCH --mem=64G

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# Specify number of array jobs
#SBATCH --array=[[array_jobs]]

# job information
scontrol show job ${SLURM_JOB_ID}

# per node
# prep
CONDA_PATH='[[conda_install]'
echo ${CONDA_PATH}
source ${CONDA_PATH}/etc/profile.d/conda.sh

# Specify the path to the config file
config=[[samples_file]]

# Extract the sample name for the current $SLURM_ARRAY_TASK_ID
sample=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)

# Extract the path to the forward read for the current $SLURM_ARRAY_TASK_ID
forward=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $3}' $config)

# Extract the path to the reverse read for the current $SLURM_ARRAY_TASK_ID
reverse=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $4}' $config)

# Print to a file a message that includes the current $SLURM_ARRAY_TASK_ID and sample name
echo This is array task ${SLURM_ARRAY_TASK_ID}, the sample name is ${sample} the forward read is ${forward} and the reverse is ${reverse}

# do your real computation
conda activate [[conda_env]]
cd [[mifaser_repo]]
python -m mifaser --lanes ${forward} ${reverse} -o [[out_dir]]/${sample}_out -d GS-21-all -c 16
")

In [None]:
mifaser_slurm = str_glue(mifaser_slurm_raw,
        job_name = "mifaser_run", 
        array_jobs = str_c("1-", nrow(reads_tax_df)), # number of array jobs should be expressed as 1-<number of samples to run>, if 10 samples, 1-10
        conda_install = "/mnt/lustre/groups/maier/maide581/bin/miniconda3", # Path to your conda installation
        samples_file = mifaser_samplesfile, # Samples file we created above
        mifaser_repo = mifaser_dir, # Path to the mifaser git repo
        out_dir = out_dir,
        conda_env = conda_env, .open = "[", .close = "]") # Name of conda environment ro tun mifaser, defined above

mifaser_slurm %>%
        print()

In [None]:
# Write file
mifaser_slurmfile = file.path(base_dir, "bin/mifaser_slurm.sh")
write_lines(mifaser_slurm, mifaser_slurmfile)
mifaser_slurmfile