# Execute metagenome functional profile with mifaser
Jacobo de la Cuesta-Zuluaga. June 2025.

The aim of this notebook is to obtain the functional profile from metagenome samples.

## Before we start
This notebook assumes that the sequences already went through QC. In this case, we're using the output files from the `taxprofiler` pipeline, which performs sequence quality control and removal of host sequences. See notebook 01 for that. 

In addition, you need to have a `conda` environment with `python v.3.8` to run `mifaser`, the functional profiler. [See their repo here.](https://bitbucket.org/bromberglab/mifaser)

## Load libraries and set paths

First, we'll set up the libraries and the work directory where we'll save our files

In [14]:
# Libraries
library(tidyverse)
library(conflicted)

In [15]:
# Solve conflicts
conflicts_prefer(dplyr::filter)

[1m[22m[90m[conflicted][39m Removing existing preference.
[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.


The following chunk will define the directories where the data is stored and where the output will be saved. The present example assumes everything will be contained in the same directory: `base_dir`. 

In [16]:
# Directories
# Base directory
base_dir = "/mnt/lustre/groups/maier/maide581/projects/Metemgee"

# Data
data_dir = file.path(base_dir, "data")

# Sequences
seq_dir = file.path(data_dir, "taxprofiler/analysis_ready_fastqs")

# Out
mifaser_dir = file.path(data_dir, "mifaser")
dir.create(mifaser_dir)

out_dir = file.path(mifaser_dir, "output")
dir.create(out_dir)

# sheets dir
sheets_dir = file.path(mifaser_dir, "sheets")
dir.create(sheets_dir)

# Software
bin_dir = file.path(base_dir, "bin")
dir.create(bin_dir)
conda_env = "mifaser"

“'/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/mifaser' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/mifaser/output' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/mifaser/sheets' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Metemgee/bin' already exists”


## Download `mifaser`

Next, we'll download the repo of the functional profiler. I have found this is the easiest way, since it comes with all the software and databases needed.

You can paste the generated command in the terminal to download the repo

In [17]:
# Download mifaser repo
# Directory
mifaser_dir = file.path(bin_dir, "mifaser/")

# Command
git_cmd = str_glue("git clone https://bitbucket.org/bromberglab/mifaser.git {mifaser_dir}",
    mifaser_dir = mifaser_dir)

system(git_cmd)

## Create samples file
Similar to the file we passed to taxprofiler, we'll need to create a file with the name of the sample and the files corresponding to forward and reverse reads.

Importantly, this file needs to have a first column called `ArrayTaskID` with the number of the sample (1 for first sample, 2 for second and so on).

**Note** that in this case we'll need the clean reads, not the raw reads.

In [18]:
# List raw sequences
clean_seq_list = list.files(seq_dir,  
        pattern = "merged.fastq.gz",
        full.names = TRUE)
# F
forward_reads = clean_seq_list %>%
    str_subset("_1")
#R
reverse_reads = clean_seq_list %>%
    str_subset("_2")

clean_seq_list

In [19]:
# Combine lists of files to create a data frame
reads_tax_df = data.frame(Forward = forward_reads, # Full path of forward reads
        Reverse = reverse_reads) %>% # Full path of reverse reads
    mutate(Sample_name = basename(Forward), # Sample name from the file
        Sample_name = str_remove(Sample_name, "_[0-9]\\.merged.*"),
        ArrayTaskID = row_number()) %>%
    relocate(ArrayTaskID, Sample_name, Forward, Reverse) # Reorder columns

reads_tax_df %>%
    head()

Unnamed: 0_level_0,ArrayTaskID,Sample_name,Forward,Reverse
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>
1,1,MI-142-H,/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-142-H_1.merged.fastq.gz,/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-142-H_2.merged.fastq.gz
2,2,MI-237-H,/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-237-H_1.merged.fastq.gz,/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-237-H_2.merged.fastq.gz


In [20]:
# Write samples file
mifaser_samplesfile = file.path(sheets_dir, "samples_file_mifaser.tsv")
write_tsv(reads_tax_df,
    file = mifaser_samplesfile)

## Create slurm script

To make use of the HPC, we need to create a bash script to submit the jobs using slurm. The following chunks will create and fill the scipt based on the template, you don't need to modify anything.

In [None]:
mifaser_slurm_raw = str_glue(.open = "[", .close = "]",
"#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=[[job_name]]

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=16

# Specify the total memory required per node
#SBATCH --mem=64G

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# Specify number of array jobs
#SBATCH --array=[[array_jobs]]%10

# job information
scontrol show job ${SLURM_JOB_ID}

# per node
# prep
source $HOME/.bashrc

# Specify the path to the config file
config=[[samples_file]]

# Extract the sample name for the current $SLURM_ARRAY_TASK_ID
sample=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)

# Extract the path to the forward read for the current $SLURM_ARRAY_TASK_ID
forward=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $3}' $config)

# Extract the path to the reverse read for the current $SLURM_ARRAY_TASK_ID
reverse=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $4}' $config)

# Print to a file a message that includes the current $SLURM_ARRAY_TASK_ID and sample name
echo This is array task ${SLURM_ARRAY_TASK_ID}, the sample name is ${sample} the forward read is ${forward} and the reverse is ${reverse}

# do your real computation
conda activate [[conda_env]]
cd [[mifaser_repo]]
python -m mifaser --lanes ${forward} ${reverse} -o [[out_dir]]/${sample}_out -d GS-21-all -c 16
")

In [22]:
mifaser_slurm = str_glue(mifaser_slurm_raw,
        job_name = "mifaser_run", 
        array_jobs = str_c("1-", nrow(reads_tax_df)), # number of array jobs should be expressed as 1-<number of samples to run>, if 10 samples, 1-10
        samples_file = mifaser_samplesfile, # Samples file we created above
        mifaser_repo = mifaser_dir, # Path to the mifaser git repo
        out_dir = out_dir,
        conda_env = conda_env, # Name of conda environment ro tun mifaser, defined above
        .open = "[", .close = "]") 

mifaser_slurm %>%
        print()

#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=mifaser_run

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=16

# Specify the total memory required per node
#SBATCH --mem=64G

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# Specify number of array jobs
#SBATCH --array=1-2

# job information
scontrol show job ${SLURM_JOB_ID}

# per node
# prep
source $HOME/.bashrc

# Specify the path to the config file
config=/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/mifaser/sheets/samples_file_mifaser.tsv

# Extract the sample name f

In [23]:
# Write file
mifaser_slurmfile = file.path(base_dir, "bin/mifaser_slurm.sh")
write_lines(mifaser_slurm, mifaser_slurmfile)

Finally, you can execute `mifaser` using:

In [24]:
# Command
str_glue("cd {sheets_dir} && sbatch {slurmfile}",
         out_dir = out_dir,
         slurmfile = mifaser_slurmfile)

In [25]:
stop("Downstream steps are to be done after mifaser finished executing")

ERROR: Error: Downstream steps are to be done after mifaser finished executing


## Merge tables
The output of `mifaser` is a table per sample. To generate a single merged table with annotations, run the following chunks 

In [26]:
# Download EC annotation file
# Retrieved from HUMANn3 repo
EC_table = "https://github.com/biobakery/humann/raw/a9f181f32b3c66b66b73cabc611ff3ac55d87033/humann/data/utility_DEMO/map_level4ec_name.txt.gz" %>%
    read_tsv(col_names = c("EC_Number", "Annot"))

[1mRows: [22m[34m7957[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (2): EC_Number, Annot

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [27]:
# Read output files and create a single table
EC_table_long = out_dir %>%
    list.files(full.names = TRUE, recursive = TRUE,pattern = "analysis") %>%
    map_df(function(filename){
        # Name of sample
        sample_name = dirname(filename) %>%
            str_remove(out_dir) %>%
            str_remove("/") %>%
            str_remove("_out")
            
        # Read tables and add sample name
        filename %>%
            read_tsv(skip = 1,col_names = c("EC_Number", "Count")) %>%
            mutate(Sample = sample_name)
            }) %>%
    left_join(EC_table) %>%
    select(Sample, EC_Number, Annot, Count)

[1mRows: [22m[34m1516[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (1): EC_Number
[32mdbl[39m (1): Count

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1425[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (1): EC_Number
[32mdbl[39m (1): Count

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1m[22mJoining with `by = join_by(EC_Number)`


In [32]:
# Create wide table
EC_table = EC_table_long %>%
    pivot_wider(id_cols = c(EC_Number, Annot),
    names_from = Sample,   
    values_from = Count, 
    values_fill = 0)

EC_table %>%
    head()

EC_Number,Annot,MI-142-H,MI-237-H
<chr>,<chr>,<dbl>,<dbl>
1.1.1.1,Alcohol dehydrogenase,3056,567
1.1.1.2,Alcohol dehydrogenase (NADP(+)),44,72
1.1.1.3,Homoserine dehydrogenase,1519,1895
1.1.1.4,"(R,R)-butanediol dehydrogenase",262,11
1.1.1.5,Transferred entry: 1.1.1.303 and 1.1.1.304,18,8
1.1.1.6,Glycerol dehydrogenase,2354,1317


In [33]:
# Write table
# You can change the output directory or the name of the file if you wish
# By default it is located in the mifaser directory
out_file = file.path(out_dir, "Merged_mifaser_out.tsv.gz")
write_tsv(EC_table, out_file)