# Execute metagenome functional profile with Paladin
Jacobo de la Cuesta-Zuluaga. June 2025.

The aim of this notebook is to obtain the functional profile from metagenome samples.


## Before we start
This notebook assumes that the sequences already went through QC. In this case, we're using the output files from the `taxprofiler` pipeline, which performs sequence quality control and removal of host sequences. See notebook 01 for that. 

In addition, you need to have a `conda` environment with `paladin` installed. [See their repo here.](https://github.com/ToniWestbrook/paladin)

## Load libraries and set paths

In [77]:
# Libraries
library(tidyverse)
library(conflicted)

In [78]:
# Solve conflicts
conflicts_prefer(dplyr::filter)

[1m[22m[90m[conflicted][39m Removing existing preference.
[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.


In [79]:
# Directories
# Base directory
base_dir = "/mnt/lustre/groups/maier/maide581/projects/Metemgee"

# Data
data_dir = file.path(base_dir, "data")

# Sequences
seq_dir = file.path(data_dir, "taxprofiler/analysis_ready_fastqs")

# Out
paladin_dir = file.path(data_dir, "paladin")
dir.create(paladin_dir)

# Paladin output
out_dir = file.path(paladin_dir, "output")
dir.create(out_dir)

# Sheets dir
sheets_dir = file.path(paladin_dir, "sheets")
dir.create(sheets_dir)

# tmp dir
tmp_dir = tempdir()

# paladin index
paladin_index = "/mnt/lustre/groups/maier/databases/Paladin/uhgp-90.faa"

conda_env = "paladin"

“'/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/paladin' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/paladin/output' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/paladin/sheets' already exists”


We will use the sequences we have previously processed. These are two quality-controlled samples using the `nf-core/taxprofiler` pipeline. For instructions on how to retrieve and perform QC, see the `01_Run_QC_Taxprofiler.ipynb` notebook

## Execute Paladin

To execute `paladin` we'll need an indexed reference. For general usage with human metagenome samples, we can use the Unified Human Gut Genome (UHGG) protein cataolg. To see how the index was created or create your own, see the notebok in the `Metemgee/helper_scripts/paladin_index` folder

### Create samples file
Similar to the file we passed to taxprofiler, we'll need to create a file with the name of the sample and the files corresponding to forward and reverse reads.

Importantly, this file needs to have a first column called `ArrayTaskID` with the number of the sample (1 for first sample, 2 for second and so on).

**Note** that in this case we'll need the clean reads, not the raw reads.

In [80]:
# Create samples file
# List clean sequences
clean_seq_list = list.files(seq_dir,  
        pattern = "1.merged.fastq.gz",
        full.names = TRUE)

# Combine lists of files to create a data frame
reads_df = data.frame(Forward = clean_seq_list) %>%
    mutate(Sample_name = basename(Forward), # Sample name from the file
        Sample_name = str_remove(Sample_name, "_[0-9]\\.merged.*"),
        ArrayTaskID = row_number()) %>%
    relocate(ArrayTaskID, Sample_name, Forward) # Reorder columns

reads_df %>%
    head()

Unnamed: 0_level_0,ArrayTaskID,Sample_name,Forward
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,MI-142-H,/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-142-H_1.merged.fastq.gz
2,2,MI-237-H,/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-237-H_1.merged.fastq.gz


In [81]:
# Write samples file
paladin_samplesfile = file.path(data_dir, "samples_file_paladin.tsv")
write_tsv(reads_df,
    file = paladin_samplesfile)

In [None]:
paladin_array_slurm_raw = str_glue(.open = "[", .close = "]",
"#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=[[job_name]]

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=[[cpu]]

# Specify the total memory required per node
#SBATCH --mem=[[mem]]

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# Specify number of array jobs
#SBATCH --array=[[array_jobs]]%10

# job information
scontrol show job ${SLURM_JOB_ID}

# per node
# prep
source $HOME/.bashrc

# Specify the path to the config file
config=[[samples_file]]

# Extract the sample name for the current $SLURM_ARRAY_TASK_ID
sample=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)

# Extract the path to the forward read for the current $SLURM_ARRAY_TASK_ID
forward=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $3}' $config)

# Print to a file a message that includes the current $SLURM_ARRAY_TASK_ID and sample name
echo This is array task ${SLURM_ARRAY_TASK_ID}, the sample name is ${sample} the forward read is ${forward}

# do your real computation
# Activate conda
conda activate [[conda_env]]
cd [[out_dir]]

# Create tmp dir
base_tmp='[[tmp_dir]]'
tmp_dir=${base_tmp}/${sample}'_tmp'
mkdir -p ${tmp_dir}

# Execute paladin and create sorted bam file
paladin align -t [[cpu]] [[index]] ${forward} | \
    samtools view -@ [[cpu]] -b - | \
    samtools sort -@ [[cpu]] - > ${tmp_dir}/${sample}.sorted.bam

# Extract counts
samtools index -@ [[64]] ${tmp_dir}/${sample}.sorted.bam
samtools idxstats -@ [[cpu]] ${tmp_dir}/${sample}.sorted.bam | gzip > ${sample}.counts.gz
")

In [83]:
paladin_array_slurm = str_glue(paladin_array_slurm_raw,
        job_name = "paladin_array", 
        array_jobs = str_c("1-", nrow(reads_df)), # number of array jobs should be expressed as 1-<number of samples to run>, if 10 samples, 1-10
        samples_file = paladin_samplesfile, # Samples file we created above
        index = paladin_index,
        out_dir = out_dir,
        tmp_dir = tmp_dir,
        cpu = 16,
        mem = "64G",
        conda_env = conda_env, # Name of conda environment, defined above
        .open = "[", .close = "]") 

paladin_array_slurm %>%
        print()

#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=paladin_array

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=16

# Specify the total memory required per node
#SBATCH --mem=64G

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# Specify number of array jobs
#SBATCH --array=1-2

# job information
scontrol show job ${SLURM_JOB_ID}

# per node
# prep
source $HOME/.bashrc

# Specify the path to the config file
config=/mnt/lustre/groups/maier/maide581/projects/Metemgee/data/samples_file_paladin.tsv

# Extract the sample name for the curren

In [84]:
# Write file
array_slurmfile = file.path(sheets_dir, "array_slurm.sh")
write_lines(paladin_array_slurm, array_slurmfile)

In [85]:
# Command
str_glue("cd {sheets_dir} && sbatch {slurmfile}",
         out_dir = out_dir,
         slurmfile = array_slurmfile)

In [86]:
stop("Downstream steps are to be done after paladin finished executing")

ERROR: Error: Downstream steps are to be done after paladin finished executing


## Merge tables
The output of `paladin` is a table per sample. To generate a single merged table with annotations, run the following chunks 

In [91]:
# Read output files and create a single table
UGHH_table_raw = out_dir %>%
    list.files(full.names = TRUE, pattern = "counts.gz") %>%
    map_df(function(filename){
        # Name of sample
        sample_name = basename(filename) %>% 
            str_remove(".counts.gz")
            
        # Read tables and add sample name
        filename %>%
            read_tsv(col_names = c("Gene", "Length", "Mapped", "Unmapped"), show_col_types = FALSE) %>%
            mutate(Sample = sample_name)
            })

In [92]:
# Number of mapped reads
Mapped_reads = UGHH_table_raw %>% 
    filter(Gene != fixed("*")) %>% 
    group_by(Sample) %>% 
    summarise(Total_mapped = sum(Mapped)) %>% 
    ungroup()

# Fraction of unmapped reads
Unmapped_reads = UGHH_table_raw %>% 
    filter(Gene == fixed("*")) %>% 
    select(Sample, Unmapped) %>% 
    left_join(Mapped_reads, by = join_by("Sample")) %>% 
    mutate(Unmapped_per = round((Unmapped/(Unmapped + Total_mapped)*100), 2))


Unmapped_reads

Sample,Unmapped,Total_mapped,Unmapped_per
<chr>,<dbl>,<dbl>,<dbl>
MI-142-H,1263266,20446606,5.82
MI-237-H,477818,11920522,3.85


In [93]:
# Counts table
UGHH_table_wide = UGHH_table_raw %>% 
    group_by(Sample) %>% 
    filter(Mapped > 0) %>% 
    ungroup %>% 
    pivot_wider(id_cols = Gene, 
                names_from = Sample, 
                values_from = Mapped, 
                values_fill = 0)

In [96]:
# UHGG eggNOG file
# Only contains the genes detected in the samples
eggNOG_annotation = "/mnt/lustre/groups/maier/databases/UHGG/Protein_catalog/uhgp-90/uhgp-90_eggNOG.tsv" %>% 
    read_tsv() %>% 
    rename("Gene" = "#query")  %>% 
    filter(Gene %in% UGHH_table_wide$Gene) %>% 
    select(Gene, eggNOG_OG = eggNOG_OGs, COG_category, Description, Preferred_name, EC, KEGG_ko, KEGG_Module, KEGG_Pathway) %>% 
    mutate(eggNOG_OG = str_extract(eggNOG_OG, ".*root"),
           eggNOG_OG = str_remove_all(eggNOG_OG, fixed("@1|root")))

[1mRows: [22m[34m10271996[39m [1mColumns: [22m[34m21[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (19): #query, seed_ortholog, eggNOG_OGs, max_annot_lvl, COG_category, De...
[32mdbl[39m  (2): evalue, score

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [98]:
# Write tables
# You can change the output directory or the name of the file if you wish
# By default it is located in the paladin directory
count_file = file.path(out_dir, "Merged_paladin_counts.tsv.gz")
write_tsv(UGHH_table_wide, count_file)

# Write tables
annotation_file = file.path(out_dir, "Merged_paladin_annotation.tsv.gz")
write_tsv(eggNOG_annotation, annotation_file)