# Execute metagenome functional profile with Paladin
Jacobo de la Cuesta-Zuluaga. June 2025.

The aim of this notebook is to obtain the functional profile from metagenome samples.


## Before we start
This notebook assumes that the sequences already went through QC. In this case, we're using the output files from the `taxprofiler` pipeline, which performs sequence quality control and removal of host sequences. See notebook 01 for that. 

In addition, you need to have a `conda` environment with `paladin` installed. [See their repo here.](https://github.com/ToniWestbrook/paladin)

## Load libraries and set paths

In [2]:
# Libraries
library(tidyverse)
library(conflicted)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [3]:
# Solve conflicts
conflicts_prefer(dplyr::filter)

[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.


In [4]:
# Directories
# Base directory
base_dir = "/PATH/TO/YOUR/PROJECT/FOLDER"

# Data
data_dir = file.path(base_dir, "data")

# Out
paladin_dir = file.path(data_dir, "paladin")
dir.create(paladin_dir)

# Paladin output
out_dir = file.path(paladin_dir, "output")
dir.create(out_dir)

# Sheets dir
sheets_dir = file.path(paladin_dir, "sheets")
dir.create(sheets_dir)

# tmp dir
tmp_dir = tempdir()
dir.create(tmp_dir)

# Software
bin_dir = file.path(base_dir, "bin")
dir.create(bin_dir)

conda_env = "paladin"

“cannot create dir '/PATH/TO/YOUR/PROJECT/FOLDER/data/paladin', reason 'No such file or directory'”
“cannot create dir '/PATH/TO/YOUR/PROJECT/FOLDER/data/paladin/output', reason 'No such file or directory'”
“cannot create dir '/PATH/TO/YOUR/PROJECT/FOLDER/data/paladin/sheets', reason 'No such file or directory'”
“'/tmp/RtmpZLaKq0' already exists”
“cannot create dir '/PATH/TO/YOUR/PROJECT/FOLDER/bin', reason 'No such file or directory'”


We will use the sequences we have previously processed. These are two quality-controlled samples using the `nf-core/taxprofiler` pipeline. For instructions on how to retrieve and perform QC, see the `01_Run_QC_Taxprofiler.ipynb` notebook

In [8]:
# Sequences
seq_dir = "/mnt/lustre/groups/maier/maide581/projects/Small_projects/diamond_Metemgee/data/taxprofiler/analysis_ready_fastqs"
list.files(seq_dir)

## Execute Paladin

To execute `paladin` we'll need an indexed reference. For general usage with human metagenome samples, we can use the Unified Human Gut Genome (UHGG) protein cataolg. To see how the index was created or create your own, see the notebok in the `Metemgee/helper_scripts/paladin_index` folder

### Create samples file
Similar to the file we passed to taxprofiler, we'll need to create a file with the name of the sample and the files corresponding to forward and reverse reads.

Importantly, this file needs to have a first column called `ArrayTaskID` with the number of the sample (1 for first sample, 2 for second and so on).

**Note** that in this case we'll need the clean reads, not the raw reads.

In [None]:
# Create samples file
# List clean sequences
clean_seq_list = list.files(seq_dir,  
        pattern = "1.merged.fastq.gz",
        full.names = TRUE)

# Combine lists of files to create a data frame
reads_df = data.frame(Forward = clean_seq_list) %>%
    mutate(Sample_name = basename(Forward), # Sample name from the file
        Sample_name = str_remove(Sample_name, "_[0-9]\\.merged.*"),
        ArrayTaskID = row_number()) %>%
    relocate(ArrayTaskID, Sample_name, Forward) # Reorder columns

reads_df %>%
    head()

Unnamed: 0_level_0,ArrayTaskID,Sample_name,Forward
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,MI-142-H_run,/mnt/lustre/groups/maier/maide581/projects/Small_projects/diamond_Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-142-H_run_1.merged.fastq.gz
2,2,MI-237-H_run,/mnt/lustre/groups/maier/maide581/projects/Small_projects/diamond_Metemgee/data/taxprofiler/analysis_ready_fastqs/MI-237-H_run_1.merged.fastq.gz


In [7]:
# Write samples file
paladin_samplesfile = file.path(data_dir, "samples_file_paladin.tsv")
write_tsv(reads_df,
    file = paladin_samplesfile)

ERROR: Error: Cannot open file for writing:
* '/PATH/TO/YOUR/PROJECT/FOLDER/data/samples_file_paladin.tsv'


In [None]:
paladin_array_slurm_raw = str_glue(.open = "[", .close = "]",
"#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=[[job_name]]

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=[[cpu]]

# Specify the total memory required per node
#SBATCH --mem=[[mem]]

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# Specify number of array jobs
#SBATCH --array=[[array_jobs]]

# job information
scontrol show job ${SLURM_JOB_ID}

# per node
# prep
source $HOME/.bashrc

# Specify the path to the config file
config=[[samples_file]]

# Extract the sample name for the current $SLURM_ARRAY_TASK_ID
sample=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)

# Extract the path to the forward read for the current $SLURM_ARRAY_TASK_ID
forward=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $3}' $config)

# Print to a file a message that includes the current $SLURM_ARRAY_TASK_ID and sample name
echo This is array task ${SLURM_ARRAY_TASK_ID}, the sample name is ${sample} the forward read is ${forward}

# do your real computation
# Activate conda
conda activate [[conda_env]]
cd [[out_dir]]

# Create tmp dir
base_tmp='[[tmp_dir]]'
tmp_dir=${base_tmp}/${sample}'_tmp'
mkdir -p ${tmp_dir}

# Execute paladin and create sorted bam file
paladin align -t [[cpu]] [[index_dir]] ${forward} | \
    samtools view -@ [[cpu]] -b - | \
    samtools sort -@ [[cpu]] - > ${tmp_dir}/${sample}.sorted.bam

# Extract counts
samtools index -@ [[64]] ${tmp_dir}/${sample}.sorted.bam
samtools idxstats -@ [[cpu]] ${tmp_dir}/${sample}.sorted.bam > ${sample}.counts

rm -rf ${tmp_dir}
")

In [None]:
paladin_array_slurm = str_glue(paladin_array_slurm_raw,
        job_name = "paladin_array", 
        array_jobs = str_c("1-", nrow(reads_df)), # number of array jobs should be expressed as 1-<number of samples to run>, if 10 samples, 1-10
        samples_file = paladin_samplesfile, # Samples file we created above
        index_dir = Large_unannot,
        out_dir = out_dir,
        tmp_dir = tmp_dir,
        cpu = 16,
        mem = "64G",
        conda_env = conda_env, # Name of conda environment, defined above
        .open = "[", .close = "]") 

paladin_array_slurm %>%
        print()

In [None]:
# Write file
array_slurmfile = file.path(bin_dir, "array_slurm.sh")
write_lines(paladin_array_slurm, array_slurmfile)