# Index Paladin Database

Jacobo de la Cuesta-Zuluaga. June 2025.

The aim of this notebook is to index the `UHGG` protein catalog for use with `Paladin`

## Load libraries and set paths

In [1]:
# Libraries
library(tidyverse)
library(conflicted)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
# Solve conflicts
conflicts_prefer(dplyr::filter)

[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.


In [3]:
# Directories
# Base directory
databases_dir = "/mnt/lustre/groups/maier/databases"

# UHGG database
uhgg_faa = file.path(databases_dir, "UHGG/Protein_catalog/uhgp-90/uhgp-90.faa")

# Out
index_dir = file.path(databases_dir, "Paladin")
dir.create(index_dir)

# Conda
conda_env = "paladin"

“'/mnt/lustre/groups/maier/databases/Paladin' already exists”


## Index protein database

In [4]:
# Copy UHGG  fasta to paladin folder
file.copy(from = uhgg_faa, to = index_dir, overwrite = FALSE)
uhgg_ref_faa = file.path(index_dir, "uhgp-90.faa")

In [5]:
index_slurm_raw = str_glue(.open = "[", .close = "]",
"#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=[[job_name]]

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=32

# Specify the total memory required per node
#SBATCH --mem=256G

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# job information
scontrol show job ${SLURM_JOB_ID}

# do your real computation
source $HOME/.bashrc
conda activate [[conda_env]]
cd [[index_dir]]
paladin index -r3 [[Protein_reference]]
")

In [6]:
index_slurm = str_glue(index_slurm_raw,
        job_name = "paladin_index", 
        index_dir = index_dir,
        Protein_reference = uhgg_ref_faa,
        conda_env = conda_env,
        .open = "[", .close = "]") 

index_slurm %>%
        print()

#!/bin/bash
##############################
#       Parameters           #
##############################

# This section will tell the cluster what are the resources your job will need.
# Change the parameters accordingly and carefully!
# The parameters here are a sensible start.

# Name of the job
#SBATCH --job-name=paladin_index

# Generate an output file and give it a name
#SBATCH --output=%x-%j.out

# Number of tasks
#SBATCH --ntasks=1

# Number of cpus that this task will need
#SBATCH --cpus-per-task=32

# Specify the total memory required per node
#SBATCH --mem=256G

# Specify the maximum time this job can take to run before being killed (hh:mm:ss)
#SBATCH --time=23:59:59

# job information
scontrol show job ${SLURM_JOB_ID}

# do your real computation
source $HOME/.bashrc
conda activate paladin
cd /mnt/lustre/groups/maier/databases/Paladin
paladin index -r3 /mnt/lustre/groups/maier/databases/Paladin/uhgp-90.faa


In [8]:
# Write file
index_slurmfile = file.path(index_dir, "index_slurm.sh")
write_lines(index_slurm, index_slurmfile)