# Assembly of nanopore sequences

Jacobo de la Cuesta-Zuluaga, June 2025.

The aim of this notebook is to execute the `nf-core` pipeline `bacass` for the assembly of a bacterial genome sequenced using nanopore

## Libraries

In [1]:
library(tidyverse)
library(conflicted)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
conflicts_prefer(dplyr::filter)

[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.


## Directories

In [11]:
# Directories
# Base directory
base_dir = "/mnt/lustre/groups/maier/maide581/projects/Huequito"

# Data
data_dir = file.path(base_dir, "data")

# fastq files
fastq_dir = file.path(data_dir, "fastq_files")

# sheets dir
sheets_dir = file.path(data_dir, "sheets")
dir.create(sheets_dir)

# sheets dir
assembly_dir = file.path(data_dir, "assembly")
dir.create(assembly_dir)

# Kraken db
k2_db = "/mnt/lustre/groups/maier/databases/Kraken_Bracken/k2_standard_16gb/k2_standard_16gb_20240605.tar.gz"

# Software
conda_env = "nextflow"

“'/mnt/lustre/groups/maier/maide581/projects/Huequito/data/sheets' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Huequito/data/assembly' already exists”


## Prepare tables

In [4]:
# Create sample sheets
raw_reads = list.files(fastq_dir,full.names = TRUE)

F_reads = raw_reads %>%
    str_subset("fastq.gz")

reads_sheet = data.frame(LongFastQ = F_reads) %>%
    mutate(ID = "S_dysgalactiae", 
           R1 = NA,
           R2 = NA,
           Fast5 = NA,
           GenomeSize = NA) %>%  
    relocate(ID) %>% 
    relocate(LongFastQ,.after = R2)

reads_sheet

ID,R1,R2,LongFastQ,Fast5,GenomeSize
<chr>,<lgl>,<lgl>,<chr>,<lgl>,<lgl>
S_dysgalactiae,,,/mnt/lustre/groups/maier/maide581/projects/Huequito/data/fastq_files/MMC234_202311.fastq.gz,,


In [5]:
# Write file
Sdysgalactiae_samplessheet = file.path(sheets_dir, "Sdysgalactiae_samples.tsv")

reads_sheet %>%
    write_tsv(Sdysgalactiae_samplessheet)

## Execute pipeline

In [6]:
# Create command
bacass_cmd = str_glue(
  "conda activate {{conda_env}} && \\
  cd {{out_dir}} && \\
  nextflow run nf-core/bacass -r 2.3.1 \\
    -profile m3c \\
    --input {{samples_sheet}} \\
    --outdir {{assemblies_dir}} \\
    --kraken2db {{kraken_db}} \\
    --annotation_tool prokka \\
    --assembly_type long \\
    --skip_kmerfinder")

In [12]:
assembly_cmd = str_glue(bacass_cmd,
                        conda_env = conda_env,
                        out_dir = assembly_dir,
                        samples_sheet = Sdysgalactiae_samplessheet,
                        assemblies_dir = assembly_dir,
                        kraken_db = k2_db)

assembly_cmd