# Compare E5 coral sRNA-seq trimming options

Simple adapter trimming vs. adapter trimming and trimming to expect sRNA lengths.


### List computer specs

In [1]:
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Wed May 24 10:44:03 PDT 2023
------------



No LSB modules are available.


Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic

------------
HOSTNAME: 
raven

------------
Computer Specs:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
Stepping:            7
CPU MHz:             1000.108
CPU max MHz:         4000.0000
CPU min MHz:         1000.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdts

### Set variables
- `%env` indicates a bash variable

- without `%env` is Python variable

In [2]:
# Set directories, input/output files
%env data_dir=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq
%env analysis_dir=/home/shared/8TB_HDD_01/sam/analyses/20230524-E5-coral-sRNAseq_trimmings_comparisons
analysis_dir="/home/shared/8TB_HDD_01/sam/20230524-E5-coral-sRNAseq_trimmings_comparisons"

%env R1_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz
%env R2_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz

# Set CPU threads
%env threads=40

# Max read length
%env max_read_length=50

# Set program locations
%env fastqc=/home/shared/FastQC/fastqc
%env flexbar=/home/shared/flexbar-3.5.0-linux/flexbar

# Set some formatting stuff
%env break_line=--------------------------------------------------------------------------

env: data_dir=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq
env: analysis_dir=/home/shared/8TB_HDD_01/sam/analyses/20230524-E5-coral-sRNAseq_trimmings_comparisons
env: R1_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz
env: R2_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz
env: threads=40
env: max_read_length=50
env: fastqc=/home/shared/FastQC/fastqc
env: flexbar=/home/shared/flexbar-3.5.0-linux/flexbar
env: break_line=--------------------------------------------------------------------------


### Create analysis directory

In [3]:
%%bash
# Make analysis and data directory, if doesn't exist
mkdir --parents "${analysis_dir}"

mkdir --parents "${data_dir}"

# Adapter only trimming

### Inspect NEB Adapter FastA

Adapter sequences are in the NEB sRNA kit protocol used by Azenta for library construction.

In [4]:
%%bash
cat "${data_dir}/NEB-adapters.fasta"

>first
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>second
GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT


### Trim adapters

Options:

- `-ap`: For paired-end analysis; recommended by NEB

- `-qf il.8`: Sets quality type as Illumina v1.8

- `qt`: Mean quality score of 25

- `--target`: Sets output filename

- `--zip-output GZ`: Sets gzip compression for trimmed files

In [5]:
%%bash
cd ${analysis_dir}

${flexbar} \
-r ${R1_fastq} \
-p ${R2_fastq}  \
-a ${data_dir}/NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 25 \
--threads ${threads} \
--target sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only \
--zip-output GZ

ls -lh

total 633M
-rw-rw-r-- 1 sam sam 311M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_1.fastq.gz
-rw-rw-r-- 1 sam sam 322M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_2.fastq.gz
-rw-rw-r-- 1 sam sam 2.6K May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only.log


### Check log file

In [6]:
%%bash
cd ${analysis_dir}

cat sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only.log


               ________          __              
              / ____/ /__  _  __/ /_  ____ ______
             / /_  / / _ \| |/ / __ \/ __ `/ ___/
            / __/ / /  __/>  </ /_/ / /_/ / /    
           /_/   /_/\___/_/|_/_.___/\__._/_/     

Flexbar - flexible barcode and adapter removal, version 3.5.0
Developed with SeqAn, the library for sequence analysis

Available on github.com/seqan/flexbar


Local time:            Wed May 24 10:44:10 2023

Number of threads:     40
Bundled fragments:     256

Target name:           sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only
File type:             fastq
Reads file:            /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz
Reads file 2:          /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz   (paired run)
Adapter file:          /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/NEB-adapters.fasta

max-uncalled:          0
min-read-length:       18

adapter-

# Adapter and length trimming

### Trim adapters and set max length (trimmed from 3' end)

Options:

- `-ap`: For paired-end analysis; recommended by NEB

- `-qf il.8`: Sets quality type as Illumina v1.8

- `qt`: Mean quality score of 25

- `--post-trim-length`: Trim reads from 3' end to length specified after adapter and quality trimming.

- `--target`: Sets output filename

- `--zip-output GZ`: Sets gzip compression for trimmed files

In [7]:
%%bash
cd ${analysis_dir}

${flexbar} \
-r ${R1_fastq} \
-p ${R2_fastq}  \
-a ${data_dir}/NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 25 \
--post-trim-length ${max_read_length} \
--threads ${threads} \
--target sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50 \
--zip-output GZ

ls -lh

total 1.2G
-rw-rw-r-- 1 sam sam 292M May 24 10:52 sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50_1.fastq.gz
-rw-rw-r-- 1 sam sam 293M May 24 10:52 sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50_2.fastq.gz
-rw-rw-r-- 1 sam sam 2.7K May 24 10:52 sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50.log
-rw-rw-r-- 1 sam sam 311M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_1.fastq.gz
-rw-rw-r-- 1 sam sam 322M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_2.fastq.gz
-rw-rw-r-- 1 sam sam 2.6K May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only.log


### Check log file

In [8]:
%%bash
cd ${analysis_dir}

cat sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50.log


               ________          __              
              / ____/ /__  _  __/ /_  ____ ______
             / /_  / / _ \| |/ / __ \/ __ `/ ___/
            / __/ / /  __/>  </ /_/ / /_/ / /    
           /_/   /_/\___/_/|_/_.___/\__._/_/     

Flexbar - flexible barcode and adapter removal, version 3.5.0
Developed with SeqAn, the library for sequence analysis

Available on github.com/seqan/flexbar


Local time:            Wed May 24 10:48:12 2023

Number of threads:     40
Bundled fragments:     256

Target name:           sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50
File type:             fastq
Reads file:            /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz
Reads file 2:          /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz   (paired run)
Adapter file:          /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/NEB-adapters.fasta

max-uncalled:          0
post-trim-length:      50
min-r

# FastQC

In [9]:
%%bash
cd ${analysis_dir}

trimmed_fastq_array=(*.fastq.gz)

# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastq_array[*]}")

${fastqc} \
${trimmed_fastqc_list} \
--threads ${threads} \
--outdir ./ \
--quiet

# MultiQC

In [10]:
%%bash
cd ${analysis_dir}

multiqc .


  [34m/[0m[32m/[0m[31m/[0m ]8;id=757692;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.14[0m

[34m|           multiqc[0m | Search path : /home/shared/8TB_HDD_01/sam/analyses/20230524-E5-coral-sRNAseq_trimmings_comparisons
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m14/14[0m  [0m0m  
[?25h[34m|           flexbar[0m | Found 2 logs
[34m|            fastqc[0m | Found 4 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : multiqc_report.html
[34m|           multiqc[0m | Data        : multiqc_data
[34m|           multiqc[0m | MultiQC complete


### Document program options

In [11]:
%%bash
${flexbar} -hh


flexbar - flexible barcode and adapter removal

SYNOPSIS
    flexbar -r reads [-b barcodes] [-a adapters] [options]

DESCRIPTION
    The program Flexbar preprocesses high-throughput sequencing data
    efficiently. It demultiplexes barcoded runs and removes adapter sequences.
    Several adapter removal presets for Illumina libraries are included.
    Flexbar computes exact overlap alignments using SIMD and multicore
    parallelism. Moreover, trimming and filtering features are provided, e.g.
    trimming of homopolymers at read ends. Flexbar increases read mapping
    rates and improves genome as well as transcriptome assemblies. Unique
    molecular identifiers can be extracted in a flexible way. The software
    supports data in fasta and fastq format from multiple sequencing
    platforms. Refer to the manual on github.com/seqan/flexbar/wiki or contact
    Johannes Roehr on github.com/jtroehr for support with this application.

OPTIONS
    -h, --help
          Display the help me