# Scaffolding 10X contigs using long reads

* Gellinia genome assembly: initial contigs were assembled from 10x reads. The nanopore MiNION and Premethon long reads (> 30k) will be used for scaffolding contigs
* There are two phased assembly sets generated from Supernova for scaffording
* Long read scaffolding programs: **LINKS** (recommended by Cecilia) and **SLR** will be tested in this notebook
* Bioinformatician: Chen Wu and Cecilia Deng


## 1. Input file paths

In [2]:
pwd

/powerplant/workspace/hraczw/github/GA/Gillenia_genome


In [3]:
mkdir 004.Scaffolding_10Xcontigs

In [1]:
WORKDIR=/powerplant/workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs

In [3]:
ASSEMBLY_P1=/output/genomic/plant/Gillenia/trifoliata/31_ReleaseS4/01_Release/G3_2_S4.1.2KB.fasta
ASSEMBLY_P2=/output/genomic/plant/Gillenia/trifoliata/31_ReleaseS4/01_Release/G3_2_S4.2.2KB.fasta

In [2]:
ONT_M=/workspace/hraczw/github/GA/Gillenia_genome/Gillenia_MinNION.fastq.gz
ONT_P=/workspace/hraczw/github/GA/Gillenia_genome/Gillenia_PromethION.fastq.gz

## 1. Filtering out short nanopore reads (length < 30k)

In [11]:
bsub -J unzip \
-o $WORKDIR/unzip_M.out \
-e $WORKDIR/unzip_M.err \
"gunzip $ONT_M"

Job <374964> is submitted to default queue <normal>.


In [70]:
bsub -J preseq \
-o $WORKDIR/preseq.out \
-e $WORKDIR/preseq.err \
"perl /workspace/hraczw/github/programs/prinseq-lite-0.20.4/prinseq-lite.pl \
-fastq /workspace/hraczw/github/GA/Gillenia_genome/Gillenia_MinNION.fastq \
-out_good $WORKDIR/Gillenia_MinNION.ml30k \
-out_bad $WORKDIR/Gillenia_MinNION.shortFrom30k \
-min_len 30000"

Job <391586> is submitted to default queue <normal>.


In [None]:
bsub -J preseq \
-o $WORKDIR/preseq_promethion.out \
-e $WORKDIR/preseq_promethion.err \
"perl /workspace/hraczw/github/programs/prinseq-lite-0.20.4/prinseq-lite.pl \
-fastq /workspace/hraczw/github/GA/Gillenia_genome/Gillenia_PromethION.fastq \
-out_good $WORKDIR/Gillenia_PromethION.ml30k \
-out_bad $WORKDIR/Gillenia_PromethION.shortFrom30k \
-min_len 30000"

MiNION

    Input sequences: 1,093,884
	Input bases: 15,503,735,959
	Input mean length: 14173.11
	Good sequences: 138,583 (12.67%)
	Good bases: 5,513,046,647
	Good mean length: 39781.55

PromethION

	Input sequences: 4,010,868
	Input bases: 57,044,858,479
	Input mean length: 14222.57
	Good sequences: 519,509 (12.95%)
	Good bases: 21,341,619,410
	Good mean length: 41080.37

## 2. LINKS

In [68]:
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) links/v1.8.5
  2) texlive/20151117   5) perlbrew/0.76      8) bwa/0.7.17
  3) pandoc/1.19.2      6) asub/2.1           9) samtools/1.9


In [4]:
module load links
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) links/v1.8.6
  3) pandoc/1.19.2      6) perl/5.28.0


In [9]:
LINKS --help

/software/bioinformatics/links_v1.8.6/bin/LINKS version [unknown] calling Getopt::Std::getopts (version 1.12 [paranoid]),
running under Perl version 5.26.2.

Usage: LINKS [-OPTIONS [-MORE_OPTIONS]] [--] [PROGRAM_ARG1 ...]

The following single-character options are accepted:
	With arguments: -f -s -d -k -e -l -a -v -b -t -p -o -r -x -m -z

Options may be merged together.  -- stops processing of options.
Space is not required between options and their arguments.
  [Now continuing due to backward compatibility and excessive paranoia.
   See 'perldoc Getopt::Std' about $Getopt::Std::STANDARD_HELP_VERSION.]
Usage: /software/bioinformatics/links_v1.8.6/bin/LINKS [v1.8.6]
-f  sequences to scaffold (Multi-FASTA format, required)
-s  file-of-filenames, full path to long sequence reads or MPET pairs [see below] (Multi-FASTA/fastq format, required)
-m  MPET reads (default -m 1 = yes, default = no, optional)
	! DO NOT SET IF NOT USING MPET. WHEN SET, LINKS WILL EXPECT A SPECIAL FORMAT UNDER -s
	!

: 255

**NOTE: when running links v1.8.6, error message as below:**

Can't locate object method "new" via package "BloomFilter::BloomFilter" (perhaps you forgot to load "BloomFilter::BloomFilter"?) at /software/bioinformatics/links_v1.8.6/bin/LINKS line 215.

Amali clarified this was fixed before, but turn out something wrong with the module again

After checking difference between v1.8.5 and v1.8.6, it looks no difference on the algorithm or basic parameter settings. Decide to go with v1.8.5

### 1.2 scaffolding phased one assembly

**NOTE** Preparing input long read file: creat a ONT_data_files.txt containing all filtered long read files (minion + promethion)

* Parameter settings: it was recommended by Cecilia to test paper's parameters on Arobidopsis (four rounds of scaffolding: k - kmer size, d - distance between a kmer pair, t - sliding window size).
1.       -d 5000 –t 20 –k 21
2.       -d 10000 –t 5 –k 21
3.       -d 15000  –t 5 –k 21
4.       -d 20000 –t 5 –k 21

Notes from author: 
* ~700-900 GB RAM is the sweet spot for LINKS. The memory panic error is a PERL out-of-memory error, unfortunately. PERL is not efficient with memory allocation and management. This will occur even if you have a 2TB machine. 
* only way to curb memory usage is to increase the spacing between kmer pair (with the skip parameter -t). It is best to choose values of -t that will place your peak memory usage to 700-900GB zone (machine permitting). This is not an ideal solution, as it makes troubleshooting challenging and decreases available kmer support. That said, you have ~125X ONT coverage of your genome. That's an overkill for LINKS and increasing -t isn't going to discard too many kmer pairs (since the redundancy in the data should capture additional supporting linkages in other reads, which is a better type of scaffolding support). 
* Alternatively, you could decrease the read coverage to 50X and decrease -t. LINKS will run faster on a smaller ONT data set, but if runtime isn't an issue, you should just keep the whole ONT set and increase -t. 
* Also, If you increase -d at each round, then you can decrease -t by a similar factor and stay within the RAM

In [7]:
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76
  3) pandoc/1.19.2      6) perl/5.28.0


In [6]:
module unload links/v1.8.6
module load links/v1.8.5

In [7]:
LINKS -h

Unknown option: h
Usage: /software/bioinformatics/links_v1.8.5/LINKS [v1.8.5]
-f  sequences to scaffold (Multi-FASTA format, required)
-s  file-of-filenames, full path to long sequence reads or MPET pairs [see below] (Multi-FASTA/fastq format, required)
-m  MPET reads (default -m 1 = yes, default = no, optional
	! DO NOT SET IF NOT USING MPET. WHEN SET, LINKS WILL EXPECT A SPECIAL FORMAT UNDER -s
	! Paired MPET reads in their original outward orientation <- -> must be separated by ":"
	  >template_name
	  ACGACACTATGCATAAGCAGACGAGCAGCGACGCAGCACG:ATATATAGCGCACGACGCAGCACAGCAGCAGACGAC
-d  distance between k-mer pairs (ie. target distances to re-scaffold on. default -d 4000, optional)
	Multiple distances are separated by comma. eg. -d 500,1000,2000,3000
-k  k-mer value (default -k 15, optional)
-t  step of sliding window when extracting k-mer pairs from long reads (default -t 2, optional)
	Multiple steps are separated by comma. eg. -t 10,5
-o  offset position for extracting k-mer pairs (de

: 255

In [11]:
mkdir $WORKDIR/k21_t20_d5000

#### Round 1

In [12]:
bsub -J links \
-m wkoppb50 \
-o $WORKDIR/k21_t20_d5000/links.out \
-e $WORKDIR/k21_t20_d5000/links.err \
"LINKS \
-f $ASSEMBLY_P1 \
-s $WORKDIR/ONT_data_files.txt \
-d 5000 \
-k 21 \
-t 20 \
-b $WORKDIR/k21_t20_d5000/scaffolds_k21_t20_d5000"

Job <737780> is submitted to default queue <lowpriority>.


In [22]:
module load samtools
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    5) perlbrew/0.76      9) bwa/0.7.17
  2) texlive/20151117   6) perl/5.28.0       10) samtools/1.9
  3) pandoc/1.19.2      7) asub/2.1
  4) git/2.21.0         8) links/v1.8.5


In [23]:
samtools faidx $WORKDIR/k21_t20_d5000/scaffolds_k21_t20_d5000.scaffolds.fa

In [4]:
ls $WORKDIR/SLR

align-self.bam		   samtools_view_bwa_align.err
align-withLongReads.bam    samtools_view_bwa_align.out
align-withLongReads.sam    samtools_view.err
ambiguous-contig-set.fa    samtools_view.out
bwa_align.err		   scaffold_set.fa
bwa_align.out		   scaffold_set.fa.fai
bwa_index.err		   scaffold_tag.fa
bwa_index.out		   shortContigOptimizePathInLongRead.fa
graph.fa		   shortContigPathInLongRead.fa
graph.GFA2		   slr.err
optimizePathInLongRead.fa  slr.out
originalPathInLongRead.fa  unique-contig-set.fa


d10-k21-t5 produced error message:


#### Round 2

In [25]:
bsub -J links \
-m aklppb34 \
-o $WORKDIR/k21_t20_d5000/links_d10k21t10.out \
-e $WORKDIR/k21_t20_d5000/links_d10k21t10.err \
"LINKS \
-f /workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs/k21_t20_d5000/scaffolds_k21_t20_d5000.scaffolds.fa \
-s $WORKDIR/ONT_data_files.txt \
-d 10000 \
-k 21 \
-t 10 \
-b $WORKDIR/k21_t20_d5000/scaffolds_k21_t10_d10000"

Job <739248> is submitted to default queue <lowpriority>.


**NOTE**: t5 failed

#### Round 3

In [35]:
bsub -J links \
-m aklppb34 \
-o $WORKDIR/k21_t20_d5000/links_d15k21t10.out \
-e $WORKDIR/k21_t20_d5000/links_d15k21t10.err \
"LINKS \
-f /workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs/k21_t20_d5000/scaffolds_k21_t10_d10000.scaffolds.fa \
-s $WORKDIR/ONT_data_files.txt \
-d 15000 \
-k 21 \
-t 10 \
-b $WORKDIR/k21_t20_d5000/scaffolds_k21_t10_d15000"

Job <743665> is submitted to default queue <lowpriority>.


**NOTE**: t5 failed

#### Round 4

In [38]:
bsub -J links \
-m aklppb34 \
-o $WORKDIR/k21_t20_d5000/links_d20k21t10.out \
-e $WORKDIR/k21_t20_d5000/links_d20k21t10.err \
"LINKS \
-f /workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs/k21_t20_d5000/scaffolds_k21_t10_d15000.scaffolds.fa \
-s $WORKDIR/ONT_data_files.txt \
-d 20000 \
-k 21 \
-t 10 \
-b $WORKDIR/k21_t20_d5000/scaffolds_k21_t10_d20000"

Job <744627> is submitted to default queue <lowpriority>.


#### Round 5

In [8]:
bsub -J links \
-m aklppb34 \
-o $WORKDIR/k21_t20_d5000/links_d25k21t10.out \
-e $WORKDIR/k21_t20_d5000/links_d25k21t10.err \
"LINKS \
-f /workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs/k21_t20_d5000/scaffolds_k21_t10_d20000.scaffolds.fa \
-s $WORKDIR/ONT_data_files.txt \
-d 25000 \
-k 21 \
-t 10 \
-b $WORKDIR/k21_t20_d5000/scaffolds_k21_t10_d25000"

Job <746221> is submitted to default queue <lowpriority>.


#### Round 6

In [13]:
bsub -J links \
-m aklppb34 \
-o $WORKDIR/k21_t20_d5000/links_d30k21t10.out \
-e $WORKDIR/k21_t20_d5000/links_d30k21t10.err \
"LINKS \
-f /workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs/k21_t20_d5000/scaffolds_k21_t10_d25000.scaffolds.fa \
-s $WORKDIR/ONT_data_files.txt \
-d 30000 \
-k 21 \
-t 10 \
-b $WORKDIR/k21_t20_d5000/scaffolds_k21_t10_d30000"

Job <746466> is submitted to default queue <lowpriority>.


## 3. SLR

SLT employs alignment scaffolding and can do ambiguous contig classification

In [37]:
module load bwa
module load samtools
module load bamtools
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    5) perlbrew/0.76      9) bwa/0.7.17
  2) texlive/20151117   6) perl/5.28.0       10) samtools/1.9
  3) pandoc/1.19.2      7) asub/2.1          11) bamtools/2.4.0
  4) git/2.21.0         8) links/v1.8.5


In [33]:
export PATH=/workspace/hraczw/github/programs/SLR/:$PATH

In [36]:
mkdir 004.Scaffolding_10Xcontigs/SLR

In [10]:
WORKDIR_SLR=/powerplant/workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs/SLR

### 3.1 index contig.fasta

In [18]:
module load bwa
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) links/v1.8.5
  3) pandoc/1.19.2      6) perl/5.28.0        9) bwa/0.7.17


In [45]:
bsub -J bwa \
-o $WORKDIR_SLR/bwa_index.out \
-e $WORKDIR_SLR/bwa_index.err \
"bwa index $ASSEMBLY_P1"

Job <376496> is submitted to default queue <normal>.


### 3.2 contig self alignment

In [55]:
ls /output/genomic/plant/Gillenia/trifoliata/31_ReleaseS4/01_Release/

G3_2_S4.1.2KB.fasta	 G3_2_S4.1.2KB.fasta.bwt  G3_2_S4.1.2KB.fasta.sa
G3_2_S4.1.2KB.fasta.amb  G3_2_S4.1.2KB.fasta.fai  G3_2_S4.2.2KB.fasta
G3_2_S4.1.2KB.fasta.ann  G3_2_S4.1.2KB.fasta.pac  G3_2_S4.2.2KB.fasta.fai


In [15]:
ASSEMBLY_PATH=/output/genomic/plant/Gillenia/trifoliata/31_ReleaseS4/01_Release/

In [62]:
bsub -J bwa \
-n 12 \
-o $WORKDIR_SLR/bwa_index.out \
-e $WORKDIR_SLR/bwa_index.err \
"bwa mem -t 12 -a $ASSEMBLY_PATH/G3_2_S4.1.2KB.fasta $ASSEMBLY_PATH/G3_2_S4.1.2KB.fasta > $WORKDIR_SLR/align-self.sam"

Job <376523> is submitted to default queue <normal>.


In [64]:
module load samtools
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) links/v1.8.5
  2) texlive/20151117   5) perlbrew/0.76      8) bwa/0.7.17
  3) pandoc/1.19.2      6) asub/2.1           9) samtools/1.9


In [65]:
bsub -J view \
-o $WORKDIR_SLR/samtools_view.out \
-e $WORKDIR_SLR/samtools_view.err \
"samtools view -Sb $WORKDIR_SLR/align-self.sam > $WORKDIR_SLR/align-self.bam"

Job <376526> is submitted to default queue <normal>.


### 3.3 Aligning long reads to contigs

In [14]:
LR_P=/workspace/hraczw/github/GA/Gillenia_genome/004.Scaffolding_10Xcontigs/Gillenia_PromethION.ml30k.fastq

In [19]:
bsub -J align \
-o $WORKDIR_SLR/bwa_align.out \
-e $WORKDIR_SLR/bwa_align.err \
-n 40 \
"bwa mem -t40 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y \
$ASSEMBLY_P1 \
$LR_P \
> $WORKDIR_SLR/align-withLongReads.sam"

Job <737965> is submitted to default queue <lowpriority>.


In [24]:
bsub -J view \
-n 40 \
-o $WORKDIR_SLR/samtools_view_bwa_align.out \
-e $WORKDIR_SLR/samtools_view_bwa_align.err \
"samtools view -Sb -@ 40 $WORKDIR_SLR/align-withLongReads.sam > $WORKDIR_SLR/align-withLongReads.bam"

Job <739151> is submitted to default queue <lowpriority>.


In [33]:
bsub -J SLR \
-m aklppb34 \
-o $WORKDIR_SLR/slr.out \
-e $WORKDIR_SLR/slr.err \
"SLR -c $ASSEMBLY_P1 \
-r $WORKDIR_SLR/align-withLongReads.bam \
-d $WORKDIR_SLR/align-self.bam \
-p $WORKDIR_SLR"

Job <740622> is submitted to default queue <lowpriority>.


In [11]:
ls $WORKDIR_SLR

align-self.bam		     samtools_view_bwa_align.out
align-withLongReads.bam      samtools_view.err
align-withLongReads.sam      samtools_view.out
ambiguous-contig-set.fa      scaffold_set.fa
bwa_align.err		     scaffold_set.fa.fai
bwa_align.out		     scaffold_set_renamedForBusco.fasta
bwa_index.err		     scaffold_tag.fa
bwa_index.out		     shortContigOptimizePathInLongRead.fa
graph.fa		     shortContigPathInLongRead.fa
graph.GFA2		     slr.err
optimizePathInLongRead.fa    slr.out
originalPathInLongRead.fa    unique-contig-set.fa
samtools_view_bwa_align.err
