# Assembling *Vaccinium corymbosum nui* Genome

## Data sources

- HiC Data for Nui is here:
    - /input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX


- 10X data for Nui and M7 here:
    - /input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF18813_H7JY3DRXX


- ONT PromethION Nui (BB2020 and BB2020-2 are the same sample) here:
    - /input/genomic/plant/Vaccinium/corymbosum/Blueberry_PromethION_Apr2020


- ONT MinION Nui (BB2020) here:
    - /input/genomic/plant/Vaccinium/corymbosum/CAGRF21436/20200224_MinION/AGRF_CAGRFF21436_FAL87845_BB2020/


- 10X Supernova Assembly for 10X data here:
    - /output/genomic/plant/Vaccinium/corymbosum/2021_GenomeAssembly/Nui/01_Supernova

### Plan 
- base-calling for ONT samples using Guppy v5.
- Filter out MinION reads <1kb.
- Cecilia has done the Supernova assembly for the 10X data.
- Use Flye to assemble ONT fastq
- Use quickmerge to merge the Supernova contigs + ONT contigs
- Use Salsa to improve assembly
- Tetraploid Haplotyping and gene annotation etc. 



## PromethION Basecalling
This table for the promethION dataset CAGRF21436. This data is Nui. BB2020 and BB2020-2 is the same individual, I had to resend it. 

Run info here: https://storage.powerplant.pfr.co.nz/input/genomic/plant/Vaccinium/corymbosum/Blueberry_PromethION_Apr2020/AGRF_CAGRF21436-2_PAE71986_BB2020/report_PAE71108_20200325_0131_8043ab86.pdf


In [5]:
module avail guppy
module load guppy/5.0.7

---------- [1;94m/software/OSutils/modules-4.7.1/share/Modules/modulefiles[0m -----------
[1mguppy[22m/3.2.4  [1mguppy[22m/3.5.2  [1mguppy[22m/4.2.2  [1mguppy[22m/5.0.7  
[1mguppy[22m/3.4.4  [1mguppy[22m/3.6.1  [1mguppy[22m/4.4.1  

Key:
[1;94mmodulepath[0m  
Loading [1mguppy/5.0.7[22m
  [94mLoading requirement[0m: singularity/3


In [3]:
WKDIR=/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT
INDIR2=/input/genomic/plant/Vaccinium/corymbosum/Blueberry_PromethION_Apr2020/AGRF_CAGRF21436-2_PAE71986_BB2020-2/fast5_pass
INDIR1=/input/genomic/plant/Vaccinium/corymbosum/Blueberry_PromethION_Apr2020/AGRF_CAGRF21436-2_PAE71986_BB2020/fast5_pass
FAST5DIR=/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/fast5
ls $WKDIR


BB2020_MinION_1kb    BB2020_PromethION_Fastq  FastQC	  log
BB2020_MinION_Fastq  fast5		      gunzip1.sh


In [4]:
#mkdir $WKDIR/
#mkdir $WKDIR/log
mkdir $FAST5DIR
#mkdir $WKDIR/BB2020_Fastq
ls $WKDIR


BB2020_Fastq  fast5  FastQC  log


In [67]:
ls $INDIR2/PAE71986_pass_9ec138bf_0* | head -n 10

/input/genomic/plant/Vaccinium/corymbosum/Blueberry_PromethION_Apr2020/AGRF_CAGRF21436-2_PAE71986_BB2020-2/fast5_pass/PAE71986_pass_9ec138bf_0.fast5.gz
/workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/fast5


### Fast5 files
All of the gzipped raw fast5 files were decompressed into workspace/hraijc
Interestingly the uncompressed fast5 files were slightly smaller. ??


In [49]:
cd $INDIR2
for file in *.gz
do 
    echo gunzip -c $INDIR2/$file '>' $FAST5DIR/${file%.gz}
done > $WKDIR/gunzip.sh

In [50]:
cd $WKDIR
chmod +x gunzip.sh
bsub -J gzip2 -o $WKDIR/log/gzip2.out -e $WKDIR/log/gzip2.err -n 1 "$WKDIR/gunzip.sh"

Job <285969> is submitted to default queue <lowpriority>.


In [73]:
cd $INDIR1
for file in *.gz
do 
    echo gunzip -c $INDIR1/$file '>' $FAST5DIR/${file%.gz}
done > $WKDIR/gunzip1.sh

In [74]:
cd $WKDIR
chmod +x gunzip1.sh
bsub -J gzip1 -o $WKDIR/log/gzip1.out -e $WKDIR/log/gzip1.err -n 8 "$WKDIR/gunzip1.sh"

Job <285975> is submitted to default queue <lowpriority>.


### Basecalling with Guppy v5
Basecalling was previously done by AGRF with Guppy v3. Using the dna_r9.4.1_450bps_hac_prom.cfg config file and 2GPU to basecall.

In [111]:
bsub -J guppyBB20 -o $WKDIR/log/guppyBB20.out -e $WKDIR/log/guppyBB20.err -R 'gpu' "guppy_basecaller --input_path $FAST5DIR --save_path $WKDIR/BB2020_Fastq -c dna_r9.4.1_450bps_hac_prom.cfg -x 'cuda:0'"

Job <286027> is submitted to default queue <lowpriority>.


In [16]:
#mkdir /workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/BB2020_Fastq/guppy5
#mv /workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/BB2020_Fastq/* /workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/BB2020_Fastq/guppy5
#cat /workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/BB2020_Fastq/pass/*.fastq > workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/BB2020_Fastq/Nui_BB2020_guppy5.fastq

In [1]:
#Cleanup and Concat Guppy Fastq files together

bsub -J concat5 -o $WKDIR/log/concat5.out -e $WKDIR/log/concat5.err "cat /workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/BB2020_Fastq/pass/*.fastq > /workspace/hraijc/Blueberry_Antho_QLT/092021_3MbReassembly/Nui_BB2020_ONT/BB2020_Fastq/BB2020_gupppy5.fastq"

mkdir: cannot create directory ‘/BB2020_Fastq/guppylogs’: No such file or directory
mv: cannot stat ‘/BB2020_Fastq/*.log’: No such file or directory
Job <286420> is submitted to default queue <lowpriority>.


In [3]:
mkdir $WKDIR/BB2020_Fastq/guppylogs
mv $WKDIR/BB2020_Fastq/*.log $WKDIR/BB2020_Fastq/guppylogs

### Check PromethION with MinIONQC

In [5]:
ls 

BB2020_gupppy5.fastq  guppylogs  sequencing_summary.txt
fail		      pass	 sequencing_telemetry.js


In [7]:
module load R/4.0.0
#Rscript /workspace/hraijc/git_clones/MinIONQC.R -h

Loading [1mR/4.0.0[22m
  [94mLoading requirement[0m: unixODBC/2.3.0 JAGS/4.2.0 gdal/2.4.0 proj/5.2.0


In [8]:
bsub -J minqc  -o $WKDIR/log/minqc.out -e $WKDIR/log/minqc.err -n 1 \
"Rscript /workspace/hraijc/git_clones/MinIONQC.R -i $WKDIR/BB2020_PromethION_Fastq/sequencing_summary.txt -o $WKDIR/BB2020_PromethION_Fastq/minionqc -c TRUE"

Job <635046> is submitted to default queue <lowpriority>.


## MinION basecalling and size filtering
### Basecalling

In [9]:
module load guppy/5.0.7

In [10]:
WKDIR=/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT
INDIR3=/input/genomic/plant/Vaccinium/corymbosum/CAGRF21436/20200224_MinION/AGRF_CAGRFF21436_FAL87845_BB2020/fast5
FAST5DIR=/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/fast5
ls $WKDIR

BB2020_MinION_Fastq	 BB2020_PromethION_Fastq  FastQC      log
BB2020_MinION_Fastq_106  fast5			  gunzip1.sh


In [19]:
cd $INDIR3
for file in *.gz
do 
    echo gunzip -c $INDIR3/$file '>' $FAST5DIR/${file%.gz}
done > $WKDIR/gunzip1.sh

In [20]:
cd $WKDIR
chmod +x gunzip1.sh
bsub -J gunzip1 -o $WKDIR/log/gunzip1.out -e $WKDIR/log/gunzip1.err -n 10 \
"parallel -a ${WKDIR}/gunzip1.sh"

Job <471668> is submitted to default queue <lowpriority>.


In [4]:
# Printout the Kit and Flowcell codes for Guppy
bsub -J gtest -o $WKDIR/log/gtest.out -e $WKDIR/log/gtest.err -R 'gpu' \
"guppy_basecaller --print_workflows"

guppy_basecaller: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory


: 127

In [6]:
grep LSK109 $WKDIR/log/gtest.out | grep FLO-MIN

FLO-MIN111 SQK-LSK109                dna_r10.3_450bps_hac           2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-MIN111 SQK-LSK109-XL             dna_r10.3_450bps_hac           2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-MIN110 SQK-LSK109                dna_r10_450bps_hac             unknown
FLO-MIN110 SQK-LSK109-XL             dna_r10_450bps_hac             unknown
FLO-MIN106 SQK-LSK109                dna_r9.4.1_450bps_hac          2021-05-17_dna_r9.4.1_minion_384_d37a2ab9
FLO-MIN106 SQK-LSK109-XL             dna_r9.4.1_450bps_hac          2021-05-17_dna_r9.4.1_minion_384_d37a2ab9
FLO-MIN107 SQK-LSK109                dna_r9.5_450bps                unknown
FLO-MIN111 SQK-LSK109                dna_r10.3_450bps_hac           2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-MIN111 SQK-LSK109-XL             dna_r10.3_450bps_hac           2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-MIN110 SQK-LSK109                dna_r10_450bps_hac             unkn

Not sure about the flowcell type...
AGRF says the flowcell ID was: FAL87845 and the Library prep kit was LSK109. The sequencing was done in 2020.
Going to guess its using the r9.4.1 chemistry not the r10 chemistry. Most recent one with R9 chemistry is: dna_r9.4.1_450bps_hac          

In [10]:
# Basecalling with Guppy V5
bsub -J guppyMinIONBB20 -o $WKDIR/log/guppyMinIONBB20.out -e $WKDIR/log/guppyMinIONBB20.err -R 'gpu' \
"guppy_basecaller --input_path $FAST5DIR --save_path $WKDIR/BB2020_MinION_Fastq -c dna_r9.4.1_450bps_hac.cfg -x 'cuda:0'"

Job <627864> is submitted to default queue <lowpriority>.


In [15]:
# Basecalling with Guppy V5
#Try again with FLO-MIN110
#DEF NOT THIS.
#bsub -J guppyMinIONBB20 -o $WKDIR/log/guppyMinIONBB20.out -e $WKDIR/log/guppyMinIONBB20.err -R 'gpu' \
#"guppy_basecaller --input_path $FAST5DIR --save_path $WKDIR/BB2020_MinION_Fastq -c dna_r10.3_450bps_hac.cfg -x 'cuda:0'"

Job <628152> is submitted to default queue <lowpriority>.


### MinION QC

In [2]:
module load R/4.0.0
mkdir /workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_Fastq/minionqc5
mkdir /workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_Fastq/minionqc3

Loading [1mR/4.0.0[22m
  [94mLoading requirement[0m: unixODBC/2.3.0 JAGS/4.2.0 gdal/2.4.0 proj/5.2.0


In [13]:
bsub -J minqc3  -o $WKDIR/log/minqc3.out -e $WKDIR/log/minqc3.err -n 1 \
"Rscript /workspace/hraijc/git_clones/MinIONQC.R -i /input/genomic/plant/Vaccinium/corymbosum/CAGRF21436/20200224_MinION/AGRF_CAGRFF21436_FAL87845_BB2020/sequencing_summary/sequencing_run_sequencing_summary.txt -o /workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_Fastq/minionqc3 -c TRUE"

Job <628072> is submitted to default queue <lowpriority>.


In [3]:
bsub -J minqc5  -o $WKDIR/log/minqc5.out -e $WKDIR/log/minqc5.err -n 1 \
"Rscript /workspace/hraijc/git_clones/MinIONQC.R -i /workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_Fastq/sequencing_summary.txt -o /workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_Fastq/minionqc5 -c TRUE"

Job <628671> is submitted to default queue <lowpriority>.


In [9]:
echo $WKDIR

/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT


## Filter MinION reads <1kb

In [10]:
mkdir $WKDIR/BB2020_MinION_1kb

In [27]:
bsub -J cat1  -o $WKDIR/log/cat1.out -e $WKDIR/log/cat1.err -n 2 \
"cat $WKDIR/BB2020_MinION_Fastq/pass/*.fastq > $WKDIR/BB2020_MinION_1kb/all.fastq"

Job <628693> is submitted to default queue <lowpriority>.


In [1]:
module load seqkit

In [29]:
bsub -J seqtkit1  -o $WKDIR/log/seqtkit1.out -e $WKDIR/log/seqtkit1.err -n 2 \
"seqkit seq -m 1000 $WKDIR/BB2020_MinION_1kb/all.fastq > $WKDIR/BB2020_MinION_1kb/BB2020_MinION_1kb.fastq"

Job <628696> is submitted to default queue <lowpriority>.


In [36]:
bsub -J seqtkit1  -o $WKDIR/log/seqtkit1.out -e $WKDIR/log/seqtkit1.err -n 2 \
"seqkit seq -m 5000 $WKDIR/BB2020_MinION_1kb/all.fastq > $WKDIR/BB2020_MinION_1kb/BB2020_MinION_5kb.fastq"

Job <628701> is submitted to default queue <lowpriority>.


In [37]:
bsub -J seqtkit2  -o $WKDIR/log/seqtkit2.out -e $WKDIR/log/seqtkit2.err -n 11 \
"seqkit stats -j 10 $WKDIR/BB2020_MinION_1kb/*.fastq"

Job <628702> is submitted to default queue <lowpriority>.


In [38]:
head -n 4 $WKDIR/log/seqtkit2.out

file                                                                                        format  type  num_seqs        sum_len  min_len   avg_len  max_len
/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_1kb/all.fastq                FASTQ   DNA    866,892  8,411,741,214       84   9,703.3  153,794
/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_1kb/BB2020_MinION_1kb.fastq  FASTQ   DNA    801,411  8,368,838,005    1,000  10,442.6  153,794
/workspace/hraijc/BB_Nui_Assembly/Nui_BB2020_ONT/BB2020_MinION_1kb/BB2020_MinION_5kb.fastq  FASTQ   DNA    429,039  7,311,167,339    5,000  17,040.8  153,794
