# Hoki ONT reads and assembly

* **project background**: This is for Hoki population genetics study. There were two rounds of analysis. First round was analysed from 360 fish and the genome was assembled from inhouse MiNION data only. The second round was analysed from 510 fish and the genome was assembled from inhouse MiNION data plus AGRF MiNION data.
* **Scientists**: Emily Koot, Maren Wellenreuther and David Chagne.
* **Lab Scientist**: Elena Hilario
* **Data download**: Roy Storey
* **Variant calling**: Chen Wu

## 1. Data preparation

### 1.1 basecalling the inhouse MiNION data produced by Elena Hilario

In [31]:
HOKI_Heart_short=/input/genomic/fish/Macruronus/novaezelandiae/ONT/GA_HOKI_Heart_short/20200918_0444_MN19482_FAO33682_c5967de7/fast5
HOKI_LIVER_QGT100=/input/genomic/fish/Macruronus/novaezelandiae/ONT/GA_HOKI_LIVER_QGT100/20200928_0347_MN19482_FAO09986_5229a3d6/fast5
HOKI_LIVER_QGT100_2nd_LIB=/input/genomic/fish/Macruronus/novaezelandiae/ONT/GA_HOKI_LIVER_QGT100_2nd_LIB/20200929_0355_MN19482_FAO33717_202f38df/fast5
HOKI_LIVER_BBLOB_30G=/input/genomic/fish/Macruronus/novaezelandiae/ONT/GA_HOKI_LIVER_BBLOB_30G/20201005_0430_MN19482_FAO33717_b24b075c/fast5
HOKI_LIVER_BBLOB30G_dsFragmentase=/input/genomic/fish/Macruronus/novaezelandiae/ONT/GA_HOKI_LIVER_BBLOB30G-dsFragmentase/20201007_0510_MN19482_FAO09986_955392c0/fast5

In [32]:
export HOKI_Heart_short
export HOKI_LIVER_QGT100
export HOKI_LIVER_QGT100_2nd_LIB
export HOKI_LIVER_BBLOB_30G
export HOKI_LIVER_BBLOB30G_dsFragmentase

In [19]:
module load guppy/4.2.2
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) singularity/3
  3) pandoc/1.19.2      6) perl/5.28.0        9) guppy/4.2.2


In [4]:
mkdir 001.baseCalling

In [3]:
WORKDIR=001.baseCalling

In [33]:
python << EOF

import sys, os

filename = '$WORKDIR/samples.txt'
f = open(filename,'r')

for line in f:
    sampleName = line.split('\n')[0]
    os.system('bsub -J guppy_' + sampleName + ' \
               -o $WORKDIR/' + sampleName + '_guppy4.2.2.out \
               -e $WORKDIR/' + sampleName + '_guppy4.2.2.err \
               -R gpu \
               guppy_basecaller \
               --compress_fastq \
               --input_path ' + os.environ[sampleName] + ' \
               --save_path $WORKDIR/' + sampleName + ' \
               --flowcell FLO-MIN106 --kit SQK-LSK109 -x "cuda:0"')

f.close()

EOF

Job <415565> is submitted to default queue <lowpriority>.


In [18]:
module load pycoqc
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) pycoqc/2.5.0.21
  3) pandoc/1.19.2      6) perl/5.28.0


In [25]:
pycoQC -f \
001.baseCalling/HOKI_Heart_short/sequencing_summary.txt \
001.baseCalling/HOKI_LIVER_QGT100/sequencing_summary.txt \
001.baseCalling/HOKI_LIVER_QGT100_2nd_LIB/sequencing_summary.txt \
001.baseCalling/HOKI_LIVER_BBLOB_30G/sequencing_summary.txt \
001.baseCalling/HOKI_LIVER_BBLOB30G_dsFragmentase/sequencing_summary.txt \
-o pycoQC_HOKI_summary_run

Checking arguments values
Check input data files
Parse data files
Merge data
Cleaning data
	Discarding lines containing NA values
		0 reads discarded
	Filtering out zero length reads
		0 reads discarded
	Sorting run IDs by decreasing throughput
		Run-id order ['015cc04062d5b433207361dc0578bf20b04010b2', '4c14a006972263ab1bb814630fac2cbc501726a8', 'b8e5a956557d82be8c97c2ee66254101001e0214', '9365aae54ec542733282cc010dadb3b4e17dad34', '236fcb04aa138cf1feea0a0cda7783405e53a5fe']
	Reordering runids
		Processing reads with Run_ID 015cc04062d5b433207361dc0578bf20b04010b2 / time offset: 0
		Processing reads with Run_ID 4c14a006972263ab1bb814630fac2cbc501726a8 / time offset: 68689.9965
		Processing reads with Run_ID b8e5a956557d82be8c97c2ee66254101001e0214 / time offset: 151180.41175
		Processing reads with Run_ID 9365aae54ec542733282cc010dadb3b4e17dad34 / time offset: 296202.4565
		Processing reads with Run_ID 236fcb04aa138cf1feea0a0cda7783405e53a5fe / time offset: 371726.99049999996
	Cast va

### 1.2 AGRF MiNION data

* **Notes from Christopher Noune**: Library prep was completed with LSK109. Sequencing using an R9.4.1 MinION flow-cell. Guppy basecalling version 4.4.0 for fast-accuracy and version 4.4.1 for high accuracy.

In [1]:
AGRF_CAGRF20062892_Blobblob=/input/genomic/fish/Macruronus/novaezelandiae/ONT/AGRF_CAGRF20062892_Blobblob/fast5_pass

### 1.3. merging all fastq files from inhouse and AGRF MiNION data

In [1]:
ll 

total 80
drwxrwsr-x. 9 hraczw powerplant  846 Oct  8 15:39 001.baseCalling
-rw-rw-r--. 1 hraczw powerplant 3851 Oct 12 11:11 03-10-2020_baseCalling.ipynb


In [4]:
bsub -J merge -o merge.out -e merge.err "cat $WORKDIR/HOKI_Heart_short/*.fastq.gz \
$WORKDIR/HOKI_LIVER_QGT100/*.fastq.gz \
$WORKDIR/HOKI_LIVER_QGT100_2nd_LIB/*.fastq.gz \
$WORKDIR/HOKI_LIVER_BBLOB_30G/*.fastq.gz \
$WORKDIR/HOKI_LIVER_BBLOB30G_dsFragmentase/*.fastq.gz > HOKI_ont.fastq.gz"

Job <417748> is submitted to default queue <lowpriority>.


In [9]:
bsub -J unzip -o unzip.out -e unzip.err "gunzip HOKI_ont.fastq.gz"

Job <417749> is submitted to default queue <lowpriority>.


In [1]:
# merge with AGRF minion fastq_pass files
AGRF_CAGRF20062892_Blobblob_fastq=/input/genomic/fish/Macruronus/novaezelandiae/ONT/AGRF_CAGRF20062892_Blobblob/fastq_pass
bsub -J merge -o merge_agrf.out -e merge_agrf.err "cat $AGRF_CAGRF20062892_Blobblob_fastq/*.fastq | gzip > AGRF_HOKI_ont.fastq.gz"

Job <820028> is submitted to default queue <lowpriority>.


In [2]:
bsub -J merge -o merge_all.out -e merge_all.err "cat HOKI_ont.fastq.gz AGRF_HOKI_ont.fastq.gz > ALL_HOKI_ont.fastq.gz"

Job <820060> is submitted to default queue <lowpriority>.


## 2. genome assembly

### 2.1 Shasta

* using the inhouse MiNION data only shows Shasta assembly is worse than Flye, so Shasta was not used for the second round of analysis.

In [7]:
mkdir 002.Shasta

In [8]:
WORKDIR=002.Shasta

In [11]:
ll

total 3872248
drwxrwsr-x. 9 hraczw powerplant        846 Oct  8 15:39 001.baseCalling
drwxrwsr-x. 2 hraczw powerplant          0 Oct 12 11:22 002.Shasta
-rw-rw-r--. 1 hraczw powerplant      34009 Oct 12 11:27 03-10-2020_baseCalling.ipynb
-rw-rw-r--. 1 hraczw powerplant 3297951666 Oct 12 11:15 HOKI_ont.fastq
-rw-rw-r--. 1 hraczw powerplant        336 Oct 12 11:13 merge.err
-rw-rw-r--. 1 hraczw powerplant       2268 Oct 12 11:15 merge.out
-rw-rw-r--. 1 hraczw powerplant          0 Oct 12 11:25 unzip.err
-rw-rw-r--. 1 hraczw powerplant        920 Oct 12 11:26 unzip.out


In [5]:
export PATH=/workspace/hraczw/github/programs/Shasta/:$PATH

In [16]:
bsub -J shasta -n 36 -m wkoppb50 -o shasta.out -e shasta.err \
"shasta-Linux-0.6.0 \
--input HOKI_ont.fastq \
--assemblyDirectory $WORKDIR/shasta \
--threads 36"

Job <417755> is submitted to default queue <lowpriority>.


In [14]:
shasta-Linux-0.6.0

Shasta Release 0.6.0

To run an assembly, use the "--input" option to specify the input files. Use the "--help" option for a description of the other options and parameters.

Default values of assembly parameters are not recommended for any specific application and mostly reflect approximate compatibility with previous releases.See the shasta/conf or shasta-install/conf directory for sample configuration files containing assembly parameters for specific applications.

For more information about the Shasta assembler, see
https://github.com/chanzuckerberg/shasta

Complete documentation for the latest version of Shasta is available here:
https://chanzuckerberg.github.io/shasta

Options allowed only on the command line:
  -h [ --help ]                         Write a help message.
  -v [ --version ]                      Identify the Shasta version.
  --config arg                          Configuration file name.
  --input arg                           Names of input files containing reads.

: 2

### 2.2. Flye

In [5]:
export PATH=/workspace/hraczw/github/programs/FLYE_2-8-1/Flye/bin/:$PATH

#### 2.2.1 inhouse MiNION data only

In [23]:
bsub -J flye -n 36 -m wkoppb50 \
-o flye.out -e flye.err \
"flye --nano-raw HOKI_ont.fastq \
--genome-size 700m \
-o Flye_All -t 36 -i 1"

Job <418171> is submitted to default queue <lowpriority>.


In [3]:
module load assemblathon_stats/14dfdab

In [4]:
assemblathon_stats.pl Flye_All/assembly.fasta


---------------- Information for assembly 'Flye_All/assembly.fasta' ----------------


                                         Number of scaffolds       3339
                                     Total size of scaffolds  105375225
                                            Longest scaffold     180552
                                           Shortest scaffold        163
                                 Number of scaffolds > 1K nt       3166  94.8%
                                Number of scaffolds > 10K nt       2832  84.8%
                               Number of scaffolds > 100K nt         20   0.6%
                                 Number of scaffolds > 1M nt          0   0.0%
                                Number of scaffolds > 10M nt          0   0.0%
                                          Mean scaffold size      31559
                                        Median scaffold size      29688
                                         N50 scaffold length      40544
             

#### 2.2.2 inhouse plus AGRF MiNION data

In [7]:
module load pfr-python3

In [8]:
# assembly with PFR and AGRF data

bsub -q priority -P P/536002/01 -J flye -n 36 -m wkoppb50 \
-o flye_PFR_AGRF.out -e flye_PFR_AGRF.err \
"flye --nano-raw ALL_HOKI_ont.fastq.gz \
--genome-size 700m \
-o Flye_PFR_AGRF -t 36 -i 1"

Job <820400> is submitted to queue <priority>.


#### 2.2.3 inhouse plus AGRF MiNION and Promethon data

In [1]:
AGRF_CAGRF20062892_Blobblob_PromethION_fastq=/input/genomic/fish/Macruronus/novaezelandiae/ONT/AGRF_CAGRF20062892_Blobblob_PromethION/fastq_pass

In [3]:
bsub -J merge -o merge_agrf_p.out -e merge_agrf_p.err "cat $AGRF_CAGRF20062892_Blobblob_PromethION_fastq/*.fastq.gz > AGRF_HOKI_ont_P.fastq.gz"

Job <33259> is submitted to default queue <lowpriority>.


In [4]:
bsub -J merge -o merge_all_plusP.out -e merge_all_plusP.err "cat ALL_HOKI_ont.fastq.gz AGRF_HOKI_ont_P.fastq.gz > ALL_HOKI_ont_plusP.fastq.gz"

Job <33260> is submitted to default queue <lowpriority>.


In [8]:
bsub -J flye -n 36 -m wkoppb50 \
-o flye_all_plusP.out -e flye_all_plusP.err \
"flye --nano-raw ALL_HOKI_ont_plusP.fastq.gz \
--genome-size 700m \
-o Flye_All_plusP -t 36 -i 1"

Job <33262> is submitted to default queue <lowpriority>.


In [None]:
bsub -q priority -P P/536002/01 -J flye -n 36 -m wkoppb50 \
-o flye_all_plusP_m1500.out -e flye_all_plusP_m1500.err \
"flye --nano-raw ALL_HOKI_ont_plusP.fastq.gz \
--genome-size 700m \
-m 1500 \
--keep-haplotypes \
-o Flye_All_plusP_m1500 -t 36 -i 1"

In [None]:
bsub -q priority -P P/536002/01 -J flye -n 36 -m wkoppb50 \
-o flye_all_plusP_m2000.out -e flye_all_plusP_m2000.err \
"flye --nano-raw ALL_HOKI_ont_plusP.fastq.gz \
--genome-size 700m \
-m 2000 \
--keep-haplotypes \
-o Flye_All_plusP_m2000 -t 36 -i 1"