## Data sources

- HiC Data for Nui is here:
    - /input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX


- 10X data for Nui and M7 here:
    - /input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF18813_H7JY3DRXX


- ONT PromethION Nui (BB2020 and BB2020-2 are the same sample) here:
    - /input/genomic/plant/Vaccinium/corymbosum/Blueberry_PromethION_Apr2020


- ONT MinION Nui (BB2020) here:
    - /input/genomic/plant/Vaccinium/corymbosum/CAGRF21436/20200224_MinION/AGRF_CAGRFF21436_FAL87845_BB2020/


- 10X Supernova Assembly for 10X data here:
    - /output/genomic/plant/Vaccinium/corymbosum/2021_GenomeAssembly/Nui/01_Supernova

### Plan 
- base-calling for ONT samples using Guppy v5.
- Filter out MinION reads <1kb. Or higher...
- Cecilia has done the Supernova assembly for the 10X data.
- Use Flye to assemble ONT fastq
- Use quickmerge to merge the Supernova contigs + ONT contigs
- Use Salsa to improve assembly
- Tetraploid Haplotyping and gene annotation etc. 



**See 01_basecalling_ONT.ipyn for ONT steps**

**See 02_flye.ipyn for ONT Assembly**

**See 03_CombineAssemblies.ipyn for quickmerge steps**


## Before you begin:

### Using the quickmerge8 Assembly for HiC scaffolding.

In [18]:
#Set Variables.


In [25]:
ASSEMBLY=/workspace/hraijc/BB_Nui_Assembly/Hybrid_assembly/quickmerge8/Nui_quickmerge8.fasta
APREFIX=qm8_HiC
HiC_RAW=/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX
WKDIR=/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_qm8
TEMPDIR=${WKDIR}/temp

In [26]:
cd ${WKDIR}

In [21]:
#See /workspace/hraijc/Raspberry/TestingOmniC/Dependencies.ipynb for building and installing dependencies.
module load bedtools/2.30.0
module load bwa/0.7.17
module load samtools/0.1.19


In [10]:
#Make directories
#mkdir ${TEMPDIR}
#mkdir ${WKDIR}/log

## Pre-Alignment

In [11]:
#create reference of the genome assembly.
samtools faidx $ASSEMBLY

In [12]:
#Create genome file from genome index.
cut -f1,2 ${ASSEMBLY}.fai > ${ASSEMBLY%.fasta}.genome

In [13]:
head -n 2 ${ASSEMBLY%.fasta}.genome

contig_1	231433
contig_10000	28597


In [14]:
#Create bwa index.
bsub -J indbwa -o ${WKDIR}/log/indbwa.log -e ${WKDIR}/log/indbwa.err -n 1 \
"bwa index ${ASSEMBLY}"

Job <649792> is submitted to default queue <lowpriority>.


In [15]:
du -sh ${HiC_RAW}/*

8.2G	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/BlueberryNui_HiC_HJWHFDRXX_GTACGA_L002_R1.fastq.gz
12G	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/BlueberryNui_HiC_HJWHFDRXX_GTACGA_L002_R2.fastq.gz
26K	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/checksums.exf
26K	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/checksums.md5
264K	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/DataValidation.pdf
9.9G	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/Pepino_HiC_HJWHFDRXX_CAGATC_L002_R1.fastq.gz
11G	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/Pepino_HiC_HJWHFDRXX_CAGATC_L002_R2.fastq.gz
48K	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/README.md
22G	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX/Rewarewa_HiC_HJWHFDRXX_TCAAGA_L002_R1.fastq.gz
23G	/input/genomic/plant/Vaccinium/corymbosum/AGRF_CA

## Alignment

In [16]:
cd ${WKDIR}
cp ${ASSEMBLY}* .

In [16]:
#Align reads with BWA. Use -5SP for Hi-C reads.
bsub -J bwamem -o ${WKDIR}/log/bwamem.log -e ${WKDIR}/log/bwamem.err -n 25 \
"bwa mem -5SP -T0 -t24 ${ASSEMBLY} ${HiC_RAW}/BlueberryNui_HiC_HJWHFDRXX_GTACGA_L002_R1.fastq.gz ${HiC_RAW}/BlueberryNui_HiC_HJWHFDRXX_GTACGA_L002_R2.fastq.gz -o ${APREFIX}.sam" 


Job <709733> is submitted to default queue <lowpriority>.


In [17]:
grep "Success" ${WKDIR}/log/bwamem.log
du -sh ${WKDIR}/${APREFIX}.sam

Successfully completed.
du: cannot access ‘/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_qm8/qm8_HiC.sam’: No such file or directory


: 1

### Flag PCR Duplicates
samblaster is a fast and flexible program for marking duplicates in read-id grouped1 paired-end SAM files. It can also optionally output discordant read pairs and/or split read mappings to separate SAM files, and/or unmapped/clipped reads to a separate FASTQ file. When marking duplicates, samblaster will require approximately 20MB of memory per 1M read pairs.

In [23]:
module load samtools/0.1.19

In [24]:
#First mark as duplicate with samblaster then
bsub -J samblaster -o ${WKDIR}/log/samblaster.log -e ${WKDIR}/log/samblaster.err \
"/workspace/hraijc/git_clones/samblaster/samblaster -i ${WKDIR}/${APREFIX}.sam -o ${APREFIX}_marked_byread.sam"

Job <650240> is submitted to default queue <lowpriority>.


Samtools Flag 2316:
+ read unmapped (0x4)
+ mate unmapped (0x8)*
+ not primary alignment (0x100)
+ supplementary alignment (0x800

In [25]:
#Then remove flagged reads.
bsub -J samview9 -o ${WKDIR}/log/samview9.log -e ${WKDIR}/log/samview9.err -n 17 \
"samtools view -S -b -@ 16 -F 2316 ${APREFIX}_marked_byread.sam > ${APREFIX}_dedup.bam"

Job <650302> is submitted to default queue <lowpriority>.


In [26]:
du -sh ${APREFIX}*am

32G	qm8_HiC_dedup.bam
161G	qm8_HiC_marked_byread.sam
160G	qm8_HiC.sam


In [28]:
rm -rf *.sam

In [27]:
module unload samtools

## QC HiC data with hic_qc

In [29]:
module load conda

In [30]:
conda activate hraijc_hic_qc2

(/workspace/appscratch/miniconda/hraijc_hic_qc2) 

: 1

In [31]:
bsub -J hic_qc1M -o ${WKDIR}/log/hic_qc1M.log -e ${WKDIR}/log/hic_qc1M.err -n 1 \
"python /workspace/hraijc/git_clones/hic_qc/hic_qc.py -n 10000000 -b ${APREFIX}_dedup.bam -o hicqc_10M"

Job <650314> is submitted to default queue <lowpriority>.
(/workspace/appscratch/miniconda/hraijc_hic_qc2) 

: 1

In [32]:
conda deactivate
module unload conda

## Make HiC contact map before SALSA run to compare results to

In [4]:
module load pfr-python2

In [34]:
cp ${ASSEMBLY} .

In [35]:
bsub -J agpgen1 -o ${WKDIR}/log/agpgen1.log -e ${WKDIR}/log/agpgen1.err -n 2 \
"python2 /powerplant/workspace/hraijc/git_clones/juicebox_scripts/juicebox_scripts/makeAgpFromFasta.py ${ASSEMBLY} ${ASSEMBLY%.fasta}.agb"

Job <650315> is submitted to default queue <lowpriority>.


In [36]:
bsub -J juiceboxt1 -o ${WKDIR}/log/juiceboxt1.log -e ${WKDIR}/log/juiceboxt1.err -n 2 \
"python2 /powerplant/workspace/hraijc/git_clones/juicebox_scripts/juicebox_scripts/agp2assembly.py ${ASSEMBLY%.fasta}.agb ${ASSEMBLY%.fasta}.assembly"

Job <650316> is submitted to default queue <lowpriority>.


In [37]:
cp ${ASSEMBLY%.fasta}* .

cp: omitting directory ‘/workspace/hraijc/BB_Nui_Assembly/Hybrid_assembly/quickmerge8/Nui_quickmerge8_busco’


: 1

In [38]:
bsub -J matlock1 -o ${WKDIR}/log/matlock1.log -e ${WKDIR}/log/matlock1.err -n 2 \
"/powerplant/workspace/hraijc/git_clones/matlock/bin/matlock bam2 juicer ${APREFIX}_dedup.bam ${APREFIX}_dedup.links.txt"


Job <650318> is submitted to default queue <lowpriority>.


In [39]:
bsub -J cmap1 -o ${WKDIR}/log/cmap1.log -e ${WKDIR}/log/cmap1.err -n 2 \
"sort -k2,2 -k6,6 ${APREFIX}_dedup.links.txt > ${APREFIX}_dedup.sorted.links.txt"

Job <650349> is submitted to default queue <lowpriority>.


In [8]:
bsub -J 3ddna1 -o ${WKDIR}/log/3ddna1.log -e ${WKDIR}/log/3ddna1.err -n 2 \
"/powerplant/workspace/hraijc/git_clones/matlock/3d-dna/visualize/run-assembly-visualizer.sh -p false ${ASSEMBLY%.fasta}.assembly ${APREFIX}_dedup.sorted.links.txt"

Job <650654> is submitted to default queue <lowpriority>.


/powerplant/workspace/hraijc/BB_Nui_Assembly





# Running again on the Flye04 Assembly for HiC scaffolding.

In [5]:
#Set Variables.


In [6]:
ASSEMBLY=/workspace/hraijc/BB_Nui_Assembly/ONT_Assemly/FLYE04/Flye04_assembly.fasta
APREFIX=flye4_HiC
HiC_RAW=/input/genomic/plant/Vaccinium/corymbosum/AGRF_CAGRF21434_HJWHFDRXX
WKDIR=/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4
TEMPDIR=${WKDIR}/temp
cd $WKDIR

In [19]:
#See /workspace/hraijc/Raspberry/TestingOmniC/Dependencies.ipynb for building and installing dependencies.
module load bedtools/2.30.0
module load bwa/0.7.17
module load samtools/0.1.19


In [20]:
#Make directories
mkdir ${WKDIR}
mkdir ${TEMPDIR}
mkdir ${WKDIR}/log


mkdir: cannot create directory ‘/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4’: File exists
mkdir: cannot create directory ‘/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4/temp’: File exists
mkdir: cannot create directory ‘/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4/log’: File exists


: 1

## Pre-Alignment

In [4]:
#create reference of the genome assembly.
bsub -I "samtools faidx $ASSEMBLY"

Job <709652> is submitted to default queue <lowpriority>.
<<Waiting for dispatch ...>>
<<Starting on aklppg31>>


In [5]:
#Create genome file from genome index.
bsub -I "cut -f1,2 ${ASSEMBLY}.fai > ${ASSEMBLY%.fasta}.genome"

Job <709653> is submitted to default queue <lowpriority>.
<<Waiting for dispatch ...>>
<<Starting on aklppb43>>


In [6]:
head -n 2 ${ASSEMBLY%.fasta}.genome

contig_1	231433
contig_10000	28597


In [7]:
#Create bwa index.
bsub -J indbwa2 -o ${WKDIR}/log/indbwa2.log -e ${WKDIR}/log/indbwa2.err -n 1 \
"bwa index ${ASSEMBLY}"

Job <709654> is submitted to default queue <lowpriority>.


## Alignment

In [13]:
cd ${WKDIR}
cp ${ASSEMBLY}* .

In [21]:
#Align reads with BWA. Use -5SP for Hi-C reads.
bsub -J bwamem -o ${WKDIR}/log/bwamem.log -e ${WKDIR}/log/bwamem.err -n 25 \
"bwa mem -5SP -T0 -t24 ${ASSEMBLY} ${HiC_RAW}/BlueberryNui_HiC_HJWHFDRXX_GTACGA_L002_R1.fastq.gz ${HiC_RAW}/BlueberryNui_HiC_HJWHFDRXX_GTACGA_L002_R2.fastq.gz -o ${APREFIX}.sam" 


Job <709734> is submitted to default queue <lowpriority>.


In [33]:
grep "Success" ${WKDIR}/log/bwamem.log
du -sh ${WKDIR}/${APREFIX}.sam

Successfully completed.
Successfully completed.
158G	/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4/flye4_HiC.sam


### Flag PCR Duplicates
samblaster is a fast and flexible program for marking duplicates in read-id grouped1 paired-end SAM files. It can also optionally output discordant read pairs and/or split read mappings to separate SAM files, and/or unmapped/clipped reads to a separate FASTQ file. When marking duplicates, samblaster will require approximately 20MB of memory per 1M read pairs.

In [34]:
module load samtools/0.1.19

In [35]:
#First mark as duplicate with samblaster then
bsub -J samblaster -o ${WKDIR}/log/samblaster.log -e ${WKDIR}/log/samblaster.err \
"/workspace/hraijc/git_clones/samblaster/samblaster -i ${WKDIR}/${APREFIX}.sam -o ${APREFIX}_marked_byread.sam"

Job <709736> is submitted to default queue <lowpriority>.


Samtools Flag 2316:
+ read unmapped (0x4)
+ mate unmapped (0x8)*
+ not primary alignment (0x100)
+ supplementary alignment (0x800

In [40]:
#Then remove flagged reads.
bsub -J samview9 -o ${WKDIR}/log/samview9.log -e ${WKDIR}/log/samview9.err -n 17 \
"samtools view -S -b -@ 16 -F 2316 ${APREFIX}_marked_byread.sam > ${APREFIX}_dedup.bam"

Job <709738> is submitted to default queue <lowpriority>.


In [47]:
du -sh ${APREFIX}*am

32G	flye4_HiC_dedup.bam
158G	flye4_HiC_marked_byread.sam
158G	flye4_HiC.sam


In [48]:
rm -rf *.sam

In [49]:
module unload samtools

In [50]:
pwd

/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4


In [None]:
/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4/flye4_HiC_dedup.bam

## Make Hic assembly and contact file to viz with JuiceBox
### Assembly file generation

In [10]:
ASSEMBLY=/workspace/hraijc/BB_Nui_Assembly/Hi-C_mapping_flye4/Flye04_assembly.fasta

In [11]:
ml load pfr-python2/2.7.13

In [12]:
cd $WKDIR

In [13]:
bsub -J juiceboxt1a -o ${WKDIR}/log/juiceboxt1a.log -e ${WKDIR}/log/juiceboxt1a.err -n 2 \
"python2 /powerplant/workspace/hraijc/git_clones/juicebox_scripts/juicebox_scripts/makeAgpFromFasta.py ${ASSEMBLY} ${ASSEMBLY%.fasta}.agp"


Job <713171> is submitted to default queue <lowpriority>.


In [14]:
bsub -J juiceboxt1b -o ${WKDIR}/log/juiceboxt1b.log -e ${WKDIR}/log/juiceboxt1b.err -n 2 \
"python2 /powerplant/workspace/hraijc/git_clones/juicebox_scripts/juicebox_scripts/agp2assembly.py ${ASSEMBLY%.fasta}.agp ${ASSEMBLY%.fasta}.assembly"

Job <713174> is submitted to default queue <lowpriority>.


#### Make contact map

In [15]:
#Make links file
bsub -J matlock5 -o ${WKDIR}/log/matlock5.log -e ${WKDIR}/log/matlock5.err -n 2 \
"/powerplant/workspace/hraijc/git_clones/matlock/bin/matlock bam2 juicer  ${APREFIX}_dedup.bam  ${APREFIX}.links.txt"

Job <713175> is submitted to default queue <lowpriority>.


In [16]:
#Sort links file
bsub -J linksort5 -o ${WKDIR}/log/linksort5.log -e ${WKDIR}/log/linksort5.err -n 2 \
"sort -k2,2 -k6,6 ${APREFIX}.links.txt > ${APREFIX}.sorted.links.txt"

Job <713180> is submitted to default queue <lowpriority>.


In [24]:
#Make Juicer file
bsub -J 3ddna5 -o ${WKDIR}/log/3ddna5.log -e ${WKDIR}/log/3ddna5.err -n 2 \
"/powerplant/workspace/hraijc/git_clones/matlock/3d-dna/visualize/run-assembly-visualizer.sh -p false ${ASSEMBLY%.fasta}.assembly ${APREFIX}.sorted.links.txt"

Job <713185> is submitted to default queue <lowpriority>.


### hic_qc of salsa assembly

In [19]:
module load conda

In [20]:
conda activate hraijc_hic_qc2

(/workspace/appscratch/miniconda/hraijc_hic_qc2) 

: 1

In [21]:
mkdir hicqc_Nui_flye4

(/workspace/appscratch/miniconda/hraijc_hic_qc2) 

: 1

In [22]:
bsub -J hic_qc2M5 -o ${WKDIR}/log/hic_qc2M5.log -e ${WKDIR}/log/hic_qc2M5.err -n 1 \
"python /workspace/hraijc/git_clones/hic_qc/hic_qc.py -n 10000000 -b flye4_HiC_dedup.bam -o hicqc_Nui_flye4"

Job <713183> is submitted to default queue <lowpriority>.
(/workspace/appscratch/miniconda/hraijc_hic_qc2) 

: 1

In [23]:
conda deactivate
module unload conda