# HiC data scaffolding using ALLHIC

## Dataset
* **HiC data:** Library was performed by Elena Hilario using the restriction enzyme from the Phase Genomics kit version 1.0. Shawn Sullivan clarified the enzyme used for cutting is Sau3AI, which cuts at GATC (same with Gillenia)

## Other information
* **Chr number:** 12
* **Genome size:** ~600Mb
* **Total data:** 163 million reads (intended 120 million). AGRF mixed up Gillenia and Pepino for sequencing, which resulted in more reads for Gillenia, but less reads for Pepino (Actual: 100 M, intended: 180 M).
* **Read type:** 150 PE


## 0. Input

### HiC data

In [14]:
HICR1_RAW=/workspace/hraczw/github/GA/Bilberry_genome/002.Fastp.trimming/R1.cleaned.specifiedAdapter.short.Q15.fq.gz
HICR2_RAW=/workspace/hraczw/github/GA/Bilberry_genome/002.Fastp.trimming/R2.cleaned.specifiedAdapter.short.Q15.fq.gz

### Assembly 

In [2]:
ASSEMBLY_ML1000=/workspace/hraczw/github/GA/Bilberry_genome/100.all.assemblies/Shasta_pilon-1_corrected_pilon-i2_corrected.ml1000.fasta.fasta

In [1]:
ASSEMBLY_SHASTA_RI1=/workspace/hraczw/github/GA/Bilberry_genome/0101.Contamination.analysis/Shasta_racon_i1_pilon-2.noBacteria.ml1000.fasta
ASSEMBLY_SHASTA_RI3=/workspace/hraczw/github/GA/Bilberry_genome/0101.Contamination.analysis/Shasta_racon_i3_pilon-2.noBacteria.ml1000.fasta
ASSEMBLY_FLYE_RI3=/workspace/hraczw/github/GA/Bilberry_genome/0101.Contamination.analysis/Flye_racon_i3_pilon-2.noBacteria.ml1000.fasta

## 1. Mapping - following in steps on ALLHIC wiki

In [2]:
module load bwa
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) bwa/0.7.17
  3) pandoc/1.19.2      6) perl/5.28.0


In [5]:
mkdir 012.HiC.ALLHIC

In [3]:
WORKDIR=/workspace/hraczw/github/GA/Bilberry_genome/012.HiC.ALLHIC

In [4]:
bsub -J bwa \
-o $WORKDIR/bwa_index_shasta_i1.out \
-e $WORKDIR/bwa_index_shasta_i1.err \
"bwa index $ASSEMBLY_SHASTA_RI1"

Job <243177> is submitted to default queue <lowpriority>.


In [5]:
bsub -J bwa \
-o $WORKDIR/bwa_index_shasta_i3.out \
-e $WORKDIR/bwa_index_shasta_i3.err \
"bwa index $ASSEMBLY_SHASTA_RI3"

Job <243178> is submitted to default queue <lowpriority>.


In [6]:
bsub -J bwa \
-o $WORKDIR/bwa_index_flye_i3.out \
-e $WORKDIR/bwa_index_flye_i3.err \
"bwa index $ASSEMBLY_FLYE_RI3"

Job <243179> is submitted to default queue <lowpriority>.


In [7]:
bwa mem


Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:

       -t INT        number of threads [1]
       -k INT        minimum seed length [19]
       -w INT        band width for banded alignment [100]
       -d INT        off-diagonal X-dropoff [100]
       -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
       -y INT        seed occurrence for the 3rd round seeding [20]
       -c INT        skip seeds with more than INT occurrences [500]
       -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
       -W INT        discard a chain if seeded bases shorter than INT [0]
       -m INT        perform at most INT rounds of mate rescues for each read [50]
       -S            skip mate rescue
       -P            skip pairing; mate rescue performed unless -S also in use

Scoring options:

       -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
     

: 1

In [8]:
module load samtools
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) bwa/0.7.17
  3) pandoc/1.19.2      6) perl/5.28.0        9) samtools/1.9


In [15]:
bsub -J map \
-n 20 \
-o $WORKDIR/map_shasta_i1.out \
-e $WORKDIR/map_shasta_i1.err \
"bwa mem \
-t 20 \
$ASSEMBLY_SHASTA_RI1 \
$HICR1_RAW \
$HICR2_RAW | \
samtools sort \
-@ 20 \
-o $WORKDIR/mapped_shasta_i1.bam -"

Job <243193> is submitted to default queue <lowpriority>.


In [16]:
bsub -J map \
-n 20 \
-o $WORKDIR/map_shasta_i3.out \
-e $WORKDIR/map_shasta_i3.err \
"bwa mem \
-t 20 \
$ASSEMBLY_SHASTA_RI3 \
$HICR1_RAW \
$HICR2_RAW | \
samtools sort \
-@ 20 \
-o $WORKDIR/mapped_shasta_i3.bam -"

Job <243194> is submitted to default queue <lowpriority>.


In [17]:
bsub -J map \
-n 20 \
-o $WORKDIR/map_flye_i3.out \
-e $WORKDIR/map_flye_i3.err \
"bwa mem \
-t 20 \
$ASSEMBLY_FLYE_RI3 \
$HICR1_RAW \
$HICR2_RAW | \
samtools sort \
-@ 20 \
-o $WORKDIR/mapped_flye_i3.bam -"

Job <243195> is submitted to default queue <lowpriority>.


In [18]:
export PATH=/workspace/hraczw/github/programs/ALLHiC/bin/:$PATH

In [19]:
export PATH=/workspace/hraczw/github/programs/ALLHiC/scripts/:$PATH

In [20]:
ls /workspace/hraczw/github/programs/ALLHiC/scripts

ALLHiC2ALLMAPS.pl  filterBAM_forHiC.pl	       partition.pl
bam2CLM.pl	   getFalen.pl		       PreprocessSAMs.pl
bam2CLM_simple.pl  gmap2AlleleTable.pl	       prune.pl
bam2net.pl	   link_superscaffold.pl       release3DDNA.pl
bam_HiCplotter.py  make_bed_around_RE_site.pl  remove_reads.pl
blastn_parse.pl    mc_bam.pl		       simuCTG.pl
classify.pl	   partition_gmap.pl	       statAGP.pl


In [21]:
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) bwa/0.7.17
  3) pandoc/1.19.2      6) perl/5.28.0        9) samtools/1.9


In [22]:
module load bedtools
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    5) perlbrew/0.76      9) samtools/1.9
  2) texlive/20151117   6) perl/5.28.0       10) bedtools/2.27.1
  3) pandoc/1.19.2      7) asub/2.1
  4) git/2.21.0         8) bwa/0.7.17


In [24]:
PreprocessSAMs.pl


PreprocessSAMs.pl: A script to prepare SAM or BAM files for use with Lachesis.

Syntax: /workspace/hraczw/github/programs/ALLHiC/scripts/PreprocessSAMs.pl <sam-or-bam-filename> <draft-assembly-fasta> enzyme(HINDIII/MBOI)



In [25]:
bsub -J preprocess \
-o $WORKDIR/preprocess_shasta_i1.out \
-e $WORKDIR/preprocess_shasta_i1.err \
"PreprocessSAMs.pl $WORKDIR/mapped_shasta_i1.bam $ASSEMBLY_SHASTA_RI1 MBOI"

Job <243754> is submitted to default queue <lowpriority>.


In [26]:
bsub -J preprocess \
-o $WORKDIR/preprocess_shasta_i3.out \
-e $WORKDIR/preprocess_shasta_i3.err \
"PreprocessSAMs.pl $WORKDIR/mapped_shasta_i3.bam $ASSEMBLY_SHASTA_RI3 MBOI"

Job <243755> is submitted to default queue <lowpriority>.


In [27]:
bsub -J preprocess \
-o $WORKDIR/preprocess_flye_i3.out \
-e $WORKDIR/preprocess_flye_i3.err \
"PreprocessSAMs.pl $WORKDIR/mapped_flye_i3.bam $ASSEMBLY_FLYE_RI3 MBOI"

Job <243756> is submitted to default queue <lowpriority>.


## 2. ALLHIC scaffolding

In [23]:
ALLHiC_partition -h

Unknown option: h
************************************************************************
    Usage: ALLHiC_partition -r draft.asm.fasta -e enzyme_sites -k Num of groups
      -h : help and usage.
      -b : prunned bam (optional, default prunning.bam)
      -r : draft.sam.fasta
      -e : enzyme_sites (HindIII: AAGCTT; MboI: GATC)
      -k : number of groups (user defined K value)
      -m : minimum number of restriction sites (default, 25)
************************************************************************


: 255

In [24]:
cd $WORKDIR
ls

bwa_index.err  mapped.bam			    mapped.sam
bwa_index.out  mapped.REduced.bam		    preprocess.err
map.err        mapped.REduced.paired_only.bam	    preprocess.out
map.out        mapped.REduced.paired_only.flagstat


In [25]:
bsub -J partition \
-o partition.out \
-e partition.out \
"ALLHiC_partition -b mapped.REduced.paired_only.bam -r $ASSEMBLY_ML1000 -e GATC -k 12"

Job <391717> is submitted to default queue <normal>.


In [26]:
bsub -J extract \
-o extract.out \
-e extract.err \
"allhic extract mapped.REduced.paired_only.bam $ASSEMBLY_ML1000 --RE GATC"

Job <391751> is submitted to default queue <normal>.


In [27]:
for i in {1..12}; \
do bsub -J allhicOpt -o allhicOpt_g${i}.out -e allhicOpt_g${i}.err \
"allhic optimize mapped.REduced.paired_only.counts_GATC.12g$i.txt \
mapped.REduced.paired_only.clm"; done

Job <391807> is submitted to default queue <normal>.
Job <391808> is submitted to default queue <normal>.
Job <391809> is submitted to default queue <normal>.
Job <391810> is submitted to default queue <normal>.
Job <391811> is submitted to default queue <normal>.
Job <391812> is submitted to default queue <normal>.
Job <391813> is submitted to default queue <normal>.
Job <391814> is submitted to default queue <normal>.
Job <391815> is submitted to default queue <normal>.
Job <391816> is submitted to default queue <normal>.
Job <391817> is submitted to default queue <normal>.
Job <391818> is submitted to default queue <normal>.


In [28]:
ALLHiC_build $ASSEMBLY_ML1000

1. tour format to agp ...
Processing mapped.REduced.paired_only.counts_GATC.12g1.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g10.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g11.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g12.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g2.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g3.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g4.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g5.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g6.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g7.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g8.tour ...
Processing mapped.REduced.paired_only.counts_GATC.12g9.tour ...


In [47]:
perl /workspace/hraczw/github/programs/ALLHiC/scripts/getFalen.pl \
-i groups.asm.fasta \
-o len.txt

In [52]:
grep 'mapped.REduced.paired_only.counts_GATC' len.txt > chrn.list

In [53]:
cat chrn.list

mapped.REduced.paired_only.counts_GATC.12g1	106190990
mapped.REduced.paired_only.counts_GATC.12g10	9469300
mapped.REduced.paired_only.counts_GATC.12g11	7367141
mapped.REduced.paired_only.counts_GATC.12g12	917959
mapped.REduced.paired_only.counts_GATC.12g2	82140996
mapped.REduced.paired_only.counts_GATC.12g3	61383071
mapped.REduced.paired_only.counts_GATC.12g4	60208030
mapped.REduced.paired_only.counts_GATC.12g5	58071397
mapped.REduced.paired_only.counts_GATC.12g6	39353152
mapped.REduced.paired_only.counts_GATC.12g7	37222306
mapped.REduced.paired_only.counts_GATC.12g8	30748869
mapped.REduced.paired_only.counts_GATC.12g9	27719198


In [55]:
module load pfr-python3/3.6.6

In [56]:
module list

Currently Loaded Modulefiles:
  1) powerPlant/core     5) perlbrew/0.76       9) samtools/1.9
  2) texlive/20151117    6) perl/5.28.0        10) bedtools/2.27.1
  3) pandoc/1.19.2       7) asub/2.1           11) pfr-python3/3.6.6
  4) git/2.21.0          8) bwa/0.7.17


In [57]:
bsub -J plot \
-o plot.out \
-e plot.err \
"ALLHiC_plot mapped.REduced.paired_only.bam groups.agp chrn.list 500k pdf"

Job <391892> is submitted to default queue <normal>.


In [48]:
ls

allhicOpt_g10.err   mapped.REduced.paired_only.bam
allhicOpt_g10.out   mapped.REduced.paired_only.clm
allhicOpt_g11.err   mapped.REduced.paired_only.clusters.txt
allhicOpt_g11.out   mapped.REduced.paired_only.counts_GATC.12g10.tour
allhicOpt_g12.err   mapped.REduced.paired_only.counts_GATC.12g10.txt
allhicOpt_g12.out   mapped.REduced.paired_only.counts_GATC.12g11.tour
allhicOpt_g1.err    mapped.REduced.paired_only.counts_GATC.12g11.txt
allhicOpt_g1.out    mapped.REduced.paired_only.counts_GATC.12g12.tour
allhicOpt_g2.err    mapped.REduced.paired_only.counts_GATC.12g12.txt
allhicOpt_g2.out    mapped.REduced.paired_only.counts_GATC.12g1.tour
allhicOpt_g3.err    mapped.REduced.paired_only.counts_GATC.12g1.txt
allhicOpt_g3.out    mapped.REduced.paired_only.counts_GATC.12g2.tour
allhicOpt_g4.err    mapped.REduced.paired_only.counts_GATC.12g2.txt
allhicOpt_g4.out    mapped.REduced.paired_only.counts_GATC.12g3.tour
allhicOpt_g5.err    mapped.REduced.paired_only.counts_GATC.12g3.txt
allhicOpt_