Skip to content

theaidenlab/AGWG-merge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3D de novo assembly (3D-DNA) pipeline with AGWG-specific modifications

This is a version of the 3D-DNA pipeline (Dudchenko et al., Science, 2017) that was used to help generate AaegL5 genome assembly for the mosquito Aedes aegypti.

In addition to the scripts contained here, the Hi-C portion of the workflow for AaegL5 used tools and methods from Juicer (Durand & Shamim et al., Cell Systems, 2016), Juicebox (Durand & Robinson et al., Cell Systems, 2016) and Juicebox Assembly Tools (Dudchenko et al., bioRxiv, 2018). The in situ Hi-C experiments were performed as described in (Rao & Huntley et al., Cell, 2014). Interactive figures incorporated with this description were created using Juicebox.js software (Robinson et al., Cell Systems, 2018).

For the latest release of 3D-DNA see 3d-dna.

Feel free to post your questions and comments at: http://www.aidenlab.org/forum.html

Workflow, Commands and Intermediate Results

The detailed description of the workflow is hosted here. In brief, scaffolding and alternative haplotype removal for AaegL5 genome assembly can be broken down into 7 steps:

  • 1: Automatic preliminary scaffolding;
  • 2: Curated misjoin correction based on preliminary scaffolding results;
  • 3: Automatic scaffolding of misjoin-corrected contigs;
  • 4: Curated polishing and chromosome splitting;
  • 5: Automatic pairwise alignment of contigs that were placed in nearby positions along the pre-merge assembly
  • 6: Curated arrangement of contigs into 'merge groups' based on alignment data
  • 7: Automatic merging of overlapping contigs in individual merge groups

Listed below are commands and intermediate processing files associated with these steps, as well as input data (included as step 0: Draft contigs) and final assembly (included as step 8: Final NCBI submission).

Step 0: Draft contigs

For data associated with the draft FALCON-Unzip contigs see the following files:

Commands used to generate the files are as follows:

# prep: generate assembly file from falcon full output fasta
awk -f ${path_to_3ddna}/utils/generate-assembly-file-from-fasta.awk AGWG.draft.fasta > AGWG.draft.assembly
ln -sf AGWG.draft.mnd.txt AGWG.mnd.txt
# generate a high-resolution Hi-C map to visualize the draft with default mapping quality threshold (mapq>=1)
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.draft.assembly AGWG.mnd.txt
# generate supplementary Hi-C maps to visualize draft with mapq>=0 and mapq>=30
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -q 0 -m temp.AGWG.draft.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.draft.assembly AGWG.mnd.txt
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -q 30 -m temp.AGWG.draft.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 -c AGWG.draft.assembly AGWG.mnd.txt
Step 1: Preliminary scaffolding

The preliminary scaffolding step results include the following files:

Commands used to generate these files are listed below:

# prep
awk -f ${path_to_3ddna}/utils/convert-assembly-to-cprop-and-asm.awk AGWG.draft.assembly
ln -sf AGWG.draft.cprops AGWG.0.cprops
ln -sf AGWG.mnd.txt AGWG.mnd.0.txt
# order and orient scaffolds larger than 20000
bash ${path_to_3ddna}/scaffold/run-liger-assembler.sh -s 20000 AGWG.0.cprops AGWG.mnd.0.txt
# build high-resolution hic map with mapq >= 1 (default)
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.0.cprops AGWG.0.asm AGWG.mnd.0.txt
# build supplementary high-resolution hic maps with reads mapping quality >=0 and >=30
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -q 0 -m temp.AGWG.0.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.0.cprops AGWG.0.asm AGWG.mnd.0.txt
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -q 30 -m temp.AGWG.0.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 -c AGWG.0.cprops AGWG.0.asm AGWG.mnd.0.txt
Step 2: Curated misjoin correction

Misassembly correction has been performed by surveying the preliminary scaffolding results in Juicebox Assembly Tools. Regions annotated as misassembled are shared as a 2D annotation track. For convenience, the data is also presented in a form of an input 2D annotation track and a bed track as generated by 3D-DNA (see below).

The annotated regions can be surveyed using this interactive figure, which shows a Hi-C contact map from Step 1: Preliminary scaffolding (AGWG.0_0.hic). The heatmap indicates the frequency of contact between pairs of loci in the assembly as measured by the Hi-C experiments, on a scale from white to red (pure white indicates no observed contacts, whereas pure red indicates a maximum number of contacts which is indicated in the top left corner of the interactive figure). 2D squares along the diagonal indicate the boundaries of input contigs as ordered and oriented in Step 1 (AGWG.0_asm.scaffold_track.txt). Shown on top and to the left of the map, along the X and Y axes, is a .bed track (suspect.at.step.0.bed) highlighting the annotated misjoin positions: horizontal and vertical blue bars align with diagonal points inside the 2D annotation squares that are associated with a reduced number of contacts between sequences immediately upstream and downstream than would be expected in the case of correctly joined sequences. By default the figure focuses on a representative interval; by scrolling readers can explore the complete, genome-wide data and annotations.

The command used to edit the input contigs based on the annotation file is included below:

( (awk 'NR==1{print > "/dev/stderr";next}1' <(awk -f ${path_to_3ddna}/lift/lift-asm-annotations-to-input-annotations.awk AGWG.0.cprops AGWG.0.asm suspect_2D.at.step.0.txt) | sort -k 2,2n -k 3,3n)  2>&1 ) > AGWG.edits.txt
Step 3: Automatic scaffolding of misjoin-corrected contigs

The scaffolding step using misjoin-corrected contigs as input is documented with the following files:

Commands used to generate the above files are listed below:

# prep
awk -f ${path_to_3ddna}/edit/edit-cprops-according-to-annotations.awk AGWG.edits.txt AGWG.0.cprops > AGWG.1.cprops
bash ${path_to_3ddna}/edit/edit-mnd-according-to-new-cprops.sh AGWG.1.cprops AGWG.mnd.txt > AGWG.mnd.1.txt
# order and orient scaffolds larger than 20000
bash ${path_to_3ddna}/scaffold/run-liger-assembler.sh -s 20000 AGWG.1.cprops AGWG.mnd.1.txt
# build high-resolution hic map with mapq >= 1 (default)
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.1.cprops AGWG.1.asm AGWG.mnd.1.txt
# build supplementary high-resolution hic maps with reads mapping quality >=0 and >=30
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -q 0 -m temp.AGWG.1.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.1.cprops AGWG.1.asm AGWG.mnd.1.txt
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -q 30 -m temp.AGWG.1.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 -c AGWG.1.cprops AGWG.1.asm AGWG.mnd.1.txt
Step 4: Curated polishing and chromosome splitting

The assembly from step 3 was manually polished using Juicebox Assembly Tools to generate three raw chromosomal scaffolds. Shared below is the reviewed assembly file as exported by the Juicebox Assembly Tools as well as the Hi-C contact maps associated with the step:

Commands used to generate the above files are as follows:

# prep
ln -sf AGWG.1.review.assembly AGWG.rawchrom.assembly
awk -f ${path_to_3ddna}/utils/convert-assembly-to-cprop-and-asm.awk AGWG.rawchrom.assembly
bash ${path_to_3ddna}/edit/edit-mnd-according-to-new-cprops.sh AGWG.rawchrom.cprops AGWG.mnd.txt > AGWG.mnd.rawchrom.txt
# generate a high-resolution Hi-C map to visualize pre-merge assembly with default mapping quality threshold (mapq>=1)
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.rawchrom.assembly AGWG.mnd.rawchrom.txt
# generate supplementary Hi-C maps to visualize pre-merge assembly with mapq>=0 and mapq>=30
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -q 0 -m temp.AGWG.rawchrom.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.rawchrom.assembly AGWG.mnd.rawchrom.txt
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -q 30 -m temp.AGWG.rawchrom.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 -c AGWG.rawchrom.assembly AGWG.mnd.rawchrom.txt

The progression from draft to raw chromosomal scaffolds can be interactively reviewed at http://bit.ly/2vfsKgZ. This shows four contact maps, AGWG.draft.hic, AGWG.0.hic, AGWG.1.hic and AGWG.rawchrom.hic, all generated using the same Hi-C dataset. The pixel intensity in the contact maps indicates how often a pair of loci colocate in the nucleus, on a scale from white to red (pure white indicates no observed contacts, whereas pure red indicates a maximum number of contacts which is indicated in the left top corner). The maps become 'cleaner', with less off-diagonal signal showing, as order and orientation of contigs changes from essentially random in the draft to locally correct in raw chromosomal scaffolds. The 18 off-diagonal peaks remaining in the raw chromosomal map are a manifestation of telomere-to-telomere and centromere-to-centromere clustering in A. aegypti. By default the interactive figure shows a representative interval; by scrolling readers can explore the complete, genome-wide data.

Step 5: Automatic pairwise alignment of nearby contigs

We then used LASTZ to do pairwise alignment of all contigs that fall within a sliding window of 3Mb along the raw chromosomal scaffolds. The alignment data allows one to identify long nearby stretches of high sequence identity suggestive of undercollapsed heterozygosity. The intermediate files associated with this step include:

For reproducing the above files, see the following commands:

# prep
awk -f ${path_to_3ddna}/utils/convert-assembly-to-cprop-and-asm.awk AGWG.rawchrom.assembly
awk -f ${path_to_3ddna}/edit/edit-fasta-according-to-new-cprops.awk AGWG.rawchrom.cprops AGWG.draft.fasta > AGWG.rawchrom.fasta
mkdir faSplit && cd faSplit && awk -f ${path_to_3ddna}/merge/split-fasta-by-cname.awk AGWG.rawchrom.cprops AGWG.rawchrom.fasta && cd ..
# perform pairwise alignment of all input contigs/scaffolds within set distance and filter using alignment scores, length and identity to select probable alternative haplotype sequences
bash ${path_to_3ddna}/merge/align-nearby-sequences-and-filter-overlaps.sh AGWG.rawchrom.cprops AGWG.rawchrom.asm faSplit
# tile based on filtered alignment results
bash ${path_to_3ddna}/merge/tile-assembly-based-on-overlaps.sh AGWG.rawchrom.cprops AGWG.rawchrom.asm overlaps_2D_input.txt
# generate a high-resolution Hi-C map to visualize pre-merge assembly with default mapping quality threshold (mapq>=1)
cat <(awk '{$0=">"$0}1' AGWG.rawchrom.cprops) <(awk '{gsub("{|}","")}1' AGWG.rawchrom_tiled.asm) > AGWG.rawchrom_tiled.assembly
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.rawchrom_tiled.assembly AGWG.mnd.rawchrom.txt
# generate supplementary Hi-C maps to visualize pre-merge assembly with mapq>=0 and mapq>=30
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -q 0 -m temp.AGWG.rawchrom.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.rawchrom_tiled.assembly AGWG.mnd.rawchrom.txt
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -q 30 -m temp.AGWG.rawchrom.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 -c AGWG.rawchrom_tiled.assembly AGWG.mnd.rawchrom.txt
# lift overlap annotations to new positions in the tiled assembly
awk -f ${path_to_3ddna}/supp/lift-input-annotations-to-assembly-annotations.awk AGWG.rawchrom_tiled.assembly overlaps_2D_input.txt > overlaps_2D_asm_post_tiling.txt

For interactive review of results associated with step #5 compare http://bit.ly/2JIu3Z7 and http://bit.ly/2Hn3kmw. The links show, respectively, the contact maps before tiling (AGWG.rawchrom_0.hic) and after tiling (AGWG.rawchrom_tiled_0.hic), with two sets of annotations overlaid over each of the maps. As before, the heatmaps indicate the frequency of contact between pairs of loci as measured by the Hi-C experiment, on a scale from white to red; pure white indicates no observed contacts, whereas pure red indicates a maximum number of contacts which is indicated in the left top corner. The same scale is used for both maps by default.

On the left panel of each of the interactive figures, 2D square annotations along the diagonal indicate the boundaries of contigs, encoded in AGWG.rawchrom_asm.scaffold_track.txt and AGWG.rawchrom_tiled_asm.scaffold_track.txt, respectively. On the right, the same area of the map is shown with overlaid annotations of tentative overlaps as identified by pairwise LASTZ alignments (overlaps_2D_asm.txt and overlaps_2D_asm_post_tiling.txt).

One can see that as a rule the regions of high sequence identity match off-diagonal stretches of high contact intensity parallel or perpendicular to the diagonal in http://bit.ly/2JIu3Z7. This is brought about by the accumulation of contacts associated with Hi-C reads aligning with low mapping quality (mapq~0). The latter appear when the aligner, unable to distinguish between the haplotypes, assigns a large number of Hi-C contacts associated with true 1D proximity to one of the alternative haplotypes present in the assembly at random, effectively creating mirror images of the diagonal contact band.

The 'parallel' stretches correspond to cases when the alternative haplotypes are in the same orientation in the raw chromosomal scaffolds. The 'perpendicular' off-diagonal stretches correspond to cases where orientation of contigs as suggested by pairwise alignment does not match that suggested by the scaffolding step. These differences are reconciled in favor of the alignment signal in http://bit.ly/2Hn3kmw, where all annotated stretches of enriched Hi-C signal are in parallel to the diagonal.

By default the interactive figures focus on a representative interval (same across both maps); by scrolling readers can explore the complete, genome-wide data and annotations.

Step 6: Curation of merge groups

The automatically identified merge groups from Step 5 have been curated using Assembly Tools to remove false positive alignments, remove ambiguously aligning contigs and reconcile ordering and orientation disagreements in merge groups. The files shared in association with this review are listed below:

Commands relevant to this step are as follows:

# prep
awk -f ${path_to_3ddna}/utils/convert-assembly-to-cprops-and-asm.awk AGWG.rawchrom_tiled.curated.assembly
bash ${path_to_3ddna}/edit/edit-mnd-according-to-new-cprops.sh AGWG.rawchrom_tiled.curated.cprops AGWG.mnd.txt > AGWG.rawchrom_tiled.curated.txt
# build high-resolution hic map with mapq >= 1 (default)
bash ${path_to_3ddna}/visualize/run-assembly-visualizer.sh -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.rawchrom_tiled.curated.assembly AGWG.rawchrom_tiled.curated.txt
# build supplementary high-resolution hic maps with reads mapping quality >=0 and >=30
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -q 0 -m temp.AGWG.rawchrom_tiled.curated.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 AGWG.rawchrom_tiled.curated.assembly AGWG.rawchrom_tiled.curated.txt
bash ${path_to_3ddna}/visualize/run-asm-visualizer.sh -q 30 -m temp.AGWG.rawchrom_tiled.curated.asm_mnd.txt -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,1000,500 -c AGWG.rawchrom_tiled.curated.assembly AGWG.rawchrom_tiled.curated.txt

For interactive review of this step see http://bit.ly/2JLGzaq. This shows the AGWG.rawchrom_tiled.review_0.hic with AGWG.rawchrom_tiled.review_asm.scaffold_track.txt boundary annotations along the diagonal. The heatmap indicates the frequency of contact between pairs of loci as measured by the Hi-C experiment, on a scale from white to red, and the square annotations along the diagonal indicate the ordering and orientation of contigs after tiling curation. By default, the map shows the neighborhood of the same contigs that have been the focus of interactive illustration for Step 6; by scrolling readers can explore the complete, genome-wide data and annotations.

Step 7: Automatic merging of overlapping contigs

Merging of chained sequences was performed iteratively by the 3D-DNA merge module. The results are shared as follows:

To reproduce the above files see the following commands:

# create new haploid contigs by merging connected components defined in the tiled assembly file
bash ${path_to_3ddna}/merge/merge-tiled-asm.sh -a AGWG.rawchrom_tiled.review_asm.scaffold_track.txt AGWG.rawchrom.cprops AGWG.rawchrom_tiled.review.asm faSplit
# parse iterative merge results for comparison with pairwise alignment and contig contribution classification
awk 'NR==1||$7=="255,255,0"' AGWG.rawchrom/overlap_qc_track_2D_asm.txt > contributing_2D_asm.txt
awk 'NR==1||$7=="255,255,255"' AGWG.rawchrom/overlap_qc_track_2D_asm.txt > not_contributing_2D_asm.txt
# finalize output from haploid contigs to haploid chromosome-length scaffolds
cp AGWG.rawchrom/AGWG.rawchrom_merged.asm AGWG.final.asm
ln -sf AGWG.rawchrom/merged_AGWG.rawchrom.fa AGWG.final.fasta
awk -f ${path_to_3ddna}/utils/generate-cprops-file.awk AGWG.final.fasta > AGWG.final.cprops
bash ${path_to_3ddna}/finalize/finalize-output.sh -l AGWG AGWG.final.cprops AGWG.final.asm AGWG.final.fasta final
ln -sf AGWG.FINAL.fasta AGWG.HiC.fasta

For review of iterative alignment results see http://bit.ly/2JECh4H showing AGWG.rawchrom_tiled.review_0.hic heatmap with three types of overlaid annotations. The heatmap indicates the frequency of contact between pairs of loci as measured by the Hi-C experiment, on a scale from white to red, with the red indicating the maximal number of contacts (indicated in the top left corner).

On the left, the diagonal annotations (AGWG.rawchrom_tiled.review_asm.scaffold_track.txt) indicate the boundaries of draft contigs as ordered and oriented by the assembly at this stage.

In the center, not_contributing_2D_asm.txt 2D annotation file highlights contigs that will, in their entirety, be annotated as alternate alleles. These contigs will not contribute any sequence to the 'main' haploid reference as the sequence they contain is already included as part of larger contigs. The annotations are a subset of diagonal squares from the left panel shifted up and left to match the genomic positions of overlapping sequences upstream in the assembly. As expected (and analogous to the pairwise alignment annotation in Step 5) these annotation enclose stretches of enriched contact frequency parallel to the diagonal, brought about by random assignment of short Hi-C reads with low mapping quality across haplotypes.

Overlaid on top of the map on the right is the contributing_2D_asm.txt track. This track highlights contigs that only partly overlap with the contigs upstream in the assembly. As such, these contigs will contribute sequence to the final haploid reference. The annotations represent a subset of diagonal squares from the left panel, shifted up and left to genomic positions of matching sequences upstream in the assembly. The parallel-to-diagonal stretches of enriched signal associated with the overlap are seen in the top left corner of each annotation.

By default the figure indicates a representative interval; by scrolling readers can explore the complete, genome-wide data and annotations. To explore the corresponding location of the contact map after the merge see http://bit.ly/2JA7khR (see Step 8 for details).

Step 8: Final NCBI submission

The Hi-C chromosomal scaffolds from Step 7 have been then polished and gap-filled using PBJelly, see (Matthews, Dudchenko & Kingan et al., bioRxiv, 2017) for details. Additionally, mitochondrial sequences have been extracted and gaps normalized to 100N, and the resulting sequences as well as alternate alleles submitted to NCBI. Hi-C maps associated with the final output are as follows:

  • Liverpool1.hic, Liverpool1_30.hic - Hi-C contact maps created by realigning the Hi-C data from SRR6294665 experiment to the haploid chromosome-length AaegL5 reference, as generated by the Juicer pipeline, thresholded at mapping quality 1 and 30, respecitvely;
  • Liverpool2.hic, Liverpool2_30.hic - Hi-C contact maps created by realigning the Hi-C data from SRR6294663 experiment to the haploid chromosome-length AaegL5 reference, as generated by the Juicer pipeline, thresholded at mapping quality 1 and 30, respectively;
  • Liverpool3.hic, Liverpool3_30.hic - Hi-C contact maps created by realigning the Hi-C data from SRR6294664 experiment to the haploid chromosome-length AaegL5 reference, as generated by the Juicer pipeline, thresholded at mapping quality 1 and 30, respectively;
  • mega.hic, mega_30.hic - contact maps, created by combining data from all three Hi-C experiments listed above, as generated by the mega.sh script from the Juicer suite, thresholded at mapping quality 1 and 30, respectively.

For details on how to run the Juicer pipeline see the Juicer GitHub page and Wiki.

Explore the genome-wide contact map mega.hic via http://bit.ly/2JGNmSN. This interactive heatmap indicates the frequency of contacts between pairs of loci in AaegL5 at multiple resolutions as measured by three Hi-C experiments conducted for this study, on a scale from white to red; pure white indicates no observed contacts, whereas pure red indicates a maximum number of contacts (indicated in the top left corner). By default the genome-wide map is shown; by zooming in and scrolling readers can explore the complete, genome-wide data for each of the three chromosomes of A. aegypti.

Citations and licensing

If you use AaegL5 or any of the intermediate assembly results in your work, please cite the following paper:

Matthews, B.J.,* Dudchenko O.,* Kingan S.,* Koren S., Antoshechkin I., Crawford J.E., Glassford W.J., Herre M., Redmond S.N., Rose N.H., et al. (2017). Improved Aedes aegypti mosquito reference genome assembly enables biological discovery and vector control. bioRxiv. doi: https://doi.org/10.1101/240747.

For Hi-C assembly methods please cite:

Dudchenko, O., Batra, S.S., Omer, A.D., Nyquist, S.K., Hoeger, M., Durand, N.C., Shamim, M.S., Machol, I., Lander, E.S., Aiden, A.P., et al. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. Apr 7; 356(6333):92-95. doi: https://doi.org/10.1126/science.aal3327. Epub 2017 Mar 23.

This software is distributed under The MIT License (MIT).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published