# Fixing gaps in Dovetail-assembled haplotype fastas

Pairwise dotplots of haplotypes revealed a few minor gaps in some of the chromosomal scaffolds that could likely be filled by currently unplaced scaffolds. These gaps could not be filled by the assembler since they seem to occur in large repetitive regions, which likely have little PacBio and Hi-C read support. However, complementary haplotype mapping (i.e., using `minimap2` + visualization using dotplots) helped identify the missing pieces and where they should be placed. 

The general approach to filling the gaps is as follows:

1. Identify the gapped chromosome and the scaffold that fills the gap using the dotplots
2. Load the AGP file for the gapped chromosome, which includes the coordinates for the gap. Dovetail's AGP files only have a single gap where the scaffold could be placed, so this makes our life easier.
3. Place the missing scaffold in the middle of the string of 'n' currently being used to represent the chromosome gap. Fill in 100 n's on either side of the inserted scaffold to be consistent with the number of n's inserted by Dovetail's pipeline.
4. Write a new AGP file using a script written by the Juicebox developers (note: this will be done in a separate Snakemake rule)

## Setup

In [2]:
# Load modules
import pandas as pd
from Bio import SeqIO

### Functions

In [3]:
def write_fasta(record_dict, fasta_out):
    with open(fasta_out, 'w') as fout:
        for scaff, rec in record_dict.items():
            fout.write(f">{scaff}\n{rec.seq}\n")

### Inputs and outputs

#### Inputs

In [4]:
# AGP Files
names = ['object', 'object_beg', 'object_end', 'part_number', 'component_type', 
         'component_id', 'gap_type', 'linkage', 'orientation']
occ1_agp = pd.read_csv(snakemake.input[0], delimiter='\t', names=names, skiprows=2)
occ2_agp = pd.read_csv(snakemake.input[1], delimiter='\t', names=names, skiprows=2)
pall1_agp = pd.read_csv(snakemake.input[2], delimiter='\t', names=names, skiprows=2)
pall2_agp = pd.read_csv(snakemake.input[3], delimiter='\t', names=names, skiprows=2)

# Assembled Dovetail haplotype fastas. Load as sequence records dictionary
occ1_records_dict = SeqIO.to_dict(SeqIO.parse(snakemake.input[4], 'fasta'))
occ2_records_dict = SeqIO.to_dict(SeqIO.parse(snakemake.input[5], 'fasta'))
pall1_records_dict = SeqIO.to_dict(SeqIO.parse(snakemake.input[6], 'fasta'))
pall2_records_dict = SeqIO.to_dict(SeqIO.parse(snakemake.input[7], 'fasta'))

#### Outputs

In [5]:
occ1_fasta_out = snakemake.output[0] 
occ2_fasta_out = snakemake.output[1] 
pall1_fasta_out = snakemake.output[2] 
pall2_fasta_out = snakemake.output[3] 

## Occ1 Haplotype

### Issue 1: Missing chunk in middle of Occ1_S2

Dotplot using Occ2 as the query and Occ1 as the target suggests that the gap in the middle of Occ1_S2 ould be filled by Occ1_S9.

In [6]:
# Load AGP file for Occ1
occ1_S2_agp = occ1_agp[occ1_agp["object"].str.contains("Scaffold_2_")]
occ1_S2_agp

In [7]:
# Coordinates of gap in chromosome
# Subtract 1 from start position since Python is 0-based and AGP is 1-based
occ1_S2_gapStart = int(occ1_S2_agp[occ1_S2_agp['component_type'] == 'U']['object_beg'] - 1)
occ1_S2_gapEnd = int(occ1_S2_agp[occ1_S2_agp['component_type'] == 'U']['object_beg'])
occ1_S2_gapMid = occ1_S2_gapStart + 50  # Since always 100 n's

In [8]:
# Get relevant records
occ1_S2_record = occ1_records_dict['Scaffold_2__2_contigs__length_42515810']
occ1_S9_record = occ1_records_dict['Scaffold_9__1_contigs__length_12292107']

In [9]:
# Assemble new sequence and replace Occ1_S2 record sequence in dict
occ1_S2_newSeq = occ1_S2_record.seq[:occ1_S2_gapMid] + (50*'n') + occ1_S9_record.seq + (50*'n') + occ1_S2_record.seq[occ1_S2_gapMid:]
occ1_records_dict[occ1_S2_record.id].seq = occ1_S2_newSeq
del occ1_records_dict[occ1_S9_record.id] # Remove scaffold since now placed

### Issue 2: Missing chunk in the middle of Occ1_S8

Dotplot using Occ2 as the query and Occ1 as the target suggests that the gap in the middle of Occ1_S8 ould be filled by Occ1_S10.

In [10]:
# Filter Occ1 AGP for Scaffold 8
occ1_S8_agp = occ1_agp[occ1_agp["object"].str.contains("Scaffold_8_")]
occ1_S8_agp

In [11]:
# Coordinates of gap in chromosome
# Subtract 1 from start position since Python is 0-based and AGP is 1-based
occ1_S8_gapStart = int(occ1_S8_agp[occ1_S8_agp['component_type'] == 'U']['object_beg'] - 1)
occ1_S8_gapEnd = int(occ1_S8_agp[occ1_S8_agp['component_type'] == 'U']['object_beg'])
occ1_S8_gapMid = occ1_S8_gapStart + 50  # Since always 100 n's

In [12]:
# Get relevant records
occ1_S8_record = occ1_records_dict['Scaffold_8__2_contigs__length_53096940']
occ1_S10_record = occ1_records_dict['Scaffold_10__1_contigs__length_6038839']

In [13]:
# Assemble new sequence and replace Occ1_S8 record sequence in dict
occ1_S8_newSeq = occ1_S8_record.seq[:occ1_S8_gapMid] + (50*'n') + occ1_S10_record.seq + (50*'n') + occ1_S8_record.seq[occ1_S8_gapMid:]
occ1_records_dict[occ1_S8_record.id].seq = occ1_S8_newSeq
del occ1_records_dict[occ1_S10_record.id] # Remove scaffold since now placed

### Issue 3: Missing chunk at end of Occ1_S3

Dotplot using Occ1 as the query and Occ2 as the target suggests that Occ1_S3 is missing a bit at the end (mapping ends prematurely along Occ2_S3). Dotplot of Pall1 against Occ2 suggests that Pall1_S9 is complementary to the beginning of Occ2_S3, which is the same as the missing bit from Occ1_S3 since Occ1_S3 and Occ2_S3 are in reverse orientation. Therefore, Pall1_S9 needs to be reverse compelemted and placed at the end of Occ1_S3.

In [14]:
# Load AGP file for Occ1
occ1_S3_agp = occ1_agp[occ1_agp["object"].str.contains("Scaffold_3_")]
occ1_S3_agp

In [15]:
# Get relevant records
occ1_S3_record = occ1_records_dict['Scaffold_3__2_contigs__length_53941476']
pall1_S9_record = pall1_records_dict['Scaffold_9__1_contigs__length_10370967']

In [16]:
# Assemble new sequence and replace pall2_S7 record sequence in dict
# Pall1_S9 needs to be reverse complemented
pall1_S9_record_revComp = pall1_S9_record.reverse_complement()
occ1_S3_newSeq = occ1_S3_record.seq + (100*'n') + pall1_S9_record_revComp.seq
occ1_records_dict[occ1_S3_record.id].seq = occ1_S3_newSeq
del pall1_records_dict[pall1_S9_record.id] # Remove scaffold since now placed

### Issue 4: Missing chunk at end of Occ1_S4

Original dotplot with Pall and Occ Hap1's mapped against all Hap2's suggested an extra fragment at at the end of Pall1_S4 that should be reverse-complemented and added to the end of Occ1_S4

In [19]:
# Get relevant contigs
pall1_S4_record = pall1_records_dict['Scaffold_4__1_contigs__length_63763842']
occ1_S4_record = occ1_records_dict['Scaffold_4__2_contigs__length_59527430']

In [20]:
# Where the telomeric duplication starts
# Obtained from the PAF file where the pall1S4 mapping jumps back to mapping at the end of Pall1S5
pall1S4_dupStart = 58113889 - 1

# Get duplocated sequence and create new pall1S4 sequence without duplication
pall1_S4_dupSeq = pall1_S4_record.seq[pall1S4_dupStart:]
pall1_S4_newSeq = pall1_S4_record.seq[:pall1S4_dupStart]

In [21]:
occ1_S4_newSeq = occ1_S4_record.seq + (100*'n') + pall1_S4_dupSeq.reverse_complement()
occ1_records_dict[occ1_S4_record.id].seq = occ1_S4_newSeq  # Replace Occ1_S4 
pall1_records_dict[pall1_S4_record.id].seq = pall1_S4_newSeq # Replace pall1_S4

## Occ2 Haplotype

No revisions to this haplotype are necessary. However, I'll still load and re-write the fasta. 

## Pall1 Haplotype

Pall1_S9 was removed from Pall1 sequence records above. This is the only change to this haplotype fasta

## Pall2 Haplotype

### Issue 4: Missing chung in middle of Pall2_S7

Dotplot using Pall1 as the query and Pall2 as the target suggests that the gap in the middle of Pall2_S7 could be filled by Pall2_S9, though Pall2_S9 needs to be reverse complemented

In [22]:
# Load AGP file for Pall2
pall2_S7_agp = pall2_agp[pall2_agp["object"].str.contains("Scaffold_7_")]
pall2_S7_agp

In [23]:
# Coordinates of gap in chromosome
# Subtract 1 from start position since Python is 0-based and AGP is 1-based
pall2_S7_gapStart = int(pall2_S7_agp[pall2_S7_agp['component_type'] == 'U']['object_beg'] - 1)
pall2_S7_gapEnd = int(pall2_S7_agp[pall2_S7_agp['component_type'] == 'U']['object_beg'])
pall2_S7_gapMid = pall2_S7_gapStart + 50  # Since always 100 n's

In [24]:
# Get relevant records
pall2_S7_record = pall2_records_dict['Scaffold_7__2_contigs__length_55122380']
pall2_S9_record = pall2_records_dict['Scaffold_9__1_contigs__length_11749539']

In [25]:
# Assemble new sequence and replace pall2_S7 record sequence in dict
# Pall2_S9 needs to be reverse complemented
pall2_S9_record_revComp = pall2_S9_record.reverse_complement()
pall2_S7_newSeq = pall2_S7_record.seq[:pall2_S7_gapMid] + (50*'n') + pall2_S9_record_revComp.seq + (50*'n') + pall2_S7_record.seq[pall2_S7_gapMid:]
pall2_records_dict[pall2_S7_record.id].seq = pall2_S7_newSeq
del pall2_records_dict[pall2_S9_record.id] # Remove record since now placed

## Write fastas

In [27]:
write_fasta(occ1_records_dict, occ1_fasta_out)
write_fasta(occ2_records_dict, occ2_fasta_out)
write_fasta(pall1_records_dict, pall1_fasta_out)
write_fasta(pall2_records_dict, pall2_fasta_out)