Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

Using FALCON Assembled Contigs To Generate Draft Calls of Structure Variations

Jason Chin edited this page Oct 8, 2016 · 1 revision

Detect Structure Variation from de novo assembly contigs

For high coverage data, if one can do de novo assembly, the best way to call structure variations (SVs) probably just do the whole genome alignment between the assembly and the reference genome and detect all gaps in the assembly. Unfortunately, there is not much handy "push-button" kind of tools for arm-chair or talk-only bioinformatists. One can follow UCSC Chain-Net method to do the whole genome alignment and find where the alignment gaps are to call SV. What I have been doing is to use the excellent Mummer3 tool initially developed by Adam Phillippy and others to find all Maximum Unique Matches (MUM) and then I use mgaps to cluster the MUMs. Then I scan the clusters to find big gaps or breaking points to call SVs (or translocation or mis-assemblies). However, the default mummer will take whole day to find all MUMs. One can split the reference and do it in parallel to speed it up. I used a different multithreading code developed by Douglas G. Scofield to find MUMs.

Douglas's mummer is very fast and take about 10 to 20 minutes to generate the MUM hits. It takes about an hour to build the index of the GRC38 reference. However, there no facility to save the index. Namely, you will need to compute the index every time you use it. Brett Hannigan from DNAnexus wrote a wrapper to dump the index using Boost serialization library to a file so one does not have to build the index every time. This is my preferred version.

The following snippet shows how to create all "MUM" hits using Brett's branch of sparseMEM-big to align the assembly contigs consensus.fasta to GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta. (Surely, you will need to go over the Boost building process to build Brett's version. If you don't know how to do it, you can learn a lot of Boost manual.)

./create_mummer_index GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta 
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.mum GCA_000001405.15_GRCh38_no_alt_analysis_set.mum.idx
~/build/sparseMEM-big/mummer -b -l 24 -qthreads 48 -mum /read_data/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.mum.idx ./consensus.fasta > conensus.fasta.mum

I wrote a simple script called multi_mgap_p.py to split the conensus.fasta.mum by contigs to run mgaps from Mummer3 in parallel.

$ cat multi_mgap_p.py
import sys
import os

def cluster_mum(inputs):
    mum_data, contig_id = inputs
    outputs = []
    for ch in mum_data:
        outputs.append( "> %s:%s" % (contig_id, ch) )
        for l in mum_data[ch]:
            outputs.append( "\t".join(l) )
    return "\n".join(outputs)

fn = sys.argv[1]
mum_data = {}
contig_id = ""
n = 32
tmp_mum_out = []
for i in range(n):
    tmp_mum_out.append( open("tmp_out_%02d.mum" % i,"w") )

with open(fn) as f:
    for l in f:
        l = l.strip().split()
        if l[0] == ">":
            if len(mum_data) != 0:
                output_data = cluster_mum( (mum_data, contig_id) )
                i = hash(contig_id) % 32
                print >> tmp_mum_out[i], output_data
            mum_data = {}
            contig_id = "_".join(l[1:])
        else:
            mum_data.setdefault(l[0],[])
            mum_data[l[0]].append(l[1:])


    if len(mum_data) != 0:
        output_data = cluster_mum( (mum_data, contig_id) )
        i = hash(contig_id) % 32
        print >> tmp_mum_out[i], output_data

for i in range(n):
    tmp_mum_out[i].close()

for i in range(n):
    os.system("cat tmp_out_%02d.mum | mgaps -d 1000 -l 10000 -f 1 -s 100000 > tmp_mgaps_out_%02d &" % (i,i))

Note the parameters that one passes to mgaps might affect the sensitivity and specificity of SV calling.

The next step is to concatenate the mgaps output.

cat tmp_mgaps* >  consensus.fasta.mgaps

By parsing the outputted conensus.fasta.mgaps to find the gaps or breaking points in the alignments, one can do some quick SV calls.

Here is an example script to generate a bed file of all the SV location called.

$ cat mgaps_to_SV_bed.py

from falcon_kit.FastaReader import FastaReader

ctg_seq = {}
f = FastaReader("./consensus.fasta")
for r in f:
    ctg_seq[r.name] = r.sequence

mum_hits = {}

with open("./consensus.fasta.mgaps") as f:
    aln_data = []
    pre_chunk = None
    for l in f:
        l = l.strip().split()
        if l[0] == ">":
            try:
                if len(aln_data) != 0:
                    mum_hits[(ctg, chr_)] = aln_data
                ctg, chr_ = l[1].split(":")
                aln_data = []
            except:
                print l
        else:
            if l[0] != "#":
                try:
                    chr_pos, ctg_pos, aln_len, dummy, chr_gap, ctg_gap = l
                    chr_pos = int(chr_pos)
                    ctg_pos = int(ctg_pos)
                    aln_len = int(aln_len)
                    chr_gap = int(chr_gap) if chr_gap != "-" else "-"
                    ctg_gap = int(ctg_gap) if ctg_gap != "-" else "-"
                    cur_chunk = ( chr_pos, ctg_pos, aln_len, chr_gap, ctg_gap )
                    if chr_gap != "-" and ctg_gap != "-":
                        if (chr_gap > 100 or ctg_gap > 100) and abs(chr_gap - ctg_gap) > 100 :
                            aln_data.append( (pre_chunk, cur_chunk) )
                    pre_chunk = cur_chunk
                except:
                    print l
    if len(aln_data) != 0:
        mum_hits[(ctg, chr_)] = aln_data

flanking_size = 2500

with open("consensus.fasta.SV.bed","w") as f:
    for k in mum_hits.keys():
        ctg_id, chr_id = k
        #ctg_num = ctg_id.split("|")[0].split("_")[1]
        ctg_num = ctg_id
        for pre_chunk, gap_chunk in mum_hits[k]:
            if pre_chunk == None:
                continue
            #print k, pre_chunk, gap_chunk
            chr_pos1, ctg_pos1, aln_len1, chr_gap1, ctg_gap1 = pre_chunk
            chr_pos2, ctg_pos2, aln_len2, chr_gap2, ctg_gap2 = gap_chunk
            bed_line = [chr_id, chr_pos2 - chr_gap2 - flanking_size, chr_pos2 + flanking_size, str(chr_pos2)+":"+ctg_num+"_"+str(chr_gap2)+":"+str(ctg_gap2),100]
            if ctg_id[-7:] == "Reverse":
                bed_line += ["-"]
            else:
                bed_line += ["+"]
            bed_line += [ chr_pos2 - chr_gap2, chr_pos2 ]
            if ctg_id[-7:] == "Reverse":
                bed_line += ["255,0,0"]
            else:
                bed_line += ["0,0,255"]
             print >>f, "\t".join([str(c) for c in bed_line])

Thees lines

 if (chr_gap > 100 or ctg_gap > 100) and abs(chr_gap - ctg_gap) > 100 :
                            aln_data.append( (pre_chunk, cur_chunk) )

filer out gaps < 100 bp or the difference of the gap size < 100 bp. You can adjust the logic to get more sensitive calls but you should consider other filtering criteria to filter out false positives.

You can also generate the whole genome alignment from the mgaps file. Here is a script I use for that:

$ cat dump_ctg_aln_bed.py

from falcon_kit.FastaReader import FastaReader

def get_q_cov_base(aln_data):
    return sum([x[2] for x in aln_data])

def get_q_ave_idt(aln_data, id_):
    return 1.0*sum([x[-1] for x in aln_data[id_]])/len(aln_data[id_])

def get_contig_extent(aln_data):
    ad = aln_data[:]
    #ad = [x for x in ad if x[-1]>98 and x[-2] > 10000]
    ad.sort(key=lambda x:x[0])
    return ad[0][0],ad[-1][2]+ad[-1][0], ad[0][1], ad[-1][2] + ad[-1][1]

ctg_seq = {}

f = FastaReader("./consensus.fasta")
for r in f:
    r_id = r.name.split()[0]
    ctg_seq[r_id] = r.sequence

with open("./consensus.fasta.mgaps") as f:
    aln_data = []

    for l in f:
        l = l.strip().split()
        if l[0] == ">" or l[0] == "#":
            if len(aln_data) != 0:
                q_cov = get_q_cov_base(aln_data)
                if ctg[-7:] == "Reverse":
                    ctg2 = ctg[:-8]
                else:
                    ctg2 = ctg
                ctg_len = len(ctg_seq[ctg2])
                ctg_num = ctg.split("|")[0]
                #ctg_num = ctg.split("|")[1]
                s, e, qs, qe = get_contig_extent(aln_data)
                q_cov_pct = int(100.0*q_cov/ctg_len)
                r_cov_pct = int(100.0*q_cov/(e-s))

                
                bed_line = [chr_, s, e, ctg_num + ":" + str(ctg_len)+"_" + str(s) + ":" + str(e) + ":" + "%d" % r_cov_pct, q_cov_pct]
                if ctg[-7:] == "Reverse":
                    bed_line += ["-"]
                else:
                    bed_line += ["+"]

                bed_line += [ s, e ]
                if ctg[-7:] == "Reverse":
                    bed_line += ["255,0,0"]
                else:
                    bed_line += ["0,0,255"]
                print "\t".join( [str(c) for c in bed_line] )
            if l[0] != "#":
                ctg, chr_ = l[1].split(":")
            aln_data = []
        else:
            chr_pos, ctg_pos, aln_len, dummy, chr_gap, ctg_gap = l
            chr_pos = int(chr_pos)
            ctg_pos = int(ctg_pos)
            aln_len = int(aln_len)
            cur_chunk = ( chr_pos, ctg_pos, aln_len )
            aln_data.append(  cur_chunk )

    if len(aln_data) != 0:
        q_cov = get_q_cov_base(aln_data)
        if ctg[-7:] == "Reverse":
            ctg2 = ctg[:-8]
        else:
            ctg2 = ctg
        ctg_num = ctg.split("|")[0]
        ctg_len = len(ctg_seq[ctg2])
        s, e, qs, qe = get_contig_extent(aln_data)
        q_cov_pct = int(100.0*q_cov/ctg_len)
        r_cov_pct = int(100.0*q_cov/(e-s))

        
        bed_line = [chr_, s, e, ctg_num + ":" + str(ctg_len)+"_" + str(s) + ":" + str(e) + ":" + "%d" % r_cov_pct, q_cog_pct]
        if ctg[-7:] == "Reverse":
            bed_line += ["-"]
        else:
            bed_line += ["+"]

        bed_line += [ s, e ]
        if ctg[-7:] == "Reverse":
            bed_line += ["255,0,0"]
        else:
            bed_line += ["0,0,255"]
        print "\t".join( [str(c) for c in bed_line] )

Detect Structure Variation from Error Corrected Reads (p-reads) without An Whole Genome de novo Assembly

Sometimes, whole genome assembly and post-assembly alignment might have some errors. It might be useful to call SVs without the process. For some smaller indel, one can call within some error corrected reads. For larger one, some local re-assmebly will help to resolve them. Here is some possible way to call SV using this p-read with whole genome assembly.

  • Aligne p-reads (error corrected reads) to a reference genome with bwa mem ** Identify “soft or hard clips”/“large indels” by examining the CIGAR string in the SAM file.
  • For small local SVs, one can call with the p-reads vs. reference alignment
  • For larger local SVs, some local assembly might be needed
  • For non local SVs (e.g. translocations), one can simply check where the different segments of the reads mapped to
  • To avoid artifacts caused by the error correction process, one can always align the raw sequencing to confirm a call

In fact, it may not hurt to assemble the reads that has big part got clipped by bwa men alignment. By doing assembly, it helps to reduce the redundant sequences. In this case, the contig should contain the regions of structure variations and their flanking regions. You might still need to use the MUM alignment approach above to generate a call set.

Detect Structure Variation from Raw PacBio Reads without whole genome error correction

Whole genome error correction can be still expansive to compute. Another strategy, like using the p-reads along, is to find all raw reads that does not align fully to the references. We can generate a set of such reads. You can then use FALCON to do error correction and assembly of those reads. After that, you can call SVs by align those contigs to the reference.