Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GMAP alignment format question -- minimap2 support #79

Closed
nextgenusfs opened this issue May 31, 2018 · 28 comments
Closed

GMAP alignment format question -- minimap2 support #79

nextgenusfs opened this issue May 31, 2018 · 28 comments

Comments

@nextgenusfs
Copy link
Contributor

Hi @brianjohnhaas, since there seems to be problems with PASA and many versions of GMAP (I'm getting users of funannotate that end up having GMAP version incompatibility errors), I'm wondering if I should just swap out GMAP for minimap2 alignment in PASA. I've done this swap for mapping transcript evidence to the genome in funannotate because it is much much faster and results are nearly identical. To do this I wrote a bam2gff3 parser (in Python), so I think I can pretty much re-use that code but perhaps with some modifications specific for PASA.

Before I wrote this up, I wanted to get your thoughts. So my question is related to the output format of GMAP - specifically is there anything "special" about the GFF3 format that PASA is expecting? Below is the first few lines from a test set. Is the Gap= flag necessary?

##gff-version   3
# Generated by GMAP version 2017-06-20 using call:  gmap.avx2 -D /home/linuxbrew/data/genome6/training -d genome.fasta.gmap -f 3 -n 0 -x 50 -t 3 -B 5 --max-intronlength-middle=3000 --max-intronlength-ends=3000 /home/linuxbrew/data/genome6/training/trinity.fasta.clean
NW_017263654.1	genome.fasta.gmap	cDNA_match	111713	111995	100	+	.	ID=Trinity_GG_1_c0_g1_i1.path1;Name=Trinity_GG_1_c0_g1_i1;Target=Trinity_GG_1_c0_g1_i1 4 286;Gap=M283
###
NW_017263654.1	genome.fasta.gmap	cDNA_match	66497	66599	100	+	.	ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 1 103;Gap=M103
NW_017263654.1	genome.fasta.gmap	cDNA_match	66651	66731	100	+	.	ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 104 184;Gap=M81
NW_017263654.1	genome.fasta.gmap	cDNA_match	66784	67198	100	+	.	ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 185 599;Gap=M415
###
NW_017263654.1	genome.fasta.gmap	cDNA_match	82710	83308	99	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 7 605;Gap=M599
NW_017263654.1	genome.fasta.gmap	cDNA_match	83360	83420	100	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 606 666;Gap=M61
NW_017263654.1	genome.fasta.gmap	cDNA_match	83470	83985	100	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 667 1182;Gap=M516
NW_017263654.1	genome.fasta.gmap	cDNA_match	84051	85267	100	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 1183 2399;Gap=M1217

And then next question is if I first run minimap2 alignment, I can pass this to PASA using the --IMPORT_CUSTOM_ALIGNMENTS_GFF3 flag and then setting --ALIGNERS blat would then effectively use both the minimap2 alignments and then have PASA run BLAT, is that correct?

Thanks,
Jon

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented May 31, 2018 via email

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented May 31, 2018 via email

@nextgenusfs
Copy link
Contributor Author

nextgenusfs commented May 31, 2018

Yeah I saw that blog as well, which prompted me to take a look. The speed increase of minimap2 is amazing. So what I'm currently doing to map transcripts is this:

I'm running minimap2 through a small little shell script so that I can pipe the data to samtools and still use multiple threads (the reason is that it isn't easy in Python to pipe commands, maybe easier in Perl?), looks like this:
sam2bam.sh

#!/bin/bash

#simple wrapper for running aligner program and piping output to samtools view/sort

if [ -z "$3" ]; then
    echo 'Usage: sam2bam.sh "aligner_command" bam_threads bam_output'
    echo '**The double quotes are required around aligner command**'
    exit
fi

#construct the command
cmd="$1 | samtools view -@ $2 -bS - | samtools sort -@ $2 -o $3 -"

#run the command
eval $cmd

I'm then running minimap2 with this function (runSubprocess is just my little wrapper for subprocess/logging in python).

def minimap2Align(transcripts, genome, cpus, intron, output):
    '''
    function to align transcripts to genome using minimap2
    huge speed increase over gmap + blat
    '''
    bamthreads = round(int(cpus) / 2)
    if bamthreads > 4:
        bamthreads = 4
    minimap2_cmd = ['minimap2', '-ax', 'splice', '-t', str(cpus), '--cs', '-u', 'b', '-G', str(intron), genome, transcripts]
    cmd = [os.path.join(parentdir, 'util', 'sam2bam.sh'), " ".join(minimap2_cmd), str(bamthreads), output]
    runSubprocess(cmd, '.', log)

I'm then using the PyBam native python BAM parser to parse the data and convert BAM to GFF3 (as there isn't a GFF3 output for minimap2). Probably there is an equivalent BAM parser in Perl that might fit better with your code, but at any rate the bam2gff3 function looks like this:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn

def bam2gff3(input, output):
    import pybam
    with open(output, 'w') as gffout:
        gffout.write('##gff-version 3\n')
        for aln in pybam.read(os.path.realpath(input)):
            if aln.sam_flag == 0:
                strand = '+'
            elif aln.sam_flag == 16:
                strand = '-'
            else:
                continue
            cs = None
            nm = None
            tags = aln.sam_tags_string.split('\t')
            for x in tags:
                if x.startswith('cs:'):
                    cs = x.replace('cs:Z:', '')
                if x.startswith('NM:'):
                    nm = int(x.split(':')[-1])
            #print(aln.sam_qname, nm, cs)
            if nm is None or cs is None:
                continue
            matches = 0
            ProperSplice = True
            splitter = []
            exons = [int(aln.sam_pos1)]
            position = int(aln.sam_pos1)
            query = [1]
            querypos = 0
            num_exons = 1
            gaps = 0
            splitter = tokenizeString(cs, [':','*','+', '-', '~'])
            for i,x in enumerate(splitter):
                if x == ':':
                    matches += int(splitter[i+1])
                    position += int(splitter[i+1])
                    querypos += int(splitter[i+1])
                elif x == '-':
                    gaps += 1
                elif x == '+':
                    gaps += 1
                    querypos += len(splitter[i+1])
                elif x == '~':
                    if aln.sam_flag == 0:
                        if splitter[i+1].startswith('gt') and splitter[i+1].endswith('ag'):
                            ProperSplice = True
                        elif splitter[i+1].startswith('at') and splitter[i+1].endswith('ac'):
                            ProperSplice = True
                        else:
                            ProperSplice = False
                    elif aln.sam_flag == 16:
                        if splitter[i+1].startswith('ct') and splitter[i+1].endswith('ac'):
                            ProperSplice = True
                        elif splitter[i+1].startswith('gt') and splitter[i+1].endswith('at'):
                            ProperSplice = True
                        else:
                            ProperSplice = False
                    num_exons += 1
                    exons.append(position)
                    query.append(querypos)
                    query.append(querypos+1)
                    intronLen = int(splitter[i+1][2:-2])
                    position += intronLen
                    exons.append(position)
            #add last Position
            exons.append(position)
            query.append(aln.sam_l_seq)
            #convert exon list into list of exon tuples
            exons = zip(exons[0::2], exons[1::2])
            queries = zip(query[0::2], query[1::2])
            if ProperSplice:
                mismatches = nm - gaps
                pident = 100 * (matches / (matches + mismatches))
                if pident < 80:
                    continue
                for i,exon in enumerate(exons):
                    start = exon[0]
                    end = exon[1]-1
                    if strand == '+':
                        qstart = queries[i][0]
                        qend = queries[i][1]
                    else:
                        qstart = aln.sam_l_seq - queries[i][1] + 1
                        qend = aln.sam_l_seq - queries[i][0] + 1
                    gffout.write('{:}\t{:}\t{:}\t{:}\t{:}\t{:.2f}\t{:}\t{:}\tID={:};Target={:} {:} {:}\n'.format(aln.sam_rname,'genome','cDNA_match',start,end,pident,strand,'.',aln.sam_qname,aln.sam_qname,qstart,qend))

Heng made a "new" flag in minimap2 called the cs flag, which is a little bit easier to parse than the CIGAR string.

UPDATE (6/25/2018): I updated the bam2gff3 function using PyBAM - previous one was not dealing with crick aligned transcripts correctly. Note this only writes alignments to GFF3 if pident is > 80%.

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented May 31, 2018 via email

@nextgenusfs
Copy link
Contributor Author

Hi Brian,
I have this a try and looks like I need to also have a lend an rend (left and right end) in the GFF3 file format:

error parsing match coordinates lend[] rend[]from last field: 
ID=f7f48989-482b-4409-a232-144592d4318b;Target=f7f48989-482b-4409-a232-144592d4318b,
line: 
CP022967.1	genome	cDNA_match	702	1436	91.42	.	.	ID=f7f48989-482b-4409-a232-144592d4318b;Target=f7f48989-482b-4409-a232-144592d4318b

Just a quick question as to what coords is that referring -- I'm assuming this is the query lend and rend?

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Jun 10, 2018 via email

@nextgenusfs
Copy link
Contributor Author

Okay, so the cDNA coordinates... any idea how I can get that out of SAM format?

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Jun 10, 2018 via email

@marchoeppner
Copy link

Hi,
just wondering if there has been any progress on integrating minimap2 with Pasa over the past year? ;) Seems like a worthwhile addition.

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 8, 2019 via email

@marchoeppner
Copy link

Thanks and fair enough! If I get around to it, I'll make a pull request.

@nextgenusfs
Copy link
Contributor Author

I’m using the custom alignments approach which is working well once I got the format figured out.

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 8, 2019 via email

@nextgenusfs
Copy link
Contributor Author

The python method above is what I’m using from ~1 year ago. An equivalent perl converter would better for PASA.

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 8, 2019 via email

@marchoeppner
Copy link

Seems like Minimap2 also supports PAF output, which is basically tab-delimited blast-type formatting https://github.com/lh3/miniasm/blob/master/PAF.md. Just thinking about porting code to Perl that relies on PySam...

@nextgenusfs
Copy link
Contributor Author

Right, paf probably easier. My use case was a general BAM to GFF3 conversion. SAM format would also work. You need to run minimap2 with the —cs flag to get the splicing information.

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 8, 2019 via email

@nextgenusfs
Copy link
Contributor Author

Okay, I just used your sample data from sample_data folder and ran this:

minimap2 -ax splice --cs genome_sample.fasta.gz all_transcripts.fasta | samtools sort -o minimap2.alignments.bam -

And then I used the method above to convert to GFF3 (this is in my funannotate package):

funannotate util bam2gff3 -i minimap2.alignments.bam -o minimap2.alignments.gff3

Attachment has both the bam file and the resulting GFF3 file.
minimap2.zip

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 8, 2019 via email

@nextgenusfs
Copy link
Contributor Author

The cs flag description is on the manual page of minimap2: https://lh3.github.io/minimap2/minimap2.html. I think probably could also get this information from the CIGAR string, but for me the cs was easier for me to understand. But you can also output CIGAR string in the BAM file with minimap2 by passing the `-c' flag.

-c | Generate CIGAR. In PAF, the CIGAR is written to the ‘cg’ custom tag.
--cs[=STR]  | Output the cs tag. STR can be either short or long. If no STR is given, short is assumed.

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 8, 2019 via email

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 9, 2019 via email

@marchoeppner
Copy link

marchoeppner commented Aug 9, 2019

I think this will then do the same? Can't use the dev branch for my project, so needed to encapsulate the same function in a standalone script (basically just stole Brian's script and one function) - this requires Samtools to be in $PATH, which is a prereq for PASA anyway:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

# This is a rip-off from Brian Haas's Sam_to_gtf.pl converter bundled with PASA
# Needed this as a standalone version.

my %PATH_COUNTER;

my $bam = $ARGV[0] ;

open BAM,"samtools view $bam |";

        while(<BAM>){
        next if(/^(\@)/);  ## skipping the header lines (if you used -h in the samools command)
        s/\n//;  s/\r//;  ## removing new line
        my @sam = split(/\t+/);  ## splitting SAM line into array

	my %entry = ( "qname" => $sam[0], "flag" => $sam[1], "rname" => $sam[2], "pos" => $sam[3], "mapq" => $sam[4], 
		"cigar" => $sam[5], "rnext" => $sam[6], "pnext" => $sam[7], "tlen" => $sam[8], "seq" => $sam[9], "qual" => $sam[10] );

	my $num_mismatches = 0;

	if ( join("\t",@sam) =~ /NM:i:(\d+)/) {
		$num_mismatches = $1;
        }

	
	my $strand = ($entry{"flag"} & 16) ? "-" : "+" ;
	$entry{"strand"} = $strand ;
	
	my $read_name = $entry{"qname"} ;
	my $scaff_name = $entry{"rname"};
	
	my ($genome_coords_aref, $query_coords_aref) = get_aligned_coords(%entry);

	my $align_len = 0;

	foreach my $coordset (@$genome_coords_aref) {
                $align_len += abs($coordset->[1] - $coordset->[0]) + 1;
        }
	# Check this...
	next if ($align_len eq 0);

	my $per_id = sprintf("%.1f", 100 - $num_mismatches/$align_len * 100); 

	my $align_counter = "$read_name.p" . ++$PATH_COUNTER{$read_name};

	my @genome_n_trans_coords;
        
        while (@$genome_coords_aref) {
            my $genome_coordset_aref = shift @$genome_coords_aref;
            my $trans_coordset_aref = shift @$query_coords_aref;

            my ($genome_lend, $genome_rend) = @$genome_coordset_aref;
            my ($trans_lend, $trans_rend) = sort {$a<=>$b} @$trans_coordset_aref;

            push (@genome_n_trans_coords, [ $genome_lend, $genome_rend, $trans_lend, $trans_rend ] );

        }

	my @merged_coords;
        push (@merged_coords, shift @genome_n_trans_coords);

        my $MERGE_DIST = 10;
        while (@genome_n_trans_coords) {
            my $coordset_ref = shift @genome_n_trans_coords;
            my $last_coordset_ref = $merged_coords[$#merged_coords];
            
            if ($coordset_ref->[0] - $last_coordset_ref->[1] <= $MERGE_DIST) {
                # merge it.
                $last_coordset_ref->[1] = $coordset_ref->[1];

                if ($strand eq "+") {
                    $last_coordset_ref->[3] = $coordset_ref->[3];
                } else {
                    $last_coordset_ref->[2] = $coordset_ref->[2];
                }
            }
            else {
                # not merging.
                push (@merged_coords, $coordset_ref);
            }
        }

	foreach my $coordset_ref (@merged_coords) {
            my ($genome_lend, $genome_rend, $trans_lend, $trans_rend) = @$coordset_ref;
            print join("\t",
                       $scaff_name,
                       "genome",
                       "cDNA_match",
                       $genome_lend, $genome_rend,
                       $per_id,
                       $strand,
                       ".",
                       "ID=$align_counter;Target=$read_name $trans_lend $trans_rend") . "\n";
        }
        #print "\n";
        
        
}


sub get_aligned_coords {

	my %entry = @_;

	my $genome_lend = $entry{"pos"};

	my $alignment = $entry{"cigar"};
	my $query_lend = 0;

	my @genome_coords;
	my @query_coords;

	$genome_lend--;

	while ($alignment =~ /(\d+)([A-Z])/g) {

		my $len = $1;
		my $code = $2;
		
		unless ($code =~ /^[MSDNIH]$/) {
			die  "Error, cannot parse cigar code [$code] ";
		}
		
		#print "parsed $len,$code\n";
		
		if ($code eq 'M') { # aligned bases match or mismatch
			
			my $genome_rend = $genome_lend + $len;
			my $query_rend = $query_lend + $len;
			
			push (@genome_coords, [$genome_lend+1, $genome_rend]);
			push (@query_coords, [$query_lend+1, $query_rend]);
			
			# reset coord pointers
			$genome_lend = $genome_rend;
			$query_lend = $query_rend;
		}
		elsif ($code eq 'D' || $code eq 'N') { # insertion in the genome or gap in query (intron, perhaps)
			$genome_lend += $len;
			
		}

		elsif ($code eq 'I'  # gap in genome or insertion in query 
               ||
               $code eq 'S' || $code eq 'H')  # masked region of query
        { 
            $query_lend += $len;

		}
	}

	 ## see if reverse strand alignment - if so, must revcomp the read matching coordinates.
    	if ($entry{"strand"} eq '-') {

        my $read_len = length($entry{"seq"});
        unless ($read_len) {
            die "Error, no read length obtained from entry";
        }

        my @revcomp_coords;
        foreach my $coordset (@query_coords) {
            my ($lend, $rend) = @$coordset;

            my $new_lend = $read_len - $lend + 1;
            my $new_rend = $read_len - $rend + 1;

            push (@revcomp_coords, [$new_lend, $new_rend]);
        }

        @query_coords = @revcomp_coords;

    }

	return(\@genome_coords, \@query_coords);
}
;

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 9, 2019 via email

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 16, 2019 via email

@marchoeppner
Copy link

Wasn't the ID field, as it turns out - I forgot to filter out all SAM mappings that had "*" as a sequence, which produced a bunch of non-sensical GFF entries with negative coordinates. After fixing that, all is well. So just a not-so-helpful warning from PASA ;)

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Aug 16, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants