GMAP alignment format question -- minimap2 support #79

nextgenusfs · 2018-05-31T14:36:03Z

Hi @brianjohnhaas, since there seems to be problems with PASA and many versions of GMAP (I'm getting users of funannotate that end up having GMAP version incompatibility errors), I'm wondering if I should just swap out GMAP for minimap2 alignment in PASA. I've done this swap for mapping transcript evidence to the genome in funannotate because it is much much faster and results are nearly identical. To do this I wrote a bam2gff3 parser (in Python), so I think I can pretty much re-use that code but perhaps with some modifications specific for PASA.

Before I wrote this up, I wanted to get your thoughts. So my question is related to the output format of GMAP - specifically is there anything "special" about the GFF3 format that PASA is expecting? Below is the first few lines from a test set. Is the Gap= flag necessary?

##gff-version   3
# Generated by GMAP version 2017-06-20 using call:  gmap.avx2 -D /home/linuxbrew/data/genome6/training -d genome.fasta.gmap -f 3 -n 0 -x 50 -t 3 -B 5 --max-intronlength-middle=3000 --max-intronlength-ends=3000 /home/linuxbrew/data/genome6/training/trinity.fasta.clean
NW_017263654.1	genome.fasta.gmap	cDNA_match	111713	111995	100	+	.	ID=Trinity_GG_1_c0_g1_i1.path1;Name=Trinity_GG_1_c0_g1_i1;Target=Trinity_GG_1_c0_g1_i1 4 286;Gap=M283
###
NW_017263654.1	genome.fasta.gmap	cDNA_match	66497	66599	100	+	.	ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 1 103;Gap=M103
NW_017263654.1	genome.fasta.gmap	cDNA_match	66651	66731	100	+	.	ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 104 184;Gap=M81
NW_017263654.1	genome.fasta.gmap	cDNA_match	66784	67198	100	+	.	ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 185 599;Gap=M415
###
NW_017263654.1	genome.fasta.gmap	cDNA_match	82710	83308	99	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 7 605;Gap=M599
NW_017263654.1	genome.fasta.gmap	cDNA_match	83360	83420	100	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 606 666;Gap=M61
NW_017263654.1	genome.fasta.gmap	cDNA_match	83470	83985	100	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 667 1182;Gap=M516
NW_017263654.1	genome.fasta.gmap	cDNA_match	84051	85267	100	+	.	ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 1183 2399;Gap=M1217

And then next question is if I first run minimap2 alignment, I can pass this to PASA using the --IMPORT_CUSTOM_ALIGNMENTS_GFF3 flag and then setting --ALIGNERS blat would then effectively use both the minimap2 alignments and then have PASA run BLAT, is that correct?

Thanks,
Jon

The text was updated successfully, but these errors were encountered:

brianjohnhaas · 2018-05-31T14:57:36Z

Hi Jon, That should all be fine. I can work on integrating minimap2 support for the aligners option. If you already have code to contribute, I'd be happy to integrate it. Also, we (Nathan, I, and a few others) want to write up the PASA2 soon. You're welcome to join us. best, ~b

…

On Thu, May 31, 2018 at 10:36 AM, Jon Palmer ***@***.***> wrote: Hi @brianjohnhaas <https://github.com/brianjohnhaas>, since there seems to be problems with PASA and many versions of GMAP (I'm getting users of funannotate that end up having GMAP version incompatibility errors), I'm wondering if I should just swap out GMAP for minimap2 alignment in PASA. I've done this swap for mapping transcript evidence to the genome in funannotate because it is much much faster and results are nearly identical. To do this I wrote a bam2gff3 parser (in Python), so I think I can pretty much re-use that code but perhaps with some modifications specific for PASA. Before I wrote this up, I wanted to get your thoughts. So my question is related to the output format of GMAP - specifically is there anything "special" about the GFF3 format that PASA is expecting? Below is the first few lines from a test set. Is the Gap= flag necessary? ##gff-version 3 # Generated by GMAP version 2017-06-20 using call: gmap.avx2 -D /home/linuxbrew/data/genome6/training -d genome.fasta.gmap -f 3 -n 0 -x 50 -t 3 -B 5 --max-intronlength-middle=3000 --max-intronlength-ends=3000 /home/linuxbrew/data/genome6/training/trinity.fasta.clean NW_017263654.1 genome.fasta.gmap cDNA_match 111713 111995 100 + . ID=Trinity_GG_1_c0_g1_i1.path1;Name=Trinity_GG_1_c0_g1_i1;Target=Trinity_GG_1_c0_g1_i1 4 286;Gap=M283 ### NW_017263654.1 genome.fasta.gmap cDNA_match 66497 66599 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 1 103;Gap=M103 NW_017263654.1 genome.fasta.gmap cDNA_match 66651 66731 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 104 184;Gap=M81 NW_017263654.1 genome.fasta.gmap cDNA_match 66784 67198 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 185 599;Gap=M415 ### NW_017263654.1 genome.fasta.gmap cDNA_match 82710 83308 99 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 7 605;Gap=M599 NW_017263654.1 genome.fasta.gmap cDNA_match 83360 83420 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 606 666;Gap=M61 NW_017263654.1 genome.fasta.gmap cDNA_match 83470 83985 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 667 1182;Gap=M516 NW_017263654.1 genome.fasta.gmap cDNA_match 84051 85267 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 1183 2399;Gap=M1217 And then next question is if I first run minimap2 alignment, I can pass this to PASA using the --IMPORT_CUSTOM_ALIGNMENTS_GFF3 flag and then setting --ALIGNERS blat would then effectively use both the minimap2 alignments and then have PASA run BLAT, is that correct? Thanks, Jon — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHMVX9wz69Jn3n6xcS2xDC4ZNQPqRjjYks5t3__dgaJpZM4UVKXs> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

brianjohnhaas · 2018-05-31T15:00:29Z

also interesting: http://bioinfo.zesoi.fer.hr/index.php/en/blog-en/56-gmap-vs-minimap2 On Thu, May 31, 2018 at 10:57 AM, Brian Haas <bhaas@broadinstitute.org> wrote:

…

Hi Jon, That should all be fine. I can work on integrating minimap2 support for the aligners option. If you already have code to contribute, I'd be happy to integrate it. Also, we (Nathan, I, and a few others) want to write up the PASA2 soon. You're welcome to join us. best, ~b On Thu, May 31, 2018 at 10:36 AM, Jon Palmer ***@***.***> wrote: > Hi @brianjohnhaas <https://github.com/brianjohnhaas>, since there seems > to be problems with PASA and many versions of GMAP (I'm getting users of > funannotate that end up having GMAP version incompatibility errors), I'm > wondering if I should just swap out GMAP for minimap2 alignment in PASA. > I've done this swap for mapping transcript evidence to the genome in > funannotate because it is much much faster and results are nearly > identical. To do this I wrote a bam2gff3 parser (in Python), so I think I > can pretty much re-use that code but perhaps with some modifications > specific for PASA. > > Before I wrote this up, I wanted to get your thoughts. So my question is > related to the output format of GMAP - specifically is there anything > "special" about the GFF3 format that PASA is expecting? Below is the first > few lines from a test set. Is the Gap= flag necessary? > > ##gff-version 3 > # Generated by GMAP version 2017-06-20 using call: gmap.avx2 -D /home/linuxbrew/data/genome6/training -d genome.fasta.gmap -f 3 -n 0 -x 50 -t 3 -B 5 --max-intronlength-middle=3000 --max-intronlength-ends=3000 /home/linuxbrew/data/genome6/training/trinity.fasta.clean > NW_017263654.1 genome.fasta.gmap cDNA_match 111713 111995 100 + . ID=Trinity_GG_1_c0_g1_i1.path1;Name=Trinity_GG_1_c0_g1_i1;Target=Trinity_GG_1_c0_g1_i1 4 286;Gap=M283 > ### > NW_017263654.1 genome.fasta.gmap cDNA_match 66497 66599 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 1 103;Gap=M103 > NW_017263654.1 genome.fasta.gmap cDNA_match 66651 66731 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 104 184;Gap=M81 > NW_017263654.1 genome.fasta.gmap cDNA_match 66784 67198 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 185 599;Gap=M415 > ### > NW_017263654.1 genome.fasta.gmap cDNA_match 82710 83308 99 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 7 605;Gap=M599 > NW_017263654.1 genome.fasta.gmap cDNA_match 83360 83420 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 606 666;Gap=M61 > NW_017263654.1 genome.fasta.gmap cDNA_match 83470 83985 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 667 1182;Gap=M516 > NW_017263654.1 genome.fasta.gmap cDNA_match 84051 85267 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 1183 2399;Gap=M1217 > > And then next question is if I first run minimap2 alignment, I can pass > this to PASA using the --IMPORT_CUSTOM_ALIGNMENTS_GFF3 flag and then > setting --ALIGNERS blat would then effectively use both the minimap2 > alignments and then have PASA run BLAT, is that correct? > > Thanks, > Jon > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#79>, or mute the > thread > <https://github.com/notifications/unsubscribe-auth/AHMVX9wz69Jn3n6xcS2xDC4ZNQPqRjjYks5t3__dgaJpZM4UVKXs> > . > -- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

nextgenusfs · 2018-05-31T15:19:38Z

Yeah I saw that blog as well, which prompted me to take a look. The speed increase of minimap2 is amazing. So what I'm currently doing to map transcripts is this:

I'm running minimap2 through a small little shell script so that I can pipe the data to samtools and still use multiple threads (the reason is that it isn't easy in Python to pipe commands, maybe easier in Perl?), looks like this:
sam2bam.sh

#!/bin/bash

#simple wrapper for running aligner program and piping output to samtools view/sort

if [ -z "$3" ]; then
    echo 'Usage: sam2bam.sh "aligner_command" bam_threads bam_output'
    echo '**The double quotes are required around aligner command**'
    exit
fi

#construct the command
cmd="$1 | samtools view -@ $2 -bS - | samtools sort -@ $2 -o $3 -"

#run the command
eval $cmd

I'm then running minimap2 with this function (runSubprocess is just my little wrapper for subprocess/logging in python).

def minimap2Align(transcripts, genome, cpus, intron, output):
    '''
    function to align transcripts to genome using minimap2
    huge speed increase over gmap + blat
    '''
    bamthreads = round(int(cpus) / 2)
    if bamthreads > 4:
        bamthreads = 4
    minimap2_cmd = ['minimap2', '-ax', 'splice', '-t', str(cpus), '--cs', '-u', 'b', '-G', str(intron), genome, transcripts]
    cmd = [os.path.join(parentdir, 'util', 'sam2bam.sh'), " ".join(minimap2_cmd), str(bamthreads), output]
    runSubprocess(cmd, '.', log)

I'm then using the PyBam native python BAM parser to parse the data and convert BAM to GFF3 (as there isn't a GFF3 output for minimap2). Probably there is an equivalent BAM parser in Perl that might fit better with your code, but at any rate the bam2gff3 function looks like this:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn

def bam2gff3(input, output):
    import pybam
    with open(output, 'w') as gffout:
        gffout.write('##gff-version 3\n')
        for aln in pybam.read(os.path.realpath(input)):
            if aln.sam_flag == 0:
                strand = '+'
            elif aln.sam_flag == 16:
                strand = '-'
            else:
                continue
            cs = None
            nm = None
            tags = aln.sam_tags_string.split('\t')
            for x in tags:
                if x.startswith('cs:'):
                    cs = x.replace('cs:Z:', '')
                if x.startswith('NM:'):
                    nm = int(x.split(':')[-1])
            #print(aln.sam_qname, nm, cs)
            if nm is None or cs is None:
                continue
            matches = 0
            ProperSplice = True
            splitter = []
            exons = [int(aln.sam_pos1)]
            position = int(aln.sam_pos1)
            query = [1]
            querypos = 0
            num_exons = 1
            gaps = 0
            splitter = tokenizeString(cs, [':','*','+', '-', '~'])
            for i,x in enumerate(splitter):
                if x == ':':
                    matches += int(splitter[i+1])
                    position += int(splitter[i+1])
                    querypos += int(splitter[i+1])
                elif x == '-':
                    gaps += 1
                elif x == '+':
                    gaps += 1
                    querypos += len(splitter[i+1])
                elif x == '~':
                    if aln.sam_flag == 0:
                        if splitter[i+1].startswith('gt') and splitter[i+1].endswith('ag'):
                            ProperSplice = True
                        elif splitter[i+1].startswith('at') and splitter[i+1].endswith('ac'):
                            ProperSplice = True
                        else:
                            ProperSplice = False
                    elif aln.sam_flag == 16:
                        if splitter[i+1].startswith('ct') and splitter[i+1].endswith('ac'):
                            ProperSplice = True
                        elif splitter[i+1].startswith('gt') and splitter[i+1].endswith('at'):
                            ProperSplice = True
                        else:
                            ProperSplice = False
                    num_exons += 1
                    exons.append(position)
                    query.append(querypos)
                    query.append(querypos+1)
                    intronLen = int(splitter[i+1][2:-2])
                    position += intronLen
                    exons.append(position)
            #add last Position
            exons.append(position)
            query.append(aln.sam_l_seq)
            #convert exon list into list of exon tuples
            exons = zip(exons[0::2], exons[1::2])
            queries = zip(query[0::2], query[1::2])
            if ProperSplice:
                mismatches = nm - gaps
                pident = 100 * (matches / (matches + mismatches))
                if pident < 80:
                    continue
                for i,exon in enumerate(exons):
                    start = exon[0]
                    end = exon[1]-1
                    if strand == '+':
                        qstart = queries[i][0]
                        qend = queries[i][1]
                    else:
                        qstart = aln.sam_l_seq - queries[i][1] + 1
                        qend = aln.sam_l_seq - queries[i][0] + 1
                    gffout.write('{:}\t{:}\t{:}\t{:}\t{:}\t{:.2f}\t{:}\t{:}\tID={:};Target={:} {:} {:}\n'.format(aln.sam_rname,'genome','cDNA_match',start,end,pident,strand,'.',aln.sam_qname,aln.sam_qname,qstart,qend))

Heng made a "new" flag in minimap2 called the cs flag, which is a little bit easier to parse than the CIGAR string.

UPDATE (6/25/2018): I updated the bam2gff3 function using PyBAM - previous one was not dealing with crick aligned transcripts correctly. Note this only writes alignments to GFF3 if pident is > 80%.

brianjohnhaas · 2018-05-31T15:41:49Z

Thx! I'll look into this soon

…

On Thu, May 31, 2018, 11:19 AM Jon Palmer ***@***.***> wrote: Yeah I saw that blog as well, which prompted me to take a look. The speed increase of minimap2 is amazing. So what I'm currently doing to map transcripts is this: I'm running minimap2 through a small little shell script so that I can pipe the data to samtools and still use multiple threads (the reason is that it isn't easy in Python to pipe commands, maybe easier in Perl?), looks like this: sam2bam.sh #!/bin/bash #simple wrapper for running aligner program and piping output to samtools view/sort if [ -z "$3" ]; then echo 'Usage: sam2bam.sh "aligner_command" bam_threads bam_output' echo '**The double quotes are required around aligner command**' exit fi #construct the command cmd="$1 | samtools view -@ $2 -bS - | samtools sort -@ $2 -o $3 -" #run the command eval $cmd I'm then running minimap2 with this function (runSubprocess is just my little wrapper for subprocess/logging in python). def minimap2Align(transcripts, genome, cpus, intron, output): ''' function to align transcripts to genome using minimap2 huge speed increase over gmap + blat ''' bamthreads = round(int(cpus) / 2) if bamthreads > 4: bamthreads = 4 minimap2_cmd = ['minimap2', '-ax', 'splice', '-t', str(cpus), '--cs', '-u', 'b', '-G', str(intron), genome, transcripts] cmd = [os.path.join(parentdir, 'util', 'sam2bam.sh'), " ".join(minimap2_cmd), str(bamthreads), output] runSubprocess(cmd, '.', log) I'm then using the PyBam <https://github.com/JohnLonginotto/pybam> native python BAM parser to parse the data and convert BAM to GFF3 (as there isn't a GFF3 output for minimap2). Probably there is an equivalent BAM parser in Perl that might fit better with your code, but at any rate the bam2gff3 function looks like this: def bam2gff3(input, output): import pybam with open(output, 'w') as gffout: gffout.write('##gff-version 3\n') for aln in pybam.read(input): cs = None tags = aln.sam_tags_string.split('\t') for x in tags: if x.startswith('cs:'): cs = x.replace('cs:Z:', '') matches = 0 mismatches = 0 ProperSplice = True splitter = [] exons = [int(aln.sam_pos1)] position = int(aln.sam_pos1) num_exons = 1 if cs: splitter = tokenizeString(cs, [':','*','+', '-', '~']) for i,x in enumerate(splitter): if x == ':': matches += int(splitter[i+1]) position += int(splitter[i+1]) elif x == '*' or x == '+': mismatches += (len(splitter[i+1]) / 2) elif x == '-': mismatches += (len(splitter[i+1]) / 2) elif x == '~': if splitter[i+1].startswith('gt') and splitter[i+1].endswith('ag'): ProperSplice = True elif splitter[i+1].startswith('at') and splitter[i+1].endswith('ac'): ProperSplice = True else: ProperSplice = False break num_exons += 1 exons.append(position) intronLen = int(splitter[i+1][2:-2]) position += intronLen exons.append(position) #add last Position exons.append(position) #convert exon list into list of exon tuples exons = zip(exons[0::2], exons[1::2]) if ProperSplice: pident = 100 * (matches / (matches + mismatches)) if pident < 80: continue strand = '.' if aln.sam_flag == 0: strand = '+' elif aln.sam_flag == 16: strand = '-' for i,exon in enumerate(exons): start = exon[0] end = exon[1]-1 gffout.write('{:}\t{:}\t{:}\t{:}\t{:}\t{:.2f}\t{:}\t{:}\tID={:};Target={:}\n'.format(aln.sam_rname, 'genome', 'cDNA_match', start, end, pident, strand, '.',aln.sam_qname,aln.sam_qname)) Heng made a "new" flag in minimap2 called the cs flag <https://github.com/lh3/minimap2#the-cs-optional-tag>, which is a little bit easier to parse than the CIGAR string. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHMVX81R7PBo3RBuR-dp__bVdh4-qj0Dks5t4AoMgaJpZM4UVKXs> .

nextgenusfs · 2018-06-10T13:07:13Z

Hi Brian,
I have this a try and looks like I need to also have a lend an rend (left and right end) in the GFF3 file format:

error parsing match coordinates lend[] rend[]from last field: 
ID=f7f48989-482b-4409-a232-144592d4318b;Target=f7f48989-482b-4409-a232-144592d4318b,
line: 
CP022967.1	genome	cDNA_match	702	1436	91.42	.	.	ID=f7f48989-482b-4409-a232-144592d4318b;Target=f7f48989-482b-4409-a232-144592d4318b

Just a quick question as to what coords is that referring -- I'm assuming this is the query lend and rend?

brianjohnhaas · 2018-06-10T15:20:31Z

Oh - this is looking for the matching coordinates at the level of the cDNA sequence, which gmap bundles into the last field of the description line. Take a look the gmap.gff3 file that PASA generates and you'll see what I mean. Just compare the last words in each gff3 line. best, ~b

…

On Sun, Jun 10, 2018 at 9:07 AM, Jon Palmer ***@***.***> wrote: Hi Brian, I have this a try and looks like I need to also have a lend an rend (left and right end) in the GFF3 file format: error parsing match coordinates lend[] rend[]from last field: ID=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b;Target=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b, line: CP022967.1 genome cDNA_match 702 1436 91.42 . . ID=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b;Target=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b Just a quick question as to what coords is that referring -- I'm assuming this is the query lend and rend? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHMVXwLYw10dcUVdQn_Lma2dOQlYI8fsks5t7RoDgaJpZM4UVKXs> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

nextgenusfs · 2018-06-10T16:24:33Z

Okay, so the cDNA coordinates... any idea how I can get that out of SAM format?

brianjohnhaas · 2018-06-10T16:41:13Z

In the Trinity codebase, I have: trinityrnaseq/util/misc/SAM_toString.pl as an example of how you might go about making a SAM to GFF3 converter. There are probably other better examples to go with, though. If you're into python, you could likely leverage pysam to do something pretty similar. You'd only want to report segments that are based on separating introns, and not short indels though, when making the gff. ~b

…

On Sun, Jun 10, 2018 at 12:24 PM, Jon Palmer ***@***.***> wrote: Okay, so the cDNA coordinates... any idea how I can get that out of SAM format? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHMVX1D758JjvA1xsrOcucTH4ggg8pH1ks5t7UhBgaJpZM4UVKXs> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

marchoeppner · 2019-08-08T09:10:06Z

Hi,
just wondering if there has been any progress on integrating minimap2 with Pasa over the past year? ;) Seems like a worthwhile addition.

brianjohnhaas · 2019-08-08T11:11:39Z

Hi, If you can make a gtf file from it, you can import the alignments in as 'custom' alignments and it should work. I don't have time to work on pasa these days to do more than minor tech support and minimal maintenance. Contributions for enhancements are always welcome! best, ~b

…

On Thu, Aug 8, 2019 at 5:10 AM marchoeppner ***@***.***> wrote: Hi, just wondering if there has been any progress on integrating minimap2 with Pasa over the past year? ;) Seems like a worthwhile addition. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX2XV7A3ZGWIBJNFKSLQDPPG7A5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD327SMY#issuecomment-519436595>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKXY3XV4GRGQDT4VMG23QDPPG7ANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

marchoeppner · 2019-08-08T12:33:11Z

Thanks and fair enough! If I get around to it, I'll make a pull request.

nextgenusfs · 2019-08-08T12:44:44Z

I’m using the custom alignments approach which is working well once I got the format figured out.

brianjohnhaas · 2019-08-08T12:48:46Z

awesome. If you have a minimap format converter or some documentation to add for how to do it, that would be great. It's easy enough to tack it on to our wiki and/or add to the code repo. best, ~b

…

On Thu, Aug 8, 2019 at 8:44 AM Jon Palmer ***@***.***> wrote: I’m using the custom alignments approach which is working well once I got the format figured out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX4ETO7AQIJLHBHE2E3QDQIL3A5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33PU5Y#issuecomment-519502455>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKX3DQANIM3YU74Q4PJ3QDQIL3ANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

nextgenusfs · 2019-08-08T13:30:29Z

The python method above is what I’m using from ~1 year ago. An equivalent perl converter would better for PASA.

brianjohnhaas · 2019-08-08T13:40:29Z

ah - got it. ok, let's keep this thread around to help others. One of these days, I'll see about doing direct integration, unless someone else contributes it first. Python integration would be fine, though. No one likes perl anymore except me and like three others. ;-)

…

On Thu, Aug 8, 2019 at 9:30 AM Jon Palmer ***@***.***> wrote: The python method above is what I’m using from ~1 year ago. An equivalent perl converter would better for PASA. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX277DMZ4QEBA6SPVTLQDQNXNA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33TURA#issuecomment-519518788>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKX3CSXFU5PT56PA22MLQDQNXNANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

marchoeppner · 2019-08-08T13:48:23Z

Seems like Minimap2 also supports PAF output, which is basically tab-delimited blast-type formatting https://github.com/lh3/miniasm/blob/master/PAF.md. Just thinking about porting code to Perl that relies on PySam...

nextgenusfs · 2019-08-08T14:03:00Z

Right, paf probably easier. My use case was a general BAM to GFF3 conversion. SAM format would also work. You need to run minimap2 with the —cs flag to get the splicing information.

brianjohnhaas · 2019-08-08T14:05:14Z

I could write a bam-to-gtf converter pretty quick based on the existing code there in pasa. Just send me an example minimap2 bam file so I don't have to experiment myself. ;-)

…

On Thu, Aug 8, 2019 at 9:48 AM marchoeppner ***@***.***> wrote: Seems like Minimap2 also supports PAF output, which is basically tab-delimited blast-type formatting https://github.com/lh3/miniasm/blob/master/PAF.md. Just thinking about porting code to Perl that relies on PySam... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX67AXVTOZ5N4PU7UELQDQP2RA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33VJ4Q#issuecomment-519525618>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKXYJOVMHP4T6SSAWPPLQDQP2RANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

nextgenusfs · 2019-08-08T14:46:32Z

Okay, I just used your sample data from sample_data folder and ran this:

minimap2 -ax splice --cs genome_sample.fasta.gz all_transcripts.fasta | samtools sort -o minimap2.alignments.bam -

And then I used the method above to convert to GFF3 (this is in my funannotate package):

funannotate util bam2gff3 -i minimap2.alignments.bam -o minimap2.alignments.gff3

Attachment has both the bam file and the resulting GFF3 file.
minimap2.zip

brianjohnhaas · 2019-08-08T14:56:57Z

Thanks, Jon! I'll work on this shortly and we can compare notes. best, ~b

…

On Thu, Aug 8, 2019 at 10:46 AM Jon Palmer ***@***.***> wrote: Okay, I just used your sample data from sample_data folder and ran this: minimap2 -ax splice --cs genome_sample.fasta.gz all_transcripts.fasta | samtools sort -o minimap2.alignments.bam - And then I used the method above to convert to GFF3 (this is in my funannotate package): funannotate util bam2gff3 -i minimap2.alignments.bam -o minimap2.alignments.gff3 Attachment has both the bam file and the resulting GFF3 file. minimap2.zip <https://github.com/PASApipeline/PASApipeline/files/3482302/minimap2.zip> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX4EPDKD2755V3SXXZTQDQWUTA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD333LDA#issuecomment-519550348>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKX2U5TOU5KG343DGDSTQDQWUTANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

nextgenusfs · 2019-08-08T15:10:02Z

The cs flag description is on the manual page of minimap2: https://lh3.github.io/minimap2/minimap2.html. I think probably could also get this information from the CIGAR string, but for me the cs was easier for me to understand. But you can also output CIGAR string in the BAM file with minimap2 by passing the `-c' flag.

-c | Generate CIGAR. In PAF, the CIGAR is written to the ‘cg’ custom tag.
--cs[=STR]  | Output the cs tag. STR can be either short or long. If no STR is given, short is assumed.

brianjohnhaas · 2019-08-08T15:18:34Z

Thanks, Jon! I'll be digging into it in a few hours. best, ~b

…

On Thu, Aug 8, 2019 at 11:10 AM Jon Palmer ***@***.***> wrote: The cs flag description is on the manual page of minimap2: https://lh3.github.io/minimap2/minimap2.html. I think probably could also get this information from the CIGAR string, but for me the cs was easier for me to understand. But you can also output CIGAR string in the BAM file with minimap2 by passing the `-c' flag. -c | Generate CIGAR. In PAF, the CIGAR is written to the ‘cg’ custom tag. --cs[=STR] | Output the cs tag. STR can be either short or long. If no STR is given, short is assumed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX3BK5ZR5OZUXK6PNGLQDQZMVA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD335Y3Y#issuecomment-519560303>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKXYF6UD4ERSMBH7F2ITQDQZMVANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

brianjohnhaas · 2019-08-09T00:31:24Z

Hi guys, I added this to the 'devel' branch of the pasapipeline repo: https://github.com/PASApipeline/PASApipeline/blob/devel/misc_utilities/SAM_to_gtf.pl You can run it on any bam file and it should make a corresponding gtf file. It shouldn't be minimap2 specific and shouldn't need custom tags. Jon - the coordinates look different from what you were generating. I checked one of the differences and mine matched up with blat or gmap. I didn't do any rigorous testing, though... Note, all the coordinates are determined based on the cigar string using one of my home-grown libraries that ships with pasa. Please let me know if there are any issues with it. Using the bam file that Jon provided was perfect for this. best, ~b

…

On Thu, Aug 8, 2019 at 11:17 AM Brian Haas ***@***.***> wrote: Thanks, Jon! I'll be digging into it in a few hours. best, ~b On Thu, Aug 8, 2019 at 11:10 AM Jon Palmer ***@***.***> wrote: > The cs flag description is on the manual page of minimap2: > https://lh3.github.io/minimap2/minimap2.html. I think probably could > also get this information from the CIGAR string, but for me the cs was > easier for me to understand. But you can also output CIGAR string in the > BAM file with minimap2 by passing the `-c' flag. > > -c | Generate CIGAR. In PAF, the CIGAR is written to the ‘cg’ custom tag. > > --cs[=STR] | Output the cs tag. STR can be either short or long. If no STR is given, short is assumed. > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#79?email_source=notifications&email_token=ABZRKX3BK5ZR5OZUXK6PNGLQDQZMVA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD335Y3Y#issuecomment-519560303>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABZRKXYF6UD4ERSMBH7F2ITQDQZMVANCNFSM4FCUUXWA> > . > -- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

marchoeppner · 2019-08-09T10:54:57Z

I think this will then do the same? Can't use the dev branch for my project, so needed to encapsulate the same function in a standalone script (basically just stole Brian's script and one function) - this requires Samtools to be in $PATH, which is a prereq for PASA anyway:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

# This is a rip-off from Brian Haas's Sam_to_gtf.pl converter bundled with PASA
# Needed this as a standalone version.

my %PATH_COUNTER;

my $bam = $ARGV[0] ;

open BAM,"samtools view $bam |";

        while(<BAM>){
        next if(/^(\@)/);  ## skipping the header lines (if you used -h in the samools command)
        s/\n//;  s/\r//;  ## removing new line
        my @sam = split(/\t+/);  ## splitting SAM line into array

	my %entry = ( "qname" => $sam[0], "flag" => $sam[1], "rname" => $sam[2], "pos" => $sam[3], "mapq" => $sam[4], 
		"cigar" => $sam[5], "rnext" => $sam[6], "pnext" => $sam[7], "tlen" => $sam[8], "seq" => $sam[9], "qual" => $sam[10] );

	my $num_mismatches = 0;

	if ( join("\t",@sam) =~ /NM:i:(\d+)/) {
		$num_mismatches = $1;
        }

	
	my $strand = ($entry{"flag"} & 16) ? "-" : "+" ;
	$entry{"strand"} = $strand ;
	
	my $read_name = $entry{"qname"} ;
	my $scaff_name = $entry{"rname"};
	
	my ($genome_coords_aref, $query_coords_aref) = get_aligned_coords(%entry);

	my $align_len = 0;

	foreach my $coordset (@$genome_coords_aref) {
                $align_len += abs($coordset->[1] - $coordset->[0]) + 1;
        }
	# Check this...
	next if ($align_len eq 0);

	my $per_id = sprintf("%.1f", 100 - $num_mismatches/$align_len * 100); 

	my $align_counter = "$read_name.p" . ++$PATH_COUNTER{$read_name};

	my @genome_n_trans_coords;
        
        while (@$genome_coords_aref) {
            my $genome_coordset_aref = shift @$genome_coords_aref;
            my $trans_coordset_aref = shift @$query_coords_aref;

            my ($genome_lend, $genome_rend) = @$genome_coordset_aref;
            my ($trans_lend, $trans_rend) = sort {$a<=>$b} @$trans_coordset_aref;

            push (@genome_n_trans_coords, [ $genome_lend, $genome_rend, $trans_lend, $trans_rend ] );

        }

	my @merged_coords;
        push (@merged_coords, shift @genome_n_trans_coords);

        my $MERGE_DIST = 10;
        while (@genome_n_trans_coords) {
            my $coordset_ref = shift @genome_n_trans_coords;
            my $last_coordset_ref = $merged_coords[$#merged_coords];
            
            if ($coordset_ref->[0] - $last_coordset_ref->[1] <= $MERGE_DIST) {
                # merge it.
                $last_coordset_ref->[1] = $coordset_ref->[1];

                if ($strand eq "+") {
                    $last_coordset_ref->[3] = $coordset_ref->[3];
                } else {
                    $last_coordset_ref->[2] = $coordset_ref->[2];
                }
            }
            else {
                # not merging.
                push (@merged_coords, $coordset_ref);
            }
        }

	foreach my $coordset_ref (@merged_coords) {
            my ($genome_lend, $genome_rend, $trans_lend, $trans_rend) = @$coordset_ref;
            print join("\t",
                       $scaff_name,
                       "genome",
                       "cDNA_match",
                       $genome_lend, $genome_rend,
                       $per_id,
                       $strand,
                       ".",
                       "ID=$align_counter;Target=$read_name $trans_lend $trans_rend") . "\n";
        }
        #print "\n";
        
        
}


sub get_aligned_coords {

	my %entry = @_;

	my $genome_lend = $entry{"pos"};

	my $alignment = $entry{"cigar"};
	my $query_lend = 0;

	my @genome_coords;
	my @query_coords;

	$genome_lend--;

	while ($alignment =~ /(\d+)([A-Z])/g) {

		my $len = $1;
		my $code = $2;
		
		unless ($code =~ /^[MSDNIH]$/) {
			die  "Error, cannot parse cigar code [$code] ";
		}
		
		#print "parsed $len,$code\n";
		
		if ($code eq 'M') { # aligned bases match or mismatch
			
			my $genome_rend = $genome_lend + $len;
			my $query_rend = $query_lend + $len;
			
			push (@genome_coords, [$genome_lend+1, $genome_rend]);
			push (@query_coords, [$query_lend+1, $query_rend]);
			
			# reset coord pointers
			$genome_lend = $genome_rend;
			$query_lend = $query_rend;
		}
		elsif ($code eq 'D' || $code eq 'N') { # insertion in the genome or gap in query (intron, perhaps)
			$genome_lend += $len;
			
		}

		elsif ($code eq 'I'  # gap in genome or insertion in query 
               ||
               $code eq 'S' || $code eq 'H')  # masked region of query
        { 
            $query_lend += $len;

		}
	}

	 ## see if reverse strand alignment - if so, must revcomp the read matching coordinates.
    	if ($entry{"strand"} eq '-') {

        my $read_len = length($entry{"seq"});
        unless ($read_len) {
            die "Error, no read length obtained from entry";
        }

        my @revcomp_coords;
        foreach my $coordset (@query_coords) {
            my ($lend, $rend) = @$coordset;

            my $new_lend = $read_len - $lend + 1;
            my $new_rend = $read_len - $rend + 1;

            push (@revcomp_coords, [$new_lend, $new_rend]);
        }

        @query_coords = @revcomp_coords;

    }

	return(\@genome_coords, \@query_coords);
}
;

brianjohnhaas · 2019-08-09T11:14:36Z

No prob. Just let me know if you see any bugs. I didn't have time to check it carefully.

…

-via googleFi

On Fri, Aug 9, 2019, 6:54 AM marchoeppner ***@***.***> wrote: I think this will then do the same? Can't use the dev branch for my project, so needed to encapsulate the same function in a standalone script (basically just stole Brian's script and one function) - this requires Samtools to be in $PATH, which is a prereq for PASA anyway: #!/bin/env perl use strict; use warnings; my %PATH_COUNTER; my $bam = $ARGV[0] ; open BAM,"samtools view $bam |"; while(<BAM>){ next if(/^(\@)/); ## skipping the header lines (if you used -h in the samools command) s/\n//; s/\r//; ## removing new line my @sam = split(/\t+/); ## splitting SAM line into array my %entry = ( "qname" => $sam[0], "flag" => $sam[1], "rname" => $sam[2], "pos" => $sam[3], "mapq" => $sam[4], "cigar" => $sam[5], "rnext" => $sam[6], "pnext" => $sam[7], "tlen" => $sam[8], "seq" => $sam[9], "qual" => $sam[10] ); my $num_mismatches = 0; if ( ***@***.***) =~ /NM:i:(\d+)/) { $num_mismatches = $1; } my $strand = ($entry{"flag"} == 0) ? "+" : "-" ; my $read_name = $entry{"qname"} ; my $scaff_name = $entry{"rname"}; my ($genome_coords_aref, $query_coords_aref) = get_aligned_coords(%entry); my $align_len = 0; foreach my $coordset (@$genome_coords_aref) { $align_len += abs($coordset->[1] - $coordset->[0]) + 1; } my $per_id = sprintf("%.1f", 100 - $num_mismatches/$align_len * 100); my $align_counter = "$read_name.p" . ++$PATH_COUNTER{$read_name}; my @genome_n_trans_coords; while (@$genome_coords_aref) { my $genome_coordset_aref = shift @$genome_coords_aref; my $trans_coordset_aref = shift @$query_coords_aref; my ($genome_lend, $genome_rend) = @$genome_coordset_aref; my ($trans_lend, $trans_rend) = sort {$a<=>$b} @$trans_coordset_aref; push ***@***.***_n_trans_coords, [ $genome_lend, $genome_rend, $trans_lend, $trans_rend ] ); } my @merged_coords; push ***@***.***_coords, shift @genome_n_trans_coords); my $MERGE_DIST = 10; while ***@***.***_n_trans_coords) { my $coordset_ref = shift @genome_n_trans_coords; my $last_coordset_ref = $merged_coords[$#merged_coords]; if ($coordset_ref->[0] - $last_coordset_ref->[1] <= $MERGE_DIST) { # merge it. $last_coordset_ref->[1] = $coordset_ref->[1]; if ($strand eq "+") { $last_coordset_ref->[3] = $coordset_ref->[3]; } else { $last_coordset_ref->[2] = $coordset_ref->[2]; } } else { # not merging. push ***@***.***_coords, $coordset_ref); } } foreach my $coordset_ref ***@***.***_coords) { my ($genome_lend, $genome_rend, $trans_lend, $trans_rend) = @$coordset_ref; print join("\t", $scaff_name, "genome", "cDNA_match", $genome_lend, $genome_rend, $per_id, $strand, ".", "ID=$align_counter;Target=$read_name $trans_lend $trans_rend") . "\n"; } print "\n"; } sub get_aligned_coords { my %entry = @_; my $genome_lend = $entry{"pos"}; my $alignment = $entry{"cigar"}; my $query_lend = 0; my @genome_coords; my @query_coords; $genome_lend--; while ($alignment =~ /(\d+)([A-Z])/g) { my $len = $1; my $code = $2; unless ($code =~ /^[MSDNIH]$/) { exit 1; "Error, cannot parse cigar code [$code] "; } # print "parsed $len,$code\n"; if ($code eq 'M') { # aligned bases match or mismatch my $genome_rend = $genome_lend + $len; my $query_rend = $query_lend + $len; push ***@***.***_coords, [$genome_lend+1, $genome_rend]); push ***@***.***_coords, [$query_lend+1, $query_rend]); # reset coord pointers $genome_lend = $genome_rend; $query_lend = $query_rend; } elsif ($code eq 'D' || $code eq 'N') { # insertion in the genome or gap in query (intron, perhaps) $genome_lend += $len; } elsif ($code eq 'I' # gap in genome or insertion in query || $code eq 'S' || $code eq 'H') # masked region of query { $query_lend += $len; } } ***@***.***_coords, ***@***.***_coords); } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX7XTYWITAZTUOY2MDTQDVEIFA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD36KZUY#issuecomment-519875795>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKXZAOVETNBEG5JPPBPLQDVEIFANCNFSM4FCUUXWA> .

brianjohnhaas · 2019-08-16T12:26:20Z

hmm.... It should be using the Target=(.....) to identify the transcript, and the ID=(.....) as a unique identifier for the alignment. ie., check out the sample_data/custom_alignments.gff3 Does your revised code make changes to the output from what I had provided there?

…

On Fri, Aug 16, 2019 at 3:18 AM marchoeppner ***@***.***> wrote: Quick question - in the log file I am getting these errors: ERROR, transcript T01B4.7.p2 found with alignment but not recognized in the transcript database So does Pasa parse the ID field of the GFF entry to match that to the original transcript name in the SQL database? If so, the running appendix ".p{num}" for multi-mapped transcripts seems to break this...? My full command looks as follows: $PASAHOME/Launch_PASA_pipeline.pl -c pasa_DB.config -C -R -t transcripts.fa.clean -I 10000 -g genome.rm.fa --IMPORT_CUSTOM_ALIGNMENTS_GFF3 minimap.transcripts.gff --CPU 8 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX76JIL2HXBVM4I5753QEZICVA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N4BVI#issuecomment-521912533>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKXZJMB6XVP55KLRBMYLQEZICVANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

marchoeppner · 2019-08-16T12:27:49Z

Wasn't the ID field, as it turns out - I forgot to filter out all SAM mappings that had "*" as a sequence, which produced a bunch of non-sensical GFF entries with negative coordinates. After fixing that, all is well. So just a not-so-helpful warning from PASA ;)

brianjohnhaas · 2019-08-16T12:34:27Z

oh - my sam to gtf converter should have filtered out those entries. I remember encountering this too. yes, the error message wasn't so helpful there. ;-)

…

On Fri, Aug 16, 2019 at 8:27 AM marchoeppner ***@***.***> wrote: Wasn't the ID field, as it turns out - I forgot to filter out all SAM mappings that had "*" as a sequence, which produced a bunch of non-sensical GFF entries with negative coordinates. After fixing that, all is well. So just a not-so-helpful warning from PASA ;) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=ABZRKX6BTQKVB6SI4ZTJANLQE2MMNA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4OPPHY#issuecomment-521992095>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZRKXY2KBDUTOGPLIXEITDQE2MMNANCNFSM4FCUUXWA> .

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

marchoeppner mentioned this issue Aug 29, 2019

PASA error during transdecoder stage #116

Closed

brianjohnhaas closed this as completed Nov 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GMAP alignment format question -- minimap2 support #79

GMAP alignment format question -- minimap2 support #79

nextgenusfs commented May 31, 2018

brianjohnhaas commented May 31, 2018 via email

brianjohnhaas commented May 31, 2018 via email

nextgenusfs commented May 31, 2018 •

edited

brianjohnhaas commented May 31, 2018 via email

nextgenusfs commented Jun 10, 2018

brianjohnhaas commented Jun 10, 2018 via email

nextgenusfs commented Jun 10, 2018

brianjohnhaas commented Jun 10, 2018 via email

marchoeppner commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

marchoeppner commented Aug 8, 2019

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

marchoeppner commented Aug 8, 2019

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

brianjohnhaas commented Aug 9, 2019 via email

marchoeppner commented Aug 9, 2019 •

edited

brianjohnhaas commented Aug 9, 2019 via email

brianjohnhaas commented Aug 16, 2019 via email

marchoeppner commented Aug 16, 2019

brianjohnhaas commented Aug 16, 2019 via email

GMAP alignment format question -- minimap2 support #79

GMAP alignment format question -- minimap2 support #79

Comments

nextgenusfs commented May 31, 2018

brianjohnhaas commented May 31, 2018 via email

brianjohnhaas commented May 31, 2018 via email

nextgenusfs commented May 31, 2018 • edited

brianjohnhaas commented May 31, 2018 via email

nextgenusfs commented Jun 10, 2018

brianjohnhaas commented Jun 10, 2018 via email

nextgenusfs commented Jun 10, 2018

brianjohnhaas commented Jun 10, 2018 via email

marchoeppner commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

marchoeppner commented Aug 8, 2019

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

marchoeppner commented Aug 8, 2019

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

nextgenusfs commented Aug 8, 2019

brianjohnhaas commented Aug 8, 2019 via email

brianjohnhaas commented Aug 9, 2019 via email

marchoeppner commented Aug 9, 2019 • edited

brianjohnhaas commented Aug 9, 2019 via email

brianjohnhaas commented Aug 16, 2019 via email

marchoeppner commented Aug 16, 2019

brianjohnhaas commented Aug 16, 2019 via email

nextgenusfs commented May 31, 2018 •

edited

marchoeppner commented Aug 9, 2019 •

edited