New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GMAP alignment format question -- minimap2 support #79
Comments
Hi Jon,
That should all be fine. I can work on integrating minimap2 support for the
aligners option. If you already have code to contribute, I'd be happy to
integrate it.
Also, we (Nathan, I, and a few others) want to write up the PASA2 soon.
You're welcome to join us.
best,
~b
…On Thu, May 31, 2018 at 10:36 AM, Jon Palmer ***@***.***> wrote:
Hi @brianjohnhaas <https://github.com/brianjohnhaas>, since there seems
to be problems with PASA and many versions of GMAP (I'm getting users of
funannotate that end up having GMAP version incompatibility errors), I'm
wondering if I should just swap out GMAP for minimap2 alignment in PASA.
I've done this swap for mapping transcript evidence to the genome in
funannotate because it is much much faster and results are nearly
identical. To do this I wrote a bam2gff3 parser (in Python), so I think I
can pretty much re-use that code but perhaps with some modifications
specific for PASA.
Before I wrote this up, I wanted to get your thoughts. So my question is
related to the output format of GMAP - specifically is there anything
"special" about the GFF3 format that PASA is expecting? Below is the first
few lines from a test set. Is the Gap= flag necessary?
##gff-version 3
# Generated by GMAP version 2017-06-20 using call: gmap.avx2 -D /home/linuxbrew/data/genome6/training -d genome.fasta.gmap -f 3 -n 0 -x 50 -t 3 -B 5 --max-intronlength-middle=3000 --max-intronlength-ends=3000 /home/linuxbrew/data/genome6/training/trinity.fasta.clean
NW_017263654.1 genome.fasta.gmap cDNA_match 111713 111995 100 + . ID=Trinity_GG_1_c0_g1_i1.path1;Name=Trinity_GG_1_c0_g1_i1;Target=Trinity_GG_1_c0_g1_i1 4 286;Gap=M283
###
NW_017263654.1 genome.fasta.gmap cDNA_match 66497 66599 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 1 103;Gap=M103
NW_017263654.1 genome.fasta.gmap cDNA_match 66651 66731 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 104 184;Gap=M81
NW_017263654.1 genome.fasta.gmap cDNA_match 66784 67198 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 185 599;Gap=M415
###
NW_017263654.1 genome.fasta.gmap cDNA_match 82710 83308 99 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 7 605;Gap=M599
NW_017263654.1 genome.fasta.gmap cDNA_match 83360 83420 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 606 666;Gap=M61
NW_017263654.1 genome.fasta.gmap cDNA_match 83470 83985 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 667 1182;Gap=M516
NW_017263654.1 genome.fasta.gmap cDNA_match 84051 85267 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 1183 2399;Gap=M1217
And then next question is if I first run minimap2 alignment, I can pass
this to PASA using the --IMPORT_CUSTOM_ALIGNMENTS_GFF3 flag and then
setting --ALIGNERS blat would then effectively use both the minimap2
alignments and then have PASA run BLAT, is that correct?
Thanks,
Jon
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AHMVX9wz69Jn3n6xcS2xDC4ZNQPqRjjYks5t3__dgaJpZM4UVKXs>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
also interesting:
http://bioinfo.zesoi.fer.hr/index.php/en/blog-en/56-gmap-vs-minimap2
On Thu, May 31, 2018 at 10:57 AM, Brian Haas <bhaas@broadinstitute.org>
wrote:
… Hi Jon,
That should all be fine. I can work on integrating minimap2 support for
the aligners option. If you already have code to contribute, I'd be happy
to integrate it.
Also, we (Nathan, I, and a few others) want to write up the PASA2 soon.
You're welcome to join us.
best,
~b
On Thu, May 31, 2018 at 10:36 AM, Jon Palmer ***@***.***>
wrote:
> Hi @brianjohnhaas <https://github.com/brianjohnhaas>, since there seems
> to be problems with PASA and many versions of GMAP (I'm getting users of
> funannotate that end up having GMAP version incompatibility errors), I'm
> wondering if I should just swap out GMAP for minimap2 alignment in PASA.
> I've done this swap for mapping transcript evidence to the genome in
> funannotate because it is much much faster and results are nearly
> identical. To do this I wrote a bam2gff3 parser (in Python), so I think I
> can pretty much re-use that code but perhaps with some modifications
> specific for PASA.
>
> Before I wrote this up, I wanted to get your thoughts. So my question is
> related to the output format of GMAP - specifically is there anything
> "special" about the GFF3 format that PASA is expecting? Below is the first
> few lines from a test set. Is the Gap= flag necessary?
>
> ##gff-version 3
> # Generated by GMAP version 2017-06-20 using call: gmap.avx2 -D /home/linuxbrew/data/genome6/training -d genome.fasta.gmap -f 3 -n 0 -x 50 -t 3 -B 5 --max-intronlength-middle=3000 --max-intronlength-ends=3000 /home/linuxbrew/data/genome6/training/trinity.fasta.clean
> NW_017263654.1 genome.fasta.gmap cDNA_match 111713 111995 100 + . ID=Trinity_GG_1_c0_g1_i1.path1;Name=Trinity_GG_1_c0_g1_i1;Target=Trinity_GG_1_c0_g1_i1 4 286;Gap=M283
> ###
> NW_017263654.1 genome.fasta.gmap cDNA_match 66497 66599 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 1 103;Gap=M103
> NW_017263654.1 genome.fasta.gmap cDNA_match 66651 66731 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 104 184;Gap=M81
> NW_017263654.1 genome.fasta.gmap cDNA_match 66784 67198 100 + . ID=Trinity_GG_2_c0_g1_i1.path1;Name=Trinity_GG_2_c0_g1_i1;Target=Trinity_GG_2_c0_g1_i1 185 599;Gap=M415
> ###
> NW_017263654.1 genome.fasta.gmap cDNA_match 82710 83308 99 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 7 605;Gap=M599
> NW_017263654.1 genome.fasta.gmap cDNA_match 83360 83420 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 606 666;Gap=M61
> NW_017263654.1 genome.fasta.gmap cDNA_match 83470 83985 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 667 1182;Gap=M516
> NW_017263654.1 genome.fasta.gmap cDNA_match 84051 85267 100 + . ID=Trinity_GG_4_c0_g1_i1.path1;Name=Trinity_GG_4_c0_g1_i1;Target=Trinity_GG_4_c0_g1_i1 1183 2399;Gap=M1217
>
> And then next question is if I first run minimap2 alignment, I can pass
> this to PASA using the --IMPORT_CUSTOM_ALIGNMENTS_GFF3 flag and then
> setting --ALIGNERS blat would then effectively use both the minimap2
> alignments and then have PASA run BLAT, is that correct?
>
> Thanks,
> Jon
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#79>, or mute the
> thread
> <https://github.com/notifications/unsubscribe-auth/AHMVX9wz69Jn3n6xcS2xDC4ZNQPqRjjYks5t3__dgaJpZM4UVKXs>
> .
>
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Yeah I saw that blog as well, which prompted me to take a look. The speed increase of minimap2 is amazing. So what I'm currently doing to map transcripts is this: I'm running minimap2 through a small little shell script so that I can pipe the data to samtools and still use multiple threads (the reason is that it isn't easy in Python to pipe commands, maybe easier in Perl?), looks like this:
I'm then running minimap2 with this function (runSubprocess is just my little wrapper for subprocess/logging in python).
I'm then using the PyBam native python BAM parser to parse the data and convert BAM to GFF3 (as there isn't a GFF3 output for minimap2). Probably there is an equivalent BAM parser in Perl that might fit better with your code, but at any rate the bam2gff3 function looks like this:
Heng made a "new" flag in minimap2 called the cs flag, which is a little bit easier to parse than the CIGAR string. UPDATE (6/25/2018): I updated the bam2gff3 function using PyBAM - previous one was not dealing with crick aligned transcripts correctly. Note this only writes alignments to GFF3 if pident is > 80%. |
Thx! I'll look into this soon
…On Thu, May 31, 2018, 11:19 AM Jon Palmer ***@***.***> wrote:
Yeah I saw that blog as well, which prompted me to take a look. The speed
increase of minimap2 is amazing. So what I'm currently doing to map
transcripts is this:
I'm running minimap2 through a small little shell script so that I can
pipe the data to samtools and still use multiple threads (the reason is
that it isn't easy in Python to pipe commands, maybe easier in Perl?),
looks like this:
sam2bam.sh
#!/bin/bash
#simple wrapper for running aligner program and piping output to samtools view/sort
if [ -z "$3" ]; then
echo 'Usage: sam2bam.sh "aligner_command" bam_threads bam_output'
echo '**The double quotes are required around aligner command**'
exit
fi
#construct the command
cmd="$1 | samtools view -@ $2 -bS - | samtools sort -@ $2 -o $3 -"
#run the command
eval $cmd
I'm then running minimap2 with this function (runSubprocess is just my
little wrapper for subprocess/logging in python).
def minimap2Align(transcripts, genome, cpus, intron, output):
'''
function to align transcripts to genome using minimap2
huge speed increase over gmap + blat
'''
bamthreads = round(int(cpus) / 2)
if bamthreads > 4:
bamthreads = 4
minimap2_cmd = ['minimap2', '-ax', 'splice', '-t', str(cpus), '--cs', '-u', 'b', '-G', str(intron), genome, transcripts]
cmd = [os.path.join(parentdir, 'util', 'sam2bam.sh'), " ".join(minimap2_cmd), str(bamthreads), output]
runSubprocess(cmd, '.', log)
I'm then using the PyBam <https://github.com/JohnLonginotto/pybam> native
python BAM parser to parse the data and convert BAM to GFF3 (as there isn't
a GFF3 output for minimap2). Probably there is an equivalent BAM parser in
Perl that might fit better with your code, but at any rate the bam2gff3
function looks like this:
def bam2gff3(input, output):
import pybam
with open(output, 'w') as gffout:
gffout.write('##gff-version 3\n')
for aln in pybam.read(input):
cs = None
tags = aln.sam_tags_string.split('\t')
for x in tags:
if x.startswith('cs:'):
cs = x.replace('cs:Z:', '')
matches = 0
mismatches = 0
ProperSplice = True
splitter = []
exons = [int(aln.sam_pos1)]
position = int(aln.sam_pos1)
num_exons = 1
if cs:
splitter = tokenizeString(cs, [':','*','+', '-', '~'])
for i,x in enumerate(splitter):
if x == ':':
matches += int(splitter[i+1])
position += int(splitter[i+1])
elif x == '*' or x == '+':
mismatches += (len(splitter[i+1]) / 2)
elif x == '-':
mismatches += (len(splitter[i+1]) / 2)
elif x == '~':
if splitter[i+1].startswith('gt') and splitter[i+1].endswith('ag'):
ProperSplice = True
elif splitter[i+1].startswith('at') and splitter[i+1].endswith('ac'):
ProperSplice = True
else:
ProperSplice = False
break
num_exons += 1
exons.append(position)
intronLen = int(splitter[i+1][2:-2])
position += intronLen
exons.append(position)
#add last Position
exons.append(position)
#convert exon list into list of exon tuples
exons = zip(exons[0::2], exons[1::2])
if ProperSplice:
pident = 100 * (matches / (matches + mismatches))
if pident < 80:
continue
strand = '.'
if aln.sam_flag == 0:
strand = '+'
elif aln.sam_flag == 16:
strand = '-'
for i,exon in enumerate(exons):
start = exon[0]
end = exon[1]-1
gffout.write('{:}\t{:}\t{:}\t{:}\t{:}\t{:.2f}\t{:}\t{:}\tID={:};Target={:}\n'.format(aln.sam_rname, 'genome', 'cDNA_match', start, end, pident, strand, '.',aln.sam_qname,aln.sam_qname))
Heng made a "new" flag in minimap2 called the cs flag
<https://github.com/lh3/minimap2#the-cs-optional-tag>, which is a little
bit easier to parse than the CIGAR string.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHMVX81R7PBo3RBuR-dp__bVdh4-qj0Dks5t4AoMgaJpZM4UVKXs>
.
|
Hi Brian,
Just a quick question as to what coords is that referring -- I'm assuming this is the query lend and rend? |
Oh - this is looking for the matching coordinates at the level of the cDNA
sequence, which gmap bundles into the last field of the description line.
Take a look the gmap.gff3 file that PASA generates and you'll see what I
mean. Just compare the last words in each gff3 line.
best,
~b
…On Sun, Jun 10, 2018 at 9:07 AM, Jon Palmer ***@***.***> wrote:
Hi Brian,
I have this a try and looks like I need to also have a lend an rend (left
and right end) in the GFF3 file format:
error parsing match coordinates lend[] rend[]from last field:
ID=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b;Target=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b,
line:
CP022967.1 genome cDNA_match 702 1436 91.42 . . ID=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b;Target=f7f48989-482b-4409-a232 <https://maps.google.com/?q=482b-4409-a232&entry=gmail&source=g>-144592d4318b
Just a quick question as to what coords is that referring -- I'm assuming
this is the query lend and rend?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHMVXwLYw10dcUVdQn_Lma2dOQlYI8fsks5t7RoDgaJpZM4UVKXs>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Okay, so the cDNA coordinates... any idea how I can get that out of SAM format? |
In the Trinity codebase, I have:
trinityrnaseq/util/misc/SAM_toString.pl
as an example of how you might go about making a SAM to GFF3 converter.
There are probably other better examples to go with, though.
If you're into python, you could likely leverage pysam to do something
pretty similar. You'd only want to report segments that are based on
separating introns, and not short indels though, when making the gff.
~b
…On Sun, Jun 10, 2018 at 12:24 PM, Jon Palmer ***@***.***> wrote:
Okay, so the cDNA coordinates... any idea how I can get that out of SAM
format?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHMVX1D758JjvA1xsrOcucTH4ggg8pH1ks5t7UhBgaJpZM4UVKXs>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Hi, |
Hi,
If you can make a gtf file from it, you can import the alignments in as
'custom' alignments and it should work.
I don't have time to work on pasa these days to do more than minor tech
support and minimal maintenance. Contributions for enhancements are always
welcome!
best,
~b
…On Thu, Aug 8, 2019 at 5:10 AM marchoeppner ***@***.***> wrote:
Hi,
just wondering if there has been any progress on integrating minimap2 with
Pasa over the past year? ;) Seems like a worthwhile addition.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX2XV7A3ZGWIBJNFKSLQDPPG7A5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD327SMY#issuecomment-519436595>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKXY3XV4GRGQDT4VMG23QDPPG7ANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Thanks and fair enough! If I get around to it, I'll make a pull request. |
I’m using the custom alignments approach which is working well once I got the format figured out. |
awesome. If you have a minimap format converter or some documentation to
add for how to do it, that would be great. It's easy enough to tack it on
to our wiki and/or add to the code repo.
best,
~b
…On Thu, Aug 8, 2019 at 8:44 AM Jon Palmer ***@***.***> wrote:
I’m using the custom alignments approach which is working well once I got
the format figured out.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX4ETO7AQIJLHBHE2E3QDQIL3A5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33PU5Y#issuecomment-519502455>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKX3DQANIM3YU74Q4PJ3QDQIL3ANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
The python method above is what I’m using from ~1 year ago. An equivalent perl converter would better for PASA. |
ah - got it. ok, let's keep this thread around to help others. One of
these days, I'll see about doing direct integration, unless someone else
contributes it first.
Python integration would be fine, though. No one likes perl anymore except
me and like three others. ;-)
…On Thu, Aug 8, 2019 at 9:30 AM Jon Palmer ***@***.***> wrote:
The python method above is what I’m using from ~1 year ago. An equivalent
perl converter would better for PASA.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX277DMZ4QEBA6SPVTLQDQNXNA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33TURA#issuecomment-519518788>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKX3CSXFU5PT56PA22MLQDQNXNANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Seems like Minimap2 also supports PAF output, which is basically tab-delimited blast-type formatting https://github.com/lh3/miniasm/blob/master/PAF.md. Just thinking about porting code to Perl that relies on PySam... |
Right, paf probably easier. My use case was a general BAM to GFF3 conversion. SAM format would also work. You need to run minimap2 with the —cs flag to get the splicing information. |
I could write a bam-to-gtf converter pretty quick based on the existing
code there in pasa. Just send me an example minimap2 bam file so I don't
have to experiment myself. ;-)
…On Thu, Aug 8, 2019 at 9:48 AM marchoeppner ***@***.***> wrote:
Seems like Minimap2 also supports PAF output, which is basically
tab-delimited blast-type formatting
https://github.com/lh3/miniasm/blob/master/PAF.md. Just thinking about
porting code to Perl that relies on PySam...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX67AXVTOZ5N4PU7UELQDQP2RA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33VJ4Q#issuecomment-519525618>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKXYJOVMHP4T6SSAWPPLQDQP2RANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Okay, I just used your sample data from
And then I used the method above to convert to GFF3 (this is in my funannotate package):
Attachment has both the bam file and the resulting GFF3 file. |
Thanks, Jon! I'll work on this shortly and we can compare notes.
best,
~b
…On Thu, Aug 8, 2019 at 10:46 AM Jon Palmer ***@***.***> wrote:
Okay, I just used your sample data from sample_data folder and ran this:
minimap2 -ax splice --cs genome_sample.fasta.gz all_transcripts.fasta | samtools sort -o minimap2.alignments.bam -
And then I used the method above to convert to GFF3 (this is in my
funannotate package):
funannotate util bam2gff3 -i minimap2.alignments.bam -o minimap2.alignments.gff3
Attachment has both the bam file and the resulting GFF3 file.
minimap2.zip
<https://github.com/PASApipeline/PASApipeline/files/3482302/minimap2.zip>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX4EPDKD2755V3SXXZTQDQWUTA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD333LDA#issuecomment-519550348>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKX2U5TOU5KG343DGDSTQDQWUTANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
The
|
Thanks, Jon! I'll be digging into it in a few hours.
best,
~b
…On Thu, Aug 8, 2019 at 11:10 AM Jon Palmer ***@***.***> wrote:
The cs flag description is on the manual page of minimap2:
https://lh3.github.io/minimap2/minimap2.html. I think probably could also
get this information from the CIGAR string, but for me the cs was easier
for me to understand. But you can also output CIGAR string in the BAM file
with minimap2 by passing the `-c' flag.
-c | Generate CIGAR. In PAF, the CIGAR is written to the ‘cg’ custom tag.
--cs[=STR] | Output the cs tag. STR can be either short or long. If no STR is given, short is assumed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX3BK5ZR5OZUXK6PNGLQDQZMVA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD335Y3Y#issuecomment-519560303>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKXYF6UD4ERSMBH7F2ITQDQZMVANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Hi guys,
I added this to the 'devel' branch of the pasapipeline repo:
https://github.com/PASApipeline/PASApipeline/blob/devel/misc_utilities/SAM_to_gtf.pl
You can run it on any bam file and it should make a corresponding gtf
file. It shouldn't be minimap2 specific and shouldn't need custom tags.
Jon - the coordinates look different from what you were generating. I
checked one of the differences and mine matched up with blat or gmap. I
didn't do any rigorous testing, though... Note, all the coordinates are
determined based on the cigar string using one of my home-grown libraries
that ships with pasa. Please let me know if there are any issues with it.
Using the bam file that Jon provided was perfect for this.
best,
~b
…On Thu, Aug 8, 2019 at 11:17 AM Brian Haas ***@***.***> wrote:
Thanks, Jon! I'll be digging into it in a few hours.
best,
~b
On Thu, Aug 8, 2019 at 11:10 AM Jon Palmer ***@***.***>
wrote:
> The cs flag description is on the manual page of minimap2:
> https://lh3.github.io/minimap2/minimap2.html. I think probably could
> also get this information from the CIGAR string, but for me the cs was
> easier for me to understand. But you can also output CIGAR string in the
> BAM file with minimap2 by passing the `-c' flag.
>
> -c | Generate CIGAR. In PAF, the CIGAR is written to the ‘cg’ custom tag.
>
> --cs[=STR] | Output the cs tag. STR can be either short or long. If no STR is given, short is assumed.
>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#79?email_source=notifications&email_token=ABZRKX3BK5ZR5OZUXK6PNGLQDQZMVA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD335Y3Y#issuecomment-519560303>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABZRKXYF6UD4ERSMBH7F2ITQDQZMVANCNFSM4FCUUXWA>
> .
>
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
I think this will then do the same? Can't use the dev branch for my project, so needed to encapsulate the same function in a standalone script (basically just stole Brian's script and one function) - this requires Samtools to be in $PATH, which is a prereq for PASA anyway:
|
No prob. Just let me know if you see any bugs. I didn't have time to
check it carefully.
…-via googleFi
On Fri, Aug 9, 2019, 6:54 AM marchoeppner ***@***.***> wrote:
I think this will then do the same? Can't use the dev branch for my
project, so needed to encapsulate the same function in a standalone script
(basically just stole Brian's script and one function) - this requires
Samtools to be in $PATH, which is a prereq for PASA anyway:
#!/bin/env perl
use strict;
use warnings;
my %PATH_COUNTER;
my $bam = $ARGV[0] ;
open BAM,"samtools view $bam |";
while(<BAM>){
next if(/^(\@)/); ## skipping the header lines (if you used -h in the samools command)
s/\n//; s/\r//; ## removing new line
my @sam = split(/\t+/); ## splitting SAM line into array
my %entry = ( "qname" => $sam[0], "flag" => $sam[1], "rname" => $sam[2], "pos" => $sam[3], "mapq" => $sam[4],
"cigar" => $sam[5], "rnext" => $sam[6], "pnext" => $sam[7], "tlen" => $sam[8], "seq" => $sam[9], "qual" => $sam[10] );
my $num_mismatches = 0;
if ( ***@***.***) =~ /NM:i:(\d+)/) {
$num_mismatches = $1;
}
my $strand = ($entry{"flag"} == 0) ? "+" : "-" ;
my $read_name = $entry{"qname"} ;
my $scaff_name = $entry{"rname"};
my ($genome_coords_aref, $query_coords_aref) = get_aligned_coords(%entry);
my $align_len = 0;
foreach my $coordset (@$genome_coords_aref) {
$align_len += abs($coordset->[1] - $coordset->[0]) + 1;
}
my $per_id = sprintf("%.1f", 100 - $num_mismatches/$align_len * 100);
my $align_counter = "$read_name.p" . ++$PATH_COUNTER{$read_name};
my @genome_n_trans_coords;
while (@$genome_coords_aref) {
my $genome_coordset_aref = shift @$genome_coords_aref;
my $trans_coordset_aref = shift @$query_coords_aref;
my ($genome_lend, $genome_rend) = @$genome_coordset_aref;
my ($trans_lend, $trans_rend) = sort {$a<=>$b} @$trans_coordset_aref;
push ***@***.***_n_trans_coords, [ $genome_lend, $genome_rend, $trans_lend, $trans_rend ] );
}
my @merged_coords;
push ***@***.***_coords, shift @genome_n_trans_coords);
my $MERGE_DIST = 10;
while ***@***.***_n_trans_coords) {
my $coordset_ref = shift @genome_n_trans_coords;
my $last_coordset_ref = $merged_coords[$#merged_coords];
if ($coordset_ref->[0] - $last_coordset_ref->[1] <= $MERGE_DIST) {
# merge it.
$last_coordset_ref->[1] = $coordset_ref->[1];
if ($strand eq "+") {
$last_coordset_ref->[3] = $coordset_ref->[3];
} else {
$last_coordset_ref->[2] = $coordset_ref->[2];
}
}
else {
# not merging.
push ***@***.***_coords, $coordset_ref);
}
}
foreach my $coordset_ref ***@***.***_coords) {
my ($genome_lend, $genome_rend, $trans_lend, $trans_rend) = @$coordset_ref;
print join("\t",
$scaff_name,
"genome",
"cDNA_match",
$genome_lend, $genome_rend,
$per_id,
$strand,
".",
"ID=$align_counter;Target=$read_name $trans_lend $trans_rend") . "\n";
}
print "\n";
}
sub get_aligned_coords {
my %entry = @_;
my $genome_lend = $entry{"pos"};
my $alignment = $entry{"cigar"};
my $query_lend = 0;
my @genome_coords;
my @query_coords;
$genome_lend--;
while ($alignment =~ /(\d+)([A-Z])/g) {
my $len = $1;
my $code = $2;
unless ($code =~ /^[MSDNIH]$/) {
exit 1; "Error, cannot parse cigar code [$code] ";
}
# print "parsed $len,$code\n";
if ($code eq 'M') { # aligned bases match or mismatch
my $genome_rend = $genome_lend + $len;
my $query_rend = $query_lend + $len;
push ***@***.***_coords, [$genome_lend+1, $genome_rend]);
push ***@***.***_coords, [$query_lend+1, $query_rend]);
# reset coord pointers
$genome_lend = $genome_rend;
$query_lend = $query_rend;
}
elsif ($code eq 'D' || $code eq 'N') { # insertion in the genome or gap in query (intron, perhaps)
$genome_lend += $len;
}
elsif ($code eq 'I' # gap in genome or insertion in query
||
$code eq 'S' || $code eq 'H') # masked region of query
{
$query_lend += $len;
}
}
***@***.***_coords, ***@***.***_coords);
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX7XTYWITAZTUOY2MDTQDVEIFA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD36KZUY#issuecomment-519875795>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKXZAOVETNBEG5JPPBPLQDVEIFANCNFSM4FCUUXWA>
.
|
hmm.... It should be using the Target=(.....) to identify the
transcript, and the ID=(.....) as a unique identifier for the alignment.
ie., check out the sample_data/custom_alignments.gff3
Does your revised code make changes to the output from what I had provided
there?
…On Fri, Aug 16, 2019 at 3:18 AM marchoeppner ***@***.***> wrote:
Quick question - in the log file I am getting these errors:
ERROR, transcript T01B4.7.p2 found with alignment but not recognized in
the transcript database
So does Pasa parse the ID field of the GFF entry to match that to the
original transcript name in the SQL database? If so, the running appendix
".p{num}" for multi-mapped transcripts seems to break this...?
My full command looks as follows:
$PASAHOME/Launch_PASA_pipeline.pl -c pasa_DB.config -C -R -t
transcripts.fa.clean -I 10000 -g genome.rm.fa
--IMPORT_CUSTOM_ALIGNMENTS_GFF3 minimap.transcripts.gff --CPU 8
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX76JIL2HXBVM4I5753QEZICVA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N4BVI#issuecomment-521912533>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKXZJMB6XVP55KLRBMYLQEZICVANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Wasn't the ID field, as it turns out - I forgot to filter out all SAM mappings that had "*" as a sequence, which produced a bunch of non-sensical GFF entries with negative coordinates. After fixing that, all is well. So just a not-so-helpful warning from PASA ;) |
oh - my sam to gtf converter should have filtered out those entries. I
remember encountering this too.
yes, the error message wasn't so helpful there. ;-)
…On Fri, Aug 16, 2019 at 8:27 AM marchoeppner ***@***.***> wrote:
Wasn't the ID field, as it turns out - I forgot to filter out all SAM
mappings that had "*" as a sequence, which produced a bunch of non-sensical
GFF entries with negative coordinates. After fixing that, all is well. So
just a not-so-helpful warning from PASA ;)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=ABZRKX6BTQKVB6SI4ZTJANLQE2MMNA5CNFSM4FCUUXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4OPPHY#issuecomment-521992095>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABZRKXY2KBDUTOGPLIXEITDQE2MMNANCNFSM4FCUUXWA>
.
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>
|
Hi @brianjohnhaas, since there seems to be problems with PASA and many versions of GMAP (I'm getting users of funannotate that end up having GMAP version incompatibility errors), I'm wondering if I should just swap out GMAP for minimap2 alignment in PASA. I've done this swap for mapping transcript evidence to the genome in funannotate because it is much much faster and results are nearly identical. To do this I wrote a bam2gff3 parser (in Python), so I think I can pretty much re-use that code but perhaps with some modifications specific for PASA.
Before I wrote this up, I wanted to get your thoughts. So my question is related to the output format of GMAP - specifically is there anything "special" about the GFF3 format that PASA is expecting? Below is the first few lines from a test set. Is the Gap= flag necessary?
And then next question is if I first run minimap2 alignment, I can pass this to PASA using the
--IMPORT_CUSTOM_ALIGNMENTS_GFF3
flag and then setting--ALIGNERS blat
would then effectively use both the minimap2 alignments and then have PASA run BLAT, is that correct?Thanks,
Jon
The text was updated successfully, but these errors were encountered: