Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues regarding flair #1

Closed
ashokpatowary opened this issue Oct 15, 2018 · 12 comments
Closed

Issues regarding flair #1

ashokpatowary opened this issue Oct 15, 2018 · 12 comments

Comments

@ashokpatowary
Copy link

Hi @anbrooks @belgravia

I tried to run the flair pipeline on my MinION RNAseq dataset. I have faces some issues.

  1. while trying align I have came across the following error

[ERROR] unknown preset ' splice '

So I mapped the reads independently with minimap2. Thereafter while using the collapse options python /u/home/Resource/flair/flair.py collapse -r test.fastq -g GRCh37.primary_assembly.genome.fa -q test_strand_corrected.psl -m minimap2 -f GenCode_v28.gtf , it gives me a blank output file. Also I tried using psl_to_sequence.py tools separately, it gives a blank output file.

Am I doing anything wrong?

@belgravia
Copy link
Collaborator

I am taking a guess that what's causing psl_to_sequence.py to fail is an inability to match the chromosome names in the reference genome fasta with the names in the psl. I have pushed a new version of bin/psl_to_sequence.py that may help. Otherwise, I'd need to take a quick look at what your files look like.

In regards to the minimap2 usage error, that's been fixed with the newest flair.py wrapper script I just pushed, thank you for notifying me of that. That step is definitely fine to do outside of the wrapper script, as long as you convert to psl moving forward.

@ashokpatowary
Copy link
Author

Hi @belgravia

Thanks for your quick response. I will try the updated one now. Just one more query. While running the last version, I was having 723040 entries in my input *.psl, 722688 entries in *_strand.psl; 129339 entries in corrected.gp; and only 10941 entries in *_strand_corrected.psl. Does it looks ok?

Thanks

@belgravia
Copy link
Collaborator

There is an expected drop in entries after correction since any reads that contain a splice junction not conforming to GT-AG splice motif and isn't within window of an annotated GT-AG is removed. A >40% drop in entries is larger than what I've normally seen.

If *_strand_corrected.psl is your output file name from the flair correct step, then what may be happening is:

  1. Many entries are being filtered out for uncorrectable splice sites (--> corrected.gp)
  2. More entries are being filtered out for containing novel, but valid sites (these filtering steps produce strand_corrected.psl)

Potential fixes are:
1a) You could increase the window size (-w) or use a more comprehensive annotation
1b) Your genepred genome annotation may be misformatted for how FLAIR expects it. Have you compared with the genepred in the example files yet?
2a) May be fixed with a better annotation file (1b)
2b) If you choose to move forward with novel, valid splice sites you can now specify just -n in the correct command and those will be retained (please update your flair.py script in that case)

And there should be little to no drop for the first strand/gene inference step. To address that, I'd have to look at a the missing entries since it's unclear to me what could be causing that.

@ashokpatowary
Copy link
Author

Hi @belgravia

I am still having issues with running it. Also I tried to check the intermidiate files. Run4_strand_corrected_novels.collapse1.psl is a blank file

Inferring strand for reads in PSL
Correcting reads in Run4_strand.psl
chrY
chrX
chr13
chr12
chr11
chr10
chr17
chr16
chr15
chr14
chr19
chr18
chrM
chr22
chr20
chr21
chr7
chr6
chr5
chr4
chr3
chr2
chr1
chr9
chr8
Converting output gp to PSL
Collapsing isoforms
Read data extracted, collapsing
Traceback (most recent call last):
File "/u/home/flair/bin/collapse_isoforms_precise.py", line 303, in
ends = find_best_sites(isoforms[chrom][jset]['tss'], isoforms[chrom][jset]['tss_tes'], chrom)
File "/u/home/Resource/flair/bin/collapse_isoforms_precise.py", line 124, in find_best_sites
found_tss = find_tsss(sites_tss_all, finding_tss=True, max_results=max_results, chrom=chrom)
File "/u/home/Resource/flair/bin/collapse_isoforms_precise.py", line 105, in find_tsss
if t in annotends[chrom] and t - bestsite[0] < closest_annotated - bestsite[0]:
KeyError: 'chrY'
Filtering isoforms
Renaming isoforms
Aligning reads to first-pass isoform reference
[M::mm_idx_gen::0.0017.01] collected minimizers
[M::mm_idx_gen::0.003
5.19] sorted minimizers
[M::main::0.0035.06] loaded/built the index for 0 target sequence(s)
[M::mm_mapopt_update::0.003
4.92] mid_occ = 0
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0
[M::mm_idx_stat::0.0034.82] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan
[M::worker_pipeline::23.135
0.76] mapped 453269 sequences
[M::main] Version: 2.12-r847-dirty
[M::main] CMD: minimap2 -a -t 4 --secondary=no /u/home/MinION/Run_04_Part/correction_output/Run4_strand_corrected_novels.collapse1.fa /u/project/
/nanopore/test.fastq
[M::main] Real time: 23.195 sec; CPU: 17.540 sec; Peak RSS: 0.845 GB
Counting isoform expression
Filtering isoforms by read coverage
Removing intermediate files/done!

@belgravia
Copy link
Collaborator

After the FLAIR v1..0 release, I had pushed an updated version of collapse_isoforms_precise.py that takes an optional gtf that also contained the bug that you found. I have fixed it now, so you can either use the most recent version (commit aa25f2c) or just use the version in the v1.0 release (no gtf support).

On another note, I hope you have fixed your issue with the loss of reads after novel junction filtering!

@ashokpatowary
Copy link
Author

@belgravia

It working perfect with updates. -novel option solve the issues related to loss of reads. Thank you for your help. I am having one last query. During collapse the minimap2 command use was CMD: minimap2 -a -t 4 --secondary=no /u/home/MinION/Run_04_Part/correction_output/Run4_strand_corrected_novels.collapse1.fa /u/project/nanopore/big.fastq

Are you considering the collapse.fa as the ref genome?

Regards

@belgravia
Copy link
Collaborator

I'm glad it's working for you now.
Yes, that step is aligning the reads directly to the first-pass isoforms to assign an isoform to each read. Isoforms that surpass some number threshold of reads supporting them are retained in the final isoform output.

@ashokpatowary
Copy link
Author

@belgravia

Thanks for the explanation and also for this nice tools. It makes life easy.

@ashokpatowary
Copy link
Author

ashokpatowary commented Nov 26, 2018 via email

@belgravia
Copy link
Collaborator

The fasta files should include all isoforms, including the annotated ones. If you run bin/identify_similar_annotated_isoforms.py on your final isoform psl, it should rename the isoforms that are annotated to their names in the annotation. This script is already run in the flair.py wrapper.

So for human data, the annotated isoforms should be renamed to 'ENST....' and the novel ones should have the nanopore read name as their name.

Also I'm going to be pushing a faster version of this script and others quite soon, if you can wait a little bit.

@ashokpatowary
Copy link
Author

ashokpatowary commented Dec 5, 2018 via email

@belgravia
Copy link
Collaborator

Sorry for delay in response! Yes, that script is correct. I combined scripts and there's now a script called bin/identify_gene_isoform.py. You can run it with a command like python identify_gene_isoform.py yourisoforms.psl annotation.gtf yourisoforms_named.psl. The script will output a file using the name of the last argument with the name field rewritten with ensemble gene IDs and transcript IDs. Isoforms that do not have a transcript ID would be novel isoforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants