Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

confused about when/how to demux in iso-seq workflow #167

Closed
kmattioli opened this issue Jul 19, 2021 · 16 comments
Closed

confused about when/how to demux in iso-seq workflow #167

kmattioli opened this issue Jul 19, 2021 · 16 comments

Comments

@kmattioli
Copy link

Hello,

I'm a bit confused about when and how to properly demux in the Iso-seq workflow in the command line. Wondering what the best practice is to get from multiplexed ccs reads all the way through to HQ isoforms + abundances per sample.

The tl;dr is: should I demux when running the Iso-seq workflow itself using lima as alluded to in the tutorial here? And if so, how do I get from the mapped, collapsed, filtered reads (which are aggregated across samples) to demuxed abundances? Or should I perform Iso-seq without demuxing (so skip the lima step), and demux later, using the demux_isoseq_with_genome.py script as described here?

More info:
I have data with 5 pooled samples (and thus 5 unique primers). I followed the steps outlined in the tutorial here, which are:

  1. call ccs on entire subreads.bam file [output file: ccs.bam]
  2. remove primers and demux with lima [output files: several fl.primer_5p--*_3p.bam depending on primers present]
  3. refine FL transcripts with isoseq3 refine [output files: several *flnc.bam and *flnc.report.csv, one per demuxed sample]
  4. combine all demuxed FLNC files [output file: all.fofn list of above files] (I combined inputs per the documentation here)
  5. cluster superset of FLNC reads using isoseq3 cluster on all.fofn (so no longer de-muxed) [output file: clustered.hq.fasta (and others)]
  6. align clustered isoforms to genome using minimap2 [output file: clustered.hq.fasta.sam]
  7. collapse isoforms using collapse_isoforms_by_sam.py [output file: clustered.hq.collapsed.rep.fa and others]
  8. filter isoforms using filter_by_count.py [output files now end in min_fl_2]
  9. filter out 5p degraded transcripts using filter_away_subset.py on the ABOVE prefix (so on the files already filtered to min_fl_2) [output files now end in filtered]

Now, according to your answer here, I should run the demux script. But I only have a flnc.report.csv for each demuxed sample from step 3 above, but it seems like I need the overall (not-demuxed) report to run demux_isoseq_with_genome.py. If I run the following command, for example:

demux_isoseq_with_genome.py --mapped_fafq clustered.hq.collapsed.min_fl_2.filtered.rep.fa --read_stat clustered.hq.collapsed.read_stat.txt --classify_csv sample1.flnc.report.csv -o test

I get an error:

Reading sample1.flnc.report.csv....
Reading clustered.hq.collapsed.read_stat.txt....
Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 82, in main
    info = read_read_stat(read_stat, classify_info)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 50, in read_read_stat
    p = classify_info[r['id']]
KeyError: 'm64049_191003_010409/108004392/ccs'

...ostensibly because this transcript from the read_stat.txt file (from all samples) is not in the sample1.flnc.report.csv file?

I think I'm missing something about how to run this pipeline correctly, but I'm not sure what.

Thanks so much in advance!

@Magdoll
Copy link
Owner

Magdoll commented Jul 23, 2021

Hi @kmattioli ,

Your listed steps (1)-(9) look good to me!
What does your sample1.flnc.report.csv look like? Can you show me the first few lines of it?

@kmattioli
Copy link
Author

Yep! Here are the first 10 lines of sample1.flnc.report.csv:

id,strand,fivelen,threelen,polyAlen,insertlen,primer
m64049_191003_010409/126/ccs,+,69,31,134,587,primer_5p--HCC_3p
m64049_191003_010409/318/ccs,+,69,31,65,1374,primer_5p--HCC_3p
m64049_191003_010409/539/ccs,+,69,30,137,1801,primer_5p--HCC_3p
m64049_191003_010409/697/ccs,+,30,69,35,998,primer_5p--HCC_3p
m64049_191003_010409/752/ccs,+,69,32,52,917,primer_5p--HCC_3p
m64049_191003_010409/883/ccs,+,31,69,114,2649,primer_5p--HCC_3p
m64049_191003_010409/978/ccs,+,31,70,40,1519,primer_5p--HCC_3p
m64049_191003_010409/1053/ccs,+,30,69,176,820,primer_5p--HCC_3p
m64049_191003_010409/1063/ccs,+,30,70,181,528,primer_5p--HCC_3p

And for good measure, here's sample2.flnc.report.csv:

id,strand,fivelen,threelen,polyAlen,insertlen,primer
m64049_191003_010409/25/ccs,+,60,30,76,919,primer_5p--HME1_3p
m64049_191003_010409/37/ccs,+,69,27,119,249,primer_5p--HME1_3p
m64049_191003_010409/51/ccs,+,31,69,44,487,primer_5p--HME1_3p
m64049_191003_010409/63/ccs,+,31,70,149,599,primer_5p--HME1_3p
m64049_191003_010409/76/ccs,+,31,68,92,719,primer_5p--HME1_3p
m64049_191003_010409/117/ccs,+,70,31,29,830,primer_5p--HME1_3p
m64049_191003_010409/211/ccs,+,67,28,102,261,primer_5p--HME1_3p
m64049_191003_010409/220/ccs,+,71,31,145,1132,primer_5p--HME1_3p
m64049_191003_010409/232/ccs,+,67,31,113,487,primer_5p--HME1_3p

(There are 3 more analogous flnc.report.csv files)

@Magdoll
Copy link
Owner

Magdoll commented Jul 25, 2021

Hi @kmattioli ,

You've already pooled the data so you need to provide a "joint" classify (flnc.report.csv), can you concatenate all the [sample].flnc.report (remember to not have redundant headers) and pass that to demux?
-Liz

@kmattioli
Copy link
Author

Hi @Magdoll - that worked!! Thank you for your guidance!

@kmattioli
Copy link
Author

Hi @Magdoll -- sorry, one more demuxing related question!

I ran the fusion_finder.py script on my samples and now I'd like to demux the fusion counts. I tried running the following:

python demux_isoseq_with_genome.py --mapped_fafq clustered.hq.fusions.rep.fa --read_stat clustered.hq.fusions.read_stat.txt --classify_csv all.flnc.report.csv -o clustered.hq.fusions.abundance.demuxed.txt

Where clustered.hq.fusions.rep.fa and clustered.hq.fusions.read_stat.txt are output by fusion_finder.py and all.flnc.report.csv is the concatenatted flnc report that you suggested above (which worked to demux non-fusions).

But now I see this error:

Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 101, in main
    raise Exception("Expected ID format PB.X.Y but found {0}!".format(r.id))
Exception: Expected ID format PB.X.Y but found PBfusion.1|chr1:17479-29348(-)+chrX:51756724-51758031(-)|transcript/41957!

Is this the correct way to demux fusions?

@Magdoll
Copy link
Owner

Magdoll commented Jul 30, 2021

Hi @kmattioli ,

Yeah, that's my fault for requiring the IDs must look like PB.X.Y....back when I was very strict about ID formats :-)

Can you do me a favor and test this? Change this line 13

from

mapped_id_rex = re.compile('(PB.\d+.\d+)')

to

mapped_id_rex = re.compile('(PB\S*.\d+.\d+)')

If that works I'll update the code in the GitHub repo

-Liz

@kmattioli
Copy link
Author

Hi @Magdoll -

That didn't work, returned the following error:

Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 105, in main
    f.write(",{0}".format(info[pbid][p]))
KeyError: 'PBfusion.1|chr1:17479-29348(-)+chrX:51756724-51758031(-)|transcript/49462'

So I did some regex testing and the following line 13 appears to work on my end for both non-fusion and fusion transcript de-muxing:

mapped_id_rex = re.compile('(PB(fusion)*\.\d+\.*\d*)')

However, I then ended up with a new error in the fusion de-muxing that I can't trace back. The PB ID is parsed correctly but now the error is:

Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 105, in main
    f.write(",{0}".format(info[pbid][p]))
KeyError: 'PBfusion.32'

Fusions 1-29 above it appeared to work fine, so not sure what the issue with this one is....

@Magdoll
Copy link
Owner

Magdoll commented Aug 3, 2021

Hi @kmattioli well crap now I realize you're giving fusions - I'm not sure fusions have ever been run through demux. I'm not sure without testing I can make sure it works properly.

Are you willing to share data confidentially for me to make some code changes? If so, please share an email for me to send file request.

Thanks,
-Liz

@kmattioli
Copy link
Author

No problem! kaia.mattioli [at] gmail.com

@Magdoll
Copy link
Owner

Magdoll commented Aug 4, 2021 via email

@Magdoll
Copy link
Owner

Magdoll commented Aug 6, 2021

Hi @kmattioli ,

I fixed the PBfusion ID format support issue but also noticed there was an actual error because PBfusion.32 was not in your read_stat.txt file. For now I've modified the script to output a warning instead of failing, so users know that this sequence will not have a demuxed count file. This new version is pushed to Cupcake v26.0.0 and I've also dropped off the modified result to you. Please confirm!

-Liz

@kmattioli
Copy link
Author

Thank you @Magdoll! I received the files. One question -- is there a reason why that fusion fails to demux? I didn't do anything special to the read_stat.txt file -- just got it from running the standard fusion finder script.

@Magdoll
Copy link
Owner

Magdoll commented Aug 6, 2021

Hi @kmattioli ,

There were two reasons it failed (and I fixed/address them both in Cupcake v26.2.0 or any version above).

The first is the sequence IDs were expected to PB.X.Y but the fusion IDs were PBfusion.X|xxxxxx. The new script now recognizes PBfusion ID formats.

The second is you had a sequence PBfusion.32 that was only present in the fasta and not in the read_stat files, I'm letting the script throw a warning but not fail, so you can still get counts for the other sequences.

Does this make sense?

@Magdoll
Copy link
Owner

Magdoll commented Aug 6, 2021

NOTE: please do not use Cupcake v26.0.0 or v26.1.0, only v26.2.0 or above (in case I find another bug in that ID format matching).

@kmattioli
Copy link
Author

Yup, makes sense! I was just wondering if it was expected for PBfusion.32 to be present in fasta but not read_stat, but it sounds like that may happen. Thank you for your help!

@Magdoll Magdoll closed this as completed Aug 7, 2021
@Yuzhang0102
Copy link

Hi, I used the same pipeline with the @kmattioli to process the isoseq data, but i met an error when running demux_isoseq_with_genome.py.

WARNING: Could not find PB.9996.2|3:104721362-104790835(-)|transcript/114028 in clustered.hq.fasta.collapsed.read_stat.txt. No demux output for this sequence! WARNING: Could not find PB.9996.3|3:104780666-104790835(-)|transcript/83742 in clustered.hq.fasta.collapsed.read_stat.txt. No demux output for this sequence!

It warns that every transcript is not in the read_stat.txt. And the transcript ID in the read_stat.txt and the ID int the filtered.rep.fa are different indeedly. But i used the files are the output from the pipeline. What's the problem with my code?

demux_isoseq_with_genome.py --mapped_fafq clustered.hq.fasta.collapsed.filtered.rep.fa --read_stat clustered.hq.fasta.collapsed.read_stat.txt --classify_csv all_classify.report.csv -o clustered.hq.fasta.collapsed.filtered.mapped_fl_count.txt

clustered.hq.fasta.collapsed.filtered.rep.fa from the get_abundance_post_collapse.py after collapse_isoforms_by_sam.py,
clustered.hq.fasta.collapsed.read_stat.txt from the collapse_isoforms_by_sam.py,
all_classify.report.csv is from merging the each sample .flnc.report.csv from the isoseq3 refine step

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants