confused about when/how to demux in iso-seq workflow #167

kmattioli · 2021-07-19T20:01:43Z

Hello,

I'm a bit confused about when and how to properly demux in the Iso-seq workflow in the command line. Wondering what the best practice is to get from multiplexed ccs reads all the way through to HQ isoforms + abundances per sample.

The tl;dr is: should I demux when running the Iso-seq workflow itself using lima as alluded to in the tutorial here? And if so, how do I get from the mapped, collapsed, filtered reads (which are aggregated across samples) to demuxed abundances? Or should I perform Iso-seq without demuxing (so skip the lima step), and demux later, using the demux_isoseq_with_genome.py script as described here?

More info:
I have data with 5 pooled samples (and thus 5 unique primers). I followed the steps outlined in the tutorial here, which are:

call ccs on entire subreads.bam file [output file: ccs.bam]
remove primers and demux with lima [output files: several fl.primer_5p--*_3p.bam depending on primers present]
refine FL transcripts with isoseq3 refine [output files: several *flnc.bam and *flnc.report.csv, one per demuxed sample]
combine all demuxed FLNC files [output file: all.fofn list of above files] (I combined inputs per the documentation here)
cluster superset of FLNC reads using isoseq3 cluster on all.fofn (so no longer de-muxed) [output file: clustered.hq.fasta (and others)]
align clustered isoforms to genome using minimap2 [output file: clustered.hq.fasta.sam]
collapse isoforms using collapse_isoforms_by_sam.py [output file: clustered.hq.collapsed.rep.fa and others]
filter isoforms using filter_by_count.py [output files now end in min_fl_2]
filter out 5p degraded transcripts using filter_away_subset.py on the ABOVE prefix (so on the files already filtered to min_fl_2) [output files now end in filtered]

Now, according to your answer here, I should run the demux script. But I only have a flnc.report.csv for each demuxed sample from step 3 above, but it seems like I need the overall (not-demuxed) report to run demux_isoseq_with_genome.py. If I run the following command, for example:

demux_isoseq_with_genome.py --mapped_fafq clustered.hq.collapsed.min_fl_2.filtered.rep.fa --read_stat clustered.hq.collapsed.read_stat.txt --classify_csv sample1.flnc.report.csv -o test

I get an error:

Reading sample1.flnc.report.csv....
Reading clustered.hq.collapsed.read_stat.txt....
Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 82, in main
    info = read_read_stat(read_stat, classify_info)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 50, in read_read_stat
    p = classify_info[r['id']]
KeyError: 'm64049_191003_010409/108004392/ccs'

...ostensibly because this transcript from the read_stat.txt file (from all samples) is not in the sample1.flnc.report.csv file?

I think I'm missing something about how to run this pipeline correctly, but I'm not sure what.

Thanks so much in advance!

The text was updated successfully, but these errors were encountered:

Magdoll · 2021-07-23T01:35:33Z

Hi @kmattioli ,

Your listed steps (1)-(9) look good to me!
What does your sample1.flnc.report.csv look like? Can you show me the first few lines of it?

kmattioli · 2021-07-23T17:04:10Z

Yep! Here are the first 10 lines of sample1.flnc.report.csv:

id,strand,fivelen,threelen,polyAlen,insertlen,primer
m64049_191003_010409/126/ccs,+,69,31,134,587,primer_5p--HCC_3p
m64049_191003_010409/318/ccs,+,69,31,65,1374,primer_5p--HCC_3p
m64049_191003_010409/539/ccs,+,69,30,137,1801,primer_5p--HCC_3p
m64049_191003_010409/697/ccs,+,30,69,35,998,primer_5p--HCC_3p
m64049_191003_010409/752/ccs,+,69,32,52,917,primer_5p--HCC_3p
m64049_191003_010409/883/ccs,+,31,69,114,2649,primer_5p--HCC_3p
m64049_191003_010409/978/ccs,+,31,70,40,1519,primer_5p--HCC_3p
m64049_191003_010409/1053/ccs,+,30,69,176,820,primer_5p--HCC_3p
m64049_191003_010409/1063/ccs,+,30,70,181,528,primer_5p--HCC_3p

And for good measure, here's sample2.flnc.report.csv:

id,strand,fivelen,threelen,polyAlen,insertlen,primer
m64049_191003_010409/25/ccs,+,60,30,76,919,primer_5p--HME1_3p
m64049_191003_010409/37/ccs,+,69,27,119,249,primer_5p--HME1_3p
m64049_191003_010409/51/ccs,+,31,69,44,487,primer_5p--HME1_3p
m64049_191003_010409/63/ccs,+,31,70,149,599,primer_5p--HME1_3p
m64049_191003_010409/76/ccs,+,31,68,92,719,primer_5p--HME1_3p
m64049_191003_010409/117/ccs,+,70,31,29,830,primer_5p--HME1_3p
m64049_191003_010409/211/ccs,+,67,28,102,261,primer_5p--HME1_3p
m64049_191003_010409/220/ccs,+,71,31,145,1132,primer_5p--HME1_3p
m64049_191003_010409/232/ccs,+,67,31,113,487,primer_5p--HME1_3p

(There are 3 more analogous flnc.report.csv files)

Magdoll · 2021-07-25T14:10:37Z

Hi @kmattioli ,

You've already pooled the data so you need to provide a "joint" classify (flnc.report.csv), can you concatenate all the [sample].flnc.report (remember to not have redundant headers) and pass that to demux?
-Liz

kmattioli · 2021-07-27T17:27:30Z

Hi @Magdoll - that worked!! Thank you for your guidance!

kmattioli · 2021-07-27T20:21:52Z

Hi @Magdoll -- sorry, one more demuxing related question!

I ran the fusion_finder.py script on my samples and now I'd like to demux the fusion counts. I tried running the following:

python demux_isoseq_with_genome.py --mapped_fafq clustered.hq.fusions.rep.fa --read_stat clustered.hq.fusions.read_stat.txt --classify_csv all.flnc.report.csv -o clustered.hq.fusions.abundance.demuxed.txt

Where clustered.hq.fusions.rep.fa and clustered.hq.fusions.read_stat.txt are output by fusion_finder.py and all.flnc.report.csv is the concatenatted flnc report that you suggested above (which worked to demux non-fusions).

But now I see this error:

Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 101, in main
    raise Exception("Expected ID format PB.X.Y but found {0}!".format(r.id))
Exception: Expected ID format PB.X.Y but found PBfusion.1|chr1:17479-29348(-)+chrX:51756724-51758031(-)|transcript/41957!

Is this the correct way to demux fusions?

Magdoll · 2021-07-30T17:57:17Z

Hi @kmattioli ,

Yeah, that's my fault for requiring the IDs must look like PB.X.Y....back when I was very strict about ID formats :-)

Can you do me a favor and test this? Change this line 13

from

mapped_id_rex = re.compile('(PB.\d+.\d+)')

to

mapped_id_rex = re.compile('(PB\S*.\d+.\d+)')

If that works I'll update the code in the GitHub repo

-Liz

kmattioli · 2021-08-03T18:34:14Z

Hi @Magdoll -

That didn't work, returned the following error:

Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 105, in main
    f.write(",{0}".format(info[pbid][p]))
KeyError: 'PBfusion.1|chr1:17479-29348(-)+chrX:51756724-51758031(-)|transcript/49462'

So I did some regex testing and the following line 13 appears to work on my end for both non-fusion and fusion transcript de-muxing:

mapped_id_rex = re.compile('(PB(fusion)*\.\d+\.*\d*)')

However, I then ended up with a new error in the fusion de-muxing that I can't trace back. The PB ID is parsed correctly but now the error is:

Traceback (most recent call last):
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 131, in <module>
    main(args.job_dir, args.mapped_fafq, args.read_stat, args.classify_csv, args.output, primer_names)
  File "/PHShome/kz659/bin/cDNA_Cupcake/post_isoseq_cluster/demux_isoseq_with_genome.py", line 105, in main
    f.write(",{0}".format(info[pbid][p]))
KeyError: 'PBfusion.32'

Fusions 1-29 above it appeared to work fine, so not sure what the issue with this one is....

Magdoll · 2021-08-03T22:53:22Z

Hi @kmattioli well crap now I realize you're giving fusions - I'm not sure fusions have ever been run through demux. I'm not sure without testing I can make sure it works properly.

Are you willing to share data confidentially for me to make some code changes? If so, please share an email for me to send file request.

Thanks,
-Liz

kmattioli · 2021-08-03T22:56:20Z

No problem! kaia.mattioli [at] gmail.com

Magdoll · 2021-08-04T17:55:22Z

request sent!

…

On Tue, Aug 3, 2021 at 3:56 PM Kaia Mattioli ***@***.***> wrote: No problem! kaia.mattioli [at] gmail.com — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#167 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEQE32XFIRFQXSDTLLKHFLT3BX2BANCNFSM5AUL33BA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

Magdoll · 2021-08-06T18:51:07Z

Hi @kmattioli ,

I fixed the PBfusion ID format support issue but also noticed there was an actual error because PBfusion.32 was not in your read_stat.txt file. For now I've modified the script to output a warning instead of failing, so users know that this sequence will not have a demuxed count file. This new version is pushed to Cupcake v26.0.0 and I've also dropped off the modified result to you. Please confirm!

-Liz

kmattioli · 2021-08-06T19:52:29Z

Thank you @Magdoll! I received the files. One question -- is there a reason why that fusion fails to demux? I didn't do anything special to the read_stat.txt file -- just got it from running the standard fusion finder script.

Magdoll · 2021-08-06T21:01:30Z

Hi @kmattioli ,

There were two reasons it failed (and I fixed/address them both in Cupcake v26.2.0 or any version above).

The first is the sequence IDs were expected to PB.X.Y but the fusion IDs were PBfusion.X|xxxxxx. The new script now recognizes PBfusion ID formats.

The second is you had a sequence PBfusion.32 that was only present in the fasta and not in the read_stat files, I'm letting the script throw a warning but not fail, so you can still get counts for the other sequences.

Does this make sense?

Magdoll · 2021-08-06T21:54:48Z

NOTE: please do not use Cupcake v26.0.0 or v26.1.0, only v26.2.0 or above (in case I find another bug in that ID format matching).

kmattioli · 2021-08-06T22:04:14Z

Yup, makes sense! I was just wondering if it was expected for PBfusion.32 to be present in fasta but not read_stat, but it sounds like that may happen. Thank you for your help!

Yuzhang0102 · 2022-05-05T08:35:38Z

Hi, I used the same pipeline with the @kmattioli to process the isoseq data, but i met an error when running demux_isoseq_with_genome.py.

WARNING: Could not find PB.9996.2|3:104721362-104790835(-)|transcript/114028 in clustered.hq.fasta.collapsed.read_stat.txt. No demux output for this sequence! WARNING: Could not find PB.9996.3|3:104780666-104790835(-)|transcript/83742 in clustered.hq.fasta.collapsed.read_stat.txt. No demux output for this sequence!

It warns that every transcript is not in the read_stat.txt. And the transcript ID in the read_stat.txt and the ID int the filtered.rep.fa are different indeedly. But i used the files are the output from the pipeline. What's the problem with my code?

demux_isoseq_with_genome.py --mapped_fafq clustered.hq.fasta.collapsed.filtered.rep.fa --read_stat clustered.hq.fasta.collapsed.read_stat.txt --classify_csv all_classify.report.csv -o clustered.hq.fasta.collapsed.filtered.mapped_fl_count.txt

clustered.hq.fasta.collapsed.filtered.rep.fa from the get_abundance_post_collapse.py after collapse_isoforms_by_sam.py,
clustered.hq.fasta.collapsed.read_stat.txt from the collapse_isoforms_by_sam.py,
all_classify.report.csv is from merging the each sample .flnc.report.csv from the isoseq3 refine step

Thanks!

Magdoll closed this as completed Aug 7, 2021

Yuzhang0102 mentioned this issue May 7, 2022

Error about demux_isoseq_with_genome.py #219

Open

KrRi4 mentioned this issue May 19, 2022

demux_isoseq_with_genome.py Issue - expecting input not generated by get_abundance_post_collapse.py #223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confused about when/how to demux in iso-seq workflow #167

confused about when/how to demux in iso-seq workflow #167

kmattioli commented Jul 19, 2021

Magdoll commented Jul 23, 2021

kmattioli commented Jul 23, 2021

Magdoll commented Jul 25, 2021

kmattioli commented Jul 27, 2021

kmattioli commented Jul 27, 2021

Magdoll commented Jul 30, 2021 •

edited

Loading

kmattioli commented Aug 3, 2021

Magdoll commented Aug 3, 2021

kmattioli commented Aug 3, 2021

Magdoll commented Aug 4, 2021 via email

Magdoll commented Aug 6, 2021

kmattioli commented Aug 6, 2021

Magdoll commented Aug 6, 2021 •

edited

Loading

Magdoll commented Aug 6, 2021

kmattioli commented Aug 6, 2021

Yuzhang0102 commented May 5, 2022

confused about when/how to demux in iso-seq workflow #167

confused about when/how to demux in iso-seq workflow #167

Comments

kmattioli commented Jul 19, 2021

Magdoll commented Jul 23, 2021

kmattioli commented Jul 23, 2021

Magdoll commented Jul 25, 2021

kmattioli commented Jul 27, 2021

kmattioli commented Jul 27, 2021

Magdoll commented Jul 30, 2021 • edited Loading

kmattioli commented Aug 3, 2021

Magdoll commented Aug 3, 2021

kmattioli commented Aug 3, 2021

Magdoll commented Aug 4, 2021 via email

Magdoll commented Aug 6, 2021

kmattioli commented Aug 6, 2021

Magdoll commented Aug 6, 2021 • edited Loading

Magdoll commented Aug 6, 2021

kmattioli commented Aug 6, 2021

Yuzhang0102 commented May 5, 2022

Magdoll commented Jul 30, 2021 •

edited

Loading

Magdoll commented Aug 6, 2021 •

edited

Loading