filter_away_subset.py error #15

markopetek · 2017-08-23T12:53:22Z

For S1 sample the filter_away_subset.py finished without a problem but for the second sample (S2) I get this error:

Traceback (most recent call last):
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/bin/filter_away_subset.py", line 4, in
import('pkg_resources').run_script('cupcake==3.1', 'filter_away_subset.py')
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 744, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 1499, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 150, in
main()
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 141, in main
r = d[k]
KeyError: 'PB.1.1'

this is the command I used for running the script (in anaCogent environment):

filter_away_subset.py all_sizes.quivered_hq_MPE_S2.fastq.collapsed

the input and generated output files are available here:
https://www.dropbox.com/sh/z4wps7y6gvyg4w2/AACDiAk4X-SyyoeWYwTqL0i2a?dl=0

Magdoll · 2017-08-23T19:09:37Z

Hi @markopetek ,

Your count file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt is somehow incomplete. It is missing, for example, the count information for PB.1.1, which is present in other collapse output such as all_sizes.quivered_hq_MPE_S2.fastq.collapsed.GFF.

The file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.group.txt shows that you have 9703 isoforms, this means there should be 9703+15 (for header)=9718 lines in all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt but it has only 4339.

--Liz

markopetek · 2017-08-25T16:39:26Z

Hi Liz,

I thought this incompleteness of the abundance file might be caused by some glitch in the gmap produced sam file so I've mapped the reads using STARlong, sorted and rerun the collapse and filer script again but I get the same erorr. Would you suggest any parameter change when running the collapse script in order to avoid this outcome? Maybe the -c 0.99 -i 0.95 are too strict since I've mapped cultivated potato transcripts to a reference which is a different subspecies - that might be why the gff has more isoforms than the abundance file.

Magdoll · 2017-08-25T18:19:33Z

Hi @markopetek ,
Can you email me the input FASTQ file? I need it to run collapse so I can see if it's a bug in collapse. Whether you use GMAP or STAR or change parameters should have no effect.

markopetek · 2017-08-28T06:46:40Z

Dear Liz, this are the fastq and the cluster report file that you've requested: https://www.dropbox.com/sh/kyn9j8txg6vqlnd/AAB9nLr8_GybALY5N1NKCo9Ia?dl=0 Let me know what you find out. With kind regards, Marko

…

-------------------- Lep pozdrav Marko Petek 2017-08-25 20:19 GMT+02:00 Magdoll <notifications@github.com>:

Hi @markopetek <https://github.com/markopetek> , Can you email me the input FASTQ file? I need it to run collapse so I can see if it's a bug in collapse. Whether you use GMAP or STAR or change parameters should have no effect. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMN1X0Dvz_3zvZZFrRDwTfXhZyC0R7ZJks5sbxA1gaJpZM4O_9m0> .

Magdoll · 2017-08-28T18:11:39Z

Hi @markopetek ,

I believe I have identified the issue. Somehow some of your IDs had an extra underline in your FASTQ file, but they were not there in the cluster_report.csv (the cluster report is what is read to generate counts).

The group txt looked like this:

PB.1.1  i0_HQ__S2|c22385/f3p0/1042
PB.2.1  i0_HQ__S2|c218058/f3p12/1377
PB.2.2  i0_HQ__S2|c113604/f3p17/1237
PB.2.3  i0_HQ__S2|c356155/f2p6/825
PB.3.1  i0_HQ__S2|c212424/f2p1/1373
PB.3.2  i0_HQ__S2|c4103/f3p1/1275
PB.4.1  i1_HQ_S2|c114684/f2p18/3644
PB.4.2  i1_HQ_S2|c14710/f2p15/2321
PB.5.1  i1_HQ_S2|c161525/f14p22/3663

where you notice some of the IDs had two underlines __ instead of one before S2. This was consistent with the HQ fastq IDs but were inconsistent with the cluster_report.csv, which all had only a single underline:

cluster_id,read_id,read_type
i3_ICE_S2|c0,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/22583/37_16585_CCS,FL
i3_ICE_S2|c1,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/34221/37_14492_CCS,FL
i3_ICE_S2|c2,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/127040/12966_55_CCS,FL

(don't be alarmed to see _ICE_ instead of _HQ_ in cluster_report.csv; it's supposed to be that way because at the end of ICE, we don't know whether clusters become HQ or LQ; the script get_abundance_post_collapse.py understands this difference)

Once I changed the group file to remove the extra underline and ran get_abundance_post_collapse.py again, I got the right results and subsequently filter_away_subset.py worked as well.

I've put the fixed group file and filtered results here:
https://www.dropbox.com/s/ec7uhd3sbllenkb/fixed.tar.gz?dl=0

markopetek · 2017-08-29T12:58:48Z

Thanks for the help Liz. I see that in the fastq file the sequences starting with @i0_HQ are followed by two underlines, those starting with @i1_HQ or @i2_HQ have only one. I'll rerun the gmap mapping in order to include the chloroplast and mitochondrial genomes besides the nuclear genome. I'll fix the input fastq before running the collapse script by replacing all @i0_HQ__S2 with @i0_HQ_S2. Is that OK or should I rather fix the group.txt file as you did?

Magdoll · 2017-08-30T16:48:24Z

Fixing the HQ FASTQ file header itself is easier because that's the "root" of the problem. Fixing the group.txt is also fine, you just have to remember to do it every time you re-run collapse.

What's not clear to me is why for @i0_HQ there is two underlines. I've never seen it happen. Hopefully it's just a human error somewhere.
--Liz

Magdoll closed this as completed Sep 27, 2017

gabbycoffing mentioned this issue Jun 9, 2021

The number of PBID records disagree #164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter_away_subset.py error #15

filter_away_subset.py error #15

markopetek commented Aug 23, 2017

Magdoll commented Aug 23, 2017

markopetek commented Aug 25, 2017

Magdoll commented Aug 25, 2017

markopetek commented Aug 28, 2017 via email

Magdoll commented Aug 28, 2017

markopetek commented Aug 29, 2017

Magdoll commented Aug 30, 2017

filter_away_subset.py error #15

filter_away_subset.py error #15

Comments

markopetek commented Aug 23, 2017

Magdoll commented Aug 23, 2017

markopetek commented Aug 25, 2017

Magdoll commented Aug 25, 2017

markopetek commented Aug 28, 2017 via email

Magdoll commented Aug 28, 2017

markopetek commented Aug 29, 2017

Magdoll commented Aug 30, 2017