Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

filter_away_subset.py error #15

Closed
markopetek opened this issue Aug 23, 2017 · 7 comments
Closed

filter_away_subset.py error #15

markopetek opened this issue Aug 23, 2017 · 7 comments

Comments

@markopetek
Copy link

For S1 sample the filter_away_subset.py finished without a problem but for the second sample (S2) I get this error:

Traceback (most recent call last):
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/bin/filter_away_subset.py", line 4, in
import('pkg_resources').run_script('cupcake==3.1', 'filter_away_subset.py')
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 744, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 1499, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 150, in
main()
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 141, in main
r = d[k]
KeyError: 'PB.1.1'

this is the command I used for running the script (in anaCogent environment):

filter_away_subset.py all_sizes.quivered_hq_MPE_S2.fastq.collapsed

the input and generated output files are available here:
https://www.dropbox.com/sh/z4wps7y6gvyg4w2/AACDiAk4X-SyyoeWYwTqL0i2a?dl=0

@Magdoll
Copy link
Owner

Magdoll commented Aug 23, 2017

Hi @markopetek ,

Your count file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt is somehow incomplete. It is missing, for example, the count information for PB.1.1, which is present in other collapse output such as all_sizes.quivered_hq_MPE_S2.fastq.collapsed.GFF.

The file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.group.txt shows that you have 9703 isoforms, this means there should be 9703+15 (for header)=9718 lines in all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt but it has only 4339.

--Liz

@markopetek
Copy link
Author

Hi Liz,

I thought this incompleteness of the abundance file might be caused by some glitch in the gmap produced sam file so I've mapped the reads using STARlong, sorted and rerun the collapse and filer script again but I get the same erorr. Would you suggest any parameter change when running the collapse script in order to avoid this outcome? Maybe the -c 0.99 -i 0.95 are too strict since I've mapped cultivated potato transcripts to a reference which is a different subspecies - that might be why the gff has more isoforms than the abundance file.

@Magdoll
Copy link
Owner

Magdoll commented Aug 25, 2017

Hi @markopetek ,
Can you email me the input FASTQ file? I need it to run collapse so I can see if it's a bug in collapse. Whether you use GMAP or STAR or change parameters should have no effect.

@markopetek
Copy link
Author

markopetek commented Aug 28, 2017 via email

@Magdoll
Copy link
Owner

Magdoll commented Aug 28, 2017

Hi @markopetek ,

I believe I have identified the issue. Somehow some of your IDs had an extra underline in your FASTQ file, but they were not there in the cluster_report.csv (the cluster report is what is read to generate counts).

The group txt looked like this:

PB.1.1  i0_HQ__S2|c22385/f3p0/1042
PB.2.1  i0_HQ__S2|c218058/f3p12/1377
PB.2.2  i0_HQ__S2|c113604/f3p17/1237
PB.2.3  i0_HQ__S2|c356155/f2p6/825
PB.3.1  i0_HQ__S2|c212424/f2p1/1373
PB.3.2  i0_HQ__S2|c4103/f3p1/1275
PB.4.1  i1_HQ_S2|c114684/f2p18/3644
PB.4.2  i1_HQ_S2|c14710/f2p15/2321
PB.5.1  i1_HQ_S2|c161525/f14p22/3663

where you notice some of the IDs had two underlines __ instead of one before S2. This was consistent with the HQ fastq IDs but were inconsistent with the cluster_report.csv, which all had only a single underline:

cluster_id,read_id,read_type
i3_ICE_S2|c0,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/22583/37_16585_CCS,FL
i3_ICE_S2|c1,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/34221/37_14492_CCS,FL
i3_ICE_S2|c2,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/127040/12966_55_CCS,FL

(don't be alarmed to see _ICE_ instead of _HQ_ in cluster_report.csv; it's supposed to be that way because at the end of ICE, we don't know whether clusters become HQ or LQ; the script get_abundance_post_collapse.py understands this difference)

Once I changed the group file to remove the extra underline and ran get_abundance_post_collapse.py again, I got the right results and subsequently filter_away_subset.py worked as well.

I've put the fixed group file and filtered results here:
https://www.dropbox.com/s/ec7uhd3sbllenkb/fixed.tar.gz?dl=0

@markopetek
Copy link
Author

Thanks for the help Liz. I see that in the fastq file the sequences starting with @i0_HQ are followed by two underlines, those starting with @i1_HQ or @i2_HQ have only one. I'll rerun the gmap mapping in order to include the chloroplast and mitochondrial genomes besides the nuclear genome. I'll fix the input fastq before running the collapse script by replacing all @i0_HQ__S2 with @i0_HQ_S2. Is that OK or should I rather fix the group.txt file as you did?

@Magdoll
Copy link
Owner

Magdoll commented Aug 30, 2017

Fixing the HQ FASTQ file header itself is easier because that's the "root" of the problem. Fixing the group.txt is also fine, you just have to remember to do it every time you re-run collapse.

What's not clear to me is why for @i0_HQ there is two underlines. I've never seen it happen. Hopefully it's just a human error somewhere.
--Liz

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants