UnicodeDecodeError during linked_fragment_file generation #77

chunlinxiao · 2019-06-30T20:11:29Z

Traceback (most recent call last):
File "./HapCUT2/utilities/LinkFragments.py", line 492, in
link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags)
File "./HapCUT2/utilities/LinkFragments.py", line 327, in link_fragments
flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom)
File "./HapCUT2/utilities/LinkFragments.py", line 212, in read_fragment_matrix
for line in fm:
File "./HapCUT2/myenv3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 644: invalid start byte

Step1 of extractHAIRS for unlinked_fragment_file is okay.

But step2 of LinkFragments.py for linked_fragment_file generation had an error above.

Do you have any idea? thanks

FrickTobias · 2019-11-25T15:17:12Z

I have the same problem.

I've also identified and removed row causing the error (by reading the unlinked.txt file and using the python module f.seek(7010) and f.read(10)) but that just caused an error with position 7021 instead of 7010, so it seems there is a bigger problem here.

error message

cat BLR_out/hapcut2_linkfragments.log 
Linking 10X fragments on chromosome: chr1
Traceback (most recent call last):
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 486, in <module>
    link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 321, in link_fragments
    flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 206, in read_fragment_matrix
    for line in fm:
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 7021: invalid start byte

Commands used to get input files

freebayes \
    -f genome_reference \
    input.bam \
    vars.vcf

vcffilter \
    -f "AC > 0" \
    vars.vcf \
    > vars-filt.vcf

extractHAIRS \
         --10X 1 \
         --bam input.bam \
         --VCF vars-filt.vcf \
         --out unlinked.txt

LinkFragments.py \
         --bam input.bam \
         -v vars-filt.vcf \
         --fragments unlinked.txt \
         --out linked.txt

pjedge · 2019-11-25T15:43:14Z

What happens if you change directory to the "utilities" directory and run test_LinkFragment.py? (Please change the value of EXTRACTHAIRS variable in the test script to the path of the extractHAIRS binary).

FrickTobias · 2019-11-26T10:40:55Z

Thanks for taking the time to answer.

I'm running it through conda so I am not 100% if my workaround will be equivalent to your description. Therefore I'll include a description of what I did.

Results

It seems to finish without any errors.

cd HapCUT2-1.2/utilities/
pytest .

pytest HapCUT2-1.2/utilities/
================================================ test session starts =================================================
platform linux -- Python 3.6.7, pytest-5.2.2, py-1.8.0, pluggy-0.12.0
rootdir: /domus/h1/tobiasf
collected 31 items                                                                                                   

HapCUT2-1.2/utilities/test_LinkFragment.py ...............................                                     [100%]

================================================= 31 passed in 1.20s =================================================

What I did

To run the test_LinkFragments.py I dowloaded the latest release's source files (v1.2) and then changed the EXTRACTHAIRS variable to the output given by which extractHAIRS. Everything was run in the same environment as the original post and I also double checked the conda HapCUT2 version distribution.

conda search -f hapcut2

Loading channels: done
# Name                       Version           Build  Channel             
hapcut2                          1.1      hed695b0_0  bioconda            
hapcut2                          1.2      hed50d52_0  bioconda

pjedge · 2019-11-26T18:34:31Z

Gotcha. So this bug is not caught by the existing tests. Can you show me which row of the unlinked fragment file was "causing" the crash?

FrickTobias · 2019-11-26T21:49:32Z

From the first error where it couldn't decode position 7010:

with open("unlinked.txt", "rb") as f:
    f.seek(7100)
    f.read(60)

b'0 F\n1 A00621:130:HN5HWDSXX:4:2467:8359:1595 2 AAAACAACGATCGT'

pjedge · 2019-11-27T02:29:06Z

This is very helpful. Would you mind seeking to a little earlier, and reading more, so that I can have the equivalent of 5-10 lines from the fragment file around that spot?

pjedge · 2019-11-27T03:03:25Z

I think I reproduced this bug with a 10x dataset of my own. Should be fixed soon.

FrickTobias · 2019-11-27T12:37:40Z

That's great, I'll leave the closing of the issue to you when it's been implemented. Let me know if you need something more.

Thank you so much for the help! If I could rate this interaction I'd give it a 10/10!

FrickTobias · 2019-11-27T12:44:14Z

This is very helpful. Would you mind seeking to a little earlier, and reading more, so that I can have the equivalent of 5-10 lines from the fragment file around that spot?

You'll find the row posted before highlighted in bold.

with open("unlinked.txt" ,"rb") as f:
... f.seek(6500)
... f.read(1000)
...
6500
b':HN5HWDSXX:4:2549:22426:20071 2 CTTGTGTGCTTCCTTGTTAG -1 1122 00 88\n1 A00621:130:HN5HWDSXX:4:1226:5538:10708 2 AAAACAACGATCGTTCTGCG -1 1132 0 F\n1 A00621:130:HN5HWDSXX:4:2229:6560:19304 2 TTAAGAAAGGCGTTAGCAAC -1 1134 0 F\n1 A00621:130:HN5HWDSXX:4:1557:6650:27007 2 CTTCGAACCTTGCGCACTTC -1 1141 0 F\n1 A00621:130:HN5HWDSXX:4:2365:19479:22169 2 CTAATTAATTTCGGTGTGTG -1 1145 0 F\n1 A00621:130:HN5HWDSXX:4:1502:12246:4695 2 GGTGGAAGTAAGCGCAGTAA -1 1145 00 FF\n1 A00621:130:HN5HWDSXX:4:2207:31331:23375 2 GAACGTTATTCGTGTAGAAA -1 1151 0 F\n1 A00621:130:HN5HWDSXX:4:2509:13485:19022 2 CTTAGGTCCGTCGTTCCTCA -1 1168 0 F\n1 A00621:130:HN5HWDSXX:4:2467:8359:1595 2 AAAACAACGATCGTTCTGCG -1 1178 00000000 FFFFFFFF\n1 A00621:130:HN5HWDSXX:4:2628:4589:9439_FILTERED 2 NULL -1 1189 0 :\n1 A00621:130:HN5HWDSXX:4:2314:5529:24220 2 GTAGGTAGGAAACTCCTTCC -1 1197 10 F:\n1 A00621:130:HN5HWDSXX:4:1145:28646:27743 2 GTACTTTATGTCTGCGCGCC -1 1198 0 <\n1 A00621:130:HN5HWDSXX:4:2341:10375:19210 2 CACGTGCATACAGTCGTATG -1 1198 0 ?\n1 A0062'

pjedge · 2019-11-27T22:34:29Z

So I didn't actually reproduce this same issue, it was a separate utf-8 issue that looked similar.

Would it be possible for you to send your entire unlinked.txt (just that file) to me? edge dot peterj at gmail dot com

FrickTobias · 2019-11-28T11:23:37Z

Alright, I've mailed the file to you.

subject: HapCUT2 issue #77 file
sender: tobias.frick@scilifelab.se

FrickTobias · 2020-02-06T19:51:44Z

A little bit down the road...

Fixed by two commits:

There was also a downstream problem with buffer sizes which weren't part of the original issue, but was patched to verify the initial utf-8 error patch. This turned out to be rooted to a hard-coded buffer size which needed to be increased.

buffer size commit

I've just finished running through my data through all steps without any further hiccups so I consider the issue closed.

Thanks to @pjedge for his excellent perseverance in resolving this issue.

FrickTobias · 2020-02-07T13:10:09Z

chunlinxiao · 2020-02-11T16:25:47Z

thanks

chunlinxiao · 2020-02-19T19:42:06Z

still have the same errors:

LinkFragments.py 10X ...
Linking 10X fragments on chromosome: scaffold_1
Traceback (most recent call last):
File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 486, in
link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags)
File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 321, in link_fragments
flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom)
File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 206, in read_fragment_matrix
for line in fm:
File "/usr/local/HapMAP2/1.2/venv/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 4911: invalid start byte

chunlinxiao · 2020-02-20T19:28:19Z

our systems installed v1.2 release, which may not include all the recent fixes. I guess this may be the reason I still got the same error.

DrPintoThe2nd · 2020-04-09T14:31:43Z

Hi,

I hit this same issue on the conda install and when I tried to quick install from github directly to troubleshoot, "make" generates this error:

"gcc -g -O3 -Wall -D_GNU_SOURCE -Ihtslib -c hairs-src/bamread.c -lhts -o build/bamread.o In file included from hairs-src/bamread.c:1:0:
hairs-src/bamread.h:10:10: fatal error: htslib/sam.h: No such file or directory #include "htslib/sam.h" ^~~~~~~~~~~~~~
compilation terminated.
Makefile:33: recipe for target 'build/bamread.o' failed
make: *** [build/bamread.o] Error 1"

htslib v1.9 is in my path.

Best,
bjp

vibansal · 2020-04-09T17:46:22Z

If you add the actual path to 'htslib' to the Makefile (variable HTSLIB), the code should compile.

I pushed a minor change to LinkFragments.py that could potentially fix the UnicdoeDecode error. If you could please try and let me know, it would be really helpful.

DrPintoThe2nd · 2020-04-09T22:37:55Z

Downloaded the new LinkFragments.py and successfully installed htslib and HapCut2 from github and everything appears to be working!! Will post confirmation when it finishes completely, but it has passed that place where it used to error.

Thanks so much for your quick response and fix to this issue!! Really needed a win today!
Best,
bjp

DrPintoThe2nd · 2020-04-19T16:35:33Z

Hi Bansal lab,
I appears that the LinkFragments.py has been successfully patched, although it's still running (going on 10 days) and is only about 1/3 of the way through the genome (purely by scaffold #).. Is this a normal run-time for this script?
Thanks again!
Best,
bjp

pjedge · 2020-04-19T18:23:51Z

Not normal, but depends on the data. What's the scale of the dataset?

…

On Sun, Apr 19, 2020, 9:35 AM Brendan J. Pinto ***@***.***> wrote: Hi Bansal lab, I appears that the LinkFragments.py has been successfully patched, although it's still running (going on 10 days) and is only about 1/3 of the way through the genome (purely by scaffold #).. Is this a normal run-time for this script? Thanks again! Best, bjp — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUBV2EEB7ULT5LKKVGCGR3RNMR6DANCNFSM4H4NKTYQ> .

DrPintoThe2nd · 2020-04-19T19:07:33Z

1.8Gb genome, ~3,000 scaffolds (w/ scaffold L90 of 112) w/ ~1.7 billion 10X reads.

Currently approaching scaffold 1,000 w/o any notable increase in speed (although these scaffolds are might small).

olavurmortensen · 2021-02-11T10:56:00Z

I'm experiencing the same error as @chunlinxiao and @FrickTobias with v1.3.

I installed HapCUT2 v1.3.2 via conda (https://anaconda.org/bioconda/hapcut2). It's installed in one of my projects here. The recipe for the bioconda pacakge can be found here: https://bioconda.github.io/recipes/hapcut2/README.html

Just to be clear, I don't think this is an issue with the bioconda recipe. In the recipe YAML it clearly uses v1.3.2 as source.

vibansal · 2021-02-18T00:27:56Z

Can you please share a dataset for reproducing this error? I can provide an upload link if needed. Thanks.

olavurmortensen · 2021-02-19T13:43:17Z

@vibansal If I use the method used above to find the particular line that causes the problem, will that suffice? I cannot share the full dataset.

EDIT:

The "method used above" being the following.

with open("unlinked.txt", "rb") as f:
    f.seek(7100)
    f.read(60)

vibansal · 2021-02-19T19:12:08Z

Yes, having a few lines around the problematic line will be helpful.

olavurmortensen · 2021-08-13T10:56:59Z

Hi, sorry about leaving this hanging. The problem solved itself. Most likely there was some versioning confusion.

You can close this issue.

FrickTobias mentioned this issue Nov 27, 2019

HapCUT2 linkfragments bug FrickTobias/BLR#159

Closed

FrickTobias mentioned this issue Feb 6, 2020

10X Genomics data in hapCUT2 #90

Open

vibansal closed this as completed Feb 7, 2020

vibansal reopened this Feb 20, 2020

olavurmortensen mentioned this issue Feb 3, 2021

HapCUT2 update fargenfo/linkseq#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError during linked_fragment_file generation #77

UnicodeDecodeError during linked_fragment_file generation #77

chunlinxiao commented Jun 30, 2019

FrickTobias commented Nov 25, 2019

pjedge commented Nov 25, 2019

FrickTobias commented Nov 26, 2019 •

edited

pjedge commented Nov 26, 2019

FrickTobias commented Nov 26, 2019

pjedge commented Nov 27, 2019

pjedge commented Nov 27, 2019

FrickTobias commented Nov 27, 2019 •

edited

FrickTobias commented Nov 27, 2019

pjedge commented Nov 27, 2019

FrickTobias commented Nov 28, 2019

FrickTobias commented Feb 6, 2020

FrickTobias commented Feb 7, 2020

chunlinxiao commented Feb 11, 2020

chunlinxiao commented Feb 19, 2020

chunlinxiao commented Feb 20, 2020

DrPintoThe2nd commented Apr 9, 2020

vibansal commented Apr 9, 2020

DrPintoThe2nd commented Apr 9, 2020

DrPintoThe2nd commented Apr 19, 2020

pjedge commented Apr 19, 2020 via email

DrPintoThe2nd commented Apr 19, 2020

olavurmortensen commented Feb 11, 2021

vibansal commented Feb 18, 2021

olavurmortensen commented Feb 19, 2021 •

edited

vibansal commented Feb 19, 2021

olavurmortensen commented Aug 13, 2021

UnicodeDecodeError during linked_fragment_file generation #77

UnicodeDecodeError during linked_fragment_file generation #77

Comments

chunlinxiao commented Jun 30, 2019

FrickTobias commented Nov 25, 2019

error message

Commands used to get input files

pjedge commented Nov 25, 2019

FrickTobias commented Nov 26, 2019 • edited

Results

What I did

pjedge commented Nov 26, 2019

FrickTobias commented Nov 26, 2019

pjedge commented Nov 27, 2019

pjedge commented Nov 27, 2019

FrickTobias commented Nov 27, 2019 • edited

FrickTobias commented Nov 27, 2019

pjedge commented Nov 27, 2019

FrickTobias commented Nov 28, 2019

FrickTobias commented Feb 6, 2020

FrickTobias commented Feb 7, 2020

chunlinxiao commented Feb 11, 2020

chunlinxiao commented Feb 19, 2020

chunlinxiao commented Feb 20, 2020

DrPintoThe2nd commented Apr 9, 2020

vibansal commented Apr 9, 2020

DrPintoThe2nd commented Apr 9, 2020

DrPintoThe2nd commented Apr 19, 2020

pjedge commented Apr 19, 2020 via email

DrPintoThe2nd commented Apr 19, 2020

olavurmortensen commented Feb 11, 2021

vibansal commented Feb 18, 2021

olavurmortensen commented Feb 19, 2021 • edited

vibansal commented Feb 19, 2021

olavurmortensen commented Aug 13, 2021

FrickTobias commented Nov 26, 2019 •

edited

FrickTobias commented Nov 27, 2019 •

edited

olavurmortensen commented Feb 19, 2021 •

edited