Skip to content

Conversation

@Changaco
Copy link
Member

Fixes #17.

Changaco added 2 commits July 16, 2017 23:16
lxml is applying basic UTF8 decoding to each chunk, which fails when a chunk ends in the middle of an UTF8 sequence

```
Traceback (most recent call last):
  File "legi/tar2sqlite.py", line 510, in <module>
    main()
  File "legi/tar2sqlite.py", line 486, in main
    process_archive(db, args.directory + '/' + archive_name)
  File "legi/tar2sqlite.py", line 262, in process_archive
    xml.feed(block)
  File "src/lxml/parser.pxi", line 1217, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:114563)
  File "src/lxml/parser.pxi", line 1339, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:114436)
  File "src/lxml/parser.pxi", line 586, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:105777)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105896)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107604)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:106458)
  File "<string>", line 227
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 EOF, line 227, column 289
```
@Changaco
Copy link
Member Author

Pour le bug de lxml j'ai ajouté un commentaire à https://bugs.launchpad.net/lxml/+bug/1671109.

@Changaco Changaco merged commit 15fd400 into master Jul 17, 2017
@Changaco Changaco deleted the fix-duplicates branch July 17, 2017 08:22
Seb35 pushed a commit to Seb35/legi.py that referenced this pull request Jan 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants