XMLSyntaxError: switching encoding: encoder error #1
Comments
Hi there,
Note that the script has been able to download perfectly a short thread (just a few DM, no images no nothing). |
Hello Laurent, Are you also using macOS? It seems there is an error with the lxml library when it reaches a message with accented characters. Could you confirm there is no accented characters for the short thread which is working for you? It's quite difficult for me to identify the exact cause because I do not own a Mac to debug it. It works properly on Windows and Linux. I keep looking for a possible fix for macOS. There's a command to run for the UTF-8 support in the Terminal which should be executed before the script but I'm not sure it would make a difference here: |
Hi, I tried to execute the command you gave, but the problem is still there. Thanx for the help, it would be really cool to have this script work. |
I'm going to add a raw mode to fetch JSON responses without using the parser. I will also add a verbose mode and add proper error handling. I hope it will help us to find the root cause. Thanks for the tests. |
Zupa. Keep up the good work, looking forward to testing it :) |
(BTW, just tested the windows exe on a basic Windows 10 Family, worked perfectly fine with every king of DM thread… good job) |
Yep. I've already updated the script to use the time of the locale instead of the UTC one. It has not been pushed yet to GitHub. And for the error, it confirms the issue is related to the macOS setup. |
Thanks to a friend of mine with a Mac, I've been able to track down what seems to be the root cause of this bug. The parsing fails when a tweet contains an emoji. The generated code will look like this for the image. It contains the With this new information, I've found this bug ticket with a similar issue: Additional tests have been done on macOS and no issue has been identified with multiple kinds of accented characters or URL. This issue only seems to occur with emoji unicode. Consequently, I'm going to do the following:
or simpler alternative
|
\o/ |
Could you just confirm there was no emoji for the thread you've been able to parse on macOS, Laurent? |
Yes, it was an old and short thread with no emojis at the time… |
Having the exact same problem. Happy to hear you're working on a fix! \o/ |
I think I have a fix in b7c316a for the Mac OS users but I need confirmation guys.
|
I did. Got a little further this time: 3 images (instead of 0), 0 text files. Error:
Maybe something went wrong with the update? I got this:
|
@muesliq: It seems you're using the wrong version of Python (2.7 instead of 3.5). Could you try with That's my fault. It's mandatory to specify And I guess you've been able to download more images only because those images have been uploaded recently, without emojis in tweets in or after them. |
Updated, thanks! Better now but not fixed yet. Thousands of tweets processed, 129 images, yet still 0 text files.
|
Ok, thanks. I've added an exception handling to print the tweet ID that raises the exception. The script should now continue, even when a tweet is causing issues. You can upgrade with This is a poor, temporary solution but the raw HTML of the offensive tweets will be also output in the log file as a [DMConversationEntry] with a [ParseError] tag. It will help me to understand what's causing the issue. The only weird situation I saw is a random position of the img attributes that makes the regex fail. I've seen |
Now the upgrade doesn't seem to work:
|
I had the same issue. It's quite strange. Maybe a temporary issue with pipy? I've been able to uninstall it and reinstall it with the latest version (0.0.10). To exclude caching issues for package download, I've also deleted the following folder on Windows: For Unix, its seems to be |
Hi ! On error though, with one thread. Had a lot of The twitter user has an emoji in her username (see below begining of the file that has been written)
I guess it might be the problem..? We're getting there! |
Two tweets (out of 12620) hat an "unexpected error". The first one contained the letter 𝜋. The second had the following tweet embedded (which contained lots of emoji): https://twitter.com/magnifier661/status/787044538145574912 |
Thanks a lot @LaurentLC and @muesliq! 👍 You've been able to identify 3 currently not properly handled cases:
I'm not sure yet how I will be able to find proper workarounds. The bug is in the lxml lib for Mac OS. Identifying emojis with regex does not seem possible. The error with 𝜋 (U+1D70B 𝜋 MATHEMATICAL ITALIC SMALL PI) also means that the issue will not be limited to emojis. It's only a simple character so it could mean the script cannot handle non-ASCII characters at all on Mac OS... :-/ Update: My guess is the error is related to code points encoded on four bytes.
Emojis are also encoded in Plane 1 (1F000–1FFFF) so I may drop all content in the range 10000-2FFFF (Planes 1 & 2). It contains mainly ancient Egyptian characters, mathematical symbols and emojis. For reference: |
By the way: Fantastic little piece of software. Thank you! |
Happy to help. 😄 I have implemented in 073a358 a more general solution as a "fix" for this issue. On Mac OS X, all the Unicode characters encoded on 4 bytes are now replaced by "□" before the lxml parsing. Consequently, it should fix all the encountered issues and allow a flawless parsing. 😄 To celebrate this, I've bumped the version to 0.1.0. 😉 |
Rejoice Mac users, I've been able to make a precompiled executable for macOS. It should be a lot easier for non-technical users to use. 😄 |
Fixed in 073a358 |
OMGoodness I was so excited it was backing up messages with this new On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart notifications@github.com
|
this is what it looked like as it was running before it got the error On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:
|
now i got this screen On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:
|
oh it didn't let me attach the 5MB file of the one particular message But here are all the various threads that were in the command screen. The Last login: Mon Nov 7 20:43:14 on ttys000 Ronnies-MacBook-Pro:~ ronniesussman$ Enter your username or email: beckybulldognj Enter your password (characters will not be displayed): Authentication succeedeed. Conversation ID not specified. Retrieving all the threads. Starting crawl of '629006352329760768' Begin of thread reached Total processed tweets: 49899 Writing conversation to 629006352329760768.txt [Truncated for confidentiality reasons] logout Saving session... ...copying shared history... ...saving history...truncating history files... ...completed. [Process completed] On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:
|
Wow so i tried it a second time and WOW!! it ran through the process. I'm Trust me so i'm excited to have any of these, even in text version without Julien, thanks so much. On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:
|
I'm not sure all the messages were backed up. i'm looking for 2 particular Thanks! On Mon, Nov 7, 2016 at 9:12 PM, Ronnie Sussman sussron@gmail.com wrote:
|
Does seem it didn't capture all the conversations or go to the first line. Thanks On Nov 7, 2016 9:37 PM, "Ronnie Sussman" sussron@gmail.com wrote:
|
Hello Ronnie, Glad to see you're getting better results. However, I am still no sure to understand what is the error message you're talking about. There is not known limitation of the thread size. If there is an error, it should appear in the generated file. Messages deleted by the users cannot be recovered. If you want to download images and GIFs from your specific conversation (629006352329760768), you should try to run the command with the following parameters: dmarchiver -id "629006352329760768" -di -dg You should also be careful of the information sent on this site. The conversation ID for a conversation between two people is "userid1-userid2," so it could be possible to know with who you're talking to on Twitter. |
Thanks for the message Julien. That great script you wrote was awesome I could put in my name and password Thanks On Nov 8, 2016 2:57 AM, "Julien Ehrhart" notifications@github.com wrote:
|
On some rare occasions, the script may have an error due to a connection issue. Just open a Terminal (command screen) and copy paste the following: The script will download again the 50,000 messages of your thread but this time, a folder will be created with images and GIFs. It could take a bit longer to download. 😄 For the missing message, I'm interested to know if it has something special that could explain why you do not find it in the generated file (special characters, emojis, large message...). |
For the missing threads It's actually not a very large long message. Thanks Julien On Nov 8, 2016 9:09 AM, "Julien Ehrhart" notifications@github.com wrote:
|
You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id. Try to run the command I've sent to you in my previous message and check if you've been able to download a complete conversation, with images this time. |
oh i meant conversation not message, but let me try doing that inspect On Tue, Nov 8, 2016 at 11:03 AM, Julien Ehrhart notifications@github.com
|
Ok so it's running now on a single thread and looks to be processing more On Nov 8, 2016 11:07 AM, "Julien Ehrhart" notifications@github.com wrote:
|
I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. 😝 You're pushing out the limits of the tool. Tell me how many tweets have been archived at the end on this thread. 😄 You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files). |
127,555 messages in one conversation thread On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com
|
you did it. you did it!!!! On Tue, Nov 8, 2016 at 2:35 PM, Ronnie Sussman sussron@gmail.com wrote:
|
i tried another one, but got this error, do you know what it means? Ronnies-MacBook-Pro:~ ronniesussman$ Enter your username or email: beckybulldognj Enter your password (characters will not be displayed): Authentication succeedeed. Conversation ID specified (629006352329760768). Retrieving only one thread. Starting crawl of '629006352329760768' Failed to execute script cmdline Traceback (most recent call last): File "dmarchiver/cmdline.py", line 70, in File "dmarchiver/cmdline.py", line 62, in main File "dmarchiver/core.py", line 468, in crawl File "requests/models.py", line 826, in json File "json/init.py", line 319, in loads File "json/decoder.py", line 339, in decode File "json/decoder.py", line 357, in raw_decode json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) Ronnies-MacBook-Pro:~ ronniesussman$ On Tue, Nov 8, 2016 at 2:51 PM, Ronnie Sussman sussron@gmail.com wrote:
|
Ronnie, I've created another specific issue for this error because I consider this one solved. Could you go there and check for the questions I have regarding this new error message? Thank you. |
Edited by Mincka on August 10th 2017:
For anybody Googling for this error message
XMLSyntaxError: switching encoding: encoder error
:Possible workarounds:
STATIC_DEPS=true
(Python 3.5 - Unable to build DOM tree. lorien/grab#199 (comment)). However, I cannot guarantee this will work. Using multiple Python versions on macOS is such a huge pain. 😞Original message:
My setup:
The text was updated successfully, but these errors were encountered: