Skip to content
This repository has been archived by the owner on Feb 28, 2023. It is now read-only.

XMLSyntaxError: switching encoding: encoder error #1

Closed
denisjacquemin opened this issue Oct 15, 2016 · 45 comments
Closed

XMLSyntaxError: switching encoding: encoder error #1

denisjacquemin opened this issue Oct 15, 2016 · 45 comments
Assignees
Labels

Comments

@denisjacquemin
Copy link

denisjacquemin commented Oct 15, 2016

Edited by Mincka on August 10th 2017:
For anybody Googling for this error message XMLSyntaxError: switching encoding: encoder error:

  • It may be related to the parsing in lxml of emojis or specific ranges of Unicode characters (like 𝜋) which are four-byte characters
  • The issue is specific to macOS and Python 3.5
  • A ticket for a bug is opened but nobody seems to be working on it (https://bugs.launchpad.net/lxml/+bug/1538213)

Possible workarounds:

  1. Strip the emojis on macOS before the parsing, see this implementation in 073a358
  2. Downgrade to Python 3.4 if you can. I attempted to upgrade to Python 3.6 but had other compatibility issues, this time with pyinstaller, so I was unable to move forward. Downgrade to Python 3.4 allow my tool to work perfectly on all platforms.
  3. Remove lxml package and reinstall it using STATIC_DEPS=true (Python 3.5 - Unable to build DOM tree.  lorien/grab#199 (comment)). However, I cannot guarantee this will work. Using multiple Python versions on macOS is such a huge pain. 😞

Original message:
My setup:

  • Python 3.5.2
  • macOS Sierra 10.12
$ dmarchiver
Enter your username or email: myusername
Enter your password (characters will not be displayed): 
Authentication succeedeed.
Conversation ID not specified. Retrieving all the threads.
Starting crawl of '################'
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in <module>
    load_entry_point('dmarchiver==0.0.5', 'console_scripts', 'dmarchiver')()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 67, in main
    crawler.crawl(thread_id, args.download_images, args.download_gifs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 443, in crawl
    tweets, download_images, download_gif)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 357, in _process_tweets
    document = lxml.html.fragment_fromstring(value)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
    base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
  File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
  File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
  File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
  File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
  File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
@LaurentLC
Copy link

LaurentLC commented Oct 26, 2016

Hi there,
I basically have the same error, trying to download a huge DM thread:

Conversation ID specified (xxxxx). Retrieving only one thread.
Starting crawl of 'xxxxx'
Traceback (most recent call last):
  File "/usr/local/bin/dmarchiver", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main
    args.download_gifs)
  File "/usr/local/lib/python3.5/site-packages/dmarchiver/core.py", line 463, in crawl
    tweets, download_images, download_gif)
  File "/usr/local/lib/python3.5/site-packages/dmarchiver/core.py", line 377, in _process_tweets
    document = lxml.html.fragment_fromstring(value)
  File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
    base_url=base_url, **kw)
  File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
  File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
  File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
  File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
  File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
  File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

Note that the script has been able to download perfectly a short thread (just a few DM, no images no nothing).

@Mincka
Copy link
Owner

Mincka commented Oct 26, 2016

Hello Laurent,

Are you also using macOS? It seems there is an error with the lxml library when it reaches a message with accented characters. Could you confirm there is no accented characters for the short thread which is working for you?

It's quite difficult for me to identify the exact cause because I do not own a Mac to debug it. It works properly on Windows and Linux. I keep looking for a possible fix for macOS.

There's a command to run for the UTF-8 support in the Terminal which should be executed before the script but I'm not sure it would make a difference here:
export PYTHONIOENCODING=utf-8

@LaurentLC
Copy link

LaurentLC commented Oct 27, 2016

Hi,
As I said via email (I thought it would be also posted here, whatever), I do have more or less the same conf: Mac OS 10.11.6, Python 3.5, lxml 3.6.4.
Unfortunately, the short thread that worked also contains accented characters (damn french people), so that's probably not about that…

I tried to execute the command you gave, but the problem is still there.

Thanx for the help, it would be really cool to have this script work.

@Mincka
Copy link
Owner

Mincka commented Oct 27, 2016

I'm going to add a raw mode to fetch JSON responses without using the parser. I will also add a verbose mode and add proper error handling. I hope it will help us to find the root cause. Thanks for the tests.

@LaurentLC
Copy link

LaurentLC commented Oct 27, 2016

Zupa. Keep up the good work, looking forward to testing it :)

@LaurentLC
Copy link

LaurentLC commented Oct 29, 2016

(BTW, just tested the windows exe on a basic Windows 10 Family, worked perfectly fine with every king of DM thread… good job)
(but it seems than the GMT is not correct, like the french +2 are missing)

@Mincka
Copy link
Owner

Mincka commented Oct 29, 2016

Yep. I've already updated the script to use the time of the locale instead of the UTC one. It has not been pushed yet to GitHub. And for the error, it confirms the issue is related to the macOS setup.

@Mincka
Copy link
Owner

Mincka commented Nov 1, 2016

Thanks to a friend of mine with a Mac, I've been able to track down what seems to be the root cause of this bug.

The parsing fails when a tweet contains an emoji. The generated code will look like this for the image. <img title="Visage avec des larmes de joie" class="Emoji Emoji--forText" draggable="false" aria-label="Emoji: Visage avec des larmes de joie" alt="😂" src="https://abs.twimg.com/emoji/v2/72x72/1f602.png">

It contains the alt attribute with the unicode character of the smiley (😂).

With this new information, I've found this bug ticket with a similar issue:
https://bugs.launchpad.net/lxml/+bug/1538213

Additional tests have been done on macOS and no issue has been identified with multiple kinds of accented characters or URL. This issue only seems to occur with emoji unicode.

Consequently, I'm going to do the following:

  1. Implement a platform specific workaround for Mac OS with platform detection.
from sys import platform

# Mac OS lxml bug workaround
if platform == "darwin":
    # Inject emojis' titles into alt attributes, replacing unicode tweet's emojis
    # to prevent encoding error with lxml while keeping a coherent alt attribute
    value = re.sub('title="(.*?)".*?class="Emoji.*?alt="(.*?)"', '\1', value)

or simpler alternative

if platform == "darwin":
    # Clear alt attributes of emojis
    value = re.sub(r'(class="Emoji.*?)alt=".*?"', r'\g<1> alt=""', value)
  1. Add a proper try / catch for the parsing
  2. Complete the bug ticket

@LaurentLC
Copy link

\o/

@Mincka
Copy link
Owner

Mincka commented Nov 2, 2016

Could you just confirm there was no emoji for the thread you've been able to parse on macOS, Laurent?

@LaurentLC
Copy link

Yes, it was an old and short thread with no emojis at the time…

@muesliq
Copy link

muesliq commented Nov 2, 2016

Having the exact same problem. Happy to hear you're working on a fix!

\o/
(not using emoji in order not to break anything ;-)

@Mincka
Copy link
Owner

Mincka commented Nov 2, 2016

I think I have a fix in b7c316a for the Mac OS users but I need confirmation guys.
You can now upgrade the package and test again. 😄

$ pip3 install dmarchiver --upgrade
$ dmarchiver

@muesliq
Copy link

muesliq commented Nov 2, 2016

I did. Got a little further this time: 3 images (instead of 0), 0 text files. Error:

Authentication succeedeed.
Conversation ID specified (123). Retrieving only one thread.
Starting crawl of '123'
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in
load_entry_point('dmarchiver==0.0.7', 'console_scripts', 'dmarchiver')()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main
args.download_gifs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 463, in crawl
tweets, download_images, download_gif)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 377, in _process_tweets
document = lxml.html.fragment_fromstring(value)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 825, in fragment_fromstring
base_url=base_url, *_kw)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 786, in fragments_fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, *_kw)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

Maybe something went wrong with the update? I got this:

Collecting dmarchiver
Downloading dmarchiver-0.0.8.zip
Collecting requests>=2.11.1 (from dmarchiver)
Using cached requests-2.11.1-py2.py3-none-any.whl
Collecting lxml>=3.6.4 (from dmarchiver)
Using cached lxml-3.6.4.tar.gz
Collecting cssselect>=0.9.2 (from dmarchiver)
Using cached cssselect-1.0.0-py2.py3-none-any.whl
Installing collected packages: requests, lxml, cssselect, dmarchiver
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/commands/install.py", line 342, in run
prefix=options.prefix_path,
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_set.py", line 784, in install
**kwargs
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_install.py", line 849, in install
self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_install.py", line 1062, in move_wheel_files
isolated=self.isolated,
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/wheel.py", line 345, in move_wheel_files
clobber(source, lib_dir, True)
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/wheel.py", line 316, in clobber
ensure_dir(destdir)
File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/utils/init.py", line 83, in ensure_dir
os.makedirs(path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/requests'

@Mincka
Copy link
Owner

Mincka commented Nov 2, 2016

@muesliq: It seems you're using the wrong version of Python (2.7 instead of 3.5). Could you try with pip3 install dmarchiver --upgrade?

That's my fault. It's mandatory to specify pip3 for Mac OS X because both version are installed. I've updated my previous post.

And I guess you've been able to download more images only because those images have been uploaded recently, without emojis in tweets in or after them.

@muesliq
Copy link

muesliq commented Nov 2, 2016

Updated, thanks! Better now but not fixed yet. Thousands of tweets processed, 129 images, yet still 0 text files.

Authentication succeedeed.
Conversation ID specified (123). Retrieving only one thread.
Starting crawl of '123'
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in
load_entry_point('dmarchiver==0.0.8', 'console_scripts', 'dmarchiver')()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main
args.download_gifs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 470, in crawl
tweets, download_images, download_gif)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 384, in _process_tweets
document = lxml.html.fragment_fromstring(value)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 825, in fragment_fromstring
base_url=base_url, *_kw)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 786, in fragments_fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, *_kw)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

@Mincka
Copy link
Owner

Mincka commented Nov 2, 2016

Ok, thanks. I've added an exception handling to print the tweet ID that raises the exception. The script should now continue, even when a tweet is causing issues.

You can upgrade with pip3 install dmarchiver --upgrade.

This is a poor, temporary solution but the raw HTML of the offensive tweets will be also output in the log file as a [DMConversationEntry] with a [ParseError] tag. It will help me to understand what's causing the issue.

The only weird situation I saw is a random position of the img attributes that makes the regex fail. I've seen title before alt on a computer and after alt on another... Maybe that's the same here with class or it's possible it could be emoji used in cards or other content types.

@muesliq
Copy link

muesliq commented Nov 2, 2016

Now the upgrade doesn't seem to work:

pip3 install dmarchiver --upgrade
Requirement already up-to-date: dmarchiver in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages
Requirement already up-to-date: requests>=2.11.1 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver)
Requirement already up-to-date: lxml>=3.6.4 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver)
Requirement already up-to-date: cssselect>=0.9.2 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver)

@Mincka
Copy link
Owner

Mincka commented Nov 2, 2016

I had the same issue. It's quite strange. Maybe a temporary issue with pipy?

I've been able to uninstall it and reinstall it with the latest version (0.0.10).

To exclude caching issues for package download, I've also deleted the following folder on Windows:
C:\Users\[User]\AppData\Local\pip\cache

For Unix, its seems to be ~/.pip/cache/ but I'm not sure.

@Mincka Mincka added the bug label Nov 2, 2016
@Mincka Mincka self-assigned this Nov 2, 2016
@LaurentLC
Copy link

Hi !
No problem with the upgrade here, and I had been able to archive a few DM threads, including big ones with emoji, pictures… Nice!

On error though, with one thread. Had a lot of
Unexpected error for tweet 'xxxx', but still I continue.

The twitter user has an emoji in her username (see below begining of the file that has been written)

[DMConversationEntry] [ParseError] Parsing of tweet 'xxxx' failed. Raw HTML: <div class="DirectMessage
            DirectMessage--received



            clearfix dm js-dm-item"
            data-quick-reply-json="null"
            data-message-id="xxxx"
            data-item-id="xxxx"

            data-card-component="dm_existing_conversation_dialog"

            data-component-context="dm_existing_conversation_dialog">

  <div class="DirectMessage-container">
    <div class="DirectMessage-avatar">
      <a href="/xxxx" class="js-action-profile js-user-profile-link" data-user-id="xxxx">
  <div class="DMAvatar DMAvatar--1 u-chromeOverflowFix">
    <span class="DMAvatar-container">
      <img class="DMAvatar-image" src="xxxx alt="SabineLC 🎃">
    </span>
</div>

I guess it might be the problem..?

We're getting there!

@muesliq
Copy link

muesliq commented Nov 3, 2016

pip3 install dmarchiver --upgrade --ignore-installed seems to have done the trick. And it works just fabulous! You managed to fix the bugs, kudos!

Two tweets (out of 12620) hat an "unexpected error". The first one contained the letter 𝜋. The second had the following tweet embedded (which contained lots of emoji): https://twitter.com/magnifier661/status/787044538145574912

@Mincka
Copy link
Owner

Mincka commented Nov 3, 2016

Thanks a lot @LaurentLC and @muesliq! 👍

You've been able to identify 3 currently not properly handled cases:

  • Emoji in username;
  • Emoji in a embedded tweet;
  • Other encoding errors due to special characters.

I'm not sure yet how I will be able to find proper workarounds. The bug is in the lxml lib for Mac OS. Identifying emojis with regex does not seem possible. The error with 𝜋 (U+1D70B 𝜋 MATHEMATICAL ITALIC SMALL PI) also means that the issue will not be limited to emojis. It's only a simple character so it could mean the script cannot handle non-ASCII characters at all on Mac OS... :-/

Update: My guess is the error is related to code points encoded on four bytes.
https://en.wikipedia.org/wiki/Unicode

Code points in Planes 1 through 16 (supplementary planes) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8.

Emojis are also encoded in Plane 1 (1F000–​1FFFF) so I may drop all content in the range 10000-​2FFFF (Planes 1 & 2). It contains mainly ancient Egyptian characters, mathematical symbols and emojis.

For reference:
http://stackoverflow.com/a/13752628/3049282

@muesliq
Copy link

muesliq commented Nov 3, 2016

By the way: Fantastic little piece of software. Thank you!

@Mincka
Copy link
Owner

Mincka commented Nov 3, 2016

Happy to help. 😄

I have implemented in 073a358 a more general solution as a "fix" for this issue. On Mac OS X, all the Unicode characters encoded on 4 bytes are now replaced by "□" before the lxml parsing.

Consequently, it should fix all the encountered issues and allow a flawless parsing. 😄

To celebrate this, I've bumped the version to 0.1.0. 😉

@Mincka
Copy link
Owner

Mincka commented Nov 4, 2016

Rejoice Mac users, I've been able to make a precompiled executable for macOS. It should be a lot easier for non-technical users to use. 😄
https://github.com/Mincka/DMArchiver/releases/tag/0.1.0

@Mincka
Copy link
Owner

Mincka commented Nov 4, 2016

Fixed in 073a358

@Mincka Mincka closed this as completed Nov 4, 2016
@sussron
Copy link

sussron commented Nov 8, 2016

OMGoodness I was so excited it was backing up messages with this new
download and it all looked to be going and then i got an error screen, do
you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart notifications@github.com
wrote:

Closed #1 #1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new
download and it all looked to be going and then i got an error screen, do
you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart notifications@github.com
wrote:

Closed #1 #1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new
download and it all looked to be going and then i got an error screen, do
you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart <notifications@github.com

wrote:

Closed #1 #1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

oh it didn't let me attach the 5MB file of the one particular message
thread.

But here are all the various threads that were in the command screen. The
most important one is the Starting crawl of '629006352329760768'

Last login: Mon Nov 7 20:43:14 on ttys000

Ronnies-MacBook-Pro:~ ronniesussman$
/Users/ronniesussman/Downloads/dmarchiver ; exit;

Enter your username or email: beckybulldognj

Enter your password (characters will not be displayed):

Authentication succeedeed.

Conversation ID not specified. Retrieving all the threads.

Starting crawl of '629006352329760768'

Begin of thread reached

Total processed tweets: 49899

Writing conversation to 629006352329760768.txt

[Truncated for confidentiality reasons]

logout

Saving session...

...copying shared history...

...saving history...truncating history files...

...completed.

[Process completed]

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm
so very very excited!!! here is the number of message threads it found and
backed up ( pasted it to a word document). I noticed the message threads
don't go back to inception, just a certain date. For example the one i'm
attaching starts May 2016 and the conversation was started August 2015,
does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without
the images videos or photos in any capacity(although with photos and videos
would be INCREDIBLE), I was just curious.

Julien, thanks so much.
Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com
wrote:

OMGoodness I was so excited it was backing up messages with this new
download and it all looked to be going and then i got an error screen, do
you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart <
notifications@github.com> wrote:

Closed #1 #1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

Wow so i tried it a second time and WOW!! it ran through the process. I'm
so very very excited!!! here is the number of message threads it found and
backed up ( pasted it to a word document). I noticed the message threads
don't go back to inception, just a certain date. For example the one i'm
attaching starts May 2016 and the conversation was started August 2015,
does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without
the images videos or photos in any capacity(although with photos and videos
would be INCREDIBLE), I was just curious.

Julien, thanks so much.
Ronnie from New Jersey

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm
so very very excited!!! here is the number of message threads it found and
backed up ( pasted it to a word document). I noticed the message threads
don't go back to inception, just a certain date. For example the one i'm
attaching starts May 2016 and the conversation was started August 2015,
does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without
the images videos or photos in any capacity(although with photos and videos
would be INCREDIBLE), I was just curious.

Julien, thanks so much.
Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com
wrote:

OMGoodness I was so excited it was backing up messages with this new
download and it all looked to be going and then i got an error screen, do
you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart <
notifications@github.com> wrote:

Closed #1 #1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

I'm not sure all the messages were backed up. i'm looking for 2 particular
ones that i can't find, but i'm going to go through all the txt files and
see that i didn't miss it.

Thanks!
Ronnie

On Mon, Nov 7, 2016 at 9:12 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm
so very very excited!!! here is the number of message threads it found and
backed up ( pasted it to a word document). I noticed the message threads
don't go back to inception, just a certain date. For example the one i'm
attaching starts May 2016 and the conversation was started August 2015,
does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without
the images videos or photos in any capacity(although with photos and videos
would be INCREDIBLE), I was just curious.

Julien, thanks so much.
Ronnie from New Jersey

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process.
I'm so very very excited!!! here is the number of message threads it found
and backed up ( pasted it to a word document). I noticed the message
threads don't go back to inception, just a certain date. For example the
one i'm attaching starts May 2016 and the conversation was started August
2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version
without the images videos or photos in any capacity(although with photos
and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much.
Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com
wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com
wrote:

OMGoodness I was so excited it was backing up messages with this new
download and it all looked to be going and then i got an error screen, do
you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart <
notifications@github.com> wrote:

Closed #1 #1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

Does seem it didn't capture all the conversations or go to the first line.
Will note which message id if I can locate it on the source element page.

Thanks
Ronnie

On Nov 7, 2016 9:37 PM, "Ronnie Sussman" sussron@gmail.com wrote:

I'm not sure all the messages were backed up. i'm looking for 2 particular
ones that i can't find, but i'm going to go through all the txt files and
see that i didn't miss it.

Thanks!
Ronnie

On Mon, Nov 7, 2016 at 9:12 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm
so very very excited!!! here is the number of message threads it found and
backed up ( pasted it to a word document). I noticed the message threads
don't go back to inception, just a certain date. For example the one i'm
attaching starts May 2016 and the conversation was started August 2015,
does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without
the images videos or photos in any capacity(although with photos and
videos
would be INCREDIBLE), I was just curious.

Julien, thanks so much.
Ronnie from New Jersey

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process.
I'm so very very excited!!! here is the number of message threads it found
and backed up ( pasted it to a word document). I noticed the message
threads don't go back to inception, just a certain date. For example the
one i'm attaching starts May 2016 and the conversation was started August
2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version
without the images videos or photos in any capacity(although with photos
and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much.
Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com
wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com
wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com
wrote:

OMGoodness I was so excited it was backing up messages with this new
download and it all looked to be going and then i got an error screen, do
you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart <
notifications@github.com> wrote:

Closed #1 #1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn
.

@Mincka
Copy link
Owner

Mincka commented Nov 8, 2016

Hello Ronnie,

Glad to see you're getting better results. However, I am still no sure to understand what is the error message you're talking about. There is not known limitation of the thread size. If there is an error, it should appear in the generated file. Messages deleted by the users cannot be recovered.

If you want to download images and GIFs from your specific conversation (629006352329760768), you should try to run the command with the following parameters:

dmarchiver -id "629006352329760768" -di -dg

You should also be careful of the information sent on this site. The conversation ID for a conversation between two people is "userid1-userid2," so it could be possible to know with who you're talking to on Twitter.

@Mincka Mincka reopened this Nov 8, 2016
@sussron
Copy link

sussron commented Nov 8, 2016

Thanks for the message Julien.
The error happened the first time but then it ran. I can see the dm
messages in my twitter account so they aren't deleted. I can do a screen
shot to show you. For the one long one It just takes a long time to scroll
back.

That great script you wrote was awesome I could put in my name and password
and it just went and did its thing. So cool! How would I now run it just
for one conversation with images. Just go to the command screen and type
that line instead of using the zip link I downloaded?

Thanks
Ronnie

On Nov 8, 2016 2:57 AM, "Julien Ehrhart" notifications@github.com wrote:

Hello Ronnie,

Glad to see you're getting better results. However, I am still no sure to
understand what is the error message you're talking about. There is not
known limitation of the thread size. If there is an error, it should appear
in the generated file. Messages deleted by the users cannot be
recovered.ca

If you want to download images and GIFs from your specific conversation
(629006352329760768), you should try to run the command with the following
parameters:

dmarchiver -id "629006352329760768" -di -dg


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVxObgJIbUzWNVUDzWfSbV6BipkFvUeeks5q8CtTgaJpZM4KXrfn
.

@Mincka
Copy link
Owner

Mincka commented Nov 8, 2016

On some rare occasions, the script may have an error due to a connection issue.

Just open a Terminal (command screen) and copy paste the following:
/Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg

The script will download again the 50,000 messages of your thread but this time, a folder will be created with images and GIFs. It could take a bit longer to download. 😄

For the missing message, I'm interested to know if it has something special that could explain why you do not find it in the generated file (special characters, emojis, large message...).

@sussron
Copy link

sussron commented Nov 8, 2016

For the missing threads It's actually not a very large long message.
That's what's weird. Maybe I'll see if I can find the message id
identifier and try it individually instead of as part of the group.

Thanks Julien
Ronnie

On Nov 8, 2016 9:09 AM, "Julien Ehrhart" notifications@github.com wrote:

On some rare occasions, the script may have an error due to a connection
issue.

Just open a Terminal (command screen) and copy paste the following:
/Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg

The script will download again the 50,000 messages of your thread but this
time, a folder will be created with images and GIFs. It could take a bit
longer to download. 😄

For the missing message, I'm interested to know if it has something
special that could explain why you do not find it in the generated file
(special characters, emojis, large message...).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVxObq9Zixy0_ztSXZEVJJA9tcFIGNmsks5q8IDMgaJpZM4KXrfn
.

@Mincka
Copy link
Owner

Mincka commented Nov 8, 2016

You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id.

Try to run the command I've sent to you in my previous message and check if you've been able to download a complete conversation, with images this time.

@sussron
Copy link

sussron commented Nov 8, 2016

oh i meant conversation not message, but let me try doing that inspect
elements thing to see if i can find the missing messages. Thanks so much
for your patience and helping me learn.
Ronnie

On Tue, Nov 8, 2016 at 11:03 AM, Julien Ehrhart notifications@github.com
wrote:

You cannot specify a specific message id, the tool can only accept a
conversation (or "thread") id.

Try to run the command I've sent to you in my previous message and check
you've been able to download a complete conversation, with images this time.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVxObixyFTjRxkJUsNhBkq9Y5S6s5SyZks5q8J1agaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

Ok so it's running now on a single thread and looks to be processing more
tweets (this one is up to 75,000 now and counting) that may have done the
trick. I'm so stinkin excited!!
Thank you thank you thank you!
You rock!
Ronnie

On Nov 8, 2016 11:07 AM, "Julien Ehrhart" notifications@github.com wrote:

You cannot specify a specific message id, the tool can only accept a
conversation (or "thread") id.

Try to run the command I've sent to you in my previous message and check
you've been able to download a complete conversation, with images this time.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVxObixyFTjRxkJUsNhBkq9Y5S6s5SyZks5q8J1agaJpZM4KXrfn
.

@Mincka
Copy link
Owner

Mincka commented Nov 8, 2016

I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. 😝 You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. 😄

You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).

@sussron
Copy link

sussron commented Nov 8, 2016

​127,555 messages in one conversation thread

On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com
wrote:

I wouldn't guess people have some crazy conversations going on thanks to
Twitter DMs. 😝 You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. 😄

You can already check the downloaded images in your "Downloads" folder, a
new folder "629006352329760768" should have been created with the pictures
and GIFs (as MP4 files).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

you did it. you did it!!!!
Woo hoo!!!!! That conversation means the world to me, you can't even begin
to know. thank you soo much

On Tue, Nov 8, 2016 at 2:35 PM, Ronnie Sussman sussron@gmail.com wrote:

​127,555 messages in one conversation thread

On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com
wrote:

I wouldn't guess people have some crazy conversations going on thanks to
Twitter DMs. 😝 You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. 😄

You can already check the downloaded images in your "Downloads" folder, a
new folder "629006352329760768" should have been created with the pictures
and GIFs (as MP4 files).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn
.

@sussron
Copy link

sussron commented Nov 8, 2016

i tried another one, but got this error, do you know what it means?

Ronnies-MacBook-Pro:~ ronniesussman$
/Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg

Enter your username or email: beckybulldognj

Enter your password (characters will not be displayed):

Authentication succeedeed.

Conversation ID specified (629006352329760768). Retrieving only one thread.

Starting crawl of '629006352329760768'

Failed to execute script cmdline

Traceback (most recent call last):

File "dmarchiver/cmdline.py", line 70, in

File "dmarchiver/cmdline.py", line 62, in main

File "dmarchiver/core.py", line 468, in crawl

File "requests/models.py", line 826, in json

File "json/init.py", line 319, in loads

File "json/decoder.py", line 339, in decode

File "json/decoder.py", line 357, in raw_decode

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Ronnies-MacBook-Pro:~ ronniesussman$

On Tue, Nov 8, 2016 at 2:51 PM, Ronnie Sussman sussron@gmail.com wrote:

you did it. you did it!!!!
Woo hoo!!!!! That conversation means the world to me, you can't even begin
to know. thank you soo much

On Tue, Nov 8, 2016 at 2:35 PM, Ronnie Sussman sussron@gmail.com wrote:

​127,555 messages in one conversation thread

On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com
wrote:

I wouldn't guess people have some crazy conversations going on thanks to
Twitter DMs. 😝 You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. 😄

You can already check the downloaded images in your "Downloads" folder,
a new folder "629006352329760768" should have been created with the
pictures and GIFs (as MP4 files).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn
.

@Mincka
Copy link
Owner

Mincka commented Nov 8, 2016

Ronnie,

I've created another specific issue for this error because I consider this one solved. Could you go there and check for the questions I have regarding this new error message? Thank you.

#7

@Mincka Mincka closed this as completed Nov 8, 2016
Repository owner locked and limited conversation to collaborators Nov 8, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants