Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS-specific lxml crash: LookupError: unknown encoding: 'b'latin1'' #157

Open
ivan opened this issue Jul 30, 2019 · 1 comment

Comments

@ivan
Copy link
Contributor

commented Jul 30, 2019

Reported on IRC by systwi on 2019-07-15

This happens on macOS (using either nixpkgs or homebrew), but not on Linux.

$ grab-site 'http://www.deeplysimple.net/' --delay '275-375'
psutil: No module named 'psutil'. Resource monitoring will be unavailable.
Manhole[66103:1563175940.3431]: Patched <built-in function fork> and <built-in function forkpty>.
Manhole[66103:1563175940.3451]: Manhole UDS path: /tmp/manhole-66103
Manhole[66103:1563175940.3452]: Waiting for new connection (in pid:66103) ...
Created lmdb db with map_size=1099511627776
Imported /Volumes/1TB Storage/warc/www.deeplysimple.net-2019-07-15-73713550/igsets
Using these 191 ignores:
	%25252525
	/%22%20\+[^/]+\+%20%22
	/%22\+[^/]+\+%22
	/%27%20\+[^/]+\+%20%27
	/%27\+[^/]+\+%27
	/%5C/%5C/
	/'\+[^/]+\+'
	/(%5C)+(%22|%27)
	/App_Themes/.+/App_Themes/
	/\\+(%22|%27)
	/\\+["']
	/\\/\\/
	/bxSlider/.+/bxSlider/
	/bxSlider/bxSlider/
	/clientscript/.+/clientscript/clientscript/
	/clientscript/clientscript/.+/clientscript/
	/clientscript/clientscript/clientscript/
	/css/.+/css/css/
	/css/css/.+/css/
	/css/css/css/
	/images/.+/images/images/
	/images/images/.+/images/
	/images/images/images/
	/img/.+/img/img/
	/img/img/.+/img/
	/img/img/img/
	/js/.+/js/js/
	/js/js/.+/js/
	/js/js/js/
	/lib/exe/.*lib[-_]exe[-_]lib[-_]exe[-_]
	/scripts/.+/scripts/scripts/
	/scripts/scripts/.+/scripts/
	/scripts/scripts/scripts/
	/slides/.+/slides/slides/
	/slides/slides/.+/slides/
	/slides/slides/slides/
	/styles/.+/styles/styles/
	/styles/styles/.+/styles/
	/styles/styles/styles/
	^https?://((s-)?static\.ak\.fbcdn\.net|(connect\.|www\.)?facebook\.com)/connect\.php/js/.*rsrc\.php
	^https?://([^/]+\.)?gdcvault\.com(/.*/|/)(fonts(/.*/|/)fonts/|css(/.*/|/)css/|img(/.*/|/)img/)
	^https?://([^\./]+\.)?stream\.publicradio\.org/
	^https?://([^\.]+\.)?pinterest\.com/pin/create/
	^https?://(\d|www|secure)\.gravatar\.com/avatar/ad516503a11cd5ca435acc9bb6523536
	^https?://(apis|plusone)\.google\.com/_/\+1/
	^https?://(audio\d?|nfw)\.video\.ria\.ru/
	^https?://(ssl\.|www\.)?reddit\.com/(login\?dest=|submit\?|static/button/button)
	^https?://(www\.)?(megaupload|filesonic|wupload)\.com/
	^https?://(www\.)?digg\.com/submit\?
	^https?://(www\.)?facebook\.com/(plugins/(share_button|like(box)?)\.php|sharer/sharer\.php|sharer?\.php|dialog/(feed|share))\?
	^https?://(www\.)?facebook\.com/v[\d\.]+/plugins/like\.php
	^https?://(www\.)?friendfeed\.com/share\?
	^https?://(www\.)?instapaper\.com/hello2\?
	^https?://(www\.)?myspace\.com/Modules/PostTo/
	^https?://(www\.)?stumbleupon\.com/(submit\?|badge/embed/)
	^https?://(www\.)?technorati\.com/faves/?\?add=
	^https?://(www\.)?twitter\.com/(share\?|intent/((re)?tweet|favorite)|home/?\?status=|\?status=)
	^https?://(www\.)?xing\.com/(app/user\?op=share|social_plugins/share\?)
	^https?://(www|draft)\.blogger\.com/(navbar\.g|post-edit\.g|delete-comment\.g|comment-iframe\.g|share-post\.g|email-post\.g|blog-this\.g|delete-backlink\.g|rearrange|blog_this\.pyra)\?
	^https?://(www|px\.srvcs)\.tumblr\.com/(impixu\?|share(/link/?)?\?|reblog/)
	^https?://(www|ssl)\.google-analytics\.com/(r/)?(__utm\.gif|collect\?)
	^https?://.+/.+/disqus\.com/forums/$
	^https?://.+/js-agent\.newrelic\.com/nr-\d{3}(\.min)?\.js$
	^https?://.+/js/chartbeat\.js$
	^https?://.+/stats\.g\.doubleclick\.net/dc\.js$
	^https?://.+\.blogspot\.(com|in|com\.au|co\.uk|jp|co\.nz|ca|de|it|fr|se|sg|es|pt|com\.br|ar|mx|kr)/(\d{4}/\d{2}/|search/label/)(CSI/$|.*/CSI/CSI/CSI/)
	^https?://[^/]*musicproxy\.s12\.de/
	^https?://[^/]+/.+/CaptchaImage\.axd
	^https?://[^/]+/anony/mjpg\.cgi$
	^https?://[^/]+/mjpg/video\.mjpg
	^https?://[^/]+\.akadostream\.ru(:\d+)?/
	^https?://[^/]+\.corp\.ne1\.yahoo\.com/
	^https?://[^/]+\.facebook\.com/login\.php
	^https?://[^/]+\.gaduradio\.pl/
	^https?://[^/]+\.libsyn\.com/.+/%2[02]https?:/
	^https?://[^/]+\.rastream\.com(:\d+)?/
	^https?://[^/]+\.services\.livejournal\.com/ljcounter
	^https?://[^/]+\.streamtheworld\.com/
	^https?://[^/]+\.xiti\.com/hit\.xiti\?
	^https?://[^\./]+\.radioscoop\.(com|net):\d+/
	^https?://[^\./]+\.streamchan\.org:\d+/
	^https?://[^\.]+\.livejournal\.com/.+/\*sup_ru/ru/UTF-8/
	^https?://[^\.]+\.livejournal\.com/.+http://[^\.]+\.livejournal\.com/
	^https?://[a-z0-9]+\.cdn\.dvmr\.fr(:\d+)?/.+\.mp3
	^https?://\d+\.media\.tumblr\.com/avatar_.+_16\.pn[gj]$
	^https?://accounts\.google\.com/(SignUp|ServiceLogin|AccountChooser|a/UniversalLogin)
	^https?://add\.my\.yahoo\.com/(rss|content)\?
	^https?://air\.radiorecord\.ru(:\d+)?/
	^https?://alb\.reddit\.com/
	^https?://api\.addthis\.com/
	^https?://audio\d?\.radioreference\.com/
	^https?://audiots\.scdn\.arkena\.com/
	^https?://av\.rasset\.ie/av/live/
	^https?://b\.hatena\.ne\.jp/add\?
	^https?://b\.scorecardresearch\.com/
	^https?://beacon\.wikia-services\.com/
	^https?://bookmark\.naver\.com/post\?
	^https?://bufferapp\.com/add\?
	^https?://connect\.mail\.ru/share\?
	^https?://csp\.cyworld\.com/bi/bi_recommend_pop\.php\?
	^https?://del\.icio\.us/post\?
	^https?://delicious\.com/(save|post)\?
	^https?://download\.ted\.com/
	^https?://flattr.com/submit/auto\?
	^https?://gcnplayer\.gcnlive\.com/.+
	^https?://geo\.yahoo\.com/b\?
	^https?://getpocket\.com/(save|edit)/?\?
	^https?://i\.dev\.cdn\.turner\.com/
	^https?://imageshack\.com/lost$
	^https?://iwiw\.hu/pages/share/share\.jsp\?
	^https?://mail\.google\.com/mail/
	^https?://media\.opb\.org/clips/embed/.+\.js$
	^https?://medium\.com/_/(vote|bookmark|subscribe)/
	^https?://memori(\.qip)?\.ru/link/\?
	^https?://mp3\.ffh\.de/
	^https?://mp3tslg\.tdf-cdn\.com/
	^https?://myweb2\.search\.yahoo\.com/myresults/bookmarklet\?
	^https?://news\.ycombinator\.com/submitlink\?
	^https?://p\.opt\.fimserve\.com/
	^https?://photobucket\.com/.+/albums/.+/albums/
	^https?://pixel\.(quantserve|wp)\.com/
	^https?://pixel\.blog\.hu/
	^https?://pixel\.redditmedia\.com/pixel/
	^https?://platform\d?\.twitter\.com/widgets/tweet_button.html\?
	^https?://play(\d+)?\.radio13\.ru:8000/
	^https?://plus\.google\.com/share\?
	^https?://posterous\.com/share\?
	^https?://prod-preview\.wired\.com/
	^https?://pub(\d+)?\.di\.fm/
	^https?://r-a-d\.io/.+\.mp3$
	^https?://r-login\.wordpress\.com/remote-login\.php
	^https?://relay\.broadcastify\.com/
	^https?://reporter\.es\.msn\.com/\?fn=contribute
	^https?://s\d+\.sitemeter\.com/(js/counter\.js|meter\.asp)
	^https?://service\.weibo\.com/share/share\.php\?
	^https?://share\.flipboard\.com/bookmarklet/popout\?
	^https?://social-plugins\.line\.me/lineit/share
	^https?://sphinn\.com/index\.php\?c=post&m=submit&
	^https?://static\.licdn\.com/sc/p/.+/f//
	^https?://static\.licdn\.com/sc/p/com\.linkedin\.nux(:|%3A)nux-static-content(\+|%2B)[\d\.]+/f/
	^https?://stream(\d+)?\.media\.rambler\.ru/
	^https?://telegram\.me/share/url\?
	^https?://tm\.uol\.com\.br/h/.+/h/
	^https?://tmz\.vo\.llnwd\.net/
	^https?://upload\.wikimedia\.org/wikipedia/[^/]+/thumb/
	^https?://video-subtitle\.tedcdn\.com/
	^https?://vkontakte\.ru/share\.php\?
	^https?://vuible\.com/pins-settings/
	^https?://web\.archive\.org/web/[^/]+/https?\:/[^/]+\.addthis\.com/.+/static/.+/static/
	^https?://wow\.ya\.ru/posts_(add|share)_link\.xml\?
	^https?://www\.addthis\.com/bookmark\.php\?
	^https?://www\.addtoany\.com/(add_to/|share_save\?)
	^https?://www\.amazon\.com/.+/logging/log-action\.html
	^https?://www\.blinklist\.com/index\.php\?Action=Blink/addblink\.php
	^https?://www\.blogger\.com/feeds/\d+/\d+/comments/default/\d+
	^https?://www\.blogger\.com/feeds/\d+/posts/default/\d+
	^https?://www\.deeplysimple\.net(/.*|/)page/%d/$
	^https?://www\.deeplysimple\.net/(wp-admin/|wp-login\.php\?)
	^https?://www\.deeplysimple\.net/.*%5Cx26route=/archive
	^https?://www\.deeplysimple\.net/.*&amp;amp;amp;
	^https?://www\.deeplysimple\.net/.*(\?|%5Cx26)route=(/page/:page|/archive/:year/:month|/tagged/:tag|/post/:id|/image/:post_id)
	^https?://www\.deeplysimple\.net/.*amp%3Bamp%3Bamp%3B
	^https?://www\.deeplysimple\.net/.+/%3Ca%20href=
	^https?://www\.deeplysimple\.net/.+/jetpack-comment/\?blogid=\d+&postid=\d+
	^https?://www\.deeplysimple\.net/.+/plugins/ultimate-social-media-plus/.+/like/like/
	^https?://www\.deeplysimple\.net/.+/quote-comment-\d+/$
	^https?://www\.deeplysimple\.net/.+[\?&](replyto(com)?|like_comment)=\d+
	^https?://www\.deeplysimple\.net/.+[\?&]mode=reply
	^https?://www\.deeplysimple\.net/.+[\?&]share=[a-z]{4,}
	^https?://www\.deeplysimple\.net/.+\?showComment(=|%5C)\d+
	^https?://www\.deeplysimple\.net/search(/label/[^\?]+|\?q=[^&]+|)[\?&]updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max-results=\d+
	^https?://www\.dreamwidth\.org/tools/(memadd|tellafriend)\?
	^https?://www\.flickr\.com/(explore/|photos/[^/]+/(sets/\d+/(page\d+/)?)?)\d+_[a-f0-9]+(_[a-z])?\.jpg$
	^https?://www\.flickr\.com/change_language\.gne
	^https?://www\.google\.com/(reader/link\?|buzz/post\?)
	^https?://www\.google\.com/accounts/AccountChooser
	^https?://www\.google\.com/bookmarks/mark\?
	^https?://www\.google\.com/recaptcha/(api|mailhide/d\?)
	^https?://www\.infomous\.com/cloud_widget/lib/lib/
	^https?://www\.khaleejtimes\.com/.+/images/.+/images/
	^https?://www\.khaleejtimes\.com/.+/imgactv/.+/imgactv/
	^https?://www\.khaleejtimes\.com/.+/kt_.+/kt_
	^https?://www\.linkedin\.com/(cws/share|shareArticle)\?
	^https?://www\.livejournal\.com/(tools/memadd|update|(identity/)?login)\.bml\?
	^https?://www\.netvibes\.com/subscribe\.php\?
	^https?://www\.newsvine\.com/_wine/save\?
	^https?://www\.odnoklassniki\.ru/dk\?st\.cmd=addShare
	^https?://www\.warnerbros\.com/\d+$
	^https?://www\.youtube\.com/.*\[\[.+\]\]
	^https?://www\.youtube\.com/.*\{\{.+\}\}
	^https?://zakladki\.yandex\.ru/newlink\.xml\?
Imported /Volumes/1TB Storage/warc/www.deeplysimple.net-2019-07-15-73713550/ignores
Using these 191 ignores:
	%25252525
	/%22%20\+[^/]+\+%20%22
	/%22\+[^/]+\+%22
	/%27%20\+[^/]+\+%20%27
	/%27\+[^/]+\+%27
	/%5C/%5C/
	/'\+[^/]+\+'
	/(%5C)+(%22|%27)
	/App_Themes/.+/App_Themes/
	/\\+(%22|%27)
	/\\+["']
	/\\/\\/
	/bxSlider/.+/bxSlider/
	/bxSlider/bxSlider/
	/clientscript/.+/clientscript/clientscript/
	/clientscript/clientscript/.+/clientscript/
	/clientscript/clientscript/clientscript/
	/css/.+/css/css/
	/css/css/.+/css/
	/css/css/css/
	/images/.+/images/images/
	/images/images/.+/images/
	/images/images/images/
	/img/.+/img/img/
	/img/img/.+/img/
	/img/img/img/
	/js/.+/js/js/
	/js/js/.+/js/
	/js/js/js/
	/lib/exe/.*lib[-_]exe[-_]lib[-_]exe[-_]
	/scripts/.+/scripts/scripts/
	/scripts/scripts/.+/scripts/
	/scripts/scripts/scripts/
	/slides/.+/slides/slides/
	/slides/slides/.+/slides/
	/slides/slides/slides/
	/styles/.+/styles/styles/
	/styles/styles/.+/styles/
	/styles/styles/styles/
	^https?://((s-)?static\.ak\.fbcdn\.net|(connect\.|www\.)?facebook\.com)/connect\.php/js/.*rsrc\.php
	^https?://([^/]+\.)?gdcvault\.com(/.*/|/)(fonts(/.*/|/)fonts/|css(/.*/|/)css/|img(/.*/|/)img/)
	^https?://([^\./]+\.)?stream\.publicradio\.org/
	^https?://([^\.]+\.)?pinterest\.com/pin/create/
	^https?://(\d|www|secure)\.gravatar\.com/avatar/ad516503a11cd5ca435acc9bb6523536
	^https?://(apis|plusone)\.google\.com/_/\+1/
	^https?://(audio\d?|nfw)\.video\.ria\.ru/
	^https?://(ssl\.|www\.)?reddit\.com/(login\?dest=|submit\?|static/button/button)
	^https?://(www\.)?(megaupload|filesonic|wupload)\.com/
	^https?://(www\.)?digg\.com/submit\?
	^https?://(www\.)?facebook\.com/(plugins/(share_button|like(box)?)\.php|sharer/sharer\.php|sharer?\.php|dialog/(feed|share))\?
	^https?://(www\.)?facebook\.com/v[\d\.]+/plugins/like\.php
	^https?://(www\.)?friendfeed\.com/share\?
	^https?://(www\.)?instapaper\.com/hello2\?
	^https?://(www\.)?myspace\.com/Modules/PostTo/
	^https?://(www\.)?stumbleupon\.com/(submit\?|badge/embed/)
	^https?://(www\.)?technorati\.com/faves/?\?add=
	^https?://(www\.)?twitter\.com/(share\?|intent/((re)?tweet|favorite)|home/?\?status=|\?status=)
	^https?://(www\.)?xing\.com/(app/user\?op=share|social_plugins/share\?)
	^https?://(www|draft)\.blogger\.com/(navbar\.g|post-edit\.g|delete-comment\.g|comment-iframe\.g|share-post\.g|email-post\.g|blog-this\.g|delete-backlink\.g|rearrange|blog_this\.pyra)\?
	^https?://(www|px\.srvcs)\.tumblr\.com/(impixu\?|share(/link/?)?\?|reblog/)
	^https?://(www|ssl)\.google-analytics\.com/(r/)?(__utm\.gif|collect\?)
	^https?://.+/.+/disqus\.com/forums/$
	^https?://.+/js-agent\.newrelic\.com/nr-\d{3}(\.min)?\.js$
	^https?://.+/js/chartbeat\.js$
	^https?://.+/stats\.g\.doubleclick\.net/dc\.js$
	^https?://.+\.blogspot\.(com|in|com\.au|co\.uk|jp|co\.nz|ca|de|it|fr|se|sg|es|pt|com\.br|ar|mx|kr)/(\d{4}/\d{2}/|search/label/)(CSI/$|.*/CSI/CSI/CSI/)
	^https?://[^/]*musicproxy\.s12\.de/
	^https?://[^/]+/.+/CaptchaImage\.axd
	^https?://[^/]+/anony/mjpg\.cgi$
	^https?://[^/]+/mjpg/video\.mjpg
	^https?://[^/]+\.akadostream\.ru(:\d+)?/
	^https?://[^/]+\.corp\.ne1\.yahoo\.com/
	^https?://[^/]+\.facebook\.com/login\.php
	^https?://[^/]+\.gaduradio\.pl/
	^https?://[^/]+\.libsyn\.com/.+/%2[02]https?:/
	^https?://[^/]+\.rastream\.com(:\d+)?/
	^https?://[^/]+\.services\.livejournal\.com/ljcounter
	^https?://[^/]+\.streamtheworld\.com/
	^https?://[^/]+\.xiti\.com/hit\.xiti\?
	^https?://[^\./]+\.radioscoop\.(com|net):\d+/
	^https?://[^\./]+\.streamchan\.org:\d+/
	^https?://[^\.]+\.livejournal\.com/.+/\*sup_ru/ru/UTF-8/
	^https?://[^\.]+\.livejournal\.com/.+http://[^\.]+\.livejournal\.com/
	^https?://[a-z0-9]+\.cdn\.dvmr\.fr(:\d+)?/.+\.mp3
	^https?://\d+\.media\.tumblr\.com/avatar_.+_16\.pn[gj]$
	^https?://accounts\.google\.com/(SignUp|ServiceLogin|AccountChooser|a/UniversalLogin)
	^https?://add\.my\.yahoo\.com/(rss|content)\?
	^https?://air\.radiorecord\.ru(:\d+)?/
	^https?://alb\.reddit\.com/
	^https?://api\.addthis\.com/
	^https?://audio\d?\.radioreference\.com/
	^https?://audiots\.scdn\.arkena\.com/
	^https?://av\.rasset\.ie/av/live/
	^https?://b\.hatena\.ne\.jp/add\?
	^https?://b\.scorecardresearch\.com/
	^https?://beacon\.wikia-services\.com/
	^https?://bookmark\.naver\.com/post\?
	^https?://bufferapp\.com/add\?
	^https?://connect\.mail\.ru/share\?
	^https?://csp\.cyworld\.com/bi/bi_recommend_pop\.php\?
	^https?://del\.icio\.us/post\?
	^https?://delicious\.com/(save|post)\?
	^https?://download\.ted\.com/
	^https?://flattr.com/submit/auto\?
	^https?://gcnplayer\.gcnlive\.com/.+
	^https?://geo\.yahoo\.com/b\?
	^https?://getpocket\.com/(save|edit)/?\?
	^https?://i\.dev\.cdn\.turner\.com/
	^https?://imageshack\.com/lost$
	^https?://iwiw\.hu/pages/share/share\.jsp\?
	^https?://mail\.google\.com/mail/
	^https?://media\.opb\.org/clips/embed/.+\.js$
	^https?://medium\.com/_/(vote|bookmark|subscribe)/
	^https?://memori(\.qip)?\.ru/link/\?
	^https?://mp3\.ffh\.de/
	^https?://mp3tslg\.tdf-cdn\.com/
	^https?://myweb2\.search\.yahoo\.com/myresults/bookmarklet\?
	^https?://news\.ycombinator\.com/submitlink\?
	^https?://p\.opt\.fimserve\.com/
	^https?://photobucket\.com/.+/albums/.+/albums/
	^https?://pixel\.(quantserve|wp)\.com/
	^https?://pixel\.blog\.hu/
	^https?://pixel\.redditmedia\.com/pixel/
	^https?://platform\d?\.twitter\.com/widgets/tweet_button.html\?
	^https?://play(\d+)?\.radio13\.ru:8000/
	^https?://plus\.google\.com/share\?
	^https?://posterous\.com/share\?
	^https?://prod-preview\.wired\.com/
	^https?://pub(\d+)?\.di\.fm/
	^https?://r-a-d\.io/.+\.mp3$
	^https?://r-login\.wordpress\.com/remote-login\.php
	^https?://relay\.broadcastify\.com/
	^https?://reporter\.es\.msn\.com/\?fn=contribute
	^https?://s\d+\.sitemeter\.com/(js/counter\.js|meter\.asp)
	^https?://service\.weibo\.com/share/share\.php\?
	^https?://share\.flipboard\.com/bookmarklet/popout\?
	^https?://social-plugins\.line\.me/lineit/share
	^https?://sphinn\.com/index\.php\?c=post&m=submit&
	^https?://static\.licdn\.com/sc/p/.+/f//
	^https?://static\.licdn\.com/sc/p/com\.linkedin\.nux(:|%3A)nux-static-content(\+|%2B)[\d\.]+/f/
	^https?://stream(\d+)?\.media\.rambler\.ru/
	^https?://telegram\.me/share/url\?
	^https?://tm\.uol\.com\.br/h/.+/h/
	^https?://tmz\.vo\.llnwd\.net/
	^https?://upload\.wikimedia\.org/wikipedia/[^/]+/thumb/
	^https?://video-subtitle\.tedcdn\.com/
	^https?://vkontakte\.ru/share\.php\?
	^https?://vuible\.com/pins-settings/
	^https?://web\.archive\.org/web/[^/]+/https?\:/[^/]+\.addthis\.com/.+/static/.+/static/
	^https?://wow\.ya\.ru/posts_(add|share)_link\.xml\?
	^https?://www\.addthis\.com/bookmark\.php\?
	^https?://www\.addtoany\.com/(add_to/|share_save\?)
	^https?://www\.amazon\.com/.+/logging/log-action\.html
	^https?://www\.blinklist\.com/index\.php\?Action=Blink/addblink\.php
	^https?://www\.blogger\.com/feeds/\d+/\d+/comments/default/\d+
	^https?://www\.blogger\.com/feeds/\d+/posts/default/\d+
	^https?://www\.deeplysimple\.net(/.*|/)page/%d/$
	^https?://www\.deeplysimple\.net/(wp-admin/|wp-login\.php\?)
	^https?://www\.deeplysimple\.net/.*%5Cx26route=/archive
	^https?://www\.deeplysimple\.net/.*&amp;amp;amp;
	^https?://www\.deeplysimple\.net/.*(\?|%5Cx26)route=(/page/:page|/archive/:year/:month|/tagged/:tag|/post/:id|/image/:post_id)
	^https?://www\.deeplysimple\.net/.*amp%3Bamp%3Bamp%3B
	^https?://www\.deeplysimple\.net/.+/%3Ca%20href=
	^https?://www\.deeplysimple\.net/.+/jetpack-comment/\?blogid=\d+&postid=\d+
	^https?://www\.deeplysimple\.net/.+/plugins/ultimate-social-media-plus/.+/like/like/
	^https?://www\.deeplysimple\.net/.+/quote-comment-\d+/$
	^https?://www\.deeplysimple\.net/.+[\?&](replyto(com)?|like_comment)=\d+
	^https?://www\.deeplysimple\.net/.+[\?&]mode=reply
	^https?://www\.deeplysimple\.net/.+[\?&]share=[a-z]{4,}
	^https?://www\.deeplysimple\.net/.+\?showComment(=|%5C)\d+
	^https?://www\.deeplysimple\.net/search(/label/[^\?]+|\?q=[^&]+|)[\?&]updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max-results=\d+
	^https?://www\.dreamwidth\.org/tools/(memadd|tellafriend)\?
	^https?://www\.flickr\.com/(explore/|photos/[^/]+/(sets/\d+/(page\d+/)?)?)\d+_[a-f0-9]+(_[a-z])?\.jpg$
	^https?://www\.flickr\.com/change_language\.gne
	^https?://www\.google\.com/(reader/link\?|buzz/post\?)
	^https?://www\.google\.com/accounts/AccountChooser
	^https?://www\.google\.com/bookmarks/mark\?
	^https?://www\.google\.com/recaptcha/(api|mailhide/d\?)
	^https?://www\.infomous\.com/cloud_widget/lib/lib/
	^https?://www\.khaleejtimes\.com/.+/images/.+/images/
	^https?://www\.khaleejtimes\.com/.+/imgactv/.+/imgactv/
	^https?://www\.khaleejtimes\.com/.+/kt_.+/kt_
	^https?://www\.linkedin\.com/(cws/share|shareArticle)\?
	^https?://www\.livejournal\.com/(tools/memadd|update|(identity/)?login)\.bml\?
	^https?://www\.netvibes\.com/subscribe\.php\?
	^https?://www\.newsvine\.com/_wine/save\?
	^https?://www\.odnoklassniki\.ru/dk\?st\.cmd=addShare
	^https?://www\.warnerbros\.com/\d+$
	^https?://www\.youtube\.com/.*\[\[.+\]\]
	^https?://www\.youtube\.com/.*\{\{.+\}\}
	^https?://zakladki\.yandex\.ru/newlink\.xml\?
Connected to ws://127.0.0.1:29000
Imported /Volumes/1TB Storage/warc/www.deeplysimple.net-2019-07-15-73713550/max_content_length
http://www.deeplysimple.net/ ...
Imported /Volumes/1TB Storage/warc/www.deeplysimple.net-2019-07-15-73713550/delay
Imported /Volumes/1TB Storage/warc/www.deeplysimple.net-2019-07-15-73713550/concurrency
/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))
http://www.deeplysimple.net/robots.txt ...
http://www.deeplysimple.net/sitemap.xml ...
https://lh5.googleusercontent.com/proxy/jGCPN6dpGSPrk94-N44Lll3LSmEqI1huhPD_buopBC7Gigb5s0Q2G6N3igU_tZPZnMTpwjVKAbNBKlFZlf1vIAtp=s0-d ...
http://pagead2.googlesyndication.com/pagead/show_ads.js ...
http://www.blogblog.com/dynamicviews/4224c15c4e7c9321/js/comments.js ...
http://www.deeplysimple.net/2015/09/remove-noise-or-hiss-from-recordings.html ...
http://www.deeplysimple.net/2013/12/how-to-create-password-thats-hard-to.html ...
http://1.bp.blogspot.com/-ZQKhMlDYEZ8/UpjrFVOJERI/AAAAAAAAAoc/in4fyLD42KM/s1600/Anno.tif ...
http://photos1.blogger.com/img/43/1633/320/13539953_0384ccecf9.jpg ...
http://photos1.blogger.com/blogger/3709/485/1600/arabic-flag.gif ...
https://lh6.googleusercontent.com/proxy/TKvq7-K0Je8CD3cwSlU8DvxkxhJ0xcC4xz48jlH2ZxruSG4T41iYKPKUtG4hUkCjW-rZ5M6moMo=s0-d ...
https://www.blogger.com/static/v1/widgets/1501421786-widgets.js ...
http://jobsearch.naukri.com/mynaukri/mn_newsmartsearch.php?xz=7_0_5&qc=5202&tem=hyundai ...
http://www.deeplysimple.net/2013/08/beware-of-scams-at-job-search-sites.html ...
http://3.bp.blogspot.com/-Jnx8VhofPWs/Vc9qdHS7-bI/AAAAAAAADa4/VN0ZIZc1ono/s1600/Search.jpg ...
http://verify.naukri.com/captcha/?path=http://www.naukri.com/mynaukri/mn_newsmartsearch.php?xz=7_0_5&qc=5202&tem=hyundai&xz=7_0_5&qc=5202&tem=hyundai ...
ERROR Fatal exception.
Traceback (most recent call last):
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/application/tasks/download.py", line 421, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 91, in process
    return (yield from session.process())
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 185, in process
    yield from self._process_loop()
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 244, in _process_loop
    exit_early, wait_time = yield from self._fetch_one(cast(Request, self._item_session.request))
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 308, in _fetch_one
    action = self._handle_response(request, response)
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 423, in _handle_response
    self._processing_rule.scrape_document(self._item_session)
  File "/nix/store/0kqhy8f743yirbvcznvkz4s6bl0b2llj-grab-site-2.1.16/lib/python3.7/site-packages/libgrabsite/wpull_tweaks.py", line 55, in scrape_document
    super().scrape_document(item_session)
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/rule.py", line 527, in scrape_document
    item_session.url_record.link_type
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/scraper/base.py", line 186, in scrape_info
    scrape_result = scraper.scrape(request, response, link_type)
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/scraper/html.py", line 114, in scrape
    elements, response, base_url, link_contexts
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/scraper/html.py", line 139, in _process_elements
    for element in elements:
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/document/htmlparse/lxml_.py", line 22, in parse
    parser_type = self.detect_parser_type(file, encoding=encoding)
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/document/htmlparse/lxml_.py", line 88, in detect_parser_type
    doctype = cls.parse_doctype(file, encoding=encoding) or ''
  File "/nix/store/cdjd2cps3ygvmajphjagkadhhp63lhhb-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/document/htmlparse/lxml_.py", line 71, in parse_doctype
    parser = lxml.etree.XMLParser(encoding=lxml_encoding, recover=True)
  File "src/lxml/parser.pxi", line 1520, in lxml.etree.XMLParser.__init__
  File "src/lxml/parser.pxi", line 823, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'latin1''
CRITICAL Sorry, Wpull unexpectedly crashed.
@ivan

This comment has been minimized.

Copy link
Contributor Author

commented Jul 30, 2019

I could not repro this on macOS 10.14 (with homebrew install) just now, but systwi says it still happens on 10.13.6 (which install is not known).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.