Error if the character code is Shift_JIS
An error occurs on some Japanese pages.
Steps to reproduce
An error occurs on this page.
https://www.mbc.co.jp/news/
http://167.86.112.42/hello_sjis.html
Screenshots or log output
$ ./archive https://www.mbc.co.jp/news/
[*] [2019-08-16 18:53:59] Downloading https://www.mbc.co.jp/news/
[!] Failed to download https://www.mbc.co.jp/news/
'utf-8' codec can't decode byte 0x8e in position 181: invalid start byte
The character code of the page where the error occurs seems to be Shift_JIS (a little old Japanese character code).
$ curl -s https://www.mbc.co.jp/news/ | grep -i charset=
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS" />
<script type="text/javascript" src="js/scrollsmoothly.js" charset="utf-8"></script>
<link rel="stylesheet" type="text/css" href="/css/mbc_menu_import.css" charset="Shift-JIS">
<SCRIPT language="JavaScript" src="/js/mbcmenu.js" charset="Shift-JIS"></SCRIPT>
An error occurred when trying to create a tiny Shift_JIS page.
$ echo '<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS" />
</head>
<body>
こんにちは
</body>
</html>' | iconv -f UTF8 -t SJIS > hello_sjis.html
$ ./archive http://167.86.112.42/hello_sjis.html
[*] [2019-08-16 19:02:10] Downloading http://167.86.112.42/hello_sjis.html
[!] Failed to download http://167.86.112.42/hello_sjis.html
'utf-8' codec can't decode byte 0x82 in position 103: invalid start byte
Software versions
- OS: Debian GNU/Linux 10 (buster) amd64
- ArchiveBox version: ArchiveBox version e2b054a
- Python version: 3.7.3 ( Debian package 3.7.3-1 )
- Chrome version: 73.0.3683.75 (Debian package 73.0.3683.75-1 )
Error if the character code is Shift_JIS
An error occurs on some Japanese pages.
Steps to reproduce
An error occurs on this page.
https://www.mbc.co.jp/news/
http://167.86.112.42/hello_sjis.html
Screenshots or log output
The character code of the page where the error occurs seems to be Shift_JIS (a little old Japanese character code).
An error occurred when trying to create a tiny Shift_JIS page.
Software versions