Skip to content

Error if the character code is Shift_JIS #257

@matoken

Description

@matoken

Error if the character code is Shift_JIS

An error occurs on some Japanese pages.

Steps to reproduce

An error occurs on this page.

https://www.mbc.co.jp/news/
http://167.86.112.42/hello_sjis.html

Screenshots or log output

$ ./archive https://www.mbc.co.jp/news/
[*] [2019-08-16 18:53:59] Downloading https://www.mbc.co.jp/news/
[!] Failed to download https://www.mbc.co.jp/news/

	 'utf-8' codec can't decode byte 0x8e in position 181: invalid start byte

The character code of the page where the error occurs seems to be Shift_JIS (a little old Japanese character code).

$ curl -s https://www.mbc.co.jp/news/ | grep -i charset=
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS" />
<script type="text/javascript" src="js/scrollsmoothly.js" charset="utf-8"></script>
<link rel="stylesheet" type="text/css" href="/css/mbc_menu_import.css" charset="Shift-JIS">
<SCRIPT language="JavaScript" src="/js/mbcmenu.js" charset="Shift-JIS"></SCRIPT>

An error occurred when trying to create a tiny Shift_JIS page.

$ echo '<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS" />
</head>
<body>
こんにちは
</body>
</html>' | iconv -f UTF8 -t SJIS > hello_sjis.html
$ ./archive http://167.86.112.42/hello_sjis.html
[*] [2019-08-16 19:02:10] Downloading http://167.86.112.42/hello_sjis.html
[!] Failed to download http://167.86.112.42/hello_sjis.html                                                                                                                      

     'utf-8' codec can't decode byte 0x82 in position 103: invalid start byte

Software versions

  • OS: Debian GNU/Linux 10 (buster) amd64
  • ArchiveBox version: ArchiveBox version e2b054a
  • Python version: 3.7.3 ( Debian package 3.7.3-1 )
  • Chrome version: 73.0.3683.75 (Debian package 73.0.3683.75-1 )

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions