PyTorch LSTM neural networks that classify byte sequences as either UTF-8 or Windows-1252.

A training and an evaluation set was created from the News Crawl 2009 corpus (available http://www.statmt.org/wmt11/translation-task.html). This corpus is created from English-language newswire data.

Chardet (version 4.0.0), an existing Python character encoding detector, was used for comparison.

Select results:

32 hidden unit model accuracy: 0.9770

Chardet accuracy: 0.7340

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls