Skip to content

Files

Latest commit

 

History

History
11 lines (6 loc) · 473 Bytes

README.md

File metadata and controls

11 lines (6 loc) · 473 Bytes

PyTorch LSTM neural networks that classify byte sequences as either UTF-8 or Windows-1252.

A training and an evaluation set was created from the News Crawl 2009 corpus (available http://www.statmt.org/wmt11/translation-task.html). This corpus is created from English-language newswire data.

Chardet (version 4.0.0), an existing Python character encoding detector, was used for comparison.

Select results:

32 hidden unit model accuracy: 0.9770

Chardet accuracy: 0.7340