Data presented in the paper "From Web Crawl to Clean Register-Annotated Corpora"
The data consists of 20 texts in Swedish and in French that have been annotated to distinguish the lines belonging to the body of text (#1#) from the rejected lines (#0#).
There are four types of files for each language: the raw HTML data retrieved from the URLs used, trafilatura output for the URLs and line-wise annotations for these both.