Skip to content

Data presented in the paper "From Web Crawl to Clean Register-Annotated Corpora"

Notifications You must be signed in to change notification settings

TurkuNLP/WAC-XII

Repository files navigation

WAC-XII

Data presented in the paper "From Web Crawl to Clean Register-Annotated Corpora"

The data consists of 20 texts in Swedish and in French that have been annotated to distinguish the lines belonging to the body of text (#1#) from the rejected lines (#0#).

There are four types of files for each language: the raw HTML data retrieved from the URLs used, trafilatura output for the URLs and line-wise annotations for these both.

About

Data presented in the paper "From Web Crawl to Clean Register-Annotated Corpora"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published