WAC-XII

Data presented in the paper "From Web Crawl to Clean Register-Annotated Corpora"

The data consists of 20 texts in Swedish and in French that have been annotated to distinguish the lines belonging to the body of text (#1#) from the rejected lines (#0#).

There are four types of files for each language: the raw HTML data retrieved from the URLs used, trafilatura output for the URLs and line-wise annotations for these both.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
fr_html_annotated.txt		fr_html_annotated.txt
fr_html_raw.txt		fr_html_raw.txt
fr_trafilatura_annotated.txt		fr_trafilatura_annotated.txt
fr_trafilatura_output.txt		fr_trafilatura_output.txt
sv_html_annotated.txt		sv_html_annotated.txt
sv_html_raw.txt		sv_html_raw.txt
sv_trafilatura_annotated.txt		sv_trafilatura_annotated.txt
sv_trafilatura_output.txt		sv_trafilatura_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

fr_html_annotated.txt

fr_html_annotated.txt

fr_html_raw.txt

fr_html_raw.txt

fr_trafilatura_annotated.txt

fr_trafilatura_annotated.txt

fr_trafilatura_output.txt

fr_trafilatura_output.txt

sv_html_annotated.txt

sv_html_annotated.txt

sv_html_raw.txt

sv_html_raw.txt

sv_trafilatura_annotated.txt

sv_trafilatura_annotated.txt

sv_trafilatura_output.txt

sv_trafilatura_output.txt

Repository files navigation

WAC-XII

About

Releases

Packages

Contributors 2

TurkuNLP/WAC-XII

Folders and files

Latest commit

History

Repository files navigation

WAC-XII

About

Resources

Stars

Watchers

Forks