Common Crawl Domain Names

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").

Breaking domain names such as "openresearch" into component words "open" and "research" is important for applications such as Text-to-Speech synthesis and web search. Common Crawl is an open repository of web crawl data that can be accessed and analyzed by anyone. Specifically, we scraped the plaintext (WET) extracts for domain names from URLs that contained diverse letter casing (e.g. "OpenBSD"). Although in the previous example, segmentation is trivial using letter casing, this was not always the case (e.g. "NASA"), so we had to manually annotate the data. We outline the distribution of trivial / non-trivial examples below.

If you plan to use this dataset as part of a publication, please cite

@inproceedings{zrs2020urlsegmentation,
  title={Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities},
  author={Hao Zhang and Jae Ro and Richard William Sproat},
  booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)},
  year={2020}
}

Dataset Description

The dataset is stored as plaintext file where each line is an example of space separated segments of a domain name. The examples are stored in their original letter casing, but harder and more interesting examples can be generated by lowercasing the input first.

Open B S D
NASA
ASAP Workouts

split	size	trivial	avg_input_length	avg_segments
train	17572	13718	12.63	2.65
eval	1953	1536	12.77	2.67
test	2170	1714	12.63	2.66

Contact Us

If you have any additional questions regarding the dataset or how the data was scraped, please create an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Common Crawl Domain Names

Dataset Description

Contact Us

About

Releases

Packages

Contributors 2

License

google-research-datasets/common-crawl-domain-names

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Domain Names

Dataset Description

Contact Us

About

Resources

License

Stars

Watchers

Forks