wiki-clustering

Code for creating clustering benchmarks for arbitrary languages using wikipedia.

Installation

The project uses poetry for dependency management - make sure you have it install (see this guide for for instructions). To install the dependencies, run:

poetry install

Usage

Config files

Adding a new language has two steps; a) downloading the right files from the wikipedia dump and b) writing a configuration file called {prefix}-config.json and storing it in language_configs/. The structure of the config file can be found in src/config.py.

Scripts

There are a bunch of scripts to run the different parts of the pipeline. The main ones are:

parse_articles.py: Parses the articles to create a json with the first paragraphs and the categories for the first 300,000 articles of the wiki dump.
parse_sql_gz.py: Parses the SQL dump of the wikipedia to get the categories of the articles as well as their ids. This includes the top-levle
join_categories.py: Joins the categories from the SQL dump with the articles from the parsed articles. Specifically, this joins the categories with the top-level categories as defined from the corresponding language article to Main topic classifications.
create_categories.py: Creates the actual dataset by sampling from the articles and the corresponding categories.
upload_hf.py: Uploads the dataset to Hugging Face. NB: Currently this can only be done by the author (me!).

Running the pipeline

For convenience, there are two helper scripts for running the pipeline: run_for_lang.sh and run_all.sh. The former runs the pipeline for a single language, while the latter runs the pipeline for all languages in the language_configs/ directory.

TODO:

Create a read-like file on HF a la this one
Simple documentation on how the data was created.

Languages

Signs

x: all done
c: config file written
d: downloaded
r: run
e: evaluated
h: uploaded and update hf

Languages

da
lv
gv
sq
[d] ku
[d] sco
[d] mt
[d] bs
[d] ca
[d] eu
[d] wa
[d] cs
[d] ilo
[d] min

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
language_configs		language_configs
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
run_and_upload_all.sh		run_and_upload_all.sh
run_for_lang.sh		run_for_lang.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki-clustering

Installation

Usage

Config files

Scripts

Running the pipeline

TODO:

Languages

Signs

Languages

About

Releases

Packages

Languages

License

Rysias/wiki-clustering

Folders and files

Latest commit

History

Repository files navigation

wiki-clustering

Installation

Usage

Config files

Scripts

Running the pipeline

TODO:

Languages

Signs

Languages

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages