In [None]:
#| hide
import logging

from squeakily.core import *

# Turn off logging for datasets
logging.getLogger("datasets").setLevel(logging.ERROR)

# squeakily

> A library for squeakily cleaning and filtering language datasets.

This repository is heavily inspired by BigScience's [ROOTs project](https://github.com/bigscience-workshop/data-preparation) and EleutherAI's [The Pile](https://github.com/EleutherAI/the-pile).

The overall pipeline is as follows:

```{mermaid}
flowchart LR
  A(Defining <br/>Datasources) --> B(Defining Filters <br/>per Datasource)
  B --> C(Defining Cleaners <br/>per Datasource)
```

In this library, we define filtering as data instances being removed from the dataset based on some criteria and cleaning as data instances being modified in some way.

## Install

```sh
pip install squeakily
```

## How to use

### Using the API

First, we need to define a datasource. `squeakily` accepts any `Dataset` object from the [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) library. For example, we can use the [wikitext](https://huggingface.co/datasets/wikitext) dataset:

In [None]:
from datasets import load_dataset

ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:10%]")

We simply need to wrap the `Dataset` object in a dictionary, with the key being the name of the datasource and the value being the `Dataset` object, the filter and cleaners. For example:

In [None]:
from squeakily.filter import check_char_repetition, check_flagged_words
from squeakily.clean import remove_empty_lines, normalize_whitespace

datasources = [
    {
        "dataset": ds,
        "columns": ["text"],
        "filters": [check_char_repetition, check_flagged_words],
        "cleaners": [remove_empty_lines, normalize_whitespace],
    },
    # ...
]

:::{.callout-warning}
Note: The order of the filters and cleaning functions matter. Filters and cleaners are applied in the order they are defined.
:::

:::{.callout-important}
Note: As of now, we only use the first column of the given column names. This is because the `squeakily` library is designed to work with language datasets, which usually have a single column of text. Future versions will support multiple columns.
:::

Finally, we can apply the filters and cleaners to the datasouces using a `Pipeline` object:

In [None]:
from squeakily.core import Pipeline

pipeline = Pipeline(datasources)
pipeline.run()

:::{.callout-note}
Note: If you want to run cleaners first, you can pass `cleaning_first=True` to the `run` function.

```python
pipeline.run(cleaning_first=True)
```
:::