In [None]:
#| hide
import logging

from squeakily.core import *

# Turn off logging for datasets
logging.getLogger("datasets").setLevel(logging.ERROR)

  from .autonotebook import tqdm as notebook_tqdm


# squeakily

> A library for squeakily cleaning and filtering language datasets.

This repository is heavily inspired by BigScience's [ROOTs project](https://github.com/bigscience-workshop/data-preparation) and EleutherAI's [The Pile](https://github.com/EleutherAI/the-pile).

The overall pipeline is as follows:

```{mermaid}
flowchart LR
  A(Defining <br/>Datasources) --> B(Defining Filters <br/>per Datasource)
  B --> C(Defining Cleaners <br/>per Datasource)
```

In this library, we define filtering as data instances being removed from the dataset based on some criteria and cleaning as data instances being modified in some way.

## Install

```sh
pip install squeakily
```

## How to use

### Using the API

First, we need to define a datasource. `squeakily` accepts any `Dataset` object from the [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) library. For example, we can use the [wikitext](https://huggingface.co/datasets/wikitext) dataset:

In [None]:
from datasets import load_dataset

ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:1%]")

We simply need to wrap the `Dataset` object in a dictionary, with the key being the name of the datasource and the value being the `Dataset` object, the filter and cleaners. For example:

In [None]:
from squeakily.filter import check_char_repetition, check_flagged_words
from squeakily.clean import remove_empty_lines, normalize_whitespace

datasources = [
    {
        "dataset": ds,
        "name": "wikitext",
        "columns": ["text"],
        "filters": [check_char_repetition, check_flagged_words],
        "cleaners": [remove_empty_lines, normalize_whitespace],
    },
    # ...
]

:::{.callout-warning}
Note: The order of the filters and cleaning functions matter. Filters and cleaners are applied in the order they are defined.
:::

:::{.callout-important}
Note: As of now, we only use the first column of the given column names. This is because the `squeakily` library is designed to work with language datasets, which usually have a single column of text. Future versions will support multiple columns.
:::

Finally, we can apply the filters and cleaners to the datasouces using a `Pipeline` object:

In [None]:
from squeakily.core import Pipeline

pipeline = Pipeline(datasources)
pipeline.run()

:::{.callout-note}
Note: If you want to run cleaners first, you can pass `cleaning_first=True` to the `run` function.

```python
pipeline.run(cleaning_first=True)
```
:::

If you need to run a filter or cleaner at the dataset level rather than the example level, you can pass `global_filters` or `global_cleaners` to the `Pipeline.run` function. For example:

In [None]:
#|output: false
from squeakily.filter import minhash_dedup

pipeline.run(global_filters=[minhash_dedup])

Adding index... #0: 100%|██████████| 251/251 [00:00<00:00, 21955.59ex/s]
Adding index... #1: 100%|██████████| 251/251 [00:00<00:00, 21928.60ex/s]

Adding index... #2: 100%|██████████| 251/251 [00:00<00:00, 20948.57ex/s]


Adding index... #3: 100%|██████████| 251/251 [00:00<00:00, 23106.83ex/s]



Adding index... #4: 100%|██████████| 251/251 [00:00<00:00, 20829.20ex/s]




Adding index... #5: 100%|██████████| 251/251 [00:00<00:00, 23253.31ex/s]





Adding index... #6: 100%|██████████| 251/251 [00:00<00:00, 23386.06ex/s]






Adding index... #7: 100%|██████████| 251/251 [00:00<00:00, 23216.39ex/s]







Adding index... #8: 100%|██████████| 251/251 [00:00<00:00, 22568.87ex/s]








Adding index... #9: 100%|██████████| 251/251 [00:00<00:00, 22853.03ex/s]









Adding index... #10: 100%|██████████| 251/251 [00:00<00:00, 21100.56ex/s]










Adding index... #11: 100%|██████████| 251/251 [00:00<00:00, 22847.57ex/s]











Adding index... #12: 100%|██████████| 251/251 [00:00<00:

:::{.callout-note}
Note: If you use global filters or cleaners, all datasets must have a common column name in order to properly concatenate them.
:::

:::{.callout-note}
Note: You can also specifiy if you want a specific dataset to be skipped by setting the `skip_global` parameter to `True` when defining the datasource.

```python
datasources = [
    {
        "dataset": ds,
        "columns": ["text"],
        "filters": [check_char_repetition, check_flagged_words],
        "cleaners": [remove_empty_lines, normalize_whitespace],
        "skip_global": True,
    },
    # ...
]
```
:::

Additionally, you can run the pipeline in a dry run mode by passing `dry_run=True` to the `run` function. This will make no modifications to the datasets' documents, but will add additional columns to the datasets with the results of the filters and cleaners. For example, if you if you ran the pipeline with the `check_char_repetition` filter, you would get a new column called `check_char_repetition` with a float value between 0 and 1 indicating the percentage of characters that are repeated in the document.

```python


In [None]:
pipeline = Pipeline(datasources)
pipeline.run(dry_run=True)
pipeline.datasources[0]["dataset"].features