In [None]:
#| hide
from squeakily.core import *

# squeakily

> A library for squeakily cleaning and filtering language datasets.

This repository is heavily inspired by BigScience's [ROOTs project](https://github.com/bigscience-workshop/data-preparation) and EleutherAI's [The Pile](https://github.com/EleutherAI/the-pile).

The overall pipeline is as follows:

```{mermaid}
flowchart LR
  A(Defining <br/>Datasources) --> B(Defining Filters <br/>per Datasource)
  B --> C(Defining Cleaning Functions <br/>per Datasource)
```

In this library, we define filtering as data instances being removed from the dataset based on some criteria, and cleaning as data instances being modified in some way.

## Install

```sh
pip install squeakily
```

## How to use

### Using the API

First, we need to define a datasource. `squeakily` accepts any `Dataset` object from the [HuggingFace Datasets](https://huggingface.co/datasets) library. For example, we can use the `wikitext` dataset:

In [None]:
from datasets import load_dataset

ds = load_dataset("wikitext", "wikitext-103-v1")

2

We simply need to wrap the `Dataset` object in a dictionary, with the key being the name of the datasource and the value being the `Dataset` object, the filter and cleaning functions. For example:

In [None]:
from squeakily.filter import exact_match, flagged
from squeakily.clean import remove_empty_lines, normalize_whitespace

datasources = {
    "wikitext": {
        "dataset": ds,
        "columns": ["text"],
        "filters": [exact_match, flagged],
        "cleaners": [remove_empty_lines, normalize_whitespace],
    },
    # ...
}

Finally, we can apply the filters and cleaning functions to the datasouces using a `Pipeline` object:

In [None]:
from squeakily.core import Pipeline

pipeline = Pipeline(datasources)
pipeline.run()

:::{.callout-note}
Note: If you want to run cleaning functions first, you can pass `cleaning_first=True` to the `run` function.

```python
pipeline.run(cleaning_first=True)
```
:::