PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines. Data transformations include encoding, binning, scaling, strings replacement, NaN filling, column type conversion, data anonymization.
The user can install PyTrousse in his/her Python virtual environment by cloning this repository:
$ git clone https://github.com/HK3-Lab-Team/pytrousse.git
and by running the following command:
$ cd pytrousse
$ pip install .
PyTrousse transformations are progressively wrapped internally with the data, thus linking all stages of data preprocessing for future reproducibility.
Along with processed data, every Dataset
object document how the user performed the analysis, in order to reproduce it in the future and to address questions about how the analysis was carried out months, years after the fact.
The traced data path can be inspected through operation_history
attribute.
>>> dataset.operations_history
[FillNA(
columns=["column_with_nan"],
value=0,
derived_columns=["column_filled"],
), ReplaceSubstrings(
columns=["column_invalid_values"],
replacement_map={",": ".", "°": ""},
derived_columns=["column_valid_values"],
)]
Wouldn't it be cool to have full column data type detection for your data?
PyTrousse expands Pandas tools for data type inference. Automatic identification is provided on an enlarged set of types (categorical, numerical, boolean, mixed, strings, etc.) using heuristic algorithms.
>>> import trousse
>>> dataset = trousse.read_csv("path/to/csv")
>>> dataset.str_categorical_columns
{"dog_breed", "fur_color"}
You can also get the name of boolean columns, numerical columns (i.e. containing integer and float values) or constant columns.
>>> dataset.bool_columns
{"is_vaccinated"}
>>> dataset.numerical_columns
{"weight", "age"}
What about having an easy API for all those boring data preprocessing steps?
Along with the common preprocessing utilities (for encoding, binning, scaling, etc.), PyTrousse provides tools for noisy data handling and for data anonymization.
>>> from trousse.feature_operations import Compose, FillNA, ReplaceSubstrings
>>> fillna_replacestrings = Compose(
... [
... FillNA(
... columns=["column_with_nan"],
... value=0,
... derived_columns=["column_filled"],
... ),
... ReplaceSubstrings(
... columns=["column_invalid_values"],
... replacement_map={",": ".", "°": ""},
... derived_columns=["column_valid_values"],
... ),
... ]
... )
>>> dataset = fillna_replacestrings(dataset)
PyTrousse aids automated testing by inverting the data transformation operators. Generation of testing fixtures and injection of errors is automatically available (more information here).