PyTrousse

WIP ⚠️

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines. Data transformations include encoding, binning, scaling, strings replacement, NaN filling, column type conversion, data anonymization.

Getting started

The user can install PyTrousse in his/her Python virtual environment by cloning this repository:

$ git clone https://github.com/HK3-Lab-Team/pytrousse.git

and by running the following command:

$ cd pytrousse
$ pip install .

Main Features

Tracing the path from raw data

PyTrousse transformations are progressively wrapped internally with the data, thus linking all stages of data preprocessing for future reproducibility.

Along with processed data, every Dataset object document how the user performed the analysis, in order to reproduce it in the future and to address questions about how the analysis was carried out months, years after the fact.

The traced data path can be inspected through operation_history attribute.

>>> dataset.operations_history

[FillNA(
    columns=["column_with_nan"],
    value=0,
    derived_columns=["column_filled"],
), ReplaceSubstrings(
    columns=["column_invalid_values"],
    replacement_map={",": ".", "°": ""},
    derived_columns=["column_valid_values"],
)]

Automatic column data type detection

Wouldn't it be cool to have full column data type detection for your data?

PyTrousse expands Pandas tools for data type inference. Automatic identification is provided on an enlarged set of types (categorical, numerical, boolean, mixed, strings, etc.) using heuristic algorithms.

>>> import trousse
>>> dataset = trousse.read_csv("path/to/csv")

>>> dataset.str_categorical_columns

{"dog_breed", "fur_color"}

You can also get the name of boolean columns, numerical columns (i.e. containing integer and float values) or constant columns.

>>> dataset.bool_columns

{"is_vaccinated"}

>>> dataset.numerical_columns

{"weight", "age"}

Composable data transformations

What about having an easy API for all those boring data preprocessing steps?

Along with the common preprocessing utilities (for encoding, binning, scaling, etc.), PyTrousse provides tools for noisy data handling and for data anonymization.

>>> from trousse.feature_operations import Compose, FillNA, ReplaceSubstrings

>>> fillna_replacestrings = Compose(
...     [
...         FillNA(
...             columns=["column_with_nan"],
...             value=0,
...             derived_columns=["column_filled"],
...         ),
...         ReplaceSubstrings(
...             columns=["column_invalid_values"],
...             replacement_map={",": ".", "°": ""},
...             derived_columns=["column_valid_values"],
...         ),
...     ]
... )

>>> dataset = fillna_replacestrings(dataset)

Integrated tools for synthetic data generation

PyTrousse aids automated testing by inverting the data transformation operators. Generation of testing fixtures and injection of errors is automatically available (more information here).

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
.github/workflows		.github/workflows
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTrousse

WIP ⚠️

Getting started

Main Features

Tracing the path from raw data

Automatic column data type detection

Composable data transformations

Integrated tools for synthetic data generation

About

Releases

Packages

Contributors 2

Languages

License

HK3-Lab-Team/pytrousse

Folders and files

Latest commit

History

Repository files navigation

PyTrousse

WIP ⚠️

Getting started

Main Features

Tracing the path from raw data

Automatic column data type detection

Composable data transformations

Integrated tools for synthetic data generation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages