Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add support for mixed-types datasets #168

Closed
arita37 opened this issue Oct 31, 2019 · 3 comments
Assignees
Labels

Comments

@arita37
Copy link

@arita37 arita37 commented Oct 31, 2019

Raw data is a mixed of
text columns, numerical columns, date columns, category (string based) columns.

Pre-processed data is numerical data with :
1Hot encoding of text columns
1Hot encoding of category data.
1Hot encoding of numerical (binning).

  1. The pipeline should manage enlargement of initial columns,
    and different types of pre-processing based on different date types.

  2. Keeps track of column name changes is useful (but complicated to handlle).

Key of AutoML is mostly the pre-processing part.

@guillaume-chevalier guillaume-chevalier changed the title Feature : Add documenation in mix-dataset Feature: Add support for mixed-types datasets Nov 1, 2019
@guillaume-chevalier

This comment has been minimized.

Copy link
Member

@guillaume-chevalier guillaume-chevalier commented Nov 1, 2019

@arita37 Thanks for your interest and ideas. We already coded EXACTLY THAT in another related project we have, and we will soon move the code that does what you suggest to Neuraxle. It's cool and motivating that you got this idea by yourself and that we already did it in parallel.

@alexbrillant Let's soon move our ColumnTransformer to Neuraxle! Ours have a better design than the one already provided by sklearn.

@brucelightyear Please notice the current issue 😃

@arita37

This comment has been minimized.

Copy link
Author

@arita37 arita37 commented Nov 2, 2019

Can you make sure that boilerplate is at the minimum,
this is quite tricky one.... (code design).

@guillaume-chevalier

This comment has been minimized.

Copy link
Member

@guillaume-chevalier guillaume-chevalier commented Nov 7, 2019

@arita37 We fixed this issue by merging the following PR:
#184

Example usage:

        ColumnChooser([
            (range(0, 2), CyclicTimes()),
            (3, CategoricalEnum(categories_count=5, starts_at_zero=True)),
            (4, CategoricalEnum(categories_count=5, starts_at_zero=True)),
            ([10, 13, 15], CategoricalEnum(categories_count=5, starts_at_zero=True)),
        ], n_dimension=3)

Note that with n_dimension=3, it can process data of shape [batch_size, ?, n_features_columns], where "?" can be replaced by a sequence length for instance.

You could as well use n_dimension=2 to process regular data of shape like [batch_size, n_features_columns].

Higher dimension values are possible (e.g.: n_dimensions=4 to process image features for instance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.