Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneHotEncoder can accidentally create columns with same name #201

Closed
lars-reimann opened this issue Apr 17, 2023 · 4 comments · Fixed by #271
Closed

OneHotEncoder can accidentally create columns with same name #201

lars-reimann opened this issue Apr 17, 2023 · 4 comments · Fixed by #271
Assignees
Labels
bug 🪲 Something isn't working released Included in a release

Comments

@lars-reimann
Copy link
Member

Describe the bug

The OneHotEncoder uses the schema <old_column_name>_<value> to name the created columns. This can lead to conflicts, however.

To Reproduce

Run this program:

from safeds.data.tabular.containers import Table
from safeds.data.tabular.transformation import OneHotEncoder

if __name__ == '__main__':
    table = Table.from_dict({"a_b": ["c"], "a": ["b_c"]})
    transformed_table = OneHotEncoder().fit_and_transform(table)

    print(transformed_table)

It raises an exception:

ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements

The issue is that two columns with the same name (a_b_c) get created.

Expected behavior

No exception. The names of all created columns should be unique. They should also not conflict with existing columns in the Table. This can be done by detecting conflicts between two created columns or between a created column and an existing, unchanged column and appending a suffix _<counter> to the names of the created columns (e.g. a_b_c_1 vs. a_b_c_2).

Screenshots (optional)

No response

Additional Context (optional)

No response

@lars-reimann lars-reimann added the bug 🪲 Something isn't working label Apr 17, 2023
@lars-reimann lars-reimann changed the title OneHotEncoder can create columns with same name OneHotEncoder can accidentally create columns with same name Apr 17, 2023
@lars-reimann
Copy link
Member Author

lars-reimann commented Apr 28, 2023

For better readability I'd suggest to use the schema <column_name>__<value>(#<counter>)? (e.g. color__blue or color__red#2) for the names of the columns created by the OneHotEncoder. Double underscores should be rare so in many cases we won't even need a counter. And it makes it easier for users to figure out what is column name and what is value if either contain single underscores.

Only the names of duplicates need to have counter. The first occurrence needn't be changed. Counting should start at two. Example:

  • color__red
  • color__red#2
  • color__red#3
  • ...

@lars-reimann
Copy link
Member Author

@zzril
Copy link
Contributor

zzril commented Apr 28, 2023

We decided that we will implement the OneHotEncoder ourselfves, instead of using the one from scikit-learn.

We should also add performance tests to verify that our implementation is as effecient the one in scikit-learn. The tests should be performant on several large datasets.
(These tests do not need to be run by pytest automatically.)

@zzril zzril linked a pull request May 5, 2023 that will close this issue
lars-reimann pushed a commit that referenced this issue May 10, 2023
Closes #201.

### Summary of Changes

Changed OneHotEncoder to manually implement the encoding.
(Breaking) Changed the format of newly generated columns to use two
underscores as separator. In case of naming conflicts, a hash and a
unique ID will be appended to the column name.

---------

Co-authored-by: zzril <>
Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>
Co-authored-by: megalinter-bot <129584137+megalinter-bot@users.noreply.github.com>
lars-reimann pushed a commit that referenced this issue May 11, 2023
## [0.12.0](v0.11.0...v0.12.0) (2023-05-11)

### Features

* add `learning_rate` to AdaBoost classifier and regressor. ([#251](#251)) ([7f74440](7f74440)), closes [#167](#167)
* add alpha parameter to `lasso_regression` ([#232](#232)) ([b5050b9](b5050b9)), closes [#163](#163)
* add parameter `lasso_ratio` to `ElasticNetRegression` ([#237](#237)) ([4a1a736](4a1a736)), closes [#166](#166)
* Add parameter `number_of_tree` to `RandomForest` classifier and regressor ([#230](#230)) ([414336a](414336a)), closes [#161](#161)
* Added `Table.plot_boxplots` to plot a boxplot for each numerical column in the table ([#254](#254)) ([0203a0c](0203a0c)), closes [#156](#156) [#239](#239)
* Added `Table.plot_histograms` to plot a histogram for each column in the table ([#252](#252)) ([e27d410](e27d410)), closes [#157](#157)
* Added `Table.transform_table` method which returns the transformed Table ([#229](#229)) ([0a9ce72](0a9ce72)), closes [#110](#110)
* Added alpha parameter to `RidgeRegression` ([#231](#231)) ([1ddc948](1ddc948)), closes [#164](#164)
* Added Column#transform ([#270](#270)) ([40fb756](40fb756)), closes [#255](#255)
* Added method `Table.inverse_transform_table` which returns the original table ([#227](#227)) ([846bf23](846bf23)), closes [#111](#111)
* Added parameter `c` to `SupportVectorMachines` ([#267](#267)) ([a88eb8b](a88eb8b)), closes [#169](#169)
* Added parameter `maximum_number_of_learner` and `learner` to `AdaBoost` ([#269](#269)) ([bb5a07e](bb5a07e)), closes [#171](#171) [#173](#173)
* Added parameter `number_of_trees` to `GradientBoosting` ([#268](#268)) ([766f2ff](766f2ff)), closes [#170](#170)
* Allow arguments of type pathlib.Path for file I/O methods ([#228](#228)) ([2b58c82](2b58c82)), closes [#146](#146)
* convert `Schema` to `dict` and format it nicely in a notebook ([#244](#244)) ([ad1cac5](ad1cac5)), closes [#151](#151)
* Convert between Excel file and `Table` ([#233](#233)) ([0d7a998](0d7a998)), closes [#138](#138) [#139](#139)
* convert containers for tabular data to HTML ([#243](#243)) ([683c279](683c279)), closes [#140](#140)
* make `Column` a subclass of `Sequence` ([#245](#245)) ([a35b943](a35b943))
* mark optional hyperparameters as keyword only ([#296](#296)) ([44a41eb](44a41eb)), closes [#278](#278)
* move exceptions back to common package ([#295](#295)) ([a91172c](a91172c)), closes [#177](#177) [#262](#262)
* precision metric for classification ([#272](#272)) ([5adadad](5adadad)), closes [#185](#185)
* Raise error if an untagged table is used instead of a `TaggedTable` ([#234](#234)) ([8eea3dd](8eea3dd)), closes [#192](#192)
* recall and F1-score metrics for classification ([#277](#277)) ([2cf93cc](2cf93cc)), closes [#187](#187) [#186](#186)
* replace prefix `n` with `number_of` ([#250](#250)) ([f4f44a6](f4f44a6)), closes [#171](#171)
* set `alpha` parameter for regularization of `ElasticNetRegression` ([#238](#238)) ([e642d1d](e642d1d)), closes [#165](#165)
* Set `column_names` in `fit` methods of table transformers to be required ([#225](#225)) ([2856296](2856296)), closes [#179](#179)
* set learning rate of Gradient Boosting models ([#253](#253)) ([9ffaf55](9ffaf55)), closes [#168](#168)
* Support vector machine for regression and for classification ([#236](#236)) ([7f6c3bd](7f6c3bd)), closes [#154](#154)
* usable constructor for `Table` ([#294](#294)) ([56a1fc4](56a1fc4)), closes [#266](#266)
* usable constructor for `TaggedTable` ([#299](#299)) ([01c3ad9](01c3ad9)), closes [#293](#293)

### Bug Fixes

* OneHotEncoder no longer creates duplicate column names ([#271](#271)) ([f604666](f604666)), closes [#201](#201)
* selectively ignore one warning instead of all warnings ([#235](#235)) ([3aad07d](3aad07d))
@lars-reimann
Copy link
Member Author

🎉 This issue has been resolved in version 0.12.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@lars-reimann lars-reimann added the released Included in a release label May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working released Included in a release
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants