Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: OneHotEncoder no longer creates duplicate column names #271

Conversation

zzril
Copy link
Contributor

@zzril zzril commented May 5, 2023

Closes #201.

Summary of Changes

Changed OneHotEncoder to manually implement the encoding.
(Breaking) Changed the format of newly generated columns to use two underscores as separator. In case of naming conflicts, a hash and a unique ID will be appended to the column name.

zzril and others added 4 commits April 28, 2023 12:03
also count number of occurences of column names
Tests run through, column name format not yet as specified
(one "_" instdead of two)
Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>
@zzril zzril linked an issue May 5, 2023 that may be closed by this pull request
@lars-reimann
Copy link
Member

lars-reimann commented May 5, 2023

🦙 MegaLinter status: ✅ SUCCESS

Descriptor Linter Files Fixed Errors Elapsed time
✅ PYTHON black 7 0 0 0.86s
✅ PYTHON mypy 7 0 1.8s
✅ PYTHON ruff 7 0 0 0.05s
✅ REPOSITORY git_diff yes no 0.03s

See detailed report in MegaLinter reports
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by OX Security

zzril and others added 4 commits May 5, 2023 13:38
Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>
reverse_transform still missing
Still need to fix one test which checks the wrapped_encoder.
Still need to change single to double underscore and update tests accordingly.

Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>
@zzril zzril changed the title 201 onehotencoder can accidentally create columns with same name Fix #201: onehotencoder can accidentally create columns with same name May 5, 2023
@zzril zzril changed the title Fix #201: onehotencoder can accidentally create columns with same name fix #201: onehotencoder can accidentally create columns with same name May 8, 2023
@zzril zzril changed the title fix #201: onehotencoder can accidentally create columns with same name fix #201: OneHotEncoder no longer creates duplicate column names May 9, 2023
@zzril zzril changed the title fix #201: OneHotEncoder no longer creates duplicate column names fix: OneHotEncoder no longer creates duplicate column names May 9, 2023
@codecov
Copy link

codecov bot commented May 9, 2023

Codecov Report

Merging #271 (07e6adc) into main (8db5914) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #271   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           43        43           
  Lines         1761      1786   +25     
=========================================
+ Hits          1761      1786   +25     
Impacted Files Coverage Δ
src/safeds/data/tabular/containers/_table.py 100.00% <ø> (ø)
src/safeds/exceptions/__init__.py 100.00% <ø> (ø)
...ds/data/tabular/transformation/_one_hot_encoder.py 100.00% <100.00%> (ø)
src/safeds/exceptions/_data.py 100.00% <100.00%> (ø)

@zzril zzril marked this pull request as ready for review May 9, 2023 14:13
@zzril zzril requested a review from a team as a code owner May 9, 2023 14:13
@zzril
Copy link
Contributor Author

zzril commented May 9, 2023

Note that this breaks code that depends on the old column renaming schema (single underscore as separator). Not sure if the keyword in the PR message is enough for that.

Also note that this PR does not yet include performance tests.

@lars-reimann
Copy link
Member

lars-reimann commented May 9, 2023

Note that this breaks code that depends on the old column renaming schema (single underscore as separator). Not sure if the keyword in the PR message is enough for that.

While the version of this library is in the 0.y.z range, we don't need to pay much attention to breaking changes (see this). Still good to mention this as you've done.

@lars-reimann
Copy link
Member

lars-reimann commented May 9, 2023

Once #301 is implemented, we can shorten the implementation of transform and inverse_transform a little. But no need to wait for that.

Copy link
Member

@lars-reimann lars-reimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@lars-reimann lars-reimann merged commit f604666 into main May 10, 2023
11 checks passed
@lars-reimann lars-reimann deleted the 201-onehotencoder-can-accidentally-create-columns-with-same-name branch May 10, 2023 18:19
lars-reimann pushed a commit that referenced this pull request May 11, 2023
## [0.12.0](v0.11.0...v0.12.0) (2023-05-11)

### Features

* add `learning_rate` to AdaBoost classifier and regressor. ([#251](#251)) ([7f74440](7f74440)), closes [#167](#167)
* add alpha parameter to `lasso_regression` ([#232](#232)) ([b5050b9](b5050b9)), closes [#163](#163)
* add parameter `lasso_ratio` to `ElasticNetRegression` ([#237](#237)) ([4a1a736](4a1a736)), closes [#166](#166)
* Add parameter `number_of_tree` to `RandomForest` classifier and regressor ([#230](#230)) ([414336a](414336a)), closes [#161](#161)
* Added `Table.plot_boxplots` to plot a boxplot for each numerical column in the table ([#254](#254)) ([0203a0c](0203a0c)), closes [#156](#156) [#239](#239)
* Added `Table.plot_histograms` to plot a histogram for each column in the table ([#252](#252)) ([e27d410](e27d410)), closes [#157](#157)
* Added `Table.transform_table` method which returns the transformed Table ([#229](#229)) ([0a9ce72](0a9ce72)), closes [#110](#110)
* Added alpha parameter to `RidgeRegression` ([#231](#231)) ([1ddc948](1ddc948)), closes [#164](#164)
* Added Column#transform ([#270](#270)) ([40fb756](40fb756)), closes [#255](#255)
* Added method `Table.inverse_transform_table` which returns the original table ([#227](#227)) ([846bf23](846bf23)), closes [#111](#111)
* Added parameter `c` to `SupportVectorMachines` ([#267](#267)) ([a88eb8b](a88eb8b)), closes [#169](#169)
* Added parameter `maximum_number_of_learner` and `learner` to `AdaBoost` ([#269](#269)) ([bb5a07e](bb5a07e)), closes [#171](#171) [#173](#173)
* Added parameter `number_of_trees` to `GradientBoosting` ([#268](#268)) ([766f2ff](766f2ff)), closes [#170](#170)
* Allow arguments of type pathlib.Path for file I/O methods ([#228](#228)) ([2b58c82](2b58c82)), closes [#146](#146)
* convert `Schema` to `dict` and format it nicely in a notebook ([#244](#244)) ([ad1cac5](ad1cac5)), closes [#151](#151)
* Convert between Excel file and `Table` ([#233](#233)) ([0d7a998](0d7a998)), closes [#138](#138) [#139](#139)
* convert containers for tabular data to HTML ([#243](#243)) ([683c279](683c279)), closes [#140](#140)
* make `Column` a subclass of `Sequence` ([#245](#245)) ([a35b943](a35b943))
* mark optional hyperparameters as keyword only ([#296](#296)) ([44a41eb](44a41eb)), closes [#278](#278)
* move exceptions back to common package ([#295](#295)) ([a91172c](a91172c)), closes [#177](#177) [#262](#262)
* precision metric for classification ([#272](#272)) ([5adadad](5adadad)), closes [#185](#185)
* Raise error if an untagged table is used instead of a `TaggedTable` ([#234](#234)) ([8eea3dd](8eea3dd)), closes [#192](#192)
* recall and F1-score metrics for classification ([#277](#277)) ([2cf93cc](2cf93cc)), closes [#187](#187) [#186](#186)
* replace prefix `n` with `number_of` ([#250](#250)) ([f4f44a6](f4f44a6)), closes [#171](#171)
* set `alpha` parameter for regularization of `ElasticNetRegression` ([#238](#238)) ([e642d1d](e642d1d)), closes [#165](#165)
* Set `column_names` in `fit` methods of table transformers to be required ([#225](#225)) ([2856296](2856296)), closes [#179](#179)
* set learning rate of Gradient Boosting models ([#253](#253)) ([9ffaf55](9ffaf55)), closes [#168](#168)
* Support vector machine for regression and for classification ([#236](#236)) ([7f6c3bd](7f6c3bd)), closes [#154](#154)
* usable constructor for `Table` ([#294](#294)) ([56a1fc4](56a1fc4)), closes [#266](#266)
* usable constructor for `TaggedTable` ([#299](#299)) ([01c3ad9](01c3ad9)), closes [#293](#293)

### Bug Fixes

* OneHotEncoder no longer creates duplicate column names ([#271](#271)) ([f604666](f604666)), closes [#201](#201)
* selectively ignore one warning instead of all warnings ([#235](#235)) ([3aad07d](3aad07d))
@lars-reimann
Copy link
Member

🎉 This PR is included in version 0.12.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@lars-reimann lars-reimann added the released Included in a release label May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released Included in a release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder can accidentally create columns with same name
3 participants