Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primitive for normalizing (feature scaling) input data #82

Open
itinawi opened this issue Feb 1, 2019 · 2 comments
Open

Primitive for normalizing (feature scaling) input data #82

itinawi opened this issue Feb 1, 2019 · 2 comments
Labels
new primitives A new primitive is being requested Pending Review The bug is not confirmed or the feature request is being considered

Comments

@itinawi
Copy link
Contributor

itinawi commented Feb 1, 2019

I want to create a primitive for normalization of data, or feature scaling so that the input is rescaled to [-1,1]

The formula for rescaling is
X_rescaled = (X - X.min) / (X.max - X.min)

The primitive input arguments are:

  • data (pandas dataframe)
  • column (string): the column to be rescaled
  • trim_percentage (float): percentage from the bottom and top to trim
  • inplace=True: if false, creates a new column called rescaled_input

Here are some potential issues with the implementation:

  • the implementation will run on historic data. If this primitive were to be used in an online system, we would have to either implement dynamic rescaling of dataset or automatically flag values larger than min and max as anomalies. Or we can just clip the input and assign large values as -1 or 1. The last suggestion makes the most sense, in my opinion.

  • distribution of the data. The data may have an outlier (very large value, e.g. 1234) and then the remaining values would be somewhere between -10 and 10. This would result in bad rescaling. One solution is to trim the lowest and highest 1% values, but what if those were outliers?

@itinawi itinawi added the new primitives A new primitive is being requested label Feb 1, 2019
@csala csala unassigned kveerama and csala Feb 1, 2019
@csala
Copy link
Contributor

csala commented Feb 1, 2019

This functionality seems to be mostly covered by scikit-learn's MinMaxScaler, so I think it would be better to integrate that instead.

The only thing that would really be new would be the trimming of the values, but this could go on a dedicated primitive.

@itinawi if you agree on that, would you mind editing the title and description to have this dedicated only to the trimming, and open a separated issue for the MinMaxScaler integration?

@csala
Copy link
Contributor

csala commented Feb 7, 2019

The MinMaxScaler will be covered in #94

This can be kept for the trimming.

@csala csala added approved The issue is approved and someone can start working on it Pending Review The bug is not confirmed or the feature request is being considered and removed approved The issue is approved and someone can start working on it labels Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new primitives A new primitive is being requested Pending Review The bug is not confirmed or the feature request is being considered
Projects
None yet
Development

No branches or pull requests

3 participants