You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One important missing feature is to be able to index strings with discrete values (e.g. cities, products_id or name). If the data has a relatively high cardinality, but it is not ordinal, ( e.g. Torino > Milano feels right but it is not meaningful) we can simply use a hash function. The problems are:
Define the formats that we support. Not only by type but also any other characteristics of the data that we should preserve and consider (like cardinality).
How do we define the columns that are cardinal/nominal?
Does this change require modifying the CubeID/CubeIterator? Right now when we have a number, we map to (0,1), and then we multiply it for a factor and we round it (more or less). However, with hashed value, this is no longer necessary. ( right? @alexeiakimov ?)
On the first problem, I can propose an alternative. Either we can define the types ( Nominal/categorical, ordinal, ratio/percentage, etc etc. ) and then we define how to manage them, or we allow the user to define how to transform this value into an indexable column using a specific transformation.
Alternatively, we can allow the user to specify which implementation of the Transformation trait to use for every specific class. (By default, it should be LinearTransformation)
It would be also possible to define something in the middle, using different symbols (e.g. city@Hash or city/nominal), but it might get too complex to be practical.
The text was updated successfully, but these errors were encountered:
Talking with @alexeiakimov we realized that there might be many different types of mapping and having a specific different implementation to manage the hashed columns is probably not worthy.
I suggest we start extending the Transformation trait in a way that ensures to convert any type to Double (to use then in the CubeID). The Transformation implementation should also manage
How to calculate the transformation paraments from a dataset (e.g. calculate min-max) This should be done in a way it is possible to compute it in a single passage for all columns.
How to map a specific type to Double
Check if it is needed to create a new revision.
In order to do so, I would split the logic into two classes: ColumnTransformer: this is a class generated from the metadata, and it keeps
the name of the interested column
eventual configuration parameters
providers the method required to calculate the statistic necessary to configure the Transformation
Given the current transformation and the calculated statistics, calculate the new statistics if necessary.
Transformation
Stores the statistics necessary to configure the transformation (e.g. min-man, std-dev, etc,etc)
Transform a type to double
This is the class that gets serialized into the revision.
One important missing feature is to be able to index strings with discrete values (e.g. cities, products_id or name). If the data has a relatively high cardinality, but it is not ordinal, ( e.g. Torino > Milano feels right but it is not meaningful) we can simply use a hash function. The problems are:
On the first problem, I can propose an alternative. Either we can define the types ( Nominal/categorical, ordinal, ratio/percentage, etc etc. ) and then we define how to manage them, or we allow the user to define how to transform this value into an indexable column using a specific transformation.
The first case would look like:
Alternatively, we can allow the user to specify which implementation of the Transformation trait to use for every specific class. (By default, it should be LinearTransformation)
It would be also possible to define something in the middle, using different symbols (e.g. city@Hash or city/nominal), but it might get too complex to be practical.
The text was updated successfully, but these errors were encountered: