Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing categorical types #36

Closed
3 tasks done
cugni opened this issue Oct 26, 2021 · 1 comment · Fixed by #39
Closed
3 tasks done

Indexing categorical types #36

cugni opened this issue Oct 26, 2021 · 1 comment · Fixed by #39
Assignees
Labels
type: enhancement New feature or request

Comments

@cugni
Copy link
Member

cugni commented Oct 26, 2021

One important missing feature is to be able to index strings with discrete values (e.g. cities, products_id or name). If the data has a relatively high cardinality, but it is not ordinal, ( e.g. Torino > Milano feels right but it is not meaningful) we can simply use a hash function. The problems are:

  • Define the formats that we support. Not only by type but also any other characteristics of the data that we should preserve and consider (like cardinality).
  • How do we define the columns that are cardinal/nominal?
  • Does this change require modifying the CubeID/CubeIterator? Right now when we have a number, we map to (0,1), and then we multiply it for a factor and we round it (more or less). However, with hashed value, this is no longer necessary. ( right? @alexeiakimov ?)

On the first problem, I can propose an alternative. Either we can define the types ( Nominal/categorical, ordinal, ratio/percentage, etc etc. ) and then we define how to manage them, or we allow the user to define how to transform this value into an indexable column using a specific transformation.

The first case would look like:

parquetDf.write
    .mode("overwrite")
    .format("qbeast")   
    .option("columnsToIndex", "ss_cdemo_sk:ordinal,ss_cdemo_sk:ordinal,city:nominal")     
    .save(qbeastTablePath)

Alternatively, we can allow the user to specify which implementation of the Transformation trait to use for every specific class. (By default, it should be LinearTransformation)

parquetDf.write
    .mode("overwrite")
    .format("qbeast")   
    .option("columnsToIndex", "ss_cdemo_sk:Linear,ss_cdemo_sk:Gaussian,city:Hash")     
    .save(qbeastTablePath)

It would be also possible to define something in the middle, using different symbols (e.g. city@Hash or city/nominal), but it might get too complex to be practical.

@cugni cugni added the type: enhancement New feature or request label Oct 26, 2021
@cugni cugni self-assigned this Oct 26, 2021
@cugni
Copy link
Member Author

cugni commented Oct 26, 2021

Talking with @alexeiakimov we realized that there might be many different types of mapping and having a specific different implementation to manage the hashed columns is probably not worthy.

I suggest we start extending the Transformation trait in a way that ensures to convert any type to Double (to use then in the CubeID). The Transformation implementation should also manage

  1. How to calculate the transformation paraments from a dataset (e.g. calculate min-max) This should be done in a way it is possible to compute it in a single passage for all columns.
  2. How to map a specific type to Double
  3. Check if it is needed to create a new revision.

In order to do so, I would split the logic into two classes:
ColumnTransformer: this is a class generated from the metadata, and it keeps

  • the name of the interested column
  • eventual configuration parameters
  • providers the method required to calculate the statistic necessary to configure the Transformation
  • Given the current transformation and the calculated statistics, calculate the new statistics if necessary.

Transformation

  • Stores the statistics necessary to configure the transformation (e.g. min-man, std-dev, etc,etc)
  • Transform a type to double
  • This is the class that gets serialized into the revision.

@osopardo1 osopardo1 added this to the No Bug November milestone Nov 16, 2021
@osopardo1 osopardo1 linked a pull request Nov 24, 2021 that will close this issue
10 tasks
@cugni cugni closed this as completed in #39 Dec 7, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants