Indexing categorical types #36

cugni · 2021-10-26T10:16:41Z

One important missing feature is to be able to index strings with discrete values (e.g. cities, products_id or name). If the data has a relatively high cardinality, but it is not ordinal, ( e.g. Torino > Milano feels right but it is not meaningful) we can simply use a hash function. The problems are:

Define the formats that we support. Not only by type but also any other characteristics of the data that we should preserve and consider (like cardinality).
How do we define the columns that are cardinal/nominal?
Does this change require modifying the CubeID/CubeIterator? Right now when we have a number, we map to (0,1), and then we multiply it for a factor and we round it (more or less). However, with hashed value, this is no longer necessary. ( right? @alexeiakimov ?)

On the first problem, I can propose an alternative. Either we can define the types ( Nominal/categorical, ordinal, ratio/percentage, etc etc. ) and then we define how to manage them, or we allow the user to define how to transform this value into an indexable column using a specific transformation.

The first case would look like:

parquetDf.write
    .mode("overwrite")
    .format("qbeast")   
    .option("columnsToIndex", "ss_cdemo_sk:ordinal,ss_cdemo_sk:ordinal,city:nominal")     
    .save(qbeastTablePath)

Alternatively, we can allow the user to specify which implementation of the Transformation trait to use for every specific class. (By default, it should be LinearTransformation)

parquetDf.write
    .mode("overwrite")
    .format("qbeast")   
    .option("columnsToIndex", "ss_cdemo_sk:Linear,ss_cdemo_sk:Gaussian,city:Hash")     
    .save(qbeastTablePath)

It would be also possible to define something in the middle, using different symbols (e.g. city@Hash or city/nominal), but it might get too complex to be practical.

cugni · 2021-10-26T11:33:31Z

Talking with @alexeiakimov we realized that there might be many different types of mapping and having a specific different implementation to manage the hashed columns is probably not worthy.

I suggest we start extending the Transformation trait in a way that ensures to convert any type to Double (to use then in the CubeID). The Transformation implementation should also manage

How to calculate the transformation paraments from a dataset (e.g. calculate min-max) This should be done in a way it is possible to compute it in a single passage for all columns.
How to map a specific type to Double
Check if it is needed to create a new revision.

In order to do so, I would split the logic into two classes:
ColumnTransformer: this is a class generated from the metadata, and it keeps

the name of the interested column
eventual configuration parameters
providers the method required to calculate the statistic necessary to configure the Transformation
Given the current transformation and the calculated statistics, calculate the new statistics if necessary.

Transformation

Stores the statistics necessary to configure the transformation (e.g. min-man, std-dev, etc,etc)
Transform a type to double
This is the class that gets serialized into the revision.

cugni added the type: enhancement New feature or request label Oct 26, 2021

cugni self-assigned this Oct 26, 2021

osopardo1 mentioned this issue Oct 26, 2021

Support different data types to index #32

Closed

2 tasks

osopardo1 added this to the No Bug November milestone Nov 16, 2021

osopardo1 linked a pull request Nov 24, 2021 that will close this issue

Better code organization #39

Merged

10 tasks

cugni closed this as completed in #39 Dec 7, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing categorical types #36

Indexing categorical types #36

cugni commented Oct 26, 2021 •

edited by osopardo1

Loading

cugni commented Oct 26, 2021

Indexing categorical types #36

Indexing categorical types #36

Comments

cugni commented Oct 26, 2021 • edited by osopardo1 Loading

cugni commented Oct 26, 2021

cugni commented Oct 26, 2021 •

edited by osopardo1

Loading