Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? #19

Dyex719 · 2020-06-02T04:55:38Z

In the readme it is mentioned,

In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows with a fixed random seed (ppscore.RANDOM_SEED). You can adjust the number of rows or skip this sampling via the API. However, in most scenarios the results will be very similar.

What datasets were tested on to make this claim?
It seems highly unlikely that sampling 5000 rows from a dataset with millions of rows would lead to consistent ppscore matrices.

8080labs · 2020-06-02T08:21:48Z

In the future, we will provide more in-depth documentation (and a paper) where we document all the tests and the results.

In short, here are the thoughts and observations:

General note 1: the sampling is a heuristic in order to reduce the computation time and it has drawbacks. There are also other methods for reducing the computation time.
General note 2: the ppscore only takes into account 2 columns during a single calculation. And there are only so many patterns that can exist between 2 columns.
If you have two numeric columns, it often does not matter very much how many rows you have. And a sample of 5000 is already plenty. And yes, there are edge cases where this is not true.
If you have categoric columns with many unique values (let's say more than 500), then there is a good chance that this might be a problem. And in case that your dataset has millions of rows, there is a good chance that there are more than 500 unique categoric values. However, many distributions of categoric values are highly skewed and in this case, there might be hardly any problem.

Which observations did you make so far? Do you want to share some of them?

Dyex719 · 2020-06-06T06:06:21Z

I agree with what you said.
I saw a difference in some scores to be ~0.5 in my dataset, which is huge.
These were mostly in columns with high number of unique values, however I will need to dig deeper to see if this is always the case.

I think it would be useful to provide this information, maybe as a warning to columns that have a large number of unique values.

FlorianWetschoreck · 2020-06-08T15:17:34Z

It would be great if you can share some of the datasets where you observed those huge differences and it most likely has to do with the high number of unique values.

And I agree that it would be good to show a warning if there is a large number of unique values which threatens a valid calculation result.
We could also push this even further and automatically adjust the sample size based on the distribution of the target/feature values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? #19

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? #19

Dyex719 commented Jun 2, 2020

8080labs commented Jun 2, 2020

Dyex719 commented Jun 6, 2020

FlorianWetschoreck commented Jun 8, 2020

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? #19

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? #19

Comments

Dyex719 commented Jun 2, 2020

8080labs commented Jun 2, 2020

Dyex719 commented Jun 6, 2020

FlorianWetschoreck commented Jun 8, 2020