Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? #19

Open
Dyex719 opened this issue Jun 2, 2020 · 3 comments

Comments

@Dyex719
Copy link

Dyex719 commented Jun 2, 2020

In the readme it is mentioned,

In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows with a fixed random seed (ppscore.RANDOM_SEED). You can adjust the number of rows or skip this sampling via the API. However, in most scenarios the results will be very similar.

What datasets were tested on to make this claim?
It seems highly unlikely that sampling 5000 rows from a dataset with millions of rows would lead to consistent ppscore matrices.

@8080labs
Copy link
Owner

8080labs commented Jun 2, 2020

In the future, we will provide more in-depth documentation (and a paper) where we document all the tests and the results.

In short, here are the thoughts and observations:

  • General note 1: the sampling is a heuristic in order to reduce the computation time and it has drawbacks. There are also other methods for reducing the computation time.

  • General note 2: the ppscore only takes into account 2 columns during a single calculation. And there are only so many patterns that can exist between 2 columns.

  • If you have two numeric columns, it often does not matter very much how many rows you have. And a sample of 5000 is already plenty. And yes, there are edge cases where this is not true.

  • If you have categoric columns with many unique values (let's say more than 500), then there is a good chance that this might be a problem. And in case that your dataset has millions of rows, there is a good chance that there are more than 500 unique categoric values. However, many distributions of categoric values are highly skewed and in this case, there might be hardly any problem.

Which observations did you make so far? Do you want to share some of them?

@Dyex719
Copy link
Author

Dyex719 commented Jun 6, 2020

I agree with what you said.
I saw a difference in some scores to be ~0.5 in my dataset, which is huge.
These were mostly in columns with high number of unique values, however I will need to dig deeper to see if this is always the case.

I think it would be useful to provide this information, maybe as a warning to columns that have a large number of unique values.

@FlorianWetschoreck
Copy link
Collaborator

It would be great if you can share some of the datasets where you observed those huge differences and it most likely has to do with the high number of unique values.

And I agree that it would be good to show a warning if there is a large number of unique values which threatens a valid calculation result.
We could also push this even further and automatically adjust the sample size based on the distribution of the target/feature values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants