-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? #19
Comments
In the future, we will provide more in-depth documentation (and a paper) where we document all the tests and the results. In short, here are the thoughts and observations:
Which observations did you make so far? Do you want to share some of them? |
I agree with what you said. I think it would be useful to provide this information, maybe as a warning to columns that have a large number of unique values. |
It would be great if you can share some of the datasets where you observed those huge differences and it most likely has to do with the high number of unique values. And I agree that it would be good to show a warning if there is a large number of unique values which threatens a valid calculation result. |
In the readme it is mentioned,
What datasets were tested on to make this claim?
It seems highly unlikely that sampling 5000 rows from a dataset with millions of rows would lead to consistent ppscore matrices.
The text was updated successfully, but these errors were encountered: