Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time required for sample_all function? #118

Closed
clj2567 opened this issue Sep 23, 2019 · 6 comments
Closed

Time required for sample_all function? #118

clj2567 opened this issue Sep 23, 2019 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@clj2567
Copy link

clj2567 commented Sep 23, 2019

The sample_all function never really returns anything. It keeps running in a loop. Is there any solution for this? One instance it ran for 2 hours , still no output.

@JDTheRipperPC
Copy link
Contributor

Thank you for reporting this @clj2567.

To be able to help you, we would need some additional information about the data and metadata you are trying to sample. Also, could you provide us a code snippet that are you using to sample?

@clj2567
Copy link
Author

clj2567 commented Sep 24, 2019

The data is the one that is referenced in the paper https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data

This is the metadata file that I am using.meta_sample.txt

Here is the code snippet

from sdv import CSVDataLoader
from sdv import Modeler
from sdv import Sampler

data_loader = CSVDataLoader('data/meta_sample.json')
data_navigator = data_loader.load_data()
transformed_data = data_navigator.transform_data()
modeler = Modeler(data_navigator)
modeler.model_database()

sampler = Sampler(data_navigator, modeler)
sampled_all = sampler.sample_rows('Countries', 10)
sampled_all.to_csv(artifact_path + '/countries_sampled.csv',index=False)
print(sampled_all)
sampled_all = sampler.sample_rows('Users', 1000)
sampled_all.to_csv(artifact_path + '/users_sampled.csv', index=False)
print(sampled_all)
sampled_all = sampler.sample_rows('Sessions', 1000)
sampled_all.to_csv(artifact_path + '/sessions_sampled.csv', index=False)
print(sampled_all)

@csala
Copy link
Contributor

csala commented Oct 17, 2019

Hi @clj2567

There is an error in the categorical values sampling implementation that provoked this behavior: the time it takes to sample increases exponentially with the number of categorical columns found in the dataset.

This fix for this has been covered in the issue_120_compatibility_with_rdt_issue72 branch and will be released soon.

@csala csala added the bug Something isn't working label Oct 17, 2019
@csala csala added this to the 0.2.0 milestone Oct 17, 2019
@clj2567
Copy link
Author

clj2567 commented Oct 17, 2019

Hi @csala ,

Thanks for the update. Is there any timeline around this?

@csala
Copy link
Contributor

csala commented Oct 21, 2019

Thanks for the update. Is there any timeline around this?

Yes. This will most likely be released this week.

@JDTheRipperPC
Copy link
Contributor

This should have been fixed in PR #121

JonathanDZiegler pushed a commit to JonathanDZiegler/SDV that referenced this issue Feb 7, 2022
* Bump version: 0.3.1.dev0 → 0.3.1.dev1

* Validates discrete columns

* Fix lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants