Time required for sample_all function? #118

clj2567 · 2019-09-23T19:03:18Z

The sample_all function never really returns anything. It keeps running in a loop. Is there any solution for this? One instance it ran for 2 hours , still no output.

JDTheRipperPC · 2019-09-24T09:01:51Z

Thank you for reporting this @clj2567.

To be able to help you, we would need some additional information about the data and metadata you are trying to sample. Also, could you provide us a code snippet that are you using to sample?

clj2567 · 2019-09-24T11:46:28Z

The data is the one that is referenced in the paper https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data

This is the metadata file that I am using.meta_sample.txt

Here is the code snippet

from sdv import CSVDataLoader
from sdv import Modeler
from sdv import Sampler

data_loader = CSVDataLoader('data/meta_sample.json')
data_navigator = data_loader.load_data()
transformed_data = data_navigator.transform_data()
modeler = Modeler(data_navigator)
modeler.model_database()

sampler = Sampler(data_navigator, modeler)
sampled_all = sampler.sample_rows('Countries', 10)
sampled_all.to_csv(artifact_path + '/countries_sampled.csv',index=False)
print(sampled_all)
sampled_all = sampler.sample_rows('Users', 1000)
sampled_all.to_csv(artifact_path + '/users_sampled.csv', index=False)
print(sampled_all)
sampled_all = sampler.sample_rows('Sessions', 1000)
sampled_all.to_csv(artifact_path + '/sessions_sampled.csv', index=False)
print(sampled_all)

csala · 2019-10-17T12:00:06Z

Hi @clj2567

There is an error in the categorical values sampling implementation that provoked this behavior: the time it takes to sample increases exponentially with the number of categorical columns found in the dataset.

This fix for this has been covered in the issue_120_compatibility_with_rdt_issue72 branch and will be released soon.

clj2567 · 2019-10-17T15:14:49Z

Hi @csala ,

Thanks for the update. Is there any timeline around this?

csala · 2019-10-21T12:27:18Z

Thanks for the update. Is there any timeline around this?

Yes. This will most likely be released this week.

JDTheRipperPC · 2019-11-11T14:45:45Z

This should have been fixed in PR #121

* Bump version: 0.3.1.dev0 → 0.3.1.dev1 * Validates discrete columns * Fix lint

csala assigned JDTheRipperPC Oct 17, 2019

csala added the bug Something isn't working label Oct 17, 2019

csala added this to the 0.2.0 milestone Oct 17, 2019

JDTheRipperPC closed this as completed Nov 11, 2019

JonathanDZiegler pushed a commit to JonathanDZiegler/SDV that referenced this issue Feb 7, 2022

Validate discrete column (sdv-dev#118)

5ac5161

* Bump version: 0.3.1.dev0 → 0.3.1.dev1 * Validates discrete columns * Fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time required for sample_all function? #118

Time required for sample_all function? #118

clj2567 commented Sep 23, 2019

JDTheRipperPC commented Sep 24, 2019

clj2567 commented Sep 24, 2019 •

edited by csala

csala commented Oct 17, 2019

clj2567 commented Oct 17, 2019

csala commented Oct 21, 2019

JDTheRipperPC commented Nov 11, 2019

Time required for sample_all function? #118

Time required for sample_all function? #118

Comments

clj2567 commented Sep 23, 2019

JDTheRipperPC commented Sep 24, 2019

clj2567 commented Sep 24, 2019 • edited by csala

csala commented Oct 17, 2019

clj2567 commented Oct 17, 2019

csala commented Oct 21, 2019

JDTheRipperPC commented Nov 11, 2019

clj2567 commented Sep 24, 2019 •

edited by csala