Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating samples is taking lot of time. Is there any way to speed up sample generation. #103

Closed
imsitu opened this issue Jul 4, 2019 · 11 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@imsitu
Copy link

imsitu commented Jul 4, 2019

  • SDV version: 0.1.1
  • Python version: 3.6
  • Operating System: Mac Mojave

Description

I am trying to setup automated test data generation for my testing.
I generated metadata JSON for the table and fit the model with it.
As sampler is a dict, I am storing sampler from data_vault as pickle.
The goal is to store this pickle of sampler in db or in a remote server and generate test data wherever and whenever necessary.

The samples are taking too much of time to generate - for a table of 29 columns and 1800 rows
To generate 10 samples it is taking 5 mins. I tried to generate whole 1800 rows but it was never completed , had to kill.

Please let me know If am dealing things in a wrong way or Anything which i need to tweak to get faster response.

What I Did

        data_vault = SDV(self.findMeta())
        data_vault.fit()

        output = open('sampler.pkl', 'wb')
        pickle.dump(data_vault.sampler, output)
        output.close()

        infile = open('sampler.pkl','rb')
        new_dict = pickle.load(infile)
        infile.close()
        samples = new_dict.sample_all(100)
@ManuelAlvarezC
Copy link
Contributor

Hi @imsitu, and thanks for your question.

I'm not sure what may be causing this, but there are some details that you could share with us that will help us figure it out. Can you please share:

  • Specs of the computer you are using
  • Metadata you generated
  • Actual data you use to fit (If it's possible)

Beyond that, there is a little detail that you mention:

As sampler is a dict, I am storing sampler from data_vault as pickle.

I'm not really sure about what do you mean with sampler is a dict, but if you are interested in store a fitted version of SDV, you can do it easily using the save and load methods from the SDV class.

@imsitu
Copy link
Author

imsitu commented Jul 9, 2019

specs :
MacBook Pro
processor 2.6 GHz intel i7
Memory : 16gb 2000 MHz DDR 4

attached metadata as an attachment.

Sry, I couldn’t share the actual data.

forget about the dictionary part.My BAD which has resulter confusion.

I am saving SDV object in pickle files and loading in it (which is what SDV load and save are doing internally )
and from the loaded pickle Im generating sample with using

sdv_pickle.sample_all()

meta_ref.txt

I think SDV is taking more time when there are more columns and categorical columns.

@csala csala added the question General question about the software label Jul 9, 2019
@csala
Copy link
Contributor

csala commented Jul 9, 2019

Hi @imsitu Just out of curiosity: Why are you using pickle yourself instead of calling save and load?

@imsitu
Copy link
Author

imsitu commented Jul 9, 2019

@csala Its basically same code underneath and besides that I want to use multiprocessing to speed up.
I think whichever way the pickle files is generated it should be the same right.

@imsitu
Copy link
Author

imsitu commented Jul 10, 2019

@csala , @ManuelAlvarezC ,
Just to add another point for single table of 29 columns and 1800 rows :
sampler = Sampler(new_datanavigator, modeler)

sampler.sample_rows('ref_table’,1800) and sampler.sample_table('ref_table’) are generating samples within 50 secs.

BUT generating default - 5 samples is taking 173 secs with sample_all()

sampler.sample_all(5)

Why would sample all take more even if there no child tables or no foriegn-key relations ?

@DataDoctorNG
Copy link

@csala , @ManuelAlvarezC , and @imsitu I am also having issues with the sample all class I am attaching the meta file with my csv as an xlsx. Is there any way to speed this up?
Meta.txt
model_data.xlsx

@kveerama
Copy link
Contributor

@imsitu I have a question for you. Is it possible to reach you via email? Can you email me at kalyanv@mit.edu

@imsitu
Copy link
Author

imsitu commented Jul 19, 2019

@kveerama you can reach me @ situ.wantsyou@gmail.com

@csala csala added approved bug Something isn't working and removed question General question about the software labels Nov 11, 2019
@csala csala added this to the 0.2.0 milestone Nov 11, 2019
@csala
Copy link
Contributor

csala commented Nov 11, 2019

This has been resolved in v0.2.0

@csala csala closed this as completed Nov 11, 2019
@imsitu
Copy link
Author

imsitu commented Nov 12, 2019

@csala May I know the commit id or pr number just to know the fix.

@csala
Copy link
Contributor

csala commented Nov 12, 2019

It was done in PR #121, but unfortunately I cannot tell you the exact commit, as the change is buried among a lot of other big refactoring changes.

But I can explain and point you at the reason of the problem in the old code-base: https://github.com/HDI-Project/SDV/blob/v0.1.2/sdv/sampler.py#L470

The problem was that the previous categorical encoding implementation required the internal sampled values to be exactly between 0 and 1, and the way to get to this number was a loop in which out of range values where dropped and re-sampled until all the values where valid.
Because of how the old CatTransformer from RDT worked, negative values were very like to show up when sampling, which meat that there was a huge number re-sampling attempts. And this number grew at a more than linear ratio with the total number of samples requested.

The CategoricalTransformer from RDT does not have this [0-1] requirement, so that validate and discard loop was removed altogether from the Sampler implementation.

JonathanDZiegler pushed a commit to JonathanDZiegler/SDV that referenced this issue Feb 7, 2022
sdv-dev#103)

* Add n_discriminator steps

* move parameter to init

* Update synthesizer.py

* remove whitespace

* Add extra information in docstring and change variable name to discriminator_steps

Co-authored-by: Carles Sala <carles@pythiac.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants