Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer_distribution() for string attributes fails to sort index of varying types #24

Closed
raids opened this issue Jul 14, 2020 · 2 comments

Comments

@raids
Copy link

raids commented Jul 14, 2020

  • DataSynthesizer version: 0.1.1
  • Python version: 3.8.2
  • Operating System: MacOS

Describing a dataset in independent attribute mode can fail during infer_distribution() for String attributes if a subset of the values could be inferred as numerical. sort_index() is called on a pd.Series which results in the following TypeError:

Traceback (most recent call last):
  File "main.py", line 76, in <module>
    args.func(args)
  File "main.py", line 40, in synthesise
    d = synthesise(mode=mode, sample=sample)
  File "/Users/raids/synth/synthesise/synthesiser.py", line 104, in synthesise
    self.describer.describe_dataset_in_independent_attribute_mode(
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/DataDescriber.py", line 123, in describe_dataset_in_independent_attribute_mode
    column.infer_distribution()
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/datatypes/StringAttribute.py", line 49, in infer_distribution
    distribution.sort_index(inplace=True)
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/pandas/core/series.py", line 3156, in sort_index
    indexer = nargsort(
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/pandas/core/sorting.py", line 274, in nargsort
    indexer = non_nan_idx[non_nans.argsort(kind=kind)]
TypeError: '<' not supported between instances of 'str' and 'float'

I ran into this with a specific string attribute and patched infer_distribution() with a single line; see line with comment below:

    def infer_distribution(self):
        if self.is_categorical:
            distribution = self.data_dropna.value_counts()
            for value in set(self.distribution_bins) - set(distribution.index):
                distribution[value] = 0
            distribution.index = distribution.index.map(str) # patch to fix index type 
            distribution.sort_index(inplace=True)
            self.distribution_probabilities = utils.normalize_given_distribution(distribution)
            self.distribution_bins = np.array(distribution.index)
        else:
            distribution = np.histogram(self.data_dropna_len, bins=self.histogram_size)
            self.distribution_bins = distribution[1][:-1]
            self.distribution_probabilities = utils.normalize_given_distribution(distribution[0])

Happy to fork and raise a PR for this change and the other attributes if you think it's an appropriate fix, not sure if the other data types would require anything like this, though, nor non-categorical attributes (when it falls into the else above.

Let me know what you think and if you need anything from my side.

Cheers

@raids raids changed the title infer_distribution() in for string attributes fails to sort index of varying types infer_distribution() for string attributes fails to sort index of varying types Jul 15, 2020
haoyueping added a commit that referenced this issue Jul 19, 2020
@haoyueping
Copy link
Collaborator

Thanks for your help in debugging this error. I made an alternative change in the code that should fix this error. Please update to DataSynthesizer 0.1.2 to see if it works.

The solution is essentially the same with your suggestion. Since StringAttribute is assumed to be string values, I added self.data_dropna = self.data_dropna.astype(str) in StringAttribute.__init__().

@raids
Copy link
Author

raids commented Jul 20, 2020

Neat, thanks!

@raids raids closed this as completed Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants