infer_distribution() for string attributes fails to sort index of varying types #24

raids · 2020-07-14T15:51:33Z

DataSynthesizer version: 0.1.1
Python version: 3.8.2
Operating System: MacOS

Describing a dataset in independent attribute mode can fail during infer_distribution() for String attributes if a subset of the values could be inferred as numerical. sort_index() is called on a pd.Series which results in the following TypeError:

Traceback (most recent call last):
  File "main.py", line 76, in <module>
    args.func(args)
  File "main.py", line 40, in synthesise
    d = synthesise(mode=mode, sample=sample)
  File "/Users/raids/synth/synthesise/synthesiser.py", line 104, in synthesise
    self.describer.describe_dataset_in_independent_attribute_mode(
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/DataDescriber.py", line 123, in describe_dataset_in_independent_attribute_mode
    column.infer_distribution()
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/datatypes/StringAttribute.py", line 49, in infer_distribution
    distribution.sort_index(inplace=True)
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/pandas/core/series.py", line 3156, in sort_index
    indexer = nargsort(
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/pandas/core/sorting.py", line 274, in nargsort
    indexer = non_nan_idx[non_nans.argsort(kind=kind)]
TypeError: '<' not supported between instances of 'str' and 'float'

I ran into this with a specific string attribute and patched infer_distribution() with a single line; see line with comment below:

    def infer_distribution(self):
        if self.is_categorical:
            distribution = self.data_dropna.value_counts()
            for value in set(self.distribution_bins) - set(distribution.index):
                distribution[value] = 0
            distribution.index = distribution.index.map(str) # patch to fix index type 
            distribution.sort_index(inplace=True)
            self.distribution_probabilities = utils.normalize_given_distribution(distribution)
            self.distribution_bins = np.array(distribution.index)
        else:
            distribution = np.histogram(self.data_dropna_len, bins=self.histogram_size)
            self.distribution_bins = distribution[1][:-1]
            self.distribution_probabilities = utils.normalize_given_distribution(distribution[0])

Happy to fork and raise a PR for this change and the other attributes if you think it's an appropriate fix, not sure if the other data types would require anything like this, though, nor non-categorical attributes (when it falls into the else above.

Let me know what you think and if you need anything from my side.

Cheers

The text was updated successfully, but these errors were encountered:

haoyueping · 2020-07-19T21:08:48Z

Thanks for your help in debugging this error. I made an alternative change in the code that should fix this error. Please update to DataSynthesizer 0.1.2 to see if it works.

The solution is essentially the same with your suggestion. Since StringAttribute is assumed to be string values, I added self.data_dropna = self.data_dropna.astype(str) in StringAttribute.__init__().

raids · 2020-07-20T09:33:54Z

Neat, thanks!

raids changed the title ~~infer_distribution() in for string attributes fails to sort index of varying types~~ infer_distribution() for string attributes fails to sort index of varying types Jul 15, 2020

haoyueping added a commit that referenced this issue Jul 19, 2020

Fix bug #24

ce9d98a

raids closed this as completed Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infer_distribution() for string attributes fails to sort index of varying types #24

infer_distribution() for string attributes fails to sort index of varying types #24

raids commented Jul 14, 2020

haoyueping commented Jul 19, 2020

raids commented Jul 20, 2020

infer_distribution() for string attributes fails to sort index of varying types #24

infer_distribution() for string attributes fails to sort index of varying types #24

Comments

raids commented Jul 14, 2020

haoyueping commented Jul 19, 2020

raids commented Jul 20, 2020