You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describing a dataset in independent attribute mode can fail during infer_distribution() for String attributes if a subset of the values could be inferred as numerical. sort_index() is called on a pd.Series which results in the following TypeError:
Traceback (most recent call last):
File "main.py", line 76, in<module>
args.func(args)
File "main.py", line 40, in synthesise
d = synthesise(mode=mode, sample=sample)
File "/Users/raids/synth/synthesise/synthesiser.py", line 104, in synthesise
self.describer.describe_dataset_in_independent_attribute_mode(
File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/DataDescriber.py", line 123, in describe_dataset_in_independent_attribute_mode
column.infer_distribution()
File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/datatypes/StringAttribute.py", line 49, in infer_distribution
distribution.sort_index(inplace=True)
File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/pandas/core/series.py", line 3156, in sort_index
indexer = nargsort(
File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/pandas/core/sorting.py", line 274, in nargsort
indexer = non_nan_idx[non_nans.argsort(kind=kind)]
TypeError: '<' not supported between instances of 'str' and 'float'
I ran into this with a specific string attribute and patched infer_distribution() with a single line; see line with comment below:
definfer_distribution(self):
ifself.is_categorical:
distribution=self.data_dropna.value_counts()
forvalueinset(self.distribution_bins) -set(distribution.index):
distribution[value] =0distribution.index=distribution.index.map(str) # patch to fix index type distribution.sort_index(inplace=True)
self.distribution_probabilities=utils.normalize_given_distribution(distribution)
self.distribution_bins=np.array(distribution.index)
else:
distribution=np.histogram(self.data_dropna_len, bins=self.histogram_size)
self.distribution_bins=distribution[1][:-1]
self.distribution_probabilities=utils.normalize_given_distribution(distribution[0])
Happy to fork and raise a PR for this change and the other attributes if you think it's an appropriate fix, not sure if the other data types would require anything like this, though, nor non-categorical attributes (when it falls into the else above.
Let me know what you think and if you need anything from my side.
Cheers
The text was updated successfully, but these errors were encountered:
raids
changed the title
infer_distribution() in for string attributes fails to sort index of varying types
infer_distribution() for string attributes fails to sort index of varying types
Jul 15, 2020
Thanks for your help in debugging this error. I made an alternative change in the code that should fix this error. Please update to DataSynthesizer 0.1.2 to see if it works.
The solution is essentially the same with your suggestion. Since StringAttribute is assumed to be string values, I added self.data_dropna = self.data_dropna.astype(str) in StringAttribute.__init__().
Describing a dataset in independent attribute mode can fail during
infer_distribution()
forString
attributes if a subset of the values could be inferred as numerical.sort_index()
is called on a pd.Series which results in the following TypeError:I ran into this with a specific string attribute and patched
infer_distribution()
with a single line; see line with comment below:Happy to fork and raise a PR for this change and the other attributes if you think it's an appropriate fix, not sure if the other data types would require anything like this, though, nor non-categorical attributes (when it falls into the
else
above.Let me know what you think and if you need anything from my side.
Cheers
The text was updated successfully, but these errors were encountered: