# Issues with using categorical data in vaex

- [1. category_values](#Sec1)
- [2. category_labels](#Sec2)
- [3. _category_dictionary](#Sec3)
- [4. numpy_dtype](#Sec4)

## 1. category_values <a id='Sec1'></a>
Function category_values gives a KeyError as it calls a non-existing dict item from delf._categories[column]['values'].

The function is not used iternally (as far as github search results show).
What do do?
- replace self._categories[column]['values'] with self.columns[column] or self[column].values in the category_values function.
- if the function is not needed mark it as deprecated and raise warning/error

In [1]:
import vaex

In [2]:
# test data
df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
df = df.categorize('year', min_value=2012, max_value=2019)
df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
df

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [3]:
from vaex.utils import _ensure_string_from_expression
column = _ensure_string_from_expression(df.year)
column

'year'

In [4]:
df.category_values(column)

KeyError: 'values'

## 2. category_labels <a id='Sec2'></a>

Functionality aslist=Ture is needed only for Arrow Dictionaries. Works fine (from the [disscussion](https://github.com/vaexio/vaex/pull/1476#issuecomment-887478906) there was a need for the check).

```python
def category_labels(self, column, aslist=True):
    column = _ensure_string_from_expression(column)
    
    # if Vaex category returns labels list
    if column in self._categories:
        return self._categories[column]['labels']
    
    # if not, then it is Arrow dictionary
    # then for non-empty dict and check aslist attribute
    dictionary = self._category_dictionary(column)
    if dictionary is not None:
        if aslist:
            dictionary = dictionary.to_pylist()
        return dictionary
    
    # if not Vaex categorical nor Arrow dict raise error
    else:
        raise ValueError(f'Column {column} is not a categorical')
```


In [5]:
# If data are Vaex categorical: function returns list in any case
df.category_labels(column, aslist=True)

[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

In [6]:
df.category_labels(column, aslist=False)

[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

In [7]:
# Arrow

In [8]:
import pyarrow as pa

In [9]:
# test data
indices = pa.array([0, 1, 0, 1, 2, 0, None, 2])
dictionary = pa.array(['foo', 'bar', 'baz'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
dfa = vaex.from_arrays(x = dict_array, y = [1,2,3,4,5,6,7,8])
dfa

#,x,y
0,foo,1
1,bar,2
2,foo,3
3,bar,4
4,baz,5
5,foo,6
6,--,7
7,baz,8


In [10]:
from vaex.utils import _ensure_string_from_expression
column2 = _ensure_string_from_expression(dfa.x)
column2

'x'

In [11]:
# If aslist=True: function returns list
# (function converts pyarrow.lib.StringArray object to list with to_pylist())
dfa.category_labels(column2, aslist=True)

['foo', 'bar', 'baz']

In [12]:
# If aslist=False: function returns pyarrow.lib.StringArray
dfa.category_labels(column2, aslist=False)

<pyarrow.lib.StringArray object at 0x000002A91D893E80>
[
  "foo",
  "bar",
  "baz"
]

## 3. _category_dictionary <a id='Sec3'></a>

Function should raise a meaningful error when column passed is not of arrow type. Proposed:
- simple check if column is ndarray

In [13]:
import numpy as np

def _category_dictionary_TEST(df, column):
    '''Return the dictionary for a column if it is an arrow dict type'''
    if column in df.columns:
        x = df.columns[column]
        
        # First raise error if ndarray
        if isinstance(x, np.ndarray):
            raise ValueError(f'Column {column} is not an arrow dict type')
        
        # If not ndarray proceed
        arrow_type = x.type
        # duplicate code in array_types.py
        if isinstance(arrow_type, pa.DictionaryType):
            # we're interested in the type of the dictionary or the indices?
            if isinstance(x, pa.ChunkedArray):
                # take the first dictionaryu
                x = x.chunks[0]
            dictionary = x.dictionary
            return dictionary

In [14]:
## Error proposal for ndarray
_category_dictionary_TEST(df,column)

ValueError: Column year is not an arrow dict type

In [15]:
# Arrow dictionary output expected for arrow dict
_category_dictionary_TEST(dfa,column2)

<pyarrow.lib.StringArray object at 0x000002A91D893E80>
[
  "foo",
  "bar",
  "baz"
]

In [16]:
# Error currently
df._category_dictionary(column)

AttributeError: 'numpy.ndarray' object has no attribute 'type'

## 4 numpy_dtype <a id='Sec4'></a>

For Arrow dictionary call to numpy_dtype result in a error. Looks like an Arrow bug, not sure.

I suggest using commented part of the code in
https://github.com/vaexio/vaex/blob/ca7927a19d259576ca0403ee207a597aaef6adc2/packages/vaex-core/vaex/array_types.py#L212-L221

```python
# I don't there is a reason anymore to return this type, the to_pandas_dtype should
# handle that
# if isinstance(arrow_type, pa.DictionaryType):
#     # we're interested in the type of the dictionary or the indices?
#     if isinstance(x, pa.ChunkedArray):
#         # take the first dictionary
#         x = x.chunks[0]
#     return numpy_dtype(x.dictionary)
# if arrow_type in string_types:
#     return arrow_type
```

In [17]:
dfa.x.dtype

dictionary<values=string, indices=int64, ordered=0>

In [18]:
from vaex import array_types
from vaex.utils import _ensure_string_from_expression

x_n = _ensure_string_from_expression(dfa.x)
x = dfa.columns[x_n]

# Errors
array_types.numpy_dtype(x)

NotImplementedError: dictionary<values=string, indices=int64, ordered=0>