# Categorical arrays in Vaex
28-07. - 30.7.2021

In Vaex there are two possibilities for constructing categorical columns:
- from Arrow Dictionary via `from_arrays` (leave it for now)
- using `categorize()` on a column in the dataframe

The dtype of a column is checked on the dataframe level that is whay metadata constructor was added to the `_Column` and `_DataFrame` class.

There are also two possibilities when constructing categorical columns with `categorize()`:
- with min & max
- with labels

The difference becomes important when defining codes, values and categories. The variables are not consistent:
- **min & max:**
codes need to be calculated, values are a subset of categories
- **labels:** 
codes = values, categories can be string or numeric

the code for codes calculation is added into `get_data_buffer()`.

There is also a bool check `bool_c` added to see if data type calculated with `dtype_from_vaexdtype` is needed for _VaexColumn definition or for from_data_frame calculacion where we need specifications of the buffer with the calling of the function `get_data_buffer()`. The difference?
- for _VaexColumn: categorical data must be labeled categorical
- for `get_data_buffer()` data must be labeled by the dtype of the values in the column

The categorization of the data is done on the dataframe level. That is why when getting data back from the buffers, information about the category must be saved and categorize method is called in  _from_dataframe(). Check needed to be added to get the correct format of the columns (values or categories, depending on the type of categorize() function used at construction - min & max, or labels). The check was added at _from_dataframe().

# Implementation change for categorical dtypes

In [1]:
%run vaex_implementation_v1.py

## Research: categorize()
First I have to do some research on the Vaex dataframe and categorical variables

In [2]:
# test data
df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
df = df.categorize('year', min_value=2012, max_value=2019)
df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
df

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [3]:
df.is_category('year')

True

## Testing from_dataframe method

In [4]:
df_test = from_dataframe_to_vaex(df)
df_test

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [5]:
df_test.is_category('year')

False

## Researching data types and categories

In [6]:
df['year'].dtype

int32

In [7]:
df.is_category('year')

True

In [8]:
df.year

Expression = year
Length: 3 dtype: int32 (column)
-------------------------------
0  2012
1  2015
2  2019

In [9]:
from vaex.utils import _ensure_string_from_expression
column = _ensure_string_from_expression(df.year)

In [10]:
df.columns[column]

array([2012, 2015, 2019])

In [11]:
df._categories

{'year': {'labels': [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
  'N': 8,
  'min_value': 2012},
 'weekday': {'labels': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
  'N': 7,
  'min_value': 0}}

In [12]:
df._categories[column]

{'labels': [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
 'N': 8,
 'min_value': 2012}

In [13]:
column in df._categories

True

In [14]:
df.is_local()

True

In [15]:
column in df.columns

True

In [16]:
df.category_labels(column)

[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

In [17]:
df._categories[column]['labels']

[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

In [18]:
df.category_count(column)

8

In [19]:
column2 = _ensure_string_from_expression(df.weekday)
column2

'weekday'

In [20]:
df.category_labels(column2)

['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

In [21]:
df[column2].values

array([0, 4, 6])

In [22]:
mapping = {ix: val for ix, val in enumerate(df[column2].values)}
mapping

{0: 0, 1: 4, 2: 6}

In [23]:
vaex.__version__

{'vaex-core': '4.3.0.post1',
 'vaex-viz': '0.5.0',
 'vaex-hdf5': '0.8.0',
 'vaex-server': '0.5.0',
 'vaex-astro': '0.8.2',
 'vaex-jupyter': '0.6.0',
 'vaex-ml': '0.12.0'}

In [24]:
df.category_values(column)

KeyError: 'values'

## Researching pyarrow dictionaries

In [59]:
df[column]

Expression = year
Length: 3 dtype: int32 (column)
-------------------------------
0  2012
1  2015
2  2019

In [60]:
df[column].dtype

int32

In [61]:
# Checking connection between categorize and pa.DicionaryType
# --> there is no connection

isinstance(df[column].dtype, pa.DictionaryType)

False

In [62]:
df.is_category(column)

True

**Constructing category from arrow dictionary and see how to get it into Vaex**

In [63]:
indices = pa.array([0, 1, 0, 1, 2, 0, None, 2])
dictionary = pa.array(['foo', 'bar', 'baz'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)

In [64]:
dict_array.type

DictionaryType(dictionary<values=string, indices=int64, ordered=0>)

In [65]:
dict_array

<pyarrow.lib.DictionaryArray object at 0x00000189FCB37F20>

-- dictionary:
  [
    "foo",
    "bar",
    "baz"
  ]
-- indices:
  [
    0,
    1,
    0,
    1,
    2,
    0,
    null,
    2
  ]

In [66]:
# Is this kind pa.Dictionary type?
isinstance(dict_array.type, pa.DictionaryType)

True

In [67]:
# Can I get it into Vaex with from_arrays? - Yes
# If yes, does it stay categorical? - Yes
vaex.from_arrays(x = dict_array, y = [1,2,3,4,5,6,7,8]).is_category('x')

True

In [68]:
vaex.from_arrays(x = dict_array, y = [1,2,3,4,5,6,7,8]).dtypes

x    dictionary<values=string, indices=int64, order...
y                                                int32
dtype: object

In [69]:
vaex.from_arrays(x = dict_array, y = [1,2,3,4,5,6,7,8])

#,x,y
0,foo,1
1,bar,2
2,foo,3
3,bar,4
4,baz,5
5,foo,6
6,--,7
7,baz,8


In [70]:
# But the _categories list is empty!
vaex.from_arrays(x = dict_array, y = [1,2,3,4,5,6,7,8])._categories

{}

In [71]:
# Same for non numpy array dataframe
x = np.array([1.5, 2.5, 3.5])
y = np.array([9.2, 10.5, 11.8])
vaex.from_arrays(x=x, y=y)._categories

{}

In [72]:
vaex.from_arrays(x=x, y=y).is_category('x')

False

In [73]:
# Try .dictionary function
# Gives labels for arrow dictionary array
vaex.from_arrays(x = dict_array, y = [1,2,3,4,5,6,7,8]).columns['x'].dictionary

<pyarrow.lib.StringArray object at 0x00000189FCB4A400>
[
  "foo",
  "bar",
  "baz"
]

In [74]:
# How about indeces?
# Returns values of the dictionary
vaex.from_arrays(x = dict_array, y = [1,2,3,4,5,6,7,8]).columns['x'].indices

<pyarrow.lib.Int64Array object at 0x00000189FCB4A280>
[
  0,
  1,
  0,
  1,
  2,
  0,
  null,
  2
]

## How to construct metadata dict for the implementation

In [75]:
# test data
dfa = vaex.from_arrays(xt = dict_array, yt = [1,2,3,4,5,6,7,8], zt = [1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5])
dfa = dfa.categorize('yt', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun','None'])

# Start with empty dict
dict_is_cat = {}
# for every column write down if the column is categorical or not
for i in dfa.get_names():
    dict_is_cat[i] = dfa.is_category(i)

In [76]:
dict_is_cat

{'xt': True, 'yt': True, 'zt': False}

In [77]:
# How to access the name of the expression
dfa['xt'].expression

'xt'

In [78]:
# Accessing the bool of the column (is it categorical or not)
dict_is_cat[dfa['zt'].expression]

False

In [79]:
dict_is_cat[dfa['xt'].expression]

True

In [80]:
dict_is_cat[dfa['yt'].expression]

True

## How does Pandas work with categories?

In [81]:
import pandas as pd

In [82]:
# First the implementation constructs Categorical array
pdf = pd.Categorical([1, 2, 3, 1, 2, 3])
pdf

[1, 2, 3, 1, 2, 3]
Categories (3, int64): [1, 2, 3]

In [83]:
# Then it turns it inro a series
series = pd.Series(pdf)
series

0    1
1    2
2    3
3    1
4    2
5    3
dtype: category
Categories (3, int64): [1, 2, 3]

**Checking datatypes and attributes used in the implementation (codes, categories.values)**

In [84]:
series.dtype

CategoricalDtype(categories=[1, 2, 3], ordered=False)

In [85]:
pdf.dtype

CategoricalDtype(categories=[1, 2, 3], ordered=False)

In [86]:
pdf.codes

array([0, 1, 2, 0, 1, 2], dtype=int8)

In [87]:
pdf.categories

Int64Index([1, 2, 3], dtype='int64')

In [88]:
pdf.categories.values

array([1, 2, 3], dtype=int64)

In [89]:
series.values

[1, 2, 3, 1, 2, 3]
Categories (3, int64): [1, 2, 3]

In [90]:
series.values.codes

array([0, 1, 2, 0, 1, 2], dtype=int8)

In [91]:
# In the implementation we construct a mapping of the categories
categories = pdf.categories.values
mapping = {ix: val for ix, val in enumerate(categories)}
mapping

{0: 1, 1: 2, 2: 3}

## Research metadata PR for Pandas and implementation

In [92]:
pdf2 = pd.DataFrame({'A': [1, 2, 3, 4],'B': [1, 2, 3, 4]})

# Check the metadata from the dataframe
expected = {"pandas.index": pdf2.index}
for key in expected:
    print(key)
    print(expected[key])

pandas.index
RangeIndex(start=0, stop=4, step=1)


## Workflow for categories implementation
Going through the errors after metadata implementation and correcting them 1-by-1. Working only with Vaex categorized data first. Try with Arrow dict later. Topics:
- _VaexDataFrame metadata
- get_data_buffer

In [93]:
%run vaex_implementation_v2.py

In [94]:
# How to call metadata from _DataFrame
_VaexDataFrame(dfa).metadata

{'vaex.cetagories_bool': {'xt': True, 'yt': True, 'zt': False},
 'vaex.cetagories': {'xt': ['foo', 'bar', 'baz'],
  'yt': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'None']}}

In [95]:
df

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [96]:
# Kind vs type

In [97]:
df.weekday.dtype

int32

In [98]:
df.weekday.dtype.kind

'i'

In [99]:
df.year.dtype.numpy.itemsize

4

In [100]:
df['year'].dtype.numpy.itemsize

4

In [101]:
%run pandas_implementation.py

In [102]:
series

0    1
1    2
2    3
3    1
4    2
5    3
dtype: category
Categories (3, int64): [1, 2, 3]

In [103]:
# Checking to see the output of _PandasColumn().get_data_buffer
# and then matching _VaexColumn().get_data_buffer
_PandasColumn(series).get_data_buffer()

(PandasBuffer({'bufsize': 6, 'ptr': 1692090374288, 'device': 'CPU'}),
 (<_DtypeKind.INT: 0>, 8, '|i1', '|'))

In [104]:
# But the dtype in the _PandasColumn() is different - implement in Vaex also!
# <_DtypeKind.INT: 0> in buffer vs 23 in _PandasColumn class

_PandasColumn(series).dtype

(23, 64, '|O08', '=')

In [105]:
%run vaex_implementation_v2.py

In [106]:
# Making it work the same for Vaex
# Changes neede to be made in _VaexBuffer and in _VaexColumn (dtype and _dtype_from_vaexdtype)

a = _VaexDataFrame(df).metadata
a

{'vaex.cetagories_bool': {'year': True, 'weekday': True},
 'vaex.cetagories': {'year': [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
  'weekday': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']}}

In [107]:
_VaexColumn(df.year, a).is_cat

True

In [108]:
_VaexColumn(df.year, a).get_data_buffer()

(VaexBuffer({'bufsize': 12, 'ptr': 1692088539872, 'device': 'CPU'}),
 (<_DtypeKind.INT: 0>, 32, '<i4', '='))

In [109]:
_VaexColumn(df.year, a).dtype

(23, 32, '<i4', '=')

In [110]:
%run pandas_implementation.py

In [111]:
# What kinf of input should be given to buffer_to_ndarray?
# Compare to _PandasColumn

cp1, cp2 = _PandasColumn(series).get_data_buffer()

In [112]:
cp1, cp2

(PandasBuffer({'bufsize': 6, 'ptr': 1692090374288, 'device': 'CPU'}),
 (<_DtypeKind.INT: 0>, 8, '|i1', '|'))

In [113]:
buffer_to_ndarray(cp1, cp2)

array([0, 1, 2, 0, 1, 2], dtype=int8)

In [114]:
#Lets compare to Vaex
%run vaex_implementation_v2.py

In [115]:
cv1, cv2 = _VaexColumn(df.year, a).get_data_buffer()

In [116]:
cv1, cv2 # Works yeeey!

(VaexBuffer({'bufsize': 12, 'ptr': 1692088539872, 'device': 'CPU'}),
 (<_DtypeKind.INT: 0>, 32, '<i4', '='))

In [117]:
buffer_to_ndarray(cv1, cv2)

array([2012, 2015, 2019])

In [118]:
buffer_to_ndarray(cv1, cv2).dtype

dtype('<i4')

In [119]:
# How to construct data frame from columns and keep categorical info

In [120]:
dfv = _VaexDataFrame(df)
dfv

<__main__._VaexDataFrame at 0x189fcbb9a30>

In [121]:
dfv.column_names()

['year', 'weekday']

In [122]:
# Check what kind of an output we get for columns and labels in _from_dataframe()

columns = dict()
lable = dict()
_k = _DtypeKind
for name in dfv.column_names():
    col = dfv.get_column_by_name(name)
    if col.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
        # Simple numerical or bool dtype, turn into numpy array
        columns[name] = convert_column_to_ndarray(col)
    elif col.dtype[0] == _k.CATEGORICAL:
        values, categories = convert_categorical_column(col)
        columns[name] = values
        lable[name] = categories
    else:
        raise NotImplementedError(f"Data type {col.dtype[0]} not handled yet")

In [123]:
columns

{'year': array([2012, 2015, 2019]), 'weekday': array([0, 4, 6])}

In [124]:
lable

{'year': array([2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]),
 'weekday': array(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], dtype='<U3')}

In [125]:
dataframe = vaex.from_dict(columns)
dataframe

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [126]:
for cat in lable:
    dataframe = dataframe.categorize(cat, labels = lable[cat])
    print(cat)
    print(lable[cat])

year
[2012 2013 2014 2015 2016 2017 2018 2019]
weekday
['Mon' 'Tue' 'Wed' 'Thu' 'Fri' 'Sat' 'Sun']


In [127]:
dataframe

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [128]:
dataframe.is_category('year')

True

In [129]:
# LOOKS GOOD =)
# Lets try THE method
df_vaex = from_dataframe_to_vaex(df)
df_vaex

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [130]:
# And lets check the properties (categorical)

In [131]:
df

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [132]:
df_vaex.is_category('year')

True

In [133]:
df_vaex.is_category('weekday')

True

In [134]:
df_vaex.category_labels('weekday')

array(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], dtype='<U3')

In [135]:
# IT WOOORKS! =)

In [136]:
# Arrow dicts:
# Doesn't work yet. There is a trouble with arrow coversion to_pandas_dtype!

## Roundtrip Pandas - Vaex

In [137]:
%run vaex_implementation_v2.py
df

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [138]:
df1 = from_dataframe_to_vaex(df)
df1

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [139]:
%run pandas_implementation.py
df2 = from_dataframe(df1)
df2

# There is a probem in convert_categorical_column
# I changed the method for Vaex so much it isn't compatible with Pandas anymore
# Have to correct it so it understands the codes!

# INFO
# categories = lables in Vaex
# values vs codes = not clear in Vaex
## categorize wih labels -> codes are the values, labels are the categories
## categorize wih min & max -> values are equal to categories and codes need to be constructed!
### done in the get_data_buffer

IndexError: index 2012 is out of bounds for axis 0 with size 8

In [140]:
# Lets research mapping of _VaexDataFrame

In [141]:
%run vaex_implementation_v2.py

In [142]:
# What is the ouput of .describe_categorical
# No codes passed here
o, i, m = _VaexColumn(df.weekday, a).describe_categorical

In [143]:
o, i, m

(False,
 True,
 {0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', 4: 'Fri', 5: 'Sat', 6: 'Sun'})

In [144]:
# cv1, cv2 = _VaexColumn(df.year, a).get_data_buffer()
cv1

VaexBuffer({'bufsize': 12, 'ptr': 1692088539872, 'device': 'CPU'})

In [145]:
cv2

(<_DtypeKind.INT: 0>, 32, '<i4', '=')

In [146]:
# This should pass codes not values!
buffer_to_ndarray(cv1, cv2)

array([2012, 2015, 2019])

In [147]:
# Check the data I have again

In [148]:
df.category_labels(column2)

['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

In [149]:
df.category_labels(column)

[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

In [150]:
df[column2].values

array([0, 4, 6])

In [151]:
df[column].values

array([2012, 2015, 2019])

In [152]:
df[column].values[1] in df.category_labels(column)

True

In [153]:
df[column2].values[1] in df.category_labels(column2)

False

In [154]:
# Maybe I can define which option needs code calculation with checking
# if value of the categorical column is in the list of categories (categorize with boundaries) -> then I need to calculate code
# if value is not found in the list of the categories (categorize with labels list) -> that means it is already a code

codes = df[column].values
for i in df[column].values:
    print(np.where(df.category_labels(column) == i)) # if values are same as labels

(array([0], dtype=int64),)
(array([3], dtype=int64),)
(array([7], dtype=int64),)


In [155]:
codes

array([2012, 2015, 2019])

In [156]:
df

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [157]:
# There were a problem with joining the columns back to the Vaex Dataframe at _from_dataframe_to_vaex
# I need to check again which kind of values I get
## numeric when categorize was made with boundaries/values care not of string type.
## string when catwgorize was done with labels

In [158]:
# Checking if I can male labels numeric also! - Yes
df5 = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
df5 = df5.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
df5

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [159]:
df5 = df5.categorize('year', labels=[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])

In [160]:
df5

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [161]:
df5.is_category('weekday')

True

In [162]:
df5.is_category('year')

True

In [163]:
# I will need codes, values and categories to be in output!

In [164]:
convert_categorical_column(_VaexColumn(df.year, a))

(array([2012, 2015, 2019]),
 array([2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]))

In [165]:
convert_categorical_column(_VaexColumn(df.weekday, a))

(array([0, 4, 6]),
 array(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], dtype='<U3'))

In [166]:
%run vaex_implementation_v3.py

In [167]:
convert_categorical_column(_VaexColumn(df.year, a))

(array([0, 3, 7]),
 array([2012, 2015, 2019]),
 array([2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]))

In [168]:
convert_categorical_column(_VaexColumn(df.weekday, a))

(array([0, 4, 6]),
 array(['Mon', 'Fri', 'Sun'], dtype='<U3'),
 array(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], dtype='<U3'))

In [169]:
# Better! =)

# Now it should work!

In [242]:
%run vaex_implementation_v3.py

# test data
df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
df = df.categorize('year', min_value=2012, max_value=2019)
df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
df

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [243]:
df_vaex = from_dataframe_to_vaex(df)
df_vaex

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [244]:
df_vaex.is_category('year')

True

In [245]:
df_vaex.is_category('weekday')

True

## Huray!
Now all we need to do is to check the roundtrip Vaex -> Pandas, Pandas -> Vaex

In [260]:
# Vaex -> Pandas

df_pandas = from_dataframe(df_vaex)
df_pandas

Unnamed: 0,year,weekday
0,2012,Mon
1,2015,Fri
2,2019,Sun


In [250]:
df_pandas.dtypes

year       category
weekday    category
dtype: object

In [259]:
# Pandas -> Vaex

dfp_vaex = from_dataframe_to_vaex(df_pandas)
dfp_vaex.dtypes

year       int64
weekday     int8
dtype: object

In [254]:
dfp_vaex

#,year,weekday
0,2012,0
1,2015,4
2,2019,6


In [256]:
dfp_vaex.is_category('year')

True

In [261]:
# Again Vaex -> Pandas

from_dataframe(dfp_vaex)

Unnamed: 0,year,weekday
0,2012,Mon
1,2015,Fri
2,2019,Sun


In [262]:
# THE END #