# **Dataframe interchange protocol implementation for Vaex library** <br>Example Notebook

## Protocol Description
>The purpose of the **Dataframe interchange protocol (`__dataframe__`)** is to enable data interchange. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe).

With the protocol implemented in dataframe libraries we will be able to write code that exepts any kind of dataframe 🎉 <br>
For more information visit the [RFC blog post](https://data-apis.org/blog/dataframe_protocol_rfc/) or the [official site](https://data-apis.org/dataframe-protocol/latest/index.html).

## Vaex description
>**Vaex library** is a high performance Python library for lazy Out-of-Core DataFrames, to visualize and explore big tabular datasets.

More about the Vaex library is available on the [official site](https://vaex.io/docs/index.html) and [blog](https://vaex.io/blog).

**The implementation for Vaex library thereby means connecting vaex dataframe class to a base class `__dataframe__` specified by the Consortium for Python Data API Standards.**

## Content

1. [dataframe attribute](#dataframe)
2. [from_dataframe method](#from_df)

<center><img src="Blog_picture_2.png" width="800"></center>

___

## `__dataframe__` attribute <a name="dataframe"></a>
The base class for the `__dataframe__` method includes three sepearate classes which are `_Buffer`, `_Column` and `_DataFrame`. Each of them has necessary and additional methods to construct and describe a dataframe. <br>Lets see some of them.

In [1]:
# First I will import Vaex protocol implementation for demo purposes
%run vaex_implementation.py

In [2]:
# Then I will construct a diverse Veax dataframe
indices = pa.array([0, 1, 2, 1, 2])
dictionary = pa.array(['foo', 'bar', 'baz'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)

df = vaex.from_arrays(
    numpy_int=np.array([1, 2, 3, 4, 0]), # Numpy int
    numpy_float=np.array([1.5, 2.5, 3.5, 4.5, 0]), # Numpy float
    numpy_bool=np.array([True, False, True, True, True]), # Numpy bool
    
    numpy_int_m=np.ma.array([1, 2, 3, 4, 0], mask=[0, 0, 0, 1, 1], dtype=int), # Numpy masked int
    numpy_float_m=np.ma.array([1.5, 2.5, 3.5, 4.5, 0], mask=[False, True, True, True, False], dtype=float), # Numpy masked float
    numpy_bool_m=np.ma.array([True, False, True, True, True], mask=[1, 0, 0, 1, 0], dtype=bool), # Numpy masked bool
    
    arrow_int = pa.array([0, 1, 2, 3, 0]), # Arrow int
    arrow_float = pa.array([0.5, 1.5, 2.5, 3.5, 0.5]), # Arrow float
    arrow_bool = pa.array([True, False, True, False, True]), # Arrow bool
    
    arrow_int_m = pa.array([0, 1, 2, None, 0], mask=np.array([0, 0, 0, 1, 1], dtype=bool)), # Arrow masked int
    arrow_float_m = pa.array([0.5, 1.5, 2.5, None, 0.5], mask=np.array([0, 0, 0, 1, 0], dtype=bool)), # Arrow masked float
    arrow_bool_m = pa.array([True, False, True, None, True], mask=np.array([0, 0, 1, 1, 0], dtype=bool)), # Arrow masked bool
    
    arrow_dict = pa.DictionaryArray.from_arrays(pa.array([0, 1, 2, 0, 1]), pa.array(['aap', 'noot', 'mies'])), # Arrow dictionary
    arrow_dict_m = pa.DictionaryArray.from_arrays(pa.array([0, 1, 2, 0, 1], mask=np.array([0, 1, 1, 0, 0], dtype=bool)), pa.array(['aap', 'noot', 'mies'])) # Arrow dict masked
)

# And print it out to visualize
df

#,numpy_int,numpy_float,numpy_bool,numpy_int_m,numpy_float_m,numpy_bool_m,arrow_int,arrow_float,arrow_bool,arrow_int_m,arrow_float_m,arrow_bool_m,arrow_dict,arrow_dict_m
0,1,1.5,True,1,1.5,--,0,0.5,True,0,0.5,True,aap,aap
1,2,2.5,False,2,--,False,1,1.5,False,1,1.5,False,noot,--
2,3,3.5,True,3,--,True,2,2.5,True,2,2.5,--,mies,--
3,4,4.5,True,--,--,--,3,3.5,False,--,--,--,aap,aap
4,0,0.0,True,--,0.0,True,0,0.5,True,--,0.5,True,noot,noot


In [3]:
# Lets first see how to call `__dataframe__` on a Vaex dataframe
df.__dataframe__()

<__main__._VaexDataFrame at 0x20e0ba0ceb0>

We can see that there is a `_VaexDataFrame` class instance generated. We can now research it's attributes:

In [4]:
# Lets see the number of columns
df.__dataframe__().num_columns()

14

In [5]:
# and the number of rows
df.__dataframe__().num_rows()

5

In [6]:
# We can also get/select columns from the dataframe
# The methods that can be used are: get_column, get_column_by_name, get_columns, select_columns, select_columns_by_name
df.__dataframe__().get_column(4)

<__main__._VaexColumn at 0x20e0ba04340>

As we can see we selected the fifth column 'numpy_float_m' and got a `_VaexColumn` instance. <br>Which means we now have the fifth column of the dataframe df as a `_VaexColumn`. We can now observe some of the methods on this column also:

In [7]:
# Let's save the column and research the methods
col = df.__dataframe__().get_column(4)

# We can get the size of the column
col.size

array(5, dtype=int64)

In [8]:
# Null count
col.null_count

3

What is very useful is the description of the data type. Lets see!

In [9]:
col.dtype

(<_DtypeKind.FLOAT: 2>, 64, 'g', '=')

That means the fifth column of `__dataframe__` instance of dataframe df is of float type, one element of the column takes up 64 bits in computer memory. There is also Apache Arrow format string type saved in the list. In this case 'g' means float64. You can see the whole list here: https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings. At the end of the list byteorder is also saved: https://numpy.org/doc/stable/reference/generated/numpy.dtype.byteorder.html.

For each column we can also get the list of the buffers. The first buffer is the data buffer, the second is the mask of the data and the third is the offest buffer. Lets see the output:

In [10]:
col.get_buffers()

{'data': (VaexBuffer({'bufsize': 40, 'ptr': 2259304034864, 'device': 'CPU'}),
  (<_DtypeKind.FLOAT: 2>, 64, 'g', '=')),
 'validity': (VaexBuffer({'bufsize': 5, 'ptr': 2259304342976, 'device': 'CPU'}),
  (<_DtypeKind.BOOL: 20>, 8, 'b', '|')),
 'offsets': {}}

We can also see in the output that besides the `_VaexBuffer`instance we get the dtype that is needed to produce and array out from the buffer at transfer.

---

## `from_dataframe` method <a name="from_df"></a>
The general method to move between dataframes is called `from_dataframe`. It iterates through the dictionary of columns (and chunks), calls the correct methods and transfers the column to the desired type.

In this example Notebook I will show how Panas dataframe can easily be transformed to Vaex datafame.

In [11]:
# First I need to import Pandas implementation of the protocol
%run pandas_implementation.py

# And construct the example dataframe
dfp = pd.DataFrame(data=dict(a=[1, 2, 3], b=[3, 4, 5],
                                c=[1.5, 2.5, 3.5], d=[9, 10, 11]))
dfp["b"] = dfp["b"].astype("category")
dfp.at[1, 'b'] = np.nan

# Lets first print the Pandas dataframe
dfp

Unnamed: 0,a,b,c,d
0,1,3.0,1.5,9
1,2,,2.5,10
2,3,5.0,3.5,11


In [12]:
# Lets see the `__dataframe__` instance
dfp.__dataframe__()

<__main__._PandasDataFrame at 0x20e0ba194c0>

It is in fact a Pandas dataframe and it has its `__dataframe__` base class instance. Now lets transfer it to Vaex dataframe.

In [13]:
dfp_vaex = from_dataframe_to_vaex(dfp)
dfp_vaex

#,a,b,c,d
0,1,3,1.5,9
1,2,--,2.5,10
2,3,5,3.5,11


If we check the transfered dataframe `__dataframe__` instance we see that it is now a Vaex dataframe!

In [14]:
dfp_vaex.__dataframe__()

<__main__._VaexDataFrame at 0x20e0cffcb20>

We could now check some attributes of both classes to see if the information stayed the same.

Lets check three things:
- missing is preserved
- data type of float is preserved
- category is preserved

In [15]:
# Number of missing values in second column ('b') of the dfp
dfp.__dataframe__().get_column_by_name('b').null_count

1

In [16]:
# And the number of missing values in second column ('b') of the dfp_vaex
dfp_vaex.__dataframe__().get_column_by_name('b').null_count

1

In [17]:
# Data type of the 'c' column in dfp
dfp.__dataframe__().get_column_by_name('c').dtype

(<_DtypeKind.FLOAT: 2>, 64, '<f8', '=')

In [18]:
# Data type of the 'c' column in dfp_vaex
dfp_vaex.__dataframe__().get_column_by_name('c').dtype

(<_DtypeKind.FLOAT: 2>, 64, 'g', '=')

Here the only difference is in the Apache Arrow format string as Pandas implementation doesn't have that implemented yet.

In [19]:
dfp.__dataframe__().get_column_by_name('b').describe_categorical

(False, True, {0: 3, 1: 4, 2: 5})

In [20]:
dfp_vaex.__dataframe__().get_column_by_name('b').describe_categorical

(False, True, {0: 3, 1: 4, 2: 5})

---

<img src="Blog_picture_5.png" width="700">

Thank you for reading through the Notebook!

If you are interested in the topic and want to see the iformation and the development of the dataframe protocol project please visit https://data-apis.org/blog/ or https://data-apis.org/dataframe-protocol/latest/index.html. 