# Working steps
First working steps in getting to know the Dataframe Interchange Protocol and how to implement it to Vaex library for simple int and float without missing and chunking.

- **2021/07/12**
Researching `test_float_only()` from `pandas_implementation.py`. Learning about Vaex with collecting data into a Vaex dataframe, researching possible ways to do it. I started with two options to construct a dataframe from `np.array` and from `dict` (`vaex.from_arrays()` and `vaex.from_dict()`). 

- **2021/07/13**
Adding `from_dataframe()` method that calls `_from_dataframe()` if `df` is not a `vaex` type dataframe. Dataframe in `vaex` is specified as `vaex.dataframe.DataFrame`. `DataFrameObject` is type `Any` (typing library). Also ` _VaexDataFrame` was added.

- **2021/07/14**
Adding `_from_dataframe()` method.

    The method goes through the columns one by one by the column name. It checks the data type in the column and then converts each column to match `vaex` column type. I will have to check which arrays are supported in `vaex` (ndarray, arrow array, ...). Two methods used in Pandas implementation are `convert_column_to_ndarray()` and`convert_categorical_column()`. I will deal with both for `vaex` in next step.

    For now I set the `_VaexColumn` class, taking `expressions` for columns. Two methods adeed in `_VaexDataFrame` class are `column_names()` giving a list of column names and `get_column_by_name()` selecting an `expression/column` by the `name`.

    Also `dtype` method for `_VaexColumn` class is set. Copy/pasted from Pandas implementation. Will see if I get into trouble later - I will need to check possible dtypes or Vaex.

- **2021/07/15**
Adding the check for more than one chunk of data, which is set to 1 by default in the current implementation (line 36 and 246).

    I started with `convert_column_to_ndarray()` method where numerical and boolean columns are turned into numpy array just like in the Pandas implementation and then all the columns would be joined together by `vaex.from_dict(columns)` where `columns` are a dictionary list of all the columns in the dataframe saved by their name. I will deal with chategorical columns later.
    
    Atributes `offset` and `describe_null` are set by default to 1 and a list respectively by this implementation protocol (line 68 and 71). Will probably need to apply changes to the list depending on how Vaex handles missing data. For now I am keeping it as it is.

- **2021/07/16**
Buffers!! Hurray! 😏

    When we want to go from `__dataframe__` class to `vaex` dataframe class, we iterate by columns and define them as NumPy arrays, using NumPy buffers and then joining all the columns back from NumPy arrays to Vaex dataframe. No need for Vaex buffer location/understanding. I copy/pasted from Ralf's Pandas implementation.

    The transformation from Vaex float dataframe back to Vaex dataframe is sucessful. The values are the same, the structure is the same. I will learn to use pytest in Jupyter Notebook to test equality of values and also buffer location.

    Testing buffer location isn't very clear to me. It looks like there is a copy being made even in the Pandas implementation. I have to check manually if the values in both dataframes change together.

**2021/07/19&20**
Kshiteej confirmed copy is being made somewhere. I will have to find where the copy is being made.

- **buffer_to_ndarray**
The copy is not being made in the method buffer_to_ndarray (tested with manual collection of data).

- **to dataframe!**
The other possibility is that it is being generated when dataframe is collected from columns that are made out of buffers. And it is correct! Even `pd.Dataframe` makes a copy unless the argument copy=False is added and the version of Pandas is 1.3.0 (doesn't work in 1.2.5).

**2021/07/21**
Change Vaex dataframe protocol so it doesn't make a copy as well and then test a loop: Vaex -> Pandas -> Vaex. It works!!! Hurray! 🎉

**2021/07/22**
Testing that the protocol works for int, float and boolean. 

In [4]:
import vaex
import pyarrow
import pyarrow as pa
import numpy as np

import enum
import collections
import ctypes
from typing import Any, Optional, Tuple, Dict, Iterable, Sequence

DataFrameObject = Any
ColumnObject = Any

def from_dataframe_to_vaex(df : DataFrameObject) -> vaex.dataframe.DataFrame:
    """
    Construct a vaex DataFrame from ``df`` if it supports ``__dataframe__``
    """
    # NOTE: commented out for roundtrip testing
    # if isinstance(df, vaex.dataframe.DataFrame):
    #     return df

    if not hasattr(df, '__dataframe__'):
        raise ValueError("`df` does not support __dataframe__")

    return _from_dataframe_to_vaex(df.__dataframe__())

def _from_dataframe_to_vaex(df : DataFrameObject) -> vaex.dataframe.DataFrame:
    """
    Note: not all cases are handled yet, only ones that can be implemented with
    only Pandas. Later, we need to implement/test support for categoricals,
    bit/byte masks, chunk handling, etc.
    """
    # Check number of chunks, if there's more than one we need to iterate
    # Alenka: it is set to 1 anyways??
    if df.num_chunks() > 1:
        raise NotImplementedError

    # We need a dict of columns here, with each column being a numpy/arrow array (at
    # least for now, deal with non-numpy/arrow dtypes later).
    columns = dict()
    _k = _DtypeKind
    for name in df.column_names():
        col = df.get_column_by_name(name)
        if col.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
            # Simple numerical or bool dtype, turn into numpy array
            columns[name] = convert_column_to_ndarray(col)
        #elif col.dtype[0] == _k.CATEGORICAL:
        #    columns[name] = convert_categorical_column(col)
        else:
            raise NotImplementedError(f"Data type {col.dtype[0]} not handled yet")

    return vaex.from_dict(columns)

class _DtypeKind(enum.IntEnum):
    INT = 0
    UINT = 1
    FLOAT = 2
    BOOL = 20
    STRING = 21   # UTF-8
    DATETIME = 22
    CATEGORICAL = 23
    
def convert_column_to_ndarray(col : ColumnObject) -> np.ndarray:
    """
    Convert an int, uint, float or bool column to a numpy array
    """
    if col.offset != 0:
        raise NotImplementedError("column.offset > 0 not handled yet")

    if col.describe_null[0] not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")

    _buffer, _dtype = col.get_data_buffer()
    return buffer_to_ndarray(_buffer, _dtype)

def buffer_to_ndarray(_buffer, _dtype) -> np.ndarray:
    # Handle the dtype
    kind = _dtype[0]
    bitwidth = _dtype[1]
    _k = _DtypeKind
    if _dtype[0] not in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
        raise RuntimeError("Not a boolean, integer or floating-point dtype")

    _ints = {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}
    _uints = {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}
    _floats = {32: np.float32, 64: np.float64}
    _np_dtypes = {0: _ints, 1: _uints, 2: _floats, 20: {8: bool}}
    column_dtype = _np_dtypes[kind][bitwidth]

    # No DLPack yet, so need to construct a new ndarray from the data pointer
    # and size in the buffer plus the dtype on the column
    ctypes_type = np.ctypeslib.as_ctypes_type(column_dtype)
    data_pointer = ctypes.cast(_buffer.ptr, ctypes.POINTER(ctypes_type))

    # NOTE: `x` does not own its memory, so the caller of this function must
    #       either make a copy or hold on to a reference of the column or
    #       buffer! (not done yet, this is pretty awful ...)
    x = np.ctypeslib.as_array(data_pointer,
                              shape=(_buffer.bufsize // (bitwidth//8),))

    return x

def __dataframe__(cls, nan_as_null : bool = False) -> dict:
    """
    The public method to attach to pd.DataFrame
    We'll attach it via monkeypatching here for demo purposes. If vaex adopt
    the protocol, this will be a regular method on vaex.dataframe.DataFrame.
    ``nan_as_null`` is a keyword intended for the consumer to tell the
    producer to overwrite null values in the data with ``NaN`` (or ``NaT``).
    This currently has no effect; once support for nullable extension
    dtypes is added, this value should be propagated to columns.
    """
    return _VaexDataFrame(cls, nan_as_null=nan_as_null)

# Monkeypatch the Vaex DataFrame class to support the interchange protocol
vaex.dataframe.DataFrame.__dataframe__ = __dataframe__

# Implementation of interchange protocol
# --------------------------------------

class _VaexBuffer:
    """
    Data in the buffer is guaranteed to be contiguous in memory.
    """

    def __init__(self, x : np.ndarray) -> None:
        """
        Handle only regular columns (= numpy arrays) for now.
        """
        if not x.strides == (x.dtype.itemsize,):
            # Array is not contiguous - this is possible to get in Pandas,
            # there was some discussion on whether to support it. Som extra
            # complexity for libraries that don't support it (e.g. Arrow),
            # but would help with numpy-based libraries like Pandas.
            raise RuntimeError("Design needs fixing - non-contiguous buffer")

        # Store the numpy array in which the data resides as a private
        # attribute, so we can use it to retrieve the public attributes
        self._x = x
    
    @property
    def bufsize(self) -> int:
        """
        Buffer size in bytes
        """
        return self._x.size * self._x.dtype.itemsize
    
    @property
    def ptr(self) -> int:
        """
        Pointer to start of the buffer as an integer
        """
        return self._x.__array_interface__['data'][0]

class _VaexColumn:
    """
    A column object, with only the methods and properties required by the
    interchange protocol defined.
    A column can contain one or more chunks. Each chunk can contain either one
    or two buffers - one data buffer and (depending on null representation) it
    may have a mask buffer.
    Note: this Column object can only be produced by ``__dataframe__``, so
          doesn't need its own version or ``__column__`` protocol.
    """

    def __init__(self, column : vaex.expression.Expression) -> None:
        """
        Note: doesn't deal with extension arrays yet, just assume an 
        expression for now.
        """
        if not isinstance(column, vaex.expression.Expression):
            raise NotImplementedError("Columns of type {} not handled "
                                      "yet".format(type(column)))

        # Store the column as a private attribute
        self._col = column
        
    @property
    def offset(self) -> int:
        """
        Offset of first element. Always zero.
        """
        return 0
        
    @property
    def dtype(self) -> Tuple[enum.IntEnum, int, str, str]:
        """
        Dtype description as a tuple ``(kind, bit-width, format string, endianness)``
        Kind :
            - INT = 0
            - UINT = 1
            - FLOAT = 2
            - BOOL = 20
            - STRING = 21   # UTF-8
            - DATETIME = 22
            - CATEGORICAL = 23
        Bit-width : the number of bits as an integer
        Format string : data type description format string in Apache Arrow C
                        Data Interface format.
        Endianness : current only native endianness (``=``) is supported
        Notes:
            - Kind specifiers are aligned with DLPack where possible (hence the
              jump to 20, leave enough room for future extension)
            - Masks must be specified as boolean with either bit width 1 (for bit
              masks) or 8 (for byte masks).
            - Dtype width in bits was preferred over bytes
            - Endianness isn't too useful, but included now in case in the future
              we need to support non-native endianness
            - Went with Apache Arrow format strings over NumPy format strings
              because they're more complete from a dataframe perspective
            - Format strings are mostly useful for datetime specification, and
              for categoricals.
            - For categoricals, the format string describes the type of the
              categorical in the data buffer. In case of a separate encoding of
              the categorical (e.g. an integer to string mapping), this can
              be derived from ``self.describe_categorical``.
            - Data types not included: complex, Arrow-style null, binary, decimal,
              and nested (list, struct, map, union) dtypes.
        """
        dtype = self._col.dtype
        return self._dtype_from_vaexdtype(dtype)
    
    def _dtype_from_vaexdtype(self, dtype) -> Tuple[enum.IntEnum, int, str, str]:
        """
        See `self.dtype` for details
        """
        # Alenka TODO!!
        
        
        # Note: 'c' (complex) not handled yet (not in array spec v1).
        #       'b', 'B' (bytes), 'S', 'a', (old-style string) 'V' (void) not handled
        #       datetime and timedelta both map to datetime (is timedelta handled?)
        _k = _DtypeKind
        _np_kinds = {'i': _k.INT, 'u': _k.UINT, 'f': _k.FLOAT, 'b': _k.BOOL,
                     'U': _k.STRING,
                     'M': _k.DATETIME, 'm': _k.DATETIME}
        kind = _np_kinds.get(dtype.kind, None)
        if kind is None:
            # Not a NumPy dtype. Check if it's a categorical maybe
            if isinstance(dtype, pd.CategoricalDtype):
                kind = 23
            else:
                raise ValueError(f"Data type {dtype} not supported by exchange"
                                 "protocol")

        if kind not in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL, _k.CATEGORICAL):
            raise NotImplementedError(f"Data type {dtype} not handled yet")

        bitwidth = dtype.numpy.itemsize * 8
        format_str = dtype.numpy.str
        endianness = dtype.byteorder if not kind == _k.CATEGORICAL else '='
        return (kind, bitwidth, format_str, endianness)
    
    @property
    def describe_null(self) -> Tuple[int, Any]:
        """
        Return the missing value (or "null") representation the column dtype
        uses, as a tuple ``(kind, value)``.
        Kind:
            - 0 : non-nullable
            - 1 : NaN/NaT
            - 2 : sentinel value
            - 3 : bit mask
            - 4 : byte mask
        Value : if kind is "sentinel value", the actual value. None otherwise.
        """
        _k = _DtypeKind
        kind = self.dtype[0]
        value = None
        if kind == _k.FLOAT:
            null = 1  # np.nan
        elif kind == _k.DATETIME:
            null = 1  # np.datetime64('NaT')
        elif kind in (_k.INT, _k.UINT, _k.BOOL):
            # TODO: check if extension dtypes are used once support for them is
            #       implemented in this procotol code
            null = 0  # integer and boolean dtypes are non-nullable
        elif kind == _k.CATEGORICAL:
            # Null values for categoricals are stored as `-1` sentinel values
            # in the category date (e.g., `col.values.codes` is int8 np.ndarray)
            null = 2
            value = -1
        else:
            raise NotImplementedError(f'Data type {self.dtype} not yet supported')

        return null, value
    
    def get_data_buffer(self) -> Tuple[_VaexBuffer, Any]:  # Any is for self.dtype tuple
        """
        Return the buffer containing the data.
        """
        _k = _DtypeKind
        if self.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
            buffer = _VaexBuffer(self._col.to_numpy())
            dtype = self.dtype
        elif self.dtype[0] == _k.CATEGORICAL:
            codes = self._col.values.codes
            buffer = _VaexBuffer(codes)
            dtype = self._dtype_from_pandasdtype(codes.dtype)
        else:
            raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")

        return buffer, dtype
    
class _VaexDataFrame:
    """
    A data frame class, with only the methods required by the interchange
    protocol defined.
    Instances of this (private) class are returned from
    ``vaex.dataframe.DataFrame.__dataframe__`` as objects with the methods and
    attributes defined on this class.
    """
    def __init__(self, df : vaex.dataframe.DataFrame, nan_as_null : bool = False) -> None:
        """
        Constructor - an instance of this (private) class is returned from
        `vaex.dataframe.DataFrame.__dataframe__`.
        """
        self._df = df
        # ``nan_as_null`` is a keyword intended for the consumer to tell the
        # producer to overwrite null values in the data with ``NaN`` (or ``NaT``).
        # This currently has no effect; once support for nullable extension
        # dtypes is added, this value should be propagated to columns.
        self._nan_as_null = nan_as_null
        
    def num_chunks(self) -> int:
        return 1
        
    def column_names(self) -> Iterable[str]:
        return self._df.get_column_names()
    
    def get_column_by_name(self, name: str) -> _VaexColumn:
        return _VaexColumn(self._df[name])

---

Adding some checks while constructing the methods.

**Construct `vaex` dataframe from `np.array` and checking if it is a `vaex` dataframe.** <br>
**Testing for different types of columns.** <br>
**Testing `.dtype` attribute.**

In [5]:
x = np.array([1.5, 2.5, 3.5])
y = np.array([9.2, 10.5, 11.8])
df = vaex.from_arrays(x=x, y=y)
df.x

# Print if df is vaex dataframe
text = ""
if isinstance(df, vaex.dataframe.DataFrame):
    text = "is" 
else:
    text = "is not"
print(f"This dataframe {text} of vaex type.")
    
# Print if a column is expression (array, arrow, ...)
text2 = ""
if isinstance(df.x, vaex.expression.Expression):
    text2 = "is" 
else:
    text2 = "is not"
print(f"This column {text2} an expression.")

# Print dtype
print(f"data type of a column: {df.x.dtype}")

This dataframe is of vaex type.
This column is an expression.
data type of a column: float64


**Constructing `vaex` dataframe from dictionary of numpy arrays.**

In [3]:
columns = dict()
for name in df.get_column_names():
    columns[name] = df[name].values
vaex.from_dict(columns)

#,x,y
0,1.5,9.2
1,2.5,10.5
2,3.5,11.8


# Unit tests

**Testing `from_dataframe()` method.**

In [4]:
from_dataframe_to_vaex(df)

#,x,y
0,1.5,9.2
1,2.5,10.5
2,3.5,11.8


In [6]:
df_transf = from_dataframe_to_vaex(df)

# Print if df is vaex dataframe
text = ""
if isinstance(df_transf, vaex.dataframe.DataFrame):
    text = "is" 
else:
    text = "is not"
print(f"This dataframe {text} of vaex type.")
    
# Print if a column is expression (array, arrow, ...)
text2 = ""
if isinstance(df_transf.x, vaex.expression.Expression):
    text2 = "is" 
else:
    text2 = "is not"
print(f"This column {text2} an expression.")

# Print the call of dtype
print(f"data type of a column: {df_transf.x.dtype}")

This dataframe is of vaex type.
This column is an expression.
data type of a column: float64


# Using pytest in Notebook with ipytest.

## 1. testing `from_dataframe_to_vaex`

In [7]:
import ipytest
ipytest.autoconfig()

In [8]:
# Researching how Maartens code checks for dataframe equality (https://stackoverflow.com/questions/59875975/vaex-check-for-equality-between-two-frames)
# In Vaex testing expressions are turned into lists and then the list are asserted (I will use in later versions)
df_1 = vaex.from_dict(data=dict(a=[1.5, 2.5, 3.5], b=[9.2, 10.5, 11.8]))
df_2 = from_dataframe_to_vaex(df_1)

df_ = df_1.join(df_2, rprefix='rhs_')  # join based on rows number
print(df_)

#Simpe print to see the result
print((df_['a'] != df_["rhs_" + 'a']).values)
print((df_['a'] != df_["rhs_" + 'a']).sum() == 0)

  #    a     b    rhs_a    rhs_b
  0  1.5   9.2      1.5      9.2
  1  2.5  10.5      2.5     10.5
  2  3.5  11.8      3.5     11.8
[False False False]
True


In [9]:
# Defining above code as a function to use it
# NOTE: it might neglect missing values!

def func_equality_dataframes(df1, df2):
    df = df1.join(df2, rprefix='rhs_')  # join based on rows number
    column_names = df1.get_column_names()
    
    check = all((df[name] != df["rhs_" + name]).sum() == 0 for name in column_names)
    return check 

In [11]:
%%run_pytest[clean]
# testing from_dataframe on vaex.from_arrays

def test_dataframe_float1():
    df1 = vaex.from_arrays(x=x, y=y)
    df2 = from_dataframe_to_vaex(df1)

    assert func_equality_dataframes(df1,df2)

.                                                                                                                [100%]
1 passed in 0.04s


In [12]:
%%run_pytest[clean]
# testing from_dataframe on vaex.from_dict

def test_dataframe_float2():
    df1 = vaex.from_dict(data=dict(a=[1.5, 2.5, 3.5], b=[9.2, 10.5, 11.8]))
    df2 = from_dataframe_to_vaex(df1)
    
    assert func_equality_dataframes(df1,df2)

.                                                                                                                [100%]
1 passed in 0.02s



## 2. testing two different Vaex dataframes


In [13]:
%%run_pytest[clean]

def test_dataframe_float3():
    y2 = np.array([9.2, 10.5, 11])
    df1 = vaex.from_arrays(x=x, y=y)
    df2 = vaex.from_arrays(x=x, y=y2)
    
    assert not func_equality_dataframes(df1,df2)

.                                                                                                                [100%]
1 passed in 0.04s


### Testing buffer pointer

In [14]:
%%run_pytest[clean]
# Testing buffer pointer for complete from_dataframe method

def test_buffer_location():
    b1 = df.x.to_numpy().__array_interface__['data'][0]
    b2 = from_dataframe_to_vaex(df).x.to_numpy().__array_interface__['data'][0]

    assert b1 == b2

.                                                                                                                [100%]
1 passed in 0.02s


In [15]:
%%run_pytest[clean]
# Testing buffer pointer for buffer_to_ndarray part of the method

def test_buffer_to_ndarray():

    # Test data
    #x = np.array([1.5, 2.5, 3.5])
    #y = np.array([9.2, 10.5, 11.8])
    #df = vaex.from_arrays(x=x, y=y)
    
    # _buffer.ptr
    buffer_start = df.x.to_numpy().__array_interface__['data'][0]
    print(buffer_start)
    
    # get_data_buffer
    buffer = _VaexBuffer(df.x.to_numpy())
    dtype = df.x.dtype
    
    print(dtype.kind)
    print(dtype.numpy.itemsize)
    print(dtype.numpy.str)
    
    # column_type
    kind = 2 #dtype.kind = 'f' -> _DtypeKind = FLOAT = 3
    bitwidth = dtype.numpy.itemsize * 8

    _ints = {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}
    _uints = {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}
    _floats = {32: np.float32, 64: np.float64}
    _np_dtypes = {0: _ints, 1: _uints, 2: _floats, 20: {8: bool}}
    column_dtype = _np_dtypes[kind][bitwidth]

    print(column_dtype)
    
    # bufsize
    bufsize = df.x.values.size*df.x.dtype.numpy.itemsize
    
    # getting array from buffer 
    ctypes_type = np.ctypeslib.as_ctypes_type(column_dtype)
    data_pointer = ctypes.cast(buffer_start, ctypes.POINTER(ctypes_type))
    x_new = np.ctypeslib.as_array(data_pointer,
                              shape=(bufsize // (bitwidth//8),))
    
    # What is the pointer?
    x_new.__array_interface__['data'][0]
    
    df_new = vaex.from_arrays(x=x_new, y=y)
    buffer_end = df_new.x.to_numpy().__array_interface__['data'][0]
    
    assert buffer_start == buffer_end

.                                                                                                                [100%]
1 passed in 0.01s
