TODO:
- Implement operator overloading where you choose an operator to allow chaining of operations
- Implement a number of example dplyr operations such as `select` (choose columns), `filter` (choose rows), `summarise`. Probably we can use alot of the pandas functions that already exist. 
- Check the performance of using the syntactic sugar versus straight-up pandas, this should make no difference. 
- Implement the operations in more depth, e.g.:
    - Include stuff like `ends_with` in select. This sould enable to say stuff such as `select(name, ends_with('color'))` to select all columns including `name` and the columns that end with the string `color`. 

# Operator overloading
We plan to use `>>` as the pipe symbol as such:

    data >> function(arg1, arg2, etc)

Lets start with a simple operation, `select`. First we load some data to play around with:

In [5]:
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


`select` takes a pandas dataframe, and returns only the columns we need. The following function performs this operation outside the pipe chain:

In [52]:
def select_worker(df, cols):
    return df[cols]

select_worker(iris, ['sepal_length', 'sepal_width']).head()

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


Now we want to use the function above in a chain that has the following syntax:

    custom_df >> select(['sepal_length', 'sepal_width'])

To get this running we need to overload the `>>` function. We do this by creating a custom class that wraps a dataframe:

In [53]:
class custom_data_frame(pd.DataFrame):
    @property
    def _constructor(self):
        return SubclassedDataFrame
    def __rshift__(self, other):
        return other(self)   # Call the function on the RHS of the `>>` on the object on the LHS

Note that this is a very simple class that omits all the other things we could do with a pandas dataframe, but for now this will suffice. The magic here is we define what should happen when `>>` is used when `custom_data_frame` is on the left hand side using the `__rshift__` function. 

Ok, lets give this a go:

In [54]:
custom_df >> select_worker(['sepal_length', 'sepal_width'])

TypeError: select_worker() missing 1 required positional argument: 'cols'

We run into the problem that in the function definition we need to pass on the data explicitly. In this syntactic sugar however we implicitly pass on the data that comes from the left hand side of the expression. We can solve this by using a functional programming trick:

In [57]:
def select(cols):
    return partial(select_worker, cols=cols)

custom_df = custom_data_frame(iris)

custom_df >> select(['sepal_length', 'sepal_width'])

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


This works, nice. Now lets add a second operation `filter` that filters rows:

In [58]:
def filter_worker(df, filters):
    return df.query(filters)

def filter(filters):
    return partial(filter_worker, filters=filters)

custom_df >> select(['sepal_length', 'sepal_width']) >> filter('sepal_length > 5.5')

TypeError: unsupported operand type(s) for >>: 'SubclassedDataFrame' and 'functools.partial'

Hmm, this doesn't work anymore. This is caused by the fact that `select_worker` does not return a normal pandas dataframe and not the custom dataframe that supports `>>`. We can fix that by rewriting the worker functions to return the special custom dataframe we defined:

In [59]:
def select_worker(df, cols):
    return custom_data_frame(df[cols]) # Note we return custom_data_frame to still have access to the `>>` operator later down the chain

def select(cols):
    return partial(select_worker, cols=cols)

def filter_worker(df, filters):
    return custom_data_frame(df.query(filters))

def filter(filters):
    return partial(filter_worker, filters=filters)

custom_df >> select(['sepal_length', 'sepal_width']) >> filter('sepal_length > 5.5')

Unnamed: 0,sepal_length,sepal_width
14,5.8,4.0
15,5.7,4.4
18,5.7,3.8
50,7.0,3.2
51,6.4,3.2
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


Which fixes the issue. Next we want to add a setup that is often used: `group_by` combined with some kind of summary function. First we add the summarise function:

In [60]:
def summarise_worker(df, sum_funcs):
    return custom_data_frame(df.agg(sum_funcs))

def summarise(sum_funcs):
    return partial(summarise_worker, sum_funcs=sum_funcs)

(custom_df 
        >> select(['sepal_length', 'sepal_width']) 
        >> filter('sepal_length > 5.5') 
        >> summarise('min'))

Unnamed: 0,0
sepal_length,5.6
sepal_width,2.2


In [61]:
def group_by_worker(df, by):
    return df.groupby(by)

def group_by(by):
    return partial(group_by_worker, by=by)

(custom_df 
    >> select(['sepal_length', 'sepal_width', 'species']) 
    >> filter('sepal_length > 5.5') 
    >> group_by('species') 
    >> summarise('min'))

TypeError: unsupported operand type(s) for >>: 'DataFrameGroupBy' and 'functools.partial'

In [36]:
print(custom_df 
        >> select(['sepal_length', 'sepal_width', 'species']) 
        >> filter('sepal_length > 5.5') 
        >> group_by('species') 
        >> summarise('min') 
        >> select('sepal_length'))

species
setosa        5.7
versicolor    5.6
virginica     5.6
Name: sepal_length, dtype: float64


In [1]:
class A:
    def __init__(self, a):
        self.a = a
    def __gt__(self, other):
        if(self.a>other.a):
            return True
        else:
            return False

ob1 = A(2)
ob2 = A(3)
if(ob1>ob2):
    print("ob1 is greater than ob2")
else:
    print("ob2 is greater than ob1")

ob2 is greater than ob1


In [2]:
import pandas as pd

class SubclassedDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return SubclassedDataFrame

    @property
    def _constructor_sliced(self):
        return SubclassedSeries

