# Demonstration of Arkouda DataFrame vs Pandas DataFrame
Arkouda DataFrame should maintain the same functionality of pandas dataframes

Functionality Currently Supported by Both 
- Drop
- Index Reset
- Renaming Columns
- Head/Tail Functionality
- Groupby
- Sorting
- Permutations

Only supported by Arkouda
- Filter by Range
- Automatic reindexing
    - This is great for simplicity, but there may be cases where a user does not want this automatic reindexing.
    - Suggest making this a flag instead

Pandas Features to add to Arkouda
- inplace argument - allowing for user to specify if action should update the calling object or return a new object
    - Drop
    - Renaming
    - Index Reset
    - Sorting
    
- Missing Groupby Functionality
    - First
    - Access Group By Name

- Renaming
    - Row/Index Renaming

-Reset Index
    - Track original indexes

-Sorting
    -Sort on Rows

- Drop
    - Provide axis, which allows for row or column to be dropped (https://github.com/Bears-R-Us/arkouda/issues/1165)
    

In [None]:
import arkouda as ak
ak.connect(connect_url="tcp://localhost:5555")

In [None]:
import pandas as pd
import numpy as np

from IPython.display import display, clear_output
from ipywidgets import HTML, HBox, VBox, Output

In [None]:
def reset_ak_df():
    username = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
    userid = ak.array([111, 222, 111, 333, 222, 111])
    item = ak.array([0, 0, 1, 1, 2, 0])
    day = ak.array([5, 5, 6, 5, 6, 6])
    amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
    df = ak.DataFrame({'user_name': username, 'user_id': userid,
                    'item': item, 'day': day, 'amount': amount})
    return df

def reset_pd_df():
    username_pd = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']
    userid_pd = [111, 222, 111, 333, 222, 111]
    item_pd = [0, 0, 1, 1, 2, 0]
    day_pd = [5, 5, 6, 5, 6, 6]
    amount_pd = [0.5, 0.6, 1.1, 1.2, 4.3, 0.6]
    pd_df = pd.DataFrame({'user_name': username_pd, 'user_id': userid_pd,
                    'item': item_pd, 'day': day_pd, 'amount': amount_pd})
    return pd_df

## Create a DataFrame
Notice the difference is only that Arkouda prints the dimensions when displaying the data

In pandas, the shape of the data can be accessed via the .shape property.

In [None]:
#Arkouda DataFrame
username = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
userid = ak.array([111, 222, 111, 333, 222, 111])
item = ak.array([0, 0, 1, 1, 2, 0])
day = ak.array([5, 5, 6, 5, 6, 6])
amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
df = ak.DataFrame({'userName': username, 'userID': userid,
                   'item': item, 'day': day, 'amount': amount})

#Pandas DataFrame
username_pd = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']
userid_pd = [111, 222, 111, 333, 222, 111]
item_pd = [0, 0, 1, 1, 2, 0]
day_pd = [5, 5, 6, 5, 6, 6]
amount_pd = [0.5, 0.6, 1.1, 1.2, 4.3, 0.6]
pd_df = pd.DataFrame({'userName': username_pd, 'userID': userid_pd,
                   'item': item_pd, 'day': day_pd, 'amount': amount_pd})

#it is easy to access the shape in pandas
#display(pd_df.shape)

#display configuration
ak_create_out = Output()
pd_create_out = Output()
with ak_create_out:
    display(df)
with pd_create_out:
    display(pd_df)
    display(pd_df.shape)
ak_box = VBox([HTML("<h3>Arkouda</h3>"), ak_create_out])
pd_box = VBox([HTML("<h3>Pandas</h3>"), pd_create_out])

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Create DataFrame</h1>
    <p>Creation of ak.DataFrame and pandas.DataFrame is very similar. It is worth noting that Arkouda will automatically supply the shape when displaying, but Pandas does not. It is easy enough to view the shape in Pandas using DataFrame.shape. This is done below to show the same functionality for both.</p>
'''

parent_container = VBox([HTML(html), container])
display(parent_container)

## Drop Functionality
It is worth noting that pandas has more drop functionality:
- Drop on both axes (drop row or column)
    - Able to drop by index and name (Arkouda can only drop by name currently)
- Option to return a new `pd.dataframe` object or perform the drop on the existing obj

### Arkouda
- Drop an entire column of data
    - Please note that the `index` field cannot be dropped
- Drop duplicate entires
    - Notice that here entries are automatically reindexed

In [None]:
#Drop entire column - arkouda
df.drop('userName')

#drop entire column - pandas
pd_df.drop('userName', axis=1, inplace=True)

#Display
ak_drop_output = Output()
pd_drop_output = Output()

with ak_drop_output:
    display(df)
with pd_drop_output:
    display(pd_df)
    display(pd_df.shape)

ak_drop_info = HTML('''
    <h3>Arkouda</h3>
    <ul>
        <li>The "index" column cannot be droped</li>
    </ul>
''')

pd_drop_info = HTML('''
    <h3>Pandas</h3>
    <ul>
        <li>Drop rows and columns using the axis parameter. 0 - drop rows, 1 - drop columns. Defaults to 0.</li>
        <li>inplace parameter allows for the action to be applied to the calling object or the result returned as a new object.</li>
    </ul>
''')

ak_box = VBox([ak_drop_info, ak_drop_output])
pd_box = VBox([pd_drop_info, pd_drop_output])

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Drop Functionality</h1>
    <p>In its current configuration, ak.DataFrame does not have the full feature support of pandas.DataFrame. These features are in the process of being added.</p>
    <p>In both examples below, the userName column has been dropped.</p>
'''

parent_container = VBox([HTML(html), container])
display(parent_container)

In [None]:
#create dataframe with duplicate rows - Arkouda
username = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
userid = ak.array([111, 222, 111, 333, 222, 111])
item = ak.array([0, 1, 0, 2, 1, 0])
day = ak.array([5, 5, 5, 5, 5, 5])
df = ak.DataFrame({'userName': username, 'userID': userid,
                    'item': item, 'day': day})

#deduplicate the dataset - Arkouda
dedup = df.drop_duplicates()

#create DataFrame with duplicate rows - pandas
username_pd = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']
userid_pd = [111, 222, 111, 333, 222, 111]
item_pd = [0, 1, 0, 2, 1, 0]
day_pd = [5, 5, 5, 5, 5, 5]
pd_df = pd.DataFrame({'userName': username_pd, 'userID': userid_pd,
                    'item': item_pd, 'day': day_pd})

dedup_pd = pd_df.drop_duplicates()

#Configure display
ak_dedup_before = Output()
ak_dedup_after = Output()
pd_dedup_before = Output()
pd_dedup_after = Output()

with ak_dedup_before:
    display(df)
with ak_dedup_after:
    display(dedup)

ak_dedup_compare = HBox([VBox([HTML("<h4>Before</h4>"), ak_dedup_before]), VBox([HTML("<h4>After</h4>"), ak_dedup_after])])
ak_dedup_info = '''
    <h3>Arkouda</h3>
    <ul>
        <li>Rows are automatically reindexed after deduplication</li>
    </ul>
'''

with pd_dedup_before:
    display(pd_df)
    display(pd_df.shape)
with pd_dedup_after:
    display(dedup_pd)
    display(dedup_pd.shape)

pd_dedup_compare = HBox([VBox([HTML("<h4>Before</h4>"), pd_dedup_before]), VBox([HTML("<h4>After</h4>"), pd_dedup_after])])
pd_dedup_info = '''
    <h3>Pandas</h3>
    <ul>
        <li>Indexes are the same as in the original DataFrame.</li>
    </ul>
'''

ak_box = VBox([HTML(ak_dedup_info), ak_dedup_compare])
pd_box = VBox([HTML(pd_dedup_info), pd_dedup_compare])

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>De-duplication</h1>
'''

parent_container = VBox([HTML(html), container])
clear_output()
display(parent_container)

## Reset Index
This is useful when dealing with slices or other subsets of a DataFrame.

It is worth noting that there are situations where `ak.DataFrame` will automatically perform indexing for you.

In [None]:
df = reset_ak_df()

#next line are for display purposes only
ak_idx_before = Output()

#taking a slice with keep indexes as they are 
slice_df = df[[1, 3, 5]]
#display
with ak_idx_before:
    display(slice_df)

#resetting the index will rebase at 0 - Arkouda
slice_df.reset_index()

#for pandas we reuse the dedup result from pandas in the previous step
re_idx = dedup_pd.reset_index(inplace=False)


#Configure display
ak_idx_after = Output()
pd_idx_before = Output()
pd_idx_after = Output()
with ak_idx_after:
    display(slice_df)

ak_idx_compare = HBox([VBox([HTML("<h4>Before</h4>"), ak_idx_before]), VBox([HTML("<h4>After</h4>"), ak_idx_after])])
ak_idx_info = '''
    <h3>Arkouda</h3>
    <ul>
        <li>Arkouda will automatically reindex for you in some situations.</li>
    </ul>
'''

with pd_idx_before:
    display(dedup_pd)
    display(dedup_pd.shape)
with pd_idx_after:
    display(re_idx)
    display(re_idx.shape)

pd_idx_compare = HBox([VBox([HTML("<h4>Before</h4>"), pd_idx_before]), VBox([HTML("<h4>After</h4>"), pd_idx_after])])
pd_idx_info = '''
    <h3>Pandas</h3>
    <ul>
        <li>Option to return a new object or perform on the calling object.</li>
        <li>Generates an additional column named "index" to store the original index</li>
    </ul>
'''

ak_box = VBox([HTML(ak_idx_info), ak_idx_compare])
pd_box = VBox([HTML(pd_idx_info), pd_idx_compare])

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Reset Indexes</h1>
    <p>Resetting the index rebases the indexes at 0. This is useful in situations when you are dealing with subsets of the original data.</p>
'''

parent_container = VBox([HTML(html), container])
clear_output()
display(parent_container)

## Renaming Columns

In [None]:
#next 3 lines are for display purposes only
ak_rename_before = Output()
with ak_rename_before:
    display(df)

#rename userName to user_name and userID to user_id - Arkouda
df.rename({'userName':'user_name', 'userID': 'user_id'})

#reset back to original pandas dataframe
pd_df = reset_pd_df()

#rename columns - Pandas
renamed_pd = pd_df.rename(columns={'userName':'user_name', 'userID':'user_id'}, inplace=False)

#Configure display
ak_rename_after = Output()
pd_rename_before = Output()
pd_rename_after = Output()
with ak_rename_after:
    display(df)

with pd_rename_before:
     display(pd_df)
     display(pd_df.shape)
with pd_rename_after:
    display(renamed_pd)
    display(renamed_pd.shape)

ak_rename_compare = HBox([VBox([HTML("<h4>Before</h4>"), ak_rename_before]), VBox([HTML("<h4>After</h4>"), ak_rename_after])])
ak_rename_info = '''
    <h3>Arkouda</h3>
'''

pd_rename_compare = HBox([VBox([HTML("<h4>Before</h4>"), pd_rename_before]), VBox([HTML("<h4>After</h4>"), pd_rename_after])])
pd_rename_info = '''
    <h3>Pandas</h3>
    <ul>
        <li>Option to return a new object or perform on the calling object.</li>
        <li>Allows for index (row) renaming.</li>
    </ul>
'''

ak_box = VBox([HTML(ak_rename_info), ak_rename_compare])
pd_box = VBox([HTML(pd_rename_info), pd_rename_compare])

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Column Renaming</h1>
    <p>This functionality essentially works the same in both modules. Pandas does provide some additional functionality at this time.</p>
'''

parent_container = VBox([HTML(html), container])
clear_output()
display(parent_container)

## Head/Tail Functionality

In [None]:
#compute head - arkouda
df_head = df.head(3)

#compute head - pandas
pd_head = pd_df.head(3)

#Display
ak_head_output = Output()
pd_head_output = Output()

with ak_head_output:
    display(df_head)
with pd_head_output:
    display(pd_head)
    display(pd_head.shape)

ak_head_info = HTML('''
    <h3>Arkouda</h3>
''')

pd_head_info = HTML('''
    <h3>Pandas</h3>
''')

ak_box = VBox([ak_head_info, ak_head_output])
pd_box = VBox([pd_head_info, pd_head_output])

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Head of Dataset</h1>
'''

parent_container = VBox([HTML(html), container])
display(parent_container)


In [None]:
#tail - Arkouda
df_tail = df.tail(2)

#tail - pandas
pd_tail = pd_df.tail(2)

#Configure Display
ak_tail_output = Output()
pd_tail_output = Output()

with ak_tail_output:
    display(df_tail)
with pd_tail_output:
    display(pd_tail)
    display(pd_tail.shape)

ak_tail_info = HTML('''
    <h3>Arkouda</h3>
    <ul>
        <li>Indexes automatically rebased to 0.</li>
    </ul>
''')

pd_tail_info = HTML('''
    <h3>Pandas</h3>
    <ul>
        <li>No automatic reindexing. Indexes are the same as in the original dataset.</li>
    </ul>
''')

ak_box = VBox([ak_tail_info, ak_tail_output])
pd_box = VBox([pd_tail_info, pd_tail_output])

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Tail of Dataset</h1>
'''

parent_container = VBox([HTML(html), container])
display(parent_container)

## GroupBy
Please Notice:
- Output here differs between `Arkouda` and `Pandas`

In [None]:
#Single Column Group By - Arkouda
gb = df.GroupBy('user_name')
keys, count = gb.count()

#Single Column Group By - Pandas 
pd_gb = pd_df.groupby('user_name')

#Configure Display
ak_gb_output = Output()
pd_gb_output = Output()

with ak_gb_output:
    display(HTML(f'<b>GroupBy Keys:</b> {keys}'))
    display(HTML(f'<b>GroupBy Counts:</b> {count}'))
with pd_gb_output:
    display(pd_gb.first())

ak_gb_info = HTML('''
    <h3>Arkouda</h3>
    <ul>
        <li>Stores groupings as 2 lists. The first is the value from the column grouped on; this is the key. The second is the number of records with the associated key. These two lists are configured so that the indexes in each array correspond, ie keys[i] is the key with x number of records, where x is the value in count[i].</li>
        <li>Currently, there is no visual representation equivalent to that of the pandas module.</li>
    </ul>
''')

pd_gb_info = HTML('''
    <h3>Pandas</h3>
    <ul>
        <li>Data is able to be viewed in a format that represents a DataFrame.</li>
        <li>More robust access options: first(), last(), get_group()</li>
        <li>Options to keep a specific occurence of the record.</li>
    </ul>
''')

ak_box = VBox([ak_gb_info, ak_gb_output])
pd_box = VBox([pd_gb_info, pd_gb_output])
ak_box.layout.max_width = '50%'

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>GroupBy - Single Column</h1>
'''

parent_container = VBox([HTML(html), container])
clear_output()
display(parent_container)

In [None]:
#Multiple Column Grouping - Arkouda
gb = df.GroupBy(['user_name', 'day'])
keys, count = gb.count()

#Multiple Column Grouping - Pandas
pd_gb = pd_df.groupby(['user_name', 'day'])
display(pd_gb.first())

#Configure Display
ak_gb_output = Output()
pd_gb_output = Output()

with ak_gb_output:
    display(HTML(f'<b>GroupBy Keys:</b> {keys}'))
    display(HTML(f'<b>GroupBy Counts:</b> {count}'))
with pd_gb_output:
    display(pd_gb.first())

ak_gb_info = HTML('''
    <h3>Arkouda</h3>
    <ul>
        <li>Here, the keys list will return a list of ak.pdarray. Each of these arrays contain the keys for the columns specified in the groupby call. Think of this as a tuple for your key - [array1, array2, ..., arrayN], the key at index 0 is equivalent to (array1[0], array2[0[, ..., arrayN[0])</li>
    </ul>
''')

pd_gb_info = HTML('''
    <h3>Pandas</h3>
''')

ak_box = VBox([ak_gb_info, ak_gb_output])
pd_box = VBox([pd_gb_info, pd_gb_output])
ak_box.layout.max_width = '50%'

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>GroupBy - Multiple Columns</h1>
    <p>This works the same as a single column, but the result is different in Arkouda</p>
'''

parent_container = VBox([HTML(html), container])
clear_output()
display(parent_container)

## Sorting

In [None]:
#arg/coarg sorting in Arkouda
arg_s = df.argsort(key='user_name')

coarg_s = df.coargsort(keys=['user_id', 'amount'])

ak_arg_output = Output()
with ak_arg_output:
    display(HTML(f'<b>Arg Sort Permutation:</b> {arg_s}'))
    display(HTML(f'<b>Coarg Sort Permutation:</b> {coarg_s}'))

html = HTML('''
    <h1>Arkouda Arg/Coarg Sorting</h1>
    <p>Returns the permutation of indexes for the resulting sort.<p>
    <p><b>Coarg sort is not guaranteed for string values.<b></p>
''')

parent_container = VBox([html, ak_arg_output])
clear_output()
display(parent_container)

In [None]:
#Sort values - arkouda
s = df.sort_values(by='user_id')
s_mult = df.sort_values(by=['user_id', 'amount'])

#sort values pandas
s_pd = pd_df.sort_values(by=['user_id'])
s_pd_mult = pd_df.sort_values(by=['user_id', 'amount'])

ak_sort_output = Output()
pd_sort_output = Output()

with ak_sort_output:
    display(HTML('<b>Single Column</b>'), s)
    display(HTML('<b>Multiple Column</b>'), s_mult)

with pd_sort_output:
    display(HTML('<b>Single Column</b>'), s_pd, s_pd.shape)
    display(HTML('<b>Multiple Column</b>'), s_pd_mult, s_pd_mult.shape)

ak_sort_info = HTML('''
    <h3>Arkouda</h3>
    <ul>
        <li>This function relies on argsort and coargsort. However, it goes one step further in applying the permutation to the DataFrame allowing for visualization similar to pandas.</li>
        <li>Automatic reindexing to rebase at 0.</li>
    </ul>
''')

pd_sort_info = HTML('''
    <h3>Pandas</h3>
    <ul>
        <li>Sorting can be done on both the rows and columns. Use the axis parameter to specify.</li>
        <li>Ascending and descending sorts</li>
        <li>Sorting can be performed on the calling object or return a new object</li>
        <li>Indexes remain same as the original dataframe.</li>
    </ul>
''')

ak_box = VBox([ak_sort_info, ak_sort_output])
pd_box = VBox([pd_sort_info, pd_sort_output])
ak_box.layout.max_width = '50%'

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Sorting</h1>
    <p>This functionality automatically selects to use argsort or coargsort.</p>
    <p>Single column sorting performed on the user_id column.</p>
    <p>Multiple column sorting is performed on the user_id and amount columns.</p>
'''

parent_container = VBox([HTML(html), container])
clear_output()
display(parent_container)


## Permutations

### Arkouda
- Arkouda permutations will automatically reindex 

In [None]:
#apply permutaion - Arkouda
df.apply_permutation(ak.array([1, 0, 2, 3, 4, 5]))

#apply permutation - Pandas
perm = pd_df.reindex([1, 0, 2, 3, 4, 5])

ak_perm_output = Output()
pd_perm_output = Output()

with ak_perm_output:
    display(df)

with pd_perm_output:
    display(perm)

ak_perm_info = HTML('''
    <h3>Arkouda</h3>
    <ul>
        <li>Automatically reindexes rows</li>
    </ul>
''')

pd_perm_info = HTML('''
    <h3>Pandas</h3>
    <ul>
        <li>No Automatic reindexing performed.</li>
        <li>Can perform on the calling object or return a new object with the permutation applied.</li>
    </ul>
''')

ak_box = VBox([ak_perm_info, ak_perm_output])
pd_box = VBox([pd_perm_info, pd_perm_output])
ak_box.layout.max_width = '50%'

container = HBox([ak_box, pd_box])
container.layout.display = 'flex'
container.layout.justify_content = 'space-around'

html = '''
    <h1>Permutations</h1>
    <p>Order the data based on the order of the indexes supplied. Pandas refers to this a 'reindexing'.</p>
'''

parent_container = VBox([HTML(html), container])
clear_output()
display(parent_container)

## Filter By Range - Arkouda
- Filter on the count of the groupby. Returns boolean pdarray 

In [None]:
gb = df.GroupBy('user_id')
keys, count = gb.count()
filter = df.filter_by_range(keys=['user_id'], low=1, high=2)

ak_filter_output = Output()
with ak_filter_output:
    display(HTML(f'<b>GroupBy Keys:</b> {keys}'))
    display(HTML(f'<b>GroupBy Counts:</b> {count}'))
    display(HTML(f'<b>Filter Results:</b> {filter}'))

html = HTML('''
    <h1>Arkouda Filter By Range</h1>
    <p>Run groupby on the keys specified and return True/False if the corresponding count falls in the range specified. Returns a list of booleans indicating the status of the key at each index.<p>
''')

parent_container = VBox([html, ak_filter_output])
clear_output()
display(parent_container)