# Demonstration of Arkouda DataFrame vs Pandas DataFrame

### For the purposes of this NB, Arkouda will be refered to as ak and Pandas will be referred to ad pd

Arkouda DataFrame should maintain the same functionality of pandas dataframes

Functionality Currently Supported by Both 
- Drop
- Index Reset
- Renaming Columns
- Head/Tail Functionality
- Groupby
- Sorting
- Permutations

Only supported by Arkouda
- Filter by Range

Pandas Features to add to Arkouda
- Missing Groupby Functionality
    - First
    - Access Group By Name

- Reset Index
    - Track original indexes

- Sorting
    - Sort on Rows
    - inplace argument - allowing for user to specify if action should update the calling object or return a new object

## Import and connect to Arkouda

In [None]:
import arkouda as ak
ak.connect(connect_url="tcp://localhost:5555")

## Import Other Requirements

In [None]:
import pandas as pd
import numpy as np

#needed to prevent outputs from overwritting eachother
from IPython.display import display

## Add Reset Functions.
These are necessary instead of repeating the code numerous times.

You will notice that the create section does not use these, as we are demonstrating the createion of a dataframe there.

In [None]:
def reset_ak_df():
    username = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
    userid = ak.array([111, 222, 111, 333, 222, 111])
    item = ak.array([0, 0, 1, 1, 2, 0])
    day = ak.array([5, 5, 6, 5, 6, 6])
    amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
    df = ak.DataFrame({'userName': username, 'userID': userid,
                    'item': item, 'day': day, 'amount': amount})
    return df

def reset_pd_df():
    username_pd = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']
    userid_pd = [111, 222, 111, 333, 222, 111]
    item_pd = [0, 0, 1, 1, 2, 0]
    day_pd = [5, 5, 6, 5, 6, 6]
    amount_pd = [0.5, 0.6, 1.1, 1.2, 4.3, 0.6]
    pd_df = pd.DataFrame({'userName': username_pd, 'userID': userid_pd,
                    'item': item_pd, 'day': day_pd, 'amount': amount_pd})
    return pd_df

## DataFrame Creation
`ak.DataFrame` and `pd.DataFrame` creation is very similar, but there are a few differences to note.

### Key Differences 
- The main difference to take note of is that `ak` uses `ak.pdarray` for defining column data, while `pd` uses `lists`.
- When displaying a `DataFrame`, `ak` displays the shape of the object, `pd` does not. The shape can be easily accessed for `pd.DataFrame` using `pd.shape`.

In [None]:
#Arkouda DataFrame
username = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
userid = ak.array([111, 222, 111, 333, 222, 111])
item = ak.array([0, 0, 1, 1, 2, 0])
day = ak.array([5, 5, 6, 5, 6, 6])
amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
df = ak.DataFrame({'userName': username, 'userID': userid,
                   'item': item, 'day': day, 'amount': amount})

#Pandas DataFrame
username_pd = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']
userid_pd = [111, 222, 111, 333, 222, 111]
item_pd = [0, 0, 1, 1, 2, 0]
day_pd = [5, 5, 6, 5, 6, 6]
amount_pd = [0.5, 0.6, 1.1, 1.2, 4.3, 0.6]
pd_df = pd.DataFrame({'userName': username_pd, 'userID': userid_pd,
                   'item': item_pd, 'day': day_pd, 'amount': amount_pd})

display(df)
display(pd_df)

## Drop Columns
`ak` and `pd` function the same in regards to dropping columns, the output is essentially identical (with the excpetion of the differences noted in the `Creation` section). `pd` has some functionality that is not currently supported in `ak`. The `index` column cannot be dropped.

### Pandas Only Functionality
- Drop both rows and columns. For both the `axis` paramter indicates which axis to perform the drop on. `0` (Default) drops specified rows. `1` drops specified columns.
- Both allow for the `drop` to be performed on the calling object or return a new object. This is done with the boolean parameter `inplace`.

In [None]:
#Drop entire column - arkouda
df.drop('userName', axis=1, inplace=True)
display(df)

#drop entire column - pandas
pd_df.drop('userName', axis=1, inplace=True)
display(pd_df)

## De-duplication
De-duplication removes any duplicate rows within the dataset. Both modules return new objects when dropping duplicates.

In [None]:
#create dataframe with duplicate rows - Arkouda
username = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
userid = ak.array([111, 222, 111, 333, 222, 111])
item = ak.array([0, 1, 0, 2, 1, 0])
day = ak.array([5, 5, 5, 5, 5, 5])
df = ak.DataFrame({'userName': username, 'userID': userid,
                    'item': item, 'day': day})

#deduplicate the dataset - Arkouda
dedup = df.drop_duplicates()
display(dedup)

#create DataFrame with duplicate rows - pandas
username_pd = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']
userid_pd = [111, 222, 111, 333, 222, 111]
item_pd = [0, 1, 0, 2, 1, 0]
day_pd = [5, 5, 5, 5, 5, 5]
pd_df = pd.DataFrame({'userName': username_pd, 'userID': userid_pd,
                    'item': item_pd, 'day': day_pd})

#deduplicate the dataset - Pandas
dedup_pd = pd_df.drop_duplicates()
display(dedup_pd)

## Reset Index
This functionality rebases the indexes at 0. This ensures that the first row has index 0 and the nth row has index `n-1`. `ak` will automatically perform this action in certain situations.

In [None]:
df = reset_ak_df()
#taking a slice with keep indexes as they are 
slice_df = df[ak.array([1, 3, 5])]

#resetting the index will rebase at 0 - Arkouda
ak_re_idx = slice_df.reset_index(inplace=False)
display(ak_re_idx)

#for pandas we reuse the dedup result from pandas in the previous step
re_idx = dedup_pd.reset_index(inplace=False)
display(re_idx)

## Column Renaming

In [None]:
#rename userName to user_name and userID to user_id - Arkouda
renamed_ak = df.rename({'userName':'user_name', 'userID': 'user_id'}, axis=1, inplace=False)
display(renamed_ak)

#reset back to original pandas dataframe
pd_df = reset_pd_df()

#rename columns - Pandas
renamed_pd = pd_df.rename(columns={'userName':'user_name', 'userID':'user_id'}, inplace=False)
display(renamed_pd)

## Head/Tail of Dataset
Note: `ak.DataFrame.tail()` will perform a `reset_index()` automatically. 

In [None]:
#compute head - arkouda
df_head = renamed_ak.head(3)
display(df_head)

#compute head - pandas
pd_head = renamed_pd.head(3)
display(pd_head)

#tail - Arkouda
df_tail = renamed_ak.tail(2)
display(df_tail)

#tail - pandas
pd_tail = renamed_pd.tail(2)
display(pd_tail)

## GroupBy
`GroupBy` is the first instance where `ak` and `pd` outputs differ substaintailly. The sections below go through notes on things to be aware of with each module.

### Arkouda
- Stores groupings as 2 lists. The first is the value from the column grouped on; this is the key. The second is the number of records with the associated key. These two lists are configured so that the indexes in each array correspond, ie keys[i] is the key with x number of records, where x is the value in count[i].
    - When grouping on multiple columns, the keys list will return a list of ak.pdarray. Each of these arrays contain the keys for the columns specified in the groupby call. Think of this as a tuple for your key - [array1, array2, ..., arrayN], the key at index 0 is equivalent to (array1[0], array2[0[, ..., arrayN[0])
    - In the example below we are displaying `gb.count()`, which allows us to access the structure described here.
- Currently, there is no visual representation equivalent to that of the pandas module.

### Pandas
- Data is able to be viewed in a format that represents a DataFrame.
- More robust access options: first(), last(), get_group()
- Options to keep a specific occurence of the record.


In [None]:
#Single Column Group By - Arkouda
gb = renamed_ak.GroupBy('user_name')
display(gb.count())

#Single Column Group By - Pandas 
pd_gb = renamed_pd.groupby('user_name')
display(pd_gb.first())

#Multiple Column Grouping - Arkouda
gb = renamed_ak.GroupBy(['user_name', 'day'])
display(gb.count())

#Multiple Column Grouping - Pandas
pd_gb = renamed_pd.groupby(['user_name', 'day'])
display(pd_gb.first())

## Arkouda Arg/Coarg Sorting
Returns the permutation of indices that would sort the dataframe.

`Coargsort only guaranteed grouping for string values, not lexicographical sorting.`

In [None]:
#arg/coarg sorting in Arkouda
arg_s = renamed_ak.argsort(key='user_name')
display(arg_s)

coarg_s = renamed_ak.coargsort(keys=['user_id', 'amount'])
display(coarg_s)

## Sorting
This functionality automatically selects to use argsort or coargsort.

Single column sorting performed on the user_id column.
Multiple column sorting is performed on the user_id and amount columns.

### Arkouda Notes
- This function relies on argsort and coargsort. However, it goes one step further in applying the permutation to the DataFrame allowing for visualization similar to pandas.

### Pandas Notes
- Sorting can be done on both the rows and columns. Use the axis parameter to specify.

In [None]:
#Sort values - arkouda
s = renamed_ak.sort_values(by='user_id')
display(s)
s_mult = renamed_ak.sort_values(by=['user_id', 'amount'])
display(s_mult)

#sort values pandas
s_pd = renamed_pd.sort_values(by=['user_id'])
display(s_pd)
s_pd_mult = renamed_pd.sort_values(by=['user_id', 'amount'])
display(s_pd_mult)

## Permutations
Permutations are essentially ordering the data based on the order of the indexes you supply. `pd` refers to this as reindexing. 

### Pandas Only Functionality
- Allow for the `drop` to be performed on the calling object or return a new object. This is done with the boolean parameter `inplace`

In [None]:
#apply permutaion - Arkouda
renamed_ak.apply_permutation(ak.array([1, 0, 2, 3, 4, 5]))
display(renamed_ak)

#apply permutation - Pandas
perm = renamed_pd.reindex([1, 0, 2, 3, 4, 5])
display(perm)

## Filter By Range
This is only available in `ak`. `ak.DataFrame.filter_by_range()` runs `ak.DataFrame.GroupBy()` on the columns specified and return True/False if the corresponding count falls in the range specified. Returns a list of booleans indicating the status of the key at each index.

The example show the groupby results for better understanding.

In [None]:
gb = renamed_ak.GroupBy('user_id')
display(gb.count())

filter = renamed_ak.filter_by_range(keys=['user_id'], low=1, high=2)
display(filter)