# Exploring the New York City Taxi Data with Arkouda + Pandas/NumPy

This notebook shows some examples of how to interoperate between Pandas and Arkouda at a small scale on a few-GB workstation. This same notebook would run with a multi-node Arkouda instance on an HPC with TB of data.

Arkouda is not trying to replace Pandas but to allow for some Pandas-style operation at a much larger scale. In our experience Pandas can handle dataframes up to about **500 million rows** on a sufficently capable compute server before performance becomes a real issue. Arkouda breaks the shared memory paradigm and scales its operations to distributed dataframes with **hundreds of billions of rows**, maybe even a trillion. In practice we have run Arkouda server operations on columns of one trillion elements running on 512 compute nodes. This yielded a **>20TB dataframe** in Arkouda.

**Outline**
- Data Preparation
  - Get Data
  - Convert Data
  - Load Data
- Data Exploration
  - Summarization
  - Histograms
  - Logical Indexing/Filtering
  - Time Data
  - Lookup Tables
  - GroupBy-Aggregate
  - Broadcast
  - Integrate with Pandas

# Data Preparation

## Download New York City Taxi Data
----------------------------------
[Yellow Trips Data Dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

[NYC Yellow Taxi Trip Records Jan 2020](https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv)

[Green Trips Data Dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf)

[NYC Green  Taxi Trip Records Jan 2020](https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2020-01.csv)

[NYC Taxi Zone Lookup Table](https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv)

[NYC Taxi Zone Shapefile](https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip)

## Convert and Load Data

Currently, `ak.DataFrame` does not have a `from_pandas()` method. Because of this, we load from the csv loop over the columns building a dictionary that is used to build an Arkouda DataFrame. In the future, these methods will be available.

#### Additional Dtypes in Arkouda
Can cast/convert to these after loading raw data
* bool
* Datetime (from int64)
* Timedelta (from int64)

#### Prefer Integers!
They are fast and versatile (usable with GroupBy, Datetime, Timedelta, bit ops, etc.)

### Describe Data Format

In [None]:
!head /Users/ethandebandi/Documents/test_data/green_tripdata_2020-01.csv

In [None]:
%%file NYCTaxi_format.py

import numpy as np

OPTIONS = {}

def YNint(yn):
    return (0, 1)[yn.upper() in 'YES']

def nullint(x):
    try:
        return np.int64(x)
    except:
        return np.int64(-1)

yellow_format = {'sep': ',',
                 'header': 0,
                 'parse_dates':['tpep_dropoff_datetime', 'tpep_pickup_datetime'],
                 'infer_datetime_format': True,
                 'converters': {'store_and_fwd_flag': YNint,
                                'VendorID': nullint,
                                'RatecodeID': nullint,
                                'PULocationID': nullint,
                                'DOLocationID': nullint,
                                'passenger_count': nullint,
                                'payment_type': nullint,
                                'trip_type': nullint}}

OPTIONS['yellow'] = yellow_format

green_format = yellow_format.copy()
green_format['parse_dates'] = ['lpep_dropoff_datetime', 'lpep_pickup_datetime']
OPTIONS['green'] = green_format

### CSV --> Pandas --> Arkouda
Pandas has a very good CSV reader, so we will use that.

It might be worth noting that we should add an arkouda wrapper function to do what will be done here, ak.from_pandas() & ak.from_csv()

In [None]:
import pandas as pd
import NYCTaxi_format as taxi
import arkouda as ak
ak.connect(connect_url="tcp://localhost:5555")

In [None]:
pdgreen = pd.read_csv('/Users/ethandebandi/Documents/test_data/green_tripdata_2020-01.csv', **taxi.OPTIONS['green'])

In [None]:
def ak_from_pandas(df):
    #first we create a dictionary mapping column names to column data
    ak_dict = {}
    for cname in df.keys():
        if df[cname].dtype.name == 'object':
            ak_dict[cname] = ak.from_series(df[cname],dtype=np.str)
        else:
            ak_dict[cname] = ak.from_series(df[cname])
    
    return ak.DataFrame(ak_dict)


In [None]:
#Create an arkouda DataFrame
ak_green = ak_from_pandas(pdgreen)
ak_green

# Exploreation

## Descriptive Statistics

In [None]:
def describe(x):
    fmt = 'mean: {}\nstd : {}\nmin : {}\nmax : {}'
    if x.dtype == ak.float64:
        fmt = fmt.format(*['{:.2f}' for _ in range(4)])
    print(fmt.format(x.mean(), x.std(), x.min(), x.max()))

In [None]:
describe(ak_green['fare_amount'])

## Histograms

In [None]:
import numpy as np
from matplotlib import pyplot as plt

def hist(x, bins, log=True):
    assert bins > 0
    # Compute histogram counts in arkouda
    h = ak.histogram(x, bins)
    # Compute bins in numpy
    if isinstance(x, ak.Datetime):
        # Matplotlib has trouble plotting np.datetime64 and np.timedelta64
        bins = ak.date_range(x.min(), x.max(), periods=bins).to_ndarray().astype('int')
    elif isinstance(x, ak.Timedelta):
        bins = ak.timedelta_range(x.min(), x.max(), periods=bins).to_ndarray().astype('int')
    else:
        bins = np.linspace(x.min(), x.max(), bins+1)[:-1]
    # Bring h over to numpy for plotting
    plt.bar(bins, h.to_ndarray(), width=bins[1]-bins[0])
    if log:
        plt.yscale('log')

In [None]:
hist(ak_green['fare_amount'], 100)

## Logical Indexing (Filters)
Find non-negative fares

In [None]:
nonneg = ak_green['fare_amount'] >= 0
print(f'{nonneg.sum() / nonneg.size :.1%} of fares are non-negative')

Select only non-negative fares for computation

In [None]:
describe(ak_green['fare_amount'][nonneg])

Make new data dict with only non-negative fares

In [None]:
data_nonneg = {k:v[nonneg] for k, v in ak_green.items()}
data_nonneg

## Time Data

In [None]:
# TODO - add new column once append is merged
ride_duration = ak_green['lpep_dropoff_datetime'] - ak_green['lpep_pickup_datetime']
ride_duration

In [None]:
ride_duration.min(), ride_duration.max()

In [None]:
hist(ride_duration, 100)

## Taxi Zone Lookup Table

In [None]:
def cvt_to_string(v):
    try:
        if v == '':
            return 'N/A'
        else:
            return str(v)
    except:
        return 'N/A'

# read the taxi-zone-lookup-table
cvt = {'Borough':cvt_to_string, 'Zone':cvt_to_string, 'service_zone':cvt_to_string}
tzlut = pd.read_csv("/Users/ethandebandi/Documents/test_data/taxi+_zone_lookup.csv",converters=cvt)

# TODO - use ak.DataFrame once concat is merged

# location id is 1-based, index is 0-based
# fix it up to be aligned with index in data frame
# which means add row zero
top_row = pd.DataFrame({'LocationID': [0], 'Borough': ['N/A'], 'Zone': ['N/A'], 'service_zone': ['N/A']})
tzlut = pd.concat([top_row, tzlut]).reset_index(drop = True)

ak_tzlut = ak_from_pandas(tzlut)

In [None]:
ak_tzlut

### Apply Lookup Table

In [None]:
(ak_tzlut['LocationID'] == ak.arange(ak_tzlut['LocationID'].size)).all()

In [None]:
# TODO - add column to ak_green once concat/append merged to codebase
pu_borough = ak_tzlut['Borough'][ak_green['PULocationID']]
do_borough = ak_tzlut['Borough'][ak_green['DOLocationID']]

In [None]:
# TODO - add column to ak_green once concat/append merged to codebase
pu_zone = ak_tzlut['Zone'][ak_green['PULocationID']]
do_zone = ak_tzlut['Zone'][ak_green['DOLocationID']]

In [None]:
# TODO - print ak_green with data appended
pu_borough

## GroupBy: Construct a Graph

Directed graph from PULocationID --> DOLocationID

In [None]:
byloc = ak.GroupBy([ak_green['PULocationID'], ak_green['DOLocationID']])
byloc.unique_keys

Edge weight is number of rides

Aggregation methods of `GroupBy` return tuple of (unique_keys, aggregate_values)

In [None]:
(u, v), w = byloc.count()
u, v, w

## Broadcast: Find Rides with Anomalous Fares

Compute mean and std of fare by (pickup, dropoff)

In [None]:
_, mf = byloc.mean(ak_green['fare_amount'])

sf = (byloc.sum(ak_green['fare_amount']**2)[1] / w) - mf**2

Broadcast group values back to ride dataframe to compute z-scores of rides

In [None]:
# TODO - add column to ak_green once concat/append merged to codebase
fare_mean = byloc.broadcast(mf, permute=True)
fare_std = byloc.broadcast(sf, permute=True)

fare_z = (ak_green['fare_amount'] - fare_mean) / (fare_std + 1)

hist(fare_z, 100)

## Bring Small Result Set Back to Pandas

In [None]:
# TODO - add once data is all merged in ak_green. See NYCTaxi_small.ipynb

## Disconnect from the server or shutdown the server

In [None]:
#ak.disconnect()
ak.shutdown()