# Using New York City Taxi Data to illistrate using Arkouda with Pandas/NumPy

This notebook shows some examples of how to interoperate between Pandas and Arkouda at a small scale to allow it to be run on a 16GB laptop.
Remember, Arkouda is not trying to replace Pandas but to allow for some Pandas-style operation at a much larger scale. In our experience Pandas can handle dataframes up to about **500 million rows** before performance becomes a real issue, this is provided that you run on a sufficently capable compute server. Arkouda breaks the shared memory paradigm and scales its operations to dataframes with over **200 billion rows**, maybe even a trillion. In practice we have run Arkouda server operations on columns of one trillion elements running on 512 compute nodes. This yielded a **>20TB dataframe** in Arkouda.

- Import Arkouda package and connect to the Arkouda server
- import other useful packages
- Define some python helper functions for ETL (Extract/Transform/Load)
- Define a python function to transfer dataframes from Pandas to Arkouda
- Read NYC taxi csv into Pandas
- Put dataframe columns into the Arkouda server
- Compute taxi ride duration in Pandas and in Arkouda and histogram data
- Read NYC Taxi Zone Lookup Table (tzlut) into Pandas
- Transfer tzlut to Arkouda
- Compute something with Groupby/aggregate in Pandas and in Arkouda
  - Groupby on pickup and dropoff location ids
  - use groupby/aggregate on edge list to compute different things
    - min/max/mean/std distance between location ids
    - min/max/mean/std time between location ids
    - number of trips between location ids
    - other things
  - 
  - other things
- model number of taxis at a given time
- model taxis as specific entities (Kalman filter?)
  - use time and location ids
  - probability of paths of taxis
  - ...
- other things?

New York City Taxi Data
----------------------------------
[Yellow Trips Data Dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

[NYC Yellow Taxi Trip Records Jan 2020](https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv)

[Green Trips Data Dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf)

[NYC Green  Taxi Trip Records Jan 2020](https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2020-01.csv)

[NYC Taxi Zone Lookup Table](https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv)

[NYC Taxi Zone Shapefile](https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip)

### Import Arkouda package and connect to the server

In [None]:
import arkouda as ak
ak.connect(connect_url="tcp://localhost:5555")

### Import the other packages

In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import gc

### Conversion functions for ETL

In [None]:
# conversion from csv field to int64
# try to convert to int, on exception (empty or other string) convert to 0
def cvt_to_int64(v):
    try:
        return np.int64(v)
    except:
        return np.int64(0)  

# conversion from csv field to string
# try to convert to int, on exception (empty or other string) convert to 0
def cvt_to_string(v):
    try:
        if v == '':
            return 'N/A'
        else:
            return str(v)
    except:
        return 'N/A'

# conversion from csv field (Y,N,empty) to bool
# on Y convert to True, on N or empty convert to False
def cvt_YN_to_bool(v):
    if v == 'Y':
        return True
    else:
        return False

### Define function to transfer dataframe to dictionary of Arkouda arrays

In [None]:
# check all objects in iterable for instance of str
# this is a crutch to get a pandas column of str into Arkouda
def is_all_str(a):
    ret = True
    for v in a:
        if isinstance(v, str):
            ret = True
        else:
            ret = False
            break
    return ret

# put data frame columns into arkouda server and return a dict of the pdarrays
# convert some columns into data types the server can understand
def ak_create_akdict_from_df(df):
    akdict = {}
    for cname in df.keys():
        a = df[cname].values
            
        # int64, float64, and np.bool should go over fine
        if a.dtype in [np.int64, np.float64, np.bool]:
            akdict[cname] = ak.array(a)
            print(cname, " : ", a.dtype, "->", a.dtype)
        # time needs to be converted to int64
        elif a.dtype in ["datetime64[ns]"]:
            akdict[cname] = ak.array(a.astype(np.int64))
            print(cname, " : ", a.dtype, "->", akdict[cname].dtype)
        # convert to arkouda Strings object if whole column is instance of str
        elif is_all_str(a):
            # convert to python list of str then ak.array can convert to ak.Strings object
            akdict[cname] = ak.array(list(a))
            print(cname, " : ", a.dtype, "->", 'ak.Strings')
        # something I don't understand how to convert to a server data type
        else:
            print("don't know how to convert ", a.dtype, " !!!")
    return akdict

### Helper functions

In [None]:
def ns_to_min(v):
    return (v / (1e9 * 60.0))

## Yellow taxi trip data

### Read in the data

In [None]:
# Read in yellow taxi data
# per yellow data dictionary convert to data types Arkouda can handle
# int64, float64, bool
cvt = {'VendorID': cvt_to_int64, 'passenger_count': cvt_to_int64, 'RatecodeID': cvt_to_int64,
       'store_and_fwd_flag': cvt_YN_to_bool,
       'PULocationID': cvt_to_int64, 'DOLocationID':cvt_to_int64, 'payment_type': cvt_to_int64}
# explicitly parse date-time fields
parse_dates_lst = ['tpep_pickup_datetime','tpep_dropoff_datetime']
# call read_csv to parse data with these options
ydf = pd.read_csv("../Downloads/yellow_tripdata_2020-01.csv",
                  converters=cvt, header=0, low_memory=False,
                  parse_dates=parse_dates_lst, infer_datetime_format=True)

### Print out the dataframe

In [None]:
#print out dataframe
ydf

### Which columns did we read in?

In [None]:
# see which keys we read in from first line of csv data file
#print(ydf.keys())
print(ydf.columns)

### Convert the dataframe to a dictionary of Arkouda arrays

In [None]:
# put data frame columns into arkouda server
# convert some columns into data types the server can understand
akdict = ak_create_akdict_from_df(ydf)

### Show which columns got transfered to the Arkouda server 

In [None]:
# which keys made it over to the server
print(akdict.keys())

### Look at the symbol table of the Arkouda server

In [None]:
# print out the arkouda server symbol table
print(ak.info(ak.AllSymbols))

### Do a simple computation in the Arkouda server about pickup-time indexed logically by the store-and-forward flag

In [None]:
# how many records made it to the server?
numTotal = akdict['tpep_pickup_datetime'].size

# use the store_and_forward column to index tpep_pickup_datetime
# see how many time was false
numFalse = akdict['tpep_pickup_datetime'][~akdict['store_and_fwd_flag']].size

# use the store_and_forward column to index tpep_pickup_datetime
# see how many time was true
numTrue = akdict['tpep_pickup_datetime'][akdict['store_and_fwd_flag']].size

numTotal == numFalse+numTrue

### Ride duration in Pandas

In [None]:
#Pandas ride duration 
ride_duration = ydf['tpep_dropoff_datetime'] - ydf['tpep_pickup_datetime']
# pull out ride duration in minutes
ride_duration = ride_duration.dt.total_seconds() / 60 # in minutes

print("min = ", ride_duration.min(),"max = ", ride_duration.max())
print("mean = ",ride_duration.mean(),"stdev = ",ride_duration.std())

# how long was the min/max ride to the next integer minute
min_ride = math.floor(ride_duration.min())
print("min_ride = ", min_ride)
max_ride = math.ceil(ride_duration.max())
print("max_ride = ", max_ride)

# histogram the ride time bin by the minute
nBins = max_ride - min_ride
cnts,binEdges = np.histogram(ride_duration, bins=nBins)

print(cnts.size,    "cnts      = ", cnts)
print(binEdges.size,"bin edges = ", binEdges)

# plot the histogram the ride time, bin by the minute
plt.plot(binEdges[:-1],cnts)
plt.yscale('log')
plt.xscale('linear')
plt.show()

### Ride duration in Arkouda

In [None]:
# take delta for ride duration
ride_duration = akdict['tpep_dropoff_datetime'] - akdict['tpep_pickup_datetime']
# pull out ride duration in minutes
ride_duration = ns_to_min(ride_duration)

print("min = ", ride_duration.min(),"max = ", ride_duration.max())
print("mean = ",ride_duration.mean(),"stdev = ",ride_duration.std())

# how long was the min/max ride to the next integer minute
min_ride = math.floor(ride_duration.min())
print("min_ride = ", min_ride)
max_ride = math.ceil(ride_duration.max())
print("max_ride = ", max_ride)

# histogram the ride time bin by the minute
nBins = max_ride - min_ride
cnts = ak.histogram(ride_duration, bins=nBins)

# create bin edges because ak.histogram doesn't 
binEdges = np.linspace(ride_duration.min(), ride_duration.max(), nBins+1)
print(binEdges)

print(cnts.size,    "cnts      = ", cnts)
print(binEdges.size,"bin edges = ", binEdges)

# plot the histogram the ride time, bin by the minute
plt.plot(binEdges[:-1],cnts.to_ndarray())
plt.yscale('log')
plt.xscale('linear')
plt.show()

### Compute somehting with trip distance in Pandas

In [None]:
print(ydf['trip_distance'].min(), ydf['trip_distance'].max())
print(ydf['trip_distance'].mean(), ydf['trip_distance'].std())

plt.figure(figsize=(8,6))
plt.hist(ydf['trip_distance'],bins=2000)
#ax = plt.gca()
#ax.set_xlim((ydf['trip_distance'].min(),ydf['trip_distance'].max()))
plt.yscale('log')
plt.xscale('log')
plt.show()

### Compute somehting with trip distance in Arkouda

## Taxi Zone Lookup Table

### Read in the data

In [None]:
# read the taxi-zone-lookup-table
cvt = {'Borough':cvt_to_string, 'Zone':cvt_to_string, 'service_zone':cvt_to_string}
tzlut = pd.read_csv("../Downloads/taxi+_zone_lookup.csv",converters=cvt)
# print out the tzlut which was read from file
print(tzlut)

# location id is 1-based, index is 0-based
# fix it up to be aligned with index in data frame
# which means add row zero
top_row = pd.DataFrame({'LocationID': [0], 'Borough': ['N/A'], 'Zone': ['N/A'], 'service_zone': ['N/A']})
tzlut = pd.concat([top_row, tzlut]).reset_index(drop = True)
# print fixed up tzlut
tzlut

### Check the columns for all strings

In [None]:
# check the columns for all strings
print(["{} is all str -> {}".format(name, is_all_str(tzlut[name].values)) for name in tzlut.keys()])

### Convert dataframe to dictionary of Arkouda arrays

In [None]:
# convert data frame with strings and int64 data
aktzlut = ak_create_akdict_from_df(tzlut)

### what did we get on the server side

In [None]:
print(aktzlut)

### GroupBy pickup and dropoff location id
- Compute something with Groupby/aggregate in Pandas and in Arkouda
  - Groupby on pickup and dropoff location ids

In [None]:
# groupby puckup(PU) and dropoff(DO) location ids(LID)
byLIDs = ak.GroupBy((akdict['PULocationID'], akdict['DOLocationID']))
# print unique keys
print('unique_keys (PU,DO): ', byLIDs.unique_keys)

### Use groupby/aggregate on edge list to compute a condensed graph

In [None]:
# create a condensed graph of LID (PU,DO) pairs with different weights
graphLID = {}
 # vertex names (integer)
graphLID['V']    = aktzlut['LocationID']
# unique edges
graphLID['E']    = byLIDs.unique_keys
# edge weight: count of each unique edge
graphLID['W_CT'] = byLIDs.count()[1]
# edge weight: mean trip distance per edge
graphLID['W_TD'] = byLIDs.mean(akdict['trip_distance'])[1]
# edge weight: mean ride duration per edge
graphLID['W_RD'] = byLIDs.mean(ride_duration)[1]
# edge weight: mean fare amount per edge
graphLID['W_FA'] = byLIDs.mean(akdict['fare_amount'])[1]
# print the graph
print("graphLID = ", graphLID)

  - use join-with-dt
  - other things
- model number of taxis at a given time
- model taxis as specific entities (Kalman filter?)
  - use time and location ids
  - probability of paths of taxis
  - ...
- other things?

### disconnect from the server or shutdown the server

In [None]:
# disconnect or shutdown the server
#ak.disconnect()
#ak.shutdown()