# BlazingSQL + cuML NYC Taxi Cab Fare Prediction

This demo uses pubically availible [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. 

In this notebook, we will cover: 
- How to set up [BlazingSQL](https://blazingsql.com) and the [RAPIDS AI](https://rapids.ai/) suite.
- How to read and query csv files with cuDF and BlazingSQL.
- How to implement a linear regression model with cuML.

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-39814657-5&cid=555&t=event&ec=guides&ea=taxi_fare_prediction&dt=taxi_fare_prediction)


## Setup
### Environment Sanity Check 

RAPIDS packages (BlazingSQL included) require Pascal+ architecture to run. For Colab, this translates to a T4 GPU instance. 

The cell below will let you know what type of GPU you've been allocated, and how to proceed.

In [1]:
# tag specs
colab_smi = !nvidia-smi

# focus GPU type
try:
    my_gpu = ' '.join(colab_smi[7].split()[2:4])
# not on gpu acceleration 
except:
    raise Exception("\nPlease make sure you've configured Colab to request a GPU instance type.\n\n"
                    "At top of Colab, try: Runtime -> Change runtime type -> Hardware accelerator -> GPU -> Save\n")

# not allocated compatable GPU
if (my_gpu != b'Tesla T4') and (my_gpu != 'Tesla P100-PCIE...') and (my_gpu != 'GeForce GTX'):
    # allocated K80
    if my_gpu == 'Tesla K80':
        raise Exception("\nYou've been allocated a K80 instance\n\n"
                    "Unfortunately, this demo requires a T4 instance\n\n"
                    "At top of Colab, try: Runtime -> Reset all runtimes...\n")
    else:
        raise Exception(f"\nYou've achieved wizardy.\nyour GPU is {my_gpu}\nPlease inform info@blazingsql.com")

# allocated compatable GPU
else:
    print('Woo! You got the right kind of GPU!')

Woo! You got the right kind of GPU!


## Installs 

Below you will find three code blocks:
1. The first installs miniconda.
2. The second installs RAPIDS AI and sets up the system environment. 
3. The third installs BlazingSQL.

### Miniconda

In [2]:
# # intall miniconda
# !wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
# !chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
# !bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

### RAPIDS AI

In [3]:
# # install RAPIDS packages
# !conda install -q -y --prefix /usr/local -c nvidia -c rapidsai \
#   -c numba -c conda-forge -c pytorch -c defaults \
#   cudf=0.9 cuml=0.9 cugraph=0.9 python=3.6 cudatoolkit=10.0

# # set environment vars
# import sys, os, shutil
# sys.path.append('/usr/local/lib/python3.6/site-packages/')
# os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
# os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

# # copy .so files to current working dir
# for fn in ['libcudf.so', 'librmm.so']:
#     shutil.copy('/usr/local/lib/'+fn, os.getcwd())

### BlazingSQL

In [4]:
# # Install BlazingSQL for CUDA 10.0
# ! conda install -q -y --prefix /usr/local -c conda-forge -c defaults -c nvidia -c rapidsai \
#    -c blazingsql/label/cuda10.0 -c blazingsql \
#    blazingsql-calcite blazingsql-orchestrator blazingsql-ral blazingsql-python

# !pip install flatbuffers

# Import Packages

In [5]:
# Import RAPIDS AI stack
from blazingsql import BlazingContext
import cudf

bc = BlazingContext()

BlazingContext ready


# Download Data

In [6]:
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv

# ETL: Read and Join CSVs


In [7]:
# set attribute column names 
column_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
# and the type of each column
column_types = ['date64', 'float32', 'float32', 'float32', 
                'float32', 'float32', 'float32']

# load first csv 
gdf_00 = cudf.read_csv('taxi_00.csv', delimiter= ',', dtype = column_types, names = column_names)
# load second csv
gdf_01 = cudf.read_csv('taxi_01.csv', delimiter= ',', dtype = column_types, names = column_names)
# load third csv
gdf_02 = cudf.read_csv('taxi_01.csv', delimiter= ',', dtype = column_types, names = column_names)
# load fourth csv
gdf_03 = cudf.read_csv('taxi_01.csv', delimiter= ',', dtype = column_types, names = column_names)

# combine all those dataframes into one master dataframe
gdf = cudf.concat([gdf_00,gdf_01, gdf_02, gdf_03])

# what's it look like?
gdf.head()

Unnamed: 0,key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2012-02-02 22:30:19.002,8.9,-73.988708,40.758804,-73.986519,40.737202,1.0
1,2014-09-20 07:19:24.001,4.0,-73.990204,40.746708,-73.994728,40.750515,1.0
2,2013-02-23 07:18:05.001,5.5,-74.016762,40.709438,-74.009003,40.719498,3.0
3,2015-04-18 23:49:27.009,13.5,-74.002708,40.73373,-73.986099,40.734776,1.0
4,2010-03-04 08:15:59.001,10.5,-73.988365,40.737663,-74.012459,40.713932,1.0


# ETL: Create Table

In [8]:
%time
# make a table from the combined df
bc.create_table('taxi', gdf)

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.05 µs


<pyblazing.apiv2.sql.Table at 0x7f171c0c8a20>

# ETL: Query Tables for Training Data

In [9]:
# define the query
query = '''
        SELECT hour(key) as hours, month(key) as months, year(key) - 2000 as years,  
        dropoff_longitude - pickup_longitude as longitude_distance, 
        dropoff_latitude - pickup_latitude as latitude_distance, 
        passenger_count FROM main.taxi
        '''
# run query on table
X_train = bc.sql(query).get()

In [10]:
# extract dataframe
X_train_gdf = X_train.columns

# how's that look?
X_train_gdf.head()

Unnamed: 0,$f0,$f1,$f2,$f3,$f4,passenger_count
0,22,2,12,0.00219,-0.021603,1.0
1,7,9,14,-0.004524,0.003807,1.0
2,7,2,13,0.007759,0.010059,3.0
3,23,4,15,0.016609,0.001045,1.0
4,8,3,10,-0.024094,-0.023731,1.0


# Current status brief
- table is being made
  - but column names are not being coppied as expected 
- `X_train` is a whole thing 

In [11]:
# temp fix to columns not translating as expected
X_train_gdf.columns = ['hours', 'months', 'years', 
                       'longitude_distance', 'latitude_distance', 'passenger_count']

X_train_gdf.head()

Unnamed: 0,hours,months,years,longitude_distance,latitude_distance,passenger_count
0,22,2,12,0.00219,-0.021603,1.0
1,7,9,14,-0.004524,0.003807,1.0
2,7,2,13,0.007759,0.010059,3.0
3,23,4,15,0.016609,0.001045,1.0
4,8,3,10,-0.024094,-0.023731,1.0


In [12]:
X_train_gdf['longitude_distance'] = X_train_gdf['longitude_distance'].fillna(0).astype('float32')
X_train_gdf['latitude_distance'] = X_train_gdf['latitude_distance'].fillna(0).astype('float32')
X_train_gdf['passenger_count'] = X_train_gdf['passenger_count'].fillna(0).astype('float32')
X_train_gdf['months'] = X_train_gdf['months'].astype('float32') 
X_train_gdf['years'] = X_train_gdf['years'].astype('float32') 
X_train_gdf['hours'] = X_train_gdf['hours'].astype('float32')
        
X_train_gdf.head()

Unnamed: 0,hours,months,years,longitude_distance,latitude_distance,passenger_count
0,22.0,2.0,12.0,0.00219,-0.021603,1.0
1,7.0,9.0,14.0,-0.004524,0.003807,1.0
2,7.0,2.0,13.0,0.007759,0.010059,3.0
3,23.0,4.0,15.0,0.016609,0.001045,1.0
4,8.0,3.0,10.0,-0.024094,-0.023731,1.0


In [13]:
X_train_gdf.isnull().sum(), len(X_train_gdf)

(hours                 0
 months                0
 years                 0
 longitude_distance    0
 latitude_distance     0
 passenger_count       0
 dtype: int64, 20000000)

In [14]:
# query dependent variable y
y_train = bc.sql('SELECT fare_amount FROM main.taxi').get()
# extract dataframe
y_train_gdf = y_train.columns
# shrink to single column
# y_train_gdf = y_train_gdf['fare_amount']

y_train_gdf.head()

Unnamed: 0,fare_amount
0,8.9
1,4.0
2,5.5
3,13.5
4,10.5


# Install cuML on Colab

In [15]:
# !wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
# !chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
# !bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local
# import sys
# sys.path.append('/usr/local/lib/python3.6/site-packages/')

In [16]:
# !conda install -c rapidsai -c nvidia -c conda-forge \
#     -c defaults cuml=0.10 python=3.7 cudatoolkit=10.0 -y

# Linear Regression: Train Model

In [17]:
%%time

import cuml
from cuml import LinearRegression

#create model
lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = "eig")
#train model
reg = lr.fit(X_train_gdf,y_train_gdf)
#print results
print("Coefficients:")
print(reg.coef_)
print(" ")
print(" Y intercept:")
print(reg.intercept_)

Exception ignored in: <finalize object at 0x7f1711f4cb10; dead>
Traceback (most recent call last):
  File "/home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/numba/utils.py", line 669, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/rmm/rmm.py", line 220, in finalizer
    librmm.rmm_free(handle, stream)
  File "rmm/_lib/lib.pyx", line 229, in rmm._lib.lib.rmm_free
  File "rmm/_lib/lib.pyx", line 218, in rmm._lib.lib.c_free
  File "rmm/_lib/lib.pyx", line 49, in rmm._lib.lib.check_error
rmm.rmm.RMMError: RMM_ERROR_CUDA_ERROR


RuntimeError: Exception occured! file=/conda/conda-bld/libcuml_1571339163826/work/cpp/src_prims/common/cuml_allocator.hpp line=109: FAIL: call='cudaMalloc(&ptr, n)'. Reason:out of memory

Obtained 64 stack frames
#0 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon9Exception16collectCallStackEv+0x3e) [0x7f162412983e]
#1 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon9ExceptionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x80) [0x7f162412a350]
#2 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon22defaultDeviceAllocator8allocateEmP11CUstream_st+0x10e) [0x7f162412a63e]
#3 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon11buffer_baseIfNS_15deviceAllocatorEEC2ESt10shared_ptrIS1_EP11CUstream_stm+0xa5) [0x7f1624131a75]
#4 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon6LinAlg8lstsqEigIfEEvPT_iiS3_S3_P17cusolverDnContextP13cublasContextSt10shared_ptrINS_15deviceAllocatorEEP11CUstream_st+0x1ac) [0x7f1624200dbc]
#5 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3GLM6olsFitIfEEvRKNS_15cumlHandle_implEPT_iiS6_S6_S6_bbP11CUstream_sti+0x677) [0x7f16242047c7]
#6 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3GLM6olsFitERKNS_10cumlHandleEPfiiS4_S4_S4_bbi+0x9d) [0x7f16241c255d]
#7 in /home/rodrigo/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cuml/linear_model/linear_regression.cpython-37m-x86_64-linux-gnu.so(+0x11dc3) [0x7f1710cbadc3]
#8 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyObject_FastCallKeywords+0x3fb) [0x55b450a4916b]
#9 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x53ae) [0x55b450aae49e]
#10 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55b4509ee929]
#11 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(PyEval_EvalCodeEx+0x44) [0x55b4509ef7e4]
#12 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(PyEval_EvalCode+0x1c) [0x55b4509ef80c]
#13 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(+0x1e0c70) [0x55b450ab8c70]
#14 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9) [0x55b450a405f9]
#15 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyCFunction_FastCallKeywords+0x21) [0x55b450a40891]
#16 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x47d4) [0x55b450aad8c4]
#17 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55b4509ee929]
#18 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallDict+0x1d5) [0x55b4509ef9f5]
#19 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x1f4c) [0x55b450aab03c]
#20 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55b4509ee929]
#21 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallKeywords+0x387) [0x55b450a3ff87]
#22 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x416) [0x55b450aa9506]
#23 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55b4509ee929]
#24 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallDict+0x400) [0x55b4509efc20]
#25 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyObject_Call_Prepend+0x63) [0x55b450a0ee23]
#26 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(PyObject_Call+0x6e) [0x55b450a0151e]
#27 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x1f4c) [0x55b450aab03c]
#28 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55b450a3fcfb]
#29 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x6f0) [0x55b450aa97e0]
#30 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55b4509ee929]
#31 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(PyEval_EvalCodeEx+0x44) [0x55b4509ef7e4]
#32 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(PyEval_EvalCode+0x1c) [0x55b4509ef80c]
#33 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(+0x1e0c70) [0x55b450ab8c70]
#34 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9) [0x55b450a405f9]
#35 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyCFunction_FastCallKeywords+0x21) [0x55b450a40891]
#36 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x47d4) [0x55b450aad8c4]
#37 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyGen_Send+0x2a2) [0x55b450a49ea2]
#38 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x1acc) [0x55b450aaabbc]
#39 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyGen_Send+0x2a2) [0x55b450a49ea2]
#40 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x1acc) [0x55b450aaabbc]
#41 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyGen_Send+0x2a2) [0x55b450a49ea2]
#42 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c) [0x55b450a4059c]
#43 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyMethodDescr_FastCallKeywords+0x4f) [0x55b450a48cdf]
#44 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x4cbc) [0x55b450aaddac]
#45 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55b450a3fcfb]
#46 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x416) [0x55b450aa9506]
#47 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55b450a3fcfb]
#48 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x6f0) [0x55b450aa97e0]
#49 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55b4509ee929]
#50 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallDict+0x400) [0x55b4509efc20]
#51 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyObject_Call_Prepend+0x63) [0x55b450a0ee23]
#52 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(PyObject_Call+0x6e) [0x55b450a0151e]
#53 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x1f4c) [0x55b450aab03c]
#54 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x5da) [0x55b4509eec0a]
#55 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallKeywords+0x387) [0x55b450a3ff87]
#56 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x14dc) [0x55b450aaa5cc]
#57 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(+0x171969) [0x55b450a49969]
#58 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9) [0x55b450a405f9]
#59 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyCFunction_FastCallKeywords+0x21) [0x55b450a40891]
#60 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x47d4) [0x55b450aad8c4]
#61 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalCodeWithName+0x5da) [0x55b4509eec0a]
#62 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyFunction_FastCallKeywords+0x387) [0x55b450a3ff87]
#63 in /home/rodrigo/anaconda3/envs/rapidsenv/bin/python(_PyEval_EvalFrameDefault+0x6f0) [0x55b450aa97e0]


# Linear Regression: Use Model to Predict Future Taxi Fares 

For this we are using a second dataset with data but no fare amount. We will predict fare_amounts. 

Here is a public link to that file: https://drive.google.com/file/d/1UG5-dXNPsAWZb0bJgquEg1qZsm12jzZW/view?usp=sharing

You will need to download that file and upload it to the colab. 

In [18]:
# Create Test Data Table

column_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
column_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']

# !wget 'https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv'

gdf2 = cudf.read_csv('test.csv', delimiter= ',', dtype = column_types, names = column_names)

bc.create_table('test', gdf2)


--2019-10-25 03:50:30--  https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv
Resolving blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)... 52.219.120.145
Connecting to blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)|52.219.120.145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 982916 (960K) [text/csv]
Saving to: ‘test.csv.1’


2019-10-25 03:50:31 (10.6 MB/s) - ‘test.csv.1’ saved [982916/982916]



<pyblazing.apiv2.sql.Table at 0x7f17110340b8>

In [19]:
# Query Test Data Table to Create GDF
X_test = bc.sql('SELECT hour(key) as hours, month(key) as months, year(key) - 2000 as years,  dropoff_longitude - pickup_longitude as longitude_distance, dropoff_latitude - pickup_latitude as latitude_distance , passenger_count FROM main.test').get()
X_test_gdf = X_test.columns

X_test_gdf['longitude_distance'] = X_test_gdf['longitude_distance'].fillna(0).astype('float32')
X_test_gdf['latitude_distance'] = X_test_gdf['latitude_distance'].fillna(0).astype('float32')
X_test_gdf['passenger_count'] = X_test_gdf['passenger_count'].fillna(0).astype('float32')
X_test_gdf['months'] = X_test_gdf['months'].astype('float32') 
X_test_gdf['years'] = X_test_gdf['years'].astype('float32') 
X_test_gdf['hours'] = X_test_gdf['hours'].astype('float32')

print(X_test_gdf) #this is the data we will use to predict future ride costs

KeyError: 'longitude_distance'

In [None]:
# Predict Fare Amounts 
predictions = lr.predict(X_test_gdf)
print(predictions)

In [None]:
#Combine into a table of table points and predictions
X_test_gdf['predicted_fare'] = predictions
print(X_test_gdf)

 ##  Predict Cost from Grand Central Station to Samsung Next NYC at 7:00 AM on May 15th, 2020

In [None]:
samsung_ride = cudf.DataFrame([('hours', 7.0), ('days',15.0), ('months', 5.0), ('years', 20.0), ('longitude_distance', 0.012727), ('latitude_distance', 0.008484), ('passenger_count', 1.0)])

samsung_ride['hours'] = samsung_ride['hours'].astype('float32')
samsung_ride['days'] = samsung_ride['days'].astype('float32')
samsung_ride['months'] = samsung_ride['months'].astype('float32')
samsung_ride['years'] = samsung_ride['years'].astype('float32')
samsung_ride['longitude_distance'] = samsung_ride['longitude_distance'].astype('float32')
samsung_ride['latitude_distance'] = samsung_ride['latitude_distance'].astype('float32')
samsung_ride['passenger_count'] = samsung_ride['passenger_count'].astype('float32')

samsung_prediction = lr.predict(samsung_ride)
print(samsung_prediction)