# BlazingSQL + cuML NYC Taxi Cab Fare Prediction

This demo uses publicly available [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. 

In this notebook, we will cover: 
- How to read and query multiple CSV files with BlazingSQL.
- How to implement a linear regression model with cuML.

### Imports
This next cell will import all packages you need to run this notebook end-to-end.

In [1]:
import os
import urllib
from cuml import LinearRegression
from blazingsql import BlazingContext

## Create BlazingContext
You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [2]:
# connect to BlazingSQL
bc = BlazingContext()

BlazingContext ready


### Download Data
For this demo we will train our model with 25,000,000 rows of data from 5 CSV files (5M rows each).

The cell below will check if you already have them, and, if you don't, will download them from AWS for you. 

In [3]:
%%time
# download taxi data
base_url = 'https://blazingsql-colab.s3.amazonaws.com/taxi_data/'
for i in range(0, 5):
    fn = 'taxi_0' + str(i) + '.csv'
    # check if we already have the file
    if not os.path.isfile('data/' + fn):
        # we don't let me know we're downloading it now
        print(f'Downloading {base_url + fn} to data/{fn}')
        # download file
        urllib.request.urlretrieve(base_url + fn, 'data/' + fn)
    # we already have data
    else:
        # let us know
        print(f'data/{fn} already downloaded')

Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv to data/taxi_00.csv
Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv to data/taxi_01.csv
Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv to data/taxi_02.csv
Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv to data/taxi_03.csv
Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_04.csv to data/taxi_04.csv
CPU times: user 4.19 s, sys: 5.16 s, total: 9.36 s
Wall time: 26.8 s


## Extract, transform, load
In order to train our Linear Regression model, we must first preform ETL to prepare our data.

BlazingSQL currently requires the full file path to create tables, the cell below will identify that path for you.

In [4]:
# identify current working directory
cwd = os.getcwd()
# add path to data w/ wildcard (*) so BSQL can read all 5 files at once
data_path = cwd + '/data/taxi_0*.csv'
# how's it look?
data_path

'/home/jupyter-winston/bsql-demos/data/taxi_0*.csv'

### ETL: Create Table 
In this next cell we will create a single BlazingSQL table from all 5 CSVs.

In [5]:
%%time
# tag column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
             'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32',
             'float32', 'float32', 'float32']

# create a table from all 5 taxi files at once
bc.create_table('train_taxi', data_path, names=col_names, dtype=col_types, header=0)

CPU times: user 3.13 ms, sys: 2.44 ms, total: 5.57 ms
Wall time: 4.66 ms


<pyblazing.apiv2.context.BlazingTable at 0x7fc0b06a4b70>

### ETL: Query Tables for Training Data

In [7]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            dropoff_longitude - pickup_longitude as longitude_distance, 
            dropoff_latitude - pickup_latitude as latitude_distance, 
            passenger_count 
        from 
            train_taxi
            '''

# run query on table (returns cuDF DataFrame)
X_train = bc.sql(query)

# fill any null values 
X_train['longitude_distance'] = X_train['longitude_distance'].fillna(0)
X_train['latitude_distance'] = X_train['latitude_distance'].fillna(0)
X_train['passenger_count'] = X_train['passenger_count'].fillna(0)

# how's it look? 
X_train.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,20.0,10.0,9.0,13.0,0.049057,0.003063,1.0
1,20.0,22.0,11.0,9.0,0.003464,0.007088,1.0
2,21.0,4.0,12.0,9.0,0.003151,0.007584,1.0
3,22.0,6.0,5.0,15.0,0.007141,0.011543,1.0
4,23.0,27.0,4.0,9.0,-0.01487,-0.033161,1.0


In [8]:
# query dependent variable y
y_train = bc.sql('SELECT fare_amount FROM train_taxi')
# how's it look?
y_train.head()

Unnamed: 0,fare_amount
0,17.0
1,3.3
2,4.1
3,6.0
4,8.9


## Linear Regression
To learn more about the cuML's LinearRegression model, check out [Beginner’s Guide to Linear Regression in Google Colab with cuML](https://medium.com/future-vision/beginners-guide-to-linear-regression-in-python-with-cuml-30e2709c761?source=friends_link&sk=1da35920b9e2ffea59d5cb3c998bfeae).

### LR: Train Model

In [9]:
%%time
# call & create cuML model
lr = LinearRegression(fit_intercept=True, normalize=False, algorithm="eig")

# train Linear Regression model 
reg = lr.fit(X_train, y_train)

# display results
print(f"Coefficients:\n{reg.coef_}\n")
print(f"Y intercept:\n{reg.intercept_}\n")

Coefficients:
0   -0.027069
1    0.003295
2    0.107198
3    0.636705
4    0.000932
5   -0.000494
6    0.092028
dtype: float32

Y intercept:
3.3608126640319824

CPU times: user 892 ms, sys: 412 ms, total: 1.3 s
Wall time: 2.25 s


### LR: Use Model to Predict Future Taxi Fares 

#### Download Test Data
The cell below will check to see if you've already got the Test data, and, if you don't, will download it for you.

In [10]:
%%time
# do we have Test taxi file?
if not os.path.isfile('/data/test.csv'):
    !wget -P data https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv
else:
    print('test data already downloaded')

--2020-01-23 04:49:37--  https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv
Resolving blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)... 52.219.116.137
Connecting to blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)|52.219.116.137|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 982916 (960K) [text/csv]
Saving to: ‘data/test.csv’


2020-01-23 04:49:38 (2.22 MB/s) - ‘data/test.csv’ saved [982916/982916]

CPU times: user 8.09 ms, sys: 26.9 ms, total: 35 ms
Wall time: 902 ms


In [11]:
%%time
# set column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']

# tag path to test data
test_path = cwd + '/data/test.csv'

# create test table directly from CSV
bc.create_table('test_taxi', test_path, names=col_names, dtype=col_types)

CPU times: user 1.68 ms, sys: 5.19 ms, total: 6.87 ms
Wall time: 5.42 ms


<pyblazing.apiv2.context.BlazingTable at 0x7fc0b95790b8>

In [12]:
%%time
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            dropoff_longitude - pickup_longitude as longitude_distance, 
            dropoff_latitude - pickup_latitude as latitude_distance, 
            passenger_count
        from 
            test_taxi
            '''

# run query on table (returns cuDF DataFrame)
X_test = bc.sql(query)

# fill null values 
X_test['longitude_distance'] = X_test['longitude_distance'].fillna(0)
X_test['latitude_distance'] = X_test['latitude_distance'].fillna(0)
X_test['passenger_count'] = X_test['passenger_count'].fillna(0)

# how's it look? 
X_test.head()

CPU times: user 61.8 ms, sys: 1.41 ms, total: 63.2 ms
Wall time: 36.9 ms


Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,13.0,27.0,1.0,15.0,-0.00811,-0.01997,1.0
1,13.0,27.0,1.0,15.0,-0.012024,0.019814,1.0
2,11.0,8.0,10.0,11.0,0.002869,-0.005119,1.0
3,21.0,1.0,12.0,12.0,-0.009277,-0.016178,1.0
4,21.0,1.0,12.0,12.0,-0.022537,-0.045345,1.0


In [13]:
# predict fares 
predictions = lr.predict(X_test)

# display predictions
predictions

0       12.847689
1       12.847666
2       11.257179
3       11.814514
4       11.814518
5       11.814510
6       11.223505
7       11.223265
8       11.223516
9       12.234369
10      12.234383
11      12.234411
12       9.695659
13       9.695644
14      11.467134
15      11.467148
16      11.460003
17      11.460035
18      11.460011
19      11.460001
20      13.480091
21      12.704147
22      12.704123
23      12.704136
24      12.704132
25      12.704119
26      12.704292
27      12.704145
28      12.704140
29      12.704115
          ...    
9884    12.641771
9885    12.641808
9886    12.641790
9887    12.641766
9888    12.641785
9889    12.641790
9890    12.641781
9891    12.641809
9892    12.641788
9893    12.641804
9894    12.641783
9895    12.641851
9896    12.641764
9897    13.446104
9898    13.204254
9899    14.129877
9900    13.363419
9901    13.627535
9902    14.162102
9903    13.824402
9904    13.664045
9905    13.252615
9906    14.129101
9907    13.444111
9908    13

In [14]:
# add predictions to test dataframe
X_test['predicted_fare'] = predictions

# how's that look?
X_test.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count,predicted_fare
0,13.0,27.0,1.0,15.0,-0.00811,-0.01997,1.0,12.847689
1,13.0,27.0,1.0,15.0,-0.012024,0.019814,1.0,12.847666
2,11.0,8.0,10.0,11.0,0.002869,-0.005119,1.0,11.257179
3,21.0,1.0,12.0,12.0,-0.009277,-0.016178,1.0,11.814514
4,21.0,1.0,12.0,12.0,-0.022537,-0.045345,1.0,11.814518
