# BlazingSQL + cuML NYC Taxi Cab Fare Prediction

This demo uses pubically availible [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. 

In this notebook, we will cover: 
- How to set up [BlazingSQL](https://blazingsql.com) and the [RAPIDS AI](https://rapids.ai/) suite.
- How to read and query csv files with cuDF and BlazingSQL.
- How to implement a linear regression model with cuML.

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-39814657-5&cid=555&t=event&ec=guides&ea=taxi_fare_prediction&dt=taxi_fare_prediction)


## Import packages and create Blazing Context
You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [2]:
# Import RAPIDS AI stack
from blazingsql import BlazingContext
import cudf

bc = BlazingContext()

lo
BlazingContext ready


### Download Data
For this demo we will train our model with 20,000,000 rows of data from 4 csv files (5,000,000 rows each). 

The cell below will download them from AWS for you.

In [2]:
!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv
!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv
!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv
!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv

## Extract, transform, load
In order to train our Linear Regression model, we must first preform ETL so to prepare our data.

### ETL: Read and Join CSVs

In [4]:
# set column names and types
column_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
column_types = ['date64', 'float32', 'float32', 'float32', 
                'float32', 'float32', 'float32']

# load first csv 
gdf_00 = cudf.read_csv('taxi_00.csv', delimiter=',', dtype=column_types, names=column_names)
# load second csv
gdf_01 = cudf.read_csv('taxi_01.csv', delimiter=',', dtype=column_types, names=column_names)
# load third csv
gdf_02 = cudf.read_csv('taxi_01.csv', delimiter=',', dtype=column_types, names=column_names)
# load fourth csv
gdf_03 = cudf.read_csv('taxi_01.csv', delimiter=',', dtype=column_types, names=column_names)

# combine all those dataframes into one master dataframe
gdf = cudf.concat([gdf_00,gdf_01, gdf_02, gdf_03])

# what's it look like?
gdf.head()

Unnamed: 0,key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2012-02-02 22:30:19.002,8.9,-73.988708,40.758804,-73.986519,40.737202,1.0
1,2014-09-20 07:19:24.001,4.0,-73.990204,40.746708,-73.994728,40.750515,1.0
2,2013-02-23 07:18:05.001,5.5,-74.016762,40.709438,-74.009003,40.719498,3.0
3,2015-04-18 23:49:27.009,13.5,-74.002708,40.73373,-73.986099,40.734776,1.0
4,2010-03-04 08:15:59.001,10.5,-73.988365,40.737663,-74.012459,40.713932,1.0


### ETL: Create Table

In [5]:
%time
# make a table from the combined df
bc.create_table('taxi', gdf, column_names=column_names)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs


<pyblazing.apiv2.context.BlazingTable at 0x7f495c01be10>

### ETL: Query Tables for Training Data

In [8]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            taxi
        '''

# run query on table (returns cuDF DataFrame)
X_train = bc.sql(query)

# fill null values 
X_train['longitude_distance'] = X_train['longitude_distance'].fillna(0)
X_train['latitude_distance'] = X_train['latitude_distance'].fillna(0)
X_train['passenger_count'] = X_train['passenger_count'].fillna(0)

# how's it look? 
X_train.head()

20196


Unnamed: 0,hours,months,years,longitude_distance,latitude_distance,passenger_count
0,22.0,2.0,12.0,0.00219,-0.021603,1.0
1,7.0,9.0,14.0,-0.004524,0.003807,1.0
2,7.0,2.0,13.0,0.007759,0.010059,3.0
3,23.0,4.0,15.0,0.016609,0.001045,1.0
4,8.0,3.0,10.0,-0.024094,-0.023731,1.0


In [14]:
# query dependent variable y
y_train = bc.sql('SELECT fare_amount FROM main.taxi')

y_train.head()

20196


Unnamed: 0,fare_amount
0,8.9
1,4.0
2,5.5
3,13.5
4,10.5


## Linear Regression
### LR: Train Model

In [15]:
%%time

import cuml
from cuml import LinearRegression

#create model
# lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = "eig")
lr = LinearRegression()

# train model on the first 1,700,000 rows (most my memory can fit)
reg = lr.fit(X_train[:17000000], y_train[:17000000])

# display results
print(f"Coefficients:\n{reg.coef_}\n")
print(f"Y intercept:\n{reg.intercept_}\n")

ModuleNotFoundError: No module named 'cuml'

### LR: Use Model to Predict Future Taxi Fares 

For this we are using a second dataset with no fare amount. The cell below will download this dataset for you.

In [18]:
# download test data
!wget 'https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv'

--2019-11-25 20:21:20--  https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv
Resolving blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)... 52.219.120.129
Connecting to blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)|52.219.120.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 982916 (960K) [text/csv]
Saving to: ‘test.csv’


2019-11-25 20:21:21 (4.48 MB/s) - ‘test.csv’ saved [982916/982916]



In [19]:
# set column names and types
column_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
column_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']

# use cuDF to make GDF
gdf2 = cudf.read_csv('test.csv', delimiter=',', dtype=column_types, names=column_names)

# create test table from GDF
bc.create_table('test', gdf2)

<pyblazing.apiv2.context.BlazingTable at 0x7f4901d64ba8>

In [20]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            test
        '''

# run query on table (returns cuDF DataFrame)
X_test = bc.sql(query)

# fill null values 
X_test['longitude_distance'] = X_test['longitude_distance'].fillna(0)
X_test['latitude_distance'] = X_test['latitude_distance'].fillna(0)
X_test['passenger_count'] = X_test['passenger_count'].fillna(0)

# how's it look? 
X_test.head()

20196


Unnamed: 0,hours,months,years,longitude_distance,latitude_distance,passenger_count
0,13.0,1.0,15.0,-0.00811,-0.01997,1.0
1,13.0,1.0,15.0,-0.012024,0.019814,1.0
2,11.0,10.0,11.0,0.002869,-0.005119,1.0
3,21.0,12.0,12.0,-0.009277,-0.016178,1.0
4,21.0,12.0,12.0,-0.022537,-0.045345,1.0


In [21]:
# predict fares 
predictions = lr.predict(X_test_gdf)

# display predictions
predictions

NameError: name 'lr' is not defined

In [19]:
# combine into a table of table points and predictions
X_test['predicted_fare'] = predictions

# how's that look?
X_test.head()

Unnamed: 0,hours,months,years,longitude_distance,latitude_distance,passenger_count,predicted_fare
0,13.0,1.0,15.0,-0.00811,-0.01997,1.0,12.778599
1,13.0,1.0,15.0,-0.012024,0.019814,1.0,12.778547
2,11.0,10.0,11.0,0.002869,-0.005119,1.0,11.284673
3,21.0,12.0,12.0,-0.009277,-0.016178,1.0,11.864964
4,21.0,12.0,12.0,-0.022537,-0.045345,1.0,11.864986


## Real Life Example  
Predict cost of a ride from Grand Central Station to Samsung Next NYC at 7:00 AM on May 15th, 2020.

In [20]:
# build a dataframe with cuDF
samsung_ride = cudf.DataFrame([('hours', 7.0), ('months', 5.0), 
                               ('years', 20.0), ('longitude_distance', 0.012727), 
                               ('latitude_distance', 0.008484), ('passenger_count', 1.0)])

samsung_ride['hours'] = samsung_ride['hours'].astype('float32')

samsung_ride['months'] = samsung_ride['months'].astype('float32')
samsung_ride['years'] = samsung_ride['years'].astype('float32')
samsung_ride['longitude_distance'] = samsung_ride['longitude_distance'].astype('float32')
samsung_ride['latitude_distance'] = samsung_ride['latitude_distance'].astype('float32')
samsung_ride['passenger_count'] = samsung_ride['passenger_count'].astype('float32')

# make prediction
samsung_prediction = lr.predict(samsung_ride)

# output fare prediction
samsung_prediction

0    16.517778
dtype: float32