# BlazingSQL + cuML NYC Taxi Cab Fare Prediction

This demo uses pubically availible [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. 

In this notebook, we will cover: 
- How to read and query csv files with cuDF and BlazingSQL.
- How to implement a linear regression model with cuML.

## Imports

In [1]:
import os
from cuml import LinearRegression 
from blazingsql import BlazingContext 

## Create BlazingContext
You can think of the BlazingContext much like a SparkContext, this is where information such as FileSystems you have registered and Tables you have created will be stored. 

In [2]:
# start up BlazingSQL 
bc = BlazingContext()

BlazingContext ready


### Download Data
For this demo we will train our model with 25,000,000 rows of data from 5 csv files (5,000,000 rows each). 

The cell below will download them from AWS for you. This can take a few minutes.

In [3]:
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv
# !wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_04.csv    

## Extract, transform, load
To train our Linear Regression model, we must first ETL our data into GPU memory. BlazingContext achieves this simply with .[create_table()](https://docs.blazingdb.com/docs/using-blazingsql#section-create-tables), which locates data via the full file path. The cell below identifies the path to this notebook, then adds a wildcard `*` which allows BlazingContext to read all 5 CSVs at once.

In [4]:
# find current working directory 
cwd = os.getcwd()
# add / & taxi wildcard to this directory
data_path = cwd + '/taxi*'
# what's the final path?
data_path

'/home/winston/bsql-demos/taxi*'

### ETL: Read and Join CSVs

In [5]:
%%time
# tag column names & types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
             'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32',
             'float32', 'float32', 'float32']

# create a table from all 'taxi*' files
bc.create_table('taxi', data_path, names=col_names, dtype=col_types)

CPU times: user 9.55 ms, sys: 0 ns, total: 9.55 ms
Wall time: 8.05 ms


<pyblazing.apiv2.context.BlazingTable at 0x7f7d7dcde550>

In [6]:
# query the whole table & display last 5 rows
bc.sql("select * from taxi").tail()  # note: BlazingSQL queries return cuDF DataFrame results

Unnamed: 0,key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
25001080,2011-02-24 16:06:26.001,6.9,-73.966537,40.804974,-73.949043,40.804226,2.0
25001081,2009-09-22 19:20:22.009,9.7,-73.980064,40.752533,-74.006432,40.739613,1.0
25001082,2012-04-19 02:17:32.001,14.1,-73.998512,40.745308,-73.953186,40.799362,2.0
25001083,2012-06-08 11:09:47.006,3.3,-73.953636,40.778801,-73.946068,40.775555,1.0
25001084,2009-06-21 11:07:00.036,6.5,-73.981583,40.772572,-73.963326,40.762135,1.0


### ETL: Query Table for Training Data
BlazingSQL allows but does not require capitalized SQL statements, mismatched combinations also work.  
A few examples:
- SELECT colA FROM table WHERE condition
- select colA from table where condition
- select colA from table WHERE condition

In [7]:
%%time
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days,
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            taxi
        '''

# run query on table (returns cuDF DataFrame)
X_train = bc.sql(query)

# fill (1254, 1254 & 1085) null values 
print(len(X_train.loc[X_train['longitude_distance'].isna()==True]))
X_train['longitude_distance'] = X_train['longitude_distance'].fillna(0)
print(len(X_train.loc[X_train['latitude_distance'].isna()==True]))
X_train['latitude_distance'] = X_train['latitude_distance'].fillna(0)
print(len(X_train.loc[X_train['passenger_count'].isna()==True]))
X_train['passenger_count'] = X_train['passenger_count'].fillna(0)

1254
1254
1085
CPU times: user 2.93 s, sys: 1.89 s, total: 4.82 s
Wall time: 2.7 s


In [8]:
# how's it look?
X_train.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,3.0,12.0,4.0,9.0,0.051445,0.050167,3.0
1,16.0,28.0,1.0,10.0,0.0382,0.056992,1.0
2,20.0,15.0,8.0,10.0,0.080971,-0.006474,1.0
3,10.0,6.0,2.0,10.0,-0.011604,-0.009991,1.0
4,20.0,22.0,4.0,10.0,0.008423,-0.00927,3.0


In [12]:
%%time
# query dependent variable y
y_train = bc.sql('SELECT fare_amount FROM taxi')
# fill (1164) null values 
y_train = y_train.fillna(0)

CPU times: user 2.25 s, sys: 1.68 s, total: 3.93 s
Wall time: 2.1 s


In [13]:
# how's it look?
y_train.head()

Unnamed: 0,fare_amount
0,14.6
1,16.9
2,5.7
3,4.9
4,6.9


## Linear Regression
### LR: Train Model

In [14]:
%%time
#create model
lr = LinearRegression(algorithm="eig", fit_intercept=True, normalize=False)

# train model on the first 1,700,000 rows (most my memory can fit)
reg = lr.fit(X_train, y_train)

# display results
print(f"Coefficients:\n{reg.coef_}\n")
print(f"Y intercept:\n{reg.intercept_}\n")

Coefficients:
0   -0.026974
1    0.003284
2    0.105881
3    0.613533
4    0.000939
5   -0.000498
6    0.091220
dtype: float32

Y intercept:
3.642357349395752

CPU times: user 450 ms, sys: 155 ms, total: 605 ms
Wall time: 847 ms


### LR: Use Model to Predict Future Taxi Fares 

For this we are using a second dataset with no fare amount. The cell below will download this dataset for you.

In [16]:
# download test data
# !wget 'https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv'

In [17]:
# create test table test CSV
bc.create_table('test', cwd+'/test.csv', names=col_names, dtype=col_types)

<pyblazing.apiv2.context.BlazingTable at 0x7f7d7ddd9208>

In [30]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days,
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            test
        '''

# run query on table (returns cuDF DataFrame)
X_test = bc.sql(query)

# fill null values 
X_test['longitude_distance'] = X_test['longitude_distance'].fillna(0)
X_test['latitude_distance'] = X_test['latitude_distance'].fillna(0)
X_test['passenger_count'] = X_test['passenger_count'].fillna(0)

# how's it look? 
X_test.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,13.0,27.0,1.0,15.0,-0.00811,-0.01997,1.0
1,13.0,27.0,1.0,15.0,-0.012024,0.019814,1.0
2,11.0,8.0,10.0,11.0,0.002869,-0.005119,1.0
3,21.0,1.0,12.0,12.0,-0.009277,-0.016178,1.0
4,21.0,1.0,12.0,12.0,-0.022537,-0.045345,1.0


### Make Predictions 
- check csv for actual values
- are we going to compare or just predict?

In [31]:
# predict fares 
predictions = lr.predict(X_test)

# display a few predictions
predictions.to_pandas().sample(3)

1526    10.809132
1331    13.092863
5002    11.181375
dtype: float32

In [28]:
# combine into a table of table points and predictions
X_test['predicted_fare'] = predictions

# how's that look? (pd sample)
X_test.to_pandas().sample(5)

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count,predicted_fare
9672,21.0,20.0,11.0,12.0,0.021873,0.015663,5.0,12.12476
7422,13.0,24.0,5.0,13.0,0.005226,-0.006298,2.0,12.058273
3373,21.0,20.0,11.0,12.0,-0.015305,-0.002697,1.0,11.759855
9625,7.0,9.0,12.0,10.0,-0.022896,0.008629,5.0,11.345058
3266,21.0,20.0,11.0,12.0,-0.218193,0.091305,1.0,11.759619


## Real Life Example  
Predict cost of a ride from Grand Central Station to Samsung Next NYC at 7:00 AM on May 15th, 2020.
- needs adjusting in instance

In [55]:
# build a dataframe with cuDF
import cudf

# samsung_ride = cudf.DataFrame()

# samsung_ride['hours'] = 7.0 float()
# samsung_ride['days'] = 15.0
# samsung_ride['months'] = 5.0
# samsung_ride['years'] = 20.0
# samsung_ride['longitude_distance'] = 0.012727
# samsung_ride['latitude_distance'] = 0.008484
# samsung_ride['passenger_count'] = 1.0

# # tag column names and values 
cols = ['hours', 'days', 'months', 'years', 'longitude_distance',
       'latitude_distance', 'passenger_count']
vals = [7.0, 15.0, 5.0, 20.0, 0.012727, 0.008484, 1.0]
samsung_ride = cudf.DataFrame(data=vals, columns=cols)

samsung_ride['hours'] = samsung_ride['hours'].astype('float32')
samsung_ride['days'] = samsung_ride['days'].astype('float32')
samsung_ride['months'] = samsung_ride['months'].astype('float32')
samsung_ride['years'] = samsung_ride['years'].astype('float32')
samsung_ride['longitude_distance'] = samsung_ride['longitude_distance'].astype('float32')
samsung_ride['latitude_distance'] = samsung_ride['latitude_distance'].astype('float32')
samsung_ride['passenger_count'] = samsung_ride['passenger_count'].astype('float32')

# # make prediction
samsung_prediction = lr.predict(samsung_ride)

# # output fare prediction
# samsung_prediction

KeyError: 'hours'

In [50]:
samsung_ride

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,7.0,15.0,5.0,20.0,0.012727,0.008484,1.0


In [45]:
samsung_ride

Unnamed: 0,0
0,7.0
1,15.0
2,5.0
3,20.0
4,0.012727
5,0.008484
6,1.0


In [54]:
X_train.columns.values

array(['hours', 'days', 'months', 'years', 'longitude_distance',
       'latitude_distance', 'passenger_count'], dtype=object)