# NYC fare prediction

[This notebook](TBD) shows my approach for predicting the fare amount for a taxi ride in NYC when given the pickup and dropoff locations of the passangers regarding the [New York City Taxi Fare Prediction Challange]( https://www.kaggle.com/c/new-york-city-taxi-fare-prediction).



---

This notebook is seperated into different sections, relating to the common data science workflow (except the hypothesis and data collection where already done).


0.   Previous Commits
1.   Setup and Check Infrastructure
2.   Having a first look at the Data (EDA)
3.   Data Cleaning (Feature Engineering)
4.   Linear Regression
5.   Ridge Regression
6.   Model ...
7.   Evaulation and Discussion

---



# 0. Previous Commits - Comparison of different Models:



Predictions are done when using the whole data set

### Linear Regression

**Commit 1 (Baseline) Score: 5.67093**
- 5.67093
 

### Ridge Regression

**Commit 2 Score: 12.54277**

params_ridge
  - alpha = loguniform(1e-5, 1e0)
  - solver = ['eig', 'cd']
  - n_iter = 100
  - cv = 5
  - verbose = 0
  - n_jobs = 1

ridge_params
  - alpha = 0.240960447726532
  - fit_intercept = True
  - normalize = False
  - solver = 'eig'

RMSE for Ridge_rmse Regression is  -7.212150573730469


### K-Nearest Neighbor 
**Commit 3 Score: 4.86790**
  - n_neighbors = 4
  - data_size = 5.5 Mio rows (1 %)

### Random Forest 
**Commit 4 Score: 5.95518**
  - n_estimators=10
  - n_jobs=-1
  - data_size = 1 Mio. rows



# 1. Setup and Check Infrastructure

In [2]:
## Switch from Kaggle to Colab easily
environment='Colab'

## when True only 50.000 rows are used for debugging purpose. Set to False when doing real training
debug_mode=True 

## choose how many rows of the training data sample you would like to use (only works when debug_mode=False ), max is 55423480
rows_datasample=5542348

In [3]:
if environment == 'Kaggle':
  env_submission_path='./'
  env_path='../input/new-york-city-taxi-fare-prediction/'
  print('The environment and paths were successfully setup for Kaggle')
elif environment == 'Colab':
  env_submission_path='/content/drive/My Drive/Colab Notebooks/'
  env_path='/content/drive/My Drive/Colab Notebooks/'

  from google.colab import drive
  drive.mount('/content/drive')

  print('The environment and paths were successfully setup for Colab')

else:
  print('Something went wrong here, please choose one of the options for path completion: Kaggle or Colab (or implement your own thing)')


Mounted at /content/drive
The environment and paths were successfully setup for Colab


Check for GPU

In [4]:
!nvidia-smi

Mon Sep 21 15:28:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
if environment == 'Kaggle':
  import sys
  !cp ../input/rapids/rapids.0.14.0 /opt/conda/envs/rapids.tar.gz
  !cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
  sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
  sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
  sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
  !cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/


elif environment == 'Colab':
  # Install RAPIDS and Dask_ml
  !pip install dask_ml
  !pip install dask_cuda

  !git clone https://github.com/rapidsai/rapidsai-csp-utils.git
  !bash rapidsai-csp-utils/colab/rapids-colab.sh stable


  import sys, os

  dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
  sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
  sys.path
  exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

else:
  print('Something went wrong here, please choose one of the options for path completion: Kaggle or Colab (or implement your own thing). If Kaggle failed please make sure you added the RAPIDS file on Kaggle to your Input!')

Collecting dask_ml
[?25l  Downloading https://files.pythonhosted.org/packages/fb/68/03a7bc7ff2378c865b7a5c6344947d790e4c69c0b38b5c7fa4fd1fb939e0/dask_ml-1.6.0-py3-none-any.whl (140kB)
[K     |████████████████████████████████| 143kB 2.8MB/s 
[?25hCollecting distributed>=2.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/bf/e4/ee17c167321d95bc35e7e379eea9257663509c95327a379c927fb5486565/distributed-2.27.0-py3-none-any.whl (652kB)
[K     |████████████████████████████████| 655kB 11.2MB/s 
[?25hCollecting multipledispatch>=0.4.9
  Downloading https://files.pythonhosted.org/packages/89/79/429ecef45fd5e4504f7474d4c3c3c4668c267be3370e4c2fd33e61506833/multipledispatch-0.6.0-py3-none-any.whl
Collecting scikit-learn>=0.23
[?25l  Downloading https://files.pythonhosted.org/packages/5c/a1/273def87037a7fb010512bbc5901c31cfddfca8080bc63b42b26e3cc55b3/scikit_learn-0.23.2-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.8MB 15.6MB/s 
[?25hCollec

In [6]:
import nvstrings
import numpy as np
import cudf, cuml
import dask_cudf
import io, requests
import math
import gc
import cupy as cp
import pandas as pd

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns 

#Learning
from cuml.preprocessing.model_selection import train_test_split
from scipy.stats import uniform

# Linear Models https://github.com/rapidsai/cuml/tree/branch-0.13/notebooks
from cuml.linear_model import LinearRegression # Linear
from cuml.linear_model import LogisticRegression # Logisitc
from cuml.linear_model import ElasticNet # Elastic
from cuml.linear_model import Ridge # Ridge
from cuml.linear_model import Lasso # Lasso
from cuml.linear_model import MBSGDRegressor as cumlMBSGDRegressor # Mini Batch SGD Regressor

from cuml.solvers import SGD as cumlSGD # Stochastic Gradient Descent
from cuml.ensemble import RandomForestRegressor as cuRF # Random Forest
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF # RandomForest

from cuml.neighbors import KNeighborsRegressor as cumlKNR # Nearest Neighbours
from cuml.svm import SVC # Support Vector Machines

from cuml import ForestInference
import xgboost as xgb

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from cuml.metrics.regression import r2_score
from cuml.metrics.accuracy import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score as sk_acc
from sklearn.utils.fixes import loguniform

  """Entry point for launching an IPython kernel.
Detected GPU 0: Tesla K80
Detected Compute Capability: 3.7
  + str(minor_version)
  import pandas.util.testing as tm


If you find yourself running these notebooks on Colab as well as on Kaggle you might find this placeholder thing helpful. Only thing to touch ist the environment you are running on (Kaggle or Colab)

# 2. First look at the Data

In [8]:
cudf.set_allocator("managed")
dtype = {'fare_amount': 'float32',
              'pickup_datetime':'str',
              'pickup_longitude': 'float32',
              'pickup_latitude': 'float32',
              'dropoff_longitude': 'float32',
              'dropoff_latitude': 'float32',
              'passenger_count': 'int8'}

usecols = list(dtype.keys())

In [3]:
%%time
# use a subset with 50.000 rows, max is nrows = 55423480

if debug_mode != True:
  ## using 1% (or how much you like)
  nrwos=rows_datasample
else:
  ## using a very small sample just for testing
  nrows = 50000

test = cudf.read_csv(env_path+'test_taxi.csv', usecols=usecols, dtype=dtype)
train = cudf.read_csv(env_path+'train.csv', nrows=nrows, usecols=usecols, dtype=dtype)
submission = cudf.read_csv(env_path+'sample_submission.csv', usecols=usecols, dtype=dtype)

NameError: ignored

In [None]:
train.head(5)

In [None]:
test.head(5)

# 3. Data Cleaning

In [1]:
#Drop Nan Values
train.nans_to_nulls()
train = train.dropna()

NameError: ignored

In [2]:
#Checking shape of the data
print("Train: " + str(train.shape))
print("Test: " + str(test.shape))

NameError: ignored

In [None]:
#Changing the data format of pickup_datetime and adding additional information about pickup time
train['pickup_datetime'] = train['pickup_datetime'].astype('datetime64[ns]')

train["hour"] = train.pickup_datetime.dt.hour
train["weekday"] = train.pickup_datetime.dt.weekday
train["month"] = train.pickup_datetime.dt.month
train["year"] = train.pickup_datetime.dt.year


test['pickup_datetime'] = test['pickup_datetime'].astype('datetime64[ns]')

test["hour"] = test.pickup_datetime.dt.hour
test["weekday"] = test.pickup_datetime.dt.weekday
test["month"] = test.pickup_datetime.dt.month
test["year"] = test.pickup_datetime.dt.year

In [None]:
#calculate trip distance in miles
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))

In [None]:
train['distance'] = distance(train['pickup_latitude'], train['pickup_longitude'], train['dropoff_latitude'], train['dropoff_longitude'] )
test['distance'] = distance(test['pickup_latitude'], test['pickup_longitude'], test['dropoff_latitude'], test['dropoff_longitude'] )
train['distance'].describe()

In [None]:
#check if everything worked
train.head(10)

In [None]:
test.head(2)

In [None]:
print("Ararage fare amount: " + str(train['fare_amount'].mean()))
print("Standard deviation fare amount: " + str(train['fare_amount'].std()))
print("Ararage distance: " + str(train['distance'].mean()) + " miles")
print("Standard deviation distance: " + str(train['distance'].std()) + " miles")

In [None]:
train.describe()

Visualization of the data <br>

The following things were noticed (while using 500k datapoints):
*   The minimal fare_amount is negative. As this does not seem to be realistic I will drop them from the dataset.
*   Some of the minimum and maximum longitude/lattitude coordinates are way off. These  will also be remove from the dataset. (bounding box will be defined)
*   The average fare_amount is about 9.79 USD with a standard deviation of 7.48 USD. When building a predictive model we want to be better than 7.48 USD.



In [None]:
train = train[train.fare_amount>=0]
train = train[(train['distance'] < 30) & (train['distance'] >=0 )]

In [None]:
fare_amount = train['fare_amount'].to_array()
passenger_count = train['passenger_count'].to_array()
distance = train['distance'].to_array()

In [None]:
plt.figure(figsize=(8,5))
sns.kdeplot(fare_amount).set_title("Verteilung des Fahrpreises")

In [None]:
plt.figure(figsize=(8,5))
sns.kdeplot(distance).set_title("Distanz")

In [None]:
#check max latitude und max longitude of test data
print("Max lat pickup: " + str(test['pickup_latitude'].max()))
print("Max lat dropoff: " + str(test['dropoff_latitude'].max()))
print("Max lon pickup: " + str(test['pickup_longitude'].max()))
print("Max lon dropoff: " + str(test['dropoff_longitude'].max()))
print("")
print("Min lat pickup: " + str(test['pickup_latitude'].min()))
print("Min lat dropoff: " + str(test['dropoff_latitude'].min()))
print("Min lon pickup: " + str(test['pickup_longitude'].min()))
print("Min lon dropoff: " + str(test['dropoff_longitude'].min()))

Bounding Box New York
<table>
  <tr>
    <th></th>
    <th>Dropoff</th>
    <th>Pickup</th>
  </tr>
  <tr>
    <td>Max Long</td>
    <td>-72.99096</td>
    <td>-72.986534</td>
   </tr>
   <tr>
    <td>Max Lat</td>
    <td>41.696682</td>
    <td>41.709553</td>
   </tr>
   <tr>
    <td>Min Long</td>
    <td>-74.26323</td>
    <td>-74.25219</td>
    </tr>
   <tr>
    <td>Min Lat</td>
    <td>40.568974</td>
    <td>40.57314</td>
   </tr>
</table>



In [None]:
train.head(2)

In [None]:
#Parts of train data are too far away, so they can be dropped
train = train[(train['pickup_longitude'] > -74.25) & (train['pickup_longitude'] < -72.98)]
train = train[(train['pickup_latitude'] > 40.57) & (train['pickup_latitude'] < 41.70)]
train = train[(train['dropoff_longitude'] < -72.99) & (train['dropoff_longitude'] > -74.26)]
train = train[(train['dropoff_latitude'] > 40.56) & (train['dropoff_latitude'] < 41.69)]

In [None]:
dropoff_longitude = train['dropoff_longitude'].to_array()
dropoff_latitude = train['dropoff_latitude'].to_array()

city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)

plt.figure(figsize=(10,6))
plt.scatter(dropoff_longitude, dropoff_latitude,
                color='green', 
                s=.02, alpha=.6)
plt.title("Dropoffs")

plt.ylim(city_lat_border)
plt.xlim(city_long_border)

In [None]:
unnecessary_columns=['pickup_datetime','dropoff_latitude','pickup_latitude','dropoff_longitude','pickup_longitude']
train=train.drop(unnecessary_columns,axis=1)
test=test.drop(unnecessary_columns,axis=1)

In [None]:
train.head(2)

In [None]:
test.head(2)

# 4. Linear Regression

In [30]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  39130
Number of records in validation data  9782
(39130, 6)
(9782, 6)
(39130,)
(9782,)


In [31]:
lm = LinearRegression(fit_intercept = True, normalize = False,
                      algorithm = "eig")
lm.fit(X_train,y_train)
y_pred=lm.predict(X_test)
lm_rmse = r2_score(y_pred, y_test)
print("RMSE for Linear Regression is ",lm_rmse)

RMSE for Linear Regression is  0.6346495151519775


In [32]:
y_pred=lm.predict(test)
y_pred

0       11.041992
1       11.041992
2        4.717896
3       10.165771
4       16.538208
          ...    
9909    11.186523
9910    14.112183
9911    47.256714
9912    23.951904
9913     6.199097
Length: 9914, dtype: float32

In [33]:
gdf_submission = submission
gdf_submission['fare_amount']= y_pred

gdf_submission.head()

Unnamed: 0,fare_amount
0,11.041992
1,11.041992
2,4.717896
3,10.165771
4,16.538208


In [34]:
gdf_submission.to_csv(env_submission_path+'submission_LinearReg.csv',index=False)

# 5. Ridge Regression

In [35]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  39130
Number of records in validation data  9782
(39130, 6)
(9782, 6)
(39130,)
(9782,)


In [36]:
params_ridge = {
    "alpha": loguniform(1e-5, 1e0), # default 1.0
    "solver": ['eig', 'cd'], 
}
ridge = Ridge()
clf = RandomizedSearchCV(ridge, params_ridge, random_state=1, n_iter=100, cv=5, verbose=0, n_jobs=1)
best_model = clf.fit(X_train,y_train)

RuntimeError: ignored

In [None]:
best_model.best_estimator_.get_params()

In [None]:
ridge_params = {
 'alpha': 0.240960447726532,
 'fit_intercept': True,
 'normalize': False,
 'solver': 'eig'
}

ridge = Ridge(**ridge_params)
result_ridge = ridge.fit(X_train,y_train)

y_pred = result_ridge.predict(X_test)
ridge_rmse = r2_score(y_pred, y_test)
print("RMSE for Ridge_rmse Regression is ", ridge_rmse)

In [None]:
ridge_pred = result_ridge.predict(test)

In [None]:
gdf_submission = submission
gdf_submission['fare_amount']= ridge_pred

gdf_submission.head()

In [None]:
gdf_submission.to_csv(env_submission_path+'submission_RidReg.csv',index=False)

# 6. K-Nearest Neighbors Regression

https://github.com/rapidsai/cuml/blob/branch-0.13/notebooks/kneighbors_regressor_demo.ipynb


In [37]:
## params
n_neighbors=4

In [38]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  39130
Number of records in validation data  9782
(39130, 6)
(9782, 6)
(39130,)
(9782,)


In [39]:
## inspiration: https://www.kaggle.com/cdeotte/rapids-knn-30-seconds-0-938/notebook

%%time
knn_cuml = cumlKNR(n_neighbors=n_neighbors)
knn_cuml.fit(X_train, y_train)

cuml_result = knn_cuml.predict(test)

CPU times: user 21.9 ms, sys: 164 ms, total: 186 ms
Wall time: 186 ms


In [40]:
#y_pred=knn_cuml.predict(X_test)

y_pred = knn_cuml.predict(X_test)
knn_cuml_rmse = cuml.metrics.regression.r2_score(y_pred, y_test)
print("RMSE for K-Nearest Neighbor is ",knn_cuml_rmse)

RMSE for K-Nearest Neighbor is  0.6358001232147217


In [41]:
gdf_submission = submission
gdf_submission['fare_amount']= cuml_result

gdf_submission.head()

gdf_submission.to_csv(env_submission_path+'submission_KNearest.csv',index=False)

# Random Forest GPU


In [42]:
## params
n_estimators=10
n_jobs=-1

In [43]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  39130
Number of records in validation data  9782
(39130, 6)
(9782, 6)
(39130,)
(9782,)


In [44]:
%%time
rf_cuml = cuRF(n_estimators=n_estimators)
rf_cuml.fit(X_train, y_train)

cuRF_result = rf_cuml.predict(test)

CPU times: user 220 ms, sys: 104 ms, total: 324 ms
Wall time: 279 ms


In [45]:
y_pred = rf_cuml.predict(X_test)
rf_cuml_rmse = cuml.metrics.regression.r2_score(y_pred, y_test)
print("RMSE for Random Forest is ",rf_cuml_rmse)

RMSE for Random Forest is  0.27523255348205566


In [46]:
gdf_submission = submission
gdf_submission['fare_amount']= cuRF_result

gdf_submission.head()

gdf_submission.to_csv(env_submission_path+'submission_RandomForest.csv',index=False)

# XG Boost model

In [47]:
params = {
    'max_depth': 7,
    'gamma' :0,
    'eta':.03, 
    'subsample': 1,
    'colsample_bytree': 0.9, 
    'objective':'reg:linear',
    'eval_metric':'rmse',
    'silent': 0
}

In [7]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

NameError: ignored

In [None]:
def XGBmodel(X_train,X_test,y_train,y_test,params):
    matrix_train = xgb.DMatrix(X_train,label=y_train)
    matrix_test = xgb.DMatrix(X_test,label=y_test)
    model=xgb.train(params=params,
                    dtrain=matrix_train,num_boost_round=5000, 
                    early_stopping_rounds=10,evals=[(matrix_test,'test')])
    return model

XGB_model = XGBmodel(X_train,X_test,y_train,y_test,params)

In [1]:
xgb_prediction = XGB_model.predict(xgb.DMatrix(test), ntree_limit = XGB_model.best_ntree_limit).tolist()

NameError: ignored

In [None]:
y_pred = XGB_model.predict(xgb.DMatrix(X_test))
XGB_rmse = cuml.metrics.regression.r2_score(y_pred, y_test)
print("RMSE for Random Forest is ",XGB_rmse)

In [1]:
xgb_submission = submission
xgb_submission['fare_amount']= xgb_prediction

xgb_submission.head()

xgb_submission.to_csv(env_submission_path+'submission_XGB.csv',index=False)

NameError: ignored

# LGBM model

- https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc

- https://www.kaggle.com/nicapotato/taxi-rides-time-analysis-and-oof-lgbm

- https://www.kaggle.com/dsaichand3/lgbm-gpu

- https://www.kaggle.com/aerdem4/rapids-svm-on-trends-neuroimaging


Compared CPU vs GPU:
- https://www.kaggle.com/ishivinal/sklearn-rapids-pandas


## Setup LGBM with GPU Support

How to setup LGBM GPU Beta:
- https://www.kaggle.com/vinhnguyen/gpu-acceleration-for-lightgbm


In [53]:
!rm -r /opt/conda/lib/python3.6/site-packages/lightgbm
!git clone --recursive https://github.com/Microsoft/LightGBM
!apt-get install -y -qq libboost-all-dev

rm: cannot remove '/opt/conda/lib/python3.6/site-packages/lightgbm': No such file or directory
Cloning into 'LightGBM'...
remote: Enumerating objects: 47, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 19367 (delta 13), reused 4 (delta 0), pack-reused 19320
Receiving objects: 100% (19367/19367), 15.65 MiB | 11.91 MiB/s, done.
Resolving deltas: 100% (14124/14124), done.
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'compute'
Cloning into '/content/LightGBM/compute'...
remote: Enumerating objects: 21728, done.        
remote: Total 21728 (delta 0), reused 0 (delta 0), pack-reused 21728        
Receiving objects: 100% (21728/21728), 8.51 MiB | 1.49 MiB/s, done.
Resolving deltas: 100% (17565/17565), done.
Submodule path 'compute': checked out '36c89134d4013b2e5e45bc55656a18bd6141995a'


In [54]:
%%bash
cd LightGBM
rm -r build
mkdir build
cd build
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
make -j$(nproc)

-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking

rm: cannot remove 'build': No such file or directory
CMakeFiles/lightgbm.dir/src/treelearner/gpu_tree_learner.cpp.o: In function `boost::compute::detail::program_binary_path(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)':
gpu_tree_learner.cpp:(.text._ZN5boost7compute6detail19program_binary_pathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb[_ZN5boost7compute6detail19program_binary_pathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb]+0x1e5): undefined reference to `boost::filesystem::detail::status(boost::filesystem::path const&, boost::system::error_code*)'
gpu_tree_learner.cpp:(.text._ZN5boost7compute6detail19program_binary_pathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb[_ZN5boost7compute6detail19program_binary_pathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb]+0x233): undefined reference to `boost::filesystem::detail::create_directories(boost::filesystem::path const&, boost::system::error_code*)

In [55]:
!cd LightGBM/python-package/;python3 setup.py install --precompile
!mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
!rm -r LightGBM

running install
running build
running build_py
creating build
creating build/lib
creating build/lib/lightgbm
copying lightgbm/compat.py -> build/lib/lightgbm
copying lightgbm/callback.py -> build/lib/lightgbm
copying lightgbm/sklearn.py -> build/lib/lightgbm
copying lightgbm/plotting.py -> build/lib/lightgbm
copying lightgbm/__init__.py -> build/lib/lightgbm
copying lightgbm/basic.py -> build/lib/lightgbm
copying lightgbm/libpath.py -> build/lib/lightgbm
copying lightgbm/engine.py -> build/lib/lightgbm
running egg_info
creating lightgbm.egg-info
writing lightgbm.egg-info/PKG-INFO
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing top-level names to lightgbm.egg-info/top_level.txt
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'build'
writing manifest file 'lig

In [59]:
import lightgbm as lgbm

OSError: ignored

In [None]:
## Inspiration from https://www.kaggle.com/dsaichand3/lgbm-gpu
params = {
        'boosting_type':'gbdt',
        'objective': 'regression',
        'nthread': 4,
        'num_leaves': 31,
        'learning_rate': 0.15,
        'max_depth': -1,
        'subsample': 0.8,
        'bagging_fraction' : 1,
        'max_bin' : 15,
        'bagging_freq': 20,
        'colsample_bytree': 0.6,
        'metric': 'rmse',
        'min_split_gain': 0.5,
        'min_child_weight': 1,
        'min_child_samples': 10,
        'scale_pos_weight':1,
        'zero_as_missing': True,
        'seed':0,
        'num_rounds':50000,
        'device': 'gpu',
        'gpu_platform_id': 0,
        'gpu_device_id': 0
    }

In [None]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
%%time
train_set = lgbm.Dataset(x_train, y_train, silent=False, categorical_feature=['year','month','day'])
valid_set = lgbm.Dataset(x_test, y_test, silent=False, categorical_feature=['year','month','day'])
del x_train, y_train, x_test, y_test
gc.collect()
model = lgbm.train(params, train_set = train_set, num_boost_round=10000, early_stopping_rounds=500, verbose_eval=500, valid_sets=valid_set)

In [None]:
dataset = pd.read_csv("/kaggle/input/new-york-city-taxi-fare-prediction/train.csv", nrows = 25000000)

In [None]:
LGBM_pred = model.predict(test, num_iteration = model.best_iteration)

In [None]:
gdf_submission = submission
gdf_submission['fare_amount']= LBGM_pred

gdf_submission.to_csv(env_submission_path+'submission_LGBM.csv',index=False)

gdf_submission.head()

# 7. Evaluation and Discussion