<a href="https://colab.research.google.com/github/AyHaski/BigDataAnalyticsProject/blob/master/Kopie_von_Rapids_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [None]:
!nvidia-smi

Sun Aug 30 09:49:16 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup:
Set up script installs
1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
1. removes incompatible files
1. Install RAPIDS libraries
1. Set necessary environment variables
1. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
1. If running v0.11 or higher, updates pyarrow library to 0.15.x.

In [1]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 180 (delta 3), reused 0 (delta 0), pack-reused 171[K
Receiving objects: 100% (180/180), 55.40 KiB | 318.00 KiB/s, done.
Resolving deltas: 100% (65/65), done.
PLEASE READ
********************************************************************************************************
Changes:
1. IMPORTANT CHANGES: RAPIDS on Colab will be pegged to 0.14 Stable until further notice.
2. Default stable version is now 0.14.  Nightly will redirect to 0.14.
3. You can now declare your RAPIDSAI version as a CLI option and skip the user prompts (ex: '0.14' or '0.15', between 0.13 to 0.14, without the quotes): 
        "!bash rapidsai-csp-utils/colab/rapids-colab.sh <version/label>"
        Examples: '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.14', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh stable', or '!

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
!pip install dask_ml
!pip install dask_cuda

Collecting dask_ml
  Downloading dask_ml-1.6.0-py3-none-any.whl (140 kB)
[?25l[K     |██▎                             | 10 kB 35.5 MB/s eta 0:00:01[K     |████▋                           | 20 kB 1.8 MB/s eta 0:00:01[K     |███████                         | 30 kB 2.3 MB/s eta 0:00:01[K     |█████████▎                      | 40 kB 2.6 MB/s eta 0:00:01[K     |███████████▋                    | 51 kB 2.0 MB/s eta 0:00:01[K     |██████████████                  | 61 kB 2.3 MB/s eta 0:00:01[K     |████████████████▎               | 71 kB 2.5 MB/s eta 0:00:01[K     |██████████████████▋             | 81 kB 2.8 MB/s eta 0:00:01[K     |█████████████████████           | 92 kB 3.0 MB/s eta 0:00:01[K     |███████████████████████▎        | 102 kB 2.9 MB/s eta 0:00:01[K     |█████████████████████████▋      | 112 kB 2.9 MB/s eta 0:00:01[K     |████████████████████████████    | 122 kB 2.9 MB/s eta 0:00:01[K     |██████████████████████████████▎ | 133 kB 2.9 MB/s eta 0:00:01[K  

In [4]:
import nvstrings
import numpy as np
import cudf, cuml
import dask_cudf
import io, requests
import math
import gc

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns 

#Learning
from cuml.preprocessing.model_selection import train_test_split
from cuml.linear_model import LinearRegression
from cuml.linear_model import LogisticRegression
from scipy.stats import uniform

from cuml.solvers import SGD as cumlSGD
from cuml.ensemble import RandomForestRegressor as cuRF
from cuml.ensemble import RandomForestClassifier
from cuml.neighbors import KNeighborsRegressor
from cuml import ForestInference
import xgboost as xgb
from cuml.svm import SVC

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from cuml.metrics.regression import r2_score


from cuml.metrics.accuracy import accuracy_score

from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score as sk_acc

import dask_ml.model_selection as dcv
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster


  """Entry point for launching an IPython kernel.
  import pandas.util.testing as tm


In [5]:
cudf.set_allocator("managed")
base_path='/content/drive/My Drive/bigData'

traintypes = {'fare_amount': 'float32',
              'pickup_datetime':'str',
              'pickup_longitude': 'float32',
              'pickup_latitude': 'float32',
              'dropoff_longitude': 'float32',
              'dropoff_latitude': 'float32',
              'passenger_count': 'int8'}

cols = list(traintypes.keys())
usecols = cols

In [6]:
%%time
gdf_test = cudf.read_csv(base_path +'/test.csv', usecols=cols, dtype=traintypes)
gdf_train = cudf.read_csv(base_path +'/train.csv', usecols=cols, dtype=traintypes)

CPU times: user 3.31 s, sys: 3.25 s, total: 6.56 s
Wall time: 1min 59s


In [None]:
gdf_train.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21 UTC,-73.844307,40.721321,-73.841614,40.712273,1
1,16.9,2010-01-05 16:52:16 UTC,-74.016045,40.711304,-73.979271,40.782005,1
2,5.7,2011-08-18 00:35:00 UTC,-73.982742,40.761269,-73.991234,40.750565,2
3,7.7,2012-04-21 04:30:42 UTC,-73.987137,40.733139,-73.99157,40.758095,1
4,5.3,2010-03-09 07:51:00 UTC,-73.968102,40.768009,-73.956665,40.783768,1


In [None]:
gdf_train.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,55423860.0,55423860.0,55423860.0,55423480.0,55423480.0,55423860.0
mean,11.34505,-72.50969,39.91979,-72.51121,39.92068,1.685075
std,20.71083,12.84888,9.642353,12.7822,9.633346,1.310116
min,-300.0,-3442.06,-3492.264,-3442.024,-3547.886,-127.0
25%,6.0,-73.99206,40.73493,-73.99139,40.73403,1.0
50%,8.5,-73.9818,40.75265,-73.98016,40.75315,1.0
75%,12.5,-73.96708,40.76713,-73.96368,40.7681,2.0
max,93963.36,3457.625,3408.79,3457.622,3537.133,51.0


In [7]:
gdf_train['pickup_datetime'] = gdf_train['pickup_datetime'].astype('datetime64[ns]')
gdf_test['pickup_datetime'] = gdf_test['pickup_datetime'].astype('datetime64[ns]')

In [8]:
#Getting interger numbers from the pickup_datetime
gdf_train["hour"] = gdf_train.pickup_datetime.dt.hour
gdf_train["weekday"] = gdf_train.pickup_datetime.dt.weekday
gdf_train["month"] = gdf_train.pickup_datetime.dt.month
gdf_train["year"] = gdf_train.pickup_datetime.dt.year

gdf_test["hour"] = gdf_test.pickup_datetime.dt.hour
gdf_test["weekday"] = gdf_test.pickup_datetime.dt.weekday
gdf_test["month"] = gdf_test.pickup_datetime.dt.month
gdf_test["year"] = gdf_test.pickup_datetime.dt.year

In [None]:
gdf_train.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
0,4.5,2009-06-15 17:26:21,-73.844307,40.721321,-73.841614,40.712273,1,17,0,6,2009
1,16.9,2010-01-05 16:52:16,-74.016045,40.711304,-73.979271,40.782005,1,16,1,1,2010
2,5.7,2011-08-18 00:35:00,-73.982742,40.761269,-73.991234,40.750565,2,0,3,8,2011
3,7.7,2012-04-21 04:30:42,-73.987137,40.733139,-73.99157,40.758095,1,4,5,4,2012
4,5.3,2010-03-09 07:51:00,-73.968102,40.768009,-73.956665,40.783768,1,7,1,3,2010


#Data Analysis

Look at missing values in data
Look at what values are in fare_amount and passenger_count
-> distribution graph
Look at coordinates 


#Data Cleaning


In [9]:
print("Shape of Training Data after dropping columns",gdf_train.shape)
print("Shape of Testing Data after dropping columns",gdf_test.shape)

Shape of Training Data after dropping columns (55423856, 11)
Shape of Testing Data after dropping columns (9914, 10)


In [10]:
gdf_train.nans_to_nulls()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
0,4.5,2009-06-15 17:26:21,-73.844307,40.721321,-73.84161377,40.71227264,1,17,0,6,2009
1,16.9,2010-01-05 16:52:16,-74.016045,40.711304,-73.97927094,40.78200531,1,16,1,1,2010
2,5.7,2011-08-18 00:35:00,-73.982742,40.761269,-73.99123383,40.75056458,2,0,3,8,2011
3,7.7,2012-04-21 04:30:42,-73.987137,40.733139,-73.99156952,40.75809479,1,4,5,4,2012
4,5.3,2010-03-09 07:51:00,-73.968102,40.768009,-73.95666504,40.7837677,1,7,1,3,2010
...,...,...,...,...,...,...,...,...,...,...,...
55423851,14.0,2014-03-15 03:28:00,-74.005264,40.740028,-73.96327209,40.76255417,1,3,5,3,2014
55423852,4.2,2009-03-24 20:46:20,-73.957794,40.765533,-73.9516449,40.7739563,1,20,1,3,2009
55423853,14.1,2011-04-02 22:04:24,-73.970512,40.752323,-73.96054077,40.79734421,1,22,5,4,2011
55423854,28.9,2011-10-26 05:57:51,-73.980904,40.764629,-73.8706131,40.77396393,1,5,2,10,2011


In [11]:
test_1 = gdf_test
train_1 = gdf_train.dropna()
train_1

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
0,4.5,2009-06-15 17:26:21,-73.844307,40.721321,-73.841614,40.712273,1,17,0,6,2009
1,16.9,2010-01-05 16:52:16,-74.016045,40.711304,-73.979271,40.782005,1,16,1,1,2010
2,5.7,2011-08-18 00:35:00,-73.982742,40.761269,-73.991234,40.750565,2,0,3,8,2011
3,7.7,2012-04-21 04:30:42,-73.987137,40.733139,-73.991570,40.758095,1,4,5,4,2012
4,5.3,2010-03-09 07:51:00,-73.968102,40.768009,-73.956665,40.783768,1,7,1,3,2010
...,...,...,...,...,...,...,...,...,...,...,...
55423851,14.0,2014-03-15 03:28:00,-74.005264,40.740028,-73.963272,40.762554,1,3,5,3,2014
55423852,4.2,2009-03-24 20:46:20,-73.957794,40.765533,-73.951645,40.773956,1,20,1,3,2009
55423853,14.1,2011-04-02 22:04:24,-73.970512,40.752323,-73.960541,40.797344,1,22,5,4,2011
55423854,28.9,2011-10-26 05:57:51,-73.980904,40.764629,-73.870613,40.773964,1,5,2,10,2011


In [12]:
train_1.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
count,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0
mean,11.34501,-72.50973,39.91982,-72.51121,39.92068,1.685087,13.50978,3.041161,6.266239,2011.739
std,20.71087,12.84877,9.642324,12.7822,9.633346,1.310113,6.517377,1.948911,3.435531,1.860079
min,-300.0,-3442.06,-3492.264,-3442.024,-3547.886,-127.0,0.0,0.0,1.0,2009.0
25%,6.0,-73.99206,40.73493,-73.99139,40.73403,1.0,9.0,1.0,3.0,2010.0
50%,8.5,-73.9818,40.75265,-73.98016,40.75315,1.0,14.0,3.0,6.0,2012.0
75%,12.5,-73.96708,40.76713,-73.96368,40.7681,2.0,19.0,5.0,9.0,2013.0
max,93963.36,3457.625,3408.79,3457.622,3537.133,51.0,23.0,6.0,12.0,2015.0


In [13]:
#Check fare_amount im minus macht kein sinn, wird gedroppt
train_1 = train_1[train_1['fare_amount'] > 0]
#Manche Koordinaten sind supper weit weg werden entfernt
train_1 = train_1[(train_1['pickup_longitude'] < -70) & (train_1['pickup_longitude'] > -75)]
train_1 = train_1[(train_1['pickup_latitude'] > 38) & (train_1['pickup_latitude'] < 44)]
train_1 = train_1[(train_1['dropoff_longitude'] < -70) & (train_1['dropoff_longitude'] > -75)]
train_1 = train_1[(train_1['dropoff_latitude'] > 38) & (train_1['dropoff_latitude'] < 44)]
train_1 = train_1[(train_1['passenger_count'] > 0) & (train_1['passenger_count'] < 6)]
train_1.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
count,52922160.0,52922160.0,52922160.0,52922160.0,52922160.0,52922160.0,52922160.0,52922160.0,52922160.0,52922160.0
mean,11.31765,-73.97512,40.75096,-73.97427,40.7513,1.597439,13.51228,3.040926,6.268444,2011.711
std,21.03815,0.040327,0.032695,0.039561,0.035832,1.15447,6.518926,1.949067,3.435905,1.86554
min,0.01,-74.99804,38.03999,-74.99828,38.03333,1.0,0.0,0.0,1.0,2009.0
25%,6.0,-73.99227,40.73656,-73.99159,40.73557,1.0,9.0,1.0,3.0,2010.0
50%,8.5,-73.9821,40.75337,-73.98061,40.75385,1.0,14.0,3.0,6.0,2012.0
75%,12.5,-73.96831,40.76754,-73.96534,40.7684,2.0,19.0,5.0,9.0,2013.0
max,93963.36,-70.00039,43.99612,-70.00227,43.99612,5.0,23.0,6.0,12.0,2015.0


#Different distance calculations

## Separate distance positiv

In [None]:
train_1['longitude_distance']=(train_1['dropoff_longitude']  - train_1['pickup_longitude']).abs()
train_1['latitude_distance']=(train_1['dropoff_latitude'] - train_1['pickup_latitude']).abs()

test_1['longitude_distance']=(test_1['dropoff_longitude'] - test_1['pickup_longitude']).abs()
test_1['latitude_distance']=(test_1['dropoff_latitude'] - test_1['pickup_latitude']).abs()

## Distance in miles

In [14]:
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))

In [15]:
train_1['distance_miles'] = distance(train_1['pickup_latitude'], train_1['pickup_longitude'], \
                                      train_1['dropoff_latitude'], train_1['dropoff_longitude'] )

In [16]:
test_1['distance_miles'] = distance(test_1['pickup_latitude'], test_1['pickup_longitude'], \
                                      test_1['dropoff_latitude'], test_1['dropoff_longitude'] )

In [17]:
train_1['distance_miles'].describe()

count    5.292216e+07
mean     1.870630e+00
std      2.652241e+00
min      0.000000e+00
25%      0.000000e+00
50%      1.366828e+00
75%      2.424852e+00
max      2.928418e+02
Name: distance_miles, dtype: float64

# Dropping Columns not needed anymore


In [18]:
drop_columns = ['pickup_datetime','dropoff_latitude','pickup_latitude','dropoff_longitude','pickup_longitude']
train_1=train_1.drop(drop_columns,axis=1)
test_1=test_1.drop(drop_columns,axis=1)

In [19]:
train_1.head()

Unnamed: 0,fare_amount,passenger_count,hour,weekday,month,year,distance_miles
0,4.5,1,17,0,6,2009,0.0
1,16.9,1,16,1,1,2010,5.244397
2,5.7,2,0,3,8,2011,0.0
3,7.7,1,4,5,4,2012,1.932986
4,5.3,1,7,1,3,2010,1.366828


In [20]:
test_1.head()

Unnamed: 0,passenger_count,hour,weekday,month,year,distance_miles
0,1,13,1,1,2015,1.366828
1,1,13,1,1,2015,1.366828
2,1,11,5,10,2011,0.0
3,1,21,5,12,2012,1.366828
4,1,21,5,12,2012,3.226874


# Linear Regression

In [None]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  42337728
Number of records in validation data  10584432
(42337728, 6)
(10584432, 6)
(42337728,)
(10584432,)


In [None]:
lm = LinearRegression(fit_intercept = True, normalize = False,
                      algorithm = "eig")
lm.fit(X_train,y_train)
y_pred=lm.predict(X_test)
lm_rmse = r2_score(y_pred, y_test)
print("RMSE for Linear Regression is ",lm_rmse)

RMSE for Linear Regression is  0.2914084792137146


In [None]:
y_pred=lm.predict(test_1)
y_pred

0        9.783966
1        9.783966
2        5.478973
3        9.694916
4       14.977570
          ...    
9909    10.014099
9910    12.425690
9911    39.885895
9912    20.667664
9913     5.894531
Length: 9914, dtype: float32

In [None]:
gdf_submission = cudf.read_csv(base_path +'/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,9.783966
1,2015-01-27 13:08:24.0000003,9.783966
2,2011-10-08 11:53:44.0000002,5.478973
3,2012-12-01 21:12:12.0000002,9.694916
4,2012-12-01 21:12:12.0000003,14.97757


In [None]:
gdf_submission.to_csv(base_path + '/logRegRapidSubmissionMilesDistance3.csv',index=False)

# Ridge Regression


In [None]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  42337728
Number of records in validation data  10584432
(42337728, 6)
(10584432, 6)
(42337728,)
(10584432,)


In [None]:
from cuml.linear_model import Ridge
import cupy as cp

In [None]:
alpha = np.array([1e-5])
ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False,
              solver = "eig")
result_ridge = ridge.fit(X_train,y_train)
print("Coefficients:")
print(result_ridge.coef_)
print("Intercept:")
print(result_ridge.intercept_)
y_pred=result_ridge.predict(X_test)
ridge_rmse = r2_score(y_pred, y_test)
print("RMSE for Ridge_rmse Regression is ", ridge_rmse)

TypeError: ignored

In [None]:
y_pred=result_ridge.predict(test_1)
y_pred

0        9.816284
1        9.816284
2        5.573151
3        9.722198
4       14.917328
          ...    
9909     9.971344
9910    12.322968
9911    39.341492
9912    20.449219
9913     5.913696
Length: 9914, dtype: float32

In [None]:
gdf_submission = cudf.read_csv(base_path +'/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.to_csv(base_path + '/ridgeRegSubmission.csv',index=False)
gdf_submission.head()

# Lasso Regression


In [None]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  42337728
Number of records in validation data  10584432
(42337728, 6)
(10584432, 6)
(42337728,)
(10584432,)


In [None]:
from cuml.linear_model import Lasso

In [None]:
alpha = np.array([1e-5])
ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False,
              solver = "eig")
result_ridge = ridge.fit(X_train,y_train)
print("Coefficients:")
print(result_ridge.coef_)
print("Intercept:")
print(result_ridge.intercept_)
y_pred=result_ridge.predict(X_test)
ridge_rmse = r2_score(y_pred, y_test)
print("RMSE for Ridge Regression is ", ridge_rmse)

Coefficients:
0    0.029549
1    0.001449
2   -0.046847
3    0.060749
4    0.195509
5    2.793014
dtype: float32
Intercept:
-388.01446533203125
RMSE for Ridge Regression is  0.30520182847976685


In [None]:
y_pred=result_ridge.predict(test_1)
y_pred

0        9.816284
1        9.816284
2        5.573151
3        9.722198
4       14.917328
          ...    
9909     9.971344
9910    12.322968
9911    39.341492
9912    20.449219
9913     5.913696
Length: 9914, dtype: float32

In [None]:
gdf_submission = cudf.read_csv(base_path +'/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.to_csv(base_path + '/ridgeRegSubmission.csv',index=False)
gdf_submission.head()

# Random Forest Regression Model

In [None]:
cu_rf_params = {
    'n_estimators': 100,
    'max_features': 'auto', #default = 0.3
    'max_depth': 16,
    'n_bins': 8
}

In [None]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)


In [None]:
cu_rf = cuRF(**cu_rf_params)
cu_rf.fit(X_train, y_train)
y_pred=cu_rf.predict(X_test)

In [None]:
rf_rmse = r2_score(y_pred, y_test)
print("RMSE for RF Regression is ",rf_rmse)

RMSE for RF Regression is  0.265926718711853


In [None]:
y_pred=cu_rf.predict(test_1)
y_pred

0        9.787024
1        9.787024
2        5.867246
3        8.939384
4       15.635692
          ...    
9909     8.973536
9910    11.656147
9911    34.585255
9912    27.609804
9913     6.255931
Length: 9914, dtype: float32

In [None]:
gdf_submission = cudf.read_csv(base_path +'/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,9.787024
1,2015-01-27 13:08:24.0000003,9.787024
2,2011-10-08 11:53:44.0000002,5.867246
3,2012-12-01 21:12:12.0000002,8.939384
4,2012-12-01 21:12:12.0000003,15.635692


In [None]:
gdf_submission.to_csv(base_path + '/randomForestSubmissionDefaultParams1000est.csv',index=False)

# HPO Functions


In [23]:
N_FOLDS = 3
N_ITER = 25

def do_HPO(model, gridsearch_params, scorer, X, y, mode='gpu-Grid', n_iter=10):
    """
        Perform HPO based on the mode specified
        
        mode: default gpu-Grid. The possible options are:
        1. gpu-grid: Perform GPU based GridSearchCV
        2. gpu-random: Perform GPU based RandomizedSearchCV
        
        n_iter: specified with Random option for number of parameter settings sampled
        
        Returns the best estimator and the results of the search
    """
    if mode == 'gpu-grid':
        print("gpu-grid selected")
        clf = dcv.GridSearchCV(model,
                               gridsearch_params,
                               cv=N_FOLDS,
                               scoring=scorer)
    elif mode == 'gpu-random':
        print("gpu-random selected")
        clf = dcv.RandomizedSearchCV(model,
                               gridsearch_params,
                               cv=N_FOLDS,
                               scoring=scorer,
                               n_iter=n_iter)

    else:
        print("Unknown Option, please choose one of [gpu-grid, gpu-random]")
        return None, None
    res = clf.fit(X, y)
    print("Best clf and score {} {}\n---\n".format(res.best_estimator_, res.best_score_))
    return res.best_estimator_, res

In [24]:
def accuracy_score_wrapper(y, y_hat): 
    """
        A wrapper function to convert labels to float32, 
        and pass it to accuracy_score.
        
        Params:
        - y: The y labels that need to be converted
        - y_hat: The predictions made by the model
    """
    y = y.astype("float32") # cuML RandomForest needs the y labels to be float32
    return accuracy_score(y, y_hat, convert_dtype=True)

accuracy_wrapper_scorer = make_scorer(accuracy_score_wrapper)
cuml_accuracy_scorer = make_scorer(accuracy_score, convert_dtype=True)

# Random Forest HPO


In [21]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [None]:
params_rf = {
    #"max_depth": np.arange(start=3, stop = 15, step = 2), # Default = 6
    #"max_features": [0.1, 0.50, 0.75, 'auto'], #default = 0.3
    "n_estimators": [100, 200, 500, 1000]
 }

mode = "gpu-random"
model_rf = cuRF()

X_cpu = X_train.to_pandas()
y_cpu = y_train.to_array()

X_test_cpu = X_test.to_pandas()
y_test_cpu = y_test.to_array()

res, results = do_HPO(model_rf,
                          params_rf,
                          cuml_accuracy_scorer,
                          X_train,
                          y_cpu,
                          mode=mode)
print("Searched over {} parameters".format(len(results.cv_results_['mean_test_score'])))

gpu-random selected




# XGBoost with HPO


Für die Hyperparameter Optimierung wurde dass Notebook als Vorlage genommen https://github.com/rapidsai/cloud-ml-examples/blob/main/dask/notebooks/HPO_demo.ipynb

Ohne die Verwendung eines Clusters war es sonst nicht möglich eine HPO durchzuführen, da es immer wieder zu OutOfMemory Exceptions kam.

In [None]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [None]:
cluster = LocalCUDACluster(dashboard_address="127.0.0.1:8005")
client = Client(cluster)

client


+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| numpy   | 1.18.5 | 1.18.5    | 1.19.1  |
| tornado | 5.1.1  | 5.1.1     | 6.0.4   |
+---------+--------+-----------+---------+


0,1
Client  Scheduler: tcp://127.0.0.1:33437  Dashboard: http://127.0.0.1:8005/status,Cluster  Workers: 1  Cores: 1  Memory: 27.39 GB


In [None]:
# For xgb_model
model_gpu_xgb = xgb.XGBRegressor(tree_method='gpu_hist')

# More range 
params_xgb = {
    "max_depth": np.arange(start=3, stop = 15, step = 3), # Default = 6
    "alpha" : np.logspace(-3, -1, 5), # default = 0
    "learning_rate": [0.05, 0.1, 0.15], #default = 0.3
    "min_child_weight" : np.arange(start=2, stop=10, step=3), # default = 1
    "n_estimators": [100, 200, 1000]
}

In [None]:
X_cpu = X_train.to_pandas()
y_cpu = y_train.to_array()

X_test_cpu = X_test.to_pandas()
y_test_cpu = y_test.to_array()

In [None]:
mode = "gpu-random"

res, results = do_HPO(model_gpu_xgb,
                                   params_xgb,
                                   cuml_accuracy_scorer,
                                   X_train,
                                   y_cpu,
                                   mode=mode,
                                   n_iter=2)
print("Searched over {} parameters".format(len(results.cv_results_['mean_test_score'])))


+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| numpy   | 1.18.5 | 1.18.5    | 1.19.1  |
| tornado | 5.1.1  | 5.1.1     | 6.0.4   |
+---------+--------+-----------+---------+


gpu-random selected




In [None]:
y_pred = res.fit(X_train, y_cpu).predict(X_test)
score = accuracy_score(y_pred, y_test_cpu.astype('float32'), convert_dtype=True)
print("{} model accuracy: {}".format(mode, score))

gdf_submission = cudf.read_csv(base_path +'/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.to_csv(base_path + '/xgbBoostWithRandomSearchCVSubmission.csv',index=False)
gdf_submission.head()

# XGB without HPO

In [None]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_test, label=y_test)

In [None]:
# instantiate params
params = {}

# booster params
booster_params = {}

booster_params['tree_method'] = 'gpu_hist'
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'rmse'
learning_task_params['objective'] = 'reg:squarederror'
params.update(learning_task_params)

params_xgb = {
    "max_depth": 12, # Default = 6
    "alpha" : 0.001, # default = 0
    "learning_rate": 0.1, #default = 0.3
    "min_child_weight" : 2, # default = 1
    "n_estimators": 1000
}
params.update(params_xgb)

print(params)

{'tree_method': 'gpu_hist', 'eval_metric': 'rmse', 'objective': 'reg:squarederror', 'max_depth': 12, 'alpha': 0.001, 'learning_rate': 0.1, 'min_child_weight': 2, 'n_estimators': 1000}


In [None]:
# model training settings
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
num_round = 100

In [None]:
bst = xgb.train(params, dtrain, num_round, evallist)

In [None]:
dtest = xgb.DMatrix(test_1)
y_pred_xgb = bst.predict(dtest)
print(y_pred_xgb)

[ 9.425025   9.425025   5.595852  ... 56.41775   23.252817   6.1432843]


In [None]:
gdf_submission = cudf.read_csv(base_path +'/sample_submission.csv')
gdf_submission['fare_amount']= y_pred_xgb

gdf_submission.to_csv(base_path + '/XGBBoostSubmissionW1000Rounds.csv',index=False)
gdf_submission.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,9.425025
1,2015-01-27 13:08:24.0000003,9.425025
2,2011-10-08 11:53:44.0000002,5.595852
3,2012-12-01 21:12:12.0000002,9.125943
4,2012-12-01 21:12:12.0000003,16.820873


 # SVM Regression Model (SVC)

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib