<a href="https://colab.research.google.com/github/AyHaski/BigDataAnalyticsProject/blob/master/Kopie_von_Rapids_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [1]:
!nvidia-smi

Sun Aug 16 09:51:00 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup:
Set up script installs
1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
1. removes incompatible files
1. Install RAPIDS libraries
1. Set necessary environment variables
1. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
1. If running v0.11 or higher, updates pyarrow library to 0.15.x.

In [1]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
PLEASE READ
********************************************************************************************************
Changes:
1. Default stable version is now 0.14.  Nightly is now 0.15.  We have fixed the long conda install.  Hooray!
2. You can now declare your RAPIDSAI version as a CLI option and skip the user prompts (ex: '0.14' or '0.15', between 0.13 to 0.15, without the quotes): 
        "!bash rapidsai-csp-utils/colab/rapids-colab.sh <version/label>"
        Examples: '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.14', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh stable', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh s'
                  '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.15, or '!bash rapidsai-csp-utils/colab/rapids-colab.sh nightly', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh n'
Enjoy using RAPIDS!  If you have any issues with or suggestions for RAPID

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
import nvstrings
import numpy as np
import cudf, cuml
import io, requests
import math

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns 

#Learning
from cuml.preprocessing.model_selection import train_test_split
from cuml.linear_model import LinearRegression
from cuml.metrics.regression import r2_score

  """Entry point for launching an IPython kernel.
  import pandas.util.testing as tm


In [3]:
cudf.set_allocator("managed")
base_path='/content/drive/My Drive/bigData'

traintypes = {'fare_amount': 'float32',
              'pickup_datetime':'str',
              'pickup_longitude': 'float32',
              'pickup_latitude': 'float32',
              'dropoff_longitude': 'float32',
              'dropoff_latitude': 'float32',
              'passenger_count': 'int8'}

cols = list(traintypes.keys())
usecols = cols

In [4]:
%%time
gdf_test = cudf.read_csv(base_path +'/test.csv', usecols=cols, dtype=traintypes)
gdf_train = cudf.read_csv(base_path +'/train.csv', usecols=cols, dtype=traintypes)

CPU times: user 3.07 s, sys: 3.19 s, total: 6.26 s
Wall time: 1min 53s


In [5]:
gdf_train.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21 UTC,-73.844307,40.721321,-73.841614,40.712273,1
1,16.9,2010-01-05 16:52:16 UTC,-74.016045,40.711304,-73.979271,40.782005,1
2,5.7,2011-08-18 00:35:00 UTC,-73.982742,40.761269,-73.991234,40.750565,2
3,7.7,2012-04-21 04:30:42 UTC,-73.987137,40.733139,-73.99157,40.758095,1
4,5.3,2010-03-09 07:51:00 UTC,-73.968102,40.768009,-73.956665,40.783768,1


In [6]:
gdf_train.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,55423860.0,55423860.0,55423860.0,55423480.0,55423480.0,55423860.0
mean,11.34505,-72.50969,39.91979,-72.51121,39.92068,1.685075
std,20.71083,12.84888,9.642353,12.7822,9.633346,1.310116
min,-300.0,-3442.06,-3492.264,-3442.024,-3547.886,-127.0
25%,6.0,-73.99206,40.73493,-73.99139,40.73403,1.0
50%,8.5,-73.9818,40.75265,-73.98016,40.75315,1.0
75%,12.5,-73.96708,40.76713,-73.96368,40.7681,2.0
max,93963.36,3457.625,3408.79,3457.622,3537.133,51.0


In [7]:
gdf_train['pickup_datetime'] = gdf_train['pickup_datetime'].astype('datetime64[ns]')
gdf_test['pickup_datetime'] = gdf_test['pickup_datetime'].astype('datetime64[ns]')

In [8]:
#Getting interger numbers from the pickup_datetime
gdf_train["hour"] = gdf_train.pickup_datetime.dt.hour
gdf_train["weekday"] = gdf_train.pickup_datetime.dt.weekday
gdf_train["month"] = gdf_train.pickup_datetime.dt.month
gdf_train["year"] = gdf_train.pickup_datetime.dt.year

gdf_test["hour"] = gdf_test.pickup_datetime.dt.hour
gdf_test["weekday"] = gdf_test.pickup_datetime.dt.weekday
gdf_test["month"] = gdf_test.pickup_datetime.dt.month
gdf_test["year"] = gdf_test.pickup_datetime.dt.year

In [9]:
gdf_train.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
0,4.5,2009-06-15 17:26:21,-73.844307,40.721321,-73.841614,40.712273,1,17,0,6,2009
1,16.9,2010-01-05 16:52:16,-74.016045,40.711304,-73.979271,40.782005,1,16,1,1,2010
2,5.7,2011-08-18 00:35:00,-73.982742,40.761269,-73.991234,40.750565,2,0,3,8,2011
3,7.7,2012-04-21 04:30:42,-73.987137,40.733139,-73.99157,40.758095,1,4,5,4,2012
4,5.3,2010-03-09 07:51:00,-73.968102,40.768009,-73.956665,40.783768,1,7,1,3,2010


#Data Analysis

  import pandas.util.testing as tm


#Data Cleaning


In [10]:
print("Shape of Training Data after dropping columns",gdf_train.shape)
print("Shape of Testing Data after dropping columns",gdf_test.shape)

Shape of Training Data after dropping columns (55423856, 11)
Shape of Testing Data after dropping columns (9914, 10)


In [11]:
gdf_train.nans_to_nulls()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
0,4.5,2009-06-15 17:26:21,-73.844307,40.721321,-73.84161377,40.71227264,1,17,0,6,2009
1,16.9,2010-01-05 16:52:16,-74.016045,40.711304,-73.97927094,40.78200531,1,16,1,1,2010
2,5.7,2011-08-18 00:35:00,-73.982742,40.761269,-73.99123383,40.75056458,2,0,3,8,2011
3,7.7,2012-04-21 04:30:42,-73.987137,40.733139,-73.99156952,40.75809479,1,4,5,4,2012
4,5.3,2010-03-09 07:51:00,-73.968102,40.768009,-73.95666504,40.7837677,1,7,1,3,2010
...,...,...,...,...,...,...,...,...,...,...,...
55423851,14.0,2014-03-15 03:28:00,-74.005264,40.740028,-73.96327209,40.76255417,1,3,5,3,2014
55423852,4.2,2009-03-24 20:46:20,-73.957794,40.765533,-73.9516449,40.7739563,1,20,1,3,2009
55423853,14.1,2011-04-02 22:04:24,-73.970512,40.752323,-73.96054077,40.79734421,1,22,5,4,2011
55423854,28.9,2011-10-26 05:57:51,-73.980904,40.764629,-73.8706131,40.77396393,1,5,2,10,2011


In [12]:
test_1 = gdf_test
train_1 = gdf_train.dropna()
train_1

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
0,4.5,2009-06-15 17:26:21,-73.844307,40.721321,-73.841614,40.712273,1,17,0,6,2009
1,16.9,2010-01-05 16:52:16,-74.016045,40.711304,-73.979271,40.782005,1,16,1,1,2010
2,5.7,2011-08-18 00:35:00,-73.982742,40.761269,-73.991234,40.750565,2,0,3,8,2011
3,7.7,2012-04-21 04:30:42,-73.987137,40.733139,-73.991570,40.758095,1,4,5,4,2012
4,5.3,2010-03-09 07:51:00,-73.968102,40.768009,-73.956665,40.783768,1,7,1,3,2010
...,...,...,...,...,...,...,...,...,...,...,...
55423851,14.0,2014-03-15 03:28:00,-74.005264,40.740028,-73.963272,40.762554,1,3,5,3,2014
55423852,4.2,2009-03-24 20:46:20,-73.957794,40.765533,-73.951645,40.773956,1,20,1,3,2009
55423853,14.1,2011-04-02 22:04:24,-73.970512,40.752323,-73.960541,40.797344,1,22,5,4,2011
55423854,28.9,2011-10-26 05:57:51,-73.980904,40.764629,-73.870613,40.773964,1,5,2,10,2011


In [13]:
train_1.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
count,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0,55423480.0
mean,11.34501,-72.50973,39.91982,-72.51121,39.92068,1.685087,13.50978,3.041161,6.266239,2011.739
std,20.71087,12.84877,9.642324,12.7822,9.633346,1.310113,6.517377,1.948911,3.435531,1.860079
min,-300.0,-3442.06,-3492.264,-3442.024,-3547.886,-127.0,0.0,0.0,1.0,2009.0
25%,6.0,-73.99206,40.73493,-73.99139,40.73403,1.0,9.0,1.0,3.0,2010.0
50%,8.5,-73.9818,40.75265,-73.98016,40.75315,1.0,14.0,3.0,6.0,2012.0
75%,12.5,-73.96708,40.76713,-73.96368,40.7681,2.0,19.0,5.0,9.0,2013.0
max,93963.36,3457.625,3408.79,3457.622,3537.133,51.0,23.0,6.0,12.0,2015.0


In [17]:
#Check fare_amount im minus macht kein sinn, wird gedroppt
train_1 = train_1[train_1['fare_amount'] > 0]
#Manche Koordinaten sind supper weit weg werden entfernt
train_1 = train_1[(train_1['pickup_longitude'] < -72) & (train_1['pickup_longitude'] > -75)]
train_1 = train_1[(train_1['pickup_latitude'] > 39) & (train_1['pickup_latitude'] < 44)]
train_1 = train_1[(train_1['dropoff_longitude'] < -72) & (train_1['dropoff_longitude'] > -75)]
train_1 = train_1[(train_1['dropoff_latitude'] > 39) & (train_1['dropoff_latitude'] < 44)]
train_1 = train_1[(train_1['passenger_count'] > 0) & (train_1['passenger_count'] < 10)]
train_1.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,weekday,month,year
count,54071770.0,54071770.0,54071770.0,54071770.0,54071770.0,54071770.0,54071770.0,54071770.0,54071770.0,54071770.0
mean,11.33481,-73.97515,40.75095,-73.9743,40.7513,1.691091,13.51032,3.041092,6.269149,2011.738
std,20.86522,0.03922,0.0322,0.038449,0.035372,1.306907,6.516743,1.949117,3.436405,1.865446
min,0.01,-74.99804,39.0073,-74.99828,39.01156,1.0,0.0,0.0,1.0,2009.0
25%,6.0,-73.99227,40.73656,-73.99159,40.73557,1.0,9.0,1.0,3.0,2010.0
50%,8.5,-73.9821,40.75337,-73.98061,40.75386,1.0,14.0,3.0,6.0,2012.0
75%,12.5,-73.9683,40.76756,-73.96533,40.7684,2.0,19.0,5.0,9.0,2013.0
max,93963.36,-72.00594,43.98246,-72.008,43.90581,9.0,23.0,6.0,12.0,2015.0


In [35]:
R = 6373.0
p = 0.017453292519943295
train_1['longitude_distance']=(train_1['dropoff_longitude']  -train_1['pickup_longitude'])
train_1['latitude_distance']=(train_1['dropoff_latitude'] - train_1['pickup_latitude'])

test_1['longitude_distance']=(test_1['dropoff_longitude'] - test_1['pickup_longitude'])
test_1['latitude_distance']=(test_1['dropoff_latitude'] - test_1['pickup_latitude'])

In [36]:
import math

a_train =  np.sin(train_1['latitude_distance'] / 2)**2 +  np.cos(train_1['pickup_latitude']) *  np.cos(train_1['dropoff_latitude']) *  np.sin(train_1['longitude_distance'] / 2)**2
a_train.applymap( lambda a : 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)))
a_train.applymap( lambda a : R*a)
train_1['distance'] = a_train

a_test =  np.sin(test_1['latitude_distance'] / 2)**2 +  np.cos(test_1['pickup_latitude']) *  np.cos(test_1['dropoff_latitude']) *  np.sin(test_1['longitude_distance'] / 2)**2
a_test.applymap( lambda a : 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)))
a_test.applymap( lambda a : R*a)
test_1['distance'] = a_test


In [None]:
#a = 0.5 - np.cos(train_1['latitude_distance']*p)/2 + np.cos(train_1['pickup_latitude']*p) * np.cos(train_1['dropoff_latitude']*p) * (1-np.cos((train_1['longitude_distance'])*p))/2
#a.applymap( lambda a : 12742 * math.asin(math.sqrt(a)))
#train_1['distance'] = a

#a_test = 0.5 - np.cos(test_1['latitude_distance']*p)/2 + np.cos(test_1['pickup_latitude']*p) * np.cos(test_1['dropoff_latitude']*p) * (1-np.cos((test_1['longitude_distance'])*p))/2
#a_test.applymap( lambda a : 12742 * math.asin(math.sqrt(a)))
#test_1['distance'] = a_test

In [18]:
train_1['longitude_distance']=(train_1['dropoff_longitude']  - train_1['pickup_longitude']).abs()
train_1['latitude_distance']=(train_1['dropoff_latitude'] - train_1['pickup_latitude']).abs()

test_1['longitude_distance']=(test_1['dropoff_longitude'] - test_1['pickup_longitude']).abs()
test_1['latitude_distance']=(test_1['dropoff_latitude'] - test_1['pickup_latitude']).abs()

In [19]:
drop_columns = ['pickup_datetime','dropoff_latitude','pickup_latitude','dropoff_longitude','pickup_longitude']
train_1=train_1.drop(drop_columns,axis=1)
test_1=test_1.drop(drop_columns,axis=1)

In [20]:
train_1.head()

Unnamed: 0,fare_amount,passenger_count,hour,weekday,month,year,longitude_distance,latitude_distance
0,4.5,1,17,0,6,2009,0.002693,0.009048
1,16.9,1,16,1,1,2010,0.036774,0.070702
2,5.7,2,0,3,8,2011,0.008492,0.010704
3,7.7,1,4,5,4,2012,0.004433,0.024956
4,5.3,1,7,1,3,2010,0.011436,0.015759


In [21]:
test_1.head(20)

Unnamed: 0,passenger_count,hour,weekday,month,year,longitude_distance,latitude_distance
0,1,13,1,1,2015,0.00811,0.01997
1,1,13,1,1,2015,0.012024,0.019814
2,1,11,5,10,2011,0.002869,0.005119
3,1,21,5,12,2012,0.009277,0.016178
4,1,21,5,12,2012,0.022537,0.045345
5,1,21,5,12,2012,0.018204,0.025494
6,1,12,3,10,2011,0.01062,0.002304
7,1,12,3,10,2011,0.207802,0.112732
8,1,12,3,10,2011,0.018997,0.031731
9,1,15,1,2,2014,0.011108,0.005203


# Linear Regression

In [22]:
X=train_1.drop(['fare_amount'],axis=1)
y=train_1['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])

Number of records in training data  37850239
Number of records in validation data  16221530


In [23]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(37850239, 7)
(16221530, 7)
(37850239,)
(16221530,)


In [24]:
lm = LinearRegression(fit_intercept = True, normalize = False,
                      algorithm = "eig")
lm.fit(X_train,y_train)
y_pred=lm.predict(X_test)
lm_rmse = r2_score(y_test, y_pred)
print("RMSE for Linear Regression is ",lm_rmse)

RuntimeError: ignored

In [25]:
y_pred=lm.predict(test_1)
y_pred

MemoryError: ignored

In [45]:
gdf_submission = cudf.read_csv(base_path +'/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,10.926086
1,2015-01-27 13:08:24.0000003,10.890076
2,2011-10-08 11:53:44.0000002,6.85376
3,2012-12-01 21:12:12.0000002,9.735809
4,2012-12-01 21:12:12.0000003,16.690033


In [46]:
gdf_submission.to_csv(base_path + '/logRegRapidSubmission4DifferentDistanceWLonLatCols.csv',index=False)

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib