In this section you will do a practical application of GPU-accelerated data analysis using the RAPIDS
suite of libraries. The focus will be on leveraging the New York City Taxi Trip Duration dataset from
Kaggle, applying Gradient Boosting Machine (GBM) models for predictive analysis, with a particular
emphasis on comparing the speed and efficiency of GPU-accelerated computations with traditional CPUbased methods 

1. Comparative Data Processing
(a)  Perform data loading and preprocessing tasks first using pandas (CPU) and then replicate the
same tasks using cuDF (GPU). Document the time taken for each operation in both scenarios.
(b)  Conduct basic exploratory data analysis (EDA) with both CPU-based tools (e.g., matplotlib)
and GPU-accelerated tools, noting any differences in performance and responsiveness.
2. Feature Engineering and Selection
(a) Engage in feature engineering, creating new variables that could aid in predicting trip durations. Compare the execution time for these operations on CPU vs. GPU.
(b) Select relevant features for the model based on their correlation with the target variable,
assessing the speed of these operations on CPU and GPU.
3. Model Training and Evaluation
(a) Train a Gradient Boosting Machine (GBM) model on the dataset using scikit-learn (CPU)
and cuML (GPU). Record and compare the training times.
(b) Evaluate the accuracy of both models and document the time taken for predictions on the
test set using CPU and GPU.
4. Performance Analysis
(a) Compile and compare the execution times for tasks performed on CPU vs. GPU, creating a
detailed analysis of the observed performance differences.
(b) Reflect on the implications of these findings for data science workflows, particularly in terms
of efficiency and scalability.

# (a) Perform data loading and preprocessing tasks first using pandas (CPU) and then replicate the same tasks using cuDF (GPU). Document the time taken for each operation in both scenarios.

In [None]:
# !pip install \
#     --extra-index-url=https://pypi.nvidia.com \
#     cudf-cu12==24.2.* dask-cudf-cu12==24.2.* cuml-cu12==24.2.* \
#     cugraph-cu12==24.2.* cuspatial-cu12==24.2.* cuproj-cu12==24.2.* \
#     cuxfilter-cu12==24.2.* cucim-cu12==24.2.* pylibraft-cu12==24.2.* \
#     raft-dask-cu12==24.2.*

In [50]:
# Importing necessary libraries
import pandas as pd
import cudf
import cupy as cp

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

## Loading the Datasts

In [51]:
# create a dataframe to hold CPU Metrics
cpu_metrics = pd.DataFrame(columns=['Task', 'start_time', 'end_time', 'time_taken'])

# create a dataframe to hold GPU Metrics
gpu_metrics = pd.DataFrame(columns=['Task', 'start_time', 'end_time', 'time_taken'])

In [52]:
# Load the data using cuDF
start_time_gpu = time.time()
sample_submission_gpu = cudf.read_csv('sample_submission.csv')
test_gpu = cudf.read_csv('test.csv')
train_gpu = cudf.read_csv('train.csv')
end_time_gpu = time.time()

In [53]:
# Load the data using pandas
start_time_cpu = time.time()
sample_submission_cpu = pd.read_csv('sample_submission.csv')
test_cpu = pd.read_csv('test.csv')
train_cpu = pd.read_csv('train.csv')
end_time_cpu = time.time()

In [54]:
# Display the time taken for data loading using pandas
print("Time taken for data loading using pandas: ", end_time_cpu - start_time_cpu)
# Display the time taken for data loading using cuDF
print("Time taken for data loading using GUP: ", end_time_gpu - start_time_gpu)

Time taken for data loading using pandas:  0.8471214771270752
Time taken for data loading using GUP:  0.05130195617675781


In [None]:
# Add the metrics to the dataframe
cpu_metrics = cpu_metrics.append({'Task': 'Data Loading', 'start_time': start_time_cpu, 'end_time': end_time_cpu, 'time_taken': end_time_cpu - start_time_cpu}, ignore_index=True)
gpu_metrics = gpu_metrics.append({'Task': 'Data Loading', 'start_time': start_time_gpu, 'end_time': end_time_gpu, 'time_taken': end_time_gpu - start_time_gpu}, ignore_index=True)

In [56]:
sample_submission_cpu.head(5)

Unnamed: 0,id,trip_duration
0,id3004672,959
1,id3505355,959
2,id1217141,959
3,id2150126,959
4,id1598245,959


In [57]:
test_cpu.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag
0,id3004672,1,2016-06-30 23:59:58,1.0,-73.988129,40.732029,-73.990173,40.75668,N
1,id3505355,1,2016-06-30 23:59:53,1.0,-73.964203,40.679993,-73.959808,40.655403,N
2,id1217141,1,2016-06-30 23:59:47,1.0,-73.997437,40.737583,-73.98616,40.729523,N
3,id2150126,2,2016-06-30 23:59:41,1.0,-73.95607,40.7719,-73.986427,40.730469,N
4,id1598245,1,2016-06-30 23:59:33,1.0,-73.970215,40.761475,-73.96151,40.75589,N


In [58]:
train_cpu.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2.0,2016-03-14 17:24:55,2016-03-14 17:32:30,1.0,-73.982155,40.767937,-73.96463,40.765602,N,455.0
1,id2377394,1.0,2016-06-12 00:43:35,2016-06-12 00:54:38,1.0,-73.980415,40.738564,-73.999481,40.731152,N,663.0
2,id3858529,2.0,2016-01-19 11:35:24,2016-01-19 12:10:48,1.0,-73.979027,40.763939,-74.005333,40.710087,N,2124.0
3,id3504673,2.0,2016-04-06 19:32:31,2016-04-06 19:39:40,1.0,-74.01004,40.719971,-74.012268,40.706718,N,429.0
4,id2181028,2.0,2016-03-26 13:30:55,2016-03-26 13:38:10,1.0,-73.973053,40.793209,-73.972923,40.78252,N,435.0


## Data Preprocessing


In [59]:
# Data Preprocessing using cuDF
time_start_gpu = time.time()
# drop rows with missing values
train_gpu.dropna(inplace=True)
test_gpu.dropna(inplace=True)

# drop records with duplicated IDs
train_gpu.drop_duplicates(subset='id', inplace=True)

# convert the 'pickup_datetime' and 'dropoff_datetime' columns to datetime format
train_gpu['pickup_datetime'] = cudf.to_datetime(train_gpu['pickup_datetime'])

# drop the 'dropoff_datetime' column since it is not available in the test set and is not needed for prediction
# the trip duration is calculated as the difference between the 'dropoff_datetime' and 'pickup_datetime'
train_gpu.drop('dropoff_datetime', axis=1, inplace=True)

test_gpu['pickup_datetime'] = cudf.to_datetime(test_gpu['pickup_datetime'])

# create new columns for the pickup month, day of the week, and hour of the day
train_gpu['pickup_month'] = train_gpu['pickup_datetime'].dt.month
train_gpu['pickup_day'] = train_gpu['pickup_datetime'].dt.dayofweek
train_gpu['pickup_hour'] = train_gpu['pickup_datetime'].dt.hour

test_gpu['pickup_month'] = test_gpu['pickup_datetime'].dt.month
test_gpu['pickup_day'] = test_gpu['pickup_datetime'].dt.dayofweek
test_gpu['pickup_hour'] = test_gpu['pickup_datetime'].dt.hour

end_time_gpu = time.time()

In [60]:
# Data Preprocessing using pandas
time_start_cpu = time.time()
# drop rows with missing values
train_cpu.dropna(inplace=True)
test_cpu.dropna(inplace=True)

# drop records with duplicated IDs
train_cpu.drop_duplicates(subset='id', inplace=True)

# convert the 'pickup_datetime' and 'dropoff_datetime' columns to datetime format
train_cpu['pickup_datetime'] = pd.to_datetime(train_cpu['pickup_datetime'])

# drop the 'dropoff_datetime' column since it is not available in the test set and is not needed for prediction
# the trip duration is calculated as the difference between the 'dropoff_datetime' and 'pickup_datetime'
train_cpu.drop('dropoff_datetime', axis=1, inplace=True)

test_cpu['pickup_datetime'] = pd.to_datetime(test_cpu['pickup_datetime'])

# create new columns for the pickup month, day of the week, and hour of the day
train_cpu['pickup_month'] = train_cpu['pickup_datetime'].dt.month
train_cpu['pickup_day'] = train_cpu['pickup_datetime'].dt.dayofweek
train_cpu['pickup_hour'] = train_cpu['pickup_datetime'].dt.hour

test_cpu['pickup_month'] = test_cpu['pickup_datetime'].dt.month
test_cpu['pickup_day'] = test_cpu['pickup_datetime'].dt.dayofweek
test_cpu['pickup_hour'] = test_cpu['pickup_datetime'].dt.hour

end_time_cpu = time.time()

In [61]:
# Display the time taken for data preprocessing using pandas
print("Time taken for data preprocessing using pandas: ", end_time_cpu - time_start_cpu)
# Display the time taken for data preprocessing using cuDF
print("Time taken for data preprocessing using GUP: ", end_time_gpu - time_start_gpu)

Time taken for data preprocessing using pandas:  0.24985980987548828
Time taken for data preprocessing using GUP:  0.06355476379394531


In [None]:
# Add the metrics to the dataframe
cpu_metrics = cpu_metrics.append({'Task': 'Data Preprocessing', 'start_time': time_start_cpu, 'end_time': end_time_cpu, 'time_taken': end_time_cpu - time_start_cpu}, ignore_index=True)
gpu_metrics = gpu_metrics.append({'Task': 'Data Preprocessing', 'start_time': time_start_gpu, 'end_time': end_time_gpu, 'time_taken': end_time_gpu - time_start_gpu}, ignore_index=True)

In [63]:
train_cpu.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,pickup_month,pickup_day,pickup_hour
0,id2875421,2.0,2016-03-14 17:24:55,1.0,-73.982155,40.767937,-73.96463,40.765602,N,455.0,3,0,17
1,id2377394,1.0,2016-06-12 00:43:35,1.0,-73.980415,40.738564,-73.999481,40.731152,N,663.0,6,6,0
2,id3858529,2.0,2016-01-19 11:35:24,1.0,-73.979027,40.763939,-74.005333,40.710087,N,2124.0,1,1,11
3,id3504673,2.0,2016-04-06 19:32:31,1.0,-74.01004,40.719971,-74.012268,40.706718,N,429.0,4,2,19
4,id2181028,2.0,2016-03-26 13:30:55,1.0,-73.973053,40.793209,-73.972923,40.78252,N,435.0,3,5,13


In [64]:
test_cpu.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,pickup_month,pickup_day,pickup_hour
0,id3004672,1,2016-06-30 23:59:58,1.0,-73.988129,40.732029,-73.990173,40.75668,N,6,3,23
1,id3505355,1,2016-06-30 23:59:53,1.0,-73.964203,40.679993,-73.959808,40.655403,N,6,3,23
2,id1217141,1,2016-06-30 23:59:47,1.0,-73.997437,40.737583,-73.98616,40.729523,N,6,3,23
3,id2150126,2,2016-06-30 23:59:41,1.0,-73.95607,40.7719,-73.986427,40.730469,N,6,3,23
4,id1598245,1,2016-06-30 23:59:33,1.0,-73.970215,40.761475,-73.96151,40.75589,N,6,3,23


In [65]:
train_gpu.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,pickup_month,pickup_day,pickup_hour
0,id2875421,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,N,455,3,0,17
1,id2377394,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6,6,0
2,id3858529,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1,1,11
3,id3504673,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,N,429,4,2,19
4,id2181028,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,N,435,3,5,13


In [66]:
test_gpu.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,pickup_month,pickup_day,pickup_hour
0,id3004672,1,2016-06-30 23:59:58,1,-73.988129,40.732029,-73.990173,40.75668,N,6,3,23
1,id3505355,1,2016-06-30 23:59:53,1,-73.964203,40.679993,-73.959808,40.655403,N,6,3,23
2,id1217141,1,2016-06-30 23:59:47,1,-73.997437,40.737583,-73.98616,40.729523,N,6,3,23
3,id2150126,2,2016-06-30 23:59:41,1,-73.95607,40.7719,-73.986427,40.730469,N,6,3,23
4,id1598245,1,2016-06-30 23:59:33,1,-73.970215,40.761475,-73.96151,40.75589,N,6,3,23


In [67]:
cpu_metrics.head(5)

Unnamed: 0,Task,start_time,end_time,time_taken
0,Data Loading,1713413000.0,1713413000.0,0.847121
1,Data Preprocessing,1713413000.0,1713413000.0,0.24986


In [68]:
gpu_metrics

Unnamed: 0,Task,start_time,end_time,time_taken
0,Data Loading,1713413000.0,1713413000.0,0.051302
1,Data Preprocessing,1713413000.0,1713413000.0,0.063555
