<h1 align=center>Accelerated Data Science Workflows with RAPIDS</h1>
<h2 align=center>No Show Predictive Model End-to-End</h2>

![RAPIDS logo](rapids-logo.png)

The [RAPIDS](https://rapids.ai/) suite of software libraries gives data scientists the freedom to execute end-to-end data science and analytics pipelines entirely on GPU accelerators. The software suite relies on NVIDIA CUDA primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces. RAPIDS enables faster and more iterative development for end to end data science workflows. How to use RAPIDS for these benefits is the focus of this lab.

## Introduction

In this lab we will be using a synthetic data set of healthcare patients, and will be attempting to predict whether or not a given patient is likely to be a no-show, or, significantly late, to a scheduled appointment. We will begin with an overview of the data before proceeding into a typical workflow of ingesting, manipulating, and then training a model with the data. At each stage we will be demonstrating how to perform a given task on the GPU with RAPIDS, before asking you to refactor working CPU-only code to run more performantly on the GPU. As a final exercise, you will be asked to implement additional features and retrain the model in order to make better predictions.

## Generate the Data

Follow this link to [No Show Predictive Model Data Generator](No%20Show%20Predictive%20Model%20Data%20Generator.ipynb). When the notebook opens, use the menu to select **Cell -> Run All**. This will generate the synthetic data set we will be using in this lab.

After generating the data for his lab it is stored in 2 csv files:

- **patient_data.csv** contains 16 attributes about medical patients. It includes personal data such as their age and sex, data about the time and type of their scheduled appointment, information about the day's weather, and their previous rate of being a no-show.
- **zipcode_data.csv** contains 2 attributes: the patient's zipcode, which can be used to join this data with that in *patient_data*, and whether or not they have access to public transportation.

## Prerequisites

This lab does not intend to teach how to do Data Science and assumes you have professional Data Science experience. This lab also assumes competency with the following programming tools and techniques, which will be used without explanation throughout the lab:

- The [Python 3 programming language](https://docs.python.org/)
- The [Pandas Data Analysis Library](https://pandas.pydata.org/)
- The [NumPy Library for Numerical Programming](http://www.numpy.org/)
- [One Hot Encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)
- Machine learning model training with [XGBoost](https://xgboost.readthedocs.io/)

## Objectives

In this lab you will learn to perform end to end data science workflows on a GPU accelerator using RAPIDS. By the end of this lab you will be able to:

- Read data directly onto the GPU
- Manipulate data and extract features on the GPU
- Use GPU-enabled XGBoost to train a machine learning model

## Lab Environment

This lab is using web sockets to supply a connection from your browser to an interactive iPython Notebook on a cloud service provider VM backed by an NVIDIA GPU Accelerator. Execute the cell immediately below to both confirm that your websocket connection is not being blocked by a VPN, or some other security software on your computer, and, for information about the NVIDIA GPU accelerator you will be using during the lab.

In [None]:
!nvidia-smi

## Imports

Built in Python imports:

In [None]:
import os
import sys

Additional CPU imports:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, roc_auc_score

Imports for GPU accelerated data manipulation and model training:

In [None]:
import cudf
from cudf.dataframe import DataFrame

import xgboost

Use [magic cell](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-matplotlib) to enable inline `matplotlib` interaction in the notebook:

In [None]:
%matplotlib inline

### Define a LabelEncoder and a train_test_split function
This is preview of functionality that's coming in the next version of cuDF and cuML.

We'll be using these functions in the notebook.

In [None]:
import cudf
import nvcategory

from librmm_cffi import librmm
import numpy as np


def _enforce_str(y: cudf.Series) -> cudf.Series:
    if y.dtype != "object":
        return y.astype("str")
    return y


class Base(object):
    def __init__(self, *args, **kwargs):
        self._fitted = False

    def check_is_fitted(self):
        if not self._fitted:
            raise TypeError("Model must first be .fit()")


class LabelEncoder(Base):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._cats: nvcategory.nvcategory = None
        self._dtype = None

    def fit(self, y: cudf.Series) -> "LabelEncoder":
        self._dtype = y.dtype
        y = _enforce_str(y)

        self._cats = nvcategory.from_strings(y.data)
        self._fitted = True
        return self

    def transform(self, y: cudf.Series) -> cudf.Series:
        self.check_is_fitted()
        y = _enforce_str(y)
        encoded = cudf.Series(
            nvcategory.from_strings(y.data)
            .set_keys(self._cats.keys())
            .values()
        )
        if -1 in encoded:
            raise KeyError("Attempted to encode unseen key")
        return encoded

    def fit_transform(self, y: cudf.Series) -> cudf.Series:
        self._dtype = y.dtype
        y = _enforce_str(y)
        self._cats = nvcategory.from_strings(y.data)
        self._fitted = True
        arr: librmm.device_array = librmm.device_array(
            y.data.size(), dtype=np.int32
        )
        self._cats.values(devptr=arr.device_ctypes_pointer.value)
        return cudf.Series(arr)

    def inverse_transform(self, y: cudf.Series):
        raise NotImplementedError

        
# Given a cudf string column, returns the unique values
def get_unique_strings(ds):
    c = nvcategory.from_strings(ds.data)
    return c
        
def cuml_train_test_split(X_gdf, y_gdf, test_size=0.20, random_state=42):
    # Identify shape and indices
    n_rows, n_columns = X_gdf.shape
    train_size = 1- test_size
    train_index = int(n_rows * train_size)

    # Shuffle data
    idx = np.random.randint(0,n_rows-1,n_rows)
    X_gdf = X_gdf.set_index(idx)
    y_gdf = y_gdf.set_index(idx)
    
    X_gdf = X_gdf.sort_index()
    y_gdf = y_gdf.sort_index()
    
    # Split into train & test data
    X_train, y_train = X_gdf[:train_index], y_gdf[:train_index]
    X_test, y_test = X_gdf[train_index:], y_gdf[train_index:]
    del X_gdf, y_gdf
       
    return X_train, X_test, y_train, y_test

### Custom Timer

We will be using the following `Timer` for easy perfomance profiling throughout the lab.

In [None]:
from copy import copy
from time import time
from time import sleep

from timeit import default_timer

class Timer(object):
    def __init__(self):
        self._timer = default_timer
    
    def __enter__(self):
        self.start()
        return self

    def __exit__(self, *args):
        self.stop()

    def start(self):
        """Start the timer."""
        self.start = self._timer()

    def stop(self):
        """Stop the timer. Calculate the interval in seconds."""
        self.end = self._timer()
        self.interval = self.end - self.start

The custom timer can conveniently be used in the following ways:

In [None]:
big_num = 100000

t = Timer()
t.start()

for i in range(big_num):
    r = 1

t.stop()

print('Print time using t.start() and t.stop(): \t{:.6f}'.format(t.interval))

with Timer() as t:
    for i in range(big_num):
        r = 1

print('Print time using `with` statment: \t\t{:.6f}'.format(t.interval))

## Data Ingestion

RAPIDS enables reading data on disk directly to GPU memory where it can be explored and manipulated on massively parallel NVIDIA GPUs. In this section we will demonstrate common ways to read data directly into GPU memory, noting the speed up over loading data to the CPU, before moving on to exploring and manipulating data on the GPU.

### Ingesting Data to the CPU with Pandas

Here we store the paths to the 2 csv files containing our previously generated data into variables.

In [None]:
patient_data_path = 'patient_data.csv'
zipcode_data_path = 'zipcode_data.csv'

Now we use `Pandas.DataFrame.read_csv` to load the data in the csv files into 2 Pandas DataFrames, `patient_pdf` and `zipcode_pdf`. Take note of the informational printouts, in particular the time it took to load the data into memory.

In [None]:
with Timer() as t:
    patient_pdf = pd.read_csv(patient_data_path)

pdf_read_time = t.interval
print('Time: {:.1f}s'.format(pdf_read_time))
print("Samples: {:.1f} million".format(patient_pdf.shape[0]/1E6))
print("Features:", patient_pdf.shape[1])
print("Dataset size: {:.1f} GB".format(sys.getsizeof(patient_pdf)/1E9))

In [None]:
with Timer() as t:
    zipcode_pdf = pd.read_csv(zipcode_data_path)

zipcode_pdf_read_time = t.interval
print('Time: {:.1f}s'.format(zipcode_pdf_read_time))
print("Samples: {:.1f} million".format(zipcode_pdf.shape[0]/1E6))
print("Features:", zipcode_pdf.shape[1])
print("Dataset size: {:.1f} GB".format(sys.getsizeof(zipcode_pdf)/1E9))

### Ingesting Data to the GPU with cuDF

RAPIDS provides a GPU accelerated DataFrame library, with a very similar API to Pandas, in [cuDF](https://github.com/rapidsai/cudf). cuDF (*cu* as in [CUDA](https://developer.nvidia.com/about-cuda), the platform enabling general purpose programming on NVIDIA GPU accelerators) is a DataFrame manipulation library which uses the columnar memory format standard of [Apache Arrow](https://arrow.apache.org/).

With cuDF we can load data directly to GPU memory just as we would to CPU memory with Pandas. The following uses `cudf.read_csv` to read data in a csv file to GPU memory returning a **GPU** DataFrame, commonly refered to as a **`gdf`**.

In [None]:
with Timer() as t:
    patient_gdf = cudf.read_csv(patient_data_path)

gdf_read_time = t.interval
print('Time: {:.1f}s'.format(gdf_read_time))
print("Samples: {:.1f} million".format(patient_gdf.shape[0]/1E6))
print("Features:", patient_gdf.shape[1])
print("Dataset size: {:.1f} GB".format(sys.getsizeof(patient_gdf)/1E9))

Read DataFrame from csv speedup:

In [None]:
print("Read time on GPU was {:.2f}x faster than on CPU.".format(pdf_read_time / gdf_read_time))

In [None]:
print(patient_gdf)

In [None]:
patient_gdf.columns

### You can configure how much data to show in the summary view

In [None]:
from cudf.settings import set_options

format_options =  {
    'nrows': 50,
    'ncols': 12
}

with set_options(formatting=format_options):
        print(patient_gdf)

### Exercise: Load Zipcode Data to GPU

Load the zipcode data into GPU memory, creating a GPU DataFrame called `zipcode_gdf`.

- Recall that the path to the csv file containing the data is currently stored in the `zipcode_data_path` variable


#### Complete the below to proceed further

In [6]:
zipcode_gdf = #TO DO

In [None]:
print(zipcode_gdf) # Use this cell after you complete the exercise

## Data Exploration and Manipulation with cuDF

Now that we have used cuDF to create two GPU DataFrames, `patient_gdf` and `zipcode_gdf`, we can begin to explore and manipulate the data much as we would using a Pandas DataFrame. Much of the `cuDF.DataFrame` API is identical to that of `pandas.DataFrame`, requiring no modification. When and if modifications are needed, please refer to the [cuDF API Reference](https://rapidsai.github.io/projects/cudf/en/latest/api.html), and/or the [RAPIDS Cheat Sheet](https://rapids.ai/assets/files/cheatsheet.pdf).

cuDF, along with the entire RAPIDS software suite, is open source, and undergoing very active development. For feature requests, bug reports, or to make contributions, please consider visiting the cuDF repo at [github.com/rapidsai/cudf](https://github.com/rapidsai/cudf).

Let's proceed to some basic data exploration and manipulation with cuDF.

## Simple Data Selection

Selecting a single column with cuDF is just like with Pandas:

In [None]:
print(patient_gdf['AGE'])

Integer slicing for row selection is also identical:

In [None]:
print(patient_gdf[2:4])

As is selecting by rows and labels using `loc`:

In [None]:
print(patient_gdf.loc[0:5, ['AGE']])

Boolean indexing is quite similar to  Pandas:

In [None]:
select_columns = ['AGE', 'GENDER', 'INSURANCE']
ages_greater_than_ninety = patient_gdf[select_columns][patient_gdf.AGE > 90]
print(ages_greater_than_ninety)

### Exercise: Create a GPU DataFrame with a Subset of Data

Create a GPU DataFrame `ages_of_no_shows_gdf` which contains `AGE` and `NO_SHOW_RATE` columns and is a subselection of the `patient_gdf` DataFrame (our already existing patient data) of patients who have a `NO_SHOW_RATE` greater than `0.7`.

#### Complete the below to proceed further

In [None]:
age_of_no_shows_gdf = [] # TODO: populate this as a GPU DataFrame as described above.
print(age_of_no_shows_gdf)

## View Unique Values in a Column

Use `.unique()` for numeric columns

In [None]:
print(patient_gdf['DEPT_ID'].unique())

For string columns, use the `get_unique_strings` functions we defined at the top the notebook.
This is a temporary approach till we'll `.unique()` supports strings columns.

In [None]:
print(get_unique_strings(patient_gdf['INSURANCE']))

You'll notice that there are some empty `None` values in this column. 
So let's replace it with the value of `OTHER` and look at the unique values again.

In [None]:
patient_gdf['INSURANCE'] = patient_gdf['INSURANCE'].fillna('OTHER')
print(get_unique_strings(patient_gdf['INSURANCE']))

## Converting between GPU DataFrame and Pandas DataFrame

Each GPU DataFrame can invoke the method `to_pandas` to return a Pandas DataFrame:

In [None]:
print(type(patient_gdf.to_pandas()))

You may wish to convert data to Pandas so that you can visualize it using tools like Matplotlib. As an example, we'll group the patient data by insurance, and plot it as a bar graph.

In [None]:
# First group the data by INSURANCE
insurance_count_gdf = patient_gdf[['INSURANCE', 'LABEL']].groupby('INSURANCE').count()
print(insurance_count_gdf)

In [None]:
# Reset index, so that we can have Insurance back as a column.
insurance_count_gdf = insurance_count_gdf.reset_index()

# Rename the LABEL column as COUNT for clarity, since it now stores counts by Insurance
insurance_count_gdf = insurance_count_gdf.rename({'LABEL': 'COUNT'})

print(insurance_count_gdf)

In [None]:
# Now, sort this grouped data by Insurance, just for convenience
insurance_count_gdf = insurance_count_gdf.sort_values(by='INSURANCE')
print(insurance_count_gdf)

In [None]:
# Convert to Pandas and visualize as a bar graph
insurance_count_gdf.to_pandas().plot.bar(x='INSURANCE', y='COUNT')

## Sorting

GPU DataFrames also have a `sort_values` method that works just like the Pandas equivalent, only much faster:

In [None]:
select_columns = ['AGE', 'GENDER', 'DISTANCE_FROM_CLINIC', 'LABEL']

In [None]:
with Timer() as t:
    gpu_sorted_ages = patient_gdf[select_columns].sort_values(by='AGE')

gdf_sort_time = t.interval
print('Time: {:.5f}s'.format(gdf_sort_time))

In [None]:
print(gpu_sorted_ages)

In [None]:
with Timer() as t:
    pdf_sorted_ages = patient_pdf[select_columns].sort_values(by='AGE')

pdf_sort_time = t.interval
print('Time: {:.5}s'.format(pdf_sort_time))

In [None]:
print("Sort on GPU was {:.2f}x faster than on CPU.".format(pdf_sort_time / gdf_sort_time))

### Exercise: Write the 1000 Oldest Patients to CSV

Create a csv file called `oldest_patient_summary.csv` with the `AGE`, `GENDER`, `ZIPCODE` and `NO_SHOW_RATE` of the 1000 oldest patients currently in `patient_gdf`, sorted in ascending order by their `NO_SHOW_RATE`.

#### Complete the below to proceed further

In [None]:
oldest_patients_details_path = 'oldest_patients_details.csv'

# TODO: Follow the instructions above to write the specified data to oldest_patients_details.csv

## Applying User Defined Functions to Columns

Just like with Pandas, you can perform operations on every value in a column like this:

In [None]:
patient_gdf['AGE_NEXT_YEAR'] = patient_gdf['AGE'] + 1
print(patient_gdf[['AGE', 'AGE_NEXT_YEAR']])

We can also apply functions to columns of data using `cudf.Series.applymap` which is analogous to `pandas.Series.apply`:

In [None]:
double = lambda x : x * 2

In [None]:
with Timer() as t:
    patient_gdf['DOUBLED_AGE'] = patient_gdf['AGE'].applymap(double)

gdf_apply_time = t.interval
print('Time: {:.5f}s'.format(gdf_apply_time))

In [None]:
print(patient_gdf[['AGE', 'DOUBLED_AGE']])

In [None]:
with Timer() as t:
    patient_pdf['DOUBLED_AGE'] = patient_pdf['AGE'].apply(double)

pdf_apply_time = t.interval
print('Time: {:.5f}s'.format(pdf_apply_time))

In [None]:
print("Apply function on GPU was {:.2f}x faster than on CPU.".format(pdf_apply_time / gdf_apply_time))

We don't want to keep those new columns around, so let's drop them.

In [None]:
patient_pdf.drop(['DOUBLED_AGE'], axis=1)

patient_gdf.drop('DOUBLED_AGE')
patient_gdf.drop('AGE_NEXT_YEAR')

## Applying User Defined Functions to Rows

We can also apply user defined functions to rows using `cudf.DataFrame.apply_rows` which is similar to `pandas.DataFrame.apply(index=1)`. When using `apply_rows` we:

1. Define a function with function parameters for a given row's input columns and output columns, as well as any keyword arguments we might wish to pass in
2. Utilize input column arguments in modifying or creating output column arguments
3. Invoke `apply_rows` by providing the user-defined function, a list of input columns, a dict of output columns with their types, and a dict of keyword arguments with their arguments.

The following example creates a new `IS_SENIOR` column for each row in `patient_gdf`, using existing date from the `AGE` and `INSURANCE` columns.

Since these lambda functions don't yet support strings, we'll have to convert the `INSURANCE` column from strings to numeric IDs. This is called label encoding and we'll look into this in more detail a little later. For now, you just need to know that label encoding converts the insurance strings to some mapped integers. 

In [None]:
print(patient_gdf['INSURANCE'].head())

In [None]:
le = LabelEncoder()
patient_gdf['INSURANCE'] = le.fit_transform(patient_gdf['INSURANCE'])
print(patient_gdf['INSURANCE'].head())

In [None]:
def add_is_senior(AGE, INSURANCE, IS_SENIOR, kwarg1):
    for i, (age, insurance) in enumerate(zip(AGE, INSURANCE)):
        if (age >= 65) or (insurance == 1): # Insurance 1 == MEDICARE
            IS_SENIOR[i] = 1
        else:
            IS_SENIOR[i] = 0

In [None]:
add_is_senior_gdf = patient_gdf.apply_rows(add_is_senior, 
                                           incols=['AGE', 'INSURANCE'], 
                                           outcols=dict(IS_SENIOR=np.int64), 
                                           kwargs=dict(kwarg1=1))

In [None]:
print(add_is_senior_gdf[['AGE', 'INSURANCE', 'IS_SENIOR']])

### Exercise: Implement an AGE_BUCKET Column

Assume we know, based on our subject matter expertise, that there are certain natural age clusters and that we would like to use that knowledge by categorizing our patients into one of 6 age buckets, and create a new column, `AGE_BUCKET` to store which of the 6 buckets they are in. In order to do this we will apply the `age_bucket_gpu` function, provided below, to each row in `patient_gdf`:

In [None]:
def age_bucket_gpu(AGE, AGE_BUCKET, kwarg1):
    for i, age in enumerate(AGE):
        age_bucket = 0
        if (age<18):
            age_bucket = 0
        elif (age<30):
            age_bucket = 1
        elif (age<40):
            age_bucket = 2
        elif (age<50):
            age_bucket = 3
        elif (age<60):
            age_bucket = 4
        elif (age>60):
            age_bucket = 5

        AGE_BUCKET[i] = age_bucket

Use `apply_rows` below along with `age_bucket_gpu` to create this new column. The new column `AGE_BUCKET` should have the dtype `np.int`.

In [None]:
# Convert age into buckets
with Timer() as t:
    patient_gdf = patient_gdf.apply_rows() # TODO: Pass the correct arguments into `apply_rows`

# We won't be performing this operation on `patient_pdf` until further into the notebook,
# but we will store the GPU time here to use for comparison later.
gdf_bucket_time = t.interval
print('Time: {:.5f}s'.format(gdf_bucket_time))

### Compare to CPU Performance

Once you've successfully completed the exercise run the following cells to execute the same operations on the CPU so that we can compare performance.

In [None]:
def age_bucket_cpu(row):
    age = row.AGE
    age_bucket = 0
    if (age<18):
        age_bucket = 0
    elif (age<30):
        age_bucket = 1
    elif (age<40):
        age_bucket = 2
    elif (age<50):
        age_bucket = 3
    elif (age<60):
        age_bucket = 4
    elif (age>60):
        age_bucket = 5

    return age_bucket

In [None]:
with Timer() as t:
    patient_pdf['AGE_BUCKET'] = patient_pdf.apply(age_bucket_cpu, axis=1)

pdf_bucket_time = t.interval
print('Time: {:.1f}s'.format(pdf_bucket_time))

In [None]:
print("Bucketing on GPU was {:.2f}x faster than on CPU.".format(pdf_bucket_time / gdf_bucket_time))

### Drop AGE Column

Now that you have successfully created the `AGE_BUCKET` column for each row in `patient_gdf`, execute the cell below to drop the `AGE` column which is no longer needed.

In [None]:
patient_gdf.drop('AGE')

## Label and One Hot Encoding

We'll now get deeper into Label Encodign and One Hot Encoding. 

We'll use the LabelEncoder we defined at the very beginning, to encode string columns to integers. 

We'll then one hot encode this column using the `cudf.DataFrame.one_hot_encoding` api, which expects the following arguments:

- A source column
- A prefix for dummy column names
- A sequence of integer encoded category values
- The dtype for value outputs, which defaults to float64

Here we use `cudf.DataFrame.one_hot_encode` to one hot encode a DataFrame containing values for scientists with 6 different names:

In [None]:
# Create a simple dataframe
scientists_dict = {
    1: 'Kepler',
    2: 'Maxwell',
    3: 'Pascal',
    4: 'Volta',
    5: 'Turing',
    6: 'Curie'
}

scientists_gdf = cudf.DataFrame({
    'NAMES': scientists_dict.values()
})
print(scientists_gdf)

In [None]:
# Label Encode the NAMES column 
le = LabelEncoder()
scientists_gdf['NAMES'] = le.fit_transform(scientists_gdf['NAMES'])
print(scientists_gdf)

In [None]:
# Display the unique encoded values 
scientists =scientists_gdf.NAMES.unique()
print(scientists)

In [None]:
# Finally, one hot enocde this numeric column
scientists_gdf = scientists_gdf.one_hot_encoding('NAMES', 'NAMES', scientists, dtype='int32')
print(scientists_gdf)

### Exercise: Complete One Hot Encoding Function

The following function, `one_hot_encode_and_drop` will be used extensively below to one hot encode our categorical data columns and then drop the unencoded columns. Complete the call to `patient_gdf.one_hot_encoding` so that it works as expected. Look at how `one_hot_encode_and_drop` is used in `one_hot_encode_cat_columns` below to support your work.

#### Complete the below to proceed further

In [None]:
def one_hot_encode_and_drop(gdf, column_name, label_encode=False):   
    if label_encode is True:
        le = LabelEncoder()
        gdf[column_name] = le.fit_transform(gdf[column_name])

    cats = gdf[column_name].unique()
    
    gdf = gdf.one_hot_encoding() # TODO: Pass the correct arguments into `gdf.one_hot_encoding`    
    
    return gdf

## One Hot Encode Categorical Columns

In the cell below we define `one_hot_encode_cat_columns` which utilizes both the `LabelEncoder` the `one_hot_encode_and_drop` function you just completed, to one hot encode all the categorical columns in `patient_gdf`.

In [None]:
# You do not need to modify this function. However, you need to complete the definition for
# `one_hot_encode_and_drop` above before it will work properly.

def one_hot_encode_cat_columns(gdf):
        for col in ['GENDER', 'VISIT_TYPE', 'DEPT_SPECIALTY', 'APPT_WEEKDAY']:
             gdf = one_hot_encode_and_drop(gdf, col, label_encode=True)
    
        # The following columns are already encoded as integers and don't need to be label encoded
        for col in ['AGE_BUCKET', 'INSURANCE', 'DEPT_ID' ]:
            gdf = one_hot_encode_and_drop(gdf, col, label_encode=False)
    
        return gdf

And finally before training we will use `prep_for_training` to do some final cleanup, as well as to split the data into test and training sets. 

We'll use the cuml_train_test_split function that we defined at the very beginning. Note that we need to convert the `cudf` object to `DMatrix` before we pass it to `XGBoost` for training.

In [None]:
def prep_for_training(patient_gdf):
    
    final_features = patient_gdf.columns.tolist()
    if 'LABEL' in final_features:
        final_features.remove('LABEL')
    if 'DAY' in final_features:
        final_features.remove('DAY')
    if 'MONTH' in final_features:
        final_features.remove('MONTH')

    X_train, X_test, y_train, y_test = cuml_train_test_split(patient_gdf[final_features], patient_gdf[['LABEL']])
    
    # Convert to DMatrices
    dtrain = xgboost.DMatrix(X_train, y_train)
    dtest = xgboost.DMatrix(X_test, y_test)
    del X_train, X_test, y_train

    return  dtrain, dtest, y_test.to_pandas()

## Back On the CPU

Before moving on to train our model, we provide here a couple of cells which will carry out, on the CPU, the operations just completed on the GPU so that we can compare performance below. For convenience, we have combined the one hot encoding and test/train splitting into one function for the CPU.

In [None]:
def one_hot_encode_cat_columns_cpu(pdf):
    # One Hot Encode categorical values
    pdf = pd.concat([pdf, pd.get_dummies(pdf.AGE_BUCKET, prefix="AGE_BUCKET")], axis=1)
    pdf = pd.concat([pdf, pd.get_dummies(pdf.GENDER, prefix="GENDER")], axis=1)
    pdf = pd.concat([pdf, pd.get_dummies(pdf.INSURANCE, prefix="INSURANCE")], axis=1)
    pdf = pd.concat([pdf, pd.get_dummies(pdf.VISIT_TYPE, prefix="VISIT_TYPE")], axis=1)
    pdf = pd.concat([pdf, pd.get_dummies(pdf.DEPT_SPECIALTY, prefix="DEPT_SPECIALTY")], axis=1)
    pdf = pd.concat([pdf, pd.get_dummies(pdf.DEPT_ID, prefix="DEPT")], axis=1)
    pdf = pd.concat([pdf, pd.get_dummies(pdf.APPT_WEEKDAY, prefix="APPT_WEEKDAY")], axis=1)

    # Drop labels after One Hot Encoding
    pdf = pdf.drop(['AGE', 'GENDER', 'INSURANCE', 'VISIT_TYPE', 'DEPT_SPECIALTY', 'DEPT_ID', 'APPT_WEEKDAY'], axis=1)
    
    return pdf

In [None]:
def prep_for_training_cpu(pdf):
    # Create final features
    final_features = pdf.columns.tolist()
    if 'LABEL' in final_features:
        final_features.remove('LABEL')
    if 'DAY' in final_features:
        final_features.remove('DAY')
    if 'MONTH' in final_features:
        final_features.remove('MONTH')

    # Separate features and labels
    x_df = pdf[final_features]
    y_df = pdf['LABEL']

    # Split train and test data
    X_train, X_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.20, random_state=42)

    # Convert train and test into xgboost format (DMatrix)
    dtrain = xgboost.DMatrix(data=X_train, label=y_train)
    dtest = xgboost.DMatrix(data=X_test, label=y_test)

    del pdf, x_df, y_df
    del X_train, X_test, y_train

    return dtrain, dtest, y_test

### Compare Data Prep Performance

First we will perform our data prep steps on the GPU:

In [None]:
with Timer() as t:
    merged_gdf = patient_gdf.merge(zipcode_gdf, how="left", on='ZIPCODE')
gdf_merge_time = t.interval
print('Merge Time: {:.1f}s'.format(gdf_merge_time))

with Timer() as t:
    encoded_gdf = one_hot_encode_cat_columns(merged_gdf)
gdf_one_hot_encode_time = t.interval
print('One Hot Encode Time: {:.1f}s'.format(gdf_one_hot_encode_time))

with Timer() as t:
    dtrain_gpu, dtest_gpu, y_test_gpu = prep_for_training(encoded_gdf)
gdf_dmatrix_time = t.interval
print('DMatrix Generation Time: {:.1f}s'.format(gdf_dmatrix_time))

And now on the CPU:

In [None]:
with Timer() as t:
    merged_pdf = pd.merge(patient_pdf, zipcode_pdf, on='ZIPCODE')
pdf_merge_time = t.interval
print('Merge Time: {:.1f}s'.format(pdf_merge_time))

with Timer() as t:
    encoded_pdf = one_hot_encode_cat_columns_cpu(merged_pdf)
pdf_one_hot_encode_time = t.interval
print('One Hot Encode Time: {:.1f}s'.format(pdf_one_hot_encode_time))

with Timer() as t:    
    dtrain_cpu, dtest_cpu, y_test_cpu = prep_for_training_cpu(encoded_pdf)
pdf_dmatrix_time = t.interval
print('DMatrix Generation Time: {:.1f}s'.format(pdf_dmatrix_time))

In [None]:
print("Merge time on GPU was {:.2f}x faster than on CPU.".format(pdf_merge_time / gdf_merge_time))
print("One Hot Encode time on GPU was {:.2f}x faster than on CPU.".format(pdf_one_hot_encode_time / gdf_one_hot_encode_time))

## Model Training

### XGBoost on the GPU

Using [XGBoost](https://xgboost.readthedocs.io/) to train models on the GPU is very easy and very similar to how you would use it on a CPU. 

The only difference we need to make to train with XGBoost on the GPU is to provide the `n_gpus` parameter, which indicates how many GPUs we would like to utilize in training. When set to `-1` as we are doing below, we indicate that we wish to use all available GPUs. Additionally we must specify the `tree_method` to be `gpu_hist`. Note the near identical precision with a roughly 10x speedup.

In [None]:
def train_gpu(dtrain):
    gpu_params = {
        'objective': 'binary:logistic',
        'n_gpus': 1,
        'booster':'gbtree',
        'nround': 15,
        'max_depth': 3,
        'alpha': 0.9,
        'eta': 0.1,
        'gamma': 0.1,
        'learning_rate': 0.5,
        'subsample': 1,
        'reg_lambda': 1,
        'scale_pos_weight': 2,
        'min_child_weight': 30,
        'tree_method': 'gpu_hist',
        'loss': 'ls',
        'max_features': 'auto',
        'criterion': 'friedman_mse',
        'grow_policy': 'lossguide',
    }
    
    return xgboost.train(gpu_params, dtrain=dtrain)

In [None]:
with Timer() as t:
    clf_gpu = train_gpu(dtrain_gpu)

gdf_train_time = t.interval
print('Time: {:.1f}s'.format(gdf_train_time))

In [None]:
y_pred_gpu = clf_gpu.predict(dtest_gpu)
auc_gpu = roc_auc_score(y_test_gpu, y_pred_gpu)
print("AUC: {:.3f}".format(auc_gpu))

### Training on the CPU for Comparison

In [None]:
def train_cpu(dtrain):
    cpu_params = {
        'objective': 'binary:logistic',
        'booster':'gbtree',
        'nround': 10,
        'max_depth': 3,
        'alpha': 0.9,
        'eta': 0.1,
        'gamma': 0.1,
        'learning_rate': 0.5,
        'subsample': 1,
        'reg_lambda': 1,
        'scale_pos_weight': 2,
        'min_child_weight': 30,
        'tree_method': 'hist',
        'loss': 'ls',
        'max_features': 'auto',
        'criterion': 'friedman_mse',
        'grow_policy': 'lossguide',
    }
    
    return xgboost.train(cpu_params, dtrain=dtrain)

In [None]:
with Timer() as t:
    clf_cpu = train_cpu(dtrain_cpu)

pdf_train_time = t.interval
print('Time: {:.1f}s'.format(pdf_train_time))

In [None]:
y_pred_cpu = clf_cpu.predict(dtest_cpu)
auc_cpu = roc_auc_score(y_test_cpu, y_pred_cpu)
print("AUC: {:.3f}".format(auc_cpu))

In [None]:
print("Train on GPU was {:.2f}x faster than on CPU.".format(pdf_train_time / gdf_train_time))

## Final Exercise

_Before beginning the final exercise, which will require you to make changes throughout the notebook, you may wish to download this notebook by choosing **File -> Download as -> Notebook** in the menu above._

As a final exercise you will implement a new feature in the hopes of improving the accuracy of your predictions. Use the `holiday_week_gpu` function defined below, and make any other changes in the notebook that are necessary, to create and populate a `HOLIDAY_WEEK` column which will indicate whether or not a given appointment falls on a week containing a holiday.

After you have implemented the feature, rerun `train_and_evaluate_gpu` to see if you've made any improvements.

In [None]:
def holiday_week_gpu(DAY, MONTH, HOLIDAY_WEEK, kwarg1):
    for i, (day, month) in enumerate(zip(DAY, MONTH)):
        holiday_week = 0
        if (month==5 and day>24) \
            or (month==7 and day<8) \
            or (month==9 and day<8) \
            or (month==12 and day>21) \
            or (month==1 and day<3):  \
            holiday_week = 1
        HOLIDAY_WEEK[i] = holiday_week

#### Complete the below

In [None]:
# Your work here

## Summary

Now that you have completed this, lab you should be able to:

- Read data directly onto the GPU
- Manipulate data and extract features on the GPU
- Use GPU-enabled XGBoost to train a machine learning model