# GPU Based XGBoost Training
## In the following notebook we will leverage Snowpark Container Services (SPCS) to run a notebook within Snowflake on a series of GPUs

### * Workflow* 
- Inspect GPU resources available - for this exercise we will use four NVIDIA A10G GPUs
- Load in data from Snowflake table
- Set up data for modeling
- Train two XGBoost models - one trained with CPUs and one leveraging our GPU cluster
- Compare runtimes and results of our models


### * Key Takeaways* 
- SPCS allows users to run notebook workloads that execute on containers, rather than virtual warehouses in Snowflake
- GPUs can greatly speed up model training jobs 🔥
- Bringing in third party python libraries offers flexibility to leverage great contirbutions to the OSS ecosystem


### Note - In order to successfully run !pip installs make sure you have enabled the external access integration with pypi
- Do so by clicking on the drop down of the 🟢 Active kernel settings button, clicking Edit Compute Settings, then turning on the PYPI_ACCESS_INTEGRATION radio button in the external access tab

In [None]:
!pip install seaborn

In [None]:
# Import python packages
import streamlit as st
import pandas as pd
import sys
import seaborn

# Snowpark ML
from snowflake.ml.modeling.xgboost import XGBRegressor, XGBClassifier
from snowflake.ml._internal.utils import identifier

# Snowpark session
from snowflake.snowpark import DataFrame
from snowflake.snowpark.functions import col
from snowflake.snowpark.context import get_active_session
session = get_active_session()
session

In [None]:
import torch

# Get the list of GPUs
if torch.cuda.is_available():
    # Get the number of GPUs
    num_gpus = torch.cuda.device_count()

    print(f'{num_gpus} GPU Device(s) Found')
    # Print the list of GPUs
    for i in range(num_gpus):
        print("Name:", torch.cuda.get_device_name(i), "  Index:", i)
else:
    print("No GPU available")


In [None]:
#Load in data from Snowflake table into a Snowpark dataframe
table = "XGB_GPU_DATABASE.XGB_GPU_SCHEMA.VEHICLES_TABLE"
df = session.table(table)
df.count(), len(df.columns)

In [None]:
#Note the maximum price - a $3B car must be quite a spectacle, but we don't want to use that for our model
df.select('PRICE').describe()

In [None]:
#Lets filter down to cars $100k or less - note that we only filter out ~1% of our data here
df = df.filter(col('PRICE')<100000)
df.select('PRICE').describe()

In [None]:
#View data schema
list(df.schema)

In [None]:
#Drop some columns that won't be helpful for modeling
drop_cols = ["ID","URL", "REGION_URL", "IMAGE_URL", "DESCRIPTION", "VIN", "POSTING_DATE", 'COUNTY']
df = df.drop(drop_cols)

In [None]:
#Fill NULL values with "NA" for string columns and 0 for numerical columns
string_type = df.select('REGION').schema[0].datatype
string_cols = df.select([col.name for col in df.schema if col.datatype ==string_type]).columns
non_string_cols = df.drop(string_cols).columns

df = df.fillna("NA", subset=string_cols)
df = df.fillna(0, subset= non_string_cols)

In [None]:
#Use pandas to find the top 1000 car models and cast any model values to 'INFREQUENT' to avoid excessive dimensionality
df_pd = df.to_pandas()
top_n_models = df_pd.MODEL.value_counts().keys()[0:1000]
df_pd['MODEL'] = df_pd.MODEL.apply(lambda x: x if x in top_n_models else 'INFREQUENT')
df = session.create_dataframe(df_pd)

In [None]:
#Union the data to itself a few times to go from 400k rows to 1.7M rows. This lab's purpose is to test performance so we want to have a decently large dataset!
for i in range(1,3):
    df = df.unionAll(df)

df.count()

In [None]:
import snowflake.ml.modeling.preprocessing as snowml

OHE_COLS = string_cols
OHE_POST_COLS = [i+"_OHE" for i in OHE_COLS]


# Encode categoricals to numeric columns
snowml_ohe = snowml.OneHotEncoder(input_cols=OHE_COLS, output_cols = OHE_COLS, drop_input_cols=True)
transformed_df = snowml_ohe.fit(df).transform(df)
transformed_df.columns

In [None]:
#Rename columns to avoid issues with " characters later on

#Create dict replacing bad column names
renaming_dict = {}
for n, col in enumerate(transformed_df.columns):
    double_quote_spot = col.find('"')
    if double_quote_spot==0:
        renaming_dict[col] = col[double_quote_spot+1:col.find("_")]+f"__{n}"
    else:
        renaming_dict[col] = col


#Create new df with renamed and sorted columns
df_renamed = transformed_df.rename(renaming_dict)
df_renamed = df_renamed.select(sorted(df_renamed.columns))
df_renamed.columns[0:20]

In [None]:
# Split the data into train and test sets (note this may take up to 3-4 minutes)
train, test = df_renamed.random_split(weights=[0.95, 0.05], seed=0)

## Model Training

### Now that our data is all set up - we will train a CPU-based and GPU-based Snowpark Optimized XGBoost model
#### The parameter that instructs our model to leverage GPUs is *tree_method*. 
--- When *tree_method* is set to *hist* the model will not attempt to use GPUs

--- When *tree_method* is set to *gpu_hist* the model will leverage any available GPUs found

--- Snowflake offers the ability to leverage multi-GPU training (i.e. using all 4 of our A10G GPUs we have available) for optimized performance

In [None]:
#Train both a CPU and GPU based XGB Regressor - note that we are using n_estimators=1000 to intentionally make this a more compute intensive training job


cpu_snowpark_xgb = XGBRegressor(
    input_cols=train.drop("PRICE").columns,
    label_cols=train.select("PRICE").columns,
    output_cols="PREDICTED_PRICE",
    tree_method="hist",
    predictor= "cpu_predictor",
    n_estimators=1000
)



gpu_snowpark_xgb = XGBRegressor(
    input_cols=train.drop("PRICE").columns,
    label_cols=train.select("PRICE").columns,
    output_cols="PREDICTED_PRICE",
    tree_method="gpu_hist",
    predictor= "gpu_predictor",
    n_estimators=1000
)

In [None]:
#Clear cache to make sure we have as much free memory as possible for modeling

import gc

gc.collect()

torch.cuda.empty_cache()

## While the model is training, you can see a live look at resource utilization by hovering your mouse over the 🟢 Active button that controls the kernel settings for your notebook.
### Notice both the memory and CPU utilziation for the cpu training job, and the GPU utilization for the GPU training job

In [None]:
import time
start_time = time.time()
cpu_snowpark_xgb.fit(train)
end_time = time.time()
print("TRAINING TIME:", end_time - start_time)

In [None]:
import time
start_time = time.time()
gpu_snowpark_xgb.fit(train)
end_time = time.time()
print("TRAINING TIME:", end_time - start_time)

## While results aren't entirely determinstic, you should have seen a 3-4x speedup in model training from CPU to GPU training. 
### Investigate in the logs from the two above cells where you see the message *[RayXGBoost] Finished XGBoost training* and look to the end of the line to see the pure training time for that model

In [None]:
#Compute predictions on test set for cpu model
import time
start_time = time.time()
cpu_test_preds = cpu_snowpark_xgb.predict(test)
end_time = time.time()
print("Inference TIME:", end_time - start_time)

In [None]:
#Compute predictions on test set for gpu model

import time

start_time = time.time()
gpu_test_preds = gpu_snowpark_xgb.predict(test)
end_time = time.time()
print("Inference TIME:", end_time - start_time)

## Finally now that our models have been trained and predictions have been generated, we will carry out a few final steps
- Compute performance metrics
- Visualize predicted vs. actuals

In [None]:
import numpy as np
from snowflake.ml.modeling.metrics import r2_score, mean_squared_error
print('R^2 Score:', r2_score(df=gpu_test_preds, y_true_col_name= 'PRICE', y_pred_col_name='PREDICTED_PRICE'))
print('RMSE:', np.sqrt(mean_squared_error(df=gpu_test_preds, y_true_col_names= 'PRICE', y_pred_col_names='PREDICTED_PRICE')))

In [None]:
#In our visualization below we can see that outside of 0 (our filled NA value) there is a reasonably tight correlation between predicted and actual prices for cars 
import seaborn as sns

results_df = gpu_test_preds.select(['PRICE', 'PREDICTED_PRICE']).to_pandas()

sns.scatterplot(x=results_df.PRICE, y = results_df.PREDICTED_PRICE)