# Quickstart Guide for XGBoost Training Using Snowflake's ML Container Runtime and OSS XGBoost
## Introduction
This notebook provides a quickstart for training an XGBoost model using Snowflake's ML Container Runtime APIs and OSS XGBoost. We demonstrate:
1. Data ingestion from a Snowflake table.
2. Splitting the data into train and test sets (90% train, 10% test).
3. Training an XGBoost model using the OSS XGB.
4. Scaling out to multiple GPUs using Snowflake Container Runtime APIs for distributed training.
5. Making predictions with the trained models.

### Steps Covered:
- Load data from a Snowflake table.
- Split the data for training and testing.
- Train using both OSS and Snowflake Container Runtime APIs.
- Make predictions and evaluate performance.

### Step 1: Set Up Snowflake Session
Initialize a Snowflake session to perform operations within the environment.

In [None]:
# Initialize Snowflake session
from snowflake.snowpark.context import get_active_session
session = get_active_session()

### Step 2: Load Data from Snowflake Table
We load data from the `CR_QUICKSTART.PUBLIC.VEHICLE` table and split it into training and testing sets.

In [None]:
# Load data from the Snowflake table
table_name = 'CR_QUICKSTART.PUBLIC.VEHICLE'
snowpark_df = session.table(table_name)

# Convert Snowpark DataFrame to Pandas DataFrame using DataConnector
from snowflake.ml.data.data_connector import DataConnector
pandas_df = DataConnector.from_dataframe(snowpark_df).to_pandas()

# Drop the 'C2' column (datetime column) from the dataset
pandas_df = pandas_df.drop(columns=['C2'])

# Split the data into features (X) and target (y). Assume C6 is the target column.
X = pandas_df.drop('C6', axis=1)
y = pandas_df['C6']

# Train-test split: 90% training and 10% testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Print the sizes of the splits
print(f"Training set size: {X_train.shape[0]} rows")
print(f"Test set size: {X_test.shape[0]} rows")

### Step 3: Train XGBoost Model Using OSS XGB
We first train the model using the open-source XGBoost library (OSS).

In [None]:
# Import OSS XGBoost
import xgboost as xgb

# Train an XGBoost regressor using the OSS approach
print('Training a sample XGBoost regressor using OSS solution')
oss_model = xgb.XGBRegressor(n_estimators=100, random_state=42)
oss_model.fit(X_train, y_train)


### Step 4: Make Predictions with the OSS Model
We make predictions using the trained OSS model and display a sample of the results.

In [None]:
# Make predictions on the test set using the OSS model
y_pred_oss = oss_model.predict(X_test)

# Print sample predictions
print(f'Sample predictions: {y_pred_oss[:10]}')

### Step 5: Train Using Snowflake's Container Runtime APIs
Leverage Snowflake's ML Container Runtime APIs to scale the training to multiple GPUs. The training APIs allow easy configuration of resources, including GPU and memory allocation.

**Note:** By default, these APIs utilize all available resources, ensuring efficient use of hardware for model training.

In [None]:
# Import necessary classes for Snowflake Container Runtime training
from snowflake.ml.modeling.distributors.xgboost import XGBEstimator, XGBScalingConfig

# Set up the scaling configuration for multi-GPU usage
scaling_config = XGBScalingConfig(
    num_workers=-1,            # Use all available workers
    num_cpu_per_worker=-1,     # Use all available CPU cores per worker
    use_gpu=True               # Enable GPU for training
)

# Define the XGBEstimator for training
estimator = XGBEstimator(
    n_estimators=100,                     # Number of trees
    objective='reg:squarederror',         # Objective function for regression
    scaling_config=scaling_config         # Use GPU and multi-worker scaling
)

# Drop the 'C2' column from the Snowpark DataFrame for training
snowpark_df = snowpark_df.drop(['C2'])

# Define the label column and input columns
label_col = 'C6'
input_cols = [col for col in snowpark_df.columns if col != 'C6']

# Split the Snowpark DataFrame into train and test sets
train_snowpark_df = snowpark_df.sample(0.9)  # 90% for training
test_snowpark_df = snowpark_df.subtract(train_snowpark_df)  # Remaining 10% for testing

# Drop the label column from the test data to match the input structure for prediction
test_snowpark_df = test_snowpark_df.drop([label_col])

# Create DataConnectors for the training and testing sets
train_data_connector = DataConnector.from_dataframe(train_snowpark_df)
test_data_connector = DataConnector.from_dataframe(test_snowpark_df)

# Train the model using Snowflake's Container Runtime API
estimator.fit(
    train_data_connector,   # Data for training
    input_cols=input_cols,  # Input features (all columns except 'C6')
    label_col=label_col     # Target column ('C6')
)

### Step 6: Make Predictions with the Snowflake Container Runtime Model
We make predictions using the model trained with Snowflake's Container Runtime and display a sample of the results.

In [None]:
# Make predictions on the test set using the Snowflake Container Runtime model
y_pred_snowflake = estimator.predict(test_data_connector)

# Display the first 10 predictions
print(f'Snowflake Container Runtime predictions: {y_pred_snowflake[:10]}')

## Conclusion
In this notebook, we demonstrated how to:
- Load data from a Snowflake table.
- Split the data into training and testing sets.
- Train an XGBoost model using both the OSS approach and Snowflake Container Runtime APIs.
- Make predictions with the trained models and compare performance.

Snowflake's Container Runtime APIs provide an efficient way to leverage multi-GPU systems for distributed training, offering easy configuration for scaling resources. You can now adapt this notebook to your own datasets and models!