### Set up Snowpark Session

See [Configure Connections](https://docs.snowflake.com/developer-guide/snowflake-cli/connecting/configure-connections#define-connections)
for information on how to define default Snowflake connection(s) in a config.toml
file.

In [1]:
from snowflake.snowpark import Session, Row

# Requires valid ~/.snowflake/config.toml file
session = Session.builder.getOrCreate()
print(session)

<snowflake.snowpark.session.Session: account="DEMO_ACCOUNT", role="DEMO_RL", database="DEMO_DB", schema="DEMO_SCHEMA", warehouse="DEMO_WH">


#### Set up Snowflake resources

In [2]:
# OPTIONAL: Uncomment below to select a database and schema to use
# session.use_database("temp")
# session.use_schema("public")

In [3]:
# Create compute pool if not exists
def create_compute_pool(name: str, instance_family: str, min_nodes: int = 1, max_nodes: int = 10):
    query = f"""
        CREATE COMPUTE POOL IF NOT EXISTS {name}
            MIN_NODES = {min_nodes}
            MAX_NODES = {max_nodes}
            INSTANCE_FAMILY = {instance_family}
    """
    return session.sql(query).collect()

compute_pool = "DEMO_POOL_CPU"
create_compute_pool(compute_pool, "CPU_X64_S", 1, 5)

[Row(status='DEMO_POOL_CPU already exists, statement succeeded.')]

### Approach 1: Train with function

In [5]:
# Generate a arbitary dataset
def generate_dataset_sql(db, schema, table_name, num_rows, num_cols) -> str:
    sql_script = f"CREATE TABLE IF NOT EXISTS {db}.{schema}.{table_name} AS \n"
    sql_script += f"SELECT \n"
    for i in range(1, num_cols):
        sql_script += f"uniform(0::FLOAT, 10::FLOAT, random()) AS FEATURE_{i}, \n"
    sql_script += f"FEATURE_1 + FEATURE_1 AS TARGET_1, \n"
    sql_script += f"FROM TABLE(generator(rowcount=>({num_rows})));"
    return sql_script
num_rows = 1000 * 1000
num_cols = 100
table_name = "MULTINODE_CPU_TRAIN_DS"
session.sql(generate_dataset_sql(session.get_current_database(), session.get_current_schema(), 
                                table_name, num_rows, num_cols)).collect()
feature_list = [f'FEATURE_{num}' for num in range(1, num_cols)]

In [None]:
from snowflake.ml.jobs import remote

@remote(compute_pool, stage_name="payload_stage", target_instances=3)
def xgb(table_name, input_cols, label_col):
    from snowflake.snowpark import Session
    from snowflake.ml.modeling.distributors.xgboost import XGBEstimator, XGBScalingConfig
    from snowflake.ml.data.data_connector import DataConnector

    session = Session.builder.getOrCreate()
    cpu_train_df = session.table(table_name)
    
    params = {
        "tree_method": "hist",
        "objective": "reg:pseudohubererror",
        "eta": 1e-4,
        "subsample": 0.5,
        "max_depth": 50,
        "max_leaves": 1000,
        "max_bin":63,
    }
    scaling_config = XGBScalingConfig(use_gpu=False)
    estimator = XGBEstimator(
        n_estimators=100,
        params=params,
        scaling_config=scaling_config,
    )
    data_connector = DataConnector.from_dataframe(cpu_train_df)
    xgb_model = estimator.fit(
        data_connector, input_cols=input_cols, label_col=label_col
    )
    return xgb_model

# Function invocation returns a job handle (snowflake.ml.jobs.MLJob)
job = xgb(table_name, feature_list, "TARGET_1")

In [7]:
print(job.id)
print(job.status)

MLJOB_99440CD7_F620_468E_B52D_B2872C52BAFE
PENDING


In [8]:
job.wait()
job.show_logs()


'micromamba' is running as a subprocess and can't modify the parent shell.
Thus you must initialize your shell before using activate and deactivate.

To initialize the current bash shell, run:
    $ eval "$(micromamba shell hook --shell bash)"
and then activate or deactivate with:
    $ micromamba activate
To automatically initialize all future (bash) shells, run:
    $ micromamba shell init --shell bash --root-prefix=~/micromamba
If your shell was already initialized, reinitialize your shell with:
    $ micromamba shell reinit --shell bash
Otherwise, this may be an issue. In the meantime you can run commands. See:
    $ micromamba run --help

Supported shells are {bash, zsh, csh, xonsh, cmd.exe, powershell, fish}.
Creating log directories...
 * Starting periodic command scheduler cron
   ...done.
2025-04-24 23:51:15,080 - INFO - Snowflake Connector for Python Version: 3.13.2, Python Version: 3.10.15, Platform: Linux-5.4.181-99.354.amzn2.x86_64-x86_64-with-glibc2.31
2025-04-24 23:51:1

In [9]:
import xgboost

# Retrieve trained model from job execution and use it for prediction
xgb_model = job.result()

# Predict on a sample of the dataset
# Note: This is just a demonstration, in practice you would want to predict on a different dataset
dataset = session.table(table_name).drop("TARGET_1").limit(10).to_pandas()
xgb_model.predict(xgboost.DMatrix(dataset))

configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

for more details about differences between saving model and serializing.



array([ 7.242013,  9.110336, 18.514343, 14.750366, 18.905142, 11.804218,
       17.774406, 17.400677,  7.676889, 14.249159], dtype=float32)