# Prep F1 Data for ML Model
Builds a preprocessing pipeline to transform F1 data in `fct_results` table in Snowflake, transfoming it into formats suitable for Machine Learning models. 

### Steps
1. Read in the data from the `fct_results` table in Snowflake (built by dbt).
2. Build pipeline to preprocess data for ML use:
   - Normalize numerical variables to range [0, 1].
   - One-hot encode categorical variables.
3. Save the pipeline as an artifact in a Snowflake Stage for use by downstream dbt Python models.

Notebooks are better suited for pipeline development (over dbt Python), making it easy to step through the code, debug issues, and save the pipeline as an artifact in a Snowflake Stage.

### Imports

In [13]:
import os
import warnings
from pprint import pprint

import joblib
import snowflake.ml.modeling.preprocessing as snowml
from dotenv import load_dotenv
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.snowpark import Session
from snowflake.snowpark.version import VERSION as SNOWPARK_VERSION

warnings.simplefilter("ignore")

### Constants

In [14]:
# Get environment variables from .env file. This file is not committed to git.
# It should contain the following variables:
#   SNOWFLAKE_ACCOUNT=<account_name>
#   SNOWFLAKE_ML_USER=<username>
#   SNOWFLAKE_ML_PASSWORD=<password>

load_dotenv()  # Get environment variables from .env file

# Connection
ACCOUNT = os.getenv("SNOWFLAKE_ACCOUNT")  # From .env file
DATABASE = "FORMULA1"
WAREHOUSE = "TRANSFORMING"
ROLE = "TRANSFORMER"
DEV_SCHEMA = "DBT_GREG"  # Replace with dev schema storing feature data
USER = os.getenv("SNOWFLAKE_ML_USER")  # From .env file
PASSWORD = os.getenv("SNOWFLAKE_ML_PASSWORD")  # From .env file


# Snowflake Stage and Pipeline file
ML_SCHEMA = "ML"
ML_STAGE = "F1_STAGE"
PIPELINE_FILE = "f1_preprocess_pipeline.joblib"

### Get Data from Snowflake

In [15]:
# Create a Snowflake session
session = Session.builder.configs(
    {
        "account": ACCOUNT,
        "database": DATABASE,
        "warehouse": WAREHOUSE,
        "role": ROLE,
        "schema": DEV_SCHEMA,
        "user": USER,  #
        "password": PASSWORD,
    }
).create()
session.sql_simplifier_enabled = True
snowflake_env = session.sql("select current_user(), current_version()").collect()

pprint("Connected to Snowflake with the following parameters:")
pprint(f"User: {snowflake_env[0][0]}")
pprint(f"Role: {session.get_current_role()}")
pprint(f"Database: {session.get_current_database()}")
pprint(f"Warehouse: {session.get_current_warehouse()}")
pprint(f"Schema: {session.get_current_schema()}")
pprint(f"Snowflake version: {snowflake_env[0][1]}")
pprint(
    f"Snowpark for Python version: {SNOWPARK_VERSION[0]}.{SNOWPARK_VERSION[1]}.{SNOWPARK_VERSION[2]}"
)

'Connected to Snowflake with the following parameters:'
'User: GREG_CLUNIES'
'Role: "TRANSFORMER"'
'Database: "FORMULA1"'
'Warehouse: "TRANSFORMING"'
'Schema: "DBT_GREG"'
'Snowflake version: 8.8.3'
'Snowpark for Python version: 1.12.1'


In [16]:
table_df = session.table("F1_FEATURES")
table_df.show()

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"RESULT_ID"  |"RACE_ID"  |"RACE_YEAR"  |"CIRCUIT_ID"  |"CIRCUIT_NAME"         |"CIRCUIT_REF"  |"LOCATION"    |"COUNTRY"  |"LATITUDE"  |"LONGITUDE"  |"ALTITUDE"  |"TOTAL_PIT_STOPS_PER_RACE"  |"RACE_DATE"  |"RACE_TIME"  |"DRIVER_ID"  |"DRIVER"          |"DRIVER_NUMBER"  |"DRIVERS_AGE_YEARS" 

### Feature Engineering & Selection
ML models require features in certain format to make predictions, often not human-readable. We can use Snowpark optimized preprocessiong functions to transform the data into the format required by the model.

If you're lucky, you know what features to use for your model. But often, you don't and will need to perform include techniques like recursive feature elimination, feature importance, etc. A list of Snowpark optimized functions can be found [here](https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/modeling#snowflake-ml-modeling-feature-selection).

For this demo, we will only show some simple feature engineering. We will explicitly select the features we want to use.

In [17]:
target = ["POSITION_LABEL"]  # 1 = podium, 2 = points, 3 = no points
features = [
    "RACE_YEAR",
    "CIRCUIT_NAME",
    "GRID",
    "CONSTRUCTOR_NAME",
    "DRIVER",
    "DRIVERS_AGE_YEARS",
    "DRIVER_CONFIDENCE",
    "CONSTRUCTOR_RELIABILITY",
    "TOTAL_PIT_STOPS_PER_RACE",
]
features_df = table_df.select(target + features)
features_df.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"POSITION_LABEL"  |"RACE_YEAR"  |"CIRCUIT_NAME"         |"GRID"  |"CONSTRUCTOR_NAME"  |"DRIVER"          |"DRIVERS_AGE_YEARS"  |"DRIVER_CONFIDENCE"  |"CONSTRUCTOR_RELIABILITY"  |"TOTAL_PIT_STOPS_PER_RACE"  |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|0                 |2010         |BAHRAIN_GRAND_PRIX     |4       |MCLAREN             |LEWIS_HAMILTON    |25                   |0.911215             |0.855491                   |0                           |
|1                 |2010         |BAHRAIN_GRAND_PRIX     |1       |RED_BULL            |SEBASTIAN_VETTEL  |23                   |0.902326             |0.847025     

In [18]:
# Normalizing the numeric features
snowml_mms = snowml.MinMaxScaler(
    clip=True,
    input_cols=[
        "GRID",
        "DRIVER_CONFIDENCE",
        "CONSTRUCTOR_RELIABILITY",
        "TOTAL_PIT_STOPS_PER_RACE",
    ],
    output_cols=[
        "GRID_NORM",
        "DRIVER_CONFIDENCE_NORM",
        "CONSTRUCTOR_RELIABILITY_NORM",
        "TOTAL_PIT_STOPS_PER_RACE_NORM",
    ],
)
normalized_features_df = snowml_mms.fit(features_df).transform(features_df)
normalized_features_df.show()

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"GRID_NORM"           |"DRIVER_CONFIDENCE_NORM"  |"CONSTRUCTOR_RELIABILITY_NORM"  |"TOTAL_PIT_STOPS_PER_RACE_NORM"  |"POSITION_LABEL"  |"RACE_YEAR"  |"CIRCUIT_NAME"         |"GRID"  |"CONSTRUCTOR_NAME"  |"DRIVER"          |"DRIVERS_AGE_YEARS"  |"DRIVER_CONFIDENCE"  |"CONSTRUCTOR_RELIABILITY"  |"TOTAL_PIT_STOPS_PER_RACE"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|0.16666666666666666  

In [19]:
# One hot encoding the categorical features
snowml_ohe = snowml.OneHotEncoder(
    input_cols=["CIRCUIT_NAME", "CONSTRUCTOR_NAME", "DRIVER"],
    output_cols=["CIRCUIT_NAME_OHE", "CONSTRUCTOR_NAME_OHE", "DRIVER_OHE"],
)
ohe_features_df = snowml_ohe.fit(normalized_features_df).transform(
    normalized_features_df
)

# View one hot encoded features
ohe_features_df[
    ohe_features_df["CIRCUIT_NAME_OHE_70TH_ANNIVERSARY_GRAND_PRIX"] == 1.0
].show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Build Preprocessing Pipeline

Let's package the feature engineering into a pipeline and save it to a Snowflake stage so it can be called later.

In [20]:
NUMERIC_COLUMNS = [
    "GRID",
    "DRIVER_CONFIDENCE",
    "CONSTRUCTOR_RELIABILITY",
    "TOTAL_PIT_STOPS_PER_RACE",
]
NUMERIC_COLUMNS_NORM = [
    "GRID_NORM",
    "DRIVER_CONFIDENCE_NORM",
    "CONSTRUCTOR_RELIABILITY_NORM",
    "TOTAL_PIT_STOPS_PER_RACE_NORM",
]
CATEGORICAL_COLUMNS = [
    "CIRCUIT_NAME",
    "CONSTRUCTOR_NAME",
    "DRIVER",
]
CATEGORICAL_COLUMNS_OHE = [
    "CIRCUIT_NAME_OHE",
    "CONSTRUCTOR_NAME_OHE",
    "DRIVER_OHE",
]

In [21]:
preprocess_pipeline = Pipeline(
    steps=[
        (
            "MMS",
            snowml.MinMaxScaler(
                clip=True,
                input_cols=NUMERIC_COLUMNS,
                output_cols=NUMERIC_COLUMNS_NORM,
            ),
        ),
        (
            "OHE",
            snowml.OneHotEncoder(
                input_cols=CATEGORICAL_COLUMNS,
                output_cols=CATEGORICAL_COLUMNS_OHE,
            ),
        ),
    ]
)

In [22]:
transformed_features_df = preprocess_pipeline.fit(features_df).transform(features_df)
transformed_features_df.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [23]:
joblib.dump(preprocess_pipeline, PIPELINE_FILE)

['f1_preprocess_pipeline.joblib']

### Save Preprocessing Pipeline to Snowflake Stage

In [24]:
# Create a stage in Snowflake to upload the preprocess pipeline file
session.sql(f"create schema if not exists {ML_SCHEMA}").collect()
session.sql(f"create or replace stage {ML_SCHEMA}.{ML_STAGE}").collect()
session.file.put(PIPELINE_FILE, f"@{ML_SCHEMA}.{ML_STAGE}", overwrite=True)

[PutResult(source='f1_preprocess_pipeline.joblib', target='f1_preprocess_pipeline.joblib.gz', source_size=28310, target_size=5264, source_compression='NONE', target_compression='GZIP', status='UPLOADED', message='')]