![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2F03+-+BigQuery+ML+%28BQML%29&dt=BQML+Feature+Engineering+-+preprocessing+functions.ipynb)

# BigQuery ML (BQML) - Feature Engineering Functions

This notebook explores preparing data (preprocessing) for machine learning with BigQuery using functions that are part of BigQuery ML (BQML). 

When using BigQuery ML to train a model you can prepare data in advance, at the point of input (`query_statment`) of the [`CREATE MODEL` statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#create_model_syntax), or within the model for transportability by using the [`TRANSFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In each case, the use of [operators](https://cloud.google.com/bigquery/docs/reference/standard-sql/operators), [conditional expressions](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions), [mathematical functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/mathematical_functions), [conversion functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_functions), [string functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions), [date functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions), [datetime functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/datetime_functions), [time functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/time_functions), and [timestamp functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions) are all useful.  Of particular interest for machine learning are the [Manual preprocessing functions](https://cloud.google.com/bigquery/docs/manual-preprocessing) covered indepth in this notebook.

**This content will accompany the blog post - TBD**

---

**Feature Engineering**

Feature engineering, or [preprocessing](https://cloud.google.com/bigquery/docs/preprocess-overview), is part of making data ready for machine learning.  BigQuery ML [manual feature preprocessing](https://cloud.google.com/bigquery/docs/manual-preprocessing) functions are available to make this process simple within BigQuery.  

Each `CREATE MODEL ...` statement will do [automatic feature preprocessing](https://cloud.google.com/bigquery/docs/auto-preprocessing) by default.  It is also possible to include manual feature preprocessing in the [`CREATE MODEL` statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#create_model_syntax) as a [`TRANSFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform) where it will also become part of the model serving.  Many of these functions even accompany [exported models](https://cloud.google.com/bigquery/docs/exporting-models) and models [directly registered to Vertex AI Model Registry](https://cloud.google.com/bigquery/docs/create_vertex).

```SQL
CREATE MODEL {model name} AS
    TRANSFORM (
        ML.{function name}() OVER() as {name},
        ...
    )
    OPTIONS (
        MODEL_TYPE = ...
        {more options}
    )
    AS
        SELECT ...
        FROM ...
        WHERE ...
```
  
---

**Getting Started With BigQuery ML**
<p align="center" width="100%">
A great place to start exploring what model types are available and the functions to help create an ML workflow with each model type is this site:
    <center>
        <span style="font-size:xx-large;">
        <a href="https://cloud.google.com/bigquery/docs/e2e-journey">
            End-to-end user journey for each model
        </a>
        </span>
    </center>

Another great resourse for getting started is the "What is BigQuery ML?" starting page which include a model selection guide.
    <center>
        <span style="font-size:xx-large;">
        <a href="https://cloud.google.com/bigquery/docs/bqml-introduction">
        What is BigQuery ML?
        </a>
        </span>
    </center>
</p>

---

**Prerequisites:**

None

**Services Used:**
- BigQuery

**Resources:**
- [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery/docs/bqml-introduction)
- [Overview of BQML methods and workflows](https://cloud.google.com/bigquery/docs/e2e-journey)
- [BigQuery](https://cloud.google.com/bigquery)
    - [Documentation:](https://cloud.google.com/bigquery/docs/query-overview)
    - [API:](https://cloud.google.com/bigquery/docs/reference/libraries-overview)
        - [Clients](https://cloud.google.com/bigquery/docs/reference/libraries)
            - [Python SDK:](https://github.com/googleapis/python-bigquery)
            - [Python Library Reference:](https://cloud.google.com/python/docs/reference/bigquery/latest)

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/03%20-%20BigQuery%20ML%20(BQML)/BQML%20Feature%20Engineering%20-%20preprocessing%20functions.ipynb) and run the cells in this section.  Otherwise, skip this section.

In [475]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

Updated property [core/project].


In [473]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Environment Setup

inputs:

In [7]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

packages:

In [1]:
from google.cloud import bigquery

import pandas as pd
import numpy as np

clients:

In [10]:
bq = bigquery.Client(project=PROJECT_ID, location = 'US')

---
## Transform BigQuery columns into ML features with SQL

BigQuery ML [feature preprocessing functions](https://cloud.google.com/bigquery/docs/manual-preprocessing) are usefull for converting BigQuery columns to ML features.  BigQuery ML has included a set of these for manual preprocessing and they are individually demonstrated here.

These functions can be used directly in BigQuery SQL or within the [`TRANSFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform) of the [`CREATE MODEL` statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#create_model_syntax).  Using these inside of the `TRANSFORM` clause means they will also be automatically applied during model serving in BigQuery with the [`ML.PREDICT` function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict).  Many preprocessing statements can also accompany [exported models](https://cloud.google.com/bigquery/docs/exporting-models) and models [directly registered to Vertex AI Model Registry](https://cloud.google.com/bigquery/docs/create_vertex).

**NOTE:** Some of the functions require using calculations over all values in the column and make use of an empty `OVER()` clause.  See `ML.QUANTILE_BUCKETIZE`, `ML.MIN_MAX_SCALER`, `ML.STANDARD_SCALER` for examples. 

### General Functions

[General functions](https://cloud.google.com/bigquery/docs/manual-preprocessing#general_functions) for data cleanup of string or numberical expressions.  Currently, this includes the ML.IMPUTER function for imputing missing values.

#### ML.IMPUTER

Given a column, numerical or categorical (string), the function replaces `NULL` values with the value specified by the paramter `strategy`.
- `expression` is numerical or categorial input
- `strategy` is a string value that specifies how to replace `NULL` values:
    - 'mean' uses the mean (only for numerical columns)
    - 'median' uses the median (only for numerical columns)
    - 'most_frequent' uses the mode
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [12]:
query = f"""
    SELECT
        num_column,
        ML.IMPUTER(num_column, 'mean') OVER() AS num_imputed_mean,
        ML.IMPUTER(num_column, 'median') OVER() AS num_imputed_median,
        ML.IMPUTER(num_column, 'most_frequent') OVER() AS num_imputed_mode,
        string_column,
        ML.IMPUTER(string_column, 'most_frequent') OVER() AS string_imputed_mode,
    FROM
        UNNEST([1, 1, 2, 3, 4, 5, NULL]) AS num_column WITH OFFSET pos1,
        UNNEST(['a', 'a', 'b', 'c', 'd', 'e', NULL]) AS string_column WITH OFFSET pos2
    WHERE pos1 = pos2
    ORDER BY num_column
"""
print(query)


    SELECT
        num_column,
        ML.IMPUTER(num_column, 'mean') OVER() AS num_imputed_mean,
        ML.IMPUTER(num_column, 'median') OVER() AS num_imputed_median,
        ML.IMPUTER(num_column, 'most_frequent') OVER() AS num_imputed_mode,
        string_column,
        ML.IMPUTER(string_column, 'most_frequent') OVER() AS string_imputed_mode,
    FROM
        UNNEST([1, 1, 2, 3, 4, 5, NULL]) AS num_column WITH OFFSET pos1,
        UNNEST(['a', 'a', 'b', 'c', 'd', 'e', NULL]) AS string_column WITH OFFSET pos2
    WHERE pos1 = pos2
    ORDER BY num_column



In [13]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,num_column,num_imputed_mean,num_imputed_median,num_imputed_mode,string_column,string_imputed_mode
0,,2.666667,2.0,1.0,,a
1,1.0,1.0,1.0,1.0,a,a
2,1.0,1.0,1.0,1.0,a,a
3,2.0,2.0,2.0,2.0,b,b
4,3.0,3.0,3.0,3.0,c,c
5,4.0,4.0,4.0,4.0,d,d
6,5.0,5.0,5.0,5.0,e,e


### Categorical Functions

[Categorical functions](https://cloud.google.com/bigquery/docs/manual-preprocessing#categorical_functions) for categorizing data with string expressions.

#### ML.FEATURE_CROSS

Given a STRUCT of categorial features this returns a STRUCT of all combinations up to the degree passed-in (default = 2).
- `struct_categorical_features` is a STRUCT of string values for categorical features column names to cross
- `degree` (optional) is the highest degree of feature combinations to create
    - in the range of [2, 4] with default = 2
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-feature-cross)
- Is not [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [12]:
query = f"""
    SELECT
        input_column_1, input_column_2, input_column_3,
        ML.FEATURE_CROSS(STRUCT(input_column_1, input_column_2, input_column_3)) AS feature_column
    FROM
        UNNEST(['a', 'b', 'c']) as input_column_1 WITH OFFSET pos1,
        UNNEST(['A', 'B', 'C']) AS input_column_2 WITH OFFSET pos2,
        UNNEST(['1', '2', '3']) AS input_column_3 WITH OFFSET pos3
    WHERE
        pos1 = pos2 AND pos2 = pos3
"""
print(query)


    SELECT
        input_column_1, input_column_2, input_column_3,
        ML.FEATURE_CROSS(STRUCT(input_column_1, input_column_2, input_column_3)) AS feature_column
    FROM
        UNNEST(['a', 'b', 'c']) as input_column_1 WITH OFFSET pos1,
        UNNEST(['A', 'B', 'C']) AS input_column_2 WITH OFFSET pos2,
        UNNEST(['1', '2', '3']) AS input_column_3 WITH OFFSET pos3
    WHERE
        pos1 = pos2 AND pos2 = pos3



In [13]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column_1,input_column_2,input_column_3,feature_column
0,a,A,1,"{'input_column_1_input_column_2': 'a_A', 'inpu..."
1,b,B,2,"{'input_column_1_input_column_2': 'b_B', 'inpu..."
2,c,C,3,"{'input_column_1_input_column_2': 'c_C', 'inpu..."


In [14]:
df['feature_column'].iloc[-1]

{'input_column_1_input_column_2': 'c_C',
 'input_column_1_input_column_3': 'c_3',
 'input_column_2_input_column_3': 'C_3'}

#### ML.HASH_BUCKATIZE

Given a column of string values this function will hash the values as a new column.  If a bucket size > 0 provided it will take the mod of the hash: remainder of the hash divided by bucket size as the bucket number.
- `string_expression` is the string for the categorical feature to bucketize
- `hash_bucket_size` is an integer for the number of buckets to create
    - if 0, strings are hashed without bucketizing
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-hash-bucketize)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [16]:
query = f"""
    SELECT
        input_column,
        ML.HASH_BUCKETIZE(input_column, 0) AS hash_column,
        ML.HASH_BUCKETIZE(input_column, 3) AS feature_column
    FROM
        UNNEST(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']) as input_column
"""
print(query)


    SELECT
        input_column,
        ML.HASH_BUCKETIZE(input_column, 0) AS hash_column,
        ML.HASH_BUCKETIZE(input_column, 3) AS feature_column
    FROM
        UNNEST(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']) as input_column



In [17]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,hash_column,feature_column
0,a,-5528939962900187677,0
1,b,-6651148003232386794,1
2,c,-7016299626566550744,1
3,d,4470636696479570465,2
4,e,-3078673838733201075,1
5,f,-1522288349254903624,0
6,g,4940667224093463419,2
7,h,-2585402310428948559,1
8,i,-9189916281559197516,1


#### ML.LABEL_ENCODER

Given a string column the function will encode the value as integers [0, n] representing categories.  Any `NULL` or removed values will be encoded with `0`.
- `string_expression` is the string for the catgorical feature to encode
- `top_k` (optional) takes an integer value that specifies the limit on the number of categories to encode based on frequency
    - default is 32,000
    - max is 1 million
- `frequency_threshold` (optional) takes an integer value that specifies that minimum frequency to be encoded. Categories below the threshold are encoded as 0.
    - default = 5
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-label-encoder)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [22]:
query = f"""
    SELECT
        input_column,
        ML.LABEL_ENCODER(input_column) OVER() AS labeled_all,
        ML.LABEL_ENCODER(input_column, 3) OVER() AS labeled_top3,
        ML.LABEL_ENCODER(input_column, 3, 3) OVER() AS labeled_top3_min3
    FROM
        UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd']) AS input_column
    ORDER BY input_column
"""
print(query)


    SELECT
        input_column,
        ML.LABEL_ENCODER(input_column) OVER() AS labeled_all,
        ML.LABEL_ENCODER(input_column, 3) OVER() AS labeled_top3,
        ML.LABEL_ENCODER(input_column, 3, 3) OVER() AS labeled_top3_min3
    FROM
        UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd']) AS input_column
    ORDER BY input_column



In [23]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,labeled_all,labeled_top3,labeled_top3_min3
0,,0,0,0
1,a,0,0,0
2,b,0,0,0
3,b,0,0,0
4,c,0,0,1
5,c,0,0,1
6,c,0,0,1
7,d,1,1,2
8,d,1,1,2
9,d,1,1,2


#### ML.MULTI_HOT_ENCODER

Given an column with arrays of strings the function will multi-hot encode the value as integers [0, n] representing categories. This generates a separate feature for each unique element in the arrays.  Any `NULL` or removed values will be encoded with `0`.
- `array_expression` is an ARRAY of strings to multi-hot encode
- `top_k` (optional) takes an integer value that specifies the limit on the number of categories to encode based on frequency
    - default is 32,000
    - max is 1 million
- `frequency_threshold` (optional) takes an integer value that specifies that minimum frequency to be encoded
    - default = 5
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-multi-hot-encoder)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [57]:
query = f"""
    SELECT
        input_column,
        ML.MULTI_HOT_ENCODER(input_column) OVER() AS labeled_all,
        ML.MULTI_HOT_ENCODER(input_column, 1, 2) OVER() AS labeled_top1_min2,
        ML.MULTI_HOT_ENCODER(input_column, 3, 1) OVER() AS labeled_top3_min1
    FROM
        (
            SELECT ['a', 'b', 'd', 'd', 'd', 'd', 'd'] as input_column
            UNION ALL
            SELECT ['a', 'c', 'd', 'd', 'd', 'd', 'd'] as input_column
        )
    ORDER BY input_column[OFFSET(0)]
"""
print(query)


    SELECT
        input_column,
        ML.MULTI_HOT_ENCODER(input_column) OVER() AS labeled_all,
        ML.MULTI_HOT_ENCODER(input_column, 1, 2) OVER() AS labeled_top1_min2,
        ML.MULTI_HOT_ENCODER(input_column, 3, 1) OVER() AS labeled_top3_min1
    FROM
        (
            SELECT ['a', 'b', 'd', 'd', 'd', 'd', 'd'] as input_column
            UNION ALL
            SELECT ['a', 'c', 'd', 'd', 'd', 'd', 'd'] as input_column
        )
    ORDER BY input_column[OFFSET(0)]



In [58]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,labeled_all,labeled_top1_min2,labeled_top3_min1
0,"[a, c, d, d, d, d, d]","[{'index': 0, 'value': 1.0}]","[{'index': 1, 'value': 1.0}, {'index': 0, 'val...","[{'index': 1, 'value': 1.0}, {'index': 0, 'val..."
1,"[a, b, d, d, d, d, d]","[{'index': 0, 'value': 1.0}]","[{'index': 1, 'value': 1.0}, {'index': 0, 'val...","[{'index': 1, 'value': 1.0}, {'index': 2, 'val..."


In [59]:
df['labeled_top1_min2'].iloc[0]

array([{'index': 1, 'value': 1.0}, {'index': 0, 'value': 1.0}],
      dtype=object)

In [60]:
df['labeled_top1_min2'].iloc[1]

array([{'index': 1, 'value': 1.0}, {'index': 0, 'value': 1.0}],
      dtype=object)

In [61]:
df['labeled_top3_min1'].iloc[0]

array([{'index': 1, 'value': 1.0}, {'index': 0, 'value': 1.0},
       {'index': 3, 'value': 1.0}], dtype=object)

In [62]:
df['labeled_top3_min1'].iloc[-1]

array([{'index': 1, 'value': 1.0}, {'index': 2, 'value': 1.0},
       {'index': 3, 'value': 1.0}], dtype=object)

#### ML.NGRAMS

Given an array of strings returns an array of merged inputs strings for the ranges provided.
- `array_input` is an ARRAY of strings that represent tokens to be merged
- `range` is an array of integers for the sizes of n-gram to return.  A single integer (x) results in a range of [x, x].
- `separator` (optional) is a string value the specifies the seperator for connecting adjacent token in the output.
    - default is whitespace ` `
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-ngrams)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [63]:
query = f"""
    SELECT
        input_column,
        ML.NGRAMS(input_column, [2, 4]) AS ngram_column
    FROM
        (SELECT ['a', 'b', 'c', 'd'] as input_column)
"""
print(query)


    SELECT
        input_column,
        ML.NGRAMS(input_column, [2, 4]) AS ngram_column
    FROM
        (SELECT ['a', 'b', 'c', 'd'] as input_column)



In [64]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,ngram_column
0,"[a, b, c, d]","[a b, a b c, a b c d, b c, b c d, c d]"


In [65]:
df.iloc[-1]

input_column                              [a, b, c, d]
ngram_column    [a b, a b c, a b c d, b c, b c d, c d]
Name: 0, dtype: object

#### ML.ONE_HOT_ENCODER

Given a string column this function will one-hot encode the values in the column after sorting alphabetically.  Any `NULL` or dropped values will be encoded with `0`.
- `string_expression` is the string to encode
- `drop` (optional) takes values: 
    - 'none' (default) retains all values in `string_expression`
    - 'most_frequent' for dummy encoding, drop the most frequent category found in the `string_expression`
- `top_k` (optional) takes an integer value that specifies the limit on the number of categories to encode based on frequency
    - default is 32,000
    - max is 1 million
- `frequency_threshold` (optional) takes an integer value that specifies that minimum frequency to be encoded
    - default = 5
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-one-hot-encoder)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [68]:
query = f"""
    SELECT
        input_column,
        ML.ONE_HOT_ENCODER(input_column) OVER() AS OHE_1,
        ML.ONE_HOT_ENCODER(input_column, 'most_frequent', 100, 1) OVER() AS OHE_2,
        ML.ONE_HOT_ENCODER(input_column, 'most_frequent', 3, 3) OVER() AS OHE_3,
        ML.ONE_HOT_ENCODER(input_column, 'most_frequent', 3, 1) OVER() AS OHE_4
    FROM
        UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd']) AS input_column
    ORDER BY input_column
"""
print(query)


    SELECT
        input_column,
        ML.ONE_HOT_ENCODER(input_column) OVER() AS OHE_1,
        ML.ONE_HOT_ENCODER(input_column, 'most_frequent', 100, 1) OVER() AS OHE_2,
        ML.ONE_HOT_ENCODER(input_column, 'most_frequent', 3, 3) OVER() AS OHE_3,
        ML.ONE_HOT_ENCODER(input_column, 'most_frequent', 3, 1) OVER() AS OHE_4
    FROM
        UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd']) AS input_column
    ORDER BY input_column



In [69]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,OHE_1,OHE_2,OHE_3,OHE_4
0,,"[{'index': 0, 'value': 1.0}]","[{'index': 0, 'value': 1.0}]","[{'index': 0, 'value': 1.0}]","[{'index': 0, 'value': 1.0}]"
1,a,"[{'index': 0, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]","[{'index': 0, 'value': 1.0}]","[{'index': 0, 'value': 1.0}]"
2,b,"[{'index': 0, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]","[{'index': 0, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
3,b,"[{'index': 0, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]","[{'index': 0, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
4,c,"[{'index': 0, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
5,c,"[{'index': 0, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
6,c,"[{'index': 0, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
7,d,"[{'index': 0, 'value': 1.0}]","[{'index': 4, 'value': 0.0}]","[{'index': 2, 'value': 0.0}]","[{'index': 3, 'value': 0.0}]"
8,d,"[{'index': 0, 'value': 1.0}]","[{'index': 4, 'value': 0.0}]","[{'index': 2, 'value': 0.0}]","[{'index': 3, 'value': 0.0}]"
9,d,"[{'index': 0, 'value': 1.0}]","[{'index': 4, 'value': 0.0}]","[{'index': 2, 'value': 0.0}]","[{'index': 3, 'value': 0.0}]"


### Numerical Functions

[Numerical functions](https://cloud.google.com/bigquery/docs/manual-preprocessing#numerical_functions) for regularizing data with numerical expressions.

#### ML.BUCKETIZE

Given a column of numerical values this function creates a new column with bucketed values based on a list of boundaries given as input.
- `numerical_expression` is the numerical expression to bucketize
- `array_split_points` is an array of numerical values that represent the points at which to split the `numerical_expression`
- `exclude_boundaries` (optional) is a BOOL that determines where the upper and lower boundaries from `array_split_points` are used.
    - default = FALSE
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-bucketize)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [74]:
query = f"""
    SELECT
        input_column,
        ML.BUCKETIZE(input_column, [2, 5, 7]) AS feature_column,
        ML.BUCKETIZE(input_column, [2, 5, 7], TRUE) AS feature_column_2
    FROM
        UNNEST([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) as input_column
"""
print(query)


    SELECT
        input_column,
        ML.BUCKETIZE(input_column, [2, 5, 7]) AS feature_column,
        ML.BUCKETIZE(input_column, [2, 5, 7], TRUE) AS feature_column_2
    FROM
        UNNEST([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) as input_column



In [75]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,feature_column,feature_column_2
0,1,bin_1,bin_1
1,2,bin_2,bin_1
2,3,bin_2,bin_1
3,4,bin_2,bin_1
4,5,bin_3,bin_2
5,6,bin_3,bin_2
6,7,bin_4,bin_2
7,8,bin_4,bin_2
8,9,bin_4,bin_2
9,10,bin_4,bin_2


#### ML.MAX_ABS_SCALER

Given a column of numerical value this function will scale the value to the range [-1, 1] by dividing by the maximum absolute value.  This does not shift the center or change the sparcity of the data.
- `numerical_expression` is the numerical expression to scale
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-max-abs-scaler)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [78]:
query = f"""
    SELECT
        input_column,
        ML.MAX_ABS_SCALER(input_column) OVER() AS feature_column
    FROM
        UNNEST([0, -1, 2, -3, 4, -5, 6, -7, 8, -9, 10]) as input_column
    ORDER BY input_column
"""
print(query)


    SELECT
        input_column,
        ML.MAX_ABS_SCALER(input_column) OVER() AS feature_column
    FROM
        UNNEST([0, -1, 2, -3, 4, -5, 6, -7, 8, -9, 10]) as input_column
    ORDER BY input_column



In [79]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,feature_column
0,-9,-0.9
1,-7,-0.7
2,-5,-0.5
3,-3,-0.3
4,-1,-0.1
5,0,0.0
6,2,0.2
7,4,0.4
8,6,0.6
9,8,0.8


#### ML.MIN_MAX_SCALER

Given a column of numerical values this function will scale the value to the range [0, 1] and cap data at either 0 or 1.  
- `numerical_expression` is the numerical expression to scale
- When used with the `TRANSFORM` statement of a `CREATE MODEL` this will also apply to `ML.PREDICT` and cap inputs to 0 or 1.
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-min-max-scaler)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [80]:
query = f"""
    SELECT
        input_column,
        ML.MIN_MAX_SCALER(input_column) OVER() AS feature_column
    FROM
        UNNEST([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) as input_column
    ORDER BY input_column
"""
print(query)


    SELECT
        input_column,
        ML.MIN_MAX_SCALER(input_column) OVER() AS feature_column
    FROM
        UNNEST([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) as input_column
    ORDER BY input_column



In [81]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,feature_column
0,0,0.0
1,1,0.1
2,2,0.2
3,3,0.3
4,4,0.4
5,5,0.5
6,6,0.6
7,7,0.7
8,8,0.8
9,9,0.9


#### ML.NORMALIZER

Given a column of numerical arrays this function will normalize the arrays to have unit norm with given p-norm (parameter `p` has default = 2 and takes values 0, +inf, >= 1.
- `array_expression` is an array of numerical expressions to normalize
- `p` (optional) is the p-norm for the normalization
    - default is 2
    - can be 0.0, >= 1 or +inf by using `CAST('+INF' AS FLOAT64)`
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-normalizer)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [82]:
query = f"""
    SELECT
        input_column,
        ML.NORMALIZER(input_column, CAST('+inf' as float64)) AS norm_inf,
        ML.NORMALIZER(input_column) AS norm_2,
        ML.NORMALIZER(input_column, 1) AS norm_1,
        ML.NORMALIZER(input_column, 0) AS norm_0
    FROM
        (SELECT [1, 2, 3, 4, 5] as input_column)
"""
print(query)


    SELECT
        input_column,
        ML.NORMALIZER(input_column, CAST('+inf' as float64)) AS norm_inf,
        ML.NORMALIZER(input_column) AS norm_2,
        ML.NORMALIZER(input_column, 1) AS norm_1,
        ML.NORMALIZER(input_column, 0) AS norm_0
    FROM
        (SELECT [1, 2, 3, 4, 5] as input_column)



In [83]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,norm_inf,norm_2,norm_1,norm_0
0,"[1, 2, 3, 4, 5]","[0.2, 0.4, 0.6, 0.8, 1.0]","[0.13483997249264842, 0.26967994498529685, 0.4...","[0.06666666666666667, 0.13333333333333333, 0.2...","[0.2, 0.4, 0.6, 0.8, 1.0]"


**Check Results** By using `np.linalg.norm` to verify the correct norm was applied:

In [85]:
li = df['norm_inf'].iloc[-1]
print('Normalized For Unit Norm with Inf-norm:', li)
print('L^Infinity-Norm:', np.max(abs(li)))
print(np.linalg.norm(li, np.inf))

Normalized For Unit Norm with Inf-norm: [0.2 0.4 0.6 0.8 1. ]
L^Infinity-Norm: 1.0
1.0


In [86]:
l2 = df['norm_2'].iloc[-1]
print('Normalized For Unit Norm with 2-norm:', l2)
print('L^2-Norm:', np.sqrt(np.sum(l2**2)))
print(np.linalg.norm(l2, 2))

Normalized For Unit Norm with 2-norm: [0.13483997 0.26967994 0.40451992 0.53935989 0.67419986]
L^2-Norm: 1.0
1.0


In [87]:
l1 = df['norm_1'].iloc[-1]
print('Normalized For Unit Norm with 1-norm:', l1)
print('L^1-Norm:', np.sum(abs(l1)))
print(np.linalg.norm(l1, 1))

Normalized For Unit Norm with 1-norm: [0.06666667 0.13333333 0.2        0.26666667 0.33333333]
L^1-Norm: 1.0
1.0


In [90]:
l0 = df['norm_0'].iloc[-1]
print('Normalized For Unit Norm with 0-norm', l0)
#print(np.linalg.norm(l0, 0))

Normalized For Unit Norm with 0-norm [0.2 0.4 0.6 0.8 1. ]


#### ML.POLYNOMIAL_EXPAND

Given a STRUCT of numerical features this returns a STRUCT of polynomial combinations up to the degree passed-in (default = 2).
- `struct_numerical_features` is a STRUCT containing numerical input features to expand
    - up to 10 input features without duplicates
- `degree` (optional) is an integer that specifies the highest degree of combinations
    - default is 2
    - range is [1, 4]
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-polynomial-expand)
- Is not [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [91]:
query = f"""
    SELECT
        input_column_1, input_column_2,
        ML.POLYNOMIAL_EXPAND(STRUCT(input_column_1, input_column_2)) AS feature_column
    FROM
        UNNEST([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) AS input_column_1 WITH OFFSET pos1,
        UNNEST([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) AS input_column_2 WITH OFFSET pos2
    WHERE
        pos1 = pos2
"""
print(query)


    SELECT
        input_column_1, input_column_2,
        ML.POLYNOMIAL_EXPAND(STRUCT(input_column_1, input_column_2)) AS feature_column
    FROM
        UNNEST([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) AS input_column_1 WITH OFFSET pos1,
        UNNEST([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) AS input_column_2 WITH OFFSET pos2
    WHERE
        pos1 = pos2



In [92]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column_1,input_column_2,feature_column
0,1,0,"{'input_column_1': 1.0, 'input_column_1_input_..."
1,2,1,"{'input_column_1': 2.0, 'input_column_1_input_..."
2,3,2,"{'input_column_1': 3.0, 'input_column_1_input_..."
3,4,3,"{'input_column_1': 4.0, 'input_column_1_input_..."
4,5,4,"{'input_column_1': 5.0, 'input_column_1_input_..."
5,6,5,"{'input_column_1': 6.0, 'input_column_1_input_..."
6,7,6,"{'input_column_1': 7.0, 'input_column_1_input_..."
7,8,7,"{'input_column_1': 8.0, 'input_column_1_input_..."
8,9,8,"{'input_column_1': 9.0, 'input_column_1_input_..."
9,10,9,"{'input_column_1': 10.0, 'input_column_1_input..."


In [93]:
df['feature_column'].iloc[-1]

{'input_column_1': 10.0,
 'input_column_1_input_column_1': 100.0,
 'input_column_1_input_column_2': 90.0,
 'input_column_2': 9.0,
 'input_column_2_input_column_2': 81.0}

In [94]:
pd.concat([df[['input_column_1', 'input_column_2']], df['feature_column'].apply(pd.Series)], axis=1)

Unnamed: 0,input_column_1,input_column_2,input_column_1.1,input_column_1_input_column_1,input_column_1_input_column_2,input_column_2.1,input_column_2_input_column_2
0,1,0,1.0,1.0,0.0,0.0,0.0
1,2,1,2.0,4.0,2.0,1.0,1.0
2,3,2,3.0,9.0,6.0,2.0,4.0
3,4,3,4.0,16.0,12.0,3.0,9.0
4,5,4,5.0,25.0,20.0,4.0,16.0
5,6,5,6.0,36.0,30.0,5.0,25.0
6,7,6,7.0,49.0,42.0,6.0,36.0
7,8,7,8.0,64.0,56.0,7.0,49.0
8,9,8,9.0,81.0,72.0,8.0,64.0
9,10,9,10.0,100.0,90.0,9.0,81.0


#### ML.QUANTILE_BUCKATIZE

Given a column of numerical values this function creates a new column with bucketed values named with the value of the quantile based on the input number of buckets.
- `numerical_expression` is the numerical expression to buckatize
- `num_buckets` is an integer that specifies the number of buckets to split the numerical expression into
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-quantile-bucketize)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [101]:
query = f"""
    SELECT
        input_column,
        ML.QUANTILE_BUCKETIZE(input_column, 2) OVER() AS feature_column
    FROM
        UNNEST([1, 1, 1, 2, 2, 3, 3, 3]) as input_column
    ORDER BY input_column
"""
print(query)


    SELECT
        input_column,
        ML.QUANTILE_BUCKETIZE(input_column, 2) OVER() AS feature_column
    FROM
        UNNEST([1, 1, 1, 2, 2, 3, 3, 3]) as input_column
    ORDER BY input_column



In [102]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,feature_column
0,1,bin_1
1,1,bin_1
2,1,bin_1
3,2,bin_2
4,2,bin_2
5,3,bin_2
6,3,bin_2
7,3,bin_2


#### ML.ROBUST_SCALER

Given a column of numerical value this function will scale the values to the quantile range: 
- `numerical_expression` is the numerical expression to buckatize
- `quantile_range` (optional) is an array of two integers the specify the quantile range. 
    - default is [25, 75]
    - min is 0, max is 100
    - second element must be larger than the first element
- `with_median` (optional) is a BOOL value that specifies where the data is centered by subtracting the median before scaling
    - default is TRUE
- `with_quantile_range` (optional) is a BOOL value that specifies if the data is scaled to the quantile range
    - default is TRUE
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-robust-scaler)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [103]:
query = f"""
    SELECT
        input_column,
        ML.ROBUST_SCALER(input_column) OVER() AS feature_column_1,
        ML.ROBUST_SCALER(input_column, [25, 75], FALSE, TRUE) OVER() AS feature_column_2,
        ML.ROBUST_SCALER(input_column, [25, 75], TRUE, FALSE) OVER() AS feature_column_3
    FROM
        UNNEST([0, 25, 50, 75, 100]) as input_column
    ORDER BY input_column
"""
print(query)


    SELECT
        input_column,
        ML.ROBUST_SCALER(input_column) OVER() AS feature_column_1,
        ML.ROBUST_SCALER(input_column, [25, 75], FALSE, TRUE) OVER() AS feature_column_2,
        ML.ROBUST_SCALER(input_column, [25, 75], TRUE, FALSE) OVER() AS feature_column_3
    FROM
        UNNEST([0, 25, 50, 75, 100]) as input_column
    ORDER BY input_column



In [104]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,feature_column_1,feature_column_2,feature_column_3
0,0,-1.0,0.0,-50.0
1,25,-0.5,0.5,-25.0
2,50,0.0,1.0,0.0
3,75,0.5,1.5,25.0
4,100,1.0,2.0,50.0


#### ML.STANDARD_SCALER

Given a column of numerical value this function will standardize the values by subtracting the `AVG` and dividing by the `STDDEV` - the [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score).
- `numerical_expression` is the numerical expression to standardize
- When used with the `TRANSFORM` statement of a `CREATE MODEL` this will also apply to `ML.PREDICT` and use the same values for `AVG` and `STDDEV`.
- [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-standard-scaler)
- Is [exportable](https://cloud.google.com/bigquery/docs/exporting-models#export-transform-functions) when used in [`TRANFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).

In [105]:
query = f"""
    SELECT
        input_column,
        CAST((input_column - AVG(input_column) OVER()) / STDDEV(input_column) OVER() AS FLOAT64) AS manual_column,
        ML.STANDARD_SCALER(input_column) OVER() AS feature_column
    FROM
        UNNEST([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) as input_column
    ORDER BY input_column
"""
print(query)


    SELECT
        input_column,
        CAST((input_column - AVG(input_column) OVER()) / STDDEV(input_column) OVER() AS FLOAT64) AS manual_column,
        ML.STANDARD_SCALER(input_column) OVER() AS feature_column
    FROM
        UNNEST([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) as input_column
    ORDER BY input_column



In [106]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,manual_column,feature_column
0,0,-1.507557,-1.507557
1,1,-1.206045,-1.206045
2,2,-0.904534,-0.904534
3,3,-0.603023,-0.603023
4,4,-0.301511,-0.301511
5,5,0.0,0.0
6,6,0.301511,0.301511
7,7,0.603023,0.603023
8,8,0.904534,0.904534
9,9,1.206045,1.206045


### Advanced Usage of Manual Feature Engineering

There may be situations where multiple feature preprocessing functions are needed.  Here is an example of using `ML.IMPUTER` with `ML.POLYNOMIAL_EXPAND` for example.

**NOTE:** An analytic function (has `OVER ()`) cannot be an argument of another analytic function, however, scaler functions can be arguments.

This Example Compounds:
- `CAST` the string values to `FLOAT64`
- imputes missing values for the column with `ML.IMPUTER`
- uses `ML.POLYNOMIAL_EXPAND` to create higher order terms from the imputed column

In [107]:
query = f"""
SELECT
    input_column,
    ML.POLYNOMIAL_EXPAND(
        STRUCT(
            ML.IMPUTER(
                CAST(input_column AS FLOAT64),
                'mean'
            ) OVER() AS num_imputed_mean
        ),
        2
    ) AS imputed_expanded
FROM
    UNNEST(['1', '1', '2', '3', '4', '5', NULL]) AS input_column
"""
print(query)


SELECT
    input_column,
    ML.POLYNOMIAL_EXPAND(
        STRUCT(
            ML.IMPUTER(
                CAST(input_column AS FLOAT64),
                'mean'
            ) OVER() AS num_imputed_mean
        ),
        2
    ) AS imputed_expanded
FROM
    UNNEST(['1', '1', '2', '3', '4', '5', NULL]) AS input_column



In [108]:
df = bq.query(query = query).to_dataframe()
df

Unnamed: 0,input_column,imputed_expanded
0,1.0,"{'num_imputed_mean': 1.0, 'num_imputed_mean_nu..."
1,2.0,"{'num_imputed_mean': 2.0, 'num_imputed_mean_nu..."
2,3.0,"{'num_imputed_mean': 3.0, 'num_imputed_mean_nu..."
3,,"{'num_imputed_mean': 2.666666666666667, 'num_i..."
4,4.0,"{'num_imputed_mean': 4.0, 'num_imputed_mean_nu..."
5,1.0,"{'num_imputed_mean': 1.0, 'num_imputed_mean_nu..."
6,5.0,"{'num_imputed_mean': 5.0, 'num_imputed_mean_nu..."


In [109]:
pd.concat([df['input_column'], df['imputed_expanded'].apply(pd.Series)], axis=1)

Unnamed: 0,input_column,num_imputed_mean,num_imputed_mean_num_imputed_mean
0,1.0,1.0,1.0
1,2.0,2.0,4.0
2,3.0,3.0,9.0
3,,2.666667,7.111111
4,4.0,4.0,16.0
5,1.0,1.0,1.0
6,5.0,5.0,25.0
