In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Train a multi-class classification model for ads-targeting
<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/ads_targetting/training-multi-class-classification-model-for-ads-targeting-usecase.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/ads_targetting/training-multi-class-classification-model-for-ads-targeting-usecase.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/ads_targetting/training-multi-class-classification-model-for-ads-targeting-usecase.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This tutorial demonstrates how to build a machine learning model for an ads-targeting use case. Ads-targeting is an advertisement technique where chosen or tailor-made ads are shown to the customers based on their past behavior and preferences. Targeted ads are meant to reach specific customers based on demographics, psychographics, behavior, and other second-order activities that are learned usually through data collected from the customers.

*Note: If you are using [Vertex AI Workbench managed notebooks](https://cloud.google.com/vertex-ai/docs/workbench/managed/create-instance) instance use the `TensorFlow 2 (Local)` kernel. Some components of this notebook may not work in other notebook environments.*


### Objective

In this tutorial, you learn how to collect data from BigQuery, preprocess it, and train a multi-class classification model on an E-commerce dataset. 

This tutorial uses the following Google Cloud ML services and resources:

- Bigquery

The steps performed include:

- Fetch the required data from BigQuery
- Preprocess the data
- Train a TensorFlow (>=2.4) classification model
- Evaluate the loss for the trained model
- Automate the notebook execution using the executor feature
- Save the model to a Cloud Storage path
- Clean up the created resources

## Dataset

This tutorial uses the `looker-private-demo.ecomm` dataset in BigQuery. The dataset consists of information about various advertisement campaigns including the demographics of users who have clicked and made some purchases after seeing the ads. For this tutorial, the top three campaigns from the USA are selected from this dataset and user information for those who have made purchases shall be used to train a model with the campaigns as the classes. The idea is to see if the advertisement and the user data can be used to identify which campaign is best-suited for the user.

The dataset can be accessed by pinning the `looker-private-demo` project in BigQuery. If you are using Vertex AI Workbench managed notebooks instance, instead of going to the BigQuery user interface, this process can be performed from the JupyterLab user interface. Vertex AI Workbench managed notebooks instances support browsing through the datasets and tables from BigQuery through its BigQuery integration. 

<img src="images/Bigquery_UI_new.PNG"></img>

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [BigQuery
pricing](https://cloud.google.com/bigquery/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

### Install additional packages


In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

In [None]:
! pip3 install {USER_FLAG} --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]' tensorflow sklearn protobuf==3.20.1 -q 

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. <a href="https://console.cloud.google.com/cloud-resource-manager" target="_blank">Select or create a Google Cloud project</a>. When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. <a href="https://cloud.google.com/billing/docs/how-to/modify-project" target="_blank">Make sure that billing is enabled for your project</a>.

1. <a href="https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com" target="_blank">Enable the Vertex AI API</a>.

1. If you are running this notebook locally, you will need to install the <a href="https://cloud.google.com/sdk" target="_blank">Cloud SDK</a>.

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a UUID for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.


In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_URI

**Finally**, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

### Import libraries and define constants

In [None]:
import warnings

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

warnings.filterwarnings("ignore")

## Tutorial

### Fetch the data from BigQuery 


If you are using ***Vertex AI Workbench managed notebooks instance***, below cell which starts with "#@bigquery" will be a SQL Query. If you are using Vertex AI Workbench user managed notebooks instance or Colab it will be a markdown cell.

#@bigquery

WITH
  traindata AS (
  SELECT
    b.* EXCEPT(ad_event_id,
      user_id),
    c.* EXCEPT(id),
    d.* EXCEPT(keyword_id,
      ad_id),
    a.amount,
    a.device_type,
    e.name
  FROM
    `looker-private-demo.ecomm.ad_events` a
  JOIN (
    SELECT
      ad_event_id,
      user_id,
      state,
      os,
      browser
    FROM
      `looker-private-demo.ecomm.events`
    WHERE
      event_type="Purchase"
      AND country="USA") b
  ON
    a.id = b.ad_event_id
  JOIN (
    SELECT
      id,
      gender,
      age
    FROM
      `looker-private-demo.ecomm.users`) c
  ON
    b.user_id = c.id
  JOIN (
    SELECT
      keyword_id,
      ad_id,
      cpc_bid_amount,
      bidding_strategy_type,
      quality_score,
      keyword_match_type
    FROM
      `looker-private-demo.ecomm.keywords`
    WHERE
      cpc_bid_amount <= 3000) d
  ON
    a.keyword_id = d.keyword_id
  JOIN (
    SELECT
      ad_id,
      name
    FROM
      `looker-private-demo.ecomm.ad_groups`) e
  ON
    d.ad_id = e.ad_id )
SELECT
  *
FROM
  traindata

If you are using Vertex AI Workbench managed notebooks instance, once the results from BigQuery are displayed in the above cell, click the **Query and load as DataFrame** button and execute the generated code stub to fetch the data into the current notebook as a dataframe.

*Note: By default the data is loaded into a `df` variable, though this can be changed before executing the cell if required.*

In [None]:
# The following two lines are only necessary to run once.
# Comment out otherwise for speed-up.
from google.cloud.bigquery import Client

client = Client(project=PROJECT_ID)

In [None]:
query = """WITH traindata AS (
SELECT b.* except(ad_event_id, user_id), c.* except(id), d.* except(keyword_id, ad_id), a.amount, a.device_type, e.name
FROM `looker-private-demo.ecomm.ad_events` a
JOIN
(SELECT ad_event_id, user_id, state, os, browser from `looker-private-demo.ecomm.events` WHERE event_type="Purchase" AND country="USA") b
ON a.id = b.ad_event_id
JOIN
(SELECT id, gender, age FROM `looker-private-demo.ecomm.users`) c
ON b.user_id = c.id
JOIN
(SELECT keyword_id, ad_id, cpc_bid_amount, bidding_strategy_type, quality_score, keyword_match_type FROM `looker-private-demo.ecomm.keywords`
WHERE cpc_bid_amount <= 3000) d
ON a.keyword_id = d.keyword_id
JOIN
(SELECT ad_id, name FROM `looker-private-demo.ecomm.ad_groups`) e
ON d.ad_id = e.ad_id
)
SELECT * FROM traindata"""
job = client.query(query)
df = job.to_dataframe()

### Preprocess the data
Select the necessary columns from the E-commerce data and divide them based on their type (numerical/categorical).

In [None]:
target = "name"
categ_cols = [
    "state",
    "os",
    "browser",
    "gender",
    "bidding_strategy_type",
    "keyword_match_type",
    "device_type",
]
num_cols = ["age", "cpc_bid_amount", "quality_score", "amount"]

#### Select top three campaigns

From the current dataset, only the top three campaigns will be chosen to target the users. All the relevant information about the advertisement and the user who purchased an item after seeing the advertisement is available in the dataframe already. 

In [None]:
df = df[df["name"].isin(["Tops & Tees", "Active", "Accessories"])]

Encode the target variable.

In [None]:
df["name"] = df["name"].map({"Tops & Tees": 0, "Active": 1, "Accessories": 2})

#### One-hot encode the categorical variables

After one-hot encoding, the first level-column is dropped to avoid the [dummy-variable trap](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) scenario. This process is called *dummy-encoding*.

In [None]:
def encode_cols(data, col):
    # Creating a dummy variable for the variable 'CategoryID' and dropping the first one.
    categ = pd.get_dummies(data[col], prefix=col, drop_first=True)
    # Adding the results to the master dataframe
    data = pd.concat([data, categ], axis=1)
    return data


# dummy-encode the categorical fields
for i in categ_cols:
    df = encode_cols(df, i)
    df.drop(columns=[i], inplace=True)

# check the data's shape
df.shape

#### Split the data into train and test

In [None]:
X = df[[i for i in df.columns if i != target]].copy()
y = df[target].copy()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=36
)
print(X_train.shape, X_test.shape)

#### Scale the data

In [None]:
scaler = StandardScaler()
X_train.loc[:, num_cols] = scaler.fit_transform(X_train[num_cols])
X_test.loc[:, num_cols] = scaler.transform(X_test[num_cols])

### Train a TensorFlow model
#### Convert the target column to a categorical encoded colum (one-hot encoded).

In [None]:
y_train_categ = to_categorical(y_train)
y_test_categ = to_categorical(y_test)

#### Define hyperparameters for model training

*Note: Comment or remove the parameters from the following cell if they are provided already as an input parameter through the executor feature.*

In [None]:
optimizer = "sgd"
num_hidden_layers = 3
num_neurons = [64, 128, 256]
activ_func = ["relu", "relu", "relu"]

#### Define the architecture and compile the model

In [None]:
model = Sequential()
# construct the neural network as per the defined parameters
for i in range(num_hidden_layers):
    if i == 0:
        # add the input layer
        model.add(
            Dense(
                num_neurons[i],
                activation=activ_func[i],
                input_shape=(X_train.shape[1],),
            )
        )
    else:
        # add the hidden layers
        model.add(Dense(num_neurons[i], activation=activ_func[i]))

# add the output layer
model.add(Dense(3, activation="softmax"))
# compile the model
model.compile(loss="categorical_crossentropy", optimizer=optimizer)
model.summary()

#### Fit the model

In [None]:
history = model.fit(X_train, y_train_categ, epochs=50, verbose=1)

### Run the model on test data


#### Evaluate the model on test data.

In [None]:
test_results = model.evaluate(X_test, y_test_categ, verbose=1)
print(f"Test results - Loss: {test_results}")

**Please note that executor feature is available only in Vertex AI Workbench managed notebooks**

### Automating the execution of the notebook using executor in Vertex AI Workbench managed notebooks instance

If you are using Vertex AI Workbench managed notebooks instance, the executor can help you run a notebook file from start to end, with your choice of the environment, machine type, input parameters, and other characteristics. After setting up an execution, the notebook is executed as a job in Vertex AI custom training. Your jobs can be monitored from the <b>Notebook Executor</b> pane in the menu on the left.

<img src="images/executor.png"></img>

Executor lets you choose the environment and machine type while automating the runs similar to Vertex AI training jobs without switching to the training jobs UI. Apart from the custom container that replicates the existing kernel by default, pre-built environments like TensorFlow Enterprise, PyTorch, and others can also be selected to run the notebook. Furthermore the required compute power can be specified by choosing from the list of machine types available, including GPUs.

### Scheduled runs on executor in Vertex AI Workbench managed notebooks instance

Vertex AI Workbench managed noteboook runs can also be scheduled recurringly with the executor. To do so, select <b>Schedule-based recurring executions</b> as the run type instead of <b>One-time execution</b>. The frequency of the job and the time when it executes is provided when you create the execution.

<img src="images/executor_scheduled_runs2.png"></img>

### Parameterizing the variables

If you are using Vertex AI Workbench managed notebooks instance, executor lets you run a notebook with different sets of input parameters. If required, constants in the notebook can be treated as arguments to a function, and when you submit the execution, you can provide those constants as input parameters.

<img src="images/executor_input_parameters.png"></img>

The hyperparameters defined during the model training step can be passed as arguments while submitting an execution. However, the values defined in the notebook itself should be removed or commented out before submitting the execution. Otherwise, the input parameters would just be overwritten by the values in the notebook.

### Save the model to a Cloud Storage path

TensorFlow's `model.save()` method supports Cloud Storage paths as well as the local file paths while writing the model object to a file. It needs to be ensured that the service account being used to run this notebook has `write` permissions to the specified Cloud Storage path.

In [None]:
GCS_PATH = BUCKET_URI + "/path-to-save/"
model.save(GCS_PATH)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Delete the Cloud Storage bucket

delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI