# Training a multi-class classification model for Ads-targeting usecase

## Table Of Contents
* [Overview](#section-1)
* [Dataset](#section-2)
* [Objective](#section-3)
* [Costs](#section-4)
* [Tutorial](#section-5)
	- [Fetch the required data from Bigquery](#section-5)
    - [Preprocess the data](#section-6)
    - [Train a Tensorflow model](#section-7)
    - [Run the model on test data](#section-8)
    - [Automating the execution of the notebook using Executor](#section-9)
    - [Scheduled Runs on Executor](#section-10)
    - [Parametrizing the variables](#section-11)
* [Save the model to a GCS path](#section-12)
* [Clean Up](#section-13)


## Overview
<a name="section-1"></a>

This tutorial demonstrates building a machine-learning model for an Ads-targeting use case. Ads-targeting is an advertisement technique where chosen or tailor-made ads are shown to the customers based on their past behavior and preferences.Targeted ads are meant to reach certain customers based on demographics, psychographics, behavior and other second-order activities that are learned usually through data collected from the customers. 

## Dataset
<a name="section-2"></a>
This notebook uses the following dataset in Bigquery : ```looker-private-demo.ecomm```. The dataset consists of information about various advertisement campaigns including the demographics of users who have clicked and made some purchases after seeing the ads. For the current tutorial, top 3 campaigns from USA will be selected from this dataset and user information for those who have made purchases shall be used to train a model with the campaigns as the classes. The idea is to see if the advertisement and the user data can be used to identify which campaign suits best for the user.

The dataset can be accessed by pinning the ```looker-private-demo``` project in Bigquery. Instead of going to Bigquery UI, this process can be performed from the current Jupyter environment(on Vertex-AI's managed-instance) itself. Vertex-AI's managed instances support browsing through the datasets and tables from Bigquery through its **Bigquery In Notebooks** feature. 

<img src="images/Bigquery_UI_new.PNG"></img>

## Objective
<a name="section-3"></a>
This notebook demonstrates collecting required data from Bigquery, preprocessing it and training a multi-class classification model on an E-commerce dataset. The steps performed include the following :

- Fetch the required data from Bigquery
- Preprocess the data
- Train a Tensorflow(>=2.4) classification model
- Evaluate the loss for the trained model
- Automating the notebook execution using Executor feature
- Save the model to a GCS path
- Clean up of the created resources

### Costs 
<a name="section-4"></a>
This tutorial uses billable components of Google Cloud:

* Vertex AI
* Bigquery
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Bigquery
pricing](https://cloud.google.com/bigquery/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

*Otherwise*, set your project ID here. *italicized text*

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**


{TODO: Adjust wording in the first paragraph to fit your use case - explain how your tutorial uses the Cloud Storage bucket. The example below shows how Vertex AI uses the bucket for training.}

When you submit a training job using the Cloud SDK, you upload a Python package
containing your training code to a Cloud Storage bucket. Vertex AI runs
the code from this package. In this tutorial, Vertex AI also saves the
trained model that results from your job in the same bucket. Using this model artifact, you can then
create Vertex AI model and endpoint resources in order to serve
online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are
available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with Vertex AI.

In [None]:
BUCKET_NAME = "gs://[your-bucket-name]"  # @param {type:"string"}
REGION = "[your-region]"  # @param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

`**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.`

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

**Finally**, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

## Tutorial

### Fetch the required data from Bigquery 
<a name="section-5"></a>

#@bigquery

WITH
  traindata AS (
  SELECT
    b.* EXCEPT(ad_event_id,
      user_id),
    c.* EXCEPT(id),
    d.* EXCEPT(keyword_id,
      ad_id),
    a.amount,
    a.device_type,
    e.name
  FROM
    `looker-private-demo.ecomm.ad_events` a
  JOIN (
    SELECT
      ad_event_id,
      user_id,
      state,
      os,
      browser
    FROM
      `looker-private-demo.ecomm.events`
    WHERE
      event_type="Purchase"
      AND country="USA") b
  ON
    a.id = b.ad_event_id
  JOIN (
    SELECT
      id,
      gender,
      age
    FROM
      `looker-private-demo.ecomm.users`) c
  ON
    b.user_id = c.id
  JOIN (
    SELECT
      keyword_id,
      ad_id,
      cpc_bid_amount,
      bidding_strategy_type,
      quality_score,
      keyword_match_type
    FROM
      `looker-private-demo.ecomm.keywords`
    WHERE
      cpc_bid_amount <= 3000) d
  ON
    a.keyword_id = d.keyword_id
  JOIN (
    SELECT
      ad_id,
      name
    FROM
      `looker-private-demo.ecomm.ad_groups`) e
  ON
    d.ad_id = e.ad_id )
SELECT
  *
FROM
  traindata

Once the results from Bigquery are displayed in the above cell, press the **Query and load as DataFrame** button and execute the generated code stub to fetch the data into into the current notebook as a dataframe. 

*Note : By default the data is loaded into "df" variable and it could be changed before executing the cell if required.*

In [None]:
# The following two lines are only necessary to run once.
# Comment out otherwise for speed-up.
from google.cloud.bigquery import Client

client = Client()

query = """WITH traindata AS (
SELECT b.* except(ad_event_id, user_id), c.* except(id), d.* except(keyword_id, ad_id), a.amount, a.device_type, e.name
FROM `looker-private-demo.ecomm.ad_events` a
JOIN
(SELECT ad_event_id, user_id, state, os, browser from `looker-private-demo.ecomm.events` WHERE event_type="Purchase" AND country="USA") b
ON a.id = b.ad_event_id
JOIN
(SELECT id, gender, age FROM `looker-private-demo.ecomm.users`) c
ON b.user_id = c.id
JOIN
(SELECT keyword_id, ad_id, cpc_bid_amount, bidding_strategy_type, quality_score, keyword_match_type FROM `looker-private-demo.ecomm.keywords`
WHERE cpc_bid_amount <= 3000) d
ON a.keyword_id = d.keyword_id
JOIN
(SELECT ad_id, name FROM `looker-private-demo.ecomm.ad_groups`) e
ON d.ad_id = e.ad_id
)
SELECT * FROM traindata"""
job = client.query(query)
df = job.to_dataframe()

### Preprocess the data
<a name="section-6"></a>

Import the required libraries.

In [None]:
import warnings

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

warnings.filterwarnings("ignore")

Select the necessary columns from the E-commerce data and divide them based on their type(numerical/categorical).

In [None]:
target = "name"
categ_cols = [
    "state",
    "os",
    "browser",
    "gender",
    "bidding_strategy_type",
    "keyword_match_type",
    "device_type",
]
num_cols = ["age", "cpc_bid_amount", "quality_score", "amount"]

From the current dataset, only the top 3 camapigns will be chosen to target the users. All the relevant information about the advertisement and the user who purchased an item after seeing the advertisement is available in the dataframe already. 

In [None]:
df = df[df["name"].isin(["Tops & Tees", "Active", "Accessories"])]

Encode the target variable.

In [None]:
df["name"] = df["name"].map({"Tops & Tees": 0, "Active": 1, "Accessories": 2})

One-hot encode the categorical variables. After one-hot encoding, the first level-column is dropped to avoid [dummy-variable trap](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) scenario. This process is called *dummy-encoding*.

In [None]:
def encode_cols(data, col):
    # Creating a dummy variable for the variable 'CategoryID' and dropping the first one.
    categ = pd.get_dummies(data[col], prefix=col, drop_first=True)
    # Adding the results to the master dataframe
    data = pd.concat([data, categ], axis=1)
    return data


# dummy-encode the categorical fields
for i in categ_cols:
    df = encode_cols(df, i)
    df.drop(columns=[i], inplace=True)

# check the data's shape
df.shape

Split the data into train and test.

In [None]:
X = df[[i for i in df.columns if i != target]].copy()
y = df[target].copy()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=36
)
print(X_train.shape, X_test.shape)

Scale the data.

In [None]:
scaler = StandardScaler()
X_train.loc[:, num_cols] = scaler.fit_transform(X_train[num_cols])
X_test.loc[:, num_cols] = scaler.transform(X_test[num_cols])

### Train a Tensorflow model
<a name="section-7"></a>

Convert target column to categorical encoded colum (one-hot encoded).

In [None]:
y_train_categ = to_categorical(y_train)
y_test_categ = to_categorical(y_test)

Define hyperparameters for model training. 

*Note: Comment or remove the parameters from the following cell if they are provided already as an input parameter through the Executor feature.*

In [None]:
optimizer = "sgd"
num_hidden_layers = 3
num_neurons = [64, 128, 256]
activ_func = ["relu", "relu", "relu"]

Define the architecture and compile the model.

In [None]:
model = Sequential()
# construct the neural network as per the defined parameters
for i in range(num_hidden_layers):
    if i == 0:
        # add the input layer
        model.add(
            Dense(
                num_neurons[i],
                activation=activ_func[i],
                input_shape=(X_train.shape[1],),
            )
        )
    else:
        # add the hidden layers
        model.add(Dense(num_neurons[i], activation=activ_func[i]))

# add the output layer
model.add(Dense(3, activation="softmax"))
# compile the model
model.compile(loss="categorical_crossentropy", optimizer=optimizer)
model.summary()

Fit the model.

In [None]:
history = model.fit(X_train, y_train_categ, epochs=50, verbose=1)

### Run the model on test data
<a name="section-8"></a>

Evaluate the model on test data.

In [None]:
test_results = model.evaluate(X_test, y_test_categ, verbose=1)
print(f"Test results - Loss: {test_results}")

### Automating the execution of the notebook using Executor
<a name="section-9"></a>

Running the noteboook from start to end has become more prowerful with the new executor feature. Hitting the executor button would bring up a form that can be filled with the choice of the environment, machine-type, input parameters etc. After submitting it, the notebook gets executed as a job in the Vertex-ai custom training jobs. The running jobs can be monitored from the <b>Notebook Executor</b> pane in the menu on the left.

<img src="images/executor.png"></img>


Executor gives us the freedom to choose the environment and machine-type while automating the runs similar to Vertex-Ai training jobs without switching to the training-jobs UI. Apart from the custom-container that replicates the existing kernel by default, pre-built environments like Tensorflow-Enterprise, PyTorch etc. can also be selected to run the notebook. Furthermore the required compute-power can be specified by choosing from the list of machine-types available including GPUs.

### Scheduled Runs on Executor
<a name="section-10"></a>

The runs can also be scheduled recurringly with the Executor. To do so, <b>Schedule-based recurring executions </b> needs to be selected in the run type insetead of <b>One-time execution</b>. Further, the frequency of the job and the time when it needs to execute can be provided in the form itself.


<img src="images/executor_scheduled_runs2.png"></img>

### Parametrizing the variables
<a name="section-11"></a>

Executor also makes it easy to run the notebooks with different set of input paramters. If required, the needed constants in the notebook can be treated as arguments to a function and while submitting the Executor form, those constants can be given as input parameters.

<img src="images/executor_input_parameters.png"></img>

The hyperparameters defined during the model-training step can be passed as arguments while submitting this executor form. Of course, the values defined in this notebook should be removed or commented out before submitting for execution. Otherwise, the input parameters would just be overwritten in the job. Executor feature would thus allow us to also run the notebook as a training job with different parameter settings everytime.

### Save the model to a GCS path
<a name="section-12"></a>

Tensorflow's *model.save()* method supports GCS paths as well as the local file paths while writing the model object to a file. It needs to be ensured that the service-account being used to run this notebook has write permissions to the specified GCS path.

In [None]:
GCS_PATH = "gs://" + BUCKET_NAME + "/[path-to-save]/"
model.save(GCS_PATH)

## Clean Up
<a name="section-13"></a>

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
! gsutil -m rm -r [gcs-folder-path-to-delete]