In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with BigQuery datasets

<table align="left">
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/datasets/get_started_bq_datasets.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/datasets/get_started_bq_datasets.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/datasets/get_started_bq_datasets.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview


This tutorial demonstrates how to use Vertex AI in production. This tutorial covers data management: get started with BigQuery datasets.

Learn more about [BigQuery Datasets](https://cloud.google.com/bigquery/docs/datasets-intro) and [Vertex AI for BigQuery users](https://cloud.google.com/vertex-ai/docs/beginner/bqml).

### Objective

In this tutorial, you learn how to use `BigQuery` as a dataset for training with `Vertex AI`.

This tutorial uses the following Google Cloud ML services:

- `Vertex AI Datasets`
- `BigQuery Datasets`

The steps performed include:

- Create a Vertex AI `Dataset` resource from `BigQuery` table -- compatible for `AutoML` training.
- Extract a copy of the dataset from `BigQuery` to a CSV file in Cloud Storage -- compatible for `AutoML` or custom training.
- Select rows from a `BigQuery` dataset into a `pandas` dataframe -- compatible for custom training.
- Select rows from a `BigQuery` dataset into a `tf.data.Dataset` -- compatible for custom training `TensorFlow` models.
- Select rows from extracted CSV files into a `tf.data.Dataset` -- compatible for custom training `TensorFlow` models.
- Create a `BigQuery` dataset from CSV files.
- Extract data from `BigQuery` table into a `DMatrix` -- compatible for custom training `XGBoost` models.

### Recommendations

When doing E2E MLOps on Google Cloud, following are the best practices when dealing with structured (tabular) data in BigQuery:

- For AutoML training:
  - Create a managed dataset with Vertex AI `TabularDataset`.
  - Use the BigQuery table as the input to the dataset.
  - Specify columns and columns transformations when running the AutoML training pipeline job.


- For custom training:
  - For small datasets:
    - Extract the BigQuery to a pandas dataframe.
    - Preprocess the data in the dataframe.
  - For large datasets:
    - TensorFlow model training:
      - Create a tf.data.Dataset generator from the BigQuery table.
      - Specify the columns for the custrom training.
      - Preprocess the data either:
        - Within the generator (upstream)
        - Within the model (downstream)
    - XGBoost model training:
      - Use BigQuery ML built-in XGBoost training.
      - Alternatively, create a DMatrix generator from CSV files extracted from BigQuery table.
    - PyTorch model training:
        - Extract the BigQuery to a pandas dataframe.
        - Preprocess the data in the dataframe.
        - Create a DataLoader generator from the pandas dataframe.


- Alternatively:
    - Extract the BigQuery table to CSV files.
    - Preprocess the CSV files.
    - Create a tf.data.Dataset generator from the CSV files.

### Dataset

The dataset used for this tutorial is the GSOD dataset from [BigQuery public datasets](https://cloud.google.com/bigquery/public-data). In this version of the dataset you consider the fields year, month and day to predict the value of mean daily temperature (mean_temp).

### Costs
This tutorial uses billable components of Google Cloud:

- Vertex AI
- Cloud Storage
- BigQuery

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing) and [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Installations

Install the following packages to execute this notebook.

In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform \
                                 google-cloud-bigquery \
                                 tensorflow \
                                 tensorflow-io==0.18 \
                                 xgboost \
                                 numpy \
                                 pandas \
                                 pyarrow

### Colab only: Uncomment the following cell to restart the kernel

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_URI

### Import libraries and define constants

In [None]:
import google.cloud.aiplatform as aiplatform
import pandas as pd
import xgboost as xgb
from google.cloud import bigquery

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION)

### Create BigQuery client

Create the BigQuery client.

In [None]:
bqclient = bigquery.Client(project=PROJECT_ID)

#### Location of BigQuery training data.

Now, set the variable `IMPORT_FILE` to the location of the data table in BigQuery and `BQ_TABLE` with the table id.

In [None]:
IMPORT_FILE = "bq://bigquery-public-data.samples.gsod"
BQ_TABLE = "bigquery-public-data.samples.gsod"

### Create the Dataset

#### BigQuery input data

Next, create the `Dataset` resource using the `create` method for the `TabularDataset` class, which takes the following parameters:

- `display_name`: The human readable name for the `Dataset` resource.
- `bq_source`: Import data items from a BigQuery table into the `Dataset` resource.
- `labels`: User defined metadata. In this example, you store the location of the Cloud Storage bucket containing the user defined data.

Learn more about [TabularDataset from BigQuery table](https://cloud.google.com/vertex-ai/docs/datasets/create-dataset-api#aiplatform_create_dataset_tabular_bigquery_sample-python).

In [None]:
dataset = aiplatform.TabularDataset.create(
    display_name="NOAA historical weather data",
    bq_source=[IMPORT_FILE],
    labels={"user_metadata": BUCKET_URI[5:]},
)

label_column = "mean_temp"

print(dataset.resource_name)

### Copy the dataset to Cloud Storage

Next, you make a copy of the BigQuery table as a CSV file, to Cloud Storage using the BigQuery extract command.

Learn more about [BigQuery command line interface](https://cloud.google.com/bigquery/docs/reference/bq-cli-reference).

In [None]:
comps = BQ_TABLE.split(".")
BQ_PROJECT_DATASET_TABLE = comps[0] + ":" + comps[1] + "." + comps[2]

! bq --location=us extract --destination_format CSV $BQ_PROJECT_DATASET_TABLE $BUCKET_URI/mydata*.csv

IMPORT_FILES = ! gsutil ls $BUCKET_URI/mydata*.csv

print(IMPORT_FILES)

EXAMPLE_FILE = IMPORT_FILES[0]

! gsutil cat $EXAMPLE_FILE | head

### Create the Dataset

#### CSV input data

Next, create the `Dataset` resource using the `create` method for the `TabularDataset` class, which takes the following parameters:

- `display_name`: The human readable name for the `Dataset` resource.
- `gcs_source`: A list of one or more dataset index files to import the data items into the `Dataset` resource.
- `labels`: User defined metadata. In this example, you store the location of the Cloud Storage bucket containing the user defined data.

Learn more about [TabularDataset from CSV files](https://cloud.google.com/vertex-ai/docs/datasets/create-dataset-api#aiplatform_create_dataset_tabular_gcs_sample-python)

In [None]:
gcs_source = IMPORT_FILES

dataset = aiplatform.TabularDataset.create(
    display_name="NOAA historical weather data",
    gcs_source=gcs_source,
    labels={"user_metadata": BUCKET_URI[5:]},
)


label_column = "mean_temp"

print(dataset.resource_name)

### Create a view of the BigQuery dataset

Alternatively, you can create a logical view of a BigQuery dataset that has a subset of the fields.

Learn more about [Creating BigQuery views](https://cloud.google.com/bigquery/docs/views).

In [None]:
# Set dataset name and view name in BigQuery
BQ_MY_DATASET = "[your-dataset-name]"
BQ_MY_TABLE = "[your-view-name]"

# Otherwise, use the default names
if (
    BQ_MY_DATASET == ""
    or BQ_MY_DATASET is None
    or BQ_MY_DATASET == "[your-dataset-name]"
):
    BQ_MY_DATASET = "mlops_dataset"

if BQ_MY_TABLE == "" or BQ_MY_TABLE is None or BQ_MY_TABLE == "[your-view-name]":
    BQ_MY_TABLE = "mlops_view"

In [None]:
# Create the resources
! bq --location=US mk -d \
$PROJECT_ID:$BQ_MY_DATASET

sql_script = f'''
CREATE OR REPLACE VIEW `{PROJECT_ID}.{BQ_MY_DATASET}.{BQ_MY_TABLE}`
AS SELECT station_number,year,month,day,mean_temp FROM `{BQ_TABLE}`
'''
print(sql_script)

query = bqclient.query(sql_script)

### Read the BigQuery dataset into a pandas dataframe

Next, you read a sample of the dataset into a pandas dataframe using BigQuery `list_rows()` and `to_dataframe()` method, as follows:

- `list_rows()`: Performs a query on the specified table and returns a row iterator to the query results. Optionally specify:
 - `selected_fields`: Subset of fields (columns) to return.
 - `max_results`: The maximum number of rows to return. Same as SQL LIMIT command.


- `rows.to_dataframe()`: Invokes the row iterator and reads in the data into a pandas dataframe.

Learn more about [Loading BigQuery table into a dataframe](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas)

In [None]:
# Download the table.
table = bigquery.TableReference.from_string(BQ_TABLE)

rows = bqclient.list_rows(
    table,
    max_results=500,
    selected_fields=[
        bigquery.SchemaField("station_number", "STRING"),
        bigquery.SchemaField("year", "INTEGER"),
        bigquery.SchemaField("month", "INTEGER"),
        bigquery.SchemaField("day", "INTEGER"),
        bigquery.SchemaField("mean_temp", "FLOAT"),
    ],
)

dataframe = rows.to_dataframe()
print(dataframe.head())

### Read the BigQuery dataset into a tf.data.Dataset

Next, you read a sample of the dataset into a tf.data.Dataset using TensorFlow IO `BigQueryClient()` and `read_session()` method, with the following parameters:

- `parent`: Your project ID.
- `project_id`: The project ID of the BigQuery table.
- `dataset_id`: The ID of the BigQuery dataset.
- `table_id`. The ID of the table within the corresponding BigQuery dataset.
- `selected_fields`: Subset of fields (columns) to return.
- `output_types`: The output types of the corresponding fields.
- `requested_streams`: The number of parallel readers.

Learn more about [BigQuery TensorFlow reader](https://www.tensorflow.org/io/tutorials/bigquery).

Learn more about [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

In [None]:
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient

feature_names = "station_number,year,month,day".split(",")

target_name = "mean_temp"


def read_bigquery(project, dataset, table):
    tensorflow_io_bigquery_client = BigQueryClient()
    read_session = tensorflow_io_bigquery_client.read_session(
        parent="projects/" + PROJECT_ID,
        project_id=project,
        dataset_id=dataset,
        table_id=table,
        selected_fields=feature_names + [target_name],
        output_types=[dtypes.string] + [dtypes.int32] * 3 + [dtypes.float32],
        requested_streams=2,
    )

    dataset = read_session.parallel_read_rows()
    return dataset


PROJECT, DATASET, TABLE = IMPORT_FILE.split("/")[-1].split(".")
tf_dataset = read_bigquery(PROJECT, DATASET, TABLE)

print(tf_dataset.take(1))

### Read CSV files into a tf.data.Dataset

Alternatively, when your data is in CSV files, you can load the dataset into a tf.data.Dataset using `tf.data.experimental.CsvDataset`, with the following parameters:

- `filenames`: A list of one or more CSV files.
- `header`: Whether CSV file(s) contain a header.
- `select_cols`: Subset of fields (columns) to return.
- `record_defaults`: The output types of the corresponding fields.

Learn more about [tf.data CsvDataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset)

In [None]:
import tensorflow as tf

feature_names = ["station_number,year,month,day".split(",")]

target_name = "mean_temp"

tf_dataset = tf.data.experimental.CsvDataset(
    filenames=IMPORT_FILES,
    header=True,
    select_cols=feature_names.append(target_name),
    record_defaults=[dtypes.string] + [dtypes.int32] * 3 + [dtypes.float32],
)

print(tf_dataset.take(1))

### Create a BigQuery dataset from a pandas dataframe

You can create a BigQuery dataset from a pandas dataframe using the BigQuery `create_dataset()` and `load_table_from_dataframe()` methods, as follows:

- `create_dataset()`: Creates an empty BigQuery dataset, with the following parameters:
 - `dataset_ref`: The `DatasetReference` created from the dataset_id -- e.g., samples.
- `load_table_from_dataframe()`: Loads one or more CSV files into a table within the corresponding dataset, with the following parameters:
 - `dataframe`: The dataframe.
 - `table`: The `TableReference` for the table.
 - `job_config`: Specifications on how to load the dataframe data.

In [None]:
LOCATION = "us"

SCHEMA = [
    bigquery.SchemaField("station_number", "STRING"),
    bigquery.SchemaField("year", "INTEGER"),
    bigquery.SchemaField("month", "INTEGER"),
    bigquery.SchemaField("day", "INTEGER"),
    bigquery.SchemaField("mean_temp", "FLOAT"),
]


DATASET_ID = "samples"
TABLE_ID = "gsod"


def create_bigquery_dataset(dataset_id):
    dataset = bigquery.Dataset(
        bigquery.dataset.DatasetReference(PROJECT_ID, dataset_id)
    )
    dataset.location = "us"

    try:
        dataset = bqclient.create_dataset(dataset)  # API request
        return True
    except Exception as err:
        print(err)
        if err.code != 409:  # http_client.CONFLICT
            raise
    return False


def load_data_into_bigquery(dataframe, dataset_id, table_id):
    create_bigquery_dataset(dataset_id)
    dataset = bqclient.dataset(dataset_id)
    table = dataset.table(table_id)

    job_config = bigquery.LoadJobConfig(
        # Specify a (partial) schema. All columns are always written to the
        # table. The schema is used to assist in data type definitions.
        schema=[
            bigquery.SchemaField("station_number", "STRING"),
            bigquery.SchemaField("year", "INTEGER"),
            bigquery.SchemaField("month", "INTEGER"),
            bigquery.SchemaField("day", "INTEGER"),
            bigquery.SchemaField("mean_temp", "FLOAT"),
        ],
        # Optionally, set the write disposition. BigQuery appends loaded rows
        # to an existing table by default, but with WRITE_TRUNCATE write
        # disposition it replaces the table with the loaded data.
        write_disposition="WRITE_TRUNCATE",
    )

    NEW_BQ_TABLE = f"{PROJECT_ID}.{dataset_id}.{table_id}"

    job = bqclient.load_table_from_dataframe(
        dataframe, NEW_BQ_TABLE, job_config=job_config
    )  # Make an API request.
    job.result()  # Wait for the job to complete.

    table = bqclient.get_table(NEW_BQ_TABLE)  # Make an API request.
    print(
        "Loaded {} rows and {} columns to {}".format(
            table.num_rows, len(table.schema), NEW_BQ_TABLE
        )
    )


load_data_into_bigquery(dataframe, DATASET_ID, TABLE_ID)

### Create a BigQuery dataset from CSV files

You can create a BigQuery dataset from CSV files using the BigQuery `create_dataset()` and `load_table_from_uri()` methods, as follows:

- `create_dataset()`: Creates an empty BigQuery dataset, with the following parameters:
 - `dataset_ref`: The `DatasetReference` created from the dataset_id -- e.g., samples.
- `load_table_from_uri()`: Loads one or more CSV files into a table within the corresponding dataset, with the following parameters:
 - `url`: A set of one or more CVS files in Cloud Storage storage.
 - `table`: The `TableReference` for the table.
 - `job_config`: Specifications on how to load the CSV data.

Learn more about [Importing CSV data into BigQuery](https://www.tensorflow.org/io/tutorials/bigquery#import_census_data_into_bigquery).

In [None]:
LOCATION = "us"

CSV_SCHEMA = [
    bigquery.SchemaField("station_number", "STRING"),
    bigquery.SchemaField("wban_number", "STRING"),
    bigquery.SchemaField("year", "INTEGER"),
    bigquery.SchemaField("month", "INTEGER"),
    bigquery.SchemaField("day", "INTEGER"),
    bigquery.SchemaField("mean_temp", "FLOAT"),
    bigquery.SchemaField("num_mean_temp_samples", "INTEGER"),
    bigquery.SchemaField("mean_dew_point", "FLOAT"),
    bigquery.SchemaField("num_mean_dew_point_samples", "INTEGER"),
    bigquery.SchemaField("mean_sealevel_pressure", "FLOAT"),
    bigquery.SchemaField("num_mean_sealevel_pressure_samples", "INTEGER"),
    bigquery.SchemaField("mean_station_pressure", "FLOAT"),
    bigquery.SchemaField("num_mean_station_pressure_samples", "INTEGER"),
    bigquery.SchemaField("mean_visibility", "FLOAT"),
    bigquery.SchemaField("num_mean_visibility_samples", "INTEGER"),
    bigquery.SchemaField("mean_wind_speed", "FLOAT"),
    bigquery.SchemaField("num_mean_wind_speed_samples", "INTEGER"),
    bigquery.SchemaField("max_sustained_wind_speed", "FLOAT"),
    bigquery.SchemaField("max_gust_wind_speed", "FLOAT"),
    bigquery.SchemaField("max_temperature", "FLOAT"),
    bigquery.SchemaField("max_temperature_explicit", "BOOLEAN"),
    bigquery.SchemaField("min_temperature", "FLOAT"),
    bigquery.SchemaField("min_temperature_explicit", "BOOLEAN"),
    bigquery.SchemaField("total_percipitation", "FLOAT"),
    bigquery.SchemaField("snow_depth", "FLOAT"),
    bigquery.SchemaField("fog", "BOOLEAN"),
    bigquery.SchemaField("rain", "BOOLEAN"),
    bigquery.SchemaField("snow", "BOOLEAN"),
    bigquery.SchemaField("hail", "BOOLEAN"),
    bigquery.SchemaField("thunder", "BOOLEAN"),
    bigquery.SchemaField("tornado", "BOOLEAN"),
]


DATASET_ID = "samples"
TABLE_ID = "gsod"


def load_data_into_bigquery(url, dataset_id, table_id):
    create_bigquery_dataset(dataset_id)
    dataset = bqclient.dataset(dataset_id)
    table = dataset.table(table_id)

    job_config = bigquery.LoadJobConfig()
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    job_config.source_format = bigquery.SourceFormat.CSV
    job_config.schema = CSV_SCHEMA
    job_config.skip_leading_rows = 1  # heading

    load_job = bqclient.load_table_from_uri(url, table, job_config=job_config)
    print("Starting job {}".format(load_job.job_id))

    load_job.result()  # Waits for table load to complete.
    print("Job finished.")

    destination_table = bqclient.get_table(table)
    print("Loaded {} rows.".format(destination_table.num_rows))


load_data_into_bigquery(IMPORT_FILES, DATASET_ID, TABLE_ID)

### Read BigQuery table into XGboost DMatrix

Currently, there is no direct data feeding connector between BigQuery and the open source XGBoost. The BigQuery ML service has a built-in XGBoost training module.

Alernatively, you extract the data either as a pandas dataframe or as CSV files. The extracted data is then given as an input to a `DMatrix` object when training the model.

Learn more about [Getting started with built-in XGBoost](https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost-start).

### Read pandas table into XGboost DMatrix

Next, you load the pandas dataframe into a `DMatrix` object. XGBoost does not support non-numeric inputs. Any column that is categorical need to be one-hot encoded prior to loading the dataframe.

In [None]:
dataframe["station_number"] = pd.to_numeric(dataframe["station_number"])
labels = dataframe["mean_temp"]
data = dataframe.drop(["mean_temp"], axis=1)

dtrain = xgb.DMatrix(data, label=labels)

### Read CSV files into XGboost DMatrix

Currently, there is no Cloud Storage support in XGBoost. If you use CSV files for input, you need to download them locally.

In [None]:
! gsutil cp $EXAMPLE_FILE data.csv

dtrain = xgb.DMatrix("data.csv?format=csv&label_column=4")

# Clean up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Vertex AI Dataset resource
- Cloud Storage Bucket
- BigQuery Dataset

Set `delete_storage` to _True_ to delete the storage resources used in this notebook.

In [None]:
import os

# Delete the dataset using the Vertex dataset object
dataset.delete()

# Delete the temporary BigQuery dataset
! bq rm -r -f $PROJECT_ID:$DATASET_ID

delete_storage = False
if delete_storage or os.getenv("IS_TESTING"):
    # Delete the created GCS bucket
    ! gsutil rm -r $BUCKET_URI
    # Delete the created BigQuery datasets
    ! bq rm -r -f $PROJECT_ID:$BQ_MY_DATASET