# Building a Fraud Detection model on Vertex AI with TensorFlow Enterprise and BigQuery

## Learning objectives

1. Analyze the data in BigQuery.
2. Ingest records from BigQuery.
3. Preprocess the data.
4. Build the model.
5. Train the model.
6. Evaluate the model.

## Introduction

In this notebook, you'll directly ingest a [BigQuery](https://cloud.google.com/bigquery/) dataset and train a fraud detection model with TensorFlow Enterprise on [Vertex AI](https://cloud.google.com/vertex-ai).

You've also walked through all the steps of building a model. Finally, you learned a bit about how to handle imbalanced classification problems.

Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/fraud_detection_with_tensorflow_bigquery.ipynb).

# Ingest records from BigQuery

## Step 1: Import Python packages

Run the below cell to import the python packages.

In [1]:
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as layers

from tensorflow_io.bigquery import BigQueryClient

import functools

## Step 2: Define constants

Let's next define some constants for use in the project. Change **GCP_PROJECT_ID** to the actual project ID you are using. Go ahead and run new cells as you create them.

In [2]:
GCP_PROJECT_ID = 'qwiklabs-gcp-00-b1e00ce17168' # Replace with your Project-ID
DATASET_GCP_PROJECT_ID = GCP_PROJECT_ID # A copy of the data is saved in the user project
DATASET_ID = 'tfe_codelab'
TRAIN_TABLE_ID = 'ulb_fraud_detection_train'
VAL_TABLE_ID = 'ulb_fraud_detection_val'
TEST_TABLE_ID = 'ulb_fraud_detection_test'

FEATURES = ['Time','V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12','V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24','V25','V26','V27','V28','Amount']
LABEL='Class'
DTYPES=[tf.float64] * len(FEATURES) + [tf.int64]

## Step 3: Define helper functions

Now, let's define a couple functions. **read_session()** reads data from a BigQuery table. **extract_labels()** is a helper function to separate the label column from the rest, so that the dataset is in the format expected by **keras.model_fit()** later on.

In [3]:
client = BigQueryClient()

def read_session(TABLE_ID):
    return client.read_session(
        "projects/" + GCP_PROJECT_ID, DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID,
        FEATURES + [LABEL], DTYPES, requested_streams=2
)

def extract_labels(input_dict):
  features = dict(input_dict)
  label = tf.cast(features.pop(LABEL), tf.float64)
  return (features, label)

2022-05-25 11:40:43.203881: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2022-05-25 11:40:43.206218: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
2022-05-25 11:40:43.353322: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


## Step 4: Ingest data

Finally, let's create each dataset and then print the first batch from the training dataset. Note that we have defined a **BATCH_SIZE** of 32. This is an important parameter that will impact the speed and accuracy of training.

In [4]:
BATCH_SIZE = 32

# TODO 1
# Create the datasets
raw_train_data = # Your code goes here
raw_val_data = # Your code goes here
raw_test_data = # Your code goes here

next(iter(raw_train_data)) # Print first batch

2022-05-25 11:40:45.756773: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-05-25 11:40:45.851131: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:40:45.851185: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


({'Amount': <tf.Tensor: shape=(32,), dtype=float64, numpy=
  array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])>,
  'Time': <tf.Tensor: shape=(32,), dtype=float64, numpy=
  array([ 282.,  380.,  430.,  711.,  804.,  820.,  912., 1193., 1443.,
         1444., 1444., 1840., 1888., 2291., 2371., 2519., 2549., 2741.,
         2812., 2827., 2870., 2891., 2989., 3003., 3046., 3105., 3186.,
         3204., 3224., 3275., 3330., 3371.])>,
  'V1': <tf.Tensor: shape=(32,), dtype=float64, numpy=
  array([-0.35646619, -1.29983679, -1.86025769, -0.43134934,  1.18169674,
         -0.93748131,  1.08300282,  1.13064571,  1.07669965,  1.04095773,
         -0.96040277, -4.10236268, -1.9271414 ,  1.22554941, -0.87883341,
         -0.70749537, -0.97164965,  1.33214151, -0.63340299, -0.70659816,
         -0.46060238, -1.35641041, -0.46917279, -0.94424503,  1.14151696,
         -1.04345577,  1.12683746,  1.08259865,

# Build the model

## Step 1: Preprocess data

Let's create feature columns for each feature in the dataset. In this particular dataset, all of the columns are of type **numeric_column**, but there a number of other column types (e.g. categorical_column).

You will also norm the data to center around zero so that the network converges faster. You've precalculated the means of each feature to use in this calculation.

In [5]:
MEANS = [94816.7387536405, 0.0011219465482001268, -0.0021445914636999603, -0.002317402958335562,
         -0.002525792169927835, -0.002136576923287782, -3.7586818983702984, 8.135919975738768E-4,
         -0.0015535579268265718, 0.001436137140461279, -0.0012193712736681508, -4.5364970422902533E-4,
         -4.6175444671576083E-4, 9.92177789685366E-4, 0.002366229151475428, 6.710217226762278E-4,
         0.0010325807119864225, 2.557260815835395E-4, -2.0804190062322664E-4, -5.057391100818653E-4,
         -3.452114767842334E-6, 1.0145936326270006E-4, 3.839214074518535E-4, 2.2061197469126577E-4,
         -1.5601580596677608E-4, -8.235017846415852E-4, -7.298316615408554E-4, -6.898459943652376E-5,
         4.724125688297753E-5, 88.73235686453587]

def norm_data(mean, data):
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

numeric_columns = []

for i, feature in enumerate(FEATURES):
# TODO 2: Your code goes here

numeric_columns

[NumericColumn(key='Time', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function norm_data at 0x7f445e966050>, 94816.7387536405)),
 NumericColumn(key='V1', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function norm_data at 0x7f445e966050>, 0.0011219465482001268)),
 NumericColumn(key='V2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function norm_data at 0x7f445e966050>, -0.0021445914636999603)),
 NumericColumn(key='V3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function norm_data at 0x7f445e966050>, -0.002317402958335562)),
 NumericColumn(key='V4', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function norm_data at 0x7f445e966050>, -0.002525792169927835)),
 NumericColumn(key='V5', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function norm_data at 0x7f445e9660

## Step 2: Build the model

Now we are ready to create a model. We will feed the columns we just created into the network. Then we will compile the model. We are including the Precision/Recall AUC metric, which is [useful for imbalanced datasets](http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf).

In [6]:
model = keras.Sequential([
    tf.keras.layers.DenseFeatures(numeric_columns),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(# TODO 3: Your code goes here)

## Step 3: Train the model

There are a number of techniques to handle imbalanced data, including oversampling (generating new data in the minority class) and undersampling (reducing the data in the majority class).

For the purposes of this codelab, let's use a technique that overweights the loss when misclassifying the minority class. You'll specify a class_weight parameter when training and weight "1" (fraud) higher, since it is much less prevalent.

You will use 3 epochs (passes through the data) in this lab so training is quicker. In a real-world scenario, You'd want to run it long enough to the point where the stop seeing increases in accuracy of the validation set.

In [7]:
CLASS_WEIGHT = {
    0: 1,
    1: 100
}
EPOCHS = 3

train_data = raw_train_data.shuffle(10000)
val_data = raw_val_data
test_data = raw_test_data

# Train the model using model.fit()
# TODO 4: Your code goes here

Epoch 1/3
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.


2022-05-25 11:40:49.370573: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:40:49.370632: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


      1/Unknown - 12s 12s/step - loss: 8215.7061 - accuracy: 0.3750 - auc: 0.0000e+00

2022-05-25 11:40:59.290556: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 7102 of 10000
2022-05-25 11:40:59.326535: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:228] Shuffle buffer filled.


Consider rewriting this model with the Functional API.


2022-05-25 11:41:39.231656: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:41:39.231730: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


Epoch 2/3


2022-05-25 11:41:44.524025: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:41:44.524081: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.




2022-05-25 11:42:31.364578: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:42:31.364641: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.


Epoch 3/3


2022-05-25 11:42:36.568942: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:42:36.569007: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.




2022-05-25 11:43:23.957511: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:43:23.957580: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.




<keras.callbacks.History at 0x7f4460d7e3d0>

## Step 4: Evaluate the model

The evaluate() function can be applied to test data that the model has never seen to provide an objective assessment. Fortunately, we've set aside test data just for that!

In [8]:
# Evaluate the model
# TODO 5: Your code goes here

2022-05-25 11:43:58.601339: E tensorflow/core/framework/dataset.cc:552] Unimplemented: Cannot compute input sources for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.
2022-05-25 11:43:58.601398: E tensorflow/core/framework/dataset.cc:556] Unimplemented: Cannot merge options for dataset of type IO>BigQueryDataset, because the dataset does not implement `InputDatasets`.




[11.758894920349121, 0.9978189468383789, 0.3504483997821808]

## Step 5: Exploration

In this lab, you've demonstrated how to ingest a large data set from BigQuery directly into a TensorFlow Keras model. You've also walked through all the steps of building a model. Finally, you learned a bit about how to handle imbalanced classification problems.

Feel free to keep playing around with different architectures and approaches to the imbalanced dataset, to see if you can improve the accuracy!