# Vertex AI MLOps Book - Chapter 6 - Big Query ML - Credit Card Default Prediction

<table align="left">
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw..ipynb" target="_blank">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI icon">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Dataset - Google Cloud BQ Public Dataset - Credit Card Default

**Description** - Obfuscated Google Analytics 360 data. It’s a great way to look at business data and experiment and learn the benefits of analyzing Google Analytics 360 data in BigQuery. Learn more at: https://support.google.com/analytics/answer/7586738

**Location** - bigquery-public-data.ml_datasets.credit_card_default

**Objective** - Predict the probability of a website visitor will complete a purchase transaction

#### Install the required libraries

In [None]:
! pip3 install google-cloud-aiplatform

In [None]:
! pip3 install google-cloud-bigquery

In [None]:
! pip3 install db-dtypes

#### Uncomment the following cell if running notebook locally. Not required if running on Vertex AI Workbench

In [31]:
#!gcloud auth login

Import the required libraries

In [None]:
from google.cloud import bigquery
client = bigquery.Client()

## Set GCP Project ID

Set your GCP Project ID. Replace 'jsb-alto' with your project ID.

In [15]:
Project_ID = 'jsb-alto'

# Set the project id
! gcloud config set project {Project_ID}

Updated property [core/project].


## Look at the Credit Card data we will be using in this notebook

Let's first look at the fields of the dataset. Description field below provides details about each field.

In [16]:
query = f"""
SELECT 
table_name,column_name,	data_type,	description	
FROM `bigquery-public-data.ml_datasets.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'credit_card_default'
"""

data_schema = client.query(query).to_dataframe()
data_schema

Unnamed: 0,table_name,column_name,data_type,description
0,credit_card_default,id,FLOAT64,Anonymized ID of each client
1,credit_card_default,limit_balance,FLOAT64,Amount of given credit in NT dollars (includes...
2,credit_card_default,sex,STRING,"Gender (1=male, 2=female)"
3,credit_card_default,education_level,STRING,"Education Level (1=graduate school, 2=universi..."
4,credit_card_default,marital_status,STRING,"Marital status (1=married, 2=single, 3=others)"
5,credit_card_default,age,FLOAT64,Age in years
6,credit_card_default,pay_0,FLOAT64,"Repayment status in September, 2005 (-1=pay du..."
7,credit_card_default,pay_2,FLOAT64,"Repayment status in August, 2005 (scale same a..."
8,credit_card_default,pay_3,FLOAT64,"Repayment status in July, 2005 (scale same as ..."
9,credit_card_default,pay_4,FLOAT64,"Repayment status in June, 2005 (scale same as ..."


## Let's look at a sample of the data

In [19]:
# Define the query
query = f"""
SELECT
  *
FROM
  `bigquery-public-data.ml_datasets.credit_card_default` 

  LIMIT 10
"""

# Run the query, and return a pandas DataFrame
data = client.query(query).to_dataframe()

# Preview the data
data

Unnamed: 0,id,limit_balance,sex,education_level,marital_status,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt_5,bill_amt_6,pay_amt_1,pay_amt_2,pay_amt_3,pay_amt_4,pay_amt_5,pay_amt_6,default_payment_next_month,predicted_default_payment_next_month
0,27502.0,80000.0,1,6,1,54.0,0.0,0.0,0.0,0.0,...,26210.0,17643.0,2545.0,2208.0,1336.0,2232.0,542.0,348.0,1,"[{'tables': {'score': 0.8667634129524231, 'val..."
1,26879.0,200000.0,1,4,1,49.0,0.0,0.0,0.0,0.0,...,50235.0,48984.0,1689.0,2164.0,2500.0,3480.0,2500.0,3000.0,0,"[{'tables': {'score': 0.9351968765258789, 'val..."
2,18340.0,20000.0,2,6,2,22.0,0.0,0.0,0.0,0.0,...,500.0,0.0,4641.0,1019.0,900.0,0.0,1500.0,0.0,1,"[{'tables': {'score': 0.8572560548782349, 'val..."
3,13692.0,260000.0,2,4,2,33.0,0.0,0.0,0.0,0.0,...,30767.0,29890.0,5000.0,5000.0,1137.0,5000.0,1085.0,5000.0,0,"[{'tables': {'score': 0.9690881371498108, 'val..."
4,20405.0,150000.0,1,4,2,32.0,0.0,0.0,0.0,-1.0,...,143375.0,146411.0,4019.0,146896.0,157436.0,4600.0,4709.0,5600.0,0,"[{'tables': {'score': 0.9349926710128784, 'val..."
5,3882.0,300000.0,2,4,2,32.0,0.0,0.0,0.0,0.0,...,-450.0,700.0,15235.0,1491.0,1303.0,0.0,2000.0,1400.0,0,"[{'tables': {'score': 0.9530552625656128, 'val..."
6,7227.0,130000.0,1,1,1,45.0,0.0,0.0,0.0,0.0,...,63832.0,65099.0,2886.0,2908.0,2129.0,2354.0,2366.0,2291.0,0,"[{'tables': {'score': 0.9030028581619263, 'val..."
7,1379.0,200000.0,1,1,1,58.0,0.0,0.0,0.0,0.0,...,126921.0,129167.0,7822.0,4417.0,4446.0,4597.0,4677.0,4698.0,0,"[{'tables': {'score': 0.8636506199836731, 'val..."
8,29477.0,500000.0,1,1,1,39.0,0.0,0.0,0.0,0.0,...,137406.0,204975.0,54209.0,4607.0,4603.0,5224.0,207440.0,7509.0,0,"[{'tables': {'score': 0.9399265050888062, 'val..."
9,10643.0,230000.0,1,1,1,48.0,0.0,0.0,0.0,0.0,...,108101.0,110094.0,7000.0,6607.0,3773.0,4290.0,4164.0,2000.0,0,"[{'tables': {'score': 0.8917242884635925, 'val..."


### Create Training Dataset
Let's create the query to select the fields we want to use as features in our Classification model. 
We will not include the 'id' field as it is just an identifier and not useful for our model or the predicted fields included in the raw data.

In [25]:

train_data_query = f"""
SELECT
  limit_balance,sex,education_level,marital_status,age,pay_0,pay_2,pay_3,pay_4,bill_amt_5,bill_amt_6,
  pay_amt_1,pay_amt_2,pay_amt_3,pay_amt_4,pay_amt_5,pay_amt_6,default_payment_next_month
FROM
  `bigquery-public-data.ml_datasets.credit_card_default` 

Limit 2000
"""
# Run the query
data = client.query(train_data_query).to_dataframe()

# Preview the data
data

Unnamed: 0,limit_balance,sex,education_level,marital_status,age,pay_0,pay_2,pay_3,pay_4,bill_amt_5,bill_amt_6,pay_amt_1,pay_amt_2,pay_amt_3,pay_amt_4,pay_amt_5,pay_amt_6,default_payment_next_month
0,80000.0,1,6,1,54.0,0.0,0.0,0.0,0.0,26210.0,17643.0,2545.0,2208.0,1336.0,2232.0,542.0,348.0,1
1,200000.0,1,4,1,49.0,0.0,0.0,0.0,0.0,50235.0,48984.0,1689.0,2164.0,2500.0,3480.0,2500.0,3000.0,0
2,20000.0,2,6,2,22.0,0.0,0.0,0.0,0.0,500.0,0.0,4641.0,1019.0,900.0,0.0,1500.0,0.0,1
3,260000.0,2,4,2,33.0,0.0,0.0,0.0,0.0,30767.0,29890.0,5000.0,5000.0,1137.0,5000.0,1085.0,5000.0,0
4,150000.0,1,4,2,32.0,0.0,0.0,0.0,-1.0,143375.0,146411.0,4019.0,146896.0,157436.0,4600.0,4709.0,5600.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,180000.0,2,3,1,39.0,1.0,-2.0,-2.0,-1.0,300.0,150.0,0.0,0.0,300.0,0.0,0.0,645.0,0
1996,20000.0,2,2,2,24.0,3.0,2.0,3.0,2.0,20266.0,20511.0,2000.0,1000.0,2000.0,0.0,700.0,0.0,1
1997,10000.0,1,3,2,46.0,2.0,2.0,3.0,3.0,1050.0,1050.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1998,220000.0,2,1,1,41.0,7.0,6.0,5.0,4.0,225044.0,222356.0,0.0,0.0,0.0,0.0,0.0,6000.0,1


## Train a Classification Model

#### Let's first create the BigQuery dataset where the model will reside

In [23]:
# Define the query statement
query = f"""
CREATE SCHEMA `{Project_ID}.credit_default_dataset`
"""

# Create the dataset
dataset_status = client.query(query).to_dataframe()


### Create the BQML Model

It's time to train our classification model! We'll use logistic regression for this task.

Model Type = logistic regression

Model name = credit_default_classification_model

In [26]:
query = f"""
CREATE OR REPLACE MODEL `{Project_ID}.credit_default_dataset.credit_default_classification_model`

OPTIONS(model_type='logistic_reg') AS

SELECT
  limit_balance,sex,education_level,marital_status,age,pay_0,pay_2,pay_3,pay_4,bill_amt_5,bill_amt_6,pay_amt_1,pay_amt_2,pay_amt_3,pay_amt_4,pay_amt_5,pay_amt_6,
  default_payment_next_month AS label
FROM
  `bigquery-public-data.ml_datasets.credit_card_default` 

Limit 2000
"""

model_status = client.query(query).to_dataframe()




## Evaluate Model

Now let's evaluate our model's performance using validation data.


In [30]:



query = f"""
SELECT
  *
FROM
  ml.EVALUATE(MODEL `{Project_ID}.credit_default_dataset.credit_default_classification_model`, (
SELECT
  limit_balance,sex,education_level,marital_status,age,pay_0,pay_2,pay_3,pay_4,bill_amt_5,bill_amt_6,pay_amt_1,pay_amt_2,pay_amt_3,pay_amt_4,pay_amt_5,pay_amt_6,
  default_payment_next_month AS label
FROM
  `bigquery-public-data.ml_datasets.credit_card_default` ))
"""

model_eval = client.query(query).to_dataframe()
model_eval



Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.705224,0.297638,0.822934,0.418605,0.459728,0.742173


## Generate predictions based on the model

Finally, let's use our trained model to make predictions on new, unseen data.

In [34]:
query = f"""
SELECT
  *
FROM
  ml.PREDICT(MODEL `{Project_ID}.credit_default_dataset.credit_default_classification_model`, (
SELECT
  id,limit_balance,sex,education_level,marital_status,age,pay_0,pay_2,pay_3,pay_4,bill_amt_5,bill_amt_6,
  pay_amt_1,pay_amt_2,pay_amt_3,pay_amt_4,pay_amt_5,pay_amt_6
FROM
  `bigquery-public-data.ml_datasets.credit_card_default`
  ))
"""

predictions = client.query(query).to_dataframe()
predictions

Unnamed: 0,predicted_label,predicted_label_probs,id,limit_balance,sex,education_level,marital_status,age,pay_0,pay_2,pay_3,pay_4,bill_amt_5,bill_amt_6,pay_amt_1,pay_amt_2,pay_amt_3,pay_amt_4,pay_amt_5,pay_amt_6
0,1,"[{'label': '1', 'prob': 0.5503120625288049}, {...",27502.0,80000.0,1,6,1,54.0,0.0,0.0,0.0,0.0,26210.0,17643.0,2545.0,2208.0,1336.0,2232.0,542.0,348.0
1,0,"[{'label': '1', 'prob': 0.05439527492938705}, ...",26879.0,200000.0,1,4,1,49.0,0.0,0.0,0.0,0.0,50235.0,48984.0,1689.0,2164.0,2500.0,3480.0,2500.0,3000.0
2,1,"[{'label': '1', 'prob': 0.5229681649178964}, {...",18340.0,20000.0,2,6,2,22.0,0.0,0.0,0.0,0.0,500.0,0.0,4641.0,1019.0,900.0,0.0,1500.0,0.0
3,0,"[{'label': '1', 'prob': 0.04147357145078129}, ...",13692.0,260000.0,2,4,2,33.0,0.0,0.0,0.0,0.0,30767.0,29890.0,5000.0,5000.0,1137.0,5000.0,1085.0,5000.0
4,0,"[{'label': '1', 'prob': 0.016160656045624413},...",20405.0,150000.0,1,4,2,32.0,0.0,0.0,0.0,-1.0,143375.0,146411.0,4019.0,146896.0,157436.0,4600.0,4709.0,5600.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2960,0,"[{'label': '1', 'prob': 0.026304152058040865},...",13325.0,80000.0,2,3,2,28.0,-1.0,-1.0,-1.0,-2.0,0.0,0.0,2800.0,0.0,0.0,0.0,0.0,0.0
2961,0,"[{'label': '1', 'prob': 0.04444552335907505}, ...",139.0,50000.0,2,3,1,51.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,300.0,5880.0,0.0,0.0,0.0,0.0
2962,0,"[{'label': '1', 'prob': 0.007967846079960766},...",26185.0,450000.0,2,2,1,38.0,-2.0,-2.0,-2.0,-2.0,390.0,390.0,390.0,780.0,390.0,390.0,390.0,390.0
2963,0,"[{'label': '1', 'prob': 0.014751321959732954},...",1900.0,50000.0,2,2,1,44.0,-2.0,-2.0,-2.0,-2.0,390.0,0.0,390.0,390.0,390.0,390.0,0.0,780.0


Congratulations on reaching the end of this chapter! We've successfully explored the dataset, prepared the data, trained a logistic regression model, evaluated it, and generated predictions—all using BigQuery ML.