Copyright 2026 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Dataplex Knowledge Engine Demo
## The goal of this notebook is to walk through how Dataplex Knowledge engine is able to self heal and understand complex table relationships.

First set up all of the environment variables you'll use.

In [None]:
import os

# Set the environment variables
os.environ['PROJECT_ID'] = 'your-knowledge-engine-demo'
os.environ['BQ_DATASET_ID'] = 'knowledge_engine_demo'
os.environ['BQ_DATASET_LOCATION'] = 'US'

# Pull the variable back into Python
bq_dataset_id = os.environ.get('BQ_DATASET_ID')
project_id = os.environ.get('PROJECT_ID')
data_scan_id = 'knowledge-engine-data-scan'

Before you begin, you need to [ensure all of the Dataplex DataScan permissions are granted and the services are enabled. ](https://docs.cloud.google.com/bigquery/docs/data-insights?_gl=1*19uutad*_ga*MTk5NjcwNjM5LjE3NjkwNDIwODg.*_ga_WH2QY8WWF5*czE3NjkxMTcwMDMkbzMkZzEkdDE3NjkxMTczNTQkajYwJGwwJGgw#roles) you also may neeed additional permissions to [create datasets and tables if you haven't already obtained those. ](https://docs.cloud.google.com/bigquery/docs/datasets#before_you_begin)

## Next create a bigquery dataset. If you already have one, you can skip this step.

In [None]:
# Use the bq command to create the dataset
# The ! prefix allows you to run shell commands in a Colab cell.
!bq mk --location=$BQ_DATASET_LOCATION $BQ_DATASET_ID

## You will be using the BigQuery python SDK to interact with BigQuery, it's just for simplicity as you'll be using both BigQuery and REST apis.

In [None]:
from google.cloud import bigquery

# Initialize the BigQuery client
client = bigquery.Client()

Create 3 tables, these are not extremely descriptive or have legacy names. Don't add any other metdata information.

In [None]:
sql_query = f"""
-- Create the central table
CREATE TABLE IF NOT EXISTS `{bq_dataset_id}.central` (
    platform_id STRING,
    last_login TIMESTAMP,
    level INT64,
    game_codes ARRAY<INT64>
);

-- Create the acquired_device table
CREATE TABLE IF NOT EXISTS `{bq_dataset_id}.acquired_device` (
    device_id STRING,
    gamer_tag STRING,
    infraction_level INT64,
    status STRING
);

-- Create the games table
CREATE TABLE IF NOT EXISTS `{bq_dataset_id}.games` (
    game_id STRING,
    title STRING,
    description STRING,
    price FLOAT64,
    status STRING
);
"""

# Execute the query
query_job = client.query(sql_query)

# Wait for the job to complete
query_job.result()

print(f"Tables created successfully in dataset: {bq_dataset_id}")

## Create your insight scan.

This notebook was created with using BigQuery Studio notebooks in mind. If executed elsewhere, you'll need to ensure you have authenticated with Google Cloud.



First you create the function for creating the scan. This step won't be repeated in the lab but others will so you are programmatically implementing these functions. Knowledge Engine is the intelligence layer activated when a Dataplex DataScan is configured with the DATA_DOCUMENTATION type.

In [None]:
import os
import requests
import google.auth
from google.auth.transport.requests import Request

def create_insight_scan(
        project_id,
        dataset_name='knowledge_engine_demo',
        scan_id='knowledge-engine-data-scan',
        location="us-central1"):
    """Creates an insights scan for a BigQuery dataset."""

    # 1. Get credentials and Refresh Token
    credentials, project = google.auth.default(
        scopes=['https://www.googleapis.com/auth/cloud-platform']
    )
    auth_request = Request()
    credentials.refresh(auth_request)

    # 2. Construct the API URL
    url = f"https://dataplex.googleapis.com/v1/projects/{project_id}/locations/{location}/dataScans"
    params = {"dataScanId": scan_id}

    # 3. Define the JSON Payload
    payload = {
        "data": {
            "resource": f"//bigquery.googleapis.com/projects/{project_id}/datasets/{dataset_name}"
        },
        "executionSpec": {
            "trigger": {
                "onDemand": {}
            }
        },
        "type": "DATA_DOCUMENTATION",
        "dataDocumentationSpec": {}
    }

    # 4. Set headers with the Bearer Token
    headers = {
        "Authorization": f"Bearer {credentials.token}",
        "Content-Type": "application/json"
    }

    # 5. Execute the request
    response = requests.post(url, headers=headers, params=params, json=payload)

    if response.status_code == 200:
        print(f"Successfully created scan: {scan_id}")
        return response.json()
    else:
        print(f"Error {response.status_code}: {response.text}")
        response.raise_for_status()


Execute the scan creation function.

In [None]:
create_insight_scan(project_id, bq_dataset_id, data_scan_id)

Create the function to trigger the run of the scan.

In [None]:
def run_dataplex_data_scan(
        project_id,
        scan_id,
        location="us-central1"):
    """Triggers an execution of an existing insight scan."""

    # 1. Fetch Google Cloud Credentials
    # This automatically finds credentials in your environment (ADC)
    credentials, _ = google.auth.default(
        scopes=['https://www.googleapis.com/auth/cloud-platform']
    )

    # 2. Refresh the token
    auth_request = Request()
    credentials.refresh(auth_request)

    # 3. Construct the 'Run' endpoint URL
    # Note: The :run is a custom method in the Google API
    url = f"https://dataplex.googleapis.com/v1/projects/{project_id}/locations/{location}/dataScans/{scan_id}:run"

    headers = {
        "Authorization": f"Bearer {credentials.token}",
        "Content-Type": "application/json"
    }

    # 4. Execute the POST request
    try:
        # The run endpoint doesn't require a body for on-demand triggers
        response = requests.post(url, headers=headers)

        if response.status_code == 200:
            result = response.json()
            job_id = result.get("job", {}).get("name", "Unknown Job ID")
            print(f"Successfully triggered scan: {scan_id}")
            print(f"Job ID: {job_id}")
            return result
        else:
            print(f"Error {response.status_code}: {response.text}")
            response.raise_for_status()

    except Exception as e:
        print(f"An error occurred while starting the scan: {e}")
        return None




You trigger the scan to run.

In [None]:
run_dataplex_data_scan(project_id, data_scan_id)

It's also possible to continue to monitor the status of the job so you aren't guessing when it's done. You won't execute it here but you would use the [API to list](https://docs.cloud.google.com/dataplex/docs/reference/rest/v1/projects.locations.dataScans/list) the scans and [retrieve the status in the DataScan object](https://docs.cloud.google.com/dataplex/docs/reference/rest/v1/projects.locations.dataScans#DataScan).



Create a function to fetch the insight scan results.

In [None]:

import json
from datetime import datetime

def get_datascan_insights(project_id, scan_id, location="us-central1"):
    """Retrieves full details of a scan and saves them to a file."""

    # 1. Authentication
    credentials, _ = google.auth.default(
        scopes=['https://www.googleapis.com/auth/cloud-platform']
    )
    credentials.refresh(Request())

    # 2. Construct URL with the 'FULL' view parameter
    url = f"https://dataplex.googleapis.com/v1/projects/{project_id}/locations/{location}/dataScans/{scan_id}"
    params = {"view": "FULL"}

    headers = {
        "Authorization": f"Bearer {credentials.token}",
        "Content-Type": "application/json"
    }

    # 3. Execute GET request
    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
        scan_data = response.json()

        # 4. Create a timestamped filename
        timestamp = int(datetime.now().timestamp())
        filename = f"get-insights-results-{timestamp}.json"

        # 5. Save results to a file
        with open(filename, 'w') as f:
            json.dump(scan_data, f, indent=2)

        print(f"Insights successfully saved to {filename}")
        return scan_data

    except Exception as e:
        print(f"Error retrieving insights: {e}")
        return None



Get the insights and save them to a variable.

In [None]:
insights = get_datascan_insights(project_id, data_scan_id)

Use the display function to pretty print the JSON.

In [None]:
from IPython.display import JSON

# Assuming 'insights' is the variable returned by your function
JSON(insights)

In [None]:
# Accessing the specific table insights path
table_insights = insights['dataDocumentationResult']['datasetResult']['tableResults']

# Pretty print the list of table insights
print(json.dumps(table_insights, indent=2))

You should get something along the lines of:

```
     {
        "sql": "WITH PlatformLogin AS (SELECT platform_id, last_login, ROW_NUMBER() OVER (PARTITION BY platform_id ORDER BY last_login) AS login_rank FROM `{your-project}.knowledge_engine_demo.central`), AvgInfraction AS (SELECT c.platform_id, AVG(a.infraction_level) AS avg_infraction_level FROM `{your-project}.knowledge_engine_demo.central` c JOIN `{your-project}knowledge_engine_demo.acquired_device` a ON c.platform_id = a.gamer_tag GROUP BY 1) SELECT p.platform_id, p.last_login, a.avg_infraction_level FROM PlatformLogin p JOIN AvgInfraction a ON p.platform_id = a.platform_id ORDER BY p.platform_id, p.last_login;",
        "description": "What is the distribution of last login timestamps across different platform IDs, and how does it relate to the average infraction level for acquired devices associated with those platform IDs?"
      },

```

The problem is ON c.platform_id = a.gamer_tag. Although this is a totally fine guess and possibly a good intuitive link, it's incorrect. Gamer_tag and Platform_Id are not related keys.

Dataplex Knowledge Engine can self heal and discover. You can provide it some context by running queries. You run these queries a few times because a single query might be considered an outlier so repetition is key here to make sure patterns are recognized.

In [None]:
# Query 1: Analyze progression vs infractions

import time

sql_infraction_impact = f"""
### progression vs infractions
SELECT
    a.infraction_level,
    AVG(c.level) AS avg_user_level,
    COUNT(DISTINCT c.platform_id) AS user_count
FROM
    `{bq_dataset_id}.central` AS c
JOIN
    `{bq_dataset_id}.acquired_device` AS a ON c.platform_id = a.device_id
GROUP BY
    1
ORDER BY
    infraction_level DESC
"""

for i in range (10):
    df_infractions = client.query(sql_infraction_impact).to_dataframe()
    time.sleep(1)


In [None]:
# Query 2: Calculate game value per device status
sql_game_value = f"""
### game value per device status
SELECT
    a.status AS device_status,
    g.title,
    COUNT(c.platform_id) AS player_count,
    SUM(g.price) AS potential_revenue_impact
FROM
    `{bq_dataset_id}.central` AS c,
    UNNEST(c.game_codes) AS individual_game_code
JOIN
    `{bq_dataset_id}.games` AS g ON CAST(individual_game_code AS STRING) = g.game_id
JOIN
    `{bq_dataset_id}.acquired_device` AS a ON c.platform_id = a.device_id
GROUP BY
    1, 2
ORDER BY
    potential_revenue_impact DESC
"""

for i in range (10):
    df_revenue = client.query(sql_game_value).to_dataframe()
    time.sleep(1)

You can now trigger the scan again and retrieve the results. Note the sleep of 2 minutes is usually not necessary but it's there as a safeguard.

In [None]:
run_dataplex_data_scan(project_id, data_scan_id)
time.sleep(120)
new_insights = get_datascan_insights(project_id, data_scan_id)

In [None]:
from IPython.display import JSON

# Assuming 'insights' is the variable returned by your function
JSON(new_insights)

You will see now in the new scan, you have the correct join!


```
      "queries": [
        {
          "sql": "SELECT CORR(t1.level, t2.infraction_level) AS correlation FROM `{your-project}.knowledge_engine_demo.central` AS t1 INNER JOIN `{your-project}.knowledge_engine_demo.acquired_device` AS t2 ON t1.platform_id = t2.device_id;",
          "description": "Calculate the correlation between user level and infraction level, joining central and acquired_device tables on platform_id and device_id respectively, to understand if higher level players are less likely to have infractions."
        },
```

To clean up the datset execute the following.



> **DO NOT EXECUTE IT IF YOU DO NOT INTEND TO DROP THE DATASET!**



In [None]:
client.delete_dataset(
    REPLACE_THIS_WITH_YOUR_DATASET_BUT_THE_READ_ABOVE, delete_contents=True, not_found_ok=True
)