Skip to content

Latest commit

 

History

History
492 lines (341 loc) · 24 KB

sample_identify_and_tag.md

File metadata and controls

492 lines (341 loc) · 24 KB
title description author tags
Automatically apply sensitivity tags in Data Catalog to Files, Databases and BigQuery Tables using Cloud Data Loss Prevention and Dataflow
Identify sensitive information in Database and BigQuery tables as well as files automatically, and apply sensitivity tags in Data Catalog.
anantdamle
data governance, DLP, sensitivity categorization, data migration, JDBC, Avro, Parquet

Anant Damle | Solutions Architect | Google

Contributed by Google employees.

This tutorial demonstrates how to identify and apply data sensitivity tags for your data in Cloud Storage files, relational databases (like MySQL, PostgreSQL, etc.) and BigQuery using Data Catalog with an automated Dataflow pipeline.

This pipeline uses Cloud Data Loss Prevention (Cloud DLP) to detect sensitive data like personally identifiable information (PII), followed by tagging the findings in Data Catalog.

The solution described in this document builds on the architecture of the file-based tokenizing solution described in the companion document. The primary difference is that the current document describes a solution that additionally creates Data Catalog entry and tags for DLP findings, and adds the capability to inspect relational databases using JDBC connections.

This document is intended for a technical audience whose responsibilities include data security, data governancve, data processing, or data analytics. This document assumes that you're familiar with data processing and data privacy, without the need to be an expert. This document assumes some familiarity with shell scripts and basic knowledge of Google Cloud.

Costs

This tutorial uses billable components of Google Cloud, including the following:

Use the pricing calculator to generate a cost estimate based on your projected usage.

Architecture

The solution described in this tutorial comprises a pipeline that extracts the data from any one of the sources, samples the records and processes them through Cloud Data Loss Prevention (DLP) API to identify sensitivity categories.

sampling dlp and data catalog architecture

The solution uses JDBC connection to access relational databases, when using BigQuery tables as a data source, the solution uses the BigQuery Storage API to improve load times.

The sample-and-identify pipeline outputs the following files to Cloud Storage:

  • Avro schema (equivalent) of the source's schema
  • Detected InfoTypes for each of the input columns

This solution uses record flattening to handle nested and repeated fields in records.

Before you begin

For this tutorial, you need a Google Cloud project. To make cleanup easiest at the end of the tutorial, we recommend that you create a new project for this tutorial.

  1. Create a Google Cloud project.

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Open Cloud Shell.

    At the bottom of the Cloud Console, a Cloud Shell session opens and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  4. Enable APIs for Cloud DLP, Compute Engine, Cloud Storage, Dataflow, and Cloud SQL:

    gcloud services enable \
    cloudbuild.googleapis.com \
    compute.googleapis.com \
    datacatalog.googleapis.com \
    dataflow.googleapis.com \
    dlp.googleapis.com \
    sql-component.googleapis.com \
    sqladmin.googleapis.com \
    secretmanager.googleapis.com \
    storage.googleapis.com
    

Setting up your environment

  1. In Cloud Shell, clone the source repository and go to the directory for this tutorial:

    git clone https://github.com/GoogleCloudPlatform/auto-data-tokenize.git
    cd auto-data-tokenize/
    
  2. Use a text editor to modify the set_variables.sh file to set the required environment variables:

    # The Google Cloud project to use for this tutorial
    export PROJECT_ID="[YOUR_PROJECT_ID]"
    
    # The Compute Engine region to use for running Dataflow jobs and create a
    # temporary storage bucket
    export REGION_ID="[YOUR_COMPUTE_ENGINE_REGION]"
    
    # The Cloud Storage bucket to use as a temporary bucket for Dataflow
    export TEMP_GCS_BUCKET="[CLOUD_STORAGE_BUCKET_NAME]"
    
    # Name of the service account to use (not the email address)
    export DLP_RUNNER_SERVICE_ACCOUNT_NAME="[SERVICE_ACCOUNT_NAME_FOR_RUNNER]"
    
    # Fully Qualified Entry Group Id to use for creating/searching for Entries
    # in Data Catalog for non-BigQuery entries.
    export DATA_CATALOG_ENTRY_GROUP_ID="[NON_BIGQUERY_DATA_CATALOG_ENTRY_GROUP]"
    
    # The fully qualified Data Catalog Tag Template Id to use
    # for creating sensitivity tags in Data Catalog.
    export INSPECTION_TAG_TEMPLATE_ID="[DATA_CATALOG_TAG_TEMPLATE_NAME]"
    
  3. Run the script to set the environment variables:

    source set_variables.sh
    

Creating resources

The tutorial uses following resources:

  • A service account to run Dataflow pipelines, enabling fine-grained access control
  • A Cloud Storage bucket for temporary data storage and test data
  • A Data Catalog Tag Template to attach sensitivity tags to entries
  • A MySQL on Cloud SQL instance as JDBC source

Create service accounts

We recommend that you run pipelines with fine-grained access control to improve access partitioning. If your project doesn't have a user-created service account, create one using following instructions.

You can use your browser by going to Service accounts page in the Cloud Console.

  1. Create a service account to use as the user-managed controller service account for Dataflow:

    gcloud iam service-accounts create ${DLP_RUNNER_SERVICE_ACCOUNT_NAME} \
    --project="${PROJECT_ID}" \
    --description="Service Account for Sampling and Cataloging pipelines." \
    --display-name="Sampling and Cataloging pipelines"
    
  2. Create a custom role with required permissions for accessing Cloud DLP, Dataflow, Cloud SQL, and Data Catalog:

    export SAMPLING_CATALOGING_ROLE_NAME="sampling_cataloging_runner"
    
    gcloud iam roles create ${SAMPLING_CATALOGING_ROLE_NAME} \
    --project=${PROJECT_ID} \
    --file=tokenizing_runner_permissions.yaml
    
  3. Apply the custom role to the service account:

    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL}" \
    --role=projects/${PROJECT_ID}/roles/${SAMPLING_CATALOGING_ROLE_NAME}
    
  4. Assign the dataflow.worker role to allow the service account to allow it to run as a Dataflow worker:

    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL}" \
    --role=roles/dataflow.worker
    

Create the Cloud Storage bucket

Create a Cloud Storage bucket for storing test data and Dataflow staging location:

gsutil mb -p ${PROJECT_ID} -l ${REGION_ID} "gs://${TEMP_GCS_BUCKET}"

Create Inspection Tag tempalte

Create a Data Catalog tag template to allow tagging entries with sensitvity information extracted using Cloud DLP.

gcloud data-catalog tag-templates create ${INSPECTION_TAG_TEMPLATE_ID} \
--project="${PROJECT_ID}" \
--location="${REGION_ID}" \
--display-name="Auto DLP sensitive categories" \
--field=id=infoTypes,type=string,display-name="DLP InfoTypes",required=TRUE \
--field=id=inspectTimestamp,type=timestamp,display-name="Inspection run timestamp",required=TRUE

Crete MySQL on Cloud SQL instance

You will use a Cloud SQL instance as a source of your data. Instanciate a MySQL on Cloud SQL and load with sample data.

export SQL_INSTANCE="mysql-autodlp-instance"

gcloud sql instances create "${SQL_INSTANCE}" \
--project="${PROJECT_ID}" \
--region="${REGION_ID}" \
--database-version=MYSQL_5_7 \
--root-password=root1234

Note: Please be patient as it may take a some time to spin-up a new instance.

Copy test data to Cloud SQL instance

You can use your own file datasets or copy the included demonstration dataset (contacts5k.sql.gz).

  1. Copy the sample dataset to Cloud Storage for staging into Cloud SQL:

    gsutil cp contacts5k.sql.gz gs://${TEMP_GCS_BUCKET}
    
  2. Create a new Database in the Cloud SQL instance:

    export DATABASE_ID="auto_dlp_test"
    
    gcloud sql databases create "${DATABASE_ID}" \
    --project="${PROJECT_ID}" \
    --instance="${SQL_INSTANCE}"
    
  3. Provide Cloud Storage access to Cloud SQL Service account to enable importing data:

    export SQL_SERVICE_ACCOUNT=$(gcloud sql instances describe "${SQL_INSTANCE}" --project="${PROJECT_ID}" | grep serviceAccountEmailAddress: | sed "s/serviceAccountEmailAddress: //g")
    
    gsutil iam ch "serviceAccount:${SQL_SERVICE_ACCOUNT}:objectAdmin" gs://${TEMP_GCS_BUCKET}
    

    Learn more about Importing data into Cloud SQL.

  4. Load the data in a new table:

    gcloud sql import sql "${SQL_INSTANCE}" \
    "gs://${TEMP_GCS_BUCKET}/contacts5k.sql.gz" \
    --project="${PROJECT_ID}" \
    --database="${DATABASE_ID}"
    

Compile modules

Build the executables for deploying the sample-and-identify and tokenize pipelines.

 ./gradlew clean buildNeeded shadowJar

Tip: To skip running the tests, you can add the -x test flag.

Using the sample-and-identify pipeline

Run the sample-and-identify pipeline to identify sensitive columns in the data that you need to tokenize.

The pipeline extracts sampleSize number of records, flattens the record and identifies sensitive columns using Cloud DLP. Cloud DLP provides functionality to identify sensitive information types. The Cloud DLP identify method supports only flat tables, so the pipeline flattens the Avro, Parquet or BigQuery records, since they can contain nested and repeated fields.

Create Data Catalog Entry Group

The Data Catalog maintains a list of Entries that represent Google Cloud or other resources. The Entries are kept under Entry Groups. There exist an implicit Entry Group for BigQuery, for other types of resources, you need to create Entry Groups. Learn how about Data Catalog entries.

Create a new Entry Group to create entry for your MySQL Database.

export DATA_CATALOG_ENTRY_GROUP_ID="sql_databases"

gcloud data-catalog entry-groups create \
"${DATA_CATALOG_ENTRY_GROUP_ID}" \
--project="${PROJECT_ID}" \
--location="${REGION_ID}"

Note: Ensure that location is one of the Data Catalog regions.

Create Dataflow Flex template

Dataflow templates allow you to use the Cloud Console, the gcloud command-line tool, or REST API calls to set up your pipelines on Google Cloud and run them. Classic templates are staged as execution graphs on Cloud Storage; Flex Templates bundle the pipeline as a container image in your project’s registry in Container Registry. This allows you to decouple building and running pipelines, as well as integrate with orchestration systems for daily execution. For more information, see Evaluating which template type to use in the Dataflow documentation.

Dataflow Flex templates make it possible to launch a Dataflow pipeline without having to compile code or access to development environment. Dataflow pipelines based on Flex templates can be started from Cloud Composer using DataflowStartFlexTemplateOperator.

  1. Define the location to store the template spec file containing all of the necessary information to run the job:

    export FLEX_TEMPLATE_PATH="gs://${TEMP_GCS_BUCKET}/dataflow/templates/sample-inspect-tag-pipeline.json"
    export FLEX_TEMPLATE_IMAGE="us.gcr.io/${PROJECT_ID}/dataflow/sample-inspect-tag-pipeline:latest"
    
  2. Build the Dataflow Flex template:

    gcloud dataflow flex-template build "${FLEX_TEMPLATE_PATH}" \
    --image-gcr-path="${FLEX_TEMPLATE_IMAGE}" \
    --service-account-email="${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL}" \
    --sdk-language="JAVA" \
    --flex-template-base-image=JAVA11 \
    --metadata-file="sample_identify_tag_pipeline_metadata.json" \
    --jar="build/libs/autotokenize-all.jar" \
    --env="FLEX_TEMPLATE_JAVA_MAIN_CLASS=\"com.google.cloud.solutions.autotokenize.pipeline.DlpInspectionPipeline\""
    

Inspection results to BigQuery table

Create a BigQuery table to store the inspection results:

bq mk --dataset \
--location="${REGION_ID}" \
--project_id="${PROJECT_ID}" \
inspection_results

bq mk --table \
--project_id="${PROJECT_ID}" \
inspection_results.SensitivityInspectionResults  \
inspection_results_bigquery_schema.json

Run the sample-and-identify pipeline

The sampling and DLP identification pipeline will:

  1. Extract records from the provided source
    • Optionally the records are flattened if nesting/repeated structures are identified.
  2. Sample the individual columns for required samples, excluding null or empty values
  3. Identify sensitive InfoTypes using DLP, by batching the samples into DLP acceptabe batch-sizes
  4. Findings Report is then written to Cloud Storage for future reference.
  5. Data Catalog Export When Tag Template and Entry Group information is provided, the pipeline will also create sensitivity tags for entries in Data Catalog against appropriate columns.

Launch the sampling and DLP identification pipeline:

export CLOUD_SQL_JDBC_CONNECTION_URL="jdbc:mysql:///${DATABASE_ID}?cloudSqlInstance=${PROJECT_ID}%3A${REGION_ID}%3A${SQL_INSTANCE}&socketFactory=com.google.cloud.sql.mysql.SocketFactory"

gcloud dataflow flex-template run "sample-inspect-tag-`date +%Y%m%d-%H%M%S`" \
--template-file-gcs-location "${FLEX_TEMPLATE_PATH}" \
--region "${REGION_ID}" \
--service-account-email "${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL}" \
--staging-location "gs://${TEMP_GCS_BUCKET}/staging" \
--worker-machine-type "n1-standard-1" \
--parameters sampleSize=2000 \
--parameters sourceType="JDBC_TABLE" \
--parameters inputPattern="Contacts" \
--parameters reportLocation="gs://${TEMP_GCS_BUCKET}/auto_dlp_report/" \
--parameters reportBigQueryTable="${PROJECT_ID}:inspection_results.SensitivityInspectionResults" \
--parameters jdbcConnectionUrl="${CLOUD_SQL_JDBC_CONNECTION_URL}" \
--parameters jdbcDriverClass="com.mysql.cj.jdbc.Driver" \
--parameters jdbcUserName="root" \
--parameters jdbcPasswordSecretsKey="projects/${PROJECT_ID}/secrets/${SQL_PASSWORD_SECRET_NAME}/versions/1" \
--parameters ^:^jdbcFilterClause="ROUND(RAND() * 10) IN (1,3)" \
--parameters dataCatalogEntryGroupId="projects/${PROJECT_ID}/locations/${REGION_ID}/entryGroups/${DATA_CATALOG_ENTRY_GROUP_ID}" \
--parameters dataCatalogInspectionTagTemplateId="projects/${PROJECT_ID}/locations/${REGION_ID}/tagTemplates/${INSPECTION_TAG_TEMPLATE_ID}"

The jdbcConnectionUrl specifies a JDBC database connection url with user and password details. The details of building the exact connection url would depend on your database vendor and hosting partner. Learn more about connecting using Cloud SQL connectors to understand details for connecting to Cloud SQL based relational databases.

Note:

  • You can choose one or more of the following reporting sinks:
    • reportLocation to store the report in GCS bucket
    • reportBigQueryTable to store the report in a BigQuery table
    • dataCatalogEntryGroupId to create and tag the Entry in Data Catalog. Omit this parameter if the sourceType is BIGQUERY_TABLE.

The pipeline supports following source types. Use the table to determine the right combination of sourceType and inputPattern arguments.

Data source sourceType inputPattern
Relational Databases (using JDBC) JDBC_TABLE [TABLE_NAME]
Avro file in Cloud Storage AVRO gs://[LOCATION_OF_FILES]
Parquet file in Cloud Storage PARQUET gs://[LOCATION_OF_FILES]
BigQuery table BIGQUERY_TABLE [PROJECT_ID]:[DATASET].[TABLE]

The pipeline detects all of the standard infoTypes supported by Cloud DLP. You can provide additional custom infoTypes that you need by using the --observableInfoTypes parameter.

Sample-and-identify pipeline DAG

The Dataflow execution DAG (directed acyclic graph) looks like the following:

Sample-and-identify pipeline DAG

Retrieve the report

The sample-and-identify pipeline outputs the Avro schema (or converted for Parquet) of the files and one file for each of the columns determined to contain sensitive information.

  1. Retrieve the report to your local machine:

    mkdir -p auto_dlp_report/ && rm auto_dlp_report/*.json
    gsutil -m cp "gs://${TEMP_GCS_BUCKET}/auto_dlp_report/*.json" auto_dlp_report/
    
  2. List all the column names that have been identified:

    cat auto_dlp_report/col-*.json | jq .columnName
    

    The output matches the following list:

    "$.topLevelRecord.contact_number"
    "$.topLevelRecord.person_name"
    
  3. View the details of an identified column with the cat command for the file:

    cat auto_dlp_report/col-topLevelRecord-contact_number-00000-of-00001.json
    

    The following is a snippet of the cc column:

    {
     "columnName": "$.topLevelRecord.contact_number",
     "infoTypes": [{
       "infoType": "PHONE_NUMBER",
       "count": "990"
     }]
    

    }

    • Don't be alarmed by the unusual columnName value. That's due to implict conversion of a database row to an Avro record.
    • The "count" value varies based on the randomly selected samples during execution.

Verify tags in Data Catalog

The sampling pipeline would have created a new entry and applied the sensitivity tags to appropriate columns.

Retrieve the created Entry for the Contacts table :

gcloud data-catalog entries describe Contacts \
--entry-group=${DATA_CATALOG_ENTRY_GROUP_ID} \
--project="${PROJECT_ID}" \
--location="${REGION_ID}"

This would show the details of the table, including its schema.

Show all the tags attached to this Entry:

gcloud data-catalog tags list --entry=Contacts --entry-group=${DATA_CATALOG_ENTRY_GROUP_ID} \
--project="${PROJECT_ID}" \
--location="${REGION_ID}"

Verify that sensitivity tags are present on following columns:

  • contact_number
  • person_name

The tag details look as below:

---
column: contact_number
fields:
  infoTypes:
    displayName: DLP InfoTypes
    stringValue: '[PHONE_NUMBER]'
  inspectTimestamp:
    displayName: Inspection run timestamp
    timestampValue: '2021-05-20T16:34:29.596Z'
name: projects/auto-dlp/locations/asia-southeast1/entryGroups/sql_databases/entries/Contacts/tags/CbS0CtGSpZyJ
template: projects/auto-dlp/locations/asia-southeast1/tagTemplates/auto_dlp_inspection
templateDisplayName: Auto DLP sensitive categories
---
column: person_name
fields:
  infoTypes:
    displayName: DLP InfoTypes
    stringValue: '[DATE]'
  inspectTimestamp:
    displayName: Inspection run timestamp
    timestampValue: '2021-05-20T16:34:29.594Z'
name: projects/auto-dlp/locations/asia-southeast1/entryGroups/sql_databases/entries/Contacts/tags/Cds1aiO8R0pT
template: projects/auto-dlp/locations/asia-southeast1/tagTemplates/auto_dlp_inspection
templateDisplayName: Auto DLP sensitive categories

Verify in BigQuery

The pipeline appends the aggregated findings to the provided BigQuery table. Run the following query to check the results:

bq query \
--location="${REGION_ID}" \
--project_id="${PROJECT_ID}" \
--use_legacy_sql=false \
'SELECT
   input_pattern AS table_name,
   ColumnReport.column_name AS column_name,
   ColumnReport.info_types AS info_types
 FROM
   `inspection_results.SensitivityInspectionResults`,
   UNNEST(column_report) ColumnReport;'

This prints the results of inspection retrieved from the BigQuery table:

+------------+---------------------------------+----------------------------------------------+
| table_name |           column_name           |                  info_types                  |
+------------+---------------------------------+----------------------------------------------+
| Contacts   | $.topLevelRecord.person_name    |           [{"info_type":"DATE","count":"1"}] |
| Contacts   | $.topLevelRecord.contact_number | [{"info_type":"PHONE_NUMBER","count":"990"}] |
+------------+---------------------------------+----------------------------------------------+

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the project:

  1. In the Cloud Console, go to the Manage resources page.
  2. In the project list, select the project that you want to delete and then click Delete.
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next