Skip to content

Commit

Permalink
updated read me with cloud build
Browse files Browse the repository at this point in the history
  • Loading branch information
santhh committed Dec 9, 2019
1 parent c3b0fc8 commit 074c737
Show file tree
Hide file tree
Showing 4 changed files with 77 additions and 6 deletions.
79 changes: 76 additions & 3 deletions README.md
Expand Up @@ -535,9 +535,9 @@ DLP Inspect template:
```

### DLP Inspection for Data Stored in AWS S3 Bucket
### DLP Inspection for Data Stored in AWS S3 and GCS Bucket

This PoC can be used to inspect large scale dataset (csv, txt) stored in AWS S3 bucket. It uses Dataflow S3 connector, invoke DLP Inspect API to inspect data based on some configuration specified in a DLP inspect template. It stores the result in BQ.
This PoC can be used to inspect large scale dataset (csv, txt) stored in AWS S3 and GCS buckets. It uses Dataflow S3 connector, invoke DLP Inspect API to inspect data based on some configuration specified in a DLP inspect template. It stores the result in BQ.
### How it works?
1. Build and Run the pipline in a GCP project using Dataflow. Please ensure you have enabled DLP and DF apis before executing.

Expand All @@ -561,9 +561,79 @@ gradle build -DmainClass=com.google.swarm.tokenization.S3Import --x test
To Run:
gradle run -DmainClass=com.google.swarm.tokenization.S3Import -Pargs=" --streaming --project=<id> --runner=DataflowRunner --awsAccessKey=<key>--awsSecretKey=<key>--s3BucketUrl=s3://<bucket>/*.* --inspectTemplateName=projects/<id>/inspectTemplates/<template_id> --awsRegion=ca-central-1 --numWorkers=50 --workerMachineType=n1-highmem-16 --maxNumWorkers=50 --autoscalingAlgorithm=NONE --enableStreamingEngine --tempLocation=gs://<bucket>/temp --dataSetId=<dataset_id> --s3ThreadPoolSize=1000 --maxConnections=1000000 --socketTimeout=100 --connectionTimeout=100"
gradle run -DmainClass=com.google.swarm.tokenization.S3Import -Pargs=" --streaming --project=<id> --runner=DataflowRunner --awsAccessKey=<key>--awsSecretKey=<key>--s3BucketUrl=s3://<bucket>/*.* --gcsBucketUrl=gs://<bucket>/*.* --inspectTemplateName=projects/<id>/inspectTemplates/<template_id> --awsRegion=ca-central-1 --numWorkers=50 --workerMachineType=n1-highmem-16 --maxNumWorkers=50 --autoscalingAlgorithm=NONE --enableStreamingEngine --tempLocation=gs://<bucket>/temp --dataSetId=<dataset_id> --s3ThreadPoolSize=1000 --maxConnections=1000000 --socketTimeout=100 --connectionTimeout=100"
```

### Run Using Cloud Build
1. Modify inspect config file (gcs-s3-inspect-config.json) to add/update the info types you would lilke to use for scan. Below snippet is a sample config file used for demo.

```
{
"inspectTemplate": {
"displayName": "DLP Inspection Config For Demo",
"description": "DLP Config To Inspect GCS and S3 Bucket",
"inspectConfig": {
"infoTypes": [
{
"name": "EMAIL_ADDRESS"
},
{
"name": "CREDIT_CARD_NUMBER"
},
{
"name": "PHONE_NUMBER"
},
{
"name": "US_SOCIAL_SECURITY_NUMBER"
},
{
"name": "IP_ADDRESS"
}
],
"minLikelihood": "POSSIBLE",
"customInfoTypes": [
{
"infoType": {
"name": "ONLINE_USER_ID"
},
"regex": {
"pattern": "\\b:\\d{16}"
}
}
]
}
},
}
```

2. Export Required Parameters

```
export AWS_ACCESS_KEY=<aws_access_key>
export AWS_SECRET_KEY=<aws_secret_key>
export S3_BUCKET_URL=<s3_bucket_url>
export GCS_BUCKET_URL=<gcs_bucket_url>
export AWS_REGION=<aws_region>
export BQ_DATASET=<bq_dataset>
```

3. Run Cloud Build command

```
gcloud builds submit . --config dlp-demo-s3-gcs-inspect.yaml \
--substitutions _AWS_ACCESS_KEY=$AWS_ACCESS_KEY, \
_API_KEY=$(gcloud auth print-access-token), \
_AWS_SECRET_KEY=$AWS_SECRET_KEY, \
_S3_BUCKET_URL=$S3_BUCKET_URL, \
_GCS_BUCKET_URL=$GCS_BUCKET_URL, \
_AWS_REGION=$AWS_REGION, \
_BQ_DATASET=$BQ_DATASET
```

### Testing Configuration
This PoC was built to process large scale data by scaling number of workers horizontally. During our test run, we have successfully inspected 1.3 TB of data in less than 10 minutes. It's recommended to use n1-highmem-16 machines as it allows Dataflow to reserve more JVM heap memory.

Expand All @@ -580,6 +650,9 @@ Below configurations are common for both setup:


### Some Screenshots from the PoC run
#### Dataflow Job
![Dataflow_DAG](diagrams/s3_1.png)
![Dataflow_DAG](diagrams/s3_2.png)

#### S3 Bucket

Expand Down
4 changes: 1 addition & 3 deletions create-df-template.sh
Expand Up @@ -73,9 +73,7 @@ PARAMETERS_CONFIG='{
"connectionTimeout":"100",
"tempLocation":"'$TEMP_LOCATION'",
"awsRegion":"'$AWS_REGION'",
"dataSetId":"'$BQ_DATASET'",
"dataSetId":"'$BQ_DATASET'",
}
}'
DF_API_ROOT_URL="https://dataflow.googleapis.com"
Expand Down
Binary file added diagrams/s3_1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/s3_2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 074c737

Please sign in to comment.