Skip to content
This repository has been archived by the owner on Nov 11, 2022. It is now read-only.

Deploy stack before execution on the Cloud Dataflow Service (should write to log) #301

Closed
rantibi opened this issue Jun 11, 2016 · 2 comments

Comments

@rantibi
Copy link

rantibi commented Jun 11, 2016

I try to run my pipeline on Cloud Dataflow Service, using the following command:

java -jar my-jar.jar --runner=BlockingDataflowPipelineRunner --project=bigquery-eval --stagingLocation=gs://bucket-staging --startDate=2016/05/01/00 --endDate=2016/06/01/00  --numWorkers=50 --dataflowJobFile=out.json

It print the following lines, and than stack (for more then 45 minutes), with no logging:

Jun 11, 2016 8:00:31 PM com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 1 files. Enable logging at DEBUG level to see which files will be staged.
Jun 11, 2016 8:00:34 PM com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Jun 11, 2016 8:00:34 PM com.google.cloud.dataflow.sdk.util.PackageUtil stageClasspathElements
INFO: Uploading 1 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Jun 11, 2016 8:00:34 PM com.google.cloud.dataflow.sdk.util.PackageUtil stageClasspathElements
INFO: Uploading PipelineOptions.filesToStage complete: 0 files newly uploaded, 1 files cached

When I select smaller date range for example 1 hour, it use less PCollections (my inner implementation), and then it run after 1 minutes of no logs.
I have no idea what it happen, and any way I think it should write something to the logs.

@rantibi
Copy link
Author

rantibi commented Jun 11, 2016

only after 49 minutes, I got this error

Dataflow SDK version: 1.6.0
Jun 11, 2016 8:49:06 PM com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner run
INFO: Printed workflow specification to outtt1.json
Jun 11, 2016 8:49:11 PM com.google.cloud.dataflow.sdk.util.RetryHttpRequestInitializer$LoggingHttpBackoffUnsuccessfulResponseHandler handleResponse
WARNING: Request failed with code 400, will NOT retry: https://dataflow.googleapis.com/v1b3/projects/bigquery-eval/jobs
Exception in thread "main" java.lang.RuntimeException: Failed to create a workflow job: The size of the serialized JSON representation of the pipeline exceeds the allowable limit. For more information, please check the FAQ link below:
https://cloud.google.com/dataflow/faq
    at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:642)
    at com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner.run(BlockingDataflowPipelineRunner.java:95)
    at com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner.run(BlockingDataflowPipelineRunner.java:56)
    at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:180)
    at com.juno.bi.dataflow.reports.ProviderActivityReport.main(ProviderActivityReport.java:85)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Request contains an invalid argument.",
    "reason" : "badRequest"
  } ],
  "message" : "Request contains an invalid argument.",
  "status" : "INVALID_ARGUMENT"
}
    at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:629)
    ... 4 more

I think a it problem that for 49 minutes no log has been written...

@rantibi rantibi changed the title Deploy stack before execution on the Cloud Dataflow Service Deploy stack before execution on the Cloud Dataflow Service (should write to log) Jun 11, 2016
@dhalperi
Copy link
Contributor

dhalperi commented Apr 10, 2017

I think this issue has been addressed in Dataflow SDK for Java version 2 based on Apache Beam.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants