Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe created from Bigquery table in spark returns no value; however schema of the table is returned. #216

Closed
raghavendra-gupta opened this issue Jul 23, 2020 · 14 comments
Assignees

Comments

@raghavendra-gupta
Copy link

raghavendra-gupta commented Jul 23, 2020

Environment:

Win7
Spark (2.4.3)
SBT
Scala (2.11.12)
IntelliJ
BigQuery Connector ( "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.17.0")

build.sbt file -

name := "myTestGCPProject"

version := "0.1"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"
ibraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.16.1"

Code that fails:

val spark= SparkSession.builder.appName("my first app")
  .config("spark.master", "local")
  .getOrCreate()

val myDF = spark.read.format("bigquery").option("credentialsFile", "src\\main\\resources\\gcloud-rkg-cred.json").load("decoded-tribute-279515:gcp_test_db.emp")

    val newDF = myDF.select("empid", "empname", "salary")

    myDF.printSchema()
    newDF.printSchema()
    newDF.show()

It returns the result for printSchema - but fails for printing the data of the table ( newDF.show() ).

Logs -

20/07/23 11:17:43 INFO ComputeEngineCredentials: Failed to detect whether we are running on Google Compute Engine.
    root
     |-- empid: long (nullable = false)
     |-- empname: string (nullable = false)
     |-- location: string (nullable = false)
     |-- salary: long (nullable = false)
    root
     |-- empid: long (nullable = false)
     |-- empname: string (nullable = false)
     |-- salary: long (nullable = false)
    20/07/23 11:17:51 INFO DirectBigQueryRelation: Querying table decoded-tribute-279515.gcp_test_db.emp, parameters sent from Spark: requiredColumns=[empid,empname,salary], filters=[]
    20/07/23 11:17:51 INFO DirectBigQueryRelation: Going to read from decoded-tribute-279515.gcp_test_db.emp columns=[empid, empname, salary], filter=''
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 49
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 79
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 71
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 69
    20/07/23 11:17:52 INFO BlockManagerInfo: Removed broadcast_5_piece0 on raghav-VAIO:49977 in memory (size: 6.5 KB, free: 639.2 MB)
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 88
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 83
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 58
    20/07/23 11:17:52 INFO BlockManagerInfo: Removed broadcast_4_piece0 on raghav-VAIO:49977 in memory (size: 20.8 KB, free: 639.2 MB)
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 14
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 62
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 87
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 76
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 8
    20/07/23 11:17:52 INFO BlockManagerInfo: Removed broadcast_3_piece0 on raghav-VAIO:49977 in memory (size: 7.2 KB, free: 639.3 MB)
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 9
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 10
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 72
    20/07/23 11:17:52 INFO BlockManagerInfo: Removed broadcast_1_piece0 on raghav-VAIO:49977 in memory (size: 4.5 KB, free: 639.3 MB)
    20/07/23 11:17:52 INFO ContextCleaner: Cleaned accumulator 42
    [error] (run-main-0) com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnknownException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
    [error] com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnknownException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:47)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1083)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
    [error]         at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:760)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:545)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:515)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
    [error]         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    [error]         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    [error]         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    [error]         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    [error]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    [error]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    [error]         at java.lang.Thread.run(Thread.java:748)
    [error] Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:515)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
    [error]         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    [error]         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    [error]         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    [error]         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    [error]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    [error]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    [error]         at java.lang.Thread.run(Thread.java:748)
    [error] Caused by: com.google.cloud.spark.bigquery.repackaged.io.netty.channel.ChannelPipelineException: com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.ProtocolNegotiators$ClientTlsHandler.handlerAdded() has thrown an exception; removed.
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:624)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:572)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:515)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.ProtocolNegotiators$ProtocolNegotiationHandler.fireProtocolNegotiationEvent(ProtocolNegotiators.java:767)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.ProtocolNegotiators$WaitUntilActiveHandler.channelActive(ProtocolNegotiators.java:676)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:209)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.DefaultChannelPipeline$HeadContext.channelActive(DefaultChannelPipeline.java:1398)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.DefaultChannelPipeline.fireChannelActive(DefaultChannelPipeline.java:895)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:305)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:335)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    [error]         at java.lang.Thread.run(Thread.java:748)
    [error] Caused by: java.lang.RuntimeException: ALPN unsupported. Is your classpath configured correctly? For Conscrypt, add the appropriate Conscrypt JAR to classpath and set the security provider. For Jetty-ALPN, see http://www.eclipse.org/jetty/documentation/current/alpn-chapter.html#alpn-starting
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.handler.ssl.JdkAlpnApplicationProtocolNegotiator$FailureWrapper.wrapSslEngine(JdkAlpnApplicationProtocolNegotiator.java:122)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.handler.ssl.JdkSslContext.configureAndWrapEngine(JdkSslContext.java:360)
    [error]         at com.google.cloud.spark.bigquery.repackaged.io.netty.handler.ssl.JdkSslContext.newEngine(JdkSslContext.java:335)
@raghavendra-gupta raghavendra-gupta changed the title [error] (run-main-0) com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnknownException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0] Dataframe created from Bigquery table in spark returns no value; however schema of the table is returned. Jul 23, 2020
@davidrabinowitz davidrabinowitz self-assigned this Jul 23, 2020
@davidrabinowitz
Copy link
Member

Can you please see if you can reproduce on spark 2.4.5?

@raghavendra-gupta
Copy link
Author

Let me reproduce it..

@davidrabinowitz
Copy link
Member

Also, can you try to reproduce on non-windows machine?

@raghavendra-gupta
Copy link
Author

sure..

@raghavendra-gupta
Copy link
Author

Reproduced it in windows 7 with spark 2.4.5. I see the same error thrown. Here are the logs attached. Working on non-windows machine.
logs with spark 2.4.5 - windows7.txt.zip

@davidrabinowitz
Copy link
Member

It seems that conscrypt fails to load the native library, and therefore the netty SSL fails to initialize. Since I don't see this issues on Mac and Linux, I'd appreciate it if you can find a none-windows machine - I think that a GCE instance or perhaps even docker will do.

@raghavendra-gupta
Copy link
Author

Working on running this on Oracle VM + Ubuntu18.. hope it works on that. will keep you posted.

@raghavendra-gupta
Copy link
Author

raghavendra-gupta commented Jul 23, 2020

I tried with Ubuntu18.04 (without any VMs) and with same configuration as previous... AND IT WORKED FINE..

But with OpenJDK14.0.2 - it threw the different error.. attaching the logs with both spark versions for openJDK14.0.2..

ubuntu_spark2.4.3_bqconnector0.17.0.gz
ubuntu_spark2.4.5_bqconnector0.17.0.gz

@raghavendra-gupta
Copy link
Author

This was the last line in the logs in the success scenario with Oracle JDK8..

[warn] Thread[grpc-nio-worker-ELG-1-1,5,run-main-group-0] loading com.google.cloud.spark.bigquery.repackaged.io.netty.util.concurrent.DefaultPromise$1 after test or run has completed. This is a likely resource leak

This could be useful while locating the issue in windows environment and with openJDK14.0.2.
Thanks.

@davidrabinowitz
Copy link
Member

First of all, I'm glad to see that the original ALPN issue does not exist, it may be a windows dependent issue. A PR has already been submitted, so I'll try to have a bug fix ready soon.

Regarding to the error from the provided log - it is a netty issue of Java 9+ Can you please add the io.netty.tryReflectionSetAccessible=true system property to the driver and executor (using extraJavaOptions)

@davidrabinowitz
Copy link
Member

@raghavendra-gupta Has the netty flag helped?

@davidrabinowitz
Copy link
Member

Fixed in 0.17.1

@TiGaI
Copy link

TiGaI commented Nov 16, 2020

image

Got the same problem again on window10 while trying to show any dataframe.

@davidrabinowitz
Copy link
Member

@TiGaI Can you please open a new issue with all the details - version of scala, java, spark and a sample code (python/scala/java), Do you build the app, and of so how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants