Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue #50

Closed
manojpatil05 opened this issue Apr 6, 2017 · 1 comment

Comments

@manojpatil05
Copy link

I am trying to TRUNCATE and LOAD the data into Google Cloud BigQuery table using Apache Spark. Though it is achievable with the help of IndirectBigQueryOutputFormat as stated by Dennis Hou in the link #43 , I got serious performance issue.

Below are some code samples/configurations used, conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY, "WRITE_TRUNCATE") conf.set("mapreduce.job.outputformat.class",classOf[IndirectBigQueryOutputFormat[_,_]].getName) BigQueryOutputConfiguration.configure(conf,projectId,outputDatasetId,outputTableId,outputSchema,"gs://spark3/temp",BigQueryFileFormat.NEWLINE_DELIMITED_JSON,classOf[TextOutputFormat[_,_]])

After doing some transformation on the input data, I got
val append as Dataset[Row] and i am converting it to RDD[String,Long String] using val Final_Stmt=append.as[(Long,String,Long)].rdd

Now its time to load the data into BigQuery table so I used

Final_Stmt.map(pair => (null, convertToJson(pair))).saveAsNewAPIHadoopDataset(conf)

where definition of converToJson is as below,

def convertToJson(pair: (Long,String,Long)) : JsonObject = { val id = pair._1 val name = pair._2 val score = pair._3 val jsonObject = new JsonObject() jsonObject.addProperty("id", id) jsonObject.addProperty("name", name) jsonObject.addProperty("score", score) return jsonObject }

If I comment saveAsNewAPIHadoopDataset(conf) and only prints Final_Stmt's map then it runs for 2.44 minutes for data of 8 rows and 3 columns. But If I run with saveAsNewAPIHadoopDataset(conf) which is used to load data into BigQuery table, it is taking near about 41 minutes to load.

By looking at the output window, TaskSetManager creates near about 200 tasks and for each task below gets executed,

INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential. INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from given credential. INFO com.google.cloud.hadoop.io.bigquery.output.ForwardingBigQueryFileOutputFormat: Delegating functionality to 'TextOutputFormat'

and each task takes too much time to execute.

How to improve performance of this task or is there anything missing from my end?

@manojpatil05 manojpatil05 changed the title Load data into BigQuery using Spark performance issue Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue Apr 6, 2017
@manojpatil05
Copy link
Author

Reducing number of partitions by "spark.sql.shuffle.partitions" from default 200 to 5 has solved this issue but it is not advisable for large data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant