Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue #50

manojpatil05 · 2017-04-06T10:14:00Z

I am trying to TRUNCATE and LOAD the data into Google Cloud BigQuery table using Apache Spark. Though it is achievable with the help of IndirectBigQueryOutputFormat as stated by Dennis Hou in the link #43 , I got serious performance issue.

Below are some code samples/configurations used, conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY, "WRITE_TRUNCATE") conf.set("mapreduce.job.outputformat.class",classOf[IndirectBigQueryOutputFormat[_,_]].getName) BigQueryOutputConfiguration.configure(conf,projectId,outputDatasetId,outputTableId,outputSchema,"gs://spark3/temp",BigQueryFileFormat.NEWLINE_DELIMITED_JSON,classOf[TextOutputFormat[_,_]])

After doing some transformation on the input data, I got
val append as Dataset[Row] and i am converting it to RDD[String,Long String] using val Final_Stmt=append.as[(Long,String,Long)].rdd

Now its time to load the data into BigQuery table so I used

Final_Stmt.map(pair => (null, convertToJson(pair))).saveAsNewAPIHadoopDataset(conf)

where definition of converToJson is as below,

def convertToJson(pair: (Long,String,Long)) : JsonObject = { val id = pair._1 val name = pair._2 val score = pair._3 val jsonObject = new JsonObject() jsonObject.addProperty("id", id) jsonObject.addProperty("name", name) jsonObject.addProperty("score", score) return jsonObject }

If I comment saveAsNewAPIHadoopDataset(conf) and only prints Final_Stmt's map then it runs for 2.44 minutes for data of 8 rows and 3 columns. But If I run with saveAsNewAPIHadoopDataset(conf) which is used to load data into BigQuery table, it is taking near about 41 minutes to load.

By looking at the output window, TaskSetManager creates near about 200 tasks and for each task below gets executed,

INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential. INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from given credential. INFO com.google.cloud.hadoop.io.bigquery.output.ForwardingBigQueryFileOutputFormat: Delegating functionality to 'TextOutputFormat'

and each task takes too much time to execute.

How to improve performance of this task or is there anything missing from my end?

The text was updated successfully, but these errors were encountered:

manojpatil05 · 2017-04-12T06:42:40Z

Reducing number of partitions by "spark.sql.shuffle.partitions" from default 200 to 5 has solved this issue but it is not advisable for large data.

manojpatil05 changed the title ~~Load data into BigQuery using Spark performance issue~~ Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue Apr 6, 2017

manojpatil05 closed this as completed Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue #50

Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue #50

manojpatil05 commented Apr 6, 2017

manojpatil05 commented Apr 12, 2017

Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue #50

Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue #50

Comments

manojpatil05 commented Apr 6, 2017

manojpatil05 commented Apr 12, 2017