You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to TRUNCATE and LOAD the data into Google Cloud BigQuery table using Apache Spark. Though it is achievable with the help of IndirectBigQueryOutputFormat as stated by Dennis Hou in the link #43 , I got serious performance issue.
Below are some code samples/configurations used, conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY, "WRITE_TRUNCATE") conf.set("mapreduce.job.outputformat.class",classOf[IndirectBigQueryOutputFormat[_,_]].getName) BigQueryOutputConfiguration.configure(conf,projectId,outputDatasetId,outputTableId,outputSchema,"gs://spark3/temp",BigQueryFileFormat.NEWLINE_DELIMITED_JSON,classOf[TextOutputFormat[_,_]])
After doing some transformation on the input data, I got
val append as Dataset[Row] and i am converting it to RDD[String,Long String] using val Final_Stmt=append.as[(Long,String,Long)].rdd
Now its time to load the data into BigQuery table so I used
def convertToJson(pair: (Long,String,Long)) : JsonObject = { val id = pair._1 val name = pair._2 val score = pair._3 val jsonObject = new JsonObject() jsonObject.addProperty("id", id) jsonObject.addProperty("name", name) jsonObject.addProperty("score", score) return jsonObject }
If I comment saveAsNewAPIHadoopDataset(conf) and only prints Final_Stmt's map then it runs for 2.44 minutes for data of 8 rows and 3 columns. But If I run with saveAsNewAPIHadoopDataset(conf) which is used to load data into BigQuery table, it is taking near about 41 minutes to load.
By looking at the output window, TaskSetManager creates near about 200 tasks and for each task below gets executed,
INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential. INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from given credential. INFO com.google.cloud.hadoop.io.bigquery.output.ForwardingBigQueryFileOutputFormat: Delegating functionality to 'TextOutputFormat'
and each task takes too much time to execute.
How to improve performance of this task or is there anything missing from my end?
The text was updated successfully, but these errors were encountered:
manojpatil05
changed the title
Load data into BigQuery using Spark performance issue
Trying to Load data into BigQuery using IndirectBigQueryOutputFormat/saveAsNewAPIHadoopDataset(conf) performance issue
Apr 6, 2017
I am trying to TRUNCATE and LOAD the data into Google Cloud BigQuery table using Apache Spark. Though it is achievable with the help of
IndirectBigQueryOutputFormat
as stated by Dennis Hou in the link #43 , I got serious performance issue.Below are some code samples/configurations used,
conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY, "WRITE_TRUNCATE") conf.set("mapreduce.job.outputformat.class",classOf[IndirectBigQueryOutputFormat[_,_]].getName) BigQueryOutputConfiguration.configure(conf,projectId,outputDatasetId,outputTableId,outputSchema,"gs://spark3/temp",BigQueryFileFormat.NEWLINE_DELIMITED_JSON,classOf[TextOutputFormat[_,_]])
After doing some transformation on the input data, I got
val
append
asDataset[Row]
and i am converting it toRDD[String,Long String]
using valFinal_Stmt=append.as[(Long,String,Long)].rdd
Now its time to load the data into BigQuery table so I used
Final_Stmt.map(pair => (null, convertToJson(pair))).saveAsNewAPIHadoopDataset(conf)
where definition of
converToJson
is as below,def convertToJson(pair: (Long,String,Long)) : JsonObject = { val id = pair._1 val name = pair._2 val score = pair._3 val jsonObject = new JsonObject() jsonObject.addProperty("id", id) jsonObject.addProperty("name", name) jsonObject.addProperty("score", score) return jsonObject }
If I comment
saveAsNewAPIHadoopDataset(conf)
and only printsFinal_Stmt
's map then it runs for 2.44 minutes for data of 8 rows and 3 columns. But If I run withsaveAsNewAPIHadoopDataset(conf)
which is used to load data into BigQuery table, it is taking near about 41 minutes to load.By looking at the output window,
TaskSetManager
creates near about 200 tasks and for each task below gets executed,INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential. INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from given credential. INFO com.google.cloud.hadoop.io.bigquery.output.ForwardingBigQueryFileOutputFormat: Delegating functionality to 'TextOutputFormat'
and each task takes too much time to execute.
How to improve performance of this task or is there anything missing from my end?
The text was updated successfully, but these errors were encountered: