SPARK_SQL_project with DataProc

SPARK SQL Project to convert csv files for NYSE data to parquet file with partition key as trademonth(YYYYmm). The target table or the golden data is created using Delta table using DeltaLake combining with spark SQL. Also depicting the shortcomings of SPARK sql comparing with HIVE SQL.

Steps for data loading from hdfs golden location to gcs bucket hence consumed by GBQ

1. create separate bucket
2. create dataset and corresponding table

from cluster command prompt->hadoop distcp hdfs://cluster-2fef-sovik-m/user/hive/warehouse/retail_db.db/orders gs://sovik-big-query-bucket-1/orders

Load in Big Query: bq load --autodetect --source_format=PARQUET orders.orders_v2 gs://sovik-big-query-bucket-1/orders/orders/part-*

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARK_SQL_project with DataProc

Steps for data loading from hdfs golden location to gcs bucket hence consumed by GBQ

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SPARK_SQL_project with DataProc

Steps for data loading from hdfs golden location to gcs bucket hence consumed by GBQ

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages