Skip to content

Sovik89/SPARK_SQL_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

SPARK_SQL_project with DataProc



SPARK SQL Project to convert csv files for NYSE data to parquet file with partition key as trademonth(YYYYmm). The target table or the golden data is created using Delta table using DeltaLake combining with spark SQL. Also depicting the shortcomings of SPARK sql comparing with HIVE SQL.

Steps for data loading from hdfs golden location to gcs bucket hence consumed by GBQ


1. create separate bucket
2. create dataset and corresponding table
  1. from cluster command prompt->hadoop distcp hdfs://cluster-2fef-sovik-m/user/hive/warehouse/retail_db.db/orders gs://sovik-big-query-bucket-1/orders

  2. Load in Big Query: bq load --autodetect --source_format=PARQUET orders.orders_v2 gs://sovik-big-query-bucket-1/orders/orders/part-*

About

NYSE data from csv to delta table using deltalake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors