### Solution 05: Build gold tables and semantics layer for Power BI reporting 

In this scenario, the data engineer could perform some transformation and aggregation tasks to prepare the data for downstream analysis. 

**NOTE:** <mark>you can use Scala, PySpark or SparkSQL | or combination of all languages.</mark>

**Goal:** Tables optimization for analytics is a key requirement

**Actions:**
- Turn-on V-Order feature for Spark

**Success Criteria:**
- The feature is enabled

In [4]:
spark.conf.get('spark.sql.parquet.vorder.default')
spark.conf.set('spark.sql.parquet.vorder.default', 'true')
spark.conf.get('spark.sql.parquet.vorder.default')

StatementMeta(, cd56908e-c7e4-4f17-ba47-938cc8071b62, 6, Finished, Available, Finished)

'true'

**Goal:** Union greenYYYYMM_discounts tables 

**Actions:**
- Read greenYYYYMM_discounts tables 
- Union tables
- Write output into lakehouse: **goldcurated**, table: **green_discounts** 

**Success Criteria:**
- Table **goldcurated.green_discounts** exists
    - Contains 1536873 records

In [5]:
# read green202301_discounts table
df2023 = spark.read.load(f"Tables/green202301_discounts", header=True, inferSchema=True)
df2023_count = df2023.count()
print(df2023_count)

StatementMeta(, cd56908e-c7e4-4f17-ba47-938cc8071b62, 7, Finished, Available, Finished)

59902


In [6]:
# read green201501_discounts table
df2015 = spark.read.load(f"Tables/green201501_discounts", header=True, inferSchema=True)
df2015_count = df2015.count()
print(df2015_count)

StatementMeta(, cd56908e-c7e4-4f17-ba47-938cc8071b62, 8, Finished, Available, Finished)

1476971


In [7]:
# union both tables
df_combined = df2023.union(df2015)
df_combined_count = df_combined.count()

# Write  data to destination
df_combined.write.format("delta").mode("overwrite").option("mergeSchema", "true").saveAsTable(f"goldcurated.green_discounts")
print(f"Written {df_combined_count} records")

# Display sample data
display(df_combined)

StatementMeta(, cd56908e-c7e4-4f17-ba47-938cc8071b62, 9, Finished, Available, Finished)

Written 1536873 records


SynapseWidget(Synapse.DataFrame, bd448f12-0052-4522-9ef7-88f25e5b53b8)

**Goal:** Calculate avg. metrics

**Actions:**
- Aggregate data with AVG function group by pickupDate, year, month, dayofmonth, weekDayName, timeBins, VendorID, PULocationID
    - Metrics (round to 2 dec.): passenger_count, trip_distance, trip_duration, total_amount, tip_amount, fare_amount, discount
- Write output into lakehouse: **goldcurated**, table: **green_avg_metrics** 

**Success Criteria:**
- Table **green_avg_metrics** exists
    - Contains 40727 records


In [8]:
from pyspark.sql.functions import col, year, month, dayofmonth, avg, round

# Calculate average fare amount per month
average_per_month = (
    df_combined
    .groupBy(
        "pickupDate",
        year("pickupDate").alias("year"), 
        month("pickupDate").alias("month"), 
        "dayofmonth",
        "weekDayName",
        "timeBins", 
        "VendorID",
        "PULocationID"
        )
    .agg(
        round(avg("passenger_count"),2).alias("average_passenger_count"), 
        round(avg("trip_distance"),2).alias("average_trip_distance"), 
        round(avg("trip_duration"),2).alias("average_trip_duration"),
        round(avg("total_amount"),2).alias("average_total_amount"),
        round(avg("tip_amount"),2).alias("average_tip_amount"),
        round(avg("fare_amount"),2).alias("average_fare"),
        round(avg("discount"),2).alias("average_discount")
        )
    .orderBy("pickupDate")
)


# Save the results to a new delta table
average_per_month.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(f"goldcurated.green_avg_metrics")

df_count_average = average_per_month.count()

print(f"Written {df_count_average} records")

# Display cleansed data to destination
display(average_per_month)


StatementMeta(, cd56908e-c7e4-4f17-ba47-938cc8071b62, 10, Finished, Available, Finished)

Written 40727 records


SynapseWidget(Synapse.DataFrame, d4f910bc-4a98-4d31-b69b-f01b2ecebd9e)