# Configuring Spark with Azure Blob Storage

## Prerequisites
* Azure subscription
* Installed Azure CLI
* `hadoop-azure` module installed on the cluster located at `%HADOOP_HOME%`
  * https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure
  * All dependencies must be installed

In [1]:
import findspark
findspark.init()

import pyspark  
from pyspark.sql import SparkSession
import os
import pyspark.sql.functions as F

## Configuration Files

Add the following in your Spark's config file:
```
    .config("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")\
    .config("fs.azure.account.key.<Storage account name>.blob.core.windows.net","<Storage account key>")\
```

In [3]:
import json
credentials = json.load(open("../credentials/credentials.json"))
spark = SparkSession\
    .builder\
    .master("local[*]")\
    .appName('HelloWorld')\
    .config("spark.driver.memory", "6G") \
    .config("spark.executor.memory", "6G") \
    .config("spark.driver.maxResultSize", "6G") \
    .config("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")\
    .config(f'fs.azure.account.key.{credentials["storage_account_name"]}.blob.core.windows.net',credentials["storage_account_key"])\
    .getOrCreate()



### Sample Code
Copied from 06_sql.ipynb

In [5]:
df_yellow = spark.read.parquet("../resources/datasets/yellow/*/*")
df_green = spark.read.parquet("../resources/datasets/green/*/*")

df_green = df_green\
    .withColumnRenamed("lpep_pickup_datetime", "pickup_datetime") \
    .withColumnRenamed("lpep_dropoff_datetime", "dropoff_datetime")

#Same with yellow
df_yellow = df_yellow\
    .withColumnRenamed("tpep_pickup_datetime", "pickup_datetime") \
    .withColumnRenamed("tpep_dropoff_datetime", "dropoff_datetime")

common_columns = set(df_yellow.columns) & set(df_green.columns) # intersection of columns using & operator
print("Common columns: ", common_columns)
print("Not in both: ", set(df_green.columns).symmetric_difference(set(df_yellow.columns))) # symmetric difference of columns using symmetric_difference() method

df_yellow = df_yellow.select(*common_columns).withColumn("service_type", F.lit("yellow"))
df_green = df_green.select(*common_columns).withColumn("service_type", F.lit("green"))

df_trips_data = df_green.unionAll(df_yellow)

Common columns:  {'mta_tax', 'fare_amount', 'total_amount', 'VendorID', 'DOLocationID', 'tolls_amount', 'RatecodeID', 'tip_amount', 'store_and_fwd_flag', 'payment_type', 'dropoff_datetime', 'extra', 'PULocationID', 'pickup_datetime', 'congestion_surcharge', 'improvement_surcharge', 'passenger_count', 'trip_distance'}
Not in both:  {'airport_fee', 'trip_type', 'ehail_fee'}


### Uploading and downloading files

In [6]:
# Uploading
df_trips_data.write.option("header", "true").parquet(f"wasbs://zoomcampcontainer@{credentials['storage_account_name']}.blob.core.windows.net/tripsdata")

In [7]:
# Downloading
fhv_trips = spark.read.parquet(f"wasbs://zoomcampcontainer@{credentials['storage_account_name']}.blob.core.windows.net/fhv_tripdata_2020.parquet")
fhv_trips.show(10)

+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|Affiliated_base_number|
+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|              B00001|2020-01-01 00:30:00|2020-01-01 01:44:00|       264.0|       264.0|   null|                B00001|
|              B00001|2020-01-01 00:30:00|2020-01-01 00:47:00|       264.0|       264.0|   null|                B00001|
|              B00009|2020-01-01 00:48:00|2020-01-01 01:19:00|       264.0|       264.0|   null|                B00009|
|              B00009|2020-01-01 00:34:00|2020-01-01 00:43:00|       264.0|       264.0|   null|                B00009|
|              B00009|2020-01-01 00:23:00|2020-01-01 00:32:00|       264.0|       264.0|   null|                B00009|
|              B00009|2020-01-01 00:52:0