# Access data on Azure Storage Blob (WASB) with Synapse Spark

You can access data on Azure Storage Blob (WASB) with Synapse Spark via following URL:

    wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/<path>

This notebook provides examples of how to read data from WASB into a Spark context and how to write the output of Spark jobs directly into a WASB location.

## Load a sample data

Let's first load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) from Azure Open datasets as a sample.

In [3]:
// set blob storage account connection for open dataset

val hol_blob_account_name = "azureopendatastorage"
val hol_blob_container_name = "holidaydatacontainer"
val hol_blob_relative_path = "Processed"
val hol_blob_sas_token = ""

val hol_wasbs_path = f"wasbs://$hol_blob_container_name@$hol_blob_account_name.blob.core.windows.net/$hol_blob_relative_path"
spark.conf.set(f"fs.azure.sas.$hol_blob_container_name.$hol_blob_account_name.blob.core.windows.net",hol_blob_sas_token)

hol_blob_account_name: String = azureopendatastorage
hol_blob_container_name: String = holidaydatacontainer
hol_blob_relative_path: String = Processed
hol_blob_sas_token: String = ""
hol_wasbs_path: String = wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed

In [4]:
// load the sample data as a Spark DataFrame
val hol_df = spark.read.parquet(hol_wasbs_path) 
hol_df.show(5, truncate = false)

hol_df: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName               |normalizeHolidayName      |isPaidTimeOff|countryRegionCode|date               |
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|Argentina      |Año Nuevo [New Year's Day]|Año Nuevo [New Year's Day]|null         |AR               |1970-01-01 00:00:00|
|Australia      |New Year's Day            |New Year's Day            |null         |AU               |1970-01-01 00:00:00|
|Austria        |Neujahr                   |Neujahr                   |null         |AT               |1970-01-01 00:00:00|
|Belgium        |Nieuwjaarsdag             |Nieuwjaarsdag             |null         |BE               |1970-01-01 00:00:00|
|Brazil         |Ano novo 

## Write data to Azure Storage Blob

We are going to write the spark dateframe to your Azure Blob Storage (WASB) path using **shared access signature (sas)**. Go to [Azure Portal](https://portal.azure.com/), open your Azure storage blob, select **shared access signature** in the **settings** and generate your sas token. Please make sure to allow contatiner level read and write permission. Fill in the access info for your Azure storage blob in the cell below. 


In [5]:
// set your blob storage account connection

val blob_account_name = "" // replace with your blob name
val blob_container_name = "" //replace with your container name
val blob_relative_path = "" //replace with your relative folder path
val blob_sas_token = "" //replace with your sas token

val wasbs_path = f"wasbs://$blob_container_name@$blob_account_name.blob.core.windows.net/$blob_relative_path"
spark.conf.set(f"fs.azure.sas.$blob_container_name.$blob_account_name.blob.core.windows.net",blob_sas_token)


blob_account_name: String = samplenbblob
blob_container_name: String = data
blob_relative_path: String = samplenb/
blob_sas_token: String = ?sv=2019-02-02&ss=b&srt=sco&sp=rwdlac&se=2021-03-23T17:05:16Z&st=2020-03-24T09:05:16Z&spr=https,http&sig=drtIrL68s07nPW0Q9WEb5XFL6y5Eb7%2BOpmpxGyAHLaw%3D
wasbs_path: String = wasbs://data@samplenbblob.blob.core.windows.net/samplenb/

### Save a dataframe as Parquet, JSON or CSV
If you have a dataframe, you can save it to Parquet or JSON with the .write.parquet(), .write.json() and .write.csv() methods respectively.

Dataframes can be saved in any format, regardless of the input format.


In [6]:
// set the path for the output file

val parquet_path = wasbs_path + "holiday.parquet"
val json_path = wasbs_path + "holiday.json"
val csv_path = wasbs_path + "holiday.csv"

parquet_path: String = wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.parquet
json_path: String = wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.json
csv_path: String = wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.csv

In [7]:
import org.apache.spark.sql.SaveMode

hol_df.write.mode(SaveMode.Overwrite).parquet(parquet_path)
hol_df.write.mode(SaveMode.Overwrite).json(json_path)
hol_df.write.mode(SaveMode.Overwrite).option("header", "true").csv(csv_path)

import org.apache.spark.sql.SaveMode

### Save a dataframe as text files
If you have a dataframe that you want ot save as text file, you must first covert it to an RDD and then save that RDD as a text file.


In [8]:
// Define the text file path and covert spark dataframe into RDD
val text_path = wasbs_path + "holiday.txt"
val hol_RDD = hol_df.rdd

text_path: String = wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.txt
hol_RDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[18] at rdd at <console>:30

If you have an RDD, you can convert it to a text file like the following:


In [10]:
// Save RDD as text file
hol_RDD.saveAsTextFile(text_path)

# Read data from Azure Storage Blob


### Create a dataframe from parquet files


In [12]:
val df_parquet = spark.read.parquet(parquet_path)

df_parquet: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]

### Create a dataframe from JSON files


In [13]:
val df_json = spark.read.json(json_path)

df_json: org.apache.spark.sql.DataFrame = [countryOrRegion: string, countryRegionCode: string ... 4 more fields]

### Create a dataframe from CSV files


In [15]:
val df_csv = spark.read.option("header", "true").csv(csv_path)

df_csv: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]

### Create an RDD from text file


In [16]:
val text = sc.textFile(text_path)

text: org.apache.spark.rdd.RDD[String] = wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.txt MapPartitionsRDD[36] at textFile at <console>:32

In [17]:
text.take(5).foreach(println)

[Argentina,Año Nuevo [New Year's Day],Año Nuevo [New Year's Day],null,AR,1970-01-01 00:00:00.0]
[Australia,New Year's Day,New Year's Day,null,AU,1970-01-01 00:00:00.0]
[Austria,Neujahr,Neujahr,null,AT,1970-01-01 00:00:00.0]
[Belgium,Nieuwjaarsdag,Nieuwjaarsdag,null,BE,1970-01-01 00:00:00.0]
[Brazil,Ano novo,Ano novo,null,BR,1970-01-01 00:00:00.0]