# Access data on Azure Storage Blob (WASB) with Synapse Spark

You can access data on Azure Storage Blob (WASB) with Synapse Spark via following URL:

    wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/<path>

This notebook provides examples of how to read data from WASB into a Spark context and how to write the output of Spark jobs directly into a WASB location.

## Load a sample data

Let's first load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) of last 6 months from Azure Open datasets as a sample.

In [3]:
from azureml.opendatasets import PublicHolidays

from datetime import datetime
from dateutil import parser
from dateutil.relativedelta import relativedelta


end_date = datetime.today()
start_date = datetime.today() - relativedelta(months=6)
hol = PublicHolidays(start_date=start_date, end_date=end_date)
hol_df = hol.to_spark_dataframe()

In [4]:
# Display 5 rows
hol_df.show(5, truncate = False)

+---------------+-------------------------+-------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName              |normalizeHolidayName     |isPaidTimeOff|countryRegionCode|date               |
+---------------+-------------------------+-------------------------+-------------+-----------------+-------------------+
|Czech          |Den české státnosti      |Den české státnosti      |null         |CZ               |2019-09-28 00:00:00|
|Norway         |Søndag                   |Søndag                   |null         |NO               |2019-09-29 00:00:00|
|Sweden         |Söndag                   |Söndag                   |null         |SE               |2019-09-29 00:00:00|
|India          |Gandhi Jayanti           |Gandhi Jayanti           |true         |IN               |2019-10-02 00:00:00|
|Germany        |Tag der Deutschen Einheit|Tag der Deutschen Einheit|null         |DE               |2019-10-03 00:00:00|
+---------------+-------

## Write data to Azure Storage Blob

Synapse leverage **Shared access signature (SAS)** to access Azure Blob Storage. To avoid exposing SAS keys in the code, we recommend creating a new linked service in Synapse workspace to the Azure Blob Storage account you want to access.

Follow these steps to add a new linked service for an Azure Blob Storage account:

1. Open the [Azure Synapse Studio](https://web.azuresynapse.net/).
2. Select **Manage** from the left panel and select **Linked services** under the **External connections**.
3. Search **Azure Blob Storage** in the **New linked Service** panel on the right.
4. Select **Continue**.
5. Select the Azure Blob Storage Account to access and configure the linked service name. Suggest using **Account key** for the **Authentication method**.
6. Select **Test connection** to validate the settings are correct.
7. Select **Create** first and click **Publish all** to save your changes.

You can access data on Azure Blob Storage with Synapse Spark via following URL:

```wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/```

Please make sure to allow contatiner level read and write permission. Fill in the access info for your Azure storage blob in the cell below. 


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Azure storage access info
blob_account_name = 'Your blob name' # replace with your blob name
blob_container_name = 'Your container name' # replace with your container name
blob_relative_path = 'Your relative path' # replace with your relative folder path
linked_service_name = 'Your linked service name' # replace with your linked service name

blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name)

In [6]:
# Allow SPARK to access from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
print('Remote blob path: ' + wasbs_path)

Remote blob path: wasbs://data@samplenbblob.blob.core.windows.net/samplenb/

### Save a dataframe as Parquet, JSON or CSV
If you have a dataframe, you can save it to Parquet or JSON with the .write.parquet(), .write.json() and .write.csv() methods respectively.

Dataframes can be saved in any format, regardless of the input format.


In [7]:
parquet_path = wasbs_path + 'holiday.parquet'
json_path = wasbs_path + 'holiday.json'
csv_path = wasbs_path + 'holiday.csv'
print('parquet file path: ' + parquet_path)
print('json file path： ' + json_path)
print('csv file path: ' + csv_path)

parquet file path: wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.parquet
json file path： wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.json
csv file path: wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.csv

In [8]:
hol_df.write.parquet(parquet_path, mode = 'overwrite')
hol_df.write.json(json_path, mode = 'overwrite')
hol_df.write.csv(csv_path, mode = 'overwrite', header = 'true')

### Save a dataframe as text files
If you have a dataframe that you want ot save as text file, you must first covert it to an RDD and then save that RDD as a text file.


In [9]:
# Define the text file path
text_path = wasbs_path + 'holiday.txt'
print('text file path: ' + text_path)

text file path: wasbs://data@samplenbblob.blob.core.windows.net/samplenb/holiday.txt

In [10]:
# Covert spark dataframe into RDD 
hol_RDD = hol_df.rdd
type(hol_RDD)

<class 'pyspark.rdd.RDD'>

If you have an RDD, you can convert it to a text file like the following:


In [12]:
 # Save RDD as text file
hol_RDD.saveAsTextFile(text_path)

# Read data from Azure Storage Blob


### Create a dataframe from parquet files


In [13]:
df_parquet = spark.read.parquet(parquet_path)

### Create a dataframe from JSON files


In [14]:
df_json = spark.read.json(json_path)

### Create a dataframe from CSV files


In [15]:
df_csv = spark.read.csv(csv_path, header = 'true')

### Create an RDD from text file


In [16]:
text = sc.textFile(text_path)