# Access data on Azure Data Lake Storage Gen2 (ADLS Gen2) with Synapse Spark

Azure Data Lake Storage Gen2 (ADLS Gen2) is used as the storage account associated with a Synapse workspace. A synapse workspace can have a default ADLS Gen2 storage account and additional linked storage accounts. 

You can access data on ADLS Gen2 with Synapse Spark via following URL:
    
    abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>

This notebook provides examples of how to read data from ADLS Gen2 account into a Spark context and how to write the output of Spark jobs directly into an ADLS Gen2 location.

## Pre-requisites
Synapse leverage AAD pass-through to access any ADLS Gen2 account (or folder) to which you have a **Blob Storage Contributor** permission. No credentials or access token is required. 

## Load a sample data

Let's first load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) from Azure Open datasets as a sample.

In [5]:
// set blob storage account connection for open dataset

val hol_blob_account_name = "azureopendatastorage"
val hol_blob_container_name = "holidaydatacontainer"
val hol_blob_relative_path = "Processed"
val hol_blob_sas_token = ""

val hol_wasbs_path = f"wasbs://$hol_blob_container_name@$hol_blob_account_name.blob.core.windows.net/$hol_blob_relative_path"
spark.conf.set(f"fs.azure.sas.$hol_blob_container_name.$hol_blob_account_name.blob.core.windows.net",hol_blob_sas_token)

hol_blob_account_name: String = azureopendatastorage
hol_blob_container_name: String = holidaydatacontainer
hol_blob_relative_path: String = Processed
hol_blob_sas_token: String = ""
hol_wasbs_path: String = wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed

In [6]:
// load the sample data as a Spark DataFrame
val hol_df = spark.read.parquet(hol_wasbs_path) 
hol_df.show(5, truncate = false)

hol_df: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName               |normalizeHolidayName      |isPaidTimeOff|countryRegionCode|date               |
+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+
|Argentina      |Año Nuevo [New Year's Day]|Año Nuevo [New Year's Day]|null         |AR               |1970-01-01 00:00:00|
|Australia      |New Year's Day            |New Year's Day            |null         |AU               |1970-01-01 00:00:00|
|Austria        |Neujahr                   |Neujahr                   |null         |AT               |1970-01-01 00:00:00|
|Belgium        |Nieuwjaarsdag             |Nieuwjaarsdag             |null         |BE               |1970-01-01 00:00:00|
|Brazil         |Ano novo 

## Write data to the default ADLS Gen2 storage

We are going to write the spark dateframe to your default ADLS Gen2 storage account.


In [7]:
// set your storage account connection

val account_name = "" // replace with your blob name
val container_name = "" //replace with your container name
val relative_path = "" //replace with your relative folder path

val adls_path = f"abfss://$container_name@$account_name.dfs.core.windows.net/$relative_path"

account_name: String = ltianwestus2gen2
container_name: String = mydefault
relative_path: String = samplenb/
adls_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/

### Save a dataframe as Parquet, JSON or CSV
If you have a dataframe, you can save it to Parquet or JSON with the .write.parquet(), .write.json() and .write.csv() methods respectively.

Dataframes can be saved in any format, regardless of the input format.


In [9]:
// set the path for the output file

val parquet_path = adls_path + "holiday.parquet"
val json_path = adls_path + "holiday.json"
val csv_path = adls_path + "holiday.csv"

parquet_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.parquet
json_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.json
csv_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.csv

In [10]:
import org.apache.spark.sql.SaveMode

hol_df.write.mode(SaveMode.Overwrite).parquet(parquet_path)
hol_df.write.mode(SaveMode.Overwrite).json(json_path)
hol_df.write.mode(SaveMode.Overwrite).option("header", "true").csv(csv_path)

import org.apache.spark.sql.SaveMode

### Save a dataframe as text files
If you have a dataframe that you want ot save as text file, you must first covert it to an RDD and then save that RDD as a text file.


In [12]:
// Define the text file path and covert spark dataframe into RDD
val text_path = adls_path + "holiday.txt"
val hol_RDD = hol_df.rdd

text_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.txt
hol_RDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[24] at rdd at <console>:30

In [14]:
// Save RDD as text file
hol_RDD.saveAsTextFile(text_path)

# Read data from the default ADLS Gen2 storage


### Create a dataframe from parquet files


In [15]:
val df_parquet = spark.read.parquet(parquet_path)

df_parquet: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]

### Create a dataframe from JSON files


In [16]:
val df_json = spark.read.json(json_path)

df_json: org.apache.spark.sql.DataFrame = [countryOrRegion: string, countryRegionCode: string ... 4 more fields]

### Create a dataframe from CSV files


In [17]:
val df_csv = spark.read.option("header", "true").csv(csv_path)

df_csv: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]

### Create an RDD from text file


In [18]:
val text = sc.textFile(text_path)

text: org.apache.spark.rdd.RDD[String] = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.txt MapPartitionsRDD[33] at textFile at <console>:32

In [19]:
text.take(5).foreach(println)

[Argentina,Año Nuevo [New Year's Day],Año Nuevo [New Year's Day],null,AR,1970-01-01 00:00:00.0]
[Australia,New Year's Day,New Year's Day,null,AU,1970-01-01 00:00:00.0]
[Austria,Neujahr,Neujahr,null,AT,1970-01-01 00:00:00.0]
[Belgium,Nieuwjaarsdag,Nieuwjaarsdag,null,BE,1970-01-01 00:00:00.0]
[Brazil,Ano novo,Ano novo,null,BR,1970-01-01 00:00:00.0]