# Working with data stored in Azure Data Lake Store

Azure Data Lake Store can be used as the storage account associated with a HDInsight Spark cluster. An HDInsight cluster can have a default storage and additional storage. The URL to access the cluster storage using ADLS is:

    adl:///<adls-name>.azuredatalakestore.net/clusters/<cluster-name>
    
The URL to access only the default storage is:

    adl:///<path>

This notebook provides examples of how to read data from WASB into a Spark context and then perform operations on that data. The notebook also provides examples of how to write the output of Spark jobs directly into a WASB location.


-------
## Read data from ADLS into Spark

The examples below read from the default storage account associated with your Spark cluster so the URL used in the examples is `adl:///<path>`. However, you can also read from an additional storage account (e.g., wasb) with the following syntax:

    wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

----------
## Notebook setup

When using PySpark kernel notebooks on HDInsight, there is no need to create a SparkContext or a SparkSession; a SparkSession which has the SparkContext is created for you automatically when you run the first code cell, and you'll be able to see the progress printed. The contexts are created with the following variable names:
- SparkSession (spark)

To run the cells below, place the cursor in the cell and then press **SHIFT + ENTER**.

### Create an RDD of strings

In [1]:
# textLines is an RDD of strings
textLines = spark.sparkContext.textFile('adl:///example/data/gutenberg/ulysses.txt')

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8,application_1485637644503_0313,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


### Create an RDD of key-value pairs

In [None]:
# seqFile is an RDD of key-value pairs
seqFile = spark.sparkContext.sequenceFile('adl:///example/data/people.seq')

### Create a dataframe from parquet files

Create a dataframe from an input parquet file. For more information about parquet files, see [here](http://spark.apache.org/docs/2.0.0/sql-programming-guide.html#parquet-files).

In [4]:
# parquetFile is a dataframe that matches the schema of the input parquet file
parquetFile = spark.read.parquet('adl:///example/data/people.parquet')

### Create a dataframe from JSON document

Create a dataframe that matches the schema of the input JSON document.

In [None]:
# jsonFile is a dataframe that matches the schema of the input JSON file
jsonFile = spark.read.json('adl:///example/data/people.json')

### Create an dataframe from CSV files

Create a dataframe from a CSV file with headers. Spark can automatically infer its schema.

In [None]:
# csvFile is an dataframe that matches the schema of the input CSV file
csvFile = spark.read.csv('adl:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv', header=True, inferSchema=True)

------
## Write data from Spark to ADLS in different formats

The examples below show you how to write output data from Spark directly into the storage accounts associated with your Spark cluster. If you are writing to the default storage account, you can provide the output path like this:

    adl:///<path>

If you are writing to additional storage accounts associated with the cluster, you must provide the output path like this:

    wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/<path>

### Save an RDD as text files

If you have an RDD, you can convert it to a text file like the following:

In [2]:
# textLines is an RDD converted into a text file
textLines.saveAsTextFile('adl:///example/data/gutenberg/ulysses2py.txt')

### Save a dataframe as text files

If you have a dataframe that you want to save as a text file, you must first convert it to an RDD and then save that RDD as a text file.

In [5]:
parquetRDD = parquetFile.rdd
parquetRDD.saveAsTextFile('adl:///example/data/peoplepy.txt')
# parquetRDD is a dataframe converted into RDD. parquetRDD can then be converted into a text file using RDD methods

### Save a dataframe as Parquet, JSON or CSV

If you have a dataframe, you can save it to Parquet or JSON with the `.write.parquet()`, `.write.json()` and `.write.csv()` methods respectively.

Dataframes can be saved in any format, regardless of the input format.

In [None]:
parquetFile.write.json('adl:///example/data/people3py.json')
csvFile.write.parquet('adl:///example/data/people3py.parquet')
jsonFile.write.csv('adl:///example/data/people3py.csv')

If you have an RDD and want to save it as a parquet file or JSON file, you'll have to 
convert it to a dataframe. See [Interoperating with RDDs](http://spark.apache.org/docs/2.0.0/sql-programming-guide.html#interoperating-with-rdds) for more information.

### Save an RDD of key-value pairs as a sequence file

In [None]:
# If your RDD isn't made up of key-value pairs then you'll get a runtime error
seqFile.saveAsSequenceFile('adl:///example/data/people2py.seq')