#Introduction to DataFrames

This hands-on lab offers an introduction to Azure Databricks (ADB), with a focus DataFrames. You will learn best practices for Data Prep w/ ADB, by building a data preprocessing pipeline with essential transformations and estimators on DataFrames.

Learning Goals:
* Knowing how to mount your data in Azure Blob storage.
* Understanding the relationship between between DataFrames, Datasets, and RDDs.
* Using *actions*, such as `show()`, `display()`, and `count()`.
* Using the *transformations*, such as `limit()`, `select()`, and `drop()`.
* Understand the difference between `actions` and `transformations`.
* Know how to find documentation on Spark.
* Do some basic data cleansing and description.
* Select and drop columns of your data.
* Performing conversion between SQL and dataframes.

## Mount Azure Blob storage in Azure Databricks

For this and most other labs, we stored the data in [Azure Blob storage](https://azure.microsoft.com/en-au/services/storage/blobs/).

There are two things you can take away from how we mount the data:
1. The next cell demonstrates how to run another notebook from this notebook.  This can be very useful for creating [Notebook Workflows](https://docs.databricks.com/user-guide/notebooks/notebook-workflows.html). Here this other notebook is stored in the subdirectory `includes` in the parent directory to the present notebook.
1. You can learn about how to configure `Shared Access Signatures` (SAS) for providing secure access to data stored in Azure Blob storage. See the databricks [documentation](https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs) for more details.

If you are curious, look at the contents of the notebook `mnt_blob` to see what happens there.

In [None]:
%run "../includes/mnt_blob"

## The Data Source

The data for this notebook are stored in a compressed [parquet](https://en.wikipedia.org/wiki/Apache_Parquet) "file" called **pagecounts** (~23 MB file from Wikipedia):
* One hour of page counts from the English Wikimedia projects captured August 5, 2016, at 12:00 PM UTC.
* Size on Disk: ~23 MB
* Type: Compressed Parquet File
* More Info: <a href="https://dumps.wikimedia.org/other/pagecounts-raw" target="_blank">Page view statistics for Wikimedia projects</a>


We can use the cell magic `%fs` to investigate the folder/file structure of the parquet file.  In general `%fs` allows you to use *dbutils* filesystem commands. For more information, see [Access DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#dbfs-dbutils) with dbutils.

There are other magic commands for [mixing languages](https://docs.databricks.com/user-guide/notebooks/notebook-use.html#mix-languages) within a notebook.

In [None]:
%fs ls /mnt/data/wikipedia/pagecounts/staging_parquet_en_only_clean/

## Create a DataFrame

Let's read the Parquet file into a `DataFrame`.

* We'll start with the `spark` object, an instance of the [SparkSession](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html) class and the entry point to Spark applications.
* From there we can access the `read` object which gives us an instance of the class [DataFrameReader](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html).  This class offeres different methods for different types of data, such as `csv`, `json`, and `parquet`, which we use here.

Closely look at the output of running this cell of code. Can you see the type of object that is returned? What columns are in the data, and what are the data types of each column?

In [None]:
parquetDir = "/mnt/data/wikipedia/pagecounts/staging_parquet_en_only_clean/"

pagecountsEnAllDF = (spark  # Our SparkSession & Entry Point
  .read                     # Our DataFrameReader
  .parquet(parquetDir)      # Returns an instance of DataFrame
)

## Actions and Transformations

On first sights, the distrinction between actions and transformations might be confusing. 

[Transformations](https://databricks.com/glossary/what-are-transformations) instruct Spark how you would like to modify the DataFrame you have into the one that you want.  The key thing to understand about transformations is that they don't actually do the transformation at the time they're specified. They describe a transformation that will be done. Another thing to understand is that you can "pile up" transformations, one after the other.

*Actions* are operations that are executed immediately. Actions are often taken after a transformation, or sequences of transformations, to show the results of the transformations.

Let's meet a couple of actions first.

### count()

`count()` will trigger a job to process the request and return a value.

Let's use `count()` to count all records in our `DataFrame`.

In [None]:
total = pagecountsEnAllDF.count()

print("Record Count: {0:,}".format( total ))

That tells us that there are around 2 million rows in the `DataFrame`.

## cache()

Before we take a closer look at the contents of the `DataFrame`, let us introduce a technique that speeds up processing.

The ability to cache data is one technique for achieving better performance with Apache Spark. 

This is because every action requires Spark to read the data from its source (Azure Blob, Amazon S3, HDFS, etc.) but caching moves that data into the memory of the local executor for "instant" access.

>`persist()` is an alias for `cache()`. Both can be used to achieve identical results.

Let's demonstrate this by running the action `count` twice, but including a `cache` action of the `DataFrame` during the first run. This should allow us to execute the `count` action much faster the second time.

In [None]:
(pagecountsEnAllDF
  .cache()         # Mark the DataFrame as cached
  .count()         # Materialize the cache
) 

If you re-run that command, it should take significantly less time.

In [None]:
pagecountsEnAllDF.count()

And as a quick side note, you can remove a cache by calling the `DataFrame`'s `unpersist()` method but, it is not necessary.

In [None]:
pagecountsEnAllDF.unpersist()

Now the `count` action should take longer.

> Note that it doesn't take as long if we don't also `cache` the `DataFrame` in the process.

In [None]:
pagecountsEnAllDF.count()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Our Data

Let's continue by taking a look at the type of data we have. 

We can do this with the `printSchema()` command:

In [None]:
pagecountsEnAllDF.printSchema()

We can now see that we have four columns of data:
* **project** (*string*): The name of the Wikipedia project. This will include values such as:
  * **en**: The English version of Wikipedia.
  * **fr**: The French version of Wikipedia.
  * **en.d**: The English version of Wiktionary.
  * **fr.b**: The French version of Wikibooks.
  * **de.n**: The German version of Wikinews.
* **article** (*string*): The name of the article in the corresponding project. This will include values such as:
  * <a href="https://en.wikipedia.org/wiki/Apache_Spark" target="_blank">Apache_Spark</a>
  * <a href="https://en.wikipedia.org/wiki/Matei_Zaharia" target="_blank">Matei_Zaharia</a>
  * <a href="https://en.wikipedia.org/wiki/Kevin_Bacon" target="_blank">Kevin_Bacon</a>
* **requests** (*integer*): The number of requests (clicks) the article has received in the hour this data represents.
* **bytes_served** (*long*): The total number of bytes delivered for the requested article.
  * **Note:** In our copy of the data, this value is zero for all records and consequently is of no value to us.

## Spark API Documentation


Try to find the documentation for `count()`.  Hint: There are two ways to find the documentation on this action:
- Go to the online Spark API  documentation for [printSchema](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=printSchema#pyspark.sql.DataFrame.printSchema).
- Create a new cell below with the code `help(pagecountsEnAllDF.printSchema)` and execute the cell.

You have already seen one command available to the `DataFrame` class, namely `DataFrame.printSchema()`
  
Let's take a look at the API to see what other operations we have available.

### **Spark API Home Page**
0. Open a new browser tab
0. Google for **Spark API Latest** or **Spark API _x.x.x_** for a specific version.
0. Select **Spark API Documentation - Spark _x.x.x_ Documentation - Apache Spark** 

Other Documentation:
* Programming Guides for DataFrames, SQL, Graphs, Machine Learning, Streaming...
* Deployment Guides for Spark Standalone, Mesos, Yarn...
* Configuration, Monitoring, Tuning, Security...

Here are some shortcuts
  * <a href="https://spark.apache.org/docs/latest/api.html" target="_blank">Spark API Documentation - Latest</a>
  * <a href="https://spark.apache.org/docs/2.1.1/api.html" target="_blank">Spark API Documentation - 2.1.1</a>
  * <a href="https://spark.apache.org/docs/2.1.0/api.html" target="_blank">Spark API Documentation - 2.1.0</a>
  * <a href="https://spark.apache.org/docs/2.0.2/api.html" target="_blank">Spark API Documentation - 2.0.2</a>
  * <a href="https://spark.apache.org/docs/1.6.3/api.html" target="_blank">Spark API Documentation - 1.6.3</a>

Naturally, which set of documentation you will use depends on which language you will use.

### Spark API (Python)

0. Select **Spark Python API (Sphinx)**.
0. Look up the documentation for `pyspark.sql.DataFrame`.
  0. In the lower-left-hand-corner type **DataFrame** into the search field.
  0. Hit **[Enter]**.
  0. The search results should appear in the right-hand pane.
  0. Click on **pyspark.sql.DataFrame (Python class, in pyspark.sql module)**
  0. The documentation should open in the right-hand pane.

### Spark API (Scala)

0. Select **Spark Scala API (Scaladoc)**.
0. Look up the documentation for `org.apache.spark.sql.DataFrame`.
  0. In the upper-left-hand-corner type **DataFrame** into the search field.
  0. The search will execute automatically.
  0. In the class/package list, click on **DataFrame**.
  0. The documentation should open in the right-hand pane.
  
This isn't going to work, but why?

### Spark API (Scala), Try #2

Look up the documentation for `org.apache.spark.sql.Dataset`.
  0. In the upper-left-hand-corner type **Dataset** into the search field.
  0. The search will execute automatically.
  0. In the class/package list, click on **Dataset**.
  0. The documentation should open in the right-hand pane.

Now that we have found the proper documentation, we can take a quick peek at the function `printSchema()`.

Nothing special here.

If you look at the API docs, `printSchema(..)` is described like this:
> Prints the schema to the console in a nice tree format.

## show(..)

What we want to look for next is a function that will allow us to print the data to the console.

In the API docs for `DataFrame`/`Dataset` find the docs for the `show(..)` command(s).

In the case of Python, we have one method with two optional parameters.<br/>
In the case of Scala, we have several overloaded methods.<br/>

In either case, the `show(..)` method effectively has two optional parameters:
* **n**: The number of records to print to the console, the default being 20.
* **truncate**: If true, columns wider than 20 characters will be truncated, where the default is true.

Let's take a look at the data in our `DataFrame` with the `show()` command:

In [None]:
pagecountsEnAllDF.show()

In the cell above, change the parameters of the show command to:
* print only the first five records
* disable truncation
* print only the first ten records and disable truncation

**Note:** The function `show(..)` is an **action** which triggers a job.

## display(..)

The `show(..)` command is part of the core Spark API and simply prints the results to the console.

Our notebooks have a slightly more elegant alternative.

Instead of calling `show(..)` on an existing `DataFrame` we can instead pass our `DataFrame` to the `display(..)` command:

In [None]:
display(pagecountsEnAllDF)

### show(..) vs display(..)
* `show(..)` is part of core spark - `display(..)` is specific to our notebooks.
* `show(..)` is ugly - `display(..)` is pretty.
* `show(..)` has parameters for truncating both columns and rows - `display(..)` does not.
* `show(..)` is a function of the `DataFrame`/`Dataset` class - `display(..)` works with a number of different objects.
* `display(..)` is more powerful - with it, you can...
  * Download the results as CSV
  * Render line charts, bar chart & other graphs, maps and more.
  * See up to 1000 records at a time.
  
For the most part, the difference between the two is going to come down to preference.

Like `DataFrame.show(..)`, `display(..)` is an **action** which triggers a job.

## Transformations

Both `show(..)` and `display(..)` are **actions** that trigger jobs (though in slightly different ways).

If you recall, `show(..)` has a parameter to control how many records are printed but, `display(..)` does not.

We can address that difference with our first *transformation*, `limit(..)`.

### limit(..)

If you look at the API docs, `limit(..)` is described like this:
> Returns a new Dataset by taking the first n rows...

`show(..)`, like many actions, does not return anything. 

On the other hand, transformations like `limit(..)` return a **new** `DataFrame`:

In [None]:
limitedDF = pagecountsEnAllDF.limit(5) # "limit" the number of records to the first 5

Notice how "nothing" happened - that is no job was triggered.

This is because we are simply defining the second step in our transformations.
  1. Read in the parquet file (represented by **pagecountsEnAllDF**).
  1. Limit those records to just the first 5 (represented by **limitedDF**).

It's not until we induce an action that a job is triggered and the data is processed

We can to this with either the `show(..)` or the `display(..)` actions.

For example, we can `show` the first 100 rows of the DataFrame `limitedDF`, which only has 5 row.

In [None]:
limitedDF.show(100, False) #show up to 100 records and don't truncate the columns

We can use `display` to achieve a similar, but prettier result.

In [None]:
display(limitedDF) # defaults to the first 1000 records

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) select(..)

Let's say, for the sake of argument, that we don't want to look at all the data:

In [None]:
pagecountsEnAllDF.printSchema()

For example, it was asserted above that **bytes_served** had nothing but zeros in it and consequently is of no value to us.

If that is the case, we can disregard it by selecting only the three columns that we want:

In [None]:
# Transform the data by selecting only three columns
onlyThreeDF = (pagecountsEnAllDF
  .select("project", "article", "requests") # Our 2nd transformation (4 >> 3 columns)
)
# Now let's take a look at what the schema looks like
onlyThreeDF.printSchema()

Again, notice how the call to `select(..)` does not trigger a job.

That's because `select(..)` is a transformation. It's just one more step in a long list of transformations.

Let's go ahead and invoke the action `show(..)` and take a look at the result.

In [None]:
# And lastly, show the first five records which should exclude the bytes_served column.
onlyThreeDF.show(5, False)

The `select(..)` command is one of the most powerful and most commonly used transformations. 

We will see plenty of other examples of its usage as we progress.

If you look at the API docs, `select(..)` is described like this:
> Returns a new Dataset by computing the given Column expression for each element.

The "Column expression" referred to there is where the true power of this operation shows up. Again, we will go deeper on these later.

Just like `limit(..)`, `select(..)` 
* does not trigger a job
* returns a new `DataFrame`
* simply defines the next transformation in a sequence of transformations.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) drop(..)

As a quick side note, you will quickly discover there are a lot of ways to accomplish the same task.

Take the transformation `drop(..)` for example - instead of selecting everything we wanted, `drop(..)` allows us to specify the columns we don't want.

If you look at the API docs, `drop(..)` is described like this:
> Returns a new Dataset with a column dropped.

And we can see that we can produce the same result as the last exercise this way:

In [None]:
# Transform the data by selecting only three columns
droppedDF = (pagecountsEnAllDF
  .drop("bytes_served") # Our second transformation after the initial read (4 columns down to 3)
)
# Now let's take a look at what the schema looks like
droppedDF.printSchema()

Again, `drop(..)` is just one more transformation - that is no job is triggered.

In [None]:
# And lastly, show the first five records which should exclude the bytes_served column.
droppedDF.show(5, False)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) distinct() & dropDuplicates()

These two transformations do the same thing. In fact, they are aliases for one another.
* You can see this by looking at the source code for these two methods
* ```def distinct(): Dataset[T] = dropDuplicates()```
* See <a href="https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala" target="_blank">Dataset.scala</a>

The difference between them has everything to do with the programmer and their perspective.
* The name **distinct** will resonate with developers, analyst and DB admins with a background in SQL.
* The name **dropDuplicates** will resonate with developers that have a background or experience in functional programming.

As you become more familiar with the various APIs, you will see this pattern reassert itself.

The designers of the API are trying to make the API as approachable as possible for multiple target audiences.

If you look at the API docs, both `distinct(..)` and `dropDuplicates(..)` are described like this:
> Returns a new Dataset that contains only the unique rows from this Dataset....

With this transformation, we can now tackle our first business question:

### How many different English Wikimedia projects saw traffic during that hour?

If you recall, our original `DataFrame` has this schema:

In [None]:
pagecountsEnAllDF.printSchema()

The transformation `distinct()` is applied to the row as a whole - data in the **project**, **article** and **requests** column will effect this evaluation.

To get the distinct list of projects, and only projects, we need to reduce the number of columns to just the one column, **project**. 

We can do this with the `select(..)` transformation and then we can introduce the `distinct()` transformation.

In [None]:
distinctDF = (pagecountsEnAllDF     # Our original DataFrame from spark.read.parquet(..)
  .select("project")                # Drop all columns except the "project" column
  .distinct()                       # Reduce the set of all records to just the distinct column.
)

Just to reinforce, we have three transformations:
0. Read the data (now represented by `pagecountsEnAllDF`)
0. Select just the one column
0. Reduce the records to a distinct set

No job is triggered until we perform an action like `show(..)`:

In [None]:
# There will not be more than 100 projects
distinctDF.show(100, False)               

You can count those if you like.

But, it would be easier to ask the `DataFrame` for the `count()`:

In [None]:
total = distinctDF.count()     
print("Distinct Projects: {0:,}".format( total ))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) dropDuplicates(columns...)

The method `dropDuplicates(..)` has a second variant that accepts one or more columns.
* The distinction is not performed across the entire record unlike `distinct()` or even `dropDuplicates()`.
* The distinction is based only on the specified columns.
* This allows us to keep all the original columns in our `DataFrame`.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Recap

Our code is spread out over many cells which can make this a little hard to follow.

Let's take a look at the same code in a single cell.

In [None]:
parquetDir = "/mnt/data/wikipedia/pagecounts/staging_parquet_en_only_clean/"

pagecountsEnAllDF = (spark       # Our SparkSession & Entry Point
  .read                          # Our DataFrameReader
  .parquet(parquetDir)           # Returns an instance of DataFrame
)
(pagecountsEnAllDF               # Only if we are running multiple queries
  .cache()                       # mark the DataFrame as cachable
  .count()                       # materialize the cache
)
distinctDF = (pagecountsEnAllDF  # Our original DataFrame from spark.read.parquet(..)
  .select("project")             # Drop all columns except the "project" column
  .distinct()                    # Reduce the set of all records to just the distinct column.
)
total = distinctDF.count()     
print("Distinct Projects: {0:,}".format( total ))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DataFrames vs SQL & Temporary Views


This might also be a good time to read up on the history and difference between [RDDs, DataFrames, and Datasets](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html). 

The `DataFrame`s API is built upon a SQL engine.

As such we can "convert" a `DataFrame` into a temporary view (or table) and then use it in "standard" SQL.

Let's start by creating a temporary view from a previous `DataFrame`.

In [None]:
pagecountsEnAllDF.createOrReplaceTempView("pagecounts")

Now that we have a temporary view (or table) we can start expressing our queries and transformations in SQL:

In [None]:
%sql

SELECT * FROM pagecounts

And we can just as easily express in SQL the distinct list of projects, and just because we can, we'll sort that list:

In [None]:
%sql

SELECT DISTINCT project FROM pagecounts ORDER BY project

And converting from SQL back to a `DataFrame` is just as easy:

In [None]:
tableDF = spark.sql("SELECT DISTINCT project FROM pagecounts ORDER BY project")
display(tableDF)

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>