d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Reading & Writing Data and Tables Lab

## Learning Objectives
**In this lab, you will:**
- Use DataFrames to:
  - Read data from parquet format
  - Define and use schemas while loading data
  - Write data to Parquet
  - Save data to tables
- Register views
- Read data from tables and views back to DataFrames
- (**OPTIONAL**) Explore differences between managed and unmanaged table

**Resources**
* [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html)
* [Spark Load and Save Functions](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html)
* [Databases and Tables - Databricks Docs](https://docs.databricks.com/user-guide/tables.html)
* [Managed and Unmanaged Tables](https://docs.databricks.com/user-guide/tables.html#managed-and-unmanaged-tables)
* [Creating a Table with the UI](https://docs.databricks.com/user-guide/tables.html#create-a-table-using-the-ui)
* [Create a Local Table](https://docs.databricks.com/user-guide/tables.html#create-a-local-table)
* [Saving to Persistent Tables](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables)
* [DataFrame Reader Docs](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader)
* [DataFrame Writer Docs](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter)

### Getting Started

Run the following cell to configure our "classroom."

In [4]:
%run ./Includes/Classroom-Setup

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Read Parquet into a DataFrame

Load the data in the `sourcePath` variable into a DataFrame named `tempDF`. No options need to be set beyond specifying the format.

In [6]:
# TODO

sourcePath = "/mnt/training/weather/StationData/stationData.parquet/"

# tempDF = <FILL_IN>

tempDF = (spark.read
  .format("parquet")
  .load(sourcePath)
)

### Review and Define the Schema

Note that a single job was triggered. With parquet files, the schema for each column is recorded, but Spark must still peek at the file to read this information (hence the job).

To avoid triggering this job, a schema can be passed as an argument. Define the schema here using SQL DDL or Spark types and fields (as demonstrated in the previous lesson).

In [8]:
# TODO

  # Import types and define schema OR use SQL DDL.
tempDF.printSchema()



In [9]:
# ANSWER

from pyspark.sql.types import StructType, StructField, StringType, FloatType, DateType

# Spark API
schema = StructType([
  StructField("NAME", StringType()),
  StructField("STATION", StringType()),
  StructField("LATITUDE", FloatType()),
  StructField("LONGITUDE", FloatType()),
  StructField("ELEVATION", FloatType()),
  StructField("DATE", DateType()),
  StructField("UNIT", StringType()),
  StructField("TAVG", FloatType())
])

# DDL
schemaDDL = "NAME STRING, STATION STRING, LATITUDE FLOAT, LONGITUDE FLOAT, ELEVATION FLOAT, DATE DATE, UNIT STRING, TAVG FLOAT"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Load Data with Defined Schema

No job will be triggered when a schema definition is provided. Load the data from `sourcePath` into a DataFrame named `weatherDF`.

In [11]:
# TODO

# ANSWER

weatherDF = (spark.read
  .format("parquet")
  .schema(schema)
  .load(sourcePath)
)

## Save an Unmanaged Table
The DataFrame method `saveAsTable` registers the data currently referenced in the DataFrame to the metastore and saves a copy of the data.

If a `"path"` is not provided, Spark will create a managed table, meaning that both the metadata AND data are copied and stored in the root storage DBFS associated with the workspace. This means that dropping or modifying the table will modify the data in the DBFS. An unmanaged table allows decoupling of the data and the metadata, so a table can easily be renamed or removed from the workspace without deleting or migrating the underlying data.

Save `weatherDF` as an unmanaged table. Use the `tablePath` provided with the `"path"` option. Set the mode to `"overwrite"` (which will drop the table if it currently exists). Pass the table name `"weather"` to the `saveAsTable` method.

In [13]:
# TODO

tablePath = f"{userhome}/weather"

(weatherDF
  .write
  .option("path", tablePath)
  .mode("overwrite")
  .saveAsTable("weather"))

### Query Table
This table contains the same data as the `weatherDF`. Remember that tables **persist between sessions** and (by default) are **available to all users in the workspace**.

Let's preview our data.

In [15]:
%sql

SELECT * FROM weather

NAME,STATION,LATITUDE,LONGITUDE,ELEVATION,DATE,UNIT,TAVG
"HAYWARD AIR TERMINAL, CA US",USW00093228,37.6542,-122.115,13.1,2018-05-27,F,61.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-01-05,C,11.7
"SAN FRANCISCO INTERNATIONAL AIRPORT, CA US",USW00023234,37.6197,-122.3647,2.4,2018-02-24,C,8.3
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-03-26,C,9.4
"HOUSTON INTERCONTINENTAL AIRPORT, TX US",USW00012960,29.98,-95.36,29.0,2018-05-25,F,80.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-05-16,C,11.1
"BLACK DIAMOND CALIFORNIA, CA US",USR0000CBKD,37.95,-121.8844,487.7,2018-05-25,C,10.6
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-05-21,C,11.7
"WOODACRE CALIFORNIA, CA US",USR0000CWOO,37.9906,-122.6447,426.7,2018-05-26,F,53.0
"BRIONES CALIFORNIA, CA US",USR0000CBRI,37.9442,-122.1178,442.0,2018-04-08,F,53.0


## Overview of the Data

The data include multiple entries from a selection of weather stations, including average temperatures recorded in either Fahrenheit or Celcius. The schema for the table:

|ColumnName  | DataType| Description|
|------------|---------|------------|
|NAME        |string   | Station name |
|STATION     |string   | Unique ID |
|LATITUDE    |float    | Latitude |
|LONGITUDE   |float    | Longitude |
|ELEVATION   |float    | Elevation |
|DATE        |date     | YYYY-MM-DD |
|UNIT        |string   | Temperature units |
|TAVG        |float    | Average temperature |

-sandbox
## Creating a Temp View with SQL
It's easy to register temp views using SQL queries. Temp views essentially allow a set of SQL transformations against a dataset to be given a name. This can be helpful when building up complex logic, or when an intermediate state will be used multiple times in later queries. Note that no job is triggered on view definition.

Use SQL to create a temp view named `station_counts` that returns the count of average temperatures recorded in both F and C for each station, ordered by station name.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> `COUNT` will create the column name `count(1)` be default. Alias a descriptive column name that won't require escaping this column name in further SQL queries. (Parentheses are also not valid in column names when writing to parquet format.)

In [18]:
%sql
-- TODO

CREATE OR REPLACE TEMP VIEW station_counts
AS (SELECT NAME, UNIT, COUNT(*) counts
  FROM weather
  GROUP BY NAME, UNIT
  ORDER BY NAME)

-sandbox
This aggregate view is small enough for manual examination. `SELECT *` will return all records as an interactive tabular view. Make mental note of whether or not any stations report temperature in both units, and the approximate number of records for each station.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> While no job was triggered when defining the view, a job is triggered _each time_ a query is executed against the view.

In [20]:
%sql

SELECT * FROM station_counts

NAME,UNIT,counts
"BARNABY CALIFORNIA, CA US",C,151
"BIG ROCK CALIFORNIA, CA US",C,151
"BLACK DIAMOND CALIFORNIA, CA US",C,151
"BRIONES CALIFORNIA, CA US",F,151
"CONCORD BUCHANAN FIELD, CA US",F,149
"HAYWARD AIR TERMINAL, CA US",F,149
"HOUSTON INTERCONTINENTAL AIRPORT, TX US",F,150
"HOUSTON WILLIAM P HOBBY AIRPORT, TX US",C,150
"LAS TRAMPAS CALIFORNIA, CA US",C,151
"LOS PRIETOS CALIFORNIA, CA US",F,151


## Define a DataFrame from a View
Transforming a table or view back to a DataFrame is simple, and will not trigger a job. The metadata associated with the table is just reassigned to the DataFrame, so the same access permissions and schema are now accessible through both the DataFrame and SQL APIs.

Use `spark.table()` to create a DataFrame named `countsDF` from the view `"station_counts"`.

In [22]:
# TODO

countsDF = spark.table("station_counts")

## Write to Parquet
Writing this DataFrame back to disk will persist the computed aggregates for later.

Save `countsDF` to the provided `countsPath` with in parquet format.

In [24]:
# TODO

countsPath = f"{userhome}/stationCounts"

(countsDF.write
  .format("parquet")
  .mode("overwrite")
  .save(countsPath))

## Synopsis

In this lab we:
* Read Parquet files to Dataframes both with and without defining a schema
* Saved a managed and unmanaged table
* Created an aggregate temp view of our data
* Created a dataframe from that temp view
* Wrote the new dataframe back to Parquet files

## **OPTIONAL**: Exploring Managed and Unmanaged Tables

In this section, we'll explore Databricks default behavior on managed vs. unmanaged tables. The differences in syntax for defining these are small, but the performance, cost, and security implications can be significant. **In almost all use case, UNmanaged tables are preferred.**

## Save a Managed Table
To explore the differences between managed and unmanaged tables, save the DataFrame `weatherDF` from earlier in the lesson without the `"path"` option. Use the table name `"weather_managed"`.

In [28]:
# TODO

<FILL_IN> # Reuse the code block above without the "path" option.

## Review Spark Catalog

Note the `tableType` field for our tables and views:
- The unmanaged table `weather` is `EXTERNAL`
- The managed table `weather_managed` is `MANAGED`
- The temp view `station_counts` is `TEMPORARY`

In [30]:
spark.catalog.listTables()

Using SQL `SHOW TABLES` provides most of the same information, but does not indicate whether or not a table is managed.

In [32]:
%sql

SHOW TABLES

## Examine Table Details
Use the SQL command `DESCRIBE EXTENDED table_name` to examine the two weather tables.

In [34]:
%sql

DESCRIBE EXTENDED weather

In [35]:
%sql

DESCRIBE EXTENDED weather_managed

Run the following cell to assign the `managedTablePath` variable and confirm that both paths are correct with the information printed above.

In [37]:
managedTablePath = f"dbfs:/user/hive/warehouse/{spark.catalog.currentDatabase()}.db/weather_managed"

print(f"""The weather table is saved at: 

    {tablePath}

The weather_managed table is saved at:

    {managedTablePath}""")

The same DataFrame was used to create both of these directories, and therefore the content is identical (noting that file names are unique hashes).

In [39]:
dbutils.fs.ls(tablePath)

In [40]:
dbutils.fs.ls(managedTablePath)

### Check Directory Contents after Dropping Tables
Now drop both tables and again list the contents of these directories.

In [42]:
%sql

DROP TABLE weather;
DROP TABLE weather_managed;

In [43]:
dbutils.fs.ls(tablePath)

In [44]:
dbutils.fs.ls(managedTablePath)

**This highlights the main differences between managed and unmanaged tables.** The files associated with managed tables will always be stored to this default location on the root DBFS storage linked to the workspace, and will be deleted when a table is dropped.

Files for unmanaged tables will be persisted in the `path` provided at table creation, preventing users from inadvertently deleting underlying files. **Unmanaged tables can easily be migrated to other databases or renamed, but these operations with managed tables will require rewriting ALL underlying files.**

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>