d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
# Accessing Data

Apache Spark&trade; and Databricks&reg; have numerous ways to access your data.

## In this lesson you
* Create a table from an existing file
* Create a table by uploading a data file from your local machine
* Mount an Azure Blob to DBFS
* Create tables for Databricks data sets to use throughout the course

## Audience
* Primary Audience: Data Analysts
* Additional Audiences: Data Engineers and Data Scientists

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.3**
* Familiarity with <a href="https://www.w3schools.com/sql/" target="_blank">ANSI SQL</a> is required

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/vpxrf5e9ww?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/vpxrf5e9ww?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

### Create a table from an existing file

The <a href="https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html" target="_blank">Databricks File System</a> (DBFS) is the built-in, Azure-blob-backed, alternative to the <a href="http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html" target="_blank">Hadoop Distributed File System</a> (HDFS).

Creating a table from an existing file in DBFS allows you to access the file as if it were a Spark table. It does **not** copy any data.

<iframe  
src="//fast.wistia.net/embed/iframe/xi9n55mdwy?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/xi9n55mdwy?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
The example below creates a table from the **ip-geocode.parquet** file (if it doesn't exist).

For Parquet files, you need to specify only one option: the path to the file.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> A Parquet "file" is actually a collection of files stored in a single directory.  The Parquet format offers features making it the ideal choice for storing "big data" on distributed file systems. For more information, see <a href="https://parquet.apache.org/" target="_blank">Apache Parquet</a>.

You can create a table from an existing DBFS file with a simple SQL `CREATE TABLE` statement. If you don't select a database, the database called "default" is used. Here, we'll use a database called "junk", to remind us to delete these tables later.

In [9]:
%sql
CREATE DATABASE IF NOT EXISTS databricks;

USE databricks;

CREATE TABLE IF NOT EXISTS IPGeocode
  USING parquet
  OPTIONS (
    path "dbfs:/mnt/training/ip-geocode.parquet"
  )

-sandbox
Now the table has been defined. You can see it in Databricks.
0. Click the **Data** icon on the left sidebar<br/>
<div><img src="https://files.training.databricks.com/images/eLearning/data-tab.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px"/></div>
0. Select the database **databricks**
0. Select the table **ipgeocode**
<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Right-click and open in a new tab, so you don't lose your place in this notebook.

<img src="https://files.training.databricks.com/images/eLearning/SQL-MSFT/create-table-1-databricks-sql.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; width: auto; height: auto; max-height: 383px"/>

-sandbox
You see the schema of the table, along with a sample of its data.

<img src="https://files.training.databricks.com/images/eLearning/SQL-MSFT/db-table-example-1.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px"/>

-sandbox
### Using A Personal Database

Any tables created or droped will be done so in the **`databricks`** database.

However, every user of this system, if running this same code, will be altering the same tables. 

In cases such as this one, it is often better to use a "personal" database.

For this reason, we will switch back to your personal database now.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We need to use the Spark programming API here only because we are unable to parameterize a **`%sql`** cell with the database we setup for you (as represtend by **`databaseName`**).

In [13]:
# Programatically exectue a similar SQL command as above
spark.sql(f"USE {databaseName}")

### File formats other than Parquet

You can also create a table from other file formats. 

One common format is CSV (comma-separated-values) for which you can specify:
* The file's delimiter, the default is "**,**"
* Whether the file has a header or not, the default is **false**
* Whether or not to infer the schema, the default is **false**

<iframe  
src="//fast.wistia.net/embed/iframe/6bcdrg5ci4?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/6bcdrg5ci4?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

In order to know which options to use, look at the first couple of lines of the file.

Take a look at the head of the file **/mnt/training/bikeSharing/data-001/day.csv.**

In [17]:
%fs head /mnt/training/bikeSharing/data-001/day.csv --maxBytes=492

Spark can create a table from that CSV file, as well.

As you can see above:
* There is a header
* The file is comma separated (the default)
* Let Spark infer what the schema is

In [19]:
%sql
CREATE TABLE IF NOT EXISTS BikeSharingDay
  USING csv
  OPTIONS (
    path "/mnt/training/bikeSharing/data-001/day.csv",
    inferSchema "true",
    header "true"
  )

Now the table is defined: view its contents with a simple select statement.

In [21]:
%sql
SELECT * FROM BikeSharingDay

-sandbox
Next, drop the table.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This does not delete the file from which the table was created.  Rather, it simply removes the table definition from Spark.

In [23]:
%sql
DROP TABLE BikeSharingDay

### Upload a local file as a table

The last two examples use files already loaded on the "server."

Databricks also supports creating tables by uploading files. 

Next, download the following file to your local machine: <a href="https://dbtrainwestus.blob.core.windows.net/training/dataframes/state-income.csv?sp=rl&st=2018-08-23T21:08:25Z&se=2024-08-24T21:08:00Z&sv=2017-11-09&sig=7fD9Zc5OZ9AOBdstZGyNrbvX%2FmNUiBYBbPtbtVrmiUY%3D&sr=b">state-income.csv</a>

<iframe  
src="//fast.wistia.net/embed/iframe/3vo1bm6ak0?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/3vo1bm6ak0?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox

1. Select **Data** from the sidebar, and click the **junk** database
2. Select the **+** icon to create a new table

<img src="https://files.training.databricks.com/images/eLearning/create-table-1-junk-db.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; width: auto; height: auto; max-height: 383px"/>

<br>
1. Select **Upload File**
2. click on Browse and select the **state-income.csv** file from your machine, or drag-and-drop the file to initiate the upload

<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-table-2.png" style="border: 1px solid #aaa; border-radius: 5px 5px 5px 5px; width: auto; height: auto; max-height: 300px  "/>

-sandbox
Once the file is uploaded, create the actual table:

1. Click the **Create Table with UI** button  
<br>
<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-table-3.png" style="border: 1px solid #aaa; border-radius: 5px 5px 5px 5px; width: auto; height: auto; max-height: 500px  "/>
<br>
2. In the drop-down dialog, select a cluster
3. Click the **Preview Table** button  
<br>
<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-table-4.png" style="border: 1px solid #aaa; border-radius: 5px 5px 5px 5px; width: auto; height: auto; max-height: 200px  "/>
4. Another dialog will drop down. Choose the **junk** database
5. Select the **First row is header** checkbox
6. Click the **Create Table** button
<br>
<img src="https://files.training.databricks.com/images/eLearning/create-table-5.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; margin-top: 20px"/>

-sandbox
Once Databricks finishes processing the file, you'll see another table preview.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Databricks tries to choose a table name that doesn't clash with tables created by other users. However, a name clash is still possible. If the table already exists, you'll see an error like the following:

<img src="https://files.training.databricks.com/images/eLearning/create-table-6.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; margin-top: 20px; padding: 10px"/>

If that happens, just type in a different table name, and try again.

Next, drop the table to ensure other users don't have a name conflict when uploading their tables.

In [30]:
%sql
DROP TABLE IF EXISTS state_income

-sandbox
### How to Mount an Azure Blob to DBFS

Microsoft Azure provides cloud file storage in the form of the Blob Store.  Files are stored in "blobs."
If you have an Azure account, create a blob, store data files in that blob, and mount the blob as a DBFS directory. 

Once the blob is mounted as a DBFS directory, access it without exposing your Azure Blob Store keys.

<iframe  
src="//fast.wistia.net/embed/iframe/zof0hhe8pc?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/zof0hhe8pc?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

Take a look at the blobs already mounted to your DBFS:

In [34]:
%fs mounts

-sandbox
Mount a Databricks Azure blob (using read-only access and secret key pair), access one of the files in the blob as a DBFS path, then unmount the blob.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> The mount point **must** start with `/mnt/`.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> If the directory was already mounted, you would receive the following error:

> Directory already mounted: /mnt/temp-training

In this case, use a different mount point such as `temp-training-2`, and ensure you update all three references below.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> the next cell is in Scala!

In [36]:
sasURL = "https://dbtraineastus2.blob.core.windows.net/?sv=2017-07-29&ss=b&srt=sco&sp=rl&se=2023-04-19T06:32:30Z&st=2018-04-18T22:32:30Z&spr=https&sig=BB%2FQzc0XHAH%2FarDQhKcpu49feb7llv3ZjnfViuI9IWo%3D"
sasKey = sasURL[sasURL.index('?'): len(sasURL)]
storageAccount = "dbtraineastus2"
containerName = "training"
mountPoint = "/mnt/temp-training"

dbutils.fs.mount(
  source = f"wasbs://{containerName}@{storageAccount}.blob.core.windows.net/",
  mount_point = mountPoint,
  extra_configs = {f"fs.azure.sas.{containerName}.{storageAccount}.blob.core.windows.net": sasKey}
)

-sandbox

### Creating a Shared Access Signature (SAS) URL
Azure provides you with a secure way to create and share access keys for your Azure Blob Store without compromising your account keys.

More details are provided <a href="http://docs.microsoft.com/en-us/azure/storage/common/storage-dotnet-shared-access-signature-part-1" target="_blank"> in this document</a>.

This allows access to your Azure Blob Store data directly from Databricks distributed file system (DBFS).

As shown in the screen shot, in the Azure Portal, go to the storage account containing the blob to be mounted. Then:

1. Select Shared access signature from the menu.
2. Click the Generate SAS button.
3. Copy the entire Blog service SAS URL to the clipboard.
4. Use the URL in the mount operation, as shown below.

<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-sas-keys.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; margin-top: 20px; padding: 10px"/>

List the contents of the directory you just mounted:

In [39]:
%fs ls /mnt/temp-training

Take a peek at the head of the file `auto-mpg.csv`:

In [41]:
%fs head /mnt/temp-training/auto-mpg.csv

Now you are done, unmount the directory.

In [43]:
# %fs unmount /mnt/temp-training

## Summary

Databricks allows you to:
  * Create DataFrames from existing data
  * Create DataFrames from uploaded files
  * Mount your own Azure blobs

## Review Questions
**Q:** What is Azure Blob Store?  
**A:** Blob Storage stores from hundreds to billions of objects such as unstructured data—images, videos, audio, documents easily and cost-effectively.

**Q:** What is DBFS?  
**A:** DBFS stands for Databricks File System.  DBFS provides for the cloud what the Hadoop File System (HDFS) provides for local spark deployments.  DBFS uses Azure Blob Store and makes it easy to access files by name.

**Q:** Which is more efficient to query, a parquet file or a CSV file?  
**A:** Parquet files are highly optimized binary formats for storing tables.  The overhead is less than required to parse a CSV file.  Parquet is the big data analogue to CSV as it is optimized, distributed, and more fault tolerant than CSV files.

**Q:** How can you create a new table?  
**A:** Create new tables by either:
* Uploading a new file using the Data tab on the left.
* Mounting an existing file from DBFS.

**Q:** What is the SQL syntax for defining a table in Spark from an existing parquet file in DBFS?  
**A:** ```CREATE TABLE IF NOT EXISTS IPGeocode
USING parquet
OPTIONS (
  path "dbfs:/mnt/training/ip-geocode.parquet"
)```

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [47]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Querying JSON & Hierarchical Data with SQL]($./SSQL 05 - Querying JSON).

## Additional Topics & Resources

* <a href="https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html" target="_blank">The Databricks DBFS File System</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>