d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

## Basic Queries with Spark SQL

Run the following queries to start working with Spark SQL. As you work, notice that Spark SQL syntax and patterns are the same as the SQL you would use in other modern database systems.

## Getting Started

When you work with Databricks as part of an organization, it is likely that your workspace will be set up for you. In other words, you will be connected to various data stores and able to pull current data into your notebooks for analysis. In this course, you will use data provided by Databricks. The cell below runs a file that connects this workspace to data storage. You must run the cell below at the start of any new session. There's no need to worry about the output of this cell unless you get an error. It is simply preparing your workspace to be used with the this notebook.

In [0]:
%run ../Includes/Classroom-Setup

-sandbox
## Create table

We are going to be working with different files (and file formats) throughout this course. The first thing we need to do, in order to access the data through our SQL interface, is create a **table** from that data. 

A [Databricks table](https://docs.databricks.com/data/tables.html) is a collection of structured data. We will use Spark SQL to query tables.This table contains 10 million fictitious records that hold facts about people, like first and last names, date of birth, salary, etc. We're using the [Parquet](https://databricks.com/glossary/what-is-parquet) file format, which is commonly used in many big data workloads. We will talk more about various file formats later in this course.  Run the code below to access the table we'll use for the first part of this lesson. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We will embed links to documentation about Databricks, Spark, and Spark SQL throughout this course. It's important to be able to read an access documentation for any system. We encourage you to get comfortable with the docs and learn more by reading through the provided links.

In [0]:
%sql
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true");

-sandbox
## Querying tables
In the first part of this lesson,
we 'll be using a table that has been defined for you, `People10M`. This table contains 10 million fictitious records. 

We start with a simple `SELECT` statement to get a view of the data.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> For now, you can think of a **table** much as you would a spreadsheet with named columns. The actual data is a file in an object store. When we define a table like the one in this exercise, it becomes available to anyone who has access to this Databricks workspace.  We will view
and work with this table programmatically,but you can also preview it using the `Data` tab in the sidebar on the left side of the screen.

In [0]:
%sql
SELECT * FROM People10M;

id,firstName,middleName,lastName,gender,birthDate,ssn,salary
1,Pennie,Carry,Hirschmann,F,1955-07-02T04:00:00.000+0000,981-43-9345,56172
2,An,Amira,Cowper,F,1992-02-08T05:00:00.000+0000,978-97-8086,40203
3,Quyen,Marlen,Dome,F,1970-10-11T04:00:00.000+0000,957-57-8246,53417
4,Coralie,Antonina,Marshal,F,1990-04-11T04:00:00.000+0000,963-39-4885,94727
5,Terrie,Wava,Bonar,F,1980-01-16T05:00:00.000+0000,964-49-8051,79908
6,Chassidy,Concepcion,Bourthouloume,F,1990-11-24T05:00:00.000+0000,954-59-9172,64652
7,Geri,Tambra,Mosby,F,1970-12-19T05:00:00.000+0000,968-16-4020,38195
8,Patria,Nancy,Arstall,F,1985-01-02T05:00:00.000+0000,984-76-3770,102053
9,Terese,Alfredia,Tocque,F,1967-11-17T05:00:00.000+0000,967-48-7309,91294
10,Wava,Lyndsey,Jeandon,F,1963-12-30T05:00:00.000+0000,997-82-2946,56521


In [0]:
%sql
SELECT firstName FROM People10M;

firstName
Pennie
An
Quyen
Coralie
Terrie
Chassidy
Geri
Patria
Terese
Wava


-sandbox
We can view the schema for this table by using the `DESCRIBE` function.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The **schema** is a list that defines the columns in a table and the datatypes within those columns.

In [0]:
%sql
DESCRIBE People10M;

col_name,data_type,comment
id,int,
firstName,string,
middleName,string,
lastName,string,
gender,string,
birthDate,timestamp,
ssn,string,
salary,int,


-sandbox
## Displaying query results

Any query that starts with a `SELECT` statement automatically displays the results below. We can use a `WHERE` clause to limit the results to those that meet a given condition or set of conditions. 

For the next query, we limit the result colums to `firstName`, `middleName`, `lastName`, and `birthdate`. We use a `WHERE` clause at the end to identify that we want to limit the result set to people born after 1990 whose gender is listed as `F`. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Since `birthDate` is a timestamp type, we can extract the year of birth using the function `YEAR()`

In [0]:
%sql
SELECT
  firstName,
  middleName,
  lastName,
  birthDate
FROM
  People10M
WHERE
  year(birthDate) > 1990
  AND gender = 'F'

firstName,middleName,lastName,birthDate
An,Amira,Cowper,1992-02-08T05:00:00.000+0000
Caroyln,Mamie,Cardon,1994-05-15T04:00:00.000+0000
Yesenia,Eileen,Goldring,1997-07-09T04:00:00.000+0000
Hedwig,Dulcie,Pendleberry,1998-12-02T05:00:00.000+0000
Kala,Violeta,Lyfe,1994-06-23T04:00:00.000+0000
Gussie,India,McKeeman,1991-11-15T05:00:00.000+0000
Pansy,Suzie,Shrieves,1991-05-24T04:00:00.000+0000
Chung,Dian,Dautry,1998-01-12T05:00:00.000+0000
Erica,Louvenia,O'Drought,1991-03-08T05:00:00.000+0000
Katelyn,Merrie,Pocklington,1994-01-16T05:00:00.000+0000


-sandbox
## Math

Spark SQL includes many <a href="https://spark.apache.org/docs/latest/api/sql/" target="_blank">built-in functions</a> that are also used in standard SQL. We can use them to create new columns based on a rule. In this case, we use a simple math function to calculate 20% of a person's listed. We use the keyword `AS` to rename the new column `savings`. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Many financial planning experts agree that 20% of a person's income should go into  savings.

In [0]:
%sql
SELECT
  firstName,
  lastName,
  salary,
  salary * 0.2 AS savings
FROM
  People10M

firstName,lastName,salary,savings
Pennie,Hirschmann,56172,11234.4
An,Cowper,40203,8040.6
Quyen,Dome,53417,10683.4
Coralie,Marshal,94727,18945.4
Terrie,Bonar,79908,15981.6
Chassidy,Bourthouloume,64652,12930.4
Geri,Mosby,38195,7639.0
Patria,Arstall,102053,20410.6
Terese,Tocque,91294,18258.8
Wava,Jeandon,56521,11304.2


## Temporary Views
So far, you've been working with Spark SQL by querying a table that we defined for you. In the following exercises, we will work with **temporary views**. Temporary views are useful for data exploration. It gives you a name to query from SQL, but unlike a table, does not carry over when you restart the cluster or switch to a new notebook. Also, temporary views will not show up in the `Data` tab. 

In the cell below, we create a temporary view that holds all the information from our last query, plus, adds another new column, `birthYear`.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW PeopleSavings AS
SELECT
  firstName,
  lastName,
  year(birthDate) as birthYear,
  salary,
  salary * 0.2 AS savings
FROM
  People10M;

## Where are the results?!

When you create a temporary view, the "OK" at the bottom indicates that your command ran successfully, but the view itself does not automatically appear. To see the records in the the temporary view, you can run a query on it.

In [0]:
%sql
SELECT * FROM PeopleSavings;

firstName,lastName,birthYear,salary,savings
Pennie,Hirschmann,1955,56172,11234.4
An,Cowper,1992,40203,8040.6
Quyen,Dome,1970,53417,10683.4
Coralie,Marshal,1990,94727,18945.4
Terrie,Bonar,1980,79908,15981.6
Chassidy,Bourthouloume,1990,64652,12930.4
Geri,Mosby,1970,38195,7639.0
Patria,Arstall,1985,102053,20410.6
Terese,Tocque,1967,91294,18258.8
Wava,Jeandon,1963,56521,11304.2


-sandbox
## Query Views
For the most part, you can query a view exactly as you would query a table. The query below uses the built-in function `AVG()` to calculate `avgSalary` **grouped by** `birthYear`. This is an aggregate function, which means it's meant perform an calculation on a set of values. You must include a `GROUP BY` clause to identify the subset of values you want to summarize. 

The final clause, `ORDER BY`, declares the column that will control the order in which the rows appear, and the keyword `DESC` means they will appear in descending order. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We use a `ROUND()` function around the `AVG()` to round to the nearest cent.

In [0]:
%sql
SELECT
  birthYear,
  ROUND(AVG(salary), 2) AS avgSalary
FROM
  peopleSavings
GROUP BY
  birthYear
ORDER BY
  avgSalary DESC

birthYear,avgSalary
2000,72741.39
1987,72725.18
1963,72722.43
1951,72704.04
1964,72693.98
1996,72693.55
1965,72675.08
1993,72673.54
1976,72669.25
1973,72668.67


## Define a new table

Now we will show you how to create a table using Parquet. <a href="https://databricks.com/glossary/what-is-parquet#:~:text=Parquet%20is%20an%20open%20source,like%20CSV%20or%20TSV%20files" target="_blank">Parquet</a> is an open-source, column-based file format. Apache Spark supports many different file formats; you can specify how you want your table to be written with the `USING` keyword. 


For now, we will focus on the commands we will use to create a new table. 

This data contains information about the relative popularity of first names in the United States by year from 1880 - 2016.

`Line 1`: Tables must have unique names. By including the `DROP TABLE IF EXISTS` command, we are ensuring that the next line (`CREATE TABLE`) can run successfully even if this table has already been created. The semi-colon at the end of the line allows us to run another command in the same cell. 

`Line 2`: Creates a table named `ssaNames`, defines the data source type (`parquet`) and indicated that there are some optional parameters to follow. 

`Line 3`: Identifies the path to the file in object storage

`Line 4`: Indicates that the first line of the table should be treated as a header.

In [0]:
%sql
DROP TABLE IF EXISTS ssaNames;
CREATE TABLE ssaNames USING parquet OPTIONS (
  path "/mnt/training/ssn/names.parquet",
  header "true"
)

## Preview the data
Run the cell below to preview the data. Notice that the `LIMIT` keyword restricts the number of returned rows to the specified limit.

In [0]:
%sql
SELECT
  *
FROM
  ssaNames
LIMIT
  5;

firstName,gender,total,year
Jennifer,F,54336,1983
Jessica,F,45278,1983
Amanda,F,33752,1983
Ashley,F,33292,1983
Sarah,F,27228,1983


## Joining two tables

We can combine these tables to get a sense of how the data may be related. For example, you may wonder
> How many popular first names appear in our generated `People10M` dataset?

We will use a join to help answer this question. We will perform the join in a series of steps.

## Count distinct values

First, we query tables to get a list of the distinct values in any field. Run the commands below to see the number of distinct names are in each of our tables.

In [0]:
%sql
SELECT count(DISTINCT firstName)
FROM SSANames;

count(DISTINCT firstName)
93889


In [0]:
%sql
SELECT count(DISTINCT firstName) 
FROM People10M;

count(DISTINCT firstName)
5113


## Create temporary views
Next, we create two temporary views so that the actual join will be easy to read/write.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW SSADistinctNames AS 
  SELECT DISTINCT firstName AS ssaFirstName 
  FROM SSANames;

CREATE OR REPLACE TEMPORARY VIEW PeopleDistinctNames AS 
  SELECT DISTINCT firstName 
  FROM People10M

## Perform join
Now, we can use the view names to **join** the two data sets. If you are new to using SQL, you may want to learn more about the different types of joins you can perform.  This [wikipedia article](https://en.wikipedia.org/wiki/Join_(SQL) offers complete explanations, with pictures and sample SQL code.  

By default, the join type shown here is `INNER`. That means the results will contain the intersection of the two sets, and any names that are not in both sets will not appear. Note, becuase it is default, we did not specify the join type.

In [0]:
%sql
SELECT firstName 
FROM PeopleDistinctNames 
JOIN SSADistinctNames ON firstName = ssaFirstName

firstName
Susanna
Julianne
Lashanda
Kiana
Tyler
Sandi
Faye
Alayna
Britta
Melaine


## How many names?

To answer the question posed previously, we can perform this join and include a count of the number of records in the result.

In [0]:
%sql
SELECT count(*) 
FROM PeopleDistinctNames 
JOIN SSADistinctNames ON firstName = ssaFirstName;

count(1)
5096


In [0]:
%run ../Includes/Classroom-Cleanup


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>