d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

## Data Visualization with Databricks

In this lesson, you'll learn how to create and share data visualizations in Databricks. 

By the end of the lesson, you will be able to: 
* Create a table with a specified schema
* Cast a column as a timestamp and extract day, month, or year
* Use in-notebook visualizations to see your data 

Step 1: Read through and run all the cells in this notebook. <br>
Step 2: View the corresponding video to see instructions for visualizing and sharing data.

## Getting Started

Run the following cell to connect your workspace to the appropriate data source.

In [0]:
%run ../Includes/Classroom-Setup

## Create a Table

In the previous lesson, we created a table for you to start querying. In this lesson, you will create the table by reading directly from the data source and specifying a **schema**. A schema describes the structure of your data. It contains column names and the type of data in each column. All tables must have an associated schema; if you do not explicitly define one, Spark may be able to infer it.  

In the cell below, we define the schema as we create the table. This data has the following schema: 

|Column Name | Type |
| ---------- | ---- |
| userId | INT|
| movieId | INT|
| rating | FLOAT|
| timeRecorded | INT|

Notice that it is defined right after the `CREATE TABLE` statement with the name of each column followed by the datatypes within the column. The whole group of columns is surround by parentheses and each individual column is spereated by a comma.

In [0]:
%sql
DROP TABLE IF EXISTS movieRatings;
CREATE TABLE movieRatings (
  userId INT,
  movieId INT,
  rating FLOAT,
  timeRecorded INT
) USING csv OPTIONS (
  PATH "/mnt/training/movies/20m/ratings.csv",
  header "true"
);

## Preview the data

This table contains a little more than 20 million records of movies ratings submitted by users. Note that the timestamp is an integer value recorded in UTC time.

In [0]:
%sql
SELECT
  *
FROM
  movieRatings;

userId,movieId,rating,timeRecorded
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940
1,260,4.0,1112484826


## Cast as timestamp

We use the `CAST()` function to show the timestamp as a human-readable time and date.

In [0]:
%sql
SELECT
  rating,
  CAST(timeRecorded as timestamp)
FROM
  movieRatings;

rating,timeRecorded
3.5,2005-04-02T23:53:47.000+0000
3.5,2005-04-02T23:31:16.000+0000
3.5,2005-04-02T23:33:39.000+0000
3.5,2005-04-02T23:32:07.000+0000
3.5,2005-04-02T23:29:40.000+0000
3.5,2004-09-10T03:09:00.000+0000
4.0,2004-09-10T03:08:54.000+0000
4.0,2005-04-02T23:46:13.000+0000
4.0,2005-04-02T23:35:40.000+0000
4.0,2005-04-02T23:33:46.000+0000


## Create temporary view
We will create a temporary view that we can easily refer to the data we want to include in our visualization. For this data, we can investigate whether there are any patterns in the ratings when grouped by month. To do that, we use the `ROUND()` and `AVG()` functions to calculate the average rating and limit it to 3 decimal places. Then, extract the month from the `timeRecorded` column after casting it as a timestamp. The `AVG()` is calculated over the course of a month, as specified in the `GROUP BY` clause.

In [0]:
%sql
CREATE
OR REPLACE TEMPORARY VIEW ratingsByMonth AS
SELECT
  ROUND(AVG(rating), 3) AS avgRating,
  month(CAST(timeRecorded as timestamp)) AS month
FROM
  movieRatings
GROUP BY
  month;

## Visualize the data

Run the next cell to see the view we just defined ordered by `avgRating` from least to greatest. The results will appear as a table. In the next section, you will receive video instruction that will show you how to display this table as a chart.

In [0]:
%sql
SELECT
  *
FROM
  ratingsByMonth
ORDER BY
  avgRating;

avgRating,month
3.496,3
3.505,8
3.514,6
3.515,5
3.517,7
3.52,1
3.524,2
3.528,9
3.528,4
3.54,12


In [0]:
%run ../Includes/Classroom-Cleanup

#### Citation
Access original data and backround information here: 

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=<http://dx.doi.org/10.1145/2827872>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>