d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Lab 2 - Data Munging
## Module 5 Assignment

In this exercise, you will be working with mock data meant to replicate data from an ecommerce mattress seller. Broadly, your work is to clean up and present this data so that it can be used to target geographic areas.  Work through the tasks below and answer the challenge to produce the required report. 

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this assignment you will: </br>

* Work with hierarchical data
* Use common table expressions to display data
* Create new tables based on existing tables
* Manage working with null values and timestamps

As you work through the following tasks, you will be prompted to enter selected answers in Coursera. Find the quiz associated with this lab to enter your answers. 

Run the cell below to prepare this workspace for the lab.

In [0]:
%run ../Includes/Classroom-Setup

## Exercise 1: Create a table
**Summary:** Create a new table named `eventsRaw` 

Use this path to access the data: `/mnt/training/ecommerce/events/events.parquet`

Steps to complete: 
* Make sure this notebook is idempotent by first dropping the table named `eventsRaw`, if it exists already
* Use the provided path to read in the data

In [0]:
%sql
DROP TABLE IF EXISTS eventsRaw;
CREATE TABLE eventsRaw
USING parquet
OPTIONS (
PATH "/mnt/training/ecommerce/events/events.parquet")

## Exercise 2: Understand the schema and metadata

**Summary:** Run a command to display this table's schema and other detailed table information

Notice that this table includes `ArrayType` and `StructType` data

Steps to complete: 
* Run a single command to display the table information
* **Answer the corresponding question in Coursera, in the quiz for this module, regarding the location of this table**

In [0]:
%sql
DESCRIBE EXTENDED eventsRaw;

col_name,data_type,comment
device,string,
ecommerce,struct,
event_name,string,
event_previous_timestamp,bigint,
event_timestamp,bigint,
geo,struct,
items,array>,
traffic_source,string,
user_first_touch_timestamp,bigint,
user_id,string,


## Exercise 3: Sample the table

**Summary:** Sample this table to get a closer look at the data

Steps to complete: 
* Sample the table to display up to 1 percent of the records

In [0]:
%sql
SELECT * FROM eventsRaw
TABLESAMPLE (1 PERCENT)
LIMIT 1


device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
iOS,"List(null, null, null)",main,,1593876498259824,"List(San Antonio, TX)",List(),facebook,1593876498259824,UA000000107357951


-sandbox
## Exercise 4: Create a new table

**Summary:** Create a table `purchaseEvents` that includes event data _with_ purchases that has the following schema: 

| ColumnName      | DataType| 
|-----------------|---------|
|purchases        |double   |
|previousEventDate|date     |
|eventDate        |date     |
|city             |string   |
|state            |string   |
|userId           |string   |


<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> The timestamps in this table are meant to match those used in Google Analytics, which measures time to the microsecond. To convert to unixtime, you must divide these values by 1000000 (10e6) before casting to a timestamp. 

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Access values from StructType objects using dot notation

Steps to complete: 
* Make sure this notebook is idempotent by first dropping the table, if it exists
* Create a table based on the existing table
* Use a common table expression to manipulate your data before writing the `SELECT` statement that will define your table _(Recommended)_
* Do not include records where the `purchase_revenue_in_usd` is `NULL`
* Sort the table so that the city and state with the greatest total purchase is listed first

In [0]:
%sql
DROP TABLE IF EXISTS purchaseEvents;
CREATE TABLE purchaseEvents
USING parquet
WITH  ExplodeSource
AS 
  (
  SELECT 
    CAST(ecommerce.purchase_revenue_in_usd AS DOUBLE) AS purchases,
    CAST(event_timestamp/10e6 AS timestamp) AS  eventDate,
    CAST(event_previous_timestamp/10e6 AS timestamp) AS  previousEventDate,
    geo.city AS city,
    geo.state AS state,
    user_id as userId
  FROM 
    eventsRaw
    WHERE ecommerce.purchase_revenue_in_usd not like "null"
  )
SELECT
  purchases, 
  eventDate,
  previousEventDate,
  city,
  state,
  userId
FROM
  ExplodeSource
ORDER BY purchases DESC

In [0]:
%sql
SELECT * FROM purchaseEvents

purchases,eventDate,previousEventDate,city,state,userId
921.6,1975-01-19T13:55:33.395+0000,1975-01-19T13:55:33.177+0000,Jackson,MS,UA000000106026086
921.6,1975-01-19T10:50:08.371+0000,1975-01-19T10:50:00.516+0000,Anaheim,CA,UA000000105988205
921.6,1975-01-19T08:09:51.224+0000,1975-01-19T08:08:19.269+0000,Phoenix,AZ,UA000000105469152
921.6,1975-01-19T07:58:18.606+0000,1975-01-19T07:57:17.998+0000,Humble,TX,UA000000105541469
921.6,1975-01-18T14:12:07.925+0000,1975-01-18T14:12:07.725+0000,Elgin,IL,UA000000103622157
921.6,1975-01-18T20:14:01.109+0000,1975-01-18T20:13:16.891+0000,Palm Desert,CA,UA000000103592339
921.6,1975-01-19T00:31:51.832+0000,1975-01-19T00:30:23.613+0000,Brigham City,UT,UA000000104436573
921.6,1975-01-18T23:57:34.542+0000,1975-01-18T23:55:48.474+0000,Idaho Falls,ID,UA000000104410746
921.6,1975-01-19T02:27:39.255+0000,1975-01-19T02:26:46.138+0000,Magnolia,AR,UA000000104663815
921.6,1975-01-19T02:34:07.220+0000,1975-01-19T02:32:14.952+0000,New York,NY,UA000000104847413


## Exercise 5: Count the records

**Summary:** Count all the records in your new table. 

Steps to complete:
* Write a `SELECT` statement that counts the records in `purchaseEvents`
* **Answer the corresponding quiz question in Coursera**

In [0]:
%sql
SELECT COUNT(*) FROM purchaseEvents

count(1)
180678


## Exercise 6: Find the location with the top purchase
**Summary:** Write a query to produce the city and state where the top purchase amount originated. 

Steps to complete: 
* Write a query, sorted by `purchases`, that shows the city and state of the top purchase
* **Answer the corresponding quiz question in Coursera**

In [0]:
%sql
SELECT MAX(purchases), city, state
FROM purchaseEvents
GROUP BY city, state
ORDER BY MAX(purchases) DESC
LIMIT 3


max(purchases),city,state
5830.0,Amarillo,TX
5485.0,Tampa,FL
5289.0,Buffalo,NY


-sandbox
## Challenge: Produce reports

**Summary:** Use the `purchaseEvents` table to produce queries that explore purchase patterns in the table. Add visualizations to a dashboard to produce one comprehensive customer report.  

Steps to complete: 
* Create visualizations to report on: 
  * total purchases by day of week
  * average purchases by date of purchase
  * total purchases by state
  * Any other patterns you can find in the data
* Join your table with the data at the path listed below to get list of customers with confirmed email addresses
* **Answer the corresponding quiz question in Coursera**

#### Total purchases by day of week

In [0]:
%sql
SELECT
  date_format(eventDate, "E") day,
  ROUND(SUM(purchases),2) totalPurchases
FROM
  purchaseEvents
  GROUP BY day
  ORDER BY totalPurchases DESC

day,totalPurchases
Sat,96895617.0
Sun,90847428.3


From the plot above we only 2 days of week: Sun and Sat. On Saturdays we got purchases on 4% more (52%), then on Sundays (48%). 




#### Average purchases by date of purchase

In [0]:
%sql
SELECT
  date_format(eventDate, "D-MM-yyyy") day,
  ROUND(AVG(purchases),2) avgPurchases
FROM
  purchaseEvents
  GROUP BY day
  ORDER BY avgPurchases DESC

day,avgPurchases
19-01-1975,1039.43
18-01-1975,1038.79


In average we see, that in these 2 days on 18 and 19 of Janyary 1975 year, we got the same percentage of purchases 50%50

#### Total purchases by state

In [0]:
%sql
SELECT
  state,
  ROUND(SUM(purchases),2) totalStatePurchases
FROM
  purchaseEvents
  GROUP BY state
  ORDER BY totalStatePurchases DESC

state,totalStatePurchases
CA,34065790.9
TX,21428846.1
NY,11013716.0
FL,10234381.3
OH,7029893.5
IL,7015217.5
AZ,5294428.1
WA,5175375.3
MI,4929963.4
MN,4640197.7


On the plot above we see that the Top 3 leaders are: California (34M), Texas(21.4M) and New York (11M)
The top 3 losers are: Delaware, Vermont and the last one is Alaska

#### Any other patterns you can find in the data

In [0]:
%sql
SELECT COUNT (userId) numUniqUsers, ROUND(SUM(purchases),2) totalStatePurchases, state
FROM
  purchaseEvents
  GROUP BY state
  ORDER BY totalStatePurchases DESC

numUniqUsers,totalStatePurchases,state
32812,34065790.9,CA
20766,21428846.1,TX
10530,11013716.0,NY
9903,10234381.3,FL
6767,7029893.5,OH
6728,7015217.5,IL
5135,5294428.1,AZ
5003,5175375.3,WA
4747,4929963.4,MI
4447,4640197.7,MN


From the table above we can see that there is a strong correlation between number of users in the state and amount of total state purchases: the more users - the bigger amount of total state purchases

And here the plot to prove it:
#### A total state purchases per user in each state

In [0]:
%sql
SELECT 
  ROUND(SUM(purchases)/COUNT (userId),2) purchasesPerUser,
  COUNT (userId) numUsers, 
  ROUND(SUM(purchases),2) totalStatePurchases, 
  state
FROM
  purchaseEvents
  GROUP BY state
  ORDER BY totalStatePurchases DESC

purchasesPerUser,numUsers,totalStatePurchases,state
1038.21,32812,34065790.9,CA
1031.92,20766,21428846.1,TX
1045.94,10530,11013716.0,NY
1033.46,9903,10234381.3,FL
1038.85,6767,7029893.5,OH
1042.69,6728,7015217.5,IL
1031.05,5135,5294428.1,AZ
1034.45,5003,5175375.3,WA
1038.54,4747,4929963.4,MI
1043.44,4447,4640197.7,MN


Well, this plot shows that purchases per user in each state in average are the same

#### Join your table with the data at the path listed below to get list of customers with confirmed email addresses

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Access the data that holds user email addresses. You can read the data from this path: `/mnt/training/ecommerce/users/users.parquet`

#### At first we will read the data from the sourse link and will create the table

In [0]:
%sql
DROP TABLE IF EXISTS emailAddresses;
CREATE TABLE emailAddresses
USING parquet
OPTIONS (
PATH "/mnt/training/ecommerce/users/users.parquet",
header "true")

#### Let's see what kind of data we have in the new table:

In [0]:
%sql
DESCRIBE EXTENDED emailAddresses;

col_name,data_type,comment
user_id,string,
user_first_touch_timestamp,bigint,
email,string,
,,
# Detailed Table Information,,
Database,default,
Table,emailaddresses,
Owner,root,
Created Time,Sat Dec 05 15:04:34 UTC 2020,
Last Access,UNKNOWN,


Let's check 
##### length of table `emailAddresses`:

In [0]:
%sql
SELECT COUNT(email) 
FROM  emailAddresses
WHERE email IS NOT null;

count(email)
782749


##### count of user id from `emailAddresses` table

In [0]:
%sql
SELECT  
COUNT(user_id )
FROM  emailAddresses;


count(user_id)
5025947


##### count of user id from `purchaseEvents` table

In [0]:
%sql
SELECT COUNT(userId)
FROM purchaseEvents

count(userId)
180678


Lengh of tables is different, thus we will use right Join


Now we will merge 2 tables on userId with email column

In [0]:
%sql
DROP TABLE IF EXISTS withEmails;
CREATE TABLE withEmails
AS
  SELECT 
  eventDate,
  previousEventDate,
  city,
  state,
  purchases, 
  userId,
  email
  FROM emailAddresses 
  RIGHT JOIN purchaseEvents 
  ON purchaseEvents.userId = emailAddresses.user_id
  
  

##### The last check what we got: as we can see below, the column with emails is filled correct

In [0]:
%sql
SELECT * FROM withEmails 

eventDate,previousEventDate,city,state,purchases,userId,email
1975-01-18T06:46:46.619+0000,1975-01-18T06:44:37.492+0000,Bloomington,IN,1525.5,UA000000102361456,dramos@davis.info
1975-01-18T08:14:09.079+0000,1975-01-18T08:13:57.676+0000,Torrance,CA,940.5,UA000000102367094,ztran@hotmail.com
1975-01-18T07:59:58.169+0000,1975-01-18T07:59:19.763+0000,Lewisville,TX,1525.5,UA000000102368506,ncraig@byrd.net
1975-01-18T07:20:38.609+0000,1975-01-18T07:18:41.030+0000,Redding,CA,1075.5,UA000000102369818,justinpeterson@wang.net
1975-01-18T06:13:22.356+0000,1975-01-18T06:07:56.466+0000,Beaumont,TX,1075.5,UA000000102370013,lopezjulia@yahoo.com
1975-01-18T05:37:54.151+0000,1975-01-18T05:35:53.642+0000,Florida City,FL,1075.5,UA000000102376220,nnelson@rodriguez.org
1975-01-18T06:15:46.592+0000,1975-01-18T06:15:18.701+0000,Tacoma,WA,940.5,UA000000102380918,moniquemacdonald@bradley-manning.com
1975-01-18T06:18:33.656+0000,1975-01-18T06:18:14.439+0000,Rochester Hills,MI,107.1,UA000000102382462,lisa01@frank.biz
1975-01-18T05:28:52.648+0000,1975-01-18T05:28:47.547+0000,Hermosa Beach,CA,850.5,UA000000102385149,zachary30@holland.com
1975-01-18T09:23:11.788+0000,1975-01-18T09:22:02.811+0000,Waxahachie,TX,850.5,UA000000102387080,angelabarker@gmail.com


##### Do we have unique emails and ids:

In [0]:
%sql
select count(distinct email), count(distinct userId) from withEmails

count(DISTINCT email),count(DISTINCT userId)
180678,180678


In [0]:
%run ../Includes/Classroom-Cleanup


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>