# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [1]:
# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType

import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


In [3]:
spark = SparkSession \
    .builder \
    .appName("Wrangling data using Spark-SQL") \
    .getOrCreate()

In [4]:
path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)

In [5]:
user_log.createOrReplaceTempView("log_table")

In [6]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



# Question 1

Which page did user id ""(empty string) NOT visit?

In [7]:
# SELECT distinct pages for the blank user and distinc pages for all users
# Right join the results to find pages that blank visitor did not visit

spark.sql("""
    WITH t1 AS (
        SELECT DISTINCT page
        FROM log_table
        WHERE userID=''
    ),
    t2 AS (
        SELECT DISTINCT page
        FROM log_table
    )
    
    SELECT *
    FROM t1
    RIGHT JOIN t2
    ON t1.page = t2.page
    WHERE t1.page IS NULL
""").show()

+----+----------------+
|page|            page|
+----+----------------+
|null|Submit Downgrade|
|null|       Downgrade|
|null|          Logout|
|null|   Save Settings|
|null|        Settings|
|null|        NextSong|
|null|         Upgrade|
|null|           Error|
|null|  Submit Upgrade|
+----+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

Both Spark SQL and Spark Data Frames are part of the Spark SQL library. Hence, they both use the Spark SQL Catalyst Optimizer to optimize queries.

You might prefer SQL over data frames because the syntax is clearer especially for teams already experienced in SQL.

Spark data frames give you more control. You can break down your queries into smaller steps, which can make debugging easier. You can also cache intermediate results or repartition intermediate results.

# Question 3

How many female users do we have in the data set?

In [8]:
# TODO: write your code to answer question 3

spark.sql("""
    SELECT COUNT(DISTINCT userId) AS female_users
    FROM log_table
    WHERE gender = 'F' 
""").show()

+------------+
|female_users|
+------------+
|         462|
+------------+



# Question 4

How many songs were played from the most played artist?

In [9]:
# TODO: write your code to answer question 4

spark.sql("""
    SELECT artist, COUNT(artist) as plays
    FROM log_table
    GROUP BY artist
    ORDER BY plays DESC
    LIMIT 1
""").show()

+--------+-----+
|  artist|plays|
+--------+-----+
|Coldplay|   83|
+--------+-----+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [10]:
# TODO: write your code to answer question 5

spark.sql("""
    WITH t1 AS (
        SELECT userId, page, ts,
            CASE 
            WHEN page = 'Home' THEN 1
            ELSE 0
            END AS is_home
        FROM log_table
        WHERE page = 'NextSong' 
        OR page = 'Home'         
    ),
    t2 AS (
        SELECT *, 
            SUM(is_home) OVER
            (PARTITION BY userId
             ORDER BY ts DESC
             ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period
        FROM t1
    ),
    t3 AS (
        SELECT COUNT(*) AS count_results
        FROM t2
        GROUP BY userId, period, page
        HAVING page = 'NextSong'
    )
    
    SELECT AVG(count_results)
    FROM t3
""").show()

+------------------+
|avg(count_results)|
+------------------+
| 6.898347107438017|
+------------------+

