# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Data wrangling with Spark SQL Quiz") \
    .getOrCreate()

path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)
user_log.createOrReplaceTempView("user_log_table")

# Question 1

Which page did user id ""(empty string) NOT visit?

In [2]:
spark.version

'2.4.3'

In [3]:
spark.sql('''
          SELECT DISTINCT page 
          FROM user_log_table 
          WHERE page NOT IN (
              SELECT DISTINCT page
              FROM user_log_table
              WHERE userId == '""')
          '''
          ).show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|            Home|
|       Downgrade|
|          Logout|
|   Save Settings|
|           About|
|        Settings|
|           Login|
|        NextSong|
|            Help|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [4]:
spark.sql('''
          SELECT COUNT(DISTINCT userId) AS num_female
          FROM user_log_table 
          WHERE gender == 'F'
          '''
          ).show()

+----------+
|num_female|
+----------+
|       462|
+----------+



# Question 4

How many songs were played from the most played artist?

In [5]:
spark.sql('''
          SELECT artist, COUNT(artist) AS plays
          FROM user_log_table
          GROUP BY artist
          SORT BY plays DESC 
          LIMIT 1
          '''
          ).show()

+--------------+-----+
|        artist|plays|
+--------------+-----+
|The Black Keys|   40|
+--------------+-----+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [6]:
# TODO: write your code to answer question 5