# Answer Key to the Data Frame Programming Quiz

Helpful resources:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html

In [1]:
import findspark
import os

findspark.init(os.environ['SPARK_HOME'])

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Spark SQL Quiz').getOrCreate()

In [4]:
user_log = spark.read.json('data/sparkify_log_small.json')

In [5]:
user_log.createOrReplaceTempView('user_log_sql')

In [6]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



# Question 1

Which page did user id "" (empty string) NOT visit?

In [14]:
spark.sql(
    """SELECT DISTINCT page
    FROM user_log_sql
    WHERE page NOT IN (SELECT DISTINCT page FROM user_log_sql WHERE userId = '')
    """
).show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
|          Logout|
|   Save Settings|
|        Settings|
|        NextSong|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

Both Spark SQL and Spark Data Frames are part of the Spark SQL library. Hence, they both use the Spark SQL Catalyst Optimizer to optimize queries. 

You might prefer SQL over data frames because the syntax is clearer especially for teams already experienced in SQL.

Spark data frames give you more control. You can break down your queries into smaller steps, which can make debugging easier. You can also [cache](https://unraveldata.com/to-cache-or-not-to-cache/) intermediate results or [repartition](https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4) intermediate results.

# Question 3

How many female users do we have in the data set?

In [16]:
spark.sql(
    """SELECT DISTINCT userId
    FROM user_log_sql
    WHERE gender = 'F'
    """
).count()

462

# Question 4

How many songs were played from the most played artist?

In [21]:
spark.sql(
    """SELECT DISTINCT song
    FROM user_log_sql
    WHERE artist = (
        SELECT artist
        FROM user_log_sql
        WHERE artist <> 'null'
        GROUP BY artist
        ORDER BY COUNT(*) DESC
        LIMIT 1
    )
    """
).count()

24

# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.



In [36]:
spark.sql(
    """SELECT ROUND(AVG(z.num_songs)) AS avg_interval_songs
    FROM 
    (
        SELECT y.userId, y.session, COUNT(*) AS num_songs
        FROM
        (
            SELECT *, SUM(x.is_home) OVER (PARTITION BY x.userId 
                                           ORDER BY x.ts 
                                           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session
            FROM
            (
                SELECT userId, ts, page, song, IF(page='Home', 1, 0) AS is_home
                FROM user_log_sql
                WHERE userId <> '' AND page IN ('NextSong', 'Home')
                ORDER BY userId, ts
            ) AS x
        ) AS y
        WHERE y.page <> 'Home'
        GROUP BY y.userId, y.session
    ) AS z
    """
).show()

+------------------+
|avg_interval_songs|
+------------------+
|               7.0|
+------------------+



What are the top 5 users who listen to songs the most on average between visiting home page?

In [42]:
spark.sql(
"""SELECT z.userId, AVG(z.num_songs) AS avg_interval_songs
FROM
    (
        SELECT y.userId, y.session, COUNT(*) AS num_songs
        FROM
        (
            SELECT *, SUM(x.is_home) OVER (PARTITION BY x.userId 
                                           ORDER BY x.ts 
                                           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session
            FROM
            (
                SELECT userId, ts, page, song, IF(page='Home', 1, 0) AS is_home
                FROM user_log_sql
                WHERE userId <> '' AND page IN ('NextSong', 'Home')
                ORDER BY userId, ts
            ) AS x
        ) AS y
        WHERE y.page <> 'Home'
        GROUP BY y.userId, y.session
    ) AS z
GROUP BY z.userId
ORDER BY avg_interval_songs DESC
LIMIT 5
"""
).show()

+------+------------------+
|userId|avg_interval_songs|
+------+------------------+
|  1579|              60.0|
|   462|              58.5|
|  2867|              56.0|
|  2002|              49.0|
|   445|              49.0|
+------+------------------+

