# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [None]:
import datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, desc
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType

from src.config import settings
from src.spark_lakehouse import get_spark_session

# Initialize Spark session
spark = get_spark_session(app_name="ecommerce_data_analysis")

# Load data into Spark DataFrame
df = spark.read.json(settings.SPARK_CLUSTER_DATA_DIR + "sparkify_log_small.json")

# Create temporary table view for SQL queries
df.createOrReplaceTempView("sparkify_log_table")

# Question 1

Which page did user id ""(empty string) NOT visit?

In [None]:
spark.sql(
    """
    SELECT DISTINCT page FROM sparkify_log_table
    """
).show()

In [None]:
spark.sql(
    """
    SELECT DISTINCT page 
    FROM sparkify_log_table
    WHERE userId == ""
    """
).show()

# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [None]:
spark.sql("""
    SELECT COUNT(DISTINCT userId) AS num_females
    FROM sparkify_log_table
    WHERE gender = 'F'
"""
).show()

# Question 4

How many songs were played from the most played artist?

In [None]:
spark.sql("""
    SELECT artist, COUNT(page) AS play_count
    FROM sparkify_log_table
    WHERE page = 'NextSong'
    GROUP BY artist
    ORDER BY play_count DESC
    LIMIT 10
"""
).show()

# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [None]:
spark.udf.register(
    "is_home", lambda x: int(x == "Home"), IntegerType()
)

spark.sql("""
    WITH user_periods AS (
        SELECT userId, page, ts, SUM(is_home(page)) OVER (PARTITION BY userId ORDER BY ts) AS period
        FROM sparkify_log_table
        WHERE userId != '' AND page IN ('Home', 'NextSong')
        ORDER BY userId, ts
    ),
    songs_per_period AS (
        SELECT userId, COUNT(page) AS songs_per_period
        FROM user_periods
        WHERE page = 'NextSong'
        GROUP BY userId, period
    )
    SELECT AVG(songs_per_period) AS avg_songs_per_session
    FROM songs_per_period
    """
).show()