# Dashboard for analytic queries against Sparkify Date Lake

Here is examples of possible analytic queries. You can create your own queries just using the Spark SQL syntax. Tables structure is described in the README.

In [2]:
import configparser
import os
from pyspark.sql import SparkSession

First we should setup credentials to S3 and then we can answer on some analytical questions. 
> `S3` section of the configuration file `dl.cfg` should be filled with user credentials with Full Access rights to S3.

In [3]:
config = configparser.ConfigParser()
config.read('dl.cfg')

os.environ['AWS_ACCESS_KEY_ID'] = config.get('S3', 'AWS_ACCESS_KEY_ID')
os.environ['AWS_SECRET_ACCESS_KEY'] = config.get('S3', 'AWS_SECRET_ACCESS_KEY')

In [2]:
spark = SparkSession \
    .builder \
    .appName('Sparkify Analytics') \
    .getOrCreate()

In [12]:
spark.read.parquet('tables/songs/songs.parquet').createOrReplaceTempView('songs')
spark.read.parquet('tables/artists/artists.parquet').createOrReplaceTempView('artists')
spark.read.parquet('tables/users/users.parquet').createOrReplaceTempView('users')
spark.read.parquet('tables/time/time.parquet').createOrReplaceTempView('time')
spark.read.parquet('tables/songplays/songplays.parquet').createOrReplaceTempView('songplays')

## Find top 10 most popular songs

Company want to publish top charts of songs. Find top 10 songs that users listened to most often. Print `song` (name of the song), `artist` and `play_count` (how many times users listened the song).

_(*): The output could contains a single row because of lack of real data. This is just an example of the posible query_

In [13]:
spark.sql("""
    SELECT s.title as song
        , a.name as artist
        , COUNT(*) as play_count
    FROM songplays sp
    INNER JOIN songs s ON s.song_id = sp.song_id
    LEFT JOIN artists a ON a.artist_id = sp.artist_id
    GROUP BY s.title, a.name
    ORDER BY play_count DESC
    LIMIT 10
""").show()

+--------------+------+----------+
|          song|artist|play_count|
+--------------+------+----------+
|Setanta matins| Elena|         1|
+--------------+------+----------+



## Weekly statistics

Build a report for each year, month and week to show how many songs were played and how many unique users uses Sparkify service. Report should contain following fields: `year`, `month`, `week`, `song_count` (how many songs were played), `user_count` (unique users which used the service at least once this month).

_(*): The output could contains a single row because of lack of real data. This is just an example of the posible query_

In [14]:
spark.sql("""
    SELECT t.year
        , t.month
        , t.week
        , COUNT(*) as song_count
        , COUNT(DISTINCT sp.user_id) as user_count
    FROM songplays sp
    INNER JOIN time t ON t.start_time = sp.start_time
    GROUP BY t.year, t.month, t.week
    ORDER BY t.year ASC, t.month, t.week ASC
""").show()

+----+-----+----+----------+----------+
|year|month|week|song_count|user_count|
+----+-----+----+----------+----------+
|2018|   11|  47|         1|         1|
+----+-----+----+----------+----------+

