# Example Queries of the Sparkify Database

Import: make sure to restart kernal at the end of this exercise.

In [1]:
%load_ext sql

Connect to the SparkifyDB

In [2]:
%sql postgresql://student:student@127.0.0.1/sparkifydb

'Connected: student@sparkifydb'

## 1. Fact Table `songplays` - Preview

At Sparkify App Event log level. Incoming raw data is 1 JSON file per day (that contains multiple events for that day). This DB store one event per row.

Commentary: Notice the `None` values for `artist_id` and `session_id`. Below are some explanations:

* Possibility 1: If Sparkify plays only songs that are in the Music database, then these two columns should never be `None` (or `null`). And the only reason that this is the case is due to the way created our sample datasets (via the Eventsim log simulator).
* Possibility 2: If Sparkify plays songs that can be outside the scope of the Music database, then these two columns may legitimately contain `None` (or `null`).
* Possiblity 3: It could be a combination of possibilities 1 and 2 above.

In [3]:
%sql SELECT * FROM songplays LIMIT 5;

 * postgresql://student:***@127.0.0.1/sparkifydb
5 rows affected.


songplay_id,start_time,user_id,level,song_id,artist_id,session_id,location,user_agent
1,2018-11-30 00:22:07,91,free,,,829,"Dallas-Fort Worth-Arlington, TX",Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)
2,2018-11-30 01:08:41,73,paid,,,1049,"Tampa-St. Petersburg-Clearwater, FL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"""
3,2018-11-30 01:12:48,73,paid,,,1049,"Tampa-St. Petersburg-Clearwater, FL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"""
4,2018-11-30 01:17:05,73,paid,,,1049,"Tampa-St. Petersburg-Clearwater, FL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"""
5,2018-11-30 01:20:56,73,paid,,,1049,"Tampa-St. Petersburg-Clearwater, FL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"""


## 2. Dimension Table `users` - Preview

Unique by Sparkify App user (`user_id`)

Commentary: the `gender` and price plan `level` appears to be good categorical columns for aggregation analysis.

In [4]:
%sql SELECT * FROM users LIMIT 5;

 * postgresql://student:***@127.0.0.1/sparkifydb
5 rows affected.


user_id,first_name,last_name,gender,level
57,Katherine,Gay,F,free
86,Aiden,Hess,M,free
74,Braden,Parker,M,free
30,Avery,Watkins,F,paid
78,Chloe,Roth,F,free


## 3. Dimension Table `songs` - Preview

Unique by Music Databse `song_id`

Commentary: notice we have `0` value in `year`. We need to check the raw file to check what the underlying value is. It is likely we will need to fill these `0` values as `null` instead. Alternatively, we'd need to document that `0` year represents `null`. 

In [5]:
%sql SELECT * FROM songs LIMIT 5;

 * postgresql://student:***@127.0.0.1/sparkifydb
5 rows affected.


song_id,title,artist_id,year,duration
SOMZWCG12A8C13C480,I Didn't Mean To,ARD7TVE1187B99BFB1,0,218.93179
SOUDSGM12AC9618304,Insatiable (Instrumental Version),ARNTLGG11E2835DDB9,0,266.39628
SOIAZJW12AB01853F1,Pink World,AR8ZCNI1187B9A069B,1984,269.81832
SOHKNRJ12A6701D1F8,Drop of Rain,AR10USD1187B99F3F1,0,189.57016
SOCIWDW12A8C13D406,Soul Deep,ARMJAGH1187FB546F3,1969,148.03546


## 4. Dimension Table `artists` - Preview

Unique by Music Databse `artist_id`

Commentary: the definition of artist location / latitude / longtitude isn't very clear. Is it the home address of the artist? the latest studio that the artist releases albums? Not much luck looking at [the Million Song Dataset website](http://millionsongdataset.com/pages/getting-dataset/#subset)

In [6]:
%sql SELECT * FROM artists LIMIT 5;

 * postgresql://student:***@127.0.0.1/sparkifydb
5 rows affected.


artist_id,name,location,latitude,longitude
ARD7TVE1187B99BFB1,Casual,California - LA,,
ARNTLGG11E2835DDB9,Clp,,,
AR8ZCNI1187B9A069B,Planet P Project,,,
AR10USD1187B99F3F1,Tweeterfriendly Music,"Burlington, Ontario, Canada",,
ARMJAGH1187FB546F3,The Box Tops,"Memphis, TN",35.14968,-90.04892


## 5. Dimension Table `time` - Preview

Unique by Sparkify App event log `start_time`. It contains attributes for that datatime field. Useful for condicting date / time profile analysis.

In [7]:
%sql SELECT * FROM time LIMIT 5;

 * postgresql://student:***@127.0.0.1/sparkifydb
5 rows affected.


start_time,hour,day,week,month,year,weekday
2018-11-30 00:22:07,0,30,48,11,2018,4
2018-11-30 01:08:41,1,30,48,11,2018,4
2018-11-30 01:12:48,1,30,48,11,2018,4
2018-11-30 01:17:05,1,30,48,11,2018,4
2018-11-30 01:20:56,1,30,48,11,2018,4


# Example Analysis

## Event Count by Date

Observation: sample data covers Nov 2018. We can visually see dips on certain days

In [8]:
%sql SELECT DATE(start_time) AS date, COUNT(*) AS event_count FROM songplays GROUP BY DATE(start_time) ORDER BY DATE(start_time) DESC;

 * postgresql://student:***@127.0.0.1/sparkifydb
30 rows affected.


date,event_count
2018-11-30,330
2018-11-29,319
2018-11-28,363
2018-11-27,256
2018-11-26,216
2018-11-25,41
2018-11-24,314
2018-11-23,239
2018-11-22,82
2018-11-21,437


## Event Count by Hour

What hour of the day tends to be more or less busy?

In [9]:
%sql SELECT time.hour, COUNT(*) FROM songplays JOIN time ON songplays.start_time = time.start_time GROUP BY time.hour ORDER BY time.hour;

 * postgresql://student:***@127.0.0.1/sparkifydb
24 rows affected.


hour,count
0,155
1,154
2,117
3,109
4,136
5,162
6,183
7,179
8,207
9,270


## Which user_agent (device platform) has the most activities during Nov 2018?

In [10]:
%sql SELECT user_agent, COUNT(*) AS events, COUNT(DISTINCT user_id) AS distinct_users,  COUNT(DISTINCT session_id) AS distinct_sessions FROM songplays GROUP BY user_agent ORDER BY COUNT(*) DESC;

 * postgresql://student:***@127.0.0.1/sparkifydb
40 rows affected.


user_agent,events,distinct_users,distinct_sessions
"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36""",971,6,75
"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2""",708,7,54
Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0,696,2,45
"""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36""",579,2,67
"""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36""",573,2,24
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0,443,5,31
"""Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36""",428,5,62
"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36""",419,6,62
"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.77.4 (KHTML, like Gecko) Version/7.0.5 Safari/537.77.4""",320,7,65
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0,310,9,52


## Join all tables

This query creates a denormalised view at log event level

The initial composite primary key that defines one event is defined as the followings (which could be improved / revised!)

`start_time`, `user_id`, `session_id`, `songplay_id`

Note: due to the 3 possibilities as discussed previously (under the "Fact Table songplays - Preview" section), there is currently only one record that cut across all 5 SparkifyDB table.

In [11]:
%sql \
SELECT \
    songplays.start_time AS event_start_time \
    ,songplays.songplay_id AS songplay_id \
    ,time.year AS event_year \
    ,time.month AS event_month \
    ,time.day AS event_day \
    ,time.hour AS event_hour \
    ,time.week AS event_week \
    ,time.week AS event_weekday \
    ,users.user_id AS user_id \
    ,users.first_name AS user_first_name \
    ,users.last_name AS user_last_name \
    ,users.gender AS user_gender \
    ,users.level AS user_level \
    ,songs.song_id AS song_id \
    ,songs.title AS song_title \
    ,songs.year AS song_release_year \
    ,songs.duration AS song_duration \
    ,artists.artist_id AS artist_id \
    ,artists.location AS artist_location \
    ,artists.latitude AS artist_latitude \
    ,artists.longitude AS artist_longitude \
FROM songplays \
INNER JOIN users \
    ON users.user_id = songplays.user_id \
INNER JOIN songs \
    ON songs.song_id = songplays.song_id \
INNER JOIN artists \
    ON songplays.artist_id = artists.artist_id \
INNER JOIN time \
    ON songplays.start_time = time.start_time \
;

 * postgresql://student:***@127.0.0.1/sparkifydb
1 rows affected.


event_start_time,songplay_id,event_year,event_month,event_day,event_hour,event_week,event_weekday,user_id,user_first_name,user_last_name,user_gender,user_level,song_id,song_title,song_release_year,song_duration,artist_id,artist_location,artist_latitude,artist_longitude
2018-11-21 21:56:47,4108,2018,11,21,21,47,47,15,Lily,Koch,F,paid,SOZCTXZ12AB0182364,Setanta matins,0,269.58322,AR5KOSW1187FB35FF4,Dubai UAE,49.80388,15.47491


# Conclusion

We are able to create a denormalise table from the SparkifyDB. Only one record cut across all normalised table. Initial hypothesis could be one of the 3 possibilities:

* Possibility 1: If Sparkify plays only songs that are in the Music database, then these two columns should never be `None` (or `null`). And the only reason that this is the case is due to the way created our sample datasets (via the Eventsim log simulator).
* Possibility 2: If Sparkify plays songs that can be outside the scope of the Music database, then these two columns may legitimately contain `None` (or `null`).
* Possiblity 3: It could be a combination of possibilities 1 and 2 above.