Introduction

Sparkify

Sparkify means FRESHNESS, ENERGY and JOY - this is what we are!

As Sparkify we have one goal: to strengthen every moment of your life!
How do we do this? By accompanying you with the best-fit-music. Whenever you are happy or sad, moody or angry, or you just need to focus on your task - be yourself, and we will care about the noise!

"Out Of Darkness" project

We decided to start the Out Of Darkness project to gain ~~any~~ better insight into our data.
So far we were happy seeing billions of dollars flowing from subscribers to our account every month, but recently the flow seems to be decreasing. We have to check what is going on, but so far we are not able to analyse the data because of the way how it is stored.

The aim of the project is to create a database which is:

easily accessible,
has a structure easy to understand,
can be analysed by standard SQL queries,

so the analytics team can understand what songs users are listening to.

Starting point

Our data is stored as 2 sets of json files:

Log data

Log_data files store all information we have about users and their sessions, including user's name and location, level of access, song and artist name, timestamp when the song was played etc. The fields available in every log_data file are:

artist
auth
firstName
gender
itemInSession
lastName
length
level
location
method
page
registration
sessionId
song
status
ts
userAgent
userId

The log_data files are partitioned by year and month, with a separate folder for each partition. For example, below we have pasted filepaths to two files in this dataset:

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Song data

Song_data files provide information about every single songs available in our service, along with some info about the artist. The following fields are available for each song:

artist_id
artist_latitude
artist_location
artist_longitude
artist_name
duration
num_songs
song_id
title
year

Each json file in the song_data dataset stores info about one song. The song_data files are partitioned by the first three letters of each song's track ID. For example, below we have pasted filepaths to two files in this dataset:

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

Project Details

Technology

Lucidchart to build ERD diagrams.
Functional and reliable tool to model our databse first, before we start coding anything. Lucidchart diagrams are not included in the project files.
PostgreSQL as a RDBMS.
We decided to trust this well-known and battle-tested relational database management system. It is open-source and free, which is a huge advantage over the competitors!
Python as a tool to build ETL processes.
It's very flexible programming language with hundreds of useful libraries, including a few to handle json files and to make working with PostgreSQL easy. Last but not least, we have a few ~~cheap~~ skilled Python developers in our company already!
Power BI as a Business Intelligence tool.
In the Internet we saw a few dashboards built in this tool, they look just amazing and we want to have the same! But first things first - let's start from a database...

Database schema

The database consists of 5 tables organized into a star schema.

If you want to read more about star schema, please follow this link.

Fact table

Table name: songplays
Fields: songplay_id, start_time, user_id, level, session_id, location, user_agent, song_id, artist_id
Datasource: log_data, song_data

Dimensions

Table name: users
Fields: user_id, first_name, last_name, gender, level
Datasource: log_data

Table name: songs
Fields: song_id, title, artist_id, year, duration
Datasource: song_data

Table name: artists
Fields: artist_id, name, location, latitude, longitude
Datasource: song_data

Table name: time
Fields: start_time, hour, day, week, month, year, weekday
Datasource: log_data

Files in repository

README.md - you probably found this file already and you know what is inside...
Launcher.ipynb - use it to run other scripts to create the database and run ETL process.
create_tables.py - contains frame code to create database with appropriate tables
etl.py - ETL process to read data from json files, transform it and load into the database tables
sql_queries.py - SQL queries used by create_tables.py and etl.py.
test.ipynb - a Jupyter Notebook file you can use to test if youre database was created and uploaded with data successfully
etl.ipynb - a Jupyter Notebook file used during the development process of etl.py. It loads database tables with just few rows. It's not used anymore, but can be useful for further etl development.

How to run the ETL process

To create the database and load it with data, please use the Launcher.ipynb file.

Example queries

Please find below some SQL queries which you may find useful to start analysing data.

Complete info about a songplay record (joining all tables)

SELECT *
FROM songplays sp
INNER JOIN users u ON sp.user_id = u.user_id
INNER JOIN songs s ON sp.song_id = s.song_id
INNER JOIN artists a ON sp.artist_id = a.artist_id
INNER JOIN time t ON sp.start_time = t.start_time

Total time spent by user on listening music in the selected year & month

SELECT u.user_id, sum(s.duration) as total_songs_duration
FROM songplays sp
INNER JOIN users u ON sp.user_id = u.user_id
INNER JOIN songs s ON sp.song_id = s.song_id
INNER JOIN time t ON sp.start_time = t.start_time
WHERE t.year = (selected year)
ABD t.month = (selected month)
GROUP BY u.user_id

No. songs in the database for the selected artist

SELECT a.name, count(s.song_id) as total_songs
FROM songs s
INNER JOIN artists a ON s.artist_id = a.artist_id
WHERE a.name = (artist's name)
GROUP BY a.name

Songs no one wants to listen (e.g. songs not played at all)

SELECT s.title, a.name
FROM songs s
INNER JOIN artists a ON s.artist_id = a.artist_id
LEFT JOIN songplays sp ON s.song_id = sp.song_id
WHERE sp.song_id IS NULL

Further development ideas

Naming convention

We could rename tables by adding prefixes, to make it clear which of them are facts/dimensions.
songplays -> fact_songplays
users -> dim_users
songs -> dim_songs
artists -> dim_artists
time -> dim_time

REFERENCES and load order

When we work with a complete dataset (not just a subset) we could add REFERENCES to foreing key definitions in the create_table part of the sql_queries.py. E.g. for the songplays CREATE_TABLE script, we could add what is bold below:

CREATE TABLE IF NOT EXISTS songplays (
songplay_id SERIAL PRIMARY KEY,
start_time timestamp REFERENCES time(start_time),
user_id int REFERENCES users(user_id),
level varchar,
song_id varchar REFERENCES songs(song_id),
artist_id varchar REFERENCES artists(artist_id),
session_id int,
location varchar,
user_agent varchar
);

It would provide additional quality check for these keys. We would just have to ensure that the dimension tables are loaded first in the ETL process, as all keys have to already exist when we attempt to load songplays fact table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Sparkify

"Out Of Darkness" project

Starting point

Log data

Song data

Project Details

Technology

Database schema

Fact table

Dimensions

Files in repository

How to run the ETL process

Example queries

Complete info about a songplay record (joining all tables)

Total time spent by user on listening music in the selected year & month

No. songs in the database for the selected artist

Songs no one wants to listen (e.g. songs not played at all)

Further development ideas

Naming convention

REFERENCES and load order

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
Launcher.ipynb		Launcher.ipynb
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

BartyZ/data_modeling_with_postgres

Folders and files

Latest commit

History

Repository files navigation

Introduction

Sparkify

"Out Of Darkness" project

Starting point

Log data

Song data

Project Details

Technology

Database schema

Fact table

Dimensions

Files in repository

How to run the ETL process

Example queries

Complete info about a songplay record (joining all tables)

Total time spent by user on listening music in the selected year & month

No. songs in the database for the selected artist

Songs no one wants to listen (e.g. songs not played at all)

Further development ideas

Naming convention

REFERENCES and load order

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages