## Our objective in a nutshell
We have some data scattered around in JSON files. We want to reorganise and store it in a relational database.

## What is ETL?
ETL stands for (Extract, Transform, and Load). The essence is 
* We have a datasource (in our case the JSON files), from which we want to *extract* the data **(*Extract*)**
* We can then do some preprocessing (this can be doing calculations, or changing the format of some data, ...) **(*Transform*)**
* Then, we load the data into some destination (in our case, the postgres Database) **(*Load*)**

<center> <img src = "../images/ETL-JSON-PG.jpg" width = 500></center>


## How will we do it?
Using python, we can 
* Read the JSON files (Extract)
* Do our preprocessing (Transform)
* Interact with the database, using a database driver directly, like (psycopg2), or using a driver, with an object relational Model (ORM) like sqlalchemy
<center> <img src = "../images/ETL-with-python.jpg" width = 50%></center>


## What's our data about?
Our data is a simulation of a music streaming app. The JSON files are split into two directores,
* One directory `song_data` holds json files about our songs, like the song title, the artist, ...
* The other directory `log_data` holds json files about Which songs were played, by whom, and at what time instances.

This can be shown in the following image:
<center><img src = "../images/json-data-content.jpg" width = 50%></center>


## Relational Database Schema
There are many schemas that can satisfy these requirements. In this project we will settle with a star-schema to make the queries less demanding in terms of performing joins, but at the cost of normalization, and the possibility of update anomalies. 

Designing the schema is outside the scope of this project, but if you want to learn more, you would need to study (database normalisation, denormalization and the pros and cons of different approaches towards schema design, such as star schema, snowflake schema, and others). 

So, our tables will can look like this
<center><img src = "../images/schema.png" width = 50%> </center>

Where,
* `user` table holds info about the user `(Dimension Table)`
* `song` table holds info about the songs `(Dimension Table)`
* `artist` table holds info about the artist `(Dimension Table)`
* `time` table just expands info about the timestamp (which hour, day, week, month and year it belongs to) `(Dimension Table)`
* `songplay` table logs information whenever a song is played, so it's a `Fact Table`

## Which data from which source
* The JSON files concerned with `song data` will be used to fill the `song` and `artist` tables
* The JSON files concerned with `log data` will be used to fill the `user`, `time` and `songplay` tables

In [1]:
import os
import glob
import psycopg2
import pandas as pd
import json

In [2]:
def get_files(filepath):
    all_files = []
    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root,'*.json'))
        for f in files :
            all_files.append(os.path.abspath(f))
    
    return all_files

In [3]:
song_files = get_files("../data/song_data/")
df_songs = pd.DataFrame()
for path in song_files:
    df_current = pd.read_json(path, lines = True)
    df_songs = pd.concat([df_songs, df_current])

In [5]:
df_songs.num_songs.unique()

array([1], dtype=int64)

In [6]:
df_songs

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,0
0,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969
0,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,0
0,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982
0,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007
...,...,...,...,...,...,...,...,...,...,...
0,1,AR8IEZO1187B99055E,,,,Marc Shaiman,SOINLJW12A8C13314C,City Slickers,149.86404,2008
0,1,AR558FS1187FB45658,,,,40 Grit,SOGDBUF12A8C140FAA,Intro,75.67628,2003
0,1,ARVBRGZ1187FB4675A,,,,Gwen Stefani,SORRZGD12A6310DBC3,Harajuku Girls,290.55955,2004
0,1,ARWB3G61187FB49404,,,"Hamilton, Ohio",Steve Morse,SODAUVL12A8C13D184,Prognosis,363.85914,2000
