### Welcome to the main notebook of this project

In this notebook I go through each step of the project, and briefly explain my thought process when deemed necessary, the steps to be taken:
- Connect to the database & create the tables
- Create the data to insert in the tables
- Insert the data
- Example queries
- Close the cursor and connection

Please note that this notebook replaces the original etl.ipynb and test.ipynb. 

In [1]:
from pathlib import Path
import psycopg2
from src.data_utils import create_database
from src.preprocessing import LogPreProcess, SongPreProcess, SongPlaysPreProcess
from src.sql_queries import create_table_queries, drop_table_queries, insert_table_queries

#### Connect to the database and create the tables

In [26]:
cur, conn = create_database()

In [27]:
for query in drop_table_queries:
    cur.execute(query)

In [28]:
for query in create_table_queries:
    cur.execute(query)

#### Create the data to insert in the tables

To make the project more realistic, I decided to treat the files as if I had not seen them before, and as if there were a lot more files. This approach resulted in the DataValidation class and accompanying assertions. 

As you might notice, the songplays table did surprise me at the end, forcing me to create a few workarounds.

In [29]:
song_path_list = Path('..') / 'data' / 'song_data'
songpp_instance = SongPreProcess(file_path=song_path_list)

artists_data, songs_data = songpp_instance.data_pipeline()

../data/song_data contains 71 .json files.


In [30]:
log_path_list = Path('..') / 'data' / 'log_data'
logpp_instance = LogPreProcess(file_path=log_path_list)

songsplays_help_df, time_data, users_data = logpp_instance.data_pipeline()

../data/log_data contains 30 .json files.


In [31]:
songplays_instance = SongPlaysPreProcess(artists_data, songs_data, songsplays_help_df)
songplays_data = songplays_instance.data_pipeline()

#### Insert the data into Postgres

In [32]:
data_list = [songplays_data, users_data, songs_data, artists_data, time_data]

In [33]:
for idx, (data, query) in enumerate(zip(data_list, insert_table_queries)):
    for row in data:
        try:
            cur.execute(query, row) 
        except psycopg2.Error as error:
            print(f"Psychog2 error @ file {idx} row {row}: {error} NOTE: this file will not be inserted.")

####  Example queries

In [34]:
activity_per_weekday = """
SELECT time.weekday
,    count(*) as n_songs_played
,    count(distinct sp.user_id) as n_unique_users
,    count(*) / count(distinct sp.user_id) as songs_per_user
,    count(*) / sum(count(*)) over () as perc_total_songs_played

FROM
    songplays as sp
    inner join time on time.start_time = sp.start_time
    
GROUP BY
    time.weekday
"""

In [35]:
try: 
    cur.execute(activity_per_weekday)    
except psycopg2.Error as e: 
    print(f"Error while executing the query: {e}")
else:
    row = cur.fetchone()
    while row:
        print(row)
        row = cur.fetchone()

(0, 1014, 59, 17, Decimal('0.14868035190615835777'))
(1, 1071, 57, 18, Decimal('0.15703812316715542522'))
(2, 1364, 60, 22, Decimal('0.20000000000000000000'))
(3, 1052, 56, 18, Decimal('0.15425219941348973607'))
(4, 1295, 63, 20, Decimal('0.18988269794721407625'))
(5, 628, 45, 13, Decimal('0.09208211143695014663'))
(6, 396, 39, 10, Decimal('0.05806451612903225806'))


In [36]:
activity_per_level = """
SELECT level
,    count(*) as n_songs_played
,    count(distinct user_id) as n_unique_users
,    count(*) / count(distinct user_id) as songs_per_user
,    count(*) / sum(count(*)) over () as perc_level

FROM
    songplays as sp
    
GROUP BY
    level
"""

In [37]:
try: 
    cur.execute(activity_per_level)    
except psycopg2.Error as e: 
    print(f"Error while executing the query: {e}")
else:
    row = cur.fetchone()
    while row:
        print(row)
        row = cur.fetchone()

('free', 1229, 82, 14, Decimal('0.18020527859237536657'))
('paid', 5591, 22, 254, Decimal('0.81979472140762463343'))


In [38]:
activity_per_gender = """
SELECT users.gender
,    count(*) as n_songs_played
,    count(distinct sp.user_id) as n_unique_users
,    count(*) / sum(count(*)) over () as perc_songs_played

FROM
    songplays as sp
    inner join users on users.user_id = sp.user_id
    
GROUP BY
    users.gender
"""

In [39]:
try: 
    cur.execute(activity_per_gender)    
except psycopg2.Error as e: 
    print(f"Error while executing the query: {e}")
else:
    row = cur.fetchone()
    while row:
        print(row)
        row = cur.fetchone()

('F', 4887, 55, Decimal('0.71656891495601173021'))
('M', 1933, 41, Decimal('0.28343108504398826979'))


#### Close the cursor and connection

In [25]:
cur.close()
conn.close()

#### Next Steps
Next you may run the two scripts in /scripts.