In [1]:
import os
import glob
import psycopg2
import pandas as pd
import numpy as np
import json
import sql_queries
from typing import List

Connecting to the database

In [2]:
with open("../config.json") as f:
    config = json.load(f)
    
conn = psycopg2.connect(host = config["host"],
                        dbname = config["dbname"],
                        user = config["username"],
                        password = config["password"])

conn.set_session(autocommit=True)
cur = conn.cursor()

Since the json files are split across different directories, we can write a function that gets all files under the parent directory

In [3]:
def get_files(filepath:str) -> List[str]:
    """returns all json files in the directory tree under the filepath

    Arguments:
        filepath -- the root path

    Returns:
        a list of json filepaths under the root path
    """
    all_files = []
    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root,'*.json'))
        for f in files :
            all_files.append(os.path.abspath(f))
    
    return all_files

# Part 1: The Songs Files
Using the files under the songs directory, 
We can fill the 
* `songs table`
* `artist table`

After adding the foreign key constraints (in the script), we will have to insert the artist data first, because the songs table depends on it.

This is an exploratory phase, so adding the constraints early can introduce undesired complications

In [4]:
song_files = get_files("../data/song_data/")

In [5]:
df_songs_artists = pd.read_json(song_files[0], lines = True)    
df_songs_artists.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,0


### 1.1 Trying to insert into the Songs table
For the songs table, we are only concerned with the following columns:
* `song_id`
* `title`
* `artist_id`
* `duration`
* `year`



In [6]:
song_columns = ["song_id", "title", "artist_id", "duration", "year"]
df_songs = df_songs_artists[song_columns].copy()
df_songs

Unnamed: 0,song_id,title,artist_id,duration,year
0,SOMZWCG12A8C13C480,I Didn't Mean To,ARD7TVE1187B99BFB1,218.93179,0


In [7]:
songs_data = list(df_songs.iloc[0].values)
songs_data

['SOMZWCG12A8C13C480', "I Didn't Mean To", 'ARD7TVE1187B99BFB1', 218.93179, 0]

Now, let's try inserting it into the database

In [8]:
cur.execute(sql_queries.song_table_insert, songs_data)

ProgrammingError: can't adapt type 'numpy.int64'

That's a problem with numpy datatypes. Pyscopg2 doesn't know how to convert them correctly. We can handle that as in this [stackoverflow post](https://stackoverflow.com/questions/50626058/psycopg2-cant-adapt-type-numpy-int64)

In [None]:
import numpy as np
from psycopg2.extensions import register_adapter
def adapt_np_int(np_int:np.int64):
    _INT = psycopg2.extensions.Int
    return _INT(np_int)
    
register_adapter(np.int64, adapt_np_int)

This way, pyscopg2 is adapted to convert `np.int` types into `int` types.

We can try our insert again

In [None]:
cur.execute(sql_queries.song_table_insert, songs_data)

Now for a sanity check

In [None]:
cur.execute("SELECT * FROM song")
res = cur.fetchall()
print(res)

[('SOMZWCG12A8C13C480', "I Didn't Mean To", 'ARD7TVE1187B99BFB1', 219, 0.0)]


### 1.2 Trying to insert into the artist table

This time, we are concerned with the following columns
* `artist_id`
* `artist_name`
* `location`
* `longitude`
* `latitude`

In [None]:
columns = ["artist_id", "artist_name", "artist_location","artist_latitude", "artist_longitude", ]
df_artists = df_songs_artists[columns].copy()
df_artists

Unnamed: 0,artist_id,artist_name,artist_location,artist_latitude,artist_longitude
0,ARD7TVE1187B99BFB1,Casual,California - LA,,


It's important to note that the `Nan` is a `np.float` value. So, while inserting it in the database, it won't be `NULL`

We can handle this via an adapter, as we did with the `np.int`

In [None]:
def adapt_np_float_and_nans(value):
    _FLOAT = psycopg2.extensions.Float
    _NULL = psycopg2.extensions.AsIs("NULL")
    if np.isnan(value):
        return _NULL
    return _FLOAT(value)
    
register_adapter(np.float64, adapt_np_float_and_nans)

In [None]:
artist_data = list(df_artists.iloc[0].values)
cur.execute(sql_queries.artist_table_insert, artist_data)

Sanity Check

In [None]:
cur.execute("SELECT * FROM artist")
res = cur.fetchall()
print(res)

[('ARD7TVE1187B99BFB1', 'Casual', 'California - LA', None, None)]


# Part 2: The Log data

From the data under the `log_data` directory, we can fill in
* `user` table
* `time` table
* `songplay` table

In [10]:
log_files = get_files("../data/log_data")
df_logs = pd.read_json(log_files[0], lines = True)
df_logs.head(2)

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1540919166796,38,,200,1541105830796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",39
1,,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1540344794796,139,,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8


### 1.1 The users Table

We should note that our user data can change. It makes sense to make them able to switch between free and premium levels. For the other data, some platforms allow this data to change, other do not, so it's up to use to decide. I will go with enabling the users to change their other information as well.

As such, we should take care of the way, and order of data insertion.
We can either
* Read all the logs into a single dataframe, and sort it by timestamp before inserting the users,
* Or we do it file by file and make sure we are visiting the files in chronological order. Within each file, we will still have to sort by timestamp

For now, in this notebook, we will just try to insert a single record, and do a sanity check