### Test and explore

In this notebook I test locally developed code on the actual server. There is a recurring pattern to create all tables and insert the data, which can be taken advantage of by using more generic functions. In general the process flow is as follows:
- Create a table
- Import the raw data
- Validate and prepare the data for insertion
- Insert the data

**2020-12-09**
- Next steps: create the artists table from the song_path_list -> should not take long
- Thereafter: explore the logging data, try to leverage the already existing functions -> perhaps one will be songs_func and other log_func
- Add a try statement around reading the json file, that is a typical pitfall
- ..

**2020-12-10**
- Song data fixed
- Move on to the log file -> start with users and time before you move on to songplays since that table will need other tables..
- Start logging, really important skill to have plus you are going to use it in SAMR, ADD, Aanbodsmodel, etc. especially for these types of tasks you cannot only be looking at the the command line print statement, TO LOG OR NOT TO LOG -> LOG :) 
- First: fix all tables, then https://calmcode.io/logging/introduction.html

**2020-12-12**
- Your insert statements should include checks: what if we want to insert a row which of which the primary key is taken? Currently we check all the data at once, and validate in Pandas, but in real life we would continuously update these tables, hence, we need robust SQL insert statements

In [2]:
%load_ext autoreload
%autoreload 2

In [1]:
from create_tables import create_database
from data_checks import artist_dtype_dict, songs_dtype_dict, users_dtype_dict
from data_utils import create_log_insert_list, create_song_insert_list, create_path_list
from pathlib import Path
import psycopg2
from psycopg2 import extras
from sql_queries import drop_all_tables, artist_table_create, song_table_create, artist_table_insert, song_table_insert,\
user_table_create

### Create the sparkify database

In [3]:
cur, conn = create_database()

In [4]:
cur.execute(drop_all_tables)

## Song data

#### Create the tables

In [5]:
cur.execute(artist_table_create)
cur.execute(song_table_create)

#### Import the data

In [6]:
data_path = Path('.') / 'data'

In [None]:
song_path_list = create_path_list(data_path / 'song_data')

### Validate and prepare the data for insertion
- use the unduplicated artist table to check your SQL insert statements

In [None]:
artist_columns = ['artist_id', 'artist_name', 'artist_location', 'artist_latitude', 'artist_longitude']
song_columns = ['song_id', 'title', 'artist_id', 'year', 'duration']

In [None]:
artist_table_data = create_song_insert_list(file_path_list=song_path_list,
                                            insert_columns=artist_columns,
                                            primary_key='artist_id',
                                            dtype_dict=artist_dtype_dict,
                                            not_nullable_columns=['artist_id', 'artist_name'])

In [None]:
song_table_data = create_song_insert_list(file_path_list=song_path_list,
                                          insert_columns=song_columns,
                                          primary_key='song_id',
                                          dtype_dict=songs_dtype_dict)

#### Insert the data

In [None]:
try: 
    extras.execute_values(cur, artist_table_insert.as_string(cur), artist_table_data)
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

In [None]:
try: 
    extras.execute_values(cur, song_table_insert.as_string(cur), song_table_data)
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

#### Explore the data

In [None]:
try: 
    cur.execute("SELECT * FROM artists WHERE latitude is NULL")    
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

In [None]:
try: 
    cur.execute("SELECT * FROM songs")    
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

## Log data
- Check alle code stap voor stap, het gaat mis met bepaalde keys.. dit heeft alles te maken met lower/upper column names
- If non nullable:
      non_nullable.lower() 

In [7]:
log_path_list = create_path_list(data_path / 'log_data')

data/log_data contains 30 .json files.


In [8]:
users_columns = ['userId', 'firstName', 'lastName', 'gender', 'level']

In [9]:
users_table_data = create_log_insert_list(file_path_list=log_path_list,
                                         insert_columns=users_columns,
                                         primary_keys=['userId', 'gender', 'level'],
                                         dtype_dict=users_dtype_dict,
                                         not_nullable_columns=['userId', 'level'])

There were 7665 duplicate primary keys removed from the insert dataframe


##### Insert the data into Postgres, row by row, in the correct chronological order so you can update the table accordingly, apply the same insert method for the songs tables.

### Time table

#### Close the cursor and connection

In [11]:
cur.close()
conn.close()