# Database / SQL Final Project

This project will involve three related CSV files.
  * [play_list_music.csv](./play_list_music.csv)
  * [play_list_track_customers.csv](./play_list_track_customers.csv)
  * [play_list_track_buy.csv](./play_list_track_buy.csv)


Your task for this project is to build a SQLite database from these files, then perform some analytics.
This project should be broken down into the following tasks:
  1. Download and inspect the files.
  1. Design a database that is **properly normalized**.
  1. Implement your database design.
  1. Load data from files into database.
  1. Write some basic queries.

All your code should be implemented in this notebook.
Below the notebook is partitioned into markdown and code execution cells.

In the cells below, connect to your database.
Remember to update the SSO to your pawprint.

In [None]:
import pandas as pd
import getpass
import psycopg2
import numpy as np
from psycopg2.extensions import adapt, register_adapter, AsIs
# Magic adapters for the Numpy Fun of Pandas
register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)
mypasswd = getpass.getpass()
username = 'jch5x8'
host = 'pgsql.dsa.lan'
database = 'dsa_student'

In [1]:
# Then connects to the DB
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine

# -------------- Add Content Below

# SQLAlchemy Connection Parameters
postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database' :database}

# SQLAlchemy Engine
engine = create_engine(URL(**postgres_db), echo = True)

# init connection variable
connection = None
# using a try-except
try:
    connection = engine.connect()
except Exception as err:
    print("An error occurred trying to connect: {}".format(err))
    
# -------------- Add Content Above
del mypasswd

NameError: name 'mypasswd' is not defined

# Design a database that is _properly normalized_.

Note: You can expect up approximately ten (10) tables to be derived from three CSV files.

There is no implementation cell, the output should be an ERD or sketch.

Visit the course Canvas Site for Normalization videos. 

# Implement your database design.

Use the cells below to add your `CREATE TABLE` statements.
Add extra cells as necessary

In [None]:
DROP TABLE jch5x8.invoice_tracks, jch5x8.invoice_customers, jch5x8.playlist, jch5x8.track, jch5x8.album, jch5x8.customer, jch5x8.contact, jch5x8.city;

DROP TABLE jch5x8.tracks, jch5x8.album, jch5x8.customer, jch5x8.contact, jch5x8.city;

CREATE TABLE IF NOT EXISTS jch5x8.city(city_id INT PRIMARY KEY, city VARCHAR(50), locale VARCHAR(50), country VARCHAR(50));
CREATE TABLE IF NOT EXISTS jch5x8.contact(contact_id INT PRIMARY KEY, address VARCHAR(150), city_id INT REFERENCES jch5x8.city ON DELETE SET NULL ON UPDATE CASCADE, postal_code VARCHAR(50), phone VARCHAR(50), fax VARCHAR(50), email VARCHAR(50));
CREATE TABLE IF NOT EXISTS jch5x8.customer(customer_id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), company VARCHAR(50), contact_id INT REFERENCES jch5x8.contact ON DELETE SET NULL ON UPDATE CASCADE);
CREATE TABLE IF NOT EXISTS jch5x8.album(album_id INT PRIMARY KEY, album VARCHAR(150), artist VARCHAR(150), genre VARCHAR(50));
CREATE TABLE IF NOT EXISTS jch5x8.track(track_id INT PRIMARY KEY, album_id INT REFERENCES jch5x8.album ON DELETE SET NULL ON UPDATE CASCADE, song VARCHAR(150), media_type VARCHAR(50), bytes INT);
CREATE TABLE IF NOT EXISTS jch5x8.playlist(track_playlist_id INT PRIMARY KEY, playlist VARCHAR(150), track_id INT REFERENCES jch5x8.track ON DELETE SET NULL ON UPDATE CASCADE);
CREATE TABLE IF NOT EXISTS jch5x8.invoice_customers(invoice_id INT PRIMARY KEY, customer_id INT REFERENCES jch5x8.customer ON DELETE SET NULL ON UPDATE CASCADE);
CREATE TABLE IF NOT EXISTS jch5x8.invoice_tracks(it_id INT PRIMARY KEY, invoice_id INT REFERENCES jch5x8.invoice_customers ON DELETE SET NULL ON UPDATE CASCADE, track_id INT REFERENCES jch5x8.track ON DELETE SET NULL ON UPDATE CASCADE, unit_price REAL);




CREATE TABLE IF NOT EXISTS jch5x8.city(city_id SERIAL PRIMARY KEY, city VARCHAR(50), locale VARCHAR(50), country VARCHAR(50));
CREATE TABLE IF NOT EXISTS jch5x8.contact(contact_id SERIAL PRIMARY KEY, address VARCHAR(150), city_id INT REFERENCES jch5x8.city ON DELETE SET NULL ON UPDATE CASCADE, postal_code VARCHAR(50), phone VARCHAR(50), fax VARCHAR(50), email VARCHAR(50), city VARCHAR(150), customer INT);
CREATE TABLE IF NOT EXISTS jch5x8.customer(customer_id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), company VARCHAR(50), contact_id INT REFERENCES jch5x8.contact ON DELETE SET NULL ON UPDATE CASCADE);
CREATE TABLE IF NOT EXISTS jch5x8.album(album_id SERIAL PRIMARY KEY, album VARCHAR(150), artist VARCHAR(150), genre VARCHAR(50), track INT);
CREATE TABLE IF NOT EXISTS jch5x8.track(track_id INT, album_id INT REFERENCES jch5x8.album ON DELETE SET NULL ON UPDATE CASCADE, song VARCHAR(150), playlist VARCHAR(150), media_type VARCHAR(50), bytes INT, PRIMARY KEY(track_id, playlist));
CREATE TABLE IF NOT EXISTS jch5x8.invoice(invoice_id INT PRIMARY KEY, customer_id INT REFERENCES jch5x8.customer ON DELETE SET NULL ON UPDATE CASCADE, unit_price REAL);
CREATE TABLE IF NOT EXISTS jch5x8.invoice_track(invoice_id INT REFERENCES jch5x8.invoice ON DELETE SET NULL ON UPDATE CASCADE, track_id INT REFERENCES jch5x8.track ON DELETE SET NULL ON UPDATE CASCADE, PRIMARY KEY(invoice_id, track_id));




DROP TABLE jch5x8.invoice_track, jch5x8.invoice, jch5x8.track, jch5x8.album, jch5x8.customer, jch5x8.contact, jch5x8.city;

--CREATE TABLE IF NOT EXISTS jch5x8.invoice_track(invoice_id INT REFERENCES jch5x8.invoice ON DELETE SET NULL ON UPDATE CASCADE, track_id INT REFERENCES jch5x8.track ON DELETE SET NULL ON UPDATE CASCADE, PRIMARY KEY());


In [None]:
result = connection.execute("CREATE TABLE IF NOT EXISTS jch5x8.city(city_id SERIAL PRIMARY KEY, city VARCHAR(50), locale VARCHAR(50), country VARCHAR(50))")
result = connection.execute("CREATE TABLE IF NOT EXISTS jch5x8.contact(contact_id SERIAL PRIMARY KEY, address VARCHAR(150), city_id INT REFERENCES jch5x8.city ON DELETE SET NULL ON UPDATE CASCADE, postal_code VARCHAR(50), phone VARCHAR(50), fax VARCHAR(50), email VARCHAR(50), city VARCHAR(150), customer_id INT)")
result = connection.execute("CREATE TABLE IF NOT EXISTS jch5x8.customer(customer_id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), company VARCHAR(50), contact_id INT REFERENCES jch5x8.contact ON DELETE SET NULL ON UPDATE CASCADE)")
result = connection.execute("CREATE TABLE IF NOT EXISTS jch5x8.album(album_id SERIAL PRIMARY KEY, album VARCHAR(150), artist VARCHAR(150), genre VARCHAR(50), track_id INT)")
result = connection.execute("CREATE TABLE IF NOT EXISTS jch5x8.track(track_id SERIAL PRIMARY KEY, id INT, album_id INT REFERENCES jch5x8.album ON DELETE SET NULL ON UPDATE CASCADE, song VARCHAR(150), playlist VARCHAR(150), media_type VARCHAR(50), bytes INT)")
result = connection.execute("CREATE TABLE IF NOT EXISTS jch5x8.invoice(invoice_id INT, track_id INT REFERENCES jch5x8.track ON DELETE SET NULL ON UPDATE CASCADE, customer_id INT REFERENCES jch5x8.customer ON DELETE SET NULL ON UPDATE CASCADE, unit_price REAL, PRIMARY KEY(invoice_id, track_id))")



# init connection variable
connection = None
# using a try-except
try:
    connection = engine.connect()
except Exception as err:
    print("An error occurred trying to connect: {}".format(err))
    
# Create the table
result = connection.execute("CREATE TABLE IF NOT EXISTS jch5x8.invoice_track(invoice_id INT REFERENCES jch5x8.invoice ON DELETE SET NULL ON UPDATE CASCADE, track_id INT REFERENCES jch5x8.track ON DELETE SET NULL ON UPDATE CASCADE, unit_price REAL REFERENCES jch5x8.track ON DELETE SET NULL ON UPDATE CASCADE)")

connection.close()

In [None]:
col_list = ["playlist_id", "track_id", "customer_id", "unit_price"]
playlist = pd.read_csv("playlist.csv", usecols = col_list)
playlist.head()

playlist.to_sql('playlist',       # Table to load to
            engine,               # Engine created above
            schema = username,    # Schema where table lives
            if_exists = 'append', # If table found, add data
            index = False,        # Ignore data frame row index
            chunksize = 50        # Load 50 records from data frame at a time
        )

In [None]:
track['unit_price'] = track['unit_price'].fillna(0)

In [None]:
col_list = ["id", "song", "playlist", "media_type", "bytes"]
track = pd.read_csv("track.csv", usecols = col_list)
# col_list2 = ["track_id", "unit_price"]
# price = pd.read_csv("invoice.csv", usecols = col_list2)
# track = pd.merge(track, price, on="track_id", how="left")
# track = track.drop('track_id', 1)
# track['unit_price'] = track['unit_price'].fillna(0)

In [None]:
col_list = ["it_id", "track_id", "customer_id", "unit_price"]
invoice_tracks = pd.read_csv("invoice_tracks.csv", usecols = col_list)
#invoice_tracks.head()

invoice_tracks.to_sql('invoice_tracks',         # Table to load to
            engine,               # Engine created above
            schema = username,    # Schema where table lives
            if_exists = 'append', # If table found, add data
            index = False,        # Ignore data frame row index
            chunksize = 50        # Load 50 records from data frame at a time
        )

# Load data from files into database.

### Use Excel or Pandas to carve the provided CSV files above into the **set of appropriate files** you need to load into your database.
   1. Example: Save File As *new_csv_name.csv*
   1. Remove unneeded columns
   1. Remove duplicate rows
   1. Save File, Navigate in JupyterHub folder view (your first JupyterHub tab)
   1. Upload file


   1. Load the CSV into your database using Python.
     




In [None]:
col_list = ["song", "media_type", "bytes", "track"]
track = pd.read_csv("songs.csv", usecols = col_list)
# col_list2 = ["track_id", "unit_price"]
# price = pd.read_csv("invoice.csv", usecols = col_list2)
# track = pd.merge(track, price, on="track_id", how="left")
# track = track.drop('track_id', 1)
# track['unit_price'] = track['unit_price'].fillna(0)
#track.head()

track.to_sql('track',           # Table to load to
            engine,               # Engine created above
            schema = username,    # Schema where table lives
            if_exists = 'append', # If table found, add data
            index = False,        # Ignore data frame row index
            chunksize = 50        # Load 50 records from data frame at a time
        )

## Once Loaded
  * Write SQL to show the `COUNT(*)` from each table loaded.

In [None]:
# # CLI
# UPDATE contact 
# SET city_id = ci.city_id
# FROM city ci
# WHERE contact.city = ci.city;

# UPDATE customer 
# SET contact_id = c.contact_id
# FROM contact c
# WHERE customer.customer_id = c.customer_id;

# UPDATE track
# SET album_id = a.album_id
# FROM album a
# WHERE track.track = a.track;

# ALTER TABLE contact
# DROP COLUMN city;

# ALTER TABLE album
# DROP COLUMN track;

# ALTER TABLE track
# DROP COLUMN track;

#  Write some basic queries.


## List each artist and the average bytes per song.

In [None]:
# Create SQL query
SQL = "SELECT artist, AVG(bytes)::REAL AS \"Avg Bytes per Song\" "
SQL += "FROM jch5x8.album a JOIN jch5x8.track t "
SQL += "USING (album_id) " 
SQL += "GROUP BY artist;"

# Connect to db and run query
with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    results = cursor.fetchall()

print("\nArtist, Avg. Bytes per Song")
# Display results
for row in results:
    print(row)

## List average number of track per album for each artist.

In [None]:
# Create SQL query
SQL = "SELECT al.artist, AVG(album_count.tracks)::REAL AS \"Avg Tracks per Album\" "
SQL += "FROM ( "
SQL += "SELECT DISTINCT a.album_id, "
SQL += "COUNT(t.track_id) OVER (PARTITION BY a.album) as tracks "
SQL += "FROM jch5x8.track t, jch5x8.album a "
SQL += "WHERE t.album_id = a.album_id) AS album_count, "
SQL += "jch5x8.album al "
SQL += "WHERE album_count.album_id = al.album_id "
SQL += "GROUP BY al.artist;"

# Connect to db and run query
with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    results = cursor.fetchall()

print("\nArtist, Avg. Tracks per Album")
# Display results
for row in results:
    print(row)

## List the top five customers in terms of track purchased.

In [None]:
 # Create SQL query
## I originally just counted the number of tracks purchased, because that seems to be what is being asked.
## I got dismayed, however, when they all seemed to have purchased 38 tracks. I double checked this in
## Excel worrying I had done my carpentry incorrectly or written a bad query. Excel bore out that 38 tracks
## was correct so since purchased was mentioned, I thought maybe what was really being asked was top five
## based on how much customers spent.
SQL = "SELECT c.first_name, c.last_name, track_count.tracks, track_count.total_spent "
SQL += "FROM jch5x8.customer c, ( "
SQL += "SELECT ic.customer_id, COUNT(it.track_id) as tracks "
SQL += ", SUM(it.unit_price) as total_spent "
SQL += "FROM jch5x8.invoice_customers ic, jch5x8.invoice_tracks it "
SQL += "WHERE ic.invoice_id = it.invoice_id "
SQL += "GROUP BY ic.customer_id "
SQL += "ORDER BY total_spent DESC "
SQL += "LIMIT 5) AS track_count "
SQL += "WHERE c.customer_id = track_count.customer_id "
SQL += "ORDER BY track_count.total_spent DESC;"

# Connect to db and run query
with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    results = cursor.fetchall()

print("\nFirst Name, Last Name, Tracks Purchased, Total Spent")
# Display results,
for row in results:
    print(row)

## List the top genre preference per customer.

In [None]:
# Create SQL query


# Connect to db and run query
with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    results = cursor.fetchall()

print("\nFirst Name, Last Name, Preferred Genre")
# Display results,
for row in results:
    print(row)

# Save your notebook, then `File > Close and Halt`