# Database /  Final Project

There will be a SQL project and then some conceptual questions. Both parts should be completed.

## SQL Project
Your task for this project is to build a POSTGIS database using the csv files below, then perform some analytics.

This project will involve three related CSV files.
  * [play_list_music.csv](./play_list_music.csv)
  * [play_list_track_customers.csv](./play_list_track_customers.csv)
  * [play_list_track_buy.csv](./play_list_track_buy.csv)
  
This project should be broken down into the following tasks:
  1. Download and inspect the files.
  1. Design a database that is **properly normalized**.
  1. Implement your database design.
  1. Load data from files into database.
  1. Write some basic queries.



All your code should be implemented in this notebook.
Below the notebook is partitioned into markdown and code execution cells.

## Task 1: Design a database that is _properly normalized_.

Note: You can expect up approximately XX tables to be derived from XX CSV files.

There is no implementation cell, the deliverable is the ERD.

Upload it to the same directory as this notebook and fill in the name of the file here :   ERD_database.

![erd](ERD_database.png)

## Task 2: Connect to the database

In [11]:
import getpass
import psycopg2
import numpy as np
import pandas as pd
from psycopg2.extensions import adapt, register_adapter, AsIs

# This collects a masked password from the user
mypasswd = getpass.getpass()

# Then connects to the DB
connection = psycopg2.connect(database = 'dsa_student', 
                              user = 'dlfy6', 
                              host = 'dbase.dsa.missouri.edu',
                              password = mypasswd)

# Then remove the password from computer memory
del mypasswd

········


In [12]:
cursor = connection.cursor()

## Task 3: Implement your database design.
Use the cells below to add your CREATE TABLE statements. Add extra cells as necessary

In [18]:
create_new_customers = """
DROP TABLE IF EXISTS dlfy6.new_customers;
CREATE TABLE dlfy6.new_customers (
    customerId INT,
    firstName varchar(100), 
    lastName varchar(100),
    company varchar(100),
    address varchar(100),
    city varchar(100),
    state varchar(100),
    country varchar(100),
    postalCode varchar(100),
    phone varchar(100),
    fax varchar(100),
    email varchar(100),
    
    PRIMARY KEY (customerId)
);"""

create_new_songs = """
DROP TABLE IF EXISTS dlfy6.new_songs;
CREATE TABLE dlfy6.new_songs (
    id INT,
    artist varchar(200), 
    album varchar(200),
    song varchar(200),
    playlist varchar(100),
    media_type varchar(100),
    genre varchar(100),
    bytes BIGINT,
    
    PRIMARY KEY (id)
);"""

create_new_invoice_by_customer = """
DROP TABLE IF EXISTS dlfy6.new_invoice_by_customer;
CREATE TABLE dlfy6.new_invoice_by_customer (
    invoiceId INT,
    customerId INT, 
    billingAddress varchar(100),
    billingCity varchar(100),
    
    PRIMARY KEY (invoiceId),
    FOREIGN KEY (customerId)
        REFERENCES new_customers(customerId)
);"""

create_new_invoice_by_trackid = """
DROP TABLE IF EXISTS dlfy6.new_invoice_by_trackid;
CREATE TABLE dlfy6.new_invoice_by_trackid (
    invoiceId INT,
    trackId INT, 
    unitPrice NUMERIC,
    
    FOREIGN KEY (invoiceId)
        REFERENCES new_invoice_by_customer(invoiceId),
    FOREIGN KEY (trackId)
        REFERENCES new_songs(id)
);"""



In [14]:
# CREATE new_customers table

with connection, connection.cursor() as cursor:
    cursor.execute(create_new_customers)

    


In [15]:
# CREATE new_songs table

with connection, connection.cursor() as cursor:
    cursor.execute(create_new_songs)

    


In [16]:
# CREATE create_new_invoice_by_customer table

with connection, connection.cursor() as cursor:
    cursor.execute(create_new_invoice_by_customer)

    


In [19]:
# CREATE create_new_invoice_by_trackid table

with connection, connection.cursor() as cursor:
    cursor.execute(create_new_invoice_by_trackid)

    


## Task 4: Data Cleaning and Grooming
Use Excel to carve the provided CSV files above into the set of appropriate files you need to load into your database. If you would like you can do the data cleaning in python as well however I believe most everyone would find that it is easier to do in excel. 
This step may could include removing unneeded columns, removing duplicate rows and more. 

   1. Example: Save File As *new_csv_name.csv*
   1. Groom data
   1. Save File
   1. Navigate in JupyterHub folder view (your first JupyterHub tab)
   1. Upload file
 
There is no implementation cell, the deliverable is the uploaded files.

## Task 5: Load the data from the files into the database.
Load at least 3 tables. See task 7 before you choose which tables to load because you will be querying the tables.
   1. A table with no foreign keys
   1. A table with at least one foreign key
   1. Another table. 


In [20]:
# load new_customers.csv file

new_customers = pd.read_csv("new_customers.csv",index_col='CustomerId',encoding = 'utf8')
new_customers.head()

Unnamed: 0_level_0,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Luís,Gonçalves,Embraer - Empresa Brasileira de Aeronáutica S.A.,"Av. Brigadeiro Faria Lima, 2170",São José dos Campos,SP,Brazil,12227-000,+55 (12) 3923-5555,+55 (12) 3923-5566,luisg@embraer.com.br
2,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de
3,François,Tremblay,,1498 rue Bélanger,Montréal,QC,Canada,H2G 1A7,+1 (514) 721-4711,,ftremblay@gmail.com
4,Bjørn,Hansen,,Ullevålsveien 14,Oslo,,Norway,171,+47 22 44 22 22,,bjorn.hansen@yahoo.no
5,František,Wichterlová,JetBrains s.r.o.,Klanova 9/506,Prague,,Czech Republic,14700,+420 2 4172 5555,+420 2 4172 5555,frantisekw@jetbrains.com


In [22]:
# load new_customers dataframe to database table
cursor = connection.cursor()

new_customers = new_customers.where(pd.notnull(new_customers), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in new_customers.itertuples(index=True, name ='None'):
    #print(row)
    cursor.execute('INSERT INTO dlfy6.new_customers VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',row)

   
# Save (commit) the changes
connection.commit()

In [23]:
# load new_songs.csv file

new_songs = pd.read_csv("new_songs.csv",index_col='id',encoding = 'utf8')
new_songs.head()

Unnamed: 0_level_0,artist,album,song,playlist,media_type,genre,Bytes
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,AC/DC,For Those About To Rock We Salute You,For Those About To Rock (We Salute You),Music,MPEG audio file,Rock,11170334
6,AC/DC,For Those About To Rock We Salute You,Put The Finger On You,Music,MPEG audio file,Rock,6713451
7,AC/DC,For Those About To Rock We Salute You,Let's Get It Up,Music,MPEG audio file,Rock,7636561
8,AC/DC,For Those About To Rock We Salute You,Inject The Venom,Music,MPEG audio file,Rock,6852860
9,AC/DC,For Those About To Rock We Salute You,Snowballed,Music,MPEG audio file,Rock,6599424


In [24]:
# load new_songs dataframe to database table
new_songs = new_songs.where(pd.notnull(new_songs), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in new_songs.itertuples(index=True, name ='None'):
    #print(row)
    cursor.execute('INSERT INTO dlfy6.new_songs VALUES(%s,%s,%s,%s,%s,%s,%s,%s)',row)

   
# Save (commit) the changes
connection.commit()

In [25]:
# load new_invoice_by_customer.csv file
new_invoice_by_customer = pd.read_csv("new_invoice_by_customer.csv",index_col='InvoiceId',encoding = 'utf8')
new_invoice_by_customer.head()

Unnamed: 0_level_0,CustomerId,BillingAddress,BillingCity
InvoiceId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2,Theodor-Heuss-Straße 34,Stuttgart
2,4,Ullevålsveien 14,Oslo
3,8,Grétrystraat 63,Brussels
4,14,8210 111 ST NW,Edmonton
5,23,69 Salem Street,Boston


In [26]:
# load new_invoice_by_customer dataframe to database table
new_invoice_by_customer = new_invoice_by_customer.where(pd.notnull(new_invoice_by_customer), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in new_invoice_by_customer.itertuples(index=True, name ='None'):
    #print(row)
    cursor.execute('INSERT INTO dlfy6.new_invoice_by_customer VALUES(%s,%s,%s,%s)',row)

   
# Save (commit) the changes
connection.commit()

In [27]:
# load new_invoice_by_trackid.csv file
new_invoice_by_trackid = pd.read_csv("new_invoice_by_trackid.csv",encoding = 'utf8')
new_invoice_by_trackid.head()

Unnamed: 0,InvoiceId,trackid,UnitPrice
0,1,2,0.99
1,1,4,0.99
2,2,6,0.99
3,2,8,0.99
4,2,10,0.99


In [28]:
# load new_invoice_by_trackid dataframe to database table
new_invoice_by_trackid = new_invoice_by_trackid.where(pd.notnull(new_invoice_by_trackid), None)

for row in new_invoice_by_trackid.itertuples(index=False, name ='None'):
    #print(row)
    cursor.execute('INSERT INTO dlfy6.new_invoice_by_trackid VALUES(%s,%s,%s)',row)

   
# Save (commit) the changes
connection.commit()

## Task 6: Write count statements to show the data has been loaded.
Write SQL to show the `COUNT(*)` from each table loaded.

In [37]:
SQL ="""
SELECT count(*)
FROM dlfy6.new_songs;
"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    number_of_songs = cursor.fetchall()
print("Total rows of dlfy6.new_songs are {}".format(number_of_songs[0][0]))    


Total rows of dlfy6.new_songs are 3503


In [40]:
SQL ="""
SELECT count(*)
FROM dlfy6.new_customers;
"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    number_of_customers = cursor.fetchall()
print("Total rows of dlfy6.new_customers are {}".format(number_of_customers[0][0]))      


Total rows of dlfy6.new_customers are 59


In [41]:
SQL ="""
SELECT count(*)
FROM dlfy6.new_invoice_by_customer;
"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    number_of_invoices = cursor.fetchall()
print("Total rows of dlfy6.new_invoice_by_customer are {}".format(number_of_invoices[0][0]))     


Total rows of dlfy6.new_invoice_by_customer are 412


In [42]:
SQL ="""
SELECT count(*)
FROM dlfy6.new_invoice_by_trackid;
"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    number_of_records = cursor.fetchall()
print("Total rows of dlfy6.new_invoice_by_trackid are {}".format(number_of_records[0][0]))  


Total rows of dlfy6.new_invoice_by_trackid are 2240


## Task 7: Write some basic queries for the data you have load.
Please state the question that your query is trying to answer as a comment at the top of the code box for that query.
1. a single table query
1. a query that joins two tables
1. a query that preforms some aggerate function. 

In [45]:
#  How many artists do we have in the database?
#  A single table query

SQL ="""

SELECT count(DISTINCT artist)
FROM dlfy6.new_songs;
"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    data = cursor.fetchall()
    
print("Total artists are {}".format(data[0][0]))  

Total artists are 204


In [49]:
# Each artist and their average bytes per song (from high to low)
# A query that preforms some aggerate function. 

SQL ="""

SELECT artist, AVG(bytes) as Average_bytes
FROM dlfy6.new_songs
GROUP BY artist
ORDER BY Average_bytes DESC;
"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    data = cursor.fetchall()
    data = pd.DataFrame(data,columns =['Artist','Average_bytes'])
    
data

Unnamed: 0,Artist,Average_bytes
0,Battlestar Galactica (Classic),536359243.75000000
1,Battlestar Galactica,527533346.40000000
2,Heroes,512231374.21739130
3,Aquaman,492670102.00000000
4,Lost,426654344.65217391
5,The Office,282548216.96226415
6,"Terry Bozzio, Tony Levin & Steve Stevens",19041254.000000000000
7,"Mela Tenenbaum, Pro Musica Prague & Richard Kapp",16454937.000000000000
8,Santana,15619142.259259259259
9,Dennis Chambers,12653978.555555555556


In [63]:
# Top10 albums in revenue (from high to low)
# A query that joins two tables

SQL ="""

SELECT s.album, SUM(t.unitPrice) as revenue
FROM dlfy6.new_songs s
JOIN dlfy6.new_invoice_by_trackid t
ON s.id = t.trackId
GROUP BY s.album
ORDER BY revenue DESC
LIMIT 10;
"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    data = cursor.fetchall()
    data = pd.DataFrame(data,columns =['Album','Revenue'])
    
data

Unnamed: 0,Album,Revenue
0,"Battlestar Galactica (Classic), Season 1",35.82
1,"The Office, Season 3",31.84
2,Minha Historia,26.73
3,"Heroes, Season 1",25.87
4,"Lost, Season 2",25.87
5,Greatest Hits,25.74
6,Unplugged,24.75
7,"Battlestar Galactica, Season 3",23.88
8,"Lost, Season 3",21.89
9,Acústico,21.78


## Conceptual Questions
Answer the following questions completely and to the best of your ability. Answers should be in your own words. 

## Question 1: Describe a use case for a graph database. What kind of problems would it help answer and what types of data might you have.

I would like to describe a use case: First-party Bank Fraud - Fraud Detection.  
First-party fraud involves fraudsters who apply for credit cards, loans, overdrafts and unsecured banking credit lines with no intention of paying them back.  U.S. banks lose tens of billions of dollars every year to first-party fraud.  The first-party frauud is very difficult to detect.  Fraudsters behave very similarly to legitimate customers, until the moment they “bust out”, cleaning out all their accounts and disappearing.  

We can use graph database here because we want to catch the fraud during or even before the burst-out occurs. According to computerworlduk.com, graph databases are uniquely positioned to spot the connections between large data sets and identify patterns, a useful trait when it comes to spotting complex, modern fraud techniques. Languages like Cypher provide a simple semantic for detecting rings in the graph, navigating connections in memory, in real time.  This helps banks catch frauds in real time.

The data we might have are account holder name, address, phone number, social security number, bank account, credit card, unsecured loan, etc.  The graph database will connect all data together.  In order to prevent fraud in real time, banks run entity link analysis queries using a graph database during key stages such as:
1. at the time the account is created,
2. during an investigation,
3. as soon as a credit balance threshold is hit, or
4. when a check is bounced

For more information, banks use Gartner's Layered Fraud Prevention Approach to detect frauds.  It has 5 layers: endpoint-centric, navigation-centric, account-centric, cross-channels, entity link analysis.  If we use traditional relational database, we have to have a set of tables and columns, and complex queries which are expensive to run. Scaling will face technical challenges, and performance becoming exponentially worse.  The graph database can solve these problems.

## Question 2: What is the differences between MongoDB and a Relational DB? At what time would you use one verses the other?

### Data Structure:

- In MySQL (Relation DB), we pre-define the database schema based on requirements and set up rules to govern the relationships between fields in tables. Any changes in schema can take the database offline or significantly reduce application performance.

- In MongoDB, there is no need to declare the structure of documents - no schema definition required. If a new field needs to be added to a document, then the field can be created without affecting all other documents in the collection, without updating a central system catalog, and without taking the system offline. 


### The Scalability

- SQL databases are vertically scalable, which means that we can increase the load on a single server by increasing things like CPU, RAM or SSD. 

- MongoDB (NoSQL databases) are horizontally scalable. This means that we handle more traffic by sharding, or adding more servers in the NoSQL database. 


### Risk:

- In MySQL, we have risk of SQL injection attacks.

- On the other hand, MongoDB’s querying is object-oriented. We have less risk of attack due to design.




### So Which Database Is Right For Your Business?
- MySQL is a strong choice for any business that will benefit from its pre-defined structure and set schemas, strong dependence on multi-row transactions,frequent updates and modifications of large volume of records, and relatively small datasets. For example, applications are like accounting systems or systems that monitor inventory.

- MongoDB, on the other hand, is a good choice for businesses that have rapid growth or databases with no clear schema definitions. We can use MongoDB if we cannot define a schema for your database, or if our schemas continue to change - as is often the case with mobile apps, real-time analytics, content management systems, etc. Our database is set to grow big. Data is location based. High availablility in unstable environment is required. 





## Question 3: What makes spark different than Mongo and Relational DB?

- Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL (SparkSQL), machine learning (MLlib), graph computation (GraphX), and stream processing (Spark Streaming), which can be used together in an application. Programming languages supported by Spark include: Java, Python, Scala, and R. Spark can speed up application development by 10-100x, make applications more portable and extensible. Spark can make the application run 100x faster.
- Apache Spark is for doing Parallel Computing Operations on Big Data in SQL queries.
- Spark tries to keep things in memory, so the speed is faster.
- Spark is often used with distributed data stores, with popular NoSQL databases such as MongoDB, and with distributed messaging stores.
- We have map phase and reduce phase in Spark
- Spark runs on clusters.  The SparkContext can connect to cluster managers which allocate resources across applications. Next, Spark accquires executors on worker nodes.  Then it sends application code to executors, and executors receive tasks to run.

Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. RDDs have 2 types of operations:
1. Transformations are operations (such as map, filter, join,etc) that are performed on an RDD 
2. Actions are operations (such as reduce, count, first, etc) that return a value after running a computation on an RDD




On the other hand, MongoDB and Relational DB are popular databases where we can store data.


# SAVE YOUR NOTEBOOK