# Database Design and Load Exercise

### Steps
 1. Analyze - Completed last module
 2. Design - Completed last module. 
 3. Data Carpentry
 4. Data Loading
 5. Analytical Queries

# Smart Stores ERD
Last module everyone was to build an ERD diagram based on the entities and attributes below.



### orders :
* `order_id`: order identifier
* `user_id`: customer identifier
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)
* `order_number`: the order sequence number for this user (1 = first, n = nth)
* `order_dow`: the day of the week the order was placed on
* `order_hour_of_day`: the hour of the day the order was placed on
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

### products :
* `product_id`: product identifier
* `product_name`: name of the product
* `aisle_id`: foreign key
* `department_id`: foreign key

### aisles :
* `aisle_id`: aisle identifier
* `aisle`: the name of the aisle

### deptartments :
* `department_id`: department identifier
* `department`: the name of the department

### order_products :
* `order_id`: foreign key
* `product_id`: foreign key
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

#### where SET is one of the four following evaluation sets (eval_set in orders):

* "prior": orders prior to that users most recent order (~3.2m orders)
* "train": training data supplied to participants (~131k orders)
* "test": test data reserved for machine learning competitions (~75k orders)

# ERD Diagram

For this assignment this is the ERD diagram everyone should use based on the requirements above. It is ok if you had something a little bit different in last week's exercise as long as you captured all of the requirements and relationships.

![erd](../images/db_design.png)

### M3:E1:Q1 - Create the tables in the database
 1. Convert the Entities and attributes into a Database schema for Postgres
 1. Remember to prefix table names with your database id, e.g., _SSO_.
    * Example: `CREATE TABLE SSO.Order ... `
    
**Remember to specify your Primary Keys and Foreign Keys for each table!** 

In [1]:
import getpass
# This collects a masked password from the user
mypasswd = getpass.getpass()

········


In [2]:
import psycopg2
import numpy as np
from psycopg2.extensions import adapt, register_adapter, AsIs

# Then connects to the DB
connection = psycopg2.connect(database = 'dsa_student', 
                              user = 'garwoode', 
                              host = 'dbase.dsa.missouri.edu',
                              password = mypasswd)

In [3]:
# Then remove the password from computer memory
del mypasswd

In [None]:
CREATE_TABLES = """
DROP TABLE IF EXISTS garwoode.orders cascade;
CREATE TABLE garwoode.orders(
   order_id INT PRIMARY KEY,
   user_id INT,
   eval_set VARCHAR(5),
   order_number INT,
   order_dow INT,
   order_hour_of_day INT,
   days_since_prior_order INT
);


DROP TABLE IF EXISTS garwoode.departments cascade;
CREATE TABLE garwoode.departments(
   department_id INT PRIMARY KEY,
   department VARCHAR(255)
);

DROP TABLE IF EXISTS garwoode.aisles cascade;
CREATE TABLE garwoode.aisles(
   aisle_id INT PRIMARY KEY,
   aisle VARCHAR(255)
);

DROP TABLE IF EXISTS garwoode.products cascade;
CREATE TABLE garwoode.products(
   product_id INT PRIMARY KEY,
   product_name VARCHAR(255),
   aisle_id INT,
   department_id INT,
   FOREIGN KEY (aisle_id)
        REFERENCES aisles(aisle_id),
   FOREIGN KEY (department_id)
       REFERENCES departments(department_id)
);

DROP TABLE IF EXISTS garwoode.order_products cascade;
CREATE TABLE garwoode.order_products(
   order_id INT, 
   product_id INT, 
   add_to_cart_order INT,
   reordered INT,
   PRIMARY KEY (order_id,product_id),
   FOREIGN KEY (order_id) 
       REFERENCES orders(order_id),
   FOREIGN KEY (product_id)
       REFERENCES products(product_id)
);

"""
with connection, connection.cursor() as cursor:
    cursor.execute(CREATE_TABLES)

### M3:E1:Q2 - Load data from the following files:

## `/dsa/data/all_datasets/instacart/orders.csv`
 * 3421084 Rows
 * File Preview 
```
order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
```

## `/dsa/data/all_datasets/instacart/products.csv`
 * 49689 Rows
 * File Preview 
```
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
```

## `/dsa/data/all_datasets/instacart/aisles.csv`
 * 135 Rows
 * File Preview 
```
aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
```

## `/dsa/data/all_datasets/instacart/departments.csv`
 * 22 Rows
 * File Preview 
```
department_id,department
1,frozen
2,other
3,bakery
4,produce
```

## `/dsa/data/all_datasets/instacart/order_products.csv`
 * 1384618 Rows
 * File Preview 
```
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
```
     

## In each designated cell, load the data using Python



### M3:E1:Q2a - Orders 

In [None]:
from psycopg2.extensions import adapt, register_adapter, AsIs


import pandas as pd
orders = '/dsa/data/all_datasets/instacart/orders.csv'

orders = pd.read_csv(orders, nrows=10)
orders.head()


import numpy as np

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)
# Change NaN to None
orders = orders.where(pd.notnull(orders), None)

INSERT_SQL = 'INSERT INTO garwoode.orders '
INSERT_SQL += ' (order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order ) VALUES '
INSERT_SQL += '(%s,%s,%s,%s,%s,%s,%s)'


with connection, connection.cursor() as cursor:
    for row in orders.itertuples(index=False, name=None):  # pull each row as a tuple

        # Insert the row
        #cursor.execute(INSERT_SQL,row)
        print(row)










### M3:E1:Q2c - Aisles

In [None]:
import pandas as pd
aisles_file = '/dsa/data/all_datasets/instacart/aisles.csv'
aisles = pd.read_csv(aisles_file)
aisles.head()

import numpy as np
# Magic adapters for the Numpy Fun of Pandas
register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)
#aisles = aisles.where(pd.notnull(aisles), None)

INSERT_SQL = 'INSERT INTO garwoode.aisles '
INSERT_SQL += ' (aisle_id,aisle ) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s)'

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in aisles.itertuples(index=False, name=None):  # pull each row as a tuple

        # Insert the row
        cursor.execute(INSERT_SQL,row)












### M3:E1:Q2d - Departments

In [None]:
import pandas as pd
departments_file = '/dsa/data/all_datasets/instacart/departments.csv'
departments = pd.read_csv(departments_file)
departments.head()

print(list(departments))
s = ''
for i in list(departments):
    s += '%s,'
print(s)

import numpy as np
# Magic adapters for the Numpy Fun of Pandas
register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)
departments = departments.where(pd.notnull(departments), None)

INSERT_SQL = 'INSERT INTO garwoode.departments '
INSERT_SQL += ' (department_id,department ) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s)'

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in departments.itertuples(index=False, name=None):  # pull each row as a tuple

        # Insert the row
        cursor.execute(INSERT_SQL,row)












### M3:E1:Q2b - Products







In [None]:

import pandas as pd
products_file = '/dsa/data/all_datasets/instacart/products.csv'
products = pd.read_csv(products_file)
products.head()


import numpy as np
# Magic adapters for the Numpy Fun of Pandas
#register_adapter(np.int64,AsIs)
#register_adapter(np.float64,AsIs)
products = products.where(pd.notnull(products), None)

INSERT_SQL = 'INSERT INTO garwoode.products '
INSERT_SQL += ' (product_id,product_name,aisle_id,department_id ) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s,%s,%s)'

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in products.itertuples(index=False, name=None):  # pull each row as a tuple
        print(row)
        # Insert the row
        #cursor.execute(INSERT_SQL,row)
       









In [None]:
import sqlite3
import numpy
import pandas as pd
df1 = pd.read_csv('/dsa/data/all_datasets/instacart/products.csv', sep=',', encoding='utf-8')

cursor = connection.cursor()

sqlite3.register_adapter(numpy.int64, int)
sqlite3.register_adapter(numpy.float64, float)

INSERT_SQL = 'INSERT INTO garwoode.products2 '
INSERT_SQL += ' (product_id,product_name,aisle_id,department_id ) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s,%s,%s)'

for row in df1.itertuples(index=False, name ='None'):
    print(row)
    cursor.execute(INSERT_SQL,row)
# Save (commit) the changes
connection.commit()

### M3:E1:Q2e - Order Products

In [None]:
import pandas as pd
order_products_file = '/dsa/data/all_datasets/instacart/order_products.csv'
order_products = pd.read_csv(order_products_file)
order_products.head()

import numpy as np
# Magic adapters for the Numpy Fun of Pandas
register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)
order_products = order_products.where(pd.notnull(order_products), None)

INSERT_SQL = 'INSERT INTO garwoode.order_products '
INSERT_SQL += ' (order_id,product_id,add_to_cart_order,reordered ) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s, %s, %s)'

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in order_products.itertuples(index=False, name=None):  # pull each row as a tuple

        # Insert the row
        cursor.execute(INSERT_SQL,row)











--- 

### M3:E1:Q3

In each of the cells below, use Python to pull the data out of the database. 

#### M3:E1:Q3a - Find the top 10 products, based on number of orders.
Display in a table!

In [9]:



SELECT = '(SELECT p.product_id '
SELECT += 'FROM Products p '
SELECT += 'JOIN Order_Products op USING(product_id) '
SELECT += 'JOIN Orders o USING(order_id) '
SELECT += 'GROUP BY p.product_id '
SELECT += 'ORDER BY COUNT (*) DESC '
SELECT += 'LIMIT 10)'


with connection, connection.cursor() as cursor:
    cursor.execute(SELECT)

    print(cursor.fetchall())

#for row in data:
#    print(row)
print(data)








[(24852,), (13176,), (21137,), (21903,), (47626,), (47766,), (47209,), (16797,), (26209,), (27966,)]
[(24852,), (13176,), (21137,), (21903,), (47626,), (47766,), (47209,), (16797,), (26209,), (27966,)]


#### M3:E1:Q3a - Display how many products there are in each department

# Save your notebook, then `File > Close and Halt`

---