# Database Design and Load Exercise

### Steps
 1. Analyze - Completed last module
 2. Design - Completed last module. 
 3. Data Carpentry
 4. Data Loading
 5. Analytical Queries

# Smart Stores ERD
Last module everyone was to build an ERD diagram based on the entities and attributes below.



### orders :
* `order_id`: order identifier
* `user_id`: customer identifier
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)
* `order_number`: the order sequence number for this user (1 = first, n = nth)
* `order_dow`: the day of the week the order was placed on
* `order_hour_of_day`: the hour of the day the order was placed on
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

### products :
* `product_id`: product identifier
* `product_name`: name of the product
* `aisle_id`: foreign key
* `department_id`: foreign key

### aisles :
* `aisle_id`: aisle identifier
* `aisle`: the name of the aisle

### deptartments :
* `department_id`: department identifier
* `department`: the name of the department

### order_products :
* `order_id`: foreign key
* `product_id`: foreign key
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

#### where SET is one of the four following evaluation sets (eval_set in orders):

* "prior": orders prior to that users most recent order (~3.2m orders)
* "train": training data supplied to participants (~131k orders)
* "test": test data reserved for machine learning competitions (~75k orders)

# ERD Diagram

For this assignment this is the ERD diagram everyone should use based on the requirements above. It is ok if you had something a little bit different in last week's exercise as long as you captured all of the requirements and relationships.

![erd](../images/db_design.png)

### Activity 1 - Create the tables in the database
 1. Convert the Entities and attributes into a Database schema for Postgres
 1. Remember to prefix table names with your database id, e.g., _SSO_.
    * Example: `CREATE TABLE SSO.Order ... `
    
**Remember to specify your Primary Keys and Foreign Keys for each table!** 

### Activity 2 - Load data from the following files:

## `/dsa/data/all_datasets/instacart/orders.csv`
 * 3421084 Rows
 * File Preview 
```
order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
```

## `/dsa/data/all_datasets/instacart/products.csv`
 * 49689 Rows
 * File Preview 
```
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
```

## `/dsa/data/all_datasets/instacart/aisles.csv`
 * 135 Rows
 * File Preview 
```
aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
```

## `/dsa/data/all_datasets/instacart/departments.csv`
 * 22 Rows
 * File Preview 
```
department_id,department
1,frozen
2,other
3,bakery
4,produce
```

## `/dsa/data/all_datasets/instacart/order_products.csv`
 * 1384618 Rows
 * File Preview 
```
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
```
     

In [None]:
DROP TABLE jch5x8.order_products, jch5x8.products, jch5x8.departments, jch5x8.aisles, jch5x8.orders;


CREATE TABLE jch5x8.orders (
 order_id INT PRIMARY KEY, 
 user_id INT NOT NULL, 
 eval_set VARCHAR(50), 
 order_number INT NOT NULL, 
 order_dow VARCHAR(50), 
 order_hour_of_day INT, 
 days_since_prior_order INT
);

CREATE TABLE jch5x8.aisles (
 aisle_id INT PRIMARY KEY, 
 aisle VARCHAR(50)
);

CREATE TABLE jch5x8.departments (
 department_id INT PRIMARY KEY, 
 department VARCHAR(50)
);

CREATE TABLE jch5x8.products (
 product_id INT PRIMARY KEY, 
 product_name VARCHAR(250) NOT NULL,
 aisle_id INT REFERENCES jch5x8.aisles,
 department_id INT REFERENCES jch5x8.departments
);

CREATE TABLE jch5x8.order_products (
 order_id INT REFERENCES jch5x8.orders, 
 product_id INT REFERENCES jch5x8.products, 
 add_to_cart_order INT, 
 reordered INT,
 PRIMARY KEY (order_id, product_id)
);


CREATE SEQUENCE order_number START 0;

BEGIN;
COPY orders FROM '/dsa/data/all_datasets/instacart/orders.csv';
SELECT setval('order_number', max(order_id)) FROM orders;
END;

## In each designated cell, load the data using Python



### Activity 3 - Orders 

In [None]:
import pandas as pd
import getpass
import psycopg2
import numpy as np
from psycopg2.extensions import adapt, register_adapter, AsIs
# Magic adapters for the Numpy Fun of Pandas
register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

# Collects masked password from user
mypasswd = getpass.getpass()

# Connects to DB
connection = psycopg2.connect(database = 'dsa_student', 
                              user = 'jch5x8', 
                              host = 'pgsql.dsa.lan',
                              password = mypasswd)

cursor = connection.cursor()

# Remove password from memory
del mypasswd

In [None]:
# Load in orders data
orders_file = '/dsa/data/all_datasets/instacart/orders.csv'
orders = pd.read_csv(orders_file)
#orders.head().transpose()

# # Printing column names and placeholders
# print(list(orders))
# s = ''
# for i in list(orders):
#     s += '%s,'
# print(s)

# Convert Panda to have Null values (None) instead of NaN
orders = orders.where(pd.notnull(orders), None)

# Creates INSERT SQL statement
INSERT_SQL = 'INSERT INTO jch5x8.orders'
INSERT_SQL += ' (order_id, user_id, eval_set, order_number, order_dow,'
INSERT_SQL += ' order_hour_of_day, days_since_prior_order) VALUES '

# Creates parameterized string for SQL. %s is a placeholder
INSERT_SQL += '(%s,%s,%s,%s,%s,%s,%s)'
# print(INSERT_SQL)


with connection, connection.cursor() as cursor:
    for row in orders.itertuples(index = False, name = None):  # pull each row as a tuple

        # This is an un-indexed, un-named Tuple
        print(row) 
        
        # Insert the row
        cursor.execute(INSERT_SQL,row)

### Activity 2 - Products







In [None]:
# Load in  data
products_file = '/dsa/data/all_datasets/instacart/products.csv'
products = pd.read_csv(products_file)
#products.head().transpose()

# # Printing column names and placeholders
# print(list(products))
# s = ''
# for i in list(products):
#     s += '%s,'
# print(s)

# Convert Panda to have Null values (None) instead of NaN
products = products.where(pd.notnull(products), None)

# Creates INSERT SQL statement
INSERT_SQL = 'INSERT INTO jch5x8.products'
INSERT_SQL += ' (product_id, product_name, aisle_id, department_id) VALUES '

# Creates parameterized string for SQL. %s is a placeholder
INSERT_SQL += '(%s,%s,%s,%s)'
# print(INSERT_SQL)


with connection, connection.cursor() as cursor:
    for row in products.itertuples(index = False, name = None):  # pull each row as a tuple

        # This is an un-indexed, un-named Tuple
        print(row) 
        
        # Insert row
        cursor.execute(INSERT_SQL,row)

### Activity 3 - Aisles

In [None]:
# Load in aisles data
aisles_file = '/dsa/data/all_datasets/instacart/aisles.csv'
aisles = pd.read_csv(aisles_file)
#aisles.head().transpose()

# # Printing column names and placeholders
# print(list(aisles))
# s = ''
# for i in list(aisles):
#     s += '%s,'
# print(s)

# Convert Panda to have Null values (None) instead of NaN
aisles = aisles.where(pd.notnull(aisles), None)

# Creates INSERT SQL statement
INSERT_SQL = 'INSERT INTO jch5x8.aisles'
INSERT_SQL += ' (aisle_id, aisle) VALUES '

# Creates parameterized string for SQL. %s is a placeholder
INSERT_SQL += '(%s,%s)'
# print(INSERT_SQL)


with connection, connection.cursor() as cursor:
    for row in aisles.itertuples(index = False, name = None):  # pull each row as a tuple
        # This is an un-indexed, un-named Tuple
        print(row) 
        
        # Insert row
        cursor.execute(INSERT_SQL,row)

### Activity 4 - Departments

In [None]:
# Load in departments data
departments_file = '/dsa/data/all_datasets/instacart/departments.csv'
departments = pd.read_csv(departments_file)
#departments.head().transpose()

# # Printing column names and placeholders
# print(list(departments))
# s = ''
# for i in list(departments):
#     s += '%s,'
# print(s)

# Convert Panda to have Null values (None) instead of NaN
departments = departments.where(pd.notnull(departments), None)

# Creates INSERT SQL statement
INSERT_SQL = 'INSERT INTO jch5x8.departments'
INSERT_SQL += ' (department_id, department) VALUES '

# Creates parameterized string for SQL. %s is a placeholder
INSERT_SQL += '(%s,%s)'
# print(INSERT_SQL)


with connection, connection.cursor() as cursor:
    for row in departments.itertuples(index = False, name = None):  # pull each row as a tuple
        # This is an un-indexed, un-named Tuple
        print(row) 
        
        # Insert row
        cursor.execute(INSERT_SQL,row)

### Activity 5 - Order Products

In [None]:
# Load in order_products data
order_products_file = '/dsa/data/all_datasets/instacart/order_products.csv'
order_products = pd.read_csv(order_products_file)
#order_products.head().transpose()

# # Printing column names and placeholders
# print(list(order_products))
# s = ''
# for i in list(order_products):
#     s += '%s,'
# print(s)

# Convert Panda to have Null values (None) instead of NaN
order_products = order_products.where(pd.notnull(order_products), None)

# Creates INSERT SQL statement
INSERT_SQL = 'INSERT INTO jch5x8.order_products'
INSERT_SQL += ' (order_id, product_id, add_to_cart_order, reordered) VALUES '

# Creates parameterized string for SQL. %s is a placeholder
INSERT_SQL += '(%s,%s,%s,%s)'
# print(INSERT_SQL)


with connection, connection.cursor() as cursor:
    for row in order_products.itertuples(index = False, name = None):  # pull each row as a tuple

        # This is an un-indexed, un-named Tuple
        print(row) 
        
        # Insert the row
        cursor.execute(INSERT_SQL,row)

# Save your notebook, then `File > Close and Halt`

---