# Database Design and Load Exercise

### Steps
 1. Analyze - Completed last module
 2. Design - Completed last module. 
 3. Data Carpentry
 4. Data Loading
 5. Analytical Queries

# Smart Stores ERD
Last module everyone was to build an ERD diagram based on the entities and attributes below.



### orders :
* `order_id`: order identifier
* `user_id`: customer identifier
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)
* `order_number`: the order sequence number for this user (1 = first, n = nth)
* `order_dow`: the day of the week the order was placed on
* `order_hour_of_day`: the hour of the day the order was placed on
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

### products :
* `product_id`: product identifier
* `product_name`: name of the product
* `aisle_id`: foreign key
* `department_id`: foreign key

### aisles :
* `aisle_id`: aisle identifier
* `aisle`: the name of the aisle

### deptartments :
* `department_id`: department identifier
* `department`: the name of the department

### order_products :
* `order_id`: foreign key
* `product_id`: foreign key
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

#### where SET is one of the four following evaluation sets (eval_set in orders):

* "prior": orders prior to that users most recent order (~3.2m orders)
* "train": training data supplied to participants (~131k orders)
* "test": test data reserved for machine learning competitions (~75k orders)

# ERD Diagram

For this assignment this is the ERD diagram everyone should use based on the requirements above. It is ok if you had something a little bit different in last week's exercise as long as you captured all of the requirements and relationships.

![erd](../images/db_design.png)

### M3:E1:Q1 - Create the tables in the database
 1. Convert the Entities and attributes into a Database schema for Postgres
 1. Remember to prefix table names with your database id, e.g., _SSO_.
    * Example: `CREATE TABLE SSO.Order ... `
    
**Remember to specify your Primary Keys and Foreign Keys for each table!** 

In [None]:
CREATE TABLE dlfy6.orders (
    order_id INT PRIMARY KEY, 
    user_id INT, 
    eval_set VARCHAR(5), 
    order_number INT,
    order_dow INT,
    order_hour_of_day INT,
    days_since_prior INT
);


CREATE TABLE dlfy6.aisles (
    aisle_id INT PRIMARY KEY,
    aisle VARCHAR(100)
);


CREATE TABLE dlfy6.departments (
    department_id INT PRIMARY KEY,
    department VARCHAR(50)
);




CREATE TABLE dlfy6.products (
    product_id INT,
    product_name VARCHAR(250),
    aisle_id INT,
    department_id INT, 
    PRIMARY KEY (product_id),
    FOREIGN KEY (aisle_id)
        REFERENCES aisles(aisle_id),
    FOREIGN KEY (department_id)
        REFERENCES departments(department_id)    
);


CREATE TABLE dlfy6.order_products (
    order_id INT,
    product_id INT,
    add_to_cart_order INT,
    reordered INT,    
    PRIMARY KEY (order_id,product_id),
    FOREIGN KEY (order_id)
        REFERENCES orders(order_id),
    FOREIGN KEY (product_id)
        REFERENCES products(product_id)
);





In [1]:
import getpass
# This collects a masked password from the user
mypasswd = getpass.getpass()

········


In [2]:
import psycopg2
import numpy as np
from psycopg2.extensions import adapt, register_adapter, AsIs

# Then connects to the DB
connection = psycopg2.connect(database = 'dsa_student', 
                              user = 'dlfy6', 
                              host = 'dbase.dsa.missouri.edu',
                              password = mypasswd)

In [12]:
cursor = connection.cursor()

In [3]:
# Then remove the password from computer memory
del mypasswd

### M3:E1:Q2 - Load data from the following files:

## `/dsa/data/all_datasets/instacart/orders.csv`
 * 3421084 Rows
 * File Preview 
```
order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
```

## `/dsa/data/all_datasets/instacart/products.csv`
 * 49689 Rows
 * File Preview 
```
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
```

## `/dsa/data/all_datasets/instacart/aisles.csv`
 * 135 Rows
 * File Preview 
```
aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
```

## `/dsa/data/all_datasets/instacart/departments.csv`
 * 22 Rows
 * File Preview 
```
department_id,department
1,frozen
2,other
3,bakery
4,produce
```

## `/dsa/data/all_datasets/instacart/order_products.csv`
 * 1384618 Rows
 * File Preview 
```
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
```
     

## In each designated cell, load the data using Python



### M3:E1:Q2a - Orders 

In [6]:
import pandas as pd
df_orders = pd.read_csv('/dsa/data/all_datasets/instacart/orders.csv')
df_orders.head()



Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [None]:
import sqlite3
import numpy

df_orders = df_orders.where(pd.notnull(df_orders), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in df_orders.itertuples(index=False, name ='None'):
    #print(row)
    cursor.execute('INSERT INTO dlfy6.orders VALUES(%s,%s,%s,%s,%s,%s,%s)',row)

   
# Save (commit) the changes
connection.commit()


### M3:E1:Q2b - Products







In [19]:
df_products = pd.read_csv('/dsa/data/all_datasets/instacart/products.csv')
df_products.head()


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [21]:
df_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [23]:
# run departments and aisles before run this
df_products = df_products.where(pd.notnull(df_products), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in df_products.itertuples(index=False, name ='None'):
    #print(row)
    cursor.execute('rollback;')
    cursor.execute('INSERT INTO dlfy6.products VALUES(%s,%s,%s,%s)',row)

# Save (commit) the changes
connection.commit()


### M3:E1:Q2c - Aisles

In [9]:
df_aisles = pd.read_csv('/dsa/data/all_datasets/instacart/aisles.csv')
df_aisles.head()


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


In [13]:
df_aisles = df_aisles.where(pd.notnull(df_aisles), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in df_aisles.itertuples(index=False, name ='None'):
    #print(row)
    #cursor.execute('rollback;')
    cursor.execute('INSERT INTO dlfy6.aisles VALUES (%s,%s);',row)

# Save (commit) the changes
connection.commit()


### M3:E1:Q2d - Departments

In [18]:
df_departments = pd.read_csv('/dsa/data/all_datasets/instacart/departments.csv')
df_departments.head()


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [16]:
df_departments = df_departments.where(pd.notnull(df_departments), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in df_departments.itertuples(index=False, name ='None'):
    #print(row)
    cursor.execute('INSERT INTO dlfy6.departments VALUES(%s,%s);',row)

# Save (commit) the changes
connection.commit()


### M3:E1:Q2e - Order Products

In [24]:
df_order_products = pd.read_csv('/dsa/data/all_datasets/instacart/order_products.csv')
df_order_products.head()


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [26]:
df_order_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             1384617 non-null int64
product_id           1384617 non-null int64
add_to_cart_order    1384617 non-null int64
reordered            1384617 non-null int64
dtypes: int64(4)
memory usage: 42.3 MB


In [27]:
df_order_products = df_order_products.where(pd.notnull(df_order_products), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in df_order_products.itertuples(index=False, name ='None'):
    #print(row)
    cursor.execute('rollback;')
    cursor.execute('INSERT INTO dlfy6.order_products VALUES(%s,%s,%s,%s);',row)

# Save (commit) the changes
connection.commit()


--- 

### M3:E1:Q3

In each of the cells below, use Python to pull the data out of the database. 

#### M3:E1:Q3a - Find the top 10 products, based on number of orders.
Display in a table!

In [41]:
SQL ="""

SELECT p.product_name, count(order_id) 
FROM dlfy6.products p 
JOIN dlfy6.order_products op 
USING (product_id) 

GROUP BY op.product_id,p.product_name 
ORDER BY count DESC 
LIMIT 10 ;

"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    data = cursor.fetchall()
    


In [42]:
data =pd.DataFrame(data,columns=['Top10products','number_orders'])
data

Unnamed: 0,Top10products,number_orders
0,Banana,18726
1,Bag of Organic Bananas,15480
2,Organic Strawberries,10894
3,Organic Baby Spinach,9784
4,Large Lemon,8135
5,Organic Avocado,7409
6,Organic Hass Avocado,7293
7,Strawberries,6494
8,Limes,6033
9,Organic Raspberries,5546


#### M3:E1:Q3a - Display how many products there are in each department

In [39]:
SQL ="""
SELECT d.department, count(*) 
FROM dlfy6.products p 
JOIN dlfy6.departments d
USING (department_id)
GROUP BY d.department_id;

"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    data1 = cursor.fetchall()
    

In [40]:
data1 =pd.DataFrame(data1,columns=['Departments','number_products'])
data1

Unnamed: 0,Departments,number_products
0,deli,1322
1,personal care,6563
2,household,3085
3,meat seafood,907
4,bulk,38
5,babies,1081
6,other,548
7,canned goods,2092
8,pantry,5371
9,missing,1258


# Save your notebook, then `File > Close and Halt`

---