<a href="https://colab.research.google.com/github/brook-miller/mbai-417-data/blob/main/data-models-databases/in-class/lab_instacart_with_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 1. Instacart Lab Notes



1.   Be sure to check out the [code](https://github.com/brook-miller/mbai-417-data/blob/main/data-models-databases/in-class/lab-instacart.ipynb) we reviewed in class, clear the output and run the cells yourself
2.   Use the limit clause in SQL statements to avoid downloading millions of rows to your notebook environment 



# Setting up the environment

In [1]:
#@title installs for sqlalchemy and sqlmagic
!pip install sqlalchemy-redshift --quiet
!pip install redshift_connector --quiet
!pip install ipython-sql --quiet

[K     |████████████████████████████████| 94 kB 2.3 MB/s 
[K     |████████████████████████████████| 131 kB 37.2 MB/s 
[K     |████████████████████████████████| 6.4 MB 50.9 MB/s 
[K     |████████████████████████████████| 8.5 MB 41.9 MB/s 
[K     |████████████████████████████████| 503 kB 59.4 MB/s 
[K     |████████████████████████████████| 97 kB 5.8 MB/s 
[K     |████████████████████████████████| 79 kB 7.1 MB/s 
[K     |████████████████████████████████| 138 kB 52.6 MB/s 
[K     |████████████████████████████████| 127 kB 56.9 MB/s 
[K     |████████████████████████████████| 104 kB 51.2 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [2]:
#@title standard imports - we'll use in most EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

from datetime import datetime, timedelta
from dateutil.parser import parse
from google.colab import data_table
data_table.enable_dataframe_formatter()

In [3]:
#@title setting up sql connection and sql magic, unique to this lab

import getpass
import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy import orm as sa_orm

connect_to_db = URL.create(
drivername='redshift+redshift_connector',     
host='mbai417-redshift.cuvtmrb8eogw.us-west-2.redshift.amazonaws.com',     
port=5439,  
database='dev',  
username='ro_user',         #username should not be hard coded either
password=getpass.getpass()  #please don't put passwords into code
)

engine = sa.create_engine(connect_to_db)
%reload_ext sql
%sql $connect_to_db

··········


'Connected: ro_user@dev'

#Lab questions to answer

## Create a query to select all of the aisles


In [None]:
%%sql
select * from aisles
limit 10

 * redshift+redshift_connector://ro_user:***@mbai417-redshift.cuvtmrb8eogw.us-west-2.redshift.amazonaws.com:5439/dev
Done.


aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
5,marinades meat preparation
6,other
7,packaged meat
8,bakery desserts
9,pasta sauce
10,kitchen supplies


## Which aisle has the most products?


In [9]:
#@title join the aisles table to the products table and count the number of products in each aisle

%%sql
select aisle, count(*) from aisles a
join products p on p.aisle_id = a.aisle_id
group by aisle
order by count(*) desc
limit 10


 * redshift+redshift_connector://ro_user:***@mbai417-redshift.cuvtmrb8eogw.us-west-2.redshift.amazonaws.com:5439/dev
Done.


aisle,count
missing,1258
candy chocolate,1246
ice cream ice,1091
vitamins supplements,1038
yogurt,1026
chips pretzels,989
tea,894
packaged cheese,891
frozen meals,880
cookies cakes,874


## Which aisle has the most products ordered?

In [13]:
#@title add an additional join to the previous query so that we can count # of products ordered
%%sql

select aisle, count(*) from aisles a
join products p on p.aisle_id = a.aisle_id
join order_products op on op.product_id = p.product_id
group by aisle
order by count(*) desc
limit 10

 * redshift+redshift_connector://ro_user:***@mbai417-redshift.cuvtmrb8eogw.us-west-2.redshift.amazonaws.com:5439/dev
Done.


aisle,count
fresh fruits,3642188
fresh vegetables,3418021
packaged vegetables fruits,1765313
yogurt,1452343
packaged cheese,979763
milk,891015
water seltzer sparkling water,841533
chips pretzels,722470
soy lactosefree,638253
bread,584834


## What are the most frequently ordered products with names that contain "Diapers"?



In [11]:
#@title recognize this as the same problem from the brief demo I did in class
#@markdown [1-instacart](https://github.com/brook-miller/mbai-417-data/blob/main/data-models-databases/in-class/1-instacart.ipynb) I copied the query to here and changed pizza to diapers
#@markdown <br/><br/>joining products (product names) to order_products which contains all of the products to orders

%%sql
select p.product_name as product, count(*) as product_count from order_products op
join products p on p.product_id = op.product_id
where product ilike('%diapers%')
group by product
order by product_count desc
limit 10

 * redshift+redshift_connector://ro_user:***@mbai417-redshift.cuvtmrb8eogw.us-west-2.redshift.amazonaws.com:5439/dev
Done.


product,product_count
Free & Clear Diapers,869
Free & Clear Size 4 Baby Diapers,398
Size 5 Cruisers Diapers Super Pack,324
Honest Diapers Size 5,315
Baby Diapers Size 2,300
Diapers Cruisers Size 4 Super Pack,284
Baby Dry Diapers Jumbo Pack Size 4,263
"Swaddlers Diapers Super Pack, Size 3",259
Honest Diapers Size 4,256
Giraffes Diapers Size 4 L,252


## Create a query to determine which products are most often purchased in orders with diapers. (not including Diapers products)

In [23]:
#@title products purchased with diapers
#@markdown break down the problem:
#@markdown 1. we need to find orders that have a at least 1 diapers product
#@markdown 2. from there do a product count of products ordered eliminating the products that contain diapers
#@markdown 3. line 13,14 and 17,18 count the top products overall, line 16 uses a join to the CTE (common table expression) to restrict the orders to only those that have diapers 
%%sql

with pizzaorders as (
    select distinct op.order_id from order_products op
    join products p on p.product_id = op.product_id
    where p.product_name ilike('%diapers%')
)
select p.product_name as product_name, min(p.product_id) as product_id, count(op.order_id) as order_count from order_products op
join products p on op.product_id = p.product_id
join pizzaorders po on po.order_id = op.order_id
where NOT (p.product_name ilike('%diapers%'))
group by p.product_name
order by order_count desc
limit 20

 * redshift+redshift_connector://ro_user:***@mbai417-redshift.cuvtmrb8eogw.us-west-2.redshift.amazonaws.com:5439/dev
Done.


product_name,product_id,order_count
Banana,24852,1143
Bag of Organic Bananas,13176,1140
Organic Hass Avocado,47209,613
Organic Strawberries,21137,575
Strawberries,16797,502
Organic Raspberries,27966,440
Organic Whole Milk,27845,406
Free & Clear Unscented Baby Wipes,44471,379
Organic Blueberries,39275,374
Organic Baby Spinach,21903,358


In [22]:
#@title products purchased with diapers alternative (equivalent)
#@markdown we don't necessarily need the product_id here so I've eliminated for clarity
#@markdown lines 12-14, 17-19 select and group by the most popular products
#@markdown <br/><br/>note that the 8th product down is baby wipes!
%%sql

with diaperorders as (
    select distinct op.order_id from order_products op
    join products p on p.product_id = op.product_id
    where p.product_name ilike('%diapers%')
)
select p.product_name as product_name, count(op.order_id) as order_count 
from order_products op
    join products p on op.product_id = p.product_id
where NOT (p.product_name ilike('%diapers%')) 
      and op.order_id in (select order_id from diaperorders)
group by p.product_name
order by order_count desc
limit 20

 * redshift+redshift_connector://ro_user:***@mbai417-redshift.cuvtmrb8eogw.us-west-2.redshift.amazonaws.com:5439/dev
Done.


product_name,order_count
Banana,1143
Bag of Organic Bananas,1140
Organic Hass Avocado,613
Organic Strawberries,575
Strawberries,502
Organic Raspberries,440
Organic Whole Milk,406
Free & Clear Unscented Baby Wipes,379
Organic Blueberries,374
Organic Baby Spinach,358


# Advanced (if you finish the previous quickly)

## Do organic products tend to go in the same orders, use bananas and strawberries to develop a hypothesis?  What steps would you take to validate this hypothesis across all products / orders?  How would you modify the data to make this analysis easier to do?
