## Kyle Demers

### Final Project, due Dec 20 by midnight

**Instructions:**

Do not include any code pertaining to creating tables, loading data, or intermediate results where you are testing things (you can do that a different notebook).

There may be cases where it will be beneficial (and highly recommended) to create some views - some of the questions build on each other, so you can use views to avoid duplicating the same code/logic over and over again.  In those cases, please do show the code you used to create the view to help us determine partial credit if something goes wrong!

For queries that ask about things like "the most recent month", have this determined by the SQL.  I.e., do not hardcode things like '2020-10' into your queries

In [2]:
import pandas as pd
import sqlite3
import numpy as np
sqlite3.register_adapter(np.int64, lambda val: int(val)) #fix from piazza to turn np ints into normal ints
conn = sqlite3.connect('store.db')
curs = conn.cursor()

---
1) What are the rules of Tidy Data?  Which normal form does Tidy Data most closely approximate?  What is the primary motivation for normalizing our data like this, i.e.,  What problems does it aim to prevent?


Rules of Tidy Data
1) Each Variable forms a column
2) Each observation forms a row
3) Each type of observational unit forms a table

Tidy data most is closely approximated by **3rd Normal Form** 

to prevent redundancy is the primary motivation. Inconsistencies can occur as a result of having unnormalized data. The goal is to have our data free of anomalies. We are trying to prevent anomalies from occuring.

---
2) Load the remaining sales files you've been provided into your store database. Once everything is loaded, you should have data for every month from January 2019 to October 2021.

Assuming your database and code is setup properly, you should encounter a problem loading one or more of the files.

This is not the type of error you need to correct programmatically - you'll want to open the file and correct the issue manually.  I have set things up in such a way that once you find it, it should be clear what to do. (Also, avoid Excel.  Use a basic text editor.  Excel might try to reformat things like zip codes which will lead to other problems). 

If you are unsure what to do when you find it, you can ask me - but before doing so, make sure you can tell me which file(s) and which row(s) are causing the issue.

**When you have found the problem(s), indicate the file name(s), row numbers(s), and paste in the data from that row(s) below.**

Problem in SALES_202102.csv. Somebody tried to order 333333333 Levels on row 42. 

|first|last|addr|city|state|zip|date|prod_id|prod_desc|unit_price|qty|
|-----|---|-----|----|-----|---|----|------|-----------|---------|----|
|Rieekan|Mccarthy|8944 Canterbury Drive|Dunseith|ND|58329|2021-01-03|317|Level|16.99|3333333333|

Problem in Sales_202008.csv. Somebody tried to order 11111111 Ladders on row 40
|first|last|addr|city|state|zip|date|prod_id|prod_desc|unit_price|qty|
|-----|---|-----|----|-----|---|----|------|-----------|---------|----|
|Snap|Pollard|4987 Sycamore Drive|Mc Louth|KS|66054|2020-08-04|327|Ladder|80.0|11111111|

For Grading; if it comes out weird here are the actual results:

Rieekan, Mccarthy, 8944 Canterbury Drive, Dunseith, ND, 58329, 2021-01-03, 317, Level, 16.99, 3333333333

Snap, Pollard, 4987 Sycamore Drive, Mc Louth, KS, 66054, 2020-08-04, 327, Ladder, 80.0, 11111111

---
3) Generate a summary, by month and year of how our store is performing.

Have your query return the following:
 - year
 - month
 - Sales: total sales for the month (i.e., sum of qty * unit price)
 - NumOrders: number of orders placed for the month
 - NumCust: number of _distinct_ customers who made a purchase (i.e. only count the customer at most once per month)
 - OrdersPerCust: average number of orders per customer (i.e. NumOrders/NumCust)
 - SalesPerCust: average sales per customer (i.e. Sales/NumCust)
 - SalesPerOrder: average sales per order (i.e. Sales/NumOrders)

Sort the results should by year and month, in ascending order.

_Hint: Watch out for integer division!_

In [67]:
pd.read_sql('''
WITH orders AS
(SELECT year, month, count(order_id) as orders, count(DISTINCT (Cust_id)) as customers
FROM tOrder
GROUP BY year, month
),
totalsales AS
(SELECT year,month, sum(qty*Unit_price) as [Sales], order_id
FROM tOrder
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING (Prod_id)
GROUP BY year, month
)
SELECT year, month, [Sales], orders, customers, CAST(orders as Float)/customers as [orders per customer], cast(sales as Float)/customers as [sales per customer], cast(sales as Float)/orders as [Sales per Order]
FROM totalsales
JOIN orders
USING (year,month)
GROUP BY year,month
;''',conn)

Unnamed: 0,year,month,Sales,orders,customers,orders per customer,sales per customer,Sales per Order
0,2019,1,68464.61,91,85,1.070588,805.466,752.358352
1,2019,2,55560.32,80,73,1.09589,761.100274,694.504
2,2019,3,19191.68,51,49,1.040816,391.666939,376.307451
3,2019,4,20912.07,48,46,1.043478,454.610217,435.668125
4,2019,5,11973.34,50,46,1.086957,260.29,239.4668
5,2019,6,13737.3,43,41,1.04878,335.056098,319.472093
6,2019,7,22095.05,45,41,1.097561,538.903659,491.001111
7,2019,8,15675.05,51,49,1.040816,319.89898,307.353922
8,2019,9,9360.38,40,39,1.025641,240.009744,234.0095
9,2019,10,48411.35,58,51,1.137255,949.242157,834.678448


---
4) In which month did we have the lowest total sales?

The answer can be confirmed with the previous result, but make sure to write a fresh query for this one, i.e. don't just extract it from the dataframe above!

Return one record with:
- year
- month
- sales (sum of qty * unit_price)

In [75]:
pd.read_sql('''
With totalsales AS
(SELECT year,month, sum(qty*Unit_price) as [Sales], order_id
FROM tOrder
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING (Prod_id)
GROUP BY year, month
)
SELECT year,month, min(Sales) as Sales
FROM totalsales
;''',conn)

Unnamed: 0,year,month,Sales
0,2019,9,9360.38


---

5. In the month determined from the previous question, generate a list of our total sales by state.  Make sure that all states are included, even if they have no sales (50 states + PR and DC = 52 total records).

Return:

- The two-letter state abbreviation
- Total sales for the month in question

Order the results so states with no sales are on top.

In [139]:
pd.read_sql('''
WITH SalesState AS
(SELECT *, qty*unit_price as sales
FROM tOrderDetail
JOIN tProd 
USING(prod_id)
JOIN tOrder 
USING(order_id)
JOIN tCust
USING(cust_id)
JOIN tZip 
USING(Zip)
),
totalsales AS
(SELECT year,month, sum(qty*Unit_price) as [Sales], order_id
FROM tOrder
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING (Prod_id)
GROUP BY year, month
),
minMonth AS
(SELECT year,month, min(Sales) as Sales
FROM totalsales
),
Salesmonth AS
(SELECT st,year, month, sum(Salesstate.sales) AS sales
FROM SalesState
JOIN minMonth
USING (month, year)
GROUP BY st
)
SELECT st,IFNULL(sales,0)
FROM tState
FULL JOIN Salesmonth
USING(st)
ORDER BY SALES
;''',conn)

Unnamed: 0,st,"IFNULL(sales,0)"
0,AK,0.0
1,AR,0.0
2,AZ,0.0
3,CO,0.0
4,CT,0.0
5,DC,0.0
6,DE,0.0
7,MA,0.0
8,ME,0.0
9,MI,0.0


---

6. For the list of states above that had $0 sales, generate a list of all the customers in those states, along with how much they have bought from us since then.

Return:
- customer id
- name, address, city, state (abbreviation is fine), and zip code
- the customer's total sales from all months after the month from question 4

Order the results with the largest sales totals on top.

In [168]:
pd.read_sql('''
WITH SalesState AS
(SELECT *, qty*unit_price as sales
FROM tOrderDetail
JOIN tProd 
USING(prod_id)
JOIN tOrder 
USING(order_id)
JOIN tCust
USING(cust_id)
JOIN tZip 
USING(Zip)
),
totalsales AS
(SELECT year,month, sum(qty*Unit_price) as [Sales], order_id
FROM tOrder
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING (Prod_id)
GROUP BY year, month
),
minMonth AS
(SELECT year,month, min(Sales) as Sales
FROM totalsales
),
Salesmonth AS
(SELECT st,year, month, sum(Salesstate.sales) AS sales
FROM SalesState
JOIN minMonth
USING (month, year)
GROUP BY st
),
StateSale AS
(SELECT st, IFNULL (sales,0) as sales
FROM tState
FULL JOIN Salesmonth
USING(st)
ORDER BY SALES
),
NoSales as
(SELECT *
FROM StateSale
WHERE sales = 0
),
postMin AS
(SELECT * 
FROM SalesState
WHERE year > (SELECT year FROM minMonth) 

UNION 

SELECT *
FROM SalesState
WHERE month > (SELECT month FROM minMonth)
    AND year = (SELECT year FROM minMonth)
)
SELECT cust_id,first,last,city,st,zip,addr,sum(postMin.sales) as Sales
FROM postMin
JOIN NoSales
USING(st)
GROUP BY cust_id
ORDER BY Sales DESC
;''',conn)

Unnamed: 0,cust_id,first,last,city,st,zip,addr,Sales
0,17,Rieekan,Gordon,Fort Smith,AR,72916,7641 Park Avenue,21368.34
1,141,Gold Leader,Zhang,Washington,DC,20202,9744 Park Avenue,19391.43
2,88,Gold Leader,Elliott,Victor,WV,25938,9326 Sycamore Street,16796.02
3,173,Plo Koon,Bass,Warner,NH,03278,7380 Heather Lane,16658.30
4,104,Rabe,Vincent,Pattonville,TX,75468,8972 Park Avenue,16273.48
...,...,...,...,...,...,...,...,...
113,186,Lieutenant Mitaka,Rivera,Neshanic Station,NJ,08853,7871 Valley Road,1512.37
114,38,Padme,Rivera,Bard,NM,88411,5882 Inverness Drive,1485.63
115,147,Bib Fortuna,Elliott,Gig Harbor,WA,98335,5364 Heather Lane,1469.19
116,306,Darth Maul,Rodriguez,Boothbay,ME,04537,4333 Front Street,443.88


---

7) Get a list of customers who did not purchase anything in the most recent month of data, along with their average sales for all months prior.

Return:

- customer id
- name, address, zip, city, st (abbreviation is fine)
- total sales for most recent month (to confirm they are all 0)
- average sales for all months prior

Order the results with the largest average monthly sales on top.

In [4]:
pd.read_sql('''
WITH LargestYear AS
(SELECT MAX(Year) as Year FROM tOrder),
LargestMonth AS
(SELECT MAX(Month) as month
FROM tOrder 
WHERE year = (SELECT year FROM LargestYear)),
BuyingCusts AS
(SELECT *
FROM tOrder 
WHERE year = (SELECT Year FROM LargestYear)
    AND month = (SELECT month FROM LargestMonth)),
NotBuyingCusts AS
(SELECT tcust.cust_id, first, last, addr, zip
FROM tCust
LEFT JOIN BuyingCusts
ON tCust.cust_id = BuyingCusts.cust_id
WHERE BuyingCusts.cust_id IS NULL),
totalsales AS
(SELECT year,month, sum(qty*Unit_price) as [Sales], order_id,cust_id
FROM tOrder
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING (Prod_id)
GROUP BY year, month,cust_id),
AverageSales AS
(SELECT sum(sales)/33 as avgsales,cust_id 
FROM totalsales
JOIN Notbuyingcusts
USING(cust_id)
GROUP BY (cust_id))
SELECT cust_id,first,last,addr,zip,st,city,IFNULL(sales,0) as Sales, avgsales
FROM NotBuyingCusts
LEFT JOIN SalesLast
using(cust_id)
JOIN tZip 
USING(zip)
JOIN AverageSales
USING(cust_id)
ORDER BY avgsales DESC
;''',conn)

Unnamed: 0,cust_id,first,last,addr,zip,st,city,Sales,avgsales
0,246,Unkar Plutt,Woodward,5772 4th Street,51650,IA,Riverton,0,590.661212
1,256,Captain Antilles,Walker,8516 Pheasant Run,20005,DC,Washington,0,561.206667
2,58,Unkar Plutt,Schmidt,9546 Brookside Drive,13623,NY,Chippewa Bay,0,533.290000
3,59,Jobal,Mitchell,1198 West Avenue,38629,MS,Falkner,0,519.972424
4,88,Gold Leader,Elliott,9326 Sycamore Street,25938,WV,Victor,0,509.008182
...,...,...,...,...,...,...,...,...,...
178,264,Mace Windu,Greene,9572 9th Street,61957,IL,Windsor,0,43.382121
179,303,Darth Maul,Walters,7851 Magnolia Court,58046,ND,Hope,0,39.101212
180,310,Bala-Tik,Zhang,1559 Lake Avenue,93673,CA,Traver,0,36.436364
181,309,Clone Commander Cody,Benson,4836 Front Street,28390,NC,Spring Lake,0,25.335152


---
8) Are there any products we haven't sold at least 1 of each month?

If so, return:
 
- product id, name, and unit price
- years and months that had no sales

Order the results with the products with the largest unit price on top.

In [5]:
pd.read_sql('''
WITH Months AS
(SELECT DISTINCT(year), month
FROM tOrder),
allcombos AS
(SELECT prod_id, prod_desc,month,year,unit_price
FROM tProd
JOIN Months),
allorders AS
(SELECT prod_id,sum(qty) as qty,year,month
FROM tOrderDetail
JOIN tOrder
Using(order_id)
Group BY prod_id,month,year)
SELECT qty,year,month,prod_id, prod_desc, unit_price
FROM allcombos
left join allorders
USING(prod_id,month,year)
WHERE qty IS NULL
ORDER BY unit_price DESC
;''',conn)

Unnamed: 0,qty,year,month,prod_id,prod_desc,unit_price
0,,2019,8,329,Chainsaw,499.99
1,,2019,9,329,Chainsaw,499.99
2,,2019,4,328,Workbench,300.0
3,,2019,9,325,Toolbox,50.0
4,,2019,7,325,Toolbox,50.0
5,,2019,7,321,Axe,27.99
6,,2019,4,318,Hacksaw,19.99
7,,2019,9,315,Pliers,15.99
8,,2019,6,314,Mallet,12.0
9,,2019,6,313,Wrench,11.0


---
9) What are our top 5 selling products (in terms of total dollars sold)?

Return:

- product id, name, unit_price
- total sales (i.e. sum of qty * unit price)

Order the results by total sales, descending

In [7]:
pd.read_sql('''
SELECT sum(qty*Unit_price) as [Sales], prod_id, prod_desc
FROM tOrder
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING (Prod_id)
GROUP BY prod_id
ORDER BY Sales DESC
LIMIT 5
;''',conn)

Unnamed: 0,Sales,prod_id,prod_desc
0,943481.13,329,Chainsaw
1,633300.0,328,Workbench
2,154000.0,327,Ladder
3,131652.0,326,Drill
4,100850.0,325,Toolbox


---

10) What month did we have our highest sales, and what was our top selling product that month? (All in terms of dollars).

Return:

- The month and year
- The total sales for that month
- The top selling product that month (product id, and name)
- Total sales (sum of qty * unit price) for that product that month
- Total units (total qty) for that product that month

In [138]:
pd.read_sql('''
WITH monthlySales AS
(SELECT year,month, sum(qty*Unit_price) as [Sales]
FROM tOrder
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING (Prod_id)
GROUP BY year, month),
MostSalesMonth AS
(SELECT MAX(Sales) as Sales, Year, Month
FROM monthlySales),
LargestMonthSales AS
(SELECT *, sum(qty*Unit_price) as [Sales], sum(qty) as totalSold
FROM tOrder 
JOIN tOrderDetail 
USING (order_id)
JOIN tProd
USING(prod_id)
WHERE month = (SELECT month FROM mostSalesMonth)
    AND year = (SELECT year FROM mostSalesMonth)
GROUP BY prod_id)
SELECT month,year,sum(Sales) as total_sales_in_month,max(sales) as total_sales_for_product, prod_id, prod_desc, totalSold
FROM LargestMonthSales
;''',conn)

Unnamed: 0,month,year,total_sales_in_month,total_sales_for_product,prod_id,prod_desc,totalSold
0,10,2021,251768.12,86998.26,329,Chainsaw,174
