# Sub-Queries, Temporary Tables and WITH statements

We connect to MySQL server and workbench and make analysis with the parch-and-posey database. This course is the practicals of the course SQL for Data Analysis at Udacity.

In [None]:
# we import some required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

**Next, we create a connection to the parch-and-posey DataBase in MySQL Work-Bench**

In [None]:
import mysql
from mysql.connector import Error
from getpass import getpass

try:
    connection = mysql.connector.connect(host='localhost',
                                         database='parch_and_posey',
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

In [None]:
# Let's see the tables in the database

# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

**Defining a method that converts the result of a query to a dataframe**

In [None]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [None]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

In [None]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

In [None]:
# 3. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

In [None]:
# 4. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

In [None]:
# 5. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

## SubQueries:

AKA **Inner-queries or Nested-Queries** are tools for performing operations in multiple steps.

### More Subquery Tips:

* The original query goes in the `FROM` statement
* An `*` is used in the `SELECT` statement to pull all data from the original query
* You must use an alias for the table you nest within the outer query

**EX 1: Use a subquery to find the average number of events that occur per day per channel**

In [None]:
query_to_df(
"SELECT DATE(occurred_at) date, channel, COUNT(*) num_events \
FROM web_events GROUP BY 1, 2 ORDER BY 1;"
)

**EX 2: Use a subquery to find the average number of events that occur per channel**

In [None]:
query_to_df(
"SELECT channel, AVG(num_events) avg_num_events FROM \
(SELECT DATE(occurred_at) date, channel, COUNT(*) num_events \
FROM web_events GROUP BY 1, 2 ORDER BY 1) as sub GROUP BY 1 ORDER BY 2 DESC;"
)

**EX 3. On which day-channel pair did the most events occur**

In [None]:
query_to_df(
"SELECT DATE(occurred_at) date, channel, COUNT(*) num_events \
FROM web_events GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 5;"
)

## Subquery Formatting

When writing Subqueries, it is easy for your query to look incredibly complex. In order to assist your reader, which is often just yourself at a future date, formatting SQL will help with understanding your code.

The important thing to remember when using subqueries is to provide some way for the reader to easily determine which parts of the query will be executed together. Most people do this by indenting the subquery in some way

### Well Formatted Query:
```
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
                channel, COUNT(*) as events
      FROM web_events 
      GROUP BY 1,2
      ORDER BY 3 DESC) sub;
```

Additionally, if we have a GROUP BY, ORDER BY, WHERE, HAVING, or any other statement following our subquery, we would then indent it at the same level as our outer query.

The query below is similar to the above, but it is applying additional statements to the outer query, so you can see there are GROUP BY and ORDER BY statements used on the output are not tabbed. The inner query GROUP BY and ORDER BY statements are indented to match the inner table.
```
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
                channel, COUNT(*) as events
      FROM web_events 
      GROUP BY 1,2
      ORDER BY 3 DESC) sub
GROUP BY day, channel, events
ORDER BY 2 DESC;
```

### MORE on Subquery:

Subqueries can be used in several places within a query, it can be used anywhere we might use a table name or even a column name or an individual value. 
* They are especially useful in conditional logic in conjunction with `Where` and `Join` clauses or in the `WHEN` portion of a `CASE` statement.
* Most conditional logic, would work with sub-queries containing one-cell results. But `IN` is the only type of conditional logic that will work when the inner query contains multiple results.

### Expert Tip
Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN case) rather than as a table.

Also, notice the query here compared a single value. If we returned an entire column IN would need to be used to perform a logical argument. If we are returning an entire table, then we must use an ALIAS for the table, and perform additional logic on the entire table.

**EXERCISE**:

You should write your solution as a subquery or subqueries, not by finding one solution and copying the output. The importance of this is that it allows your query to be dynamic in answering the question - even if the data changes, you still arrive at the right answer.

**1. Use Sub-queries to return orders that only took place in the same month and year as the first ever order**

In [None]:
query_to_df(
"SELECT * FROM orders WHERE EXTRACT(YEAR_MONTH FROM occurred_at) = \
(SELECT EXTRACT(YEAR_MONTH FROM MIN(occurred_at)) first_order_month FROM orders)\
ORDER BY occurred_at;"
)

**2. From the above result, use Sub-queries to return the average qty for each type of paper sold during this same first month as well as the total sales in USD for this month.**

In [None]:
query_to_df(
"SELECT EXTRACT(YEAR_MONTH FROM occurred_at) Date, \
AVG(standard_qty), AVG(gloss_qty), AVG(poster_qty), SUM(total_amt_usd) FROM \
(SELECT * FROM ORDERS WHERE EXTRACT(YEAR_MONTH FROM occurred_at) = \
(SELECT EXTRACT(YEAR_MONTH FROM MIN(occurred_at)) FROM orders)\
ORDER BY occurred_at) as sub;"
)

**3. Use subqueries to Provide the name of the sales_rep in each region with the largest amount of total_amt_usd sales.**

In [None]:
query_to_df(
"SELECT * FROM (SELECT s.name sales_rep, r.name region, SUM(o.total_amt_usd) total_amt_usd \
FROM sales_reps s JOIN region r ON s.region_id=r.id JOIN accounts a ON \
a.sales_rep_id=s.id JOIN orders o ON o.account_id=a.id GROUP BY 1, 2 ORDER BY 2, 3 DESC) AS sub GROUP BY 2;"
)

**4. For the region with the largest (sum) of sales total_amt_usd, how many total (count) orders were placed?**

In [None]:
query_to_df(
'SELECT r.name region, COUNT(o.id) total_orders FROM region r JOIN sales_reps s \
ON s.region_id=r.id JOIN accounts a ON a.sales_rep_id=s.id JOIN orders o ON o.account_id=a.id \
WHERE r.name=(SELECT region FROM \
(SELECT r.name region, SUM(o.total_amt_usd) total_amt_usd FROM region r JOIN sales_reps s ON s.region_id=r.id \
JOIN accounts a ON a.sales_rep_id=s.id JOIN orders o ON o.account_id=a.id GROUP BY 1 ORDER BY 2 DESC LIMIT 1) AS sub);'
)

**5. How many accounts had more total purchases than the account name which has bought the most standard_qty paper throughout their lifetime as a customer?**

In [None]:
query_to_df(
"SELECT COUNT(*) count FROM \
(SELECT a.name acct_name FROM accounts a JOIN orders o ON a.id=o.account_id GROUP BY a.name HAVING \
SUM(o.total) > (SELECT total_qty FROM \
(SELECT a.name acct_name, SUM(o.standard_qty) total_std_qty, SUM(o.total) total_qty FROM accounts a \
JOIN orders o ON a.id=o.account_id GROUP BY 1 ORDER BY 2 DESC LIMIT 1) AS inner_tab)) outer_tab;"
)

**6. For the customer that spent the most (in total over their lifetime as a customer) total_amt_usd, how many web_events did they have for each channel?**

In [None]:
query_to_df(
"SELECT a.name acct_name, w.channel channel, COUNT(w.id) web_events FROM accounts a JOIN web_events w ON \
w.account_id=a.id WHERE a.name = \
(SELECT name FROM \
(SELECT a.name, SUM(o.total) total_sum_usd FROM accounts a JOIN orders o ON a.id=o.account_id \
GROUP BY 1 ORDER BY 2 DESC LIMIT 1) AS name) GROUP BY 1, 2;"
)

**7. What is the lifetime average amount spent in terms of total_amt_usd for the top 10 total spending accounts?**

In [None]:
query_to_df(
"SELECT AVG(total_amt_usd) AVG_lifetime_spend_top_10_accounts FROM \
(SELECT a.name acct_name, SUM(o.total_amt_usd) total_amt_usd FROM accounts a JOIN orders o \
ON a.id=o.account_id GROUP BY 1 ORDER BY 2 DESC LIMIT 10) as avg;"
)

**8. What is the lifetime average amount spent in terms of total_amt_usd, including only the companies that spent more per order, on average, than the average of all orders.**

In [None]:
# First find the general lifetime average for all orders

query_to_df(
"SELECT (SUM(orders.total_amt_usd) / COUNT(orders.total)) avg_general_order FROM orders;"
)

In [None]:
# Then find the accounts whose lifetime average is > the general average

query_to_df(
"SELECT DISTINCT a.name acct_name, (SUM(o.total_amt_usd) / COUNT(o.total)) above_avg_avg \
FROM accounts a JOIN orders o ON o.account_id=a.id \
GROUP BY 1 HAVING (SUM(o.total_amt_usd) / COUNT(o.total)) \
> (SELECT (SUM(orders.total_amt_usd) / COUNT(orders.total)) avg_general_order FROM orders);"
)

In [None]:
# Now find the lifetime average for all these companies together

query_to_df(
"SELECT AVG(above_avg_avg) above_avg_spenders_lifetime_avg FROM \
(SELECT DISTINCT a.name acct_name, (SUM(o.total_amt_usd) / COUNT(o.total)) above_avg_avg \
FROM accounts a JOIN orders o ON o.account_id=a.id \
GROUP BY 1 HAVING (SUM(o.total_amt_usd) / COUNT(o.total)) \
> (SELECT (SUM(orders.total_amt_usd) / COUNT(orders.total)) avg_general_order FROM orders)) AS avg;"
)

## WITH Statement:

The WITH statement is often called a **Common Table Expression** or **CTE**. Though these expressions serve the exact same purpose as subqueries, they are more common in practice, as they tend to be cleaner for a future reader to follow the logic as Subqueries have the problem of maing our queries lengthy and difficult to Read.

* **`Common Table Expressions (CTEs)`** can help break a query into separate components so that the query logic is more easily readable. 
* We can read the Subquery logic on its own and then read the final query logic easily as well.
* We can theoretically write as many CTEs as we want.
* We need to define any CTEs at the beginning of the query inorder to use them in our final query at the bottom.
* Each CTE gets an alias just like a Subquery
* When creating multiple tables using `WITH`, we add a comma after every table except the last one before the final query

We can create an additional table to pull from in the following way:
```
WITH table1 AS (
          SELECT *
          FROM web_events),

     table2 AS (
          SELECT *
          FROM accounts)


SELECT *
FROM table1
JOIN table2
ON table1.account_id = table2.id;
```

Thus, You can add more and more tables using the `WITH statement` in the same way above.


### Quiz WITH:

Essentially a `WITH` statement performs the same task as a Subquery. Therefore, you can write any of the queries we worked with in the previous exercises on Subqueries, using a `WITH`.

**1. Provide the name of the sales_rep in each region with the largest amount of total_amt_usd sales.**


In [None]:
query_to_df(
"WITH \
table1 AS (SELECT DISTINCT s.name Sales_rep, r.name Region, SUM(o.total_amt_usd) Total_amt_usd FROM sales_reps s \
JOIN accounts a ON s.id=a.sales_rep_id JOIN region r on s.region_id=r.id JOIN orders o ON \
o.account_id=a.id GROUP BY 2, 1 ORDER BY 2, 3 DESC) \
\
SELECT * FROM table1 GROUP BY Region;"
)

**2. For the region with the largest sales total_amt_usd, how many total orders were placed?**

In [None]:
query_to_df(
"WITH \
table1 AS (SELECT r.name Region, SUM(o.total_amt_usd) Total_sales, COUNT(o.id) Total_orders \
FROM region r JOIN sales_reps s ON s.region_id=r.id JOIN accounts a ON a.sales_rep_id=s.id \
JOIN orders o ON o.account_id=a.id GROUP BY 1 ORDER BY 2 DESC LIMIT 1) \
\
SELECT Region, Total_orders FROM table1;"
)

**3. How many accounts had more total purchases than the account name which has bought the most standard_qty paper throughout their lifetime as a customer?**

In [None]:
query_to_df(
"WITH \
table1 AS (SELECT a.name acct_name, SUM(o.standard_qty) total_std_qty FROM accounts a JOIN orders o \
ON o.account_id=a.id GROUP BY 1 ORDER BY 2 DESC LIMIT 1), \
\
table2 AS (SELECT acct_name FROM table1), \
\
table3 AS (SELECT SUM(o.total) total FROM accounts a JOIN orders o ON \
a.id=o.account_id WHERE a.name=(select * from table2)), \
\
table4 AS (SELECT a.name acct_name FROM accounts a JOIN orders o ON \
a.id=o.account_id GROUP BY a.name HAVING SUM(o.total)>(select * FROM table3)) \
\
SELECT * FROM table4;"
)

**4. For the customer that spent the most (in total over their lifetime as a customer) total_amt_usd, how many web_events did they have for each channel?**

In [None]:
query_to_df(
"WITH \
table1 AS (SELECT a.id acct_id, SUM(o.total_amt_usd) total_usd FROM accounts a JOIN orders o ON \
a.id=o.account_id GROUP BY 1 ORDER BY 2 DESC LIMIT 1), \
\
table2 AS (SELECT a.name acct_name, w.account_id acct_id, w.channel channel, COUNT(w.occurred_at) web_events FROM \
web_events w JOIN accounts a ON a.id=w.account_id WHERE w.account_id=(SELECT acct_id FROM table1) GROUP BY 1,2,3 \
ORDER BY 4 DESC) \
\
SELECT * FROM table2;"
)

**5. What is the lifetime average amount spent in terms of total_amt_usd for the top 10 total spending accounts?**

In [None]:
query_to_df(
"WITH \
table1 AS (SELECT a.name acct_name, SUM(o.total_amt_usd) total_spent FROM accounts a JOIN orders o \
ON a.id=o.account_id GROUP BY 1 ORDER BY 2 DESC LIMIT 10) \
\
SELECT AVG(total_spent) FROM table1;"
)

**6. What is the lifetime average amount spent in terms of total_amt_usd, including only the companies that spent more per order, on average, than the average of all orders.**

In [None]:
query_to_df(
"WITH \
table1 AS (SELECT (SUM(orders.total_amt_usd) / COUNT(orders.total)) avg_order FROM orders), \
\
table2 AS (SELECT a.name acct_name, (SUM(o.total_amt_usd) / COUNT(o.total)) above_avg FROM \
accounts a JOIN orders o ON a.id=o.account_id GROUP BY 1 HAVING \
(SUM(o.total_amt_usd) / COUNT(o.total))>(SELECT * FROM table1)) \
\
SELECT AVG(above_avg) FROM table2;"
)

In [None]:
# Turn False to True to terminate connection at end of notebook

if True and connection.is_connected():
    cursor.close()
    connection.close()
    print(f'Terminating MySQL Connection: {record} Database')