# DATE_TIMES and CASE Statements

We connect to MySQL server and workbench and make analysis with the parch-and-posey database. This course is the practicals of the course SQL for Data Analysis at Udacity.

In [None]:
# we import some required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

**Next, we create a connection to the parch-and-posey DataBase in MySQL Work-Bench**

In [None]:
import mysql
from mysql.connector import Error
from getpass import getpass

try:
    connection = mysql.connector.connect(host='localhost',
                                         database='parch_and_posey',
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

In [None]:
# Let's see the tables in the database

# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

**Defining a method that converts the result of a query to a dataframe**

In [None]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [None]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

In [None]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

In [None]:
# 3. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

In [None]:
# 4. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

In [None]:
# 5. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

## Date Functions

GROUPing BY a date column is not usually very useful in SQL, as these columns tend to have transaction data down to a second. Keeping date information at such a granular data is both a blessing and a curse, as it gives really precise information (a blessing), but it makes grouping information together directly difficult (a curse).

Lucky for us, there are a number of built in SQL functions that are aimed at helping us improve our experience in working with dates.<br>
**[MySQL Date-Time Functions](https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_extract)**

#### Questions: Working With DATEs

**A. Find the sales in terms of total dollars for all orders in each year, ordered from greatest to least. Do you notice any trends in the yearly sales totals?**

In [None]:
query_to_df(
"SELECT EXTRACT(YEAR FROM o.occurred_at) order_year, SUM(o.total_amt_usd) total_spent FROM orders o \
GROUP BY 1 ORDER By 2 DESC;"
)

**Let's plot the last query at once.**

In [None]:
x=query_to_df(
    "SELECT EXTRACT(YEAR FROM occurred_at) order_year, SUM(total_amt_usd) \
    total_spent FROM orders GROUP BY 1 ORDER BY 2 DESC;"
)
x['total_spent'] = x['total_spent'].astype('float64')
x.set_index('order_year').plot.bar()
plt.title("Yearly Total Spent")
plt.show()

**B. Which month did Parch & Posey have the greatest sales in terms of total dollars? Are all months evenly represented by the dataset?**

In [None]:
query_to_df(
"SELECT EXTRACT(MONTH FROM occurred_at) sales_monthly, SUM(total_amt_usd) total \
FROM orders GROUP BY 1 ORDER BY 2 DESC;"
)

In [None]:
x=query_to_df(
"SELECT EXTRACT(MONTH FROM occurred_at) sales_monthly, SUM(total_amt_usd) total \
FROM orders GROUP BY 1 ORDER BY 2 DESC;"
)
# convert total from obj to float 
x.total = x.total.astype('float64')
# set index to monthly sales
x.set_index('sales_monthly').plot.bar(color='y')
# add title and plot
plt.title('Monthly Sales')
plt.show()

**C. Which year did Parch & Posey have the greatest sales in terms of total number of orders? Are all years evenly represented by the dataset?**

In [None]:
query_to_df(
"SELECT EXTRACT(YEAR FROM occurred_at) order_year, COUNT(*) total_orders FROM orders \
GROUP BY 1 ORDER BY 2 DESC;"
)

In [None]:
x=query_to_df(
"SELECT EXTRACT(YEAR FROM occurred_at) order_year, COUNT(*) total_orders FROM orders \
GROUP BY 1 ORDER BY 2 DESC;"
)
# convert total from obj to float 
x.total_orders = x.total_orders.astype('float64')
# set index to order_year
x.set_index('order_year').plot.bar(color='r')
# add title and plot
plt.title('Yearly Orders')
plt.show()

**D. Which month did Parch & Posey have the greatest sales in terms of total number of orders? Are all months evenly represented by the dataset?**

In [None]:
query_to_df(
"SELECT EXTRACT(MONTH FROM occurred_at) order_month, COUNT(*) monthly_orders FROM orders \
GROUP BY 1 ORDER BY 2 DESC;"
)

In [None]:
x=query_to_df(
"SELECT EXTRACT(MONTH FROM occurred_at) order_month, COUNT(*) monthly_orders FROM orders \
GROUP BY 1 ORDER BY 2 DESC;"
)
# convert total from obj to float 
x.monthly_orders = x.monthly_orders.astype('float64')
# set index to monthly orders
x.set_index('order_month').plot.bar(color='pink')
# add title and plot
plt.title('Monthly Orders')
plt.show()

**E. In which month of which year did Walmart spend the most on gloss paper in terms of dollars?**

In [None]:
query_to_df(
"SELECT EXTRACT(YEAR_MONTH FROM o.occurred_at) yr_month, SUM(o.gloss_amt_usd) gloss_usd_amt, \
a.name acct_name FROM orders o JOIN accounts a ON a.id=o.account_id WHERE a.name LIKE '%Walmart' \
GROUP BY 1, 3 ORDER BY 2 DESC LIMIT 1;"
)

In [None]:
# Same Query using HAVING

query_to_df(
    "SELECT a.name acct_name, EXTRACT(YEAR_MONTH FROM o.occurred_at) year_mnth, \
    SUM(o.gloss_amt_usd) gloss_spent FROM accounts a JOIN orders o ON a.id=o.account_id \
    GROUP BY 1, 2 HAVING acct_name LIKE '%Walmart' ORDER BY 3 DESC LIMIT 1;"
)

## Case Statements

Case statements are SQL's way of handling If-Then logic.

We can create derived columns using `CASE` statements to answer interersting questions about the data. 
* The `CASE` statement is followed by at least one pair of `when` and `then` statements which are SQL's equivalent of If and Else statements.
* The `CASE` statement must finish with the word `END`.
* We can define a `CASE` statement with many `when`, `then` statements as we like.
* Each `when` statement would evaluate in the pattern or format that it's written, one after another.
* It's really best to create `when` statements that dont over lap.
* We can add `AND` and `OR` to create finer conditions in the `when` statements.
* The `CASE` clause allows us to count several different conditions at a time, unlike the `WHERE` clause which allows us count only one condition a time. For example...
```
SELECT CASE WHEN total > 500 THEN "Over-500" ELSE "500-or-under" 
END AS total_group, COUNT(*) as order_count FROM orders GROUP BY 1
```
* Finally we can combine `CASE` statements with aggregations to produce enhanced results.

In [None]:
query_to_df(
'SELECT CASE WHEN total > 500 THEN "Over-500" ELSE "500-or-under" \
END AS total_group, COUNT(*) as order_count FROM orders GROUP BY 1;'
)

### CASE - Expert Tip
* The `CASE` statement always goes in the `SELECT` clause.

* `CASE` must include the following components: `WHEN`, `THEN`, and `END`. `ELSE` is an optional component to catch cases that didn’t meet any of the other previous CASE conditions.

* You can make any conditional statement using any conditional operator (like `WHERE`) `between` `WHEN` and `THEN`. This includes stringing together multiple conditional statements using `AND` and `OR`.

* You can include multiple `WHEN` statements, as well as an `ELSE` statement again, to deal with any unaddressed conditions.

### Example
In a quiz question in the previous Basic SQL lesson, you saw this question:

Create a column that divides the standard_amt_usd by the standard_qty to find the unit price for standard paper for each order. Limit the results to the first 10 orders, and include the id and account_id fields. NOTE - you will be thrown an error with the correct solution to this question. This is for a division by zero. You will learn how to get a solution without an error to this query when you learn about CASE statements in a later section.

Let's see how we can use the CASE statement to get around this error.

```
SELECT id, account_id, standard_amt_usd/standard_qty AS unit_price
FROM orders
LIMIT 10;
```
Above is the old solution we did before, let's use `CASE` below...
```
SELECT account_id, CASE WHEN standard_qty = 0 OR standard_qty IS NULL THEN 0
                        ELSE standard_amt_usd/standard_qty END AS unit_price
FROM orders
LIMIT 10;
```
Now the first part of the statement will catch any of those division by zero values that were causing the error, and the other components will compute the division as necessary

In [None]:
query_to_df(
    'SELECT CASE WHEN standard_qty=0 OR standard_qty IS NULL THEN 0 \
    ELSE standard_amt_usd/standard_qty END AS unit_price FROM orders LIMIT 1;'
)

## Case and Aggregations

There are some advantages to separating data into separate columns like this depending on what you want to do, but often this level of separation might be easier to do in another programming language - rather than with SQL.

### Questions: CASE

**A. Write a query to display for each order, the account ID, total amount of the order, and the level of the order - `Large` or `Small` - depending on if the order is `$3000` or more, or smaller than `$3000`.**

In [None]:
query_to_df(
"SELECT account_id acct_id, total_amt_usd total_sum, CASE WHEN total_amt_usd>=3000 THEN 'Large' \
ELSE 'Small' END AS level FROM orders;"
)

**B. Write a query to display the number of orders in each of three categories, based on the total number of items in each order. The three categories are: 'At Least 2000', 'Between 1000 and 2000' and 'Less than 1000'.**

In [None]:
query_to_df(
"SELECT COUNT(*) tot_order_qty, CASE WHEN total<1000 THEN 'Less than 1000' WHEN total>=2000 THEN 'At Least 2000' \
ELSE 'Between 1000 and 2000' END AS category FROM orders GROUP BY 2;"
)

**C. We would like to understand 3 different levels of customers based on the amount associated with their purchases. The top level includes anyone with a Lifetime Value (total sales of all orders) greater than 200,000 usd. The second level is between 200,000 and 100,000 usd. The lowest level is anyone under 100,000 usd. Provide a table that includes the level associated with each account. You should provide the account name, the total sales of all orders for the customer, and the level. Order with the top spending customers listed first.**

In [None]:
query_to_df(
"SELECT acct_name, total_sales_usd, CASE WHEN total_sales_usd>200000 THEN 'Top-Level' WHEN total_sales_usd<100000 \
THEN 'Low-Level' ELSE 'Mid-Level' END AS level FROM (SELECT a.name acct_name, SUM(o.total_amt_usd) total_sales_usd \
FROM orders o JOIN accounts a ON a.id=o.account_id GROUP BY 1) AS T1 ORDER BY 2 DESC;"
)

**D. We would now like to perform a similar calculation to the first, but we want to obtain the total amount spent by customers only in 2016 and 2017. Keep the same levels as in the previous question. Order with the top spending customers listed first.**

In [None]:
query_to_df(
"SELECT account_id acct_id, total_amt_usd total_spent, CASE WHEN total_amt_usd>=3000 THEN 'Top' \
ELSE 'Low' END AS level FROM orders WHERE occurred_at BETWEEN '2016-01-01' AND '2018-01-01' \
GROUP BY 1 ORDER BY 2 DESC;"
)

**E. We would like to identify top performing sales reps, which are sales reps associated with more than 200 orders. Create a table with the sales rep name, the total number of orders, and a column with top or not depending on if they have more than 200 orders. Place the top sales people first in your final table.**

In [None]:
query_to_df(
"SELECT *, CASE WHEN total_orders>200 THEN 'Top-Performer' ELSE 'Low-Performer' END as category FROM \
(SELECT DISTINCT s.name sales_rep, COUNT(o.id) total_orders FROM sales_reps s JOIN accounts a ON a.sales_rep_id=s.id \
JOIN orders o ON o.account_id=a.id GROUP BY 1) AS T1 ORDER BY 2 DESC;"
)

**F. The previous didn't account for the middle, nor the dollar amount associated with the sales. Management decides they want to see these characteristics represented as well. We would like to identify top performing sales reps, which are sales reps associated with more than 200 orders or more than 750000 in total sales. The middle group has any rep with more than 150 orders or 500000 in sales. Create a table with the sales rep name, the total number of orders, total sales across all orders, and a column with top, middle, or low depending on this criteria. Place the top sales people based on dollar amount of sales first in your final table. You might see a few upset sales people by this criteria!**

In [None]:
query_to_df(
"SELECT *, CASE WHEN total_orders>200 OR total_sales>750000 THEN 'Top-Performer' WHEN total_orders<=150 OR \
total_sales<=500000 THEN 'Low-Performer' ELSE 'Mid-Performer' END AS category FROM \
(SELECT DISTINCT s.name sales_rep, COUNT(o.id) total_orders, SUM(o.total_amt_usd) total_sales FROM sales_reps s \
JOIN accounts a ON a.sales_rep_id=s.id JOIN orders o ON o.account_id=a.id GROUP BY 1) AS T1 \
ORDER BY total_sales DESC, total_orders DESC;"
)

In [None]:
# Change False to True and Terminate connection at end of notebook

if True and connection.is_connected():
    cursor.close()
    connection.close()
    print(f'Terminating Connection to MySQL Database: {record}')