<a href="https://colab.research.google.com/github/Lawrence-Krukrubo/SQL_for_Data_Science/blob/main/sql_for_data_analysis3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **SQL AGGREGATIONS**

We connect to MySQL server and workbench and make analysis with the parch-and-posey database. This course is the practicals of the course **SQL for Data Analysis** at Udacity.

In [None]:
# Install mySQL connector

!pip install mysql-connector-python

In [None]:
# we import some required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

**Next, we create a connection to the parch-and-posey DataBase in MySQL Work-Bench**

In [None]:
import mysql
from mysql.connector import Error
from getpass import getpass

try:
    connection = mysql.connector.connect(host='localhost',
                                         database='parch_and_posey',
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

In [None]:
# Let's see the tables in parch-and-posey DB

# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

Let's see the first 3 data of the different tables in parch and posey database

Defining a method that converts a select query to a data frame

In [None]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [None]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

In [None]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

In [None]:
# 3. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

In [None]:
# 4. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

In [None]:
# 5. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

**In essential, row-level data are useful for initial exploratory data analysis, when we're trying to get a feel of the data... But as we search for answers, aggregate-data which are often done along columns, become more useful...**

## Nulls:

NULLs are a datatype that specifies where no data exists in SQL. They are often ignored in our aggregation functions

* Notice that NULLs are different than a zero - they are cells where data does not exist.

* When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don't use =, because NULL isn't considered a value in SQL. Rather, it is a property of the data.

**NULLs - Expert Tip**
* There are two common ways in which you are likely to encounter NULLs:

* NULLs frequently occur when performing a LEFT or RIGHT JOIN. You saw in the last lesson - when some rows in the left table of a left join are not matched with rows in the right table, those rows will contain some NULL values in the result set.

* NULLs can also occur from simply missing data in our database.

**COUNT the Number of Rows in each Table**

Try your hand at finding the number of rows in each table.

In [None]:
for table in ['orders','accounts','web_events','region','sales_reps']:
    query = f'SELECT COUNT(*) AS row_count FROM {table};'
    ans = query_to_df(query)
    print(f'Table {table}:')
    print(ans)
    print()

### COUNT:

* Note that unlike other aggregations, `COUNT` can be used in columns of Non-Numerical values. Same too for `MIN` and `MAX` clauses.

* Notice that `COUNT` does not consider rows that have `NULL` values. Therefore, this can be useful for quickly identifying which rows have missing data. 

### SUM:

* Unlike `COUNT`, you can only use `SUM` on numeric columns. However, `SUM` will ignore NULL values, as do the other aggregation functions you will see in the upcoming lessons.

### Aggregation Reminder:

An important thing to remember: aggregators only aggregate vertically - the values of a column. If you want to perform a calculation across rows, you would do this with simple arithmetic.

### Aggregation Question

find the solution for each of the following questions. If you get stuck or want to check your answers, you can find the answers at the top of the next concept.

#### Q1: Find the total amount of poster_qty paper ordered in the orders table.

In [None]:
query = 'SELECT SUM(poster_qty) FROM orders;'
query_to_df(query)

#### Q2: Find the total amount of standard_qty paper ordered in the orders table.

In [None]:
query = 'SELECT SUM(standard_qty) FROM orders;'
query_to_df(query)

#### Q4. Find the total dollar amount of sales using the total_amt_usd in the orders table.

In [None]:
query_to_df('SELECT SUM(total_amt_usd) FROM orders;')

#### Q5. Find the total amount spent on standard_amt_usd and gloss_amt_usd paper for each order in the orders table. This should give a dollar amount for each order in the table.

In [None]:
query_to_df(
    'SELECT id, (standard_amt_usd + gloss_amt_usd) tot_amt_usd FROM orders;'
)

#### Q6. Find the standard_amt_usd per unit of standard_qty paper. Your solution should use both an aggregation and a mathematical operator.

In [None]:
query_to_df(
 'SELECT (SUM(standard_amt_usd)  / SUM(standard_qty)) \
 standard_unit_usd FROM orders;'   
)

### Min and Max

Notice that `MIN` and `MAX` are aggregators that again ignore `NULL` values.

#### Expert Tip
Functionally, MIN and MAX are similar to COUNT in that they can be used on non-numerical columns. Depending on the column type, MIN will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible. As you might suspect, MAX does the opposite—it returns the highest number, the latest date, or the non-numerical value closest alphabetically to “Z.”

### AVG:

Similar to other software `AVG` returns the mean of the data - that is the sum of all of the values in the column divided by the number of values in a column. This aggregate function again ignores the `NULL` values in both the numerator and the denominator.

If you want to count NULLs as zero, you will need to use SUM and COUNT. However, this is probably not a good idea if the NULL values truly just represent unknown values for a cell.

#### MEDIAN - Expert Tip

One quick note that a median might be a more appropriate measure of center for this data, but finding the median happens to be a pretty difficult thing to get using SQL alone — so difficult that finding a median is occasionally asked as an interview question.

### Questions: MIN, MAX, & AVERAGE
Answer the following questions.

#### 1. When was the earliest order ever placed? You only need to return the date.

In [None]:
query_to_df(
    'SELECT MIN(occurred_at) earliest_order FROM orders;'
)

#### 2. Try performing the same query as in question 1 without using an aggregation function.

In [None]:
query_to_df(
    'SELECT occurred_at earliest_order FROM orders ORDER BY earliest_order LIMIT 1;'
)

#### 3. When did the most recent (latest) web_event occur?

In [None]:
query_to_df(
    'SELECT MAX(occurred_at) latest_event FROM web_events;'
)

#### 4. Try to perform the result of the previous query without using an aggregation function.

In [None]:
query_to_df(
    'SELECT occurred_at FROM web_events ORDER BY occurred_at DESC LIMIT 1;'
)

#### 5. Find the mean (AVERAGE) amount spent per order on each paper type, as well as the mean amount of each paper type purchased per order. Your final answer should have 6 values - one for each paper type for the average number of sales, as well as the average amount.

In [None]:
query_to_df(
    'SELECT SUM(standard_amt_usd) / SUM(standard_qty) avg_standard_usd, \
    SUM(total) / SUM(standard_qty) avg_standard_qty, \
    SUM(gloss_amt_usd) / SUM(gloss_qty) avg_gloss_usd, \
    SUM(total) / SUM(gloss_qty) avg_gloss_qty, \
    SUM(poster_amt_usd) / SUM(poster_qty) avg_poster_usd, \
    SUM(total) / SUM(poster_qty) avg_poster_qty\
    FROM orders;'
)

#### 6: Via the video, you might be interested in how to calculate the MEDIAN. Though this is more advanced than what we have covered so far try finding - what is the MEDIAN total_usd spent on all orders?

In [None]:
query_to_df(
    'SELECT * FROM \
    (SELECT total_amt_usd FROM orders ORDER BY total_amt_usd LIMIT 3457) \
    AS tot_amt ORDER BY total_amt_usd DESC LIMIT 2;'
)

## GROUP BY:

* `GROUP BY` can be used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.


* Any column in the `SELECT` statement that is not within an aggregator must be in the `GROUP BY` clause.


* The `GROUP BY` always goes between `WHERE` and `ORDER BY`.


* `ORDER BY` works like SORT in spreadsheet software.

### GROUP BY - Expert Tip:

SQL evaluates the aggregations before the `LIMIT` clause. If you don’t `group by` any columns, you’ll get a 1-row result—no problem there. If you `group by` a column with enough unique values that it exceeds the `LIMIT` number, the aggregates will be calculated, and then some rows will simply be omitted from the results.

This is actually a nice way to do things because you know you’re going to get the correct aggregates. If SQL cuts the table down to 100 rows, then performed the aggregations, your results would be substantially different. So the default style of `Group by` before `LIMIT` which usally comes last is ok.

## GROUP BY QUIZ:

Now that we've been introduced to `JOINs`, `GROUP BY`, and aggregate functions, the real power of SQL starts to come to life. Try some of the below to put your skills to the test!

One part that can be difficult to recognize is when it might be easiest to use an aggregate or one of the other SQL functionalities. Try some of the below to see if you can differentiate to find the easiest solution.

## Q1

Which account (by name) placed the earliest order? Your solution should have the account name and the date of the order.

In [None]:
query_to_df(
    'SELECT a.name acct_name, o.occurred_at date from accounts a JOIN \
    orders o ON a.id = o.account_id ORDER BY date LIMIT 1;'
)

## Q2

Find the total sales in usd for each account. You should include two columns - the total sales for each company's orders in usd and the company name.

In [None]:
query_to_df(
    'SELECT SUM(o.total_amt_usd) total_sales_usd, a.name acct_name FROM orders o \
    JOIN accounts a ON o.account_id = a.id GROUP BY acct_name;'
)

## Q3

Via what channel did the most recent (latest) web_event occur, which account was associated with this web_event? Your query should return only three values - the date, channel, and account name.

In [None]:
query_to_df(
    'SELECT w.occurred_at date, w.channel channel, a.name acct_name FROM \
    web_events w JOIN accounts a ON w.account_id = a.id ORDER BY date DESC LIMIT 1;'
)

## Q4

Find the total number of times each type of channel from the web_events was used. Your final table should have two columns - the channel and the number of times the channel was used.

In [None]:
query_to_df(
    'SELECT w.channel channel, COUNT(w.channel) count FROM web_events w GROUP BY \
    channel;'
)

In [None]:
# Aggregating with DISTINCT...

query_to_df(
    'SELECT DISTINCT w.channel channel, COUNT(w.channel) count FROM web_events w \
    GROUP BY channel;'
)

## Q5
Who was the primary contact associated with the earliest web_event?

In [None]:
query_to_df(
    'SELECT a.primary_poc FROM accounts a JOIN web_events w ON a.id = \
    w.account_id ORDER BY w.occurred_at LIMIT 1;'
)

## Q6

What was the smallest order placed by each account in terms of total usd. Provide only two columns - the account name and the total usd. Order from smallest dollar amounts to largest.


In [None]:
query_to_df(
    'SELECT a.name acct_name, MIN(o.total_amt_usd) min_order_usd FROM accounts \
     a JOIN orders o ON a.id = o.account_id GROUP BY acct_name ORDER BY \
     min_order_usd;'
)

## Q7
Find the number of sales reps in each region. Your final table should have two columns - the region and the number of sales_reps. Order from fewest reps to most reps.

In [None]:
query_to_df(
    'SELECT r.name region, COUNT(s.name) sales_reps_count FROM region r JOIN \
    sales_reps s ON r.id = s.region_id GROUP BY region ORDER BY sales_reps_count;'
)

I need to reconfirm the distinct channels in web_evnts again...

In [None]:
query_to_df(
    'SELECT DISTINCT(w.channel) distinct_channels FROM web_events w ORDER BY \
    distinct_channels;'
)

### **GROUP BY PART 2**

* We can `GROUP BY` multiple columns at once. This is often useful to aggregate across a number of different segments.

* The order of columns listed in the `ORDER BY` clause does make a difference. You are ordering the columns from left to right. But it makes no difference in `GROUP BY` Clause

**GROUP BY - Expert Tips**

* The order of column names in your `GROUP BY` clause doesn’t matter—the results will be the same regardless. If we run the same query and reverse the order in the `GROUP BY` clause, you can see we get the same results.


* As with `ORDER BY`, we can substitute numbers for column names in the `GROUP BY` clause. It’s generally recommended to do this only when you’re grouping many columns, or if something else is causing the text in the `GROUP BY` clause to be excessively long.


* A reminder here that any column that is not within an aggregation must show up in your `GROUP BY` statement. If you forget, you will likely get an error. However, in the off chance that your query does work, you might not like the results!

## GROUP BY Part II

### Q1
For each account, determine the average amount of each type of paper they purchased across their orders. Your result should have four columns - one for the account name and one for the average quantity purchased for each of the paper types for each account.

In [None]:
query_to_df(
    'SELECT a.name acct_name, AVG(o.standard_qty) ave_standard_qty, AVG(o.poster_qty) \
    ave_poster_qty, AVG(o.gloss_qty) ave_gloss_qty FROM accounts a JOIN orders o ON a.id \
    = o.account_id GROUP BY acct_name;'
)

### Q2
For each account, determine the average amount spent per order on each paper type. Your result should have four columns - one for the account name and one for the average amount spent on each paper type.

In [None]:
query_to_df(
    'SELECT a.name acct_name, AVG(o.standard_amt_usd) ave_standard_usd, AVG(o.poster_amt_usd) \
    ave_poster_usd, AVG(o.gloss_amt_usd) ave_gloss_usd FROM accounts a JOIN orders o ON a.id \
    = o.account_id GROUP BY acct_name;'
)

## Q3
Determine the number of times a particular channel was used in the web_events table for each sales rep. Your final table should have three columns - the name of the sales rep, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.

In [None]:
query_to_df(
    'SELECT s.name sales_rep, w.channel channels, COUNT(w.channel) count FROM \
    sales_reps s JOIN accounts a ON s.id = a.sales_rep_id JOIN web_events w ON \
    w.account_id = a.id GROUP BY sales_rep, channels ORDER BY sales_rep, count DESC;'
)

In [None]:
# Aggregating with DISTINCT

query_to_df(
    'SELECT DISTINCT s.name sales_rep, w.channel channels, COUNT(w.channel) count FROM \
    sales_reps s JOIN accounts a ON s.id = a.sales_rep_id JOIN web_events w ON \
    w.account_id = a.id GROUP BY sales_rep, channels ORDER BY sales_rep, count DESC;'
)

### Q4
Determine the number of times a particular channel was used in the web_events table for each region. Your final table should have three columns - the region name, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.


In [None]:
query_to_df(
    'SELECT r.name region, w.channel channels, COUNT(w.channel) count FROM \
    region r JOIN sales_reps s ON r.id = s.region_id JOIN accounts a ON s.id = \
    a.sales_rep_id JOIN web_events w ON w.account_id = a.id GROUP BY region, \
    channels ORDER BY region, count DESC;'
)

### **Distinct**

* `DISTINCT` is always used in `SELECT` statements, and it provides the unique rows for all columns written in the `SELECT` statement. Therefore, you only use `DISTINCT` once in any particular `SELECT` statement.

* You could write:
```
SELECT DISTINCT column1, column2, column3
FROM table1;
```
which would return the unique (or DISTINCT) rows across all three columns.

* You could not write:
```
SELECT DISTINCT column1, DISTINCT column2, DISTINCT column3
FROM table1;
```
* You can think of DISTINCT the same way you might think of the statement "unique".


**DISTINCT - Expert Tip**

It’s worth noting that using `DISTINCT`, particularly in aggregations, can slow your queries down quite a bit.

## Q1 Distinct

Use DISTINCT to test if there are any accounts associated with more than one region.

In [None]:
query_to_df(
    'SELECT DISTINCT a.name acct_name, COUNT(r.name) count FROM \
    accounts a JOIN sales_reps s ON a.sales_rep_id = s.id JOIN region r on \
    s.region_id = r.id GROUP BY acct_name ORDER BY count DESC;'
)

### Q2
Have any sales reps worked on more than one account? Answer using Distinct

In [None]:
query_to_df(
    'SELECT DISTINCT s.name sales_rep, COUNT(a.name) count \
    FROM sales_reps s JOIN accounts a on s.id = a.sales_rep_id GROUP BY sales_rep \
     ORDER BY count DESC;'
)

## **Having**

**HAVING - Expert Tip**

HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a `WHERE` on an element of your query that was created by an aggregate, you need to use `HAVING` instead.

## **Pitching Where and Having**

1. `WHERE` subsets the returned data based on a logical condition
2. `WHERE` appears after the `FROM`, `JOIN` and `ON` clauses but before the `GROUP BY`
3. `HAVING` appears after the `GROUP BY` clause but before the `ORDER BY`.
4. `HAVING` is like `WHERE` but it works on logical statements involving aggregations.  

### Q

How many of the sales reps have more than 5 accounts that they manage?

In [None]:
query_to_df(
    'SELECT COUNT(*) num_reps FROM\
    (SELECT DISTINCT s.name sales_rep, COUNT(a.name) count FROM sales_reps s JOIN \
    accounts a on s.id = a.sales_rep_id GROUP BY sales_rep HAVING count > 5 \
    ORDER BY count) AS t1;'
)

### Q

How many accounts have more than 20 orders?

In [None]:
query_to_df(
    'SELECT COUNT(*) num_accts FROM \
    (SELECT DISTINCT a.name acct_name, COUNT(o.account_id) orders FROM accounts a JOIN \
    orders o ON a.id = o.account_id GROUP BY acct_name HAVING orders > 20 \
    ORDER BY orders) AS t1;'
)

### Q
Which account has the most orders?

In [None]:
query_to_df(
    'SELECT DISTINCT a.name acct_name, COUNT(o.account_id) orders FROM accounts a \
    JOIN orders o ON a.id = o.account_id GROUP BY acct_name ORDER BY orders DESC\
    LIMIT 1;'
)

### Q
How many accounts spent more than 30,000 usd total across all orders?

In [None]:
query_to_df(
    'SELECT COUNT(*) total_accts_over_30k FROM \
    (SELECT DISTINCT a.name acct_name, SUM(o.total_amt_usd) sum_total FROM accounts \
    a JOIN orders o on a.id=o.account_id GROUP BY acct_name HAVING sum_total > \
    30000 ORDER BY 2) AS t1;'
)

### Q
Which accounts spent less than 1,000 usd total across all orders?

In [None]:
query_to_df(
    'SELECT DISTINCT a.name acct_name, SUM(o.total_amt_usd) total_spent FROM \
    accounts a JOIN orders o ON a.id=o.account_id GROUP BY acct_name HAVING \
    total_spent < 1000 ORDER BY total_spent DESC;'
)

### Q
Which account has spent the most with us?

In [None]:
query_to_df(
    'SELECT DISTINCT a.name acct_name, SUM(o.total_amt_usd) max_total_spent FROM \
    accounts a JOIN orders o ON a.id=o.account_id GROUP BY acct_name ORDER BY \
    max_total_spent DESC LIMIT 1;'
)

### Q
Which account has spent the least with us?

In [None]:
query_to_df(
    'SELECT DISTINCT a.name acct_name, SUM(o.total_amt_usd) min_total_spent FROM \
    accounts a JOIN orders o ON a.id=o.account_id GROUP BY acct_name ORDER BY \
    min_total_spent LIMIT 1;'
)

### Q
Which accounts used facebook as a channel to contact customers more than 6 times?

In [None]:
query_to_df(
    'SELECT DISTINCT a.name acct_name, w.channel channels, COUNT(w.channel) count \
    FROM accounts a JOIN web_events w ON a.id=w.account_id WHERE w.channel LIKE \
    "%facebook%" GROUP BY acct_name, channels HAVING count > 6 ORDER BY count;'
)

In [None]:
# Query can be written with only HAVING like so...

query_to_df(
    'SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel FROM accounts a \
    JOIN web_events w ON a.id = w.account_id GROUP BY a.id, a.name, w.channel \
    HAVING COUNT(*) > 6 AND w.channel LIKE "%facebook%" ORDER BY use_of_channel;'
)

### Q
Which account used facebook most as a channel?

In [None]:
query_to_df(
    'SELECT DISTINCT a.name acct_name, w.channel channels, COUNT(w.channel) count \
    FROM accounts a JOIN web_events w ON a.id=w.account_id WHERE w.channel LIKE \
    "%facebook%" GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 1;'
)

### Q
Which channel was most frequently used by most accounts?

In [None]:
query_to_df(
    'SELECT a.name acct_name, w.channel channels, COUNT(w.channel) count \
    FROM accounts a JOIN web_events w ON a.id=w.account_id GROUP BY acct_name, \
    channels ORDER BY count DESC LIMIT 10;'
)

In [None]:
# End the connection after running notebook

if connection.is_connected():
    cursor.close()
    connection.close()
    print(f'Closing MySQL Connection to {record} Database')