<a href="https://colab.research.google.com/github/Lawrence-Krukrubo/SQL_for_Data_Science/blob/main/sql_for_data_analysis3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **SQL AGGREGATIONS**

We connect to Google CloudSQL and make analysis with the Patch and Posey Database.<br>

Thanks to this [article](https://towardsdatascience.com/sql-on-the-cloud-with-python-c08a30807661) for making the connection process clearer.

If we want to download the parch-and-posey.sql file to maybe upload to a database, use this [link](https://storage.googleapis.com/kaggle1980/parch.sql) to the updated file from cloud-storage.

In [1]:
# Next mount gdrive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
# set working directory to Udacity
%cd /content/gdrive/MyDrive/Colab_Notebooks/Udacity

/content/gdrive/MyDrive/Colab_Notebooks/Udacity


In [3]:
%ls

 [0m[01;34maws_machine_learning_foundations[0m/   linear_algebra_refresher.ipynb
 client-cert.pem                    'linear-example-data (1).xlsx'
 client-key.pem                      Problem_Solving_w_Advanced_Analytics.ipynb
 [01;34mcomputer_vision[0m/                    server-ca.pem
 intro_to_algorithm.ipynb            [01;34mstatistics[0m/
 [01;34mintro_to_artificial_intelligence[0m/   time_series_forecasting.ipynb
 [01;34mintro_to_data_analysis[0m/             [01;34mUdac_Prog_Foundations_Python[0m/
 intro_to_machine_learning.ipynb     [01;34mversion_control_with_git[0m/


In [4]:
# Install mySQL connector
!pip install mysql-connector-python

Collecting mysql-connector-python
[?25l  Downloading https://files.pythonhosted.org/packages/cc/ec/102bf59d0cdeb3b8fc82d6669bf96d57d133e44811ff57ad5e941bd8588d/mysql_connector_python-8.0.23-cp36-cp36m-manylinux1_x86_64.whl (18.0MB)
[K     |████████████████████████████████| 18.1MB 74.0MB/s 
Installing collected packages: mysql-connector-python
Successfully installed mysql-connector-python-8.0.23


In [5]:
import mysql.connector
from mysql.connector.constants import ClientFlag
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
from tqdm import tqdm
print('Imported!')

Imported!


In [6]:
config = {
    'user': 'root',
    'password': 'root',
    'host': '35.226.26.66',
    'client_flags': [ClientFlag.SSL],
    'ssl_ca': 'server-ca.pem',
    'ssl_cert': 'client-cert.pem',
    'ssl_key': 'client-key.pem'
}

# now we establish our connection
try:
    cnxn = mysql.connector.connect(**config)
    print('Connection to CloudSQL Instance Successful!')
except Exception as e:
    print(e)

Connection to CloudSQL Instance Successful!


In [7]:
config

{'client_flags': [2048],
 'host': '35.226.26.66',
 'password': 'root',
 'ssl_ca': 'server-ca.pem',
 'ssl_cert': 'client-cert.pem',
 'ssl_key': 'client-key.pem',
 'user': 'root'}

Now we connect to parch_and_posey_db by adding database: parch_and_posey_db to our config dictionary and connecting just like we did before:

In [8]:
config['database'] = 'parch_and_posey_db'  # add new database to config dict
cnxn = mysql.connector.connect(**config)
cursor = cnxn.cursor()

Let's see the first 3 data of the different tables in parch and posey database

In [9]:
# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

[('accounts',), ('orders',), ('region',), ('sales_reps',), ('web_events',)]

Defining a method that converts a select query to a data frame

In [10]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [11]:
# 1. For the accounts table
query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

Query ran for 0.05814862251281738 secs!


Unnamed: 0,id,name,website,lats,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
1,1011,Exxon Mobil,www.exxonmobil.com,41.169156,-73.849374,Sung Shields,321510
2,1021,Apple,www.apple.com,42.290495,-76.084009,Jodee Lupo,321520


In [12]:
# 2. For the orders table
query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

Query ran for 0.04595661163330078 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18


In [13]:
# 3. For the sales_reps table
query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

Query ran for 0.03916454315185547 secs!


Unnamed: 0,id,name,region_id
0,321500,Samuel Racine,1
1,321510,Eugena Esser,1
2,321520,Michel Averette,1


In [14]:
# 4. For the web_events table
query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

Query ran for 0.04044938087463379 secs!


Unnamed: 0,id,account_id,occurred_at,channel
0,1,1001,2015-10-06 17:13:58,direct
1,2,1001,2015-11-05 03:08:26,direct
2,3,1001,2015-12-04 03:57:24,direct


In [15]:
# 5. For the region table
query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

Query ran for 0.037413597106933594 secs!


Unnamed: 0,id,name
0,1,Northeast
1,2,Midwest
2,3,Southeast


In [16]:
region = query_to_df(query)
region.head()

Query ran for 0.036028146743774414 secs!


Unnamed: 0,id,name
0,1,Northeast
1,2,Midwest
2,3,Southeast


**In essential, row-level data are useful for initial exploratory data analysis, when we're trying to get a feel of the data... But as we search for answers, aggregate-data which are often done along columns, become more useful...**

## Nulls:

NULLs are a datatype that specifies where no data exists in SQL. They are often ignored in our aggregation functions

* Notice that NULLs are different than a zero - they are cells where data does not exist.

* When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don't use =, because NULL isn't considered a value in SQL. Rather, it is a property of the data.

**NULLs - Expert Tip**
* There are two common ways in which you are likely to encounter NULLs:

* NULLs frequently occur when performing a LEFT or RIGHT JOIN. You saw in the last lesson - when some rows in the left table of a left join are not matched with rows in the right table, those rows will contain some NULL values in the result set.

* NULLs can also occur from simply missing data in our database.

**COUNT the Number of Rows in each Table**

Try your hand at finding the number of rows in each table.

In [29]:
for table in ['orders','accounts','web_events','region','sales_reps']:
    query = f'SELECT COUNT(*) AS row_count FROM {table};'
    ans = query_to_df(query)
    print(f'Table {table}:')
    print(ans)
    print()

Query ran for 0.03824758529663086 secs!
Table orders:
   row_count
0       6912

Query ran for 0.03505396842956543 secs!
Table accounts:
   row_count
0        351

Query ran for 0.03638911247253418 secs!
Table web_events:
   row_count
0       9073

Query ran for 0.03323507308959961 secs!
Table region:
   row_count
0          4

Query ran for 0.03325486183166504 secs!
Table sales_reps:
   row_count
0         50



### COUNT:

* Note that unlike other aggregations, `COUNT` can be used in columns of Non-Numerical values. Same too for `MIN` and `MAX` clauses.

* Notice that `COUNT` does not consider rows that have `NULL` values. Therefore, this can be useful for quickly identifying which rows have missing data. 

### SUM:

* Unlike `COUNT`, you can only use `SUM` on numeric columns. However, `SUM` will ignore NULL values, as do the other aggregation functions you will see in the upcoming lessons.

### Aggregation Reminder:

An important thing to remember: aggregators only aggregate vertically - the values of a column. If you want to perform a calculation across rows, you would do this with simple arithmetic.

##Aggregation Question

find the solution for each of the following questions. If you get stuck or want to check your answers, you can find the answers at the top of the next concept.

#### Q1: Find the total amount of poster_qty paper ordered in the orders table.

In [18]:
query = 'SELECT SUM(poster_qty) FROM orders;'
query_to_df(query)

Query ran for 0.05653810501098633 secs!


Unnamed: 0,SUM(poster_qty)
0,723646


#### Q2: Find the total amount of standard_qty paper ordered in the orders table.

In [19]:
query = 'SELECT SUM(standard_qty) FROM orders;'
query_to_df(query)

Query ran for 0.03706026077270508 secs!


Unnamed: 0,SUM(standard_qty)
0,1938346


#### Q4. Find the total dollar amount of sales using the total_amt_usd in the orders table.

In [20]:
query_to_df('SELECT SUM(total_amt_usd) FROM orders;')

Query ran for 0.03670692443847656 secs!


Unnamed: 0,SUM(total_amt_usd)
0,23141511.82


#### Q5. Find the total amount spent on standard_amt_usd and gloss_amt_usd paper for each order in the orders table. This should give a dollar amount for each order in the table.

In [21]:
query_to_df(
    'SELECT id, (standard_amt_usd + gloss_amt_usd) tot_amt_usd FROM orders;'
)

Query ran for 7.3920910358428955 secs!


Unnamed: 0,id,tot_amt_usd
0,1,778.55
1,2,1255.19
2,3,776.18
3,4,958.24
4,5,756.13
...,...,...
6907,6908,1545.40
6908,6909,706.54
6909,6910,783.90
6910,6911,816.20


#### Q6. Find the standard_amt_usd per unit of standard_qty paper. Your solution should use both an aggregation and a mathematical operator.

In [22]:
query_to_df(
 'SELECT (SUM(standard_amt_usd)  / SUM(standard_qty)) \
 standard_unit_usd FROM orders;'   
)

Query ran for 0.039717674255371094 secs!


Unnamed: 0,standard_unit_usd
0,4.99


### Min and Max

Notice that `MIN` and `MAX` are aggregators that again ignore `NULL` values.

#### Expert Tip
Functionally, MIN and MAX are similar to COUNT in that they can be used on non-numerical columns. Depending on the column type, MIN will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible. As you might suspect, MAX does the opposite—it returns the highest number, the latest date, or the non-numerical value closest alphabetically to “Z.”

### AVG:

Similar to other software `AVG` returns the mean of the data - that is the sum of all of the values in the column divided by the number of values in a column. This aggregate function again ignores the `NULL` values in both the numerator and the denominator.

If you want to count NULLs as zero, you will need to use SUM and COUNT. However, this is probably not a good idea if the NULL values truly just represent unknown values for a cell.

####MEDIAN - Expert Tip

One quick note that a median might be a more appropriate measure of center for this data, but finding the median happens to be a pretty difficult thing to get using SQL alone — so difficult that finding a median is occasionally asked as an interview question.

##Questions: MIN, MAX, & AVERAGE
Answer the following questions.

#### 1. When was the earliest order ever placed? You only need to return the date.

In [23]:
query_to_df(
    'SELECT MIN(occurred_at) earliest_order FROM orders;'
)

Query ran for 0.03742265701293945 secs!


Unnamed: 0,earliest_order
0,2013-12-04 04:22:44


#### 2. Try performing the same query as in question 1 without using an aggregation function.

In [24]:
query_to_df(
    'SELECT occurred_at earliest_order FROM orders ORDER BY occurred_at LIMIT 1;'
)

Query ran for 0.03635287284851074 secs!


Unnamed: 0,earliest_order
0,2013-12-04 04:22:44


#### 3. When did the most recent (latest) web_event occur?

In [25]:
query_to_df(
    'SELECT MAX(occurred_at) latest_event FROM web_events;'
)

Query ran for 0.0563657283782959 secs!


Unnamed: 0,latest_event
0,2017-01-01 23:51:09


#### 4. Try to perform the result of the previous query without using an aggregation function.

In [26]:
query_to_df(
    'SELECT occurred_at FROM web_events ORDER BY occurred_at DESC LIMIT 1;'
)

Query ran for 0.037081003189086914 secs!


Unnamed: 0,occurred_at
0,2017-01-01 23:51:09


#### 5. Find the mean (AVERAGE) amount spent per order on each paper type, as well as the mean amount of each paper type purchased per order. Your final answer should have 6 values - one for each paper type for the average number of sales, as well as the average amount.

In [27]:
query_to_df(
    'SELECT SUM(standard_amt_usd) / SUM(standard_qty) avg_standard_usd, \
    SUM(total) / SUM(standard_qty) avg_standard_qty, \
    SUM(gloss_amt_usd) / SUM(gloss_qty) avg_gloss_usd, \
    SUM(total) / SUM(gloss_qty) avg_gloss_qty, \
    SUM(poster_amt_usd) / SUM(poster_qty) avg_poster_usd, \
    SUM(total) / SUM(poster_qty) avg_poster_qty\
    FROM orders;'
)

Query ran for 0.041762590408325195 secs!


Unnamed: 0,avg_standard_usd,avg_standard_qty,avg_gloss_usd,avg_gloss_qty,avg_poster_usd,avg_poster_qty
0,4.99,1.8963,7.49,3.6258,8.12,5.0795


#### 6: Via the video, you might be interested in how to calculate the MEDIAN. Though this is more advanced than what we have covered so far try finding - what is the MEDIAN total_usd spent on all orders?

In [28]:
query_to_df(
    'SELECT * FROM \
    (SELECT total_amt_usd FROM orders ORDER BY total_amt_usd LIMIT 3457) \
    AS tot_amt ORDER BY total_amt_usd DESC LIMIT 2;'
)

Query ran for 0.039069414138793945 secs!


Unnamed: 0,total_amt_usd
0,2483.16
1,2482.55
