# ADVANCED SQL 3: Union and Performance Tuning

We connect to MySQL server and workbench and make analysis with the parch-and-posey database. This course is the practicals of the course SQL for Data Analysis at Udacity.

In [1]:
# we import some required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

Done!


In [2]:
import mysql
from mysql.connector import Error
from getpass import getpass

db_name = 'parch_and_posey'
try:
    connection = mysql.connector.connect(host='localhost',
                                         database=db_name,
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

Enter UserName:danam
Enter Password:········
Connected to MySQL Server version  8.0.24
You're connected to database:  ('parch_and_posey',)


In [3]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 30) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [4]:
# Let's see the tables in Parch-and-Posey database

query_to_df(
'SHOW TABLES;'
)

Query ran for 0.015661239624023438 secs!


Unnamed: 0,Tables_in_parch_and_posey
0,accounts
1,orders
2,region
3,sales_reps
4,web_events


In [5]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

Query ran for 0.004955291748046875 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.23849561,-75.10329704,Tamara Tuma,321500
1,1011,Exxon Mobil,www.exxonmobil.com,41.1691563,-73.84937379,Sung Shields,321510
2,1021,Apple,www.apple.com,42.29049481,-76.08400942,Jodee Lupo,321520


In [6]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

Query ran for 0.015618562698364258 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18


In [7]:
# 3. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

Query ran for 0.0 secs!


Unnamed: 0,id,name,region_id
0,321500,Samuel Racine,1
1,321510,Eugena Esser,1
2,321520,Michel Averette,1


In [8]:
# 4. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

Query ran for 0.0 secs!


Unnamed: 0,id,account_id,occurred_at,channel
0,1,1001,2015-10-06 17:13:58,direct
1,2,1001,2015-11-05 03:08:26,direct
2,3,1001,2015-12-04 03:57:24,direct


In [9]:
# 5. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

Query ran for 0.0 secs!


Unnamed: 0,id,name
0,1,Northeast
1,2,Midwest
2,3,Southeast


## UNION:

While JOINs allow us to stack tables or their columns side-by-side horizontally, UNIONs allow us to stack two tables vertically, one atop the other.

### Appending Data via UNION

**SQL's two strict rules for appending data:**

* Both tables must have the same number of columns.
* Those columns must have the same data types in the same order as the first table.

A common misconception is that column names have to be the same. Column names, in fact, don't need to be the same to append two tables but you will find that they typically are.

### UNION Use Case

* The UNION operator is used to combine the result sets of 2 or more SELECT statements. It removes duplicate rows between the various SELECT statements.
* Each SELECT statement within the UNION must have the same number of fields in the result sets with similar data types.
Typically, the use case for leveraging the UNION command in SQL is when a user wants to pull together distinct values of specified columns that are spread across multiple tables. For example, a chef wants to pull together the ingredients and respective aisle across three separate meals that are maintained in different tables.

### Details of UNION

* There must be the same number of expressions in both SELECT statements.
* The corresponding expressions must have the same data type in the SELECT statements. For example: expression1 must be the same data type in both the first and second SELECT statement.

**Expert Tip**

* UNION removes duplicate rows.
* UNION ALL does not remove duplicate rows.
* We'd likely use UNION ALL far more often than UNION in data analysis

**[LINK](https://www.techonthenet.com/sql/union.php)**

**QUIZ Appending Data via UNION**

Write a query that uses UNION ALL on two instances (and selecting all columns) of the accounts table. <br>Then inspect the results and answer the subsequent quiz.

The first part of the query...

In [10]:
query_to_df(
"SELECT * FROM accounts WHERE name < primary_poc;"
)

Query ran for 0.3004770278930664 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1011,Exxon Mobil,www.exxonmobil.com,41.16915630,-73.84937379,Sung Shields,321510
1,1021,Apple,www.apple.com,42.29049481,-76.08400942,Jodee Lupo,321520
2,1031,Berkshire Hathaway,www.berkshirehathaway.com,40.94902131,-75.76389759,Serafina Banda,321530
3,1081,Ford Motor,www.ford.com,41.11394200,-75.85422452,Kym Hagerman,321580
4,1091,AT&T,www.att.com,42.49746270,-74.90271225,Jamel Mosqueda,321590
...,...,...,...,...,...,...,...
168,4381,BorgWarner,www.borgwarner.com,36.16166860,-115.15042141,Jamel Mosqueda,321970
169,4401,Ball,www.ball.com,36.15614254,-115.13748599,Tuan Trainer,321990
170,4421,Eversource Energy,www.eversource.com,45.50830191,-122.66577523,Paige Bartos,321980
171,4441,Masco,www.masco.com,45.54969277,-122.64617499,Terrilyn Kesler,321980


The second part of the query...

In [11]:
query_to_df(
"SELECT * FROM accounts WHERE id % 3 = 0;"
)

Query ran for 0.21996235847473145 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1011,Exxon Mobil,www.exxonmobil.com,41.16915630,-73.84937379,Sung Shields,321510
1,1041,McKesson,www.mckesson.com,42.21709326,-75.28499823,Angeles Crusoe,321540
2,1071,General Motors,www.gm.com,40.80551762,-76.71018140,Barrie Omeara,321570
3,1101,General Electric,www.ge.com,41.16971210,-77.29713174,Parker Hoggan,321600
4,1131,Chevron,www.chevron.com,42.61194130,-76.36123105,Paige Bartos,321630
...,...,...,...,...,...,...,...
112,4371,Mohawk Industries,www.mohawkind.com,36.18032753,-115.13596405,Kym Hagerman,321910
113,4401,Ball,www.ball.com,36.15614254,-115.13748599,Tuan Trainer,321990
114,4431,Franklin Resources,www.franklinresources.com,45.53389437,-122.68221497,Dominique Favela,321970
115,4461,KKR,www.kkr.com,45.54535285,-122.65524711,Buffy Azure,321970


Combining both parts with UNION-ALL...

In [12]:
query_to_df(
"SELECT * FROM accounts WHERE name < primary_poc \
UNION ALL \
SELECT * FROM accounts WHERE id % 3 = 0;"
)

Query ran for 0.5679306983947754 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1011,Exxon Mobil,www.exxonmobil.com,41.16915630,-73.84937379,Sung Shields,321510
1,1021,Apple,www.apple.com,42.29049481,-76.08400942,Jodee Lupo,321520
2,1031,Berkshire Hathaway,www.berkshirehathaway.com,40.94902131,-75.76389759,Serafina Banda,321530
3,1081,Ford Motor,www.ford.com,41.11394200,-75.85422452,Kym Hagerman,321580
4,1091,AT&T,www.att.com,42.49746270,-74.90271225,Jamel Mosqueda,321590
...,...,...,...,...,...,...,...
285,4371,Mohawk Industries,www.mohawkind.com,36.18032753,-115.13596405,Kym Hagerman,321910
286,4401,Ball,www.ball.com,36.15614254,-115.13748599,Tuan Trainer,321990
287,4431,Franklin Resources,www.franklinresources.com,45.53389437,-122.68221497,Dominique Favela,321970
288,4461,KKR,www.kkr.com,45.54535285,-122.65524711,Buffy Azure,321970


**QUERY 2**

In [13]:
query_to_df(
"SELECT * FROM accounts WHERE name='Walmart' \
UNION ALL \
SELECT * FROM accounts WHERE name='Disney';"
)

Query ran for 0.006983757019042969 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.23849561,-75.10329704,Tamara Tuma,321500
1,1521,Disney,www.disney.com,41.87879976,-74.81102607,Timika Mistretta,321600


The above result from the Union-All query can simply be derived via...

In [14]:
query_to_df(
"SELECT * FROM accounts WHERE name='Walmart' OR name='Disney';"
)

Query ran for 0.01495981216430664 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.23849561,-75.10329704,Tamara Tuma,321500
1,1521,Disney,www.disney.com,41.87879976,-74.81102607,Timika Mistretta,321600


### Performing Operations on a Combined Dataset

Perform a query that does `UNION-ALL` on all rows and all columns of the accounts table. Wrap that in a **Common-Table-Expression(CTE)** or `WITH` clause called _double_accounts_ and then do a `COUNT` of the number of times an account name appears in _double_accounts_.

In [15]:
query_to_df(
"WITH \
double_accounts AS (SELECT * FROM accounts UNION ALL SELECT * FROM accounts) \
SELECT name acct_name, COUNT(*) count FROM double_accounts GROUP BY name;"
)

Query ran for 0.3335433006286621 secs!


Unnamed: 0,acct_name,count
0,Walmart,2
1,Exxon Mobil,2
2,Apple,2
3,Berkshire Hathaway,2
4,McKesson,2
...,...,...
346,KKR,2
347,Oneok,2
348,Newmont Mining,2
349,PPL,2


# SQL Query Performance Tuning

One way to make a query run faster is to reduce the number of calculations that need to be performed. Some of the high-level things that will affect the number of calculations a given query will make include:

* Table size
* Joins
* Aggregations

Query runtime is also dependent on some things that you can’t really control related to the database itself:

* Other users running queries concurrently on the database
* Database software and optimization (e.g., Postgres is optimized differently than Redshift)

### Factors Under Our Control:

* Filtering the data for only the observations we need can dramatically improve query speed. For example, if we have time-series data, limiting to a small time span can allow the query run faster.
* Keep in mind that we can always perform EDA on a subset of data, refine the work into a final query, then remove the limitation and run the query on entire dataset. 
* Point immediately above is why most SQL editors automatically append a limit to most SQL queries
* It's better to reduce table sizes before joining them using simple pre-aggregation. But make sure aggregation logic and join logic in your query are correct so as to derive the correct results.
* We can add `EXPLAIN` at the beginning of every working query to get a rough estimate of how expensive that query would run. This returns a query-plan that shows the order in which our query will be executed. This is more useful when we run EXPLAIN on a query, then modify it to get a simpler running cost n time.
* Sub-queries can be particularly useful in improving the runtime of queries. We can use them to pre-aggregate our results in sub-queries then finalize the work on the main query using these pre-aggregated results.

### Expert Tip

If you’d like to understand this a little better, you can do some extra research on cartesian products. It’s also worth noting that the FULL JOIN and COUNT above actually runs pretty fast—it’s the COUNT(DISTINCT) that takes forever.

In [17]:
# Change False to True below and run cell to terminate connection

if True and connection.is_connected():
    cursor.close()
    connection.close()
    print(f'Connection Terminated: {record} Database.')

Connection Terminated: ('parch_and_posey',) Database.
