<a href="https://colab.research.google.com/github/Lawrence-Krukrubo/SQL_for_Data_Science/blob/main/sql_for_data_analysis2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b><h1>SQL Joins...</h1></b>

We connect to Google CloudSQL and make analysis with the Patch and Posey Database.<br>

Thanks to this [article](https://towardsdatascience.com/sql-on-the-cloud-with-python-c08a30807661) for making the connection process clearer.

If we want to download the parch-and-posey.sql file to maybe upload to a database, use this [link](https://storage.googleapis.com/kaggle1980/parch.sql) to the updated file from cloud-storage.

In [1]:
# Next mount gdrive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
# set working directory to Udacity
%cd /content/gdrive/MyDrive/Colab_Notebooks/Udacity

/content/gdrive/MyDrive/Colab_Notebooks/Udacity


In [3]:
%ls

 [0m[01;34maws_machine_learning_foundations[0m/   linear_algebra_refresher.ipynb
 client-cert.pem                    'linear-example-data (1).xlsx'
 client-key.pem                      Problem_Solving_w_Advanced_Analytics.ipynb
 [01;34mcomputer_vision[0m/                    server-ca.pem
 intro_to_algorithm.ipynb            [01;34mstatistics[0m/
 [01;34mintro_to_artificial_intelligence[0m/   time_series_forecasting.ipynb
 [01;34mintro_to_data_analysis[0m/             [01;34mUdac_Prog_Foundations_Python[0m/
 intro_to_machine_learning.ipynb     [01;34mversion_control_with_git[0m/


In [4]:
!pip install mysql-connector-python

Collecting mysql-connector-python
[?25l  Downloading https://files.pythonhosted.org/packages/cc/ec/102bf59d0cdeb3b8fc82d6669bf96d57d133e44811ff57ad5e941bd8588d/mysql_connector_python-8.0.23-cp36-cp36m-manylinux1_x86_64.whl (18.0MB)
[K     |████████████████████████████████| 18.1MB 234kB/s 
Installing collected packages: mysql-connector-python
Successfully installed mysql-connector-python-8.0.23


In [5]:
import mysql.connector
from mysql.connector.constants import ClientFlag
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time

In [6]:
config = {
    'user': 'root',
    'password': 'root',
    'host': '35.226.26.66',
    'client_flags': [ClientFlag.SSL],
    'ssl_ca': 'server-ca.pem',
    'ssl_cert': 'client-cert.pem',
    'ssl_key': 'client-key.pem'
}

# now we establish our connection
try:
    cnxn = mysql.connector.connect(**config)
    print('Connection to CloudSQL Instance Successful!')
except Exception as e:
    print(e)

Connection to CloudSQL Instance Successful!


In [7]:
config

{'client_flags': [2048],
 'host': '35.226.26.66',
 'password': 'root',
 'ssl_ca': 'server-ca.pem',
 'ssl_cert': 'client-cert.pem',
 'ssl_key': 'client-key.pem',
 'user': 'root'}

Now we connect to parch_and_posey_db by adding database: parch_and_posey_db to our config dictionary and connecting just like we did before:

In [8]:
config['database'] = 'parch_and_posey_db'  # add new database to config dict
cnxn = mysql.connector.connect(**config)
cursor = cnxn.cursor()

Let's see the first 3 data of the different tables in parch and posey database

In [9]:
# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

[('accounts',), ('orders',), ('region',), ('sales_reps',), ('web_events',)]

Defining a method that converts a select query to a data frame

In [10]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [11]:
# 1. For the accounts table
query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

Query ran for 0.030434846878051758 secs!


Unnamed: 0,id,name,website,lats,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
1,1011,Exxon Mobil,www.exxonmobil.com,41.169156,-73.849374,Sung Shields,321510
2,1021,Apple,www.apple.com,42.290495,-76.084009,Jodee Lupo,321520


In [12]:
# 2. For the orders table
query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

Query ran for 0.018934965133666992 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18


In [13]:
# 3. For the region table
query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

Query ran for 0.010549545288085938 secs!


Unnamed: 0,id,name
0,1,Northeast
1,2,Midwest
2,3,Southeast


In [14]:
# 4. For the web_events table
query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

Query ran for 0.014106273651123047 secs!


Unnamed: 0,id,account_id,occurred_at,channel
0,1,1001,2015-10-06 17:13:58,direct
1,2,1001,2015-11-05 03:08:26,direct
2,3,1001,2015-12-04 03:57:24,direct


In [15]:
# 5. For the sales_reps table
query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

Query ran for 0.01857614517211914 secs!


Unnamed: 0,id,name,region_id
0,321500,Samuel Racine,1
1,321510,Eugena Esser,1
2,321520,Michel Averette,1


<h3>Overview</h3>

Writing Joins is the real strength and magic of SQL. Joins are used to read data from multiple tables to power your analysis.

<h3>Database Normalization</h3>

When creating a database, it is really important to think about how data will be stored. This is known as normalization, and it is a huge part of most SQL classes. If you are in charge of setting up a new database, it is important to have a thorough understanding of database normalization.

There are essentially three ideas that are aimed at database normalization:

* Are the tables storing logical groupings of the data?
* Can I make changes in a single location, rather than in many tables for the same information?
* Can I access and manipulate data quickly and efficiently?
This is discussed in detail [here](https://www.itprotoday.com/sql-server/sql-design-why-you-need-database-normalization).

<h3><b>Joins</b></h3>

The whole purpose of `JOIN` statements is to allow us to pull data from more than one table at a time.

Again - `JOINs` are useful for allowing us to pull data from multiple tables. This is both simple and powerful all at the same time.

With the addition of the `JOIN` statement to our toolkit, we will also be adding the `ON` statement.

We use `ON` clause to specify a `JOIN` condition which is a logical statement to combine the table in `FROM` and `JOIN` statements.

<h3>Join Statement Analysis</h3>

```
SELECT orders.*
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
```
The `SELECT` clause indicates which column(s) of data you'd like to see in the output (For Example, orders.* gives us all the columns in orders table in the output). The `FROM` clause indicates the first table from which we're pulling data, and the `JOIN` indicates the second table. The `ON` clause specifies the column on which you'd like to merge the two tables together.

In [16]:
query = 'SELECT orders.* FROM orders JOIN accounts \
        ON orders.account_id = accounts.id;'

query_to_df(query)

Query ran for 24.39840054512024 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.10,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.00,776.18
3,4,1001,2016-01-02 01:18:24,144,32,0,176,718.56,239.68,0.00,958.24
4,5,1001,2016-02-01 19:27:27,108,29,28,165,538.92,217.21,227.36,983.49
...,...,...,...,...,...,...,...,...,...,...,...
6907,6908,4501,2016-06-29 04:03:39,11,199,59,269,54.89,1490.51,479.08,2024.48
6908,6909,4501,2016-07-29 19:58:32,5,91,96,192,24.95,681.59,779.52,1486.06
6909,6910,4501,2016-08-27 00:58:11,16,94,82,192,79.84,704.06,665.84,1449.74
6910,6911,4501,2016-11-22 06:52:22,63,67,81,211,314.37,501.83,657.72,1473.92


**What to Notice**

We are able to pull data from two tables:

* orders
* accounts

Above, we are only pulling data from the orders table since in the `SELECT` statement we only reference columns from the orders table.

The `ON` statement holds the two columns that get linked across the two tables. 

**Additional Information**

If we wanted to only pull individual elements from either the orders or accounts table, we can do this by using the exact same information in the `FROM` and `ON` statements. However, in your `SELECT` statement, you will need to know how to specify tables and columns in the `SELECT` statement:

The table name is always before the period.<br>
The column you want from that table is always after the period.
For example, if we want to pull only the account name and the dates in which that account placed an order, but none of the other columns, we can do this with the following query:

```
SELECT accounts.name, orders.occurred_at
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
```

In [17]:
query = 'SELECT accounts.name, orders.occurred_at FROM orders JOIN accounts ON \
        orders.account_id = accounts.id;'
query_to_df(query)

Query ran for 12.180903911590576 secs!


Unnamed: 0,name,occurred_at
0,Walmart,2015-10-06 17:31:14
1,Walmart,2015-11-05 03:34:33
2,Walmart,2015-12-04 04:21:55
3,Walmart,2016-01-02 01:18:24
4,Walmart,2016-02-01 19:27:27
...,...,...
6907,SpartanNash,2016-06-29 04:03:39
6908,SpartanNash,2016-07-29 19:58:32
6909,SpartanNash,2016-08-27 00:58:11
6910,SpartanNash,2016-11-22 06:52:22


This query only pulls two columns, not all the information in these two tables. Alternatively, the below query pulls all the columns from both the accounts and orders table.

```
SELECT *
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
```

In [18]:
query = 'SELECT * FROM orders JOIN accounts ON orders.account_id = accounts.id;'
query_to_df(query)

Query ran for 49.77683067321777 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd,name,website,lats,longs,primary_poc,sales_rep_id
0,1001,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
1,1001,1001,2015-11-05 03:34:33,190,41,57,288,948.10,307.09,462.84,1718.03,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
2,1001,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.00,776.18,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
3,1001,1001,2016-01-02 01:18:24,144,32,0,176,718.56,239.68,0.00,958.24,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
4,1001,1001,2016-02-01 19:27:27,108,29,28,165,538.92,217.21,227.36,983.49,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6907,4501,4501,2016-06-29 04:03:39,11,199,59,269,54.89,1490.51,479.08,2024.48,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970
6908,4501,4501,2016-07-29 19:58:32,5,91,96,192,24.95,681.59,779.52,1486.06,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970
6909,4501,4501,2016-08-27 00:58:11,16,94,82,192,79.84,704.06,665.84,1449.74,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970
6910,4501,4501,2016-11-22 06:52:22,63,67,81,211,314.37,501.83,657.72,1473.92,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970


**Quiz Questions**

1. Try pulling all the data from the accounts table, and all the data from the orders table.

2. Try pulling standard_qty, gloss_qty, and poster_qty from the orders table, and the website and the primary_poc from the accounts table.

In [19]:
# Try pulling all the data from the accounts table, and all the data from the orders table.

query = 'SELECT * FROM accounts JOIN orders ON accounts.id = orders.account_id;'
query_to_df(query)

Query ran for 50.453455448150635 secs!


Unnamed: 0,id,name,website,lats,longs,primary_poc,sales_rep_id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500,1001,2015-11-05 03:34:33,190,41,57,288,948.10,307.09,462.84,1718.03
2,3,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.00,776.18
3,4,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500,1001,2016-01-02 01:18:24,144,32,0,176,718.56,239.68,0.00,958.24
4,5,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500,1001,2016-02-01 19:27:27,108,29,28,165,538.92,217.21,227.36,983.49
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6907,6908,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970,4501,2016-06-29 04:03:39,11,199,59,269,54.89,1490.51,479.08,2024.48
6908,6909,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970,4501,2016-07-29 19:58:32,5,91,96,192,24.95,681.59,779.52,1486.06
6909,6910,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970,4501,2016-08-27 00:58:11,16,94,82,192,79.84,704.06,665.84,1449.74
6910,6911,SpartanNash,www.spartannash.com,45.555651,-122.657145,Jewell Likes,321970,4501,2016-11-22 06:52:22,63,67,81,211,314.37,501.83,657.72,1473.92


Another way to select all columns from the two above tables is...

```
query = """SELECT orders.*, accounts.*
FROM accounts
JOIN orders
ON accounts.id = orders.account_id;"""

query_to_df(query)
```

Notice this result is the same as if you switched the tables in the `FROM` and `JOIN`. <br>Additionally, which side of the `=` a column is listed doesn't matter.

Personally, I think it makes sense to keep it uniform... <br>Meaning make the table at the left side of the `=` be the first table selected, while that at the right side be the second table.

In [20]:
# Try pulling standard_qty, gloss_qty, and poster_qty from the orders table, 
# and the website and the primary_poc from the accounts table.

query = 'SELECT orders.standard_qty, orders.gloss_qty, orders.poster_qty, \
        accounts.website, accounts.primary_poc FROM orders JOIN accounts ON \
        orders.account_id = accounts.id;'

query_to_df(query)

Query ran for 12.47115707397461 secs!


Unnamed: 0,standard_qty,gloss_qty,poster_qty,website,primary_poc
0,123,22,24,www.walmart.com,Tamara Tuma
1,190,41,57,www.walmart.com,Tamara Tuma
2,85,47,0,www.walmart.com,Tamara Tuma
3,144,32,0,www.walmart.com,Tamara Tuma
4,108,29,28,www.walmart.com,Tamara Tuma
...,...,...,...,...,...
6907,11,199,59,www.spartannash.com,Jewell Likes
6908,5,91,96,www.spartannash.com,Jewell Likes
6909,16,94,82,www.spartannash.com,Jewell Likes
6910,63,67,81,www.spartannash.com,Jewell Likes


<h3>Entity Relationship Diagrams:</h3>

From the last lesson, you might remember that an entity relationship diagram (ERD) is a common way to view data in a database. It is also a key element to understanding how we can pull data from multiple tables.

It will be beneficial to have an idea of what the ERD looks like for Parch & Posey handy,

<img src='https://video.udacity-data.com/topher/2017/October/59e946e7_erd/erd.png'>

**Tables & Columns**

In the Parch & Posey database there are 5 tables:

* web_events
* accounts
* orders
* sales_reps
* region

You will notice some of the columns in the tables have PK or FK next to the column name, while other columns don't have a label at all.

If you look a little closer, you might notice that the PK is associated with the first column in every table. The PK here stands for primary key. A primary key exists in every table, and it is a column that has a unique value for every row.

If you look at the first few rows of any of the tables in our database, you will notice that this first, PK, column is always unique. For this database it is always called id, but that is not true of all databases.

<h4>Keys</h4>

**Primary Key (PK):**

A primary key is a unique column in a particular table. This is the first column in each of our tables. Here, those columns are all called id, but that doesn't necessarily have to be the name. It is common that the primary key is the first column in our tables in most databases.

**Foreign Key (FK):**

A foreign key is a column in one table that is a primary key in a different table. We can see in the Parch & Posey ERD that the foreign keys are:

* region_id
* account_id
* sales_rep_id

Each of these is linked to the primary key of another table. An example is shown in the image below:<br>**Note that a table can have multiple foreign-keys, but one primary-key**

<img src='https://video.udacity-data.com/topher/2017/August/598d2378_screen-shot-2017-08-10-at-8.23.48-pm/screen-shot-2017-08-10-at-8.23.48-pm.png'>

<h4>Primary - Foreign Key Link</h4>

In the above image you can see that:

* The `region_id` is the foreign key.
* The `region_id` is linked to `id` - this is the **primary-foreign key link** that connects these two tables.
* The crow's foot shows that the FK can actually appear in many rows in the sales_reps table.
* While the single line tells us that the PK id appears only once for each row in the region table.
* If you look through the rest of the database, you will notice this is always the case for a primary-foreign key relationship. In the next concept, you can make sure you have this down!



<h4>JOIN Revisited</h4>

Let's look back at the first JOIN we wrote.

```
SELECT orders.*
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;
```

Here is the ERD for these two tables:


<img src='https://video.udacity-data.com/topher/2017/August/598dfda7_screen-shot-2017-08-11-at-11.54.30-am/screen-shot-2017-08-11-at-11.54.30-am.png'>

**Notice**

Notice our SQL query has the two tables we would like to join - one in the `FROM` and the other in the `JOIN`. Then in the `ON`, we will ALWAYs have the PK equal to the FK:

The way we join any two tables is in this way: linking the PK and FK (generally in an `ON` statement).

<h4>JOIN More than Two Tables</h4>

This same logic can actually assist in joining more than two tables together. Look at the three tables below.

<img src='https://video.udacity-data.com/topher/2017/August/598e2e15_screen-shot-2017-08-11-at-3.21.34-pm/screen-shot-2017-08-11-at-3.21.34-pm.png'>

**The Code**

If we wanted to join all three of these tables, we could use the same logic. The code below pulls all of the data from all of the joined tables.

```
SELECT *
FROM web_events
JOIN accounts
ON web_events.account_id = accounts.id
JOIN orders
ON accounts.id = orders.account_id
```

Alternatively, we can create a `SELECT` statement that could pull specific columns from any of the three tables. <br>Again, our `JOIN` holds a table, and `ON` is a link for our PK to equal the FK.

To pull specific columns, the `SELECT` statement will need to specify the table that you are wishing to pull the column from, as well as the column name.

<h3>Alias:</h3>

When we `JOIN` tables together, it is nice to give each table an alias. Frequently an alias is just the first letter of the table name. You actually saw something similar for column names in the Arithmetic Operators concept.

Example:

```
FROM tablename AS t1
JOIN tablename2 AS t2
```

Frequently, you might also see these statements without the `AS` statement. Each of the above could be written in the following way instead, and they would still produce the exact same results:

```
FROM tablename t1
JOIN tablename2 t2
```

and

```
SELECT col1 + col2 total, col3
```



<h3>Aliases for Columns in Resulting Table</h3>

While aliasing tables is the most common use case. It can also be used to alias the columns selected to have the resulting table reflect a more readable name.

Example:

```
Select t1.column1 aliasname, t2.column2 aliasname2
FROM tablename AS t1
JOIN tablename2 AS t2
```

The alias name fields will be what shows up in the returned table instead of t1.column1 and t2.column2

```
aliasname	aliasname2
example row	example row
example row	example row
```

<h4>Questions</h4>

1. Provide a table for all web_events associated with account name of Walmart. There should be three columns. Be sure to include the primary_poc, time of the event, and the channel for each event. Additionally, you might choose to add a fourth column to assure only Walmart events were chosen.

2. Provide a table that provides the region for each sales_rep along with their associated accounts. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.

3. Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. Your final table should have 3 columns: region name, account name, and unit price. A few accounts have 0 for total, so I divided by (total + 0.01) to assure not dividing by zero.

In [21]:
# 1. For the sales_reps table
query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

Query ran for 0.013912200927734375 secs!


Unnamed: 0,id,name,region_id
0,321500,Samuel Racine,1
1,321510,Eugena Esser,1
2,321520,Michel Averette,1


In [22]:
# 1. For the orders table
query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

Query ran for 0.019616365432739258 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18


In [23]:
# 2. For the accounts table
query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

Query ran for 0.01897907257080078 secs!


Unnamed: 0,id,name,website,lats,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,321500
1,1011,Exxon Mobil,www.exxonmobil.com,41.169156,-73.849374,Sung Shields,321510
2,1021,Apple,www.apple.com,42.290495,-76.084009,Jodee Lupo,321520


In [24]:
# 1. For the region table
query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

Query ran for 0.008010387420654297 secs!


Unnamed: 0,id,name
0,1,Northeast
1,2,Midwest
2,3,Southeast


In [25]:
# 1. For the web_events table
query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

Query ran for 0.012680530548095703 secs!


Unnamed: 0,id,account_id,occurred_at,channel
0,1,1001,2015-10-06 17:13:58,direct
1,2,1001,2015-11-05 03:08:26,direct
2,3,1001,2015-12-04 03:57:24,direct


Provide a table for all web_events associated with account name of Walmart. There should be three columns. Be sure to include the primary_poc, time of the event, and the channel for each event. Additionally, you might choose to add a fourth column to assure only Walmart events were chosen.

In [26]:
query = 'SELECT accounts.primary_poc, web_events.occurred_at, web_events.channel,\
        accounts.name FROM accounts JOIN web_events ON \
        accounts.id = web_events.account_id WHERE accounts.name LIKE "Walmart%";'
query_to_df(query)

Query ran for 0.12111759185791016 secs!


Unnamed: 0,primary_poc,occurred_at,channel,name
0,Tamara Tuma,2015-10-06 17:13:58,direct,Walmart
1,Tamara Tuma,2015-11-05 03:08:26,direct,Walmart
2,Tamara Tuma,2015-12-04 03:57:24,direct,Walmart
3,Tamara Tuma,2016-01-02 00:55:03,direct,Walmart
4,Tamara Tuma,2016-02-01 19:02:33,direct,Walmart
...,...,...,...,...
34,Tamara Tuma,2016-07-26 11:29:09,direct,Walmart
35,Tamara Tuma,2016-07-26 21:08:52,banner,Walmart
36,Tamara Tuma,2016-08-12 09:31:22,organic,Walmart
37,Tamara Tuma,2016-09-01 18:33:56,direct,Walmart


Provide a table that provides the region for each sales_rep along with their associated accounts. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.

In [27]:
query = """
SELECT r.name region, s.name sales_rep, a.name account
FROM region r JOIN sales_reps s ON r.id = s.region_id 
JOIN accounts a ON a.sales_rep_id = s.id ORDER BY a.name;
"""

query_to_df(query)

'ERROR: Query Must End with ;'

Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. Your final table should have 3 columns: region name, account name, and unit price. A few accounts have 0 for total, so I divided by (total + 0.01) to assure not dividing by zero.

In [28]:
query = """
SELECT r.name region_name, a.name acct_name, 
(o.total_amt_usd / (o.total+0.01)) unit_price 
FROM region r JOIN sales_reps s ON r.id = s.region_id
JOIN accounts a ON a.sales_rep_id = s.id
JOIN orders o ON o.account_id = a.id;
"""

query_to_df(query)

'ERROR: Query Must End with ;'

<h3>Inner, Left, Right, Outer Joins</h3>

**JOINs**

The INNER JOIN, which we saw by just using JOIN,

Fro the right and left joins,
If there is not matching information in the JOINed table, then you will have columns with empty cells. These empty cells introduce a new data type called NULL. You will learn about NULLs in detail in the next lesson, but for now you have a quick introduction as you can consider any cell without data as NULL.

<h3>JOIN Check In</h3>

**INNER JOINs** 

Notice every JOIN we have done up to this point has been an `INNER JOIN`. That is, we have always pulled rows only if they exist as a match across two tables.

Our new JOINs allow us to pull rows that might only exist in one of the two tables. This will introduce a new data type called NULL. This data type will be discussed in detail in the next lesson.

**Quick Note**

You might see the SQL syntax of

`LEFT OUTER JOIN`
OR

`RIGHT OUTER JOIN`

These are the exact same commands as the `LEFT JOIN` and `RIGHT JOIN` we learned about in the previous video.

**OUTER JOINS**

The last type of join is a `full outer join`. This will return the `inner join` result set, as well as any unmatched rows from either of the two tables being joined.

Again this returns rows that do not match one another from the two tables. The use cases for a `full outer join` are very rare.

You can see examples of outer joins at the link [here](http://www.w3resource.com/sql/joins/perform-a-full-outer-join.php) and a description of the rare use cases here. We will not spend time on these given the few instances you might need to use them.

Similar to the above, you might see the language `FULL OUTER JOIN`, which is the same as `OUTER JOIN`.

<h3><b>Facts</b></h3>

1. A `LEFT JOIN` and `RIGHT JOIN` do the same thing if we change the tables that are in the `FROM` and `JOIN` statements.

2. A `LEFT JOIN` will at least return all the rows that are in an `INNER JOIN`.

3. `JOIN` and `INNER JOIN` are the same.

4. A `LEFT OUTER JOIN` is the same as `LEFT JOIN`.



<b><h3>Tip:</h3></b>

If you have two or more columns in your SELECT that have the same name after the table name such as accounts.name and sales_reps.name you will need to alias them. Otherwise it will only show one of the columns. You can alias them like accounts.name AS AcountName, sales_rep.name AS SalesRepName

<h3>Q1</h3>

Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for the Midwest region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.

In [29]:
query = 'SELECT r.name region, s.name sales_rep, a.name acct \
        FROM region r JOIN sales_reps s ON r.id = s.region_id \
        AND r.name LIKE "Midwest%" JOIN accounts a on \
        a.sales_rep_id = s.id ORDER BY a.name;'

query_to_df(query)

Query ran for 0.07679629325866699 secs!


Unnamed: 0,region,sales_rep,acct
0,Midwest,Chau Rowles,Abbott Laboratories
1,Midwest,Julie Starr,AbbVie
2,Midwest,Cliff Meints,Aflac
3,Midwest,Chau Rowles,Alcoa
4,Midwest,Charles Bidwell,Altria Group
...,...,...,...
43,Midwest,Cliff Meints,Union Pacific
44,Midwest,Kathleen Lalonde,US Foods Holding
45,Midwest,Julie Starr,USAA
46,Midwest,Charles Bidwell,Whirlpool


<h3>Q2</h3>

Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for accounts where the sales rep has a first name starting with S and in the Midwest region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.

In [30]:
query = 'SELECT r.name region_, s.name sales_rep, a.name acct FROM region r \
         JOIN sales_reps s ON r.id = s.region_id AND s.name LIKE "S%" AND \
         r.name LIKE "Midwest%" JOIN accounts a ON a.sales_rep_id = s.id \
         ORDER BY a.name;'

query_to_df(query)

Query ran for 0.016788721084594727 secs!


Unnamed: 0,region_,sales_rep,acct
0,Midwest,Sherlene Wetherington,Community Health Systems
1,Midwest,Sherlene Wetherington,Progressive
2,Midwest,Sherlene Wetherington,Rite Aid
3,Midwest,Sherlene Wetherington,Time Warner Cable
4,Midwest,Sherlene Wetherington,U.S. Bancorp


<h3>Q3</h3>

Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for accounts where the sales rep has a last name starting with K and in the Midwest region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.

In [31]:
query = 'SELECT r.name region, s.name sales_rep, a.name acct FROM region r \
         JOIN sales_reps s ON r.id = s.region_id AND s.name LIKE "% K%" AND \
         r.name LIKE "Midwest" JOIN accounts a ON a.sales_rep_id = s.id \
        ORDER BY a.name;'

query_to_df(query)

Query ran for 0.026633024215698242 secs!


Unnamed: 0,region,sales_rep,acct
0,Midwest,Delilah Krum,Amgen
1,Midwest,Delilah Krum,AutoNation
2,Midwest,Delilah Krum,Capital One Financial
3,Midwest,Delilah Krum,Cummins
4,Midwest,Carletta Kosinski,Danaher
5,Midwest,Carletta Kosinski,Dollar General
6,Midwest,Delilah Krum,Hartford Financial Services Group
7,Midwest,Carletta Kosinski,International Paper
8,Midwest,Delilah Krum,Kimberly-Clark
9,Midwest,Carletta Kosinski,McDonald's


<h3>Q4</h3>

Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100. Your final table should have 3 columns: region name, account name, and unit price. In order to avoid a division by zero error, adding .01 to the denominator here is helpful total_amt_usd/(total+0.01).

In [32]:
query = 'SELECT r.name region, a.name acct, (o.total_amt_usd / (o.total+0.001)) \
        unit_price FROM region r JOIN sales_reps s ON r.id = s.region_id JOIN \
        accounts a ON a.sales_rep_id = s.id JOIN orders o ON a.id = o.account_id \
        AND o.standard_qty > 100;'
query_to_df(query)

Query ran for 7.537501573562622 secs!


Unnamed: 0,region,acct,unit_price
0,Northeast,Walmart,5.759907
1,Northeast,Walmart,5.965361
2,Northeast,Walmart,5.444515
3,Northeast,Walmart,5.960509
4,Northeast,Walmart,6.169040
...,...,...,...
4504,West,KKR,7.467717
4505,West,KKR,7.265457
4506,West,KKR,7.082928
4507,West,PPL,6.565052


<h3>Q5</h3>Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100 and the poster order quantity exceeds 50. Your final table should have 3 columns: region name, account name, and unit price. Sort for the smallest unit price first. In order to avoid a division by zero error, adding .01 to the denominator here is helpful (total_amt_usd/(total+0.01).

In [33]:
query = 'SELECT r.name region_name, a.name acct_name, (o.total_amt_usd/(total+0.001)) \
         unit_price FROM region r JOIN sales_reps s ON r.id = s.region_id JOIN accounts a ON \
         s.id = a.sales_rep_id JOIN orders o ON o.account_id = a.id WHERE \
         o.standard_qty > 100 AND o.poster_qty > 50 ORDER BY unit_price;'

query_to_df(query)

Query ran for 1.406799554824829 secs!


Unnamed: 0,region_name,acct_name,unit_price
0,Northeast,State Farm Insurance Cos.,5.119285
1,Southeast,DISH Network,5.231842
2,Northeast,Travelers Cos.,5.235187
3,Northeast,Best Buy,5.260447
4,West,Stanley Black & Decker,5.266474
...,...,...,...
830,West,Fidelity National Financial,7.992809
831,Northeast,CHS,8.018857
832,West,Pacific Life,8.063025
833,West,Mosaic,8.066336


### Q6

Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100 and the poster order quantity exceeds 50. Your final table should have 3 columns: region name, account name, and unit price. Sort for the largest unit price first. In order to avoid a division by zero error, adding .01 to the denominator here is helpful (total_amt_usd/(total+0.01).

In [34]:
query = 'SELECT r.name region_name, a.name acct_name, (o.total_amt_usd/(o.total+0.001)) \
        unit_price FROM region r JOIN sales_reps s ON r.id = s.region_id JOIN accounts a \
        ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id WHERE \
        o.standard_qty > 100 AND o.poster_qty > 50 ORDER BY unit_price Desc;'

query_to_df(query)

Query ran for 1.3983879089355469 secs!


Unnamed: 0,region_name,acct_name,unit_price
0,Northeast,IBM,8.089913
1,West,Mosaic,8.066336
2,West,Pacific Life,8.063025
3,Northeast,CHS,8.018857
4,West,Fidelity National Financial,7.992809
...,...,...,...
830,West,Stanley Black & Decker,5.266474
831,Northeast,Best Buy,5.260447
832,Northeast,Travelers Cos.,5.235187
833,Southeast,DISH Network,5.231842


### Q7

What are the different channels used by account id 1001? Your final table should have only 2 columns: account name and the different channels. You can try SELECT DISTINCT to narrow down the results to only the unique values.

In [35]:
query = 'SELECT DISTINCT a.name acct_name, w.channel channels FROM accounts a \
        JOIN web_events w ON w.account_id = a.id WHERE a.id = 1001;'

query_to_df(query)

Query ran for 0.017510652542114258 secs!


Unnamed: 0,acct_name,channels
0,Walmart,direct
1,Walmart,facebook
2,Walmart,organic
3,Walmart,adwords
4,Walmart,twitter
5,Walmart,banner


### Q8

Find all the orders that occurred in 2015. Your final table should have 4 columns: occurred_at, account name, order total, and order total_amt_usd.

In [36]:
query = 'SELECT o.occurred_at occurred_at, a.name acct_name, o.total total_qty, \
         o.total_amt_usd total_usd FROM orders o JOIN accounts a ON a.id = \
         o.account_id WHERE occurred_at LIKE "%2015%" ORDER BY occurred_at DESC;'

query_to_df(query)

Query ran for 4.1171722412109375 secs!


Unnamed: 0,occurred_at,acct_name,total_qty,total_usd
0,2015-12-31 23:21:15,Thermo Fisher Scientific,61,446.97
1,2015-12-31 23:15:35,Thermo Fisher Scientific,635,3246.90
2,2015-12-31 20:44:28,Coca-Cola,528,2693.54
3,2015-12-31 15:12:41,Computer Sciences,164,875.25
4,2015-12-31 15:11:15,Cameron International,513,2626.82
...,...,...,...,...
1720,2015-01-01 14:42:53,Travelers Cos.,195,1233.37
1721,2015-01-01 14:40:53,Travelers Cos.,1320,8770.30
1722,2015-01-01 11:17:47,New York Life Insurance,517,2644.89
1723,2015-01-01 05:53:44,FirstEnergy,529,2737.29


### Q9

What are the different channels used by account id 1001? Sort by the count of most frequently used channel descending. Your query should return 4 columns:- account-name, account-id, channels, count.

In [37]:
query = 'SELECT DISTINCT a.name acct_name, a.id acct_id, w.channel channels, \
         COUNT(w.channel) count FROM web_events w JOIN accounts a ON a.id = \
         w.account_id WHERE w.account_id = 1001 GROUP BY acct_name,\
         channels ORDER BY count DESC;'

query_to_df(query)

Query ran for 0.024506330490112305 secs!


Unnamed: 0,acct_name,acct_id,channels,count
0,Walmart,1001,direct,22
1,Walmart,1001,organic,6
2,Walmart,1001,adwords,5
3,Walmart,1001,banner,3
4,Walmart,1001,facebook,2
5,Walmart,1001,twitter,1


<h2>Recap</h2>

### Primary and Foreign Keys

You learned a key element for JOINing tables in a database has to do with primary and foreign keys:

* **primary keys** - are unique for every row in a table. These are generally the first column in our database (like you saw with the id column for every table in the Parch & Posey database).

* **foreign keys** - are the primary key appearing in another table, which allows the rows to be non-unique.

Choosing the set up of data in our database is very important, but not usually the job of a data analyst. This process is known as Database Normalization.

### JOINs

In this lesson, you learned how to combine data from multiple tables using JOINs. The three JOIN statements you are most likely to use are:

* **JOIN** - an INNER JOIN that only pulls data that exists in both tables.
* **LEFT JOIN **- pulls all the data that exists in both tables, as well as all of the rows from the table in the FROM even if they do not exist in the JOIN statement.
* **RIGHT JOIN** - pulls all the data that exists in both tables, as well as all of the rows from the table in the JOIN even if they do not exist in the FROM statement.

There are a few more advanced JOINs that we did not cover here, and they are used in very specific use cases. [UNION and UNION ALL](https://www.w3schools.com/sql/sql_union.asp), [CROSS JOIN](http://www.w3resource.com/sql/joins/cross-join.php), and the tricky [SELF JOIN](https://www.w3schools.com/sql/sql_join_self.asp). These are more advanced than this course will cover, but it is useful to be aware that they exist, as they are useful in special cases.

### Alias

You learned that you can alias tables and columns using AS or not using it. This allows you to be more efficient in the number of characters you need to write, while at the same time you can assure that your column headings are informative of the data in your table.

### Looking Ahead
The next lesson is aimed at aggregating data. You have already learned a ton, but SQL might still feel a bit disconnected from statistics and using Excel like platforms. Aggregations will allow you to write SQL code that will allow for more complex queries, which assist in answering questions like:

* Which channel generated more revenue?
* Which account had an order with the most items?
* Which sales_rep had the most orders? or least orders? How many orders did they have?
