# Nested Queries, Type II Subquery

Nested queries are **subqueries** that exist in a larger query.

**Conceptual Type I / II Subquery**
![Subquery](../images/subquery-syntax.gif)


**Recall**: The Type I subquery is executed once before the outer query and then the value is utilized to evaluate the rows of outer query.



# Type II - Correlated Subquery  or the nested loops of SQL 

A correlated subquery is a type of query, where inner query depends upon the outcome of the outer query in order to perform its execution.



A Type II subquery references one or more columns in the outer query.  

The Type II subquery **executes once for EACH row in the outer query**. 
This is why they are _correlated_, each row of the outer query supplies values for the execution of the inner query.

Type II subqueries are used for **difference** problems: 
 * What data in the outer query does NOT exist in the subquery?
 * The conceptual opposite of a JOIN, where a join links a row in Table A to one or more rows in Table B.

## Example: Survey Database (again)

Survey database from opensource: http://swcarpentry.github.io/sql-novice-survey/


<div class="row">
<div class="col-md-6">

<p><strong>Person</strong>: people who took readings.</p>

<table>
<thead>
<tr>
<th>id</th>
<th>personal</th>
<th>family</th>
</tr>
</thead>
<tbody>
<tr>
<td>dyer</td>
<td>William</td>
<td>Dyer</td>
</tr>
<tr>
<td>pb</td>
<td>Frank</td>
<td>Pabodie</td>
</tr>
<tr>
<td>lake</td>
<td>Anderson</td>
<td>Lake</td>
</tr>
<tr>
<td>roe</td>
<td>Valentina</td>
<td>Roerich</td>
</tr>
<tr>
<td>danforth</td>
<td>Frank</td>
<td>Danforth</td>
</tr>
</tbody>
</table>

<p><strong>Site</strong>: locations where readings were taken.</p>

<table>
<thead>
<tr>
<th>name</th>
<th>lat</th>
<th>long</th>
</tr>
</thead>
<tbody>
<tr>
<td>DR-1</td>
<td>-49.85</td>
<td>-128.57</td>
</tr>
<tr>
<td>DR-3</td>
<td>-47.15</td>
<td>-126.72</td>
</tr>
<tr>
<td>MSK-4</td>
<td>-48.87</td>
<td>-123.4</td>
</tr>
</tbody>
</table>

<p><strong>Visited</strong>: when readings were taken at specific sites.</p>

<table>
<thead>
<tr>
<th>id</th>
<th>site</th>
<th>dated</th>
</tr>
</thead>
<tbody>
<tr>
<td>619</td>
<td>DR-1</td>
<td>1927-02-08</td>
</tr>
<tr>
<td>622</td>
<td>DR-1</td>
<td>1927-02-10</td>
</tr>
<tr>
<td>734</td>
<td>DR-3</td>
<td>1930-01-07</td>
</tr>
<tr>
<td>735</td>
<td>DR-3</td>
<td>1930-01-12</td>
</tr>
<tr>
<td>751</td>
<td>DR-3</td>
<td>1930-02-26</td>
</tr>
<tr>
<td>752</td>
<td>DR-3</td>
<td>-null-</td>
</tr>
<tr>
<td>837</td>
<td>MSK-4</td>
<td>1932-01-14</td>
</tr>
<tr>
<td>844</td>
<td>DR-1</td>
<td>1932-03-22</td>
</tr>
</tbody>
</table>

</div>
<div class="col-md-6">

<p><strong>Survey</strong>: the actual readings.</p>

<table>
<thead>
<tr>
<th>taken</th>
<th>person</th>
<th>quant</th>
<th>reading</th>
</tr>
</thead>
<tbody>
<tr>
<td>619</td>
<td>dyer</td>
<td>rad</td>
<td>9.82</td>
</tr>
<tr>
<td>619</td>
<td>dyer</td>
<td>sal</td>
<td>0.13</td>
</tr>
<tr>
<td>622</td>
<td>dyer</td>
<td>rad</td>
<td>7.8</td>
</tr>
<tr>
<td>622</td>
<td>dyer</td>
<td>sal</td>
<td>0.09</td>
</tr>
<tr>
<td>734</td>
<td>pb</td>
<td>rad</td>
<td>8.41</td>
</tr>
<tr>
<td>734</td>
<td>lake</td>
<td>sal</td>
<td>0.05</td>
</tr>
<tr>
<td>734</td>
<td>pb</td>
<td>temp</td>
<td>-21.5</td>
</tr>
<tr>
<td>735</td>
<td>pb</td>
<td>rad</td>
<td>7.22</td>
</tr>
<tr>
<td>735</td>
<td>-null-</td>
<td>sal</td>
<td>0.06</td>
</tr>
<tr>
<td>735</td>
<td>-null-</td>
<td>temp</td>
<td>-26.0</td>
</tr>
<tr>
<td>751</td>
<td>pb</td>
<td>rad</td>
<td>4.35</td>
</tr>
<tr>
<td>751</td>
<td>pb</td>
<td>temp</td>
<td>-18.5</td>
</tr>
<tr>
<td>751</td>
<td>lake</td>
<td>sal</td>
<td>0.1</td>
</tr>
<tr>
<td>752</td>
<td>lake</td>
<td>rad</td>
<td>2.19</td>
</tr>
<tr>
<td>752</td>
<td>lake</td>
<td>sal</td>
<td>0.09</td>
</tr>
<tr>
<td>752</td>
<td>lake</td>
<td>temp</td>
<td>-16.0</td>
</tr>
<tr>
<td>752</td>
<td>roe</td>
<td>sal</td>
<td>41.6</td>
</tr>
<tr>
<td>837</td>
<td>lake</td>
<td>rad</td>
<td>1.46</td>
</tr>
<tr>
<td>837</td>
<td>lake</td>
<td>sal</td>
<td>0.21</td>
</tr>
<tr>
<td>837</td>
<td>roe</td>
<td>sal</td>
<td>22.5</td>
</tr>
<tr>
<td>844</td>
<td>roe</td>
<td>rad</td>
<td>11.25</td>
</tr>
</tbody>
</table>

</div>
</div>


In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dsa_ro

'Connected: dsa_ro_user@dsa_ro'

## Use-Case

Imagine we want to ask the database which `Person`s have **NO** entries in the `Survey` table.

First, let's look at the opposite question: Which `Person`s have entries in the `Survey` table.
This is a basic `JOIN`.

Let's look at two ways to write this query.

In [2]:
%%sql
SELECT DISTINCT p.personal, p.family
FROM Person p 
JOIN Survey s ON (p.id=s.person)

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
4 rows affected.


personal,family
William,Dyer
Valentina,Roerich
Anderson,Lake
Frank,Pabodie


In [3]:
%%sql
SELECT DISTINCT p.personal, p.family
FROM Person p, Survey s 
WHERE p.id=s.person

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
4 rows affected.


personal,family
William,Dyer
Valentina,Roerich
Anderson,Lake
Frank,Pabodie


Now imagine we want to find the answer to our previous question.   
This is essentially, `Person - (Person JOIN Survey)`

We expect the answer to be: Frank Danforth


In [4]:
%%sql
SELECT DISTINCT p.personal, p.family
FROM Person p 
WHERE NOT EXISTS (
    SELECT 'x' FROM Survey s WHERE p.id=s.person
)


 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


personal,family
Frank,Danforth


#### Correlated Subquery Execution
We can see, the way the query is written that for each row of the outer query (through `Person`), the Person.id is checked against the Survey.person data.
If a row is found, the `'x'` is returned. 
**`NOT EXIST`** then returns false because a row was returned in the subquery.

In [5]:
%%sql
EXPLAIN 
SELECT DISTINCT p.personal, p.family
FROM Person p 
WHERE NOT EXISTS (
    SELECT 'x' FROM Survey s WHERE p.id=s.person
)

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
7 rows affected.


QUERY PLAN
HashAggregate (cost=58.58..60.58 rows=200 width=64)
"Group Key: p.personal, p.family"
-> Hash Anti Join (cost=28.23..56.95 rows=325 width=64)
Hash Cond: (p.id = s.person)
-> Seq Scan on person p (cost=0.00..16.50 rows=650 width=96)
-> Hash (cost=18.10..18.10 rows=810 width=32)
-> Seq Scan on survey s (cost=0.00..18.10 rows=810 width=32)


We can see in this plan that the `Person` table is scanned, and each row is then checked against the `Survey` table as a **correlated** subquery.

# Save your Notebook, then `File > Close and Halt`