<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/061__Subqueries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 1/5: SQL FUNDAMENTALS

# MISSION 4: Subqueries

*Learn how to write complex, nested queries using subqueries.*

In this mission, we explore how subqueries enable us to write more complex and dynamic queries.

## 1. Writing More Complex Queries

In [1]:
# Import functions from Google modules into Colaboratory
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
# Insert file id from Google Drive shareable link:
# https://drive.google.com/file/d/1dGtg_Gx7VZgH4ZVydd2sFly3tNLsvhds/view?usp=sharing
id = '1dGtg_Gx7VZgH4ZVydd2sFly3tNLsvhds'

In [3]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('jobs.db')

In [4]:
# Import SQLite3 and pandas library
import sqlite3
import pandas as pd

In [5]:
%%capture
%load_ext sql
%sql sqlite:///jobs.db

In [6]:
%%sql

  SELECT * 
  FROM recent_grads
  LIMIT 5

 * sqlite:///jobs.db
Done.


index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564344,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.1018518519999999,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037383,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313196,758,1069,150,692,40,0.050125313,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341630502,25694,23170,5180,16697,1672,0.061097712,65000,50000,75000,18314,4440,972


What we learned so far enable us to answer questions with only one source of uncertainty. Many times, we want to answer questions that have two or more levels of unknowns. For example:

- Which rows are above the average for the `ShareWomen` column?

To determine which majors are above average for the `ShareWomen` column, we need to:

1. Determine the *average* value for the `ShareWomen` column.
2. Select and filter the rows that are greater than the average value.

Hopefuly and blindly trying the query below does not work.



In [8]:
%%sql

SELECT *
  FROM recent_grads
 WHERE ShareWomen > AVG(ShareWomen);

 * sqlite:///jobs.db
(sqlite3.OperationalError) misuse of aggregate function AVG()
[SQL: SELECT *
  FROM recent_grads
 WHERE ShareWomen > AVG(ShareWomen);]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


We get an error. We can, however, first find `AVG(ShareWomen)`, copy the value manually and replace it in the query above.



In [9]:
%%sql

SELECT AVG(ShareWomen)
  FROM recent_grads;

 * sqlite:///jobs.db
Done.


AVG(ShareWomen)
0.5225502029537575


Let's now answer the initial question.

**Instructions:**

Write a query with the following features:
- Displays, in order, the columns `Major` and `ShareWomen`.
- Displays all rows whose share of women is greater than average.

## 2. Subqueries

Our technique in the last screen lacks flexibility and requires an extra level of intervention. Fortunately SQL provides us with ways to answer questions like the above with a single query.

How do we make the computed average value, `0.5225502029537575`, dynamic?

Let's introduce the SQL way to solve this problem — subqueries. A subquery is a query nested within another query. Here's a template for a SQL statement where the subquery resides in the `WHERE` clause:
```
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > (subquery that returns the average value for ShareWomen)
```

The subquery is run first and returns the average value for the `ShareWomen` column (which happens to be `0.5225502029537575`). Based on the result of the subquery, SQL will replace the subquery with this value dynamically.

SQL will ignore the column name (`AVG(ShareWomen)`); it's smart enough to just use the actual row value. Here's a diagram that makes the flow clearer:
![alt text](https://dq-content.s3.amazonaws.com/255/subquery_flow.svg)

The query that replaces the placeholder `subquery` needs to be a full query (like the ones we've thus far).

In addition, for this particular example, the inner query should only return a table with a single row and column because of where it fits in the outer query (`... WHERE > ?`). If you instead try to return a table with multiple columns, an error will be returned.

Lastly, a subquery must always be contained within parentheses `()`. We get an error if we do not include them in the query above:
```
(sqlite3.OperationalError) near "SELECT": syntax error
[SQL: SELECT *   FROM recent_grads  WHERE ShareWomen > SELECT AVG(Sharewomen) FROM recent_grads;]
(Background on this error at: http://sqlalche.me/e/e3q8)
```
Let's practice this technique.


**Instructions:**

Write a query with the following features:
- Displays, in order, the columns 'Major` and `Unemployment_rate`.
- Displays all rows whose unemployment rate is below average.

## 3. Subquery in SELECT

On the last chapter, we wrote SQL statements that used a subquery to express dynamic filter criteria in the `WHERE` clause. Specifically, we were interested in rows that were above or below the average value in a specific column.

What if we wanted to understand the *proportion* of majors that are above the average for a given column? We'd need to divide the number of rows that met the filter criteria with the *total* number of rows in the table.

Let's focus on the query from the last screen:

```
SELECT Major, ShareWomen
  FROM recent_grads
 WHERE ShareWomen > (SELECT AVG(ShareWomen)
                       FROM recent_grads
                    );
```

Using the `COUNT` aggregate function, we can return the number of rows the results set contains:

In [10]:
%%sql

SELECT COUNT(*)
  FROM recent_grads
 WHERE ShareWomen > (SELECT AVG(ShareWomen)
                     FROM recent_grads
                    );

 * sqlite:///jobs.db
Done.


COUNT(*)
91


To return the proportion, we need to divide this value with the total number of rows in `recent_grads`. We can count the number of rows in the table, save the result and then divide by it. The challenge, however, is to do this dynamically!

To calculate the number of total rows in `recent_grads` and be able to use it in another SQL statement, we can use a subquery in the `SELECT` clause:



In [11]:
%%sql

SELECT COUNT(*),
       (SELECT COUNT(*)
          FROM recent_grads
       )
  FROM recent_grads
 WHERE ShareWomen > (SELECT AVG(ShareWomen)
                       FROM recent_grads
                    );

 * sqlite:///jobs.db
Done.


COUNT(*),(SELECT COUNT(*)  FROM recent_grads  )
91,173


**Instructions:**

Write a SQL statement that computes the proportion (as a float value) of rows that contain above average values for the `ShareWomen`.

The results should only return the proportion, aliased as `proportion_abv_avg`, like this:

|proportion_abv_avg|
|---|
proportion goes here

## 4. The IN Operator

We've been using operators like `<` and `>`. [In the SQLite documentation](https://sqlite.org/lang_expr.html), we see all of the following operators:
![alt text](https://s3.amazonaws.com/dq-content/255/sqlite_operators.png)

Using the `IN` operator, we can specify a list of values that we want to match (by equality) against in the `WHERE` clause. The following query returns the rows where Major_category equals either `Business` or `Engineering`:



In [12]:
%%sql

SELECT Major, Major_category
  FROM recent_grads
 WHERE Major_category IN ('Business', 'Engineering')
 ;

 * sqlite:///jobs.db
Done.


Major,Major_category
PETROLEUM ENGINEERING,Engineering
MINING AND MINERAL ENGINEERING,Engineering
METALLURGICAL ENGINEERING,Engineering
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
CHEMICAL ENGINEERING,Engineering
NUCLEAR ENGINEERING,Engineering
ACTUARIAL SCIENCE,Business
MECHANICAL ENGINEERING,Engineering
ELECTRICAL ENGINEERING,Engineering
COMPUTER ENGINEERING,Engineering


**Instructions:**

Let's practice using `IN`. Write a query that displays:

- In order, `Major_category` and `Major`
- All rows in any of the following categories:
  - `Business`
  - `Humanities & Liberal Arts`
  - `Education`

## 5. Returning Multiple Results in Subqueries

In the last exercise we displayed the majors in the categories `Business`, `Humanities & Liberal Arts`, and `Education`.

These happen to be the top three categories with respect to the total number of graduates, as evidenced by the query below.

In [13]:
%%sql

SELECT Major_category, SUM(TOTAL)
  FROM recent_grads
 GROUP BY Major_category
 ORDER BY SUM(TOTAL) DESC;

 * sqlite:///jobs.db
Done.


Major_category,SUM(TOTAL)
Business,1302376
Humanities & Liberal Arts,713468
Education,559129
Engineering,537583
Social Science,529966
Psychology & Social Work,481007
Health,463230
Biology & Life Science,453862
Communications & Journalism,392601
Arts,357130


Again, what if we wanted to find the same list of majors as in the previous screen, dynamically (i.e. without first determining the most popular categories and then hard coding the values into the query)?

Subqueries can also help us with this. Instead of returning just one value, we can make the query return a list of values (disguised as a single column).

In the following exercise you will display the same results as in the previous screen, only this time without manually indicating what the categories are.

You can use the following boilerplate code in this screen's exercise.

```
SELECT Major_category, Major
  FROM recent_grads
 WHERE Major_category IN (SUBQUERY_GOES_HERE);
 ```

**Instructions:**

In the code displayed above, replace `SUBQUERY_GOES_HERE` to create a query that displays the `Major` and `Major_category` columns, for the rows where Major_category is one of the 3 highest group level sums for the `Total` column.



## 6. Building Complex Subqueries

On the last few screens, we nested subqueries in the `WHERE` and the `SELECT` clauses that were evaluated before the outer query was.

We can actually nest subqueries within subqueries many times, but this makes our SQL code more complex and harder to debug. In the next course, we'll explore other techniques of composing SQL statements that make nested logic easier.

When you have a SQL statement you want to write that will end up using many subqueries, it can be overwhelming at first to know how to start.

In general, you want to start with the inner queries first and work your way outwards. Let's say we're interested in understanding the ratio of the `Sample_size` column to the `Total` column. You can read the [dataset documentation](https://github.com/fivethirtyeight/data/tree/master/college-majors) if you need a reminder for what these columns represent.

Specifically, let's say we're interested in:

- Computing this ratio for every major.
- Understanding which majors are above the average for this ratio.
- Understanding how many majors are above the average for this ratio.

We'll start by writing a query that computes the ratio for every major and then the average of all of these ratios.


**Instructions:**

Write a query that returns the average ratio between `Sample_size` and `Total` for all of the majors.

Here's the format your query should return:

|avg_ratio|
|---|
|average ratio value goes here|



## 7. Practice Integrating A Subquery With The Outer Query

Now that we have a subquery that calculates the average ratio (of `Sample_size` to `Total`), we can return the rows that exceed this average.

**Instructions:**

Write a query with the following features:
- Selects the `Major`, `Major_category`, and the computed `ratio` columns
- Filters to just the rows where `ratio` is greater than `avg_ratio`


## 8. Next Steps

In this mission, we explored how subqueries enable us to write more complex and dynamic queries.

In this course youwe learned:

- What a database and a table in a database are.
- What a query is and what are some of its components.
- How to query a database is using SQL.
- How to compute summary statistics by group and for the whole table.
- About a mental model to think about the order of execution in a query.



---

