<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/060__Group_Summary_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 1/5: SQL FUNDAMENTALS

# MISSION 3: Group Summary Statistics

Learn how to compute statistics across groups.

## 1. Introduction

In the last mission, we computed summary statistics across columns with SQL. In many cases, though, we want to drill down even more and compute summary statistics per group to answer questions like:

- What is the total number of graduates by major category?
- What major category pays better?
- How does the sample size vary by major category?

In this mission, we'll answer questions like these by learning how to calculate summary statistics by groups.

We'll continue working with the `recent_grads` table of `jobs.db`. Recall that each row represents a single college major, and contains information about post-graduation employment of students who studied the major. Here are some descriptions for just a few of the 21 total columns:

- `Rank` - The major's rank by median earnings
- `Major` - The name of the major
- `Major_category` - The broader category the major belongs to
- `Total` - The total number of people who graduated the major
- `Sample_size` - The size of the sample
- `Men` - The number of male graduates
- `Women` - The number of female graduates
- `ShareWomen` - Women as a proportion of the total number of graduates (a number ranging from `0` to `1`)
- `Employed` - The number of employed graduates

Let's load the database file and display the first few rows and columns in the dataset:



In [1]:
# Import functions from Google modules into Colaboratory
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
# Insert file id from Google Drive shareable link:
# https://drive.google.com/file/d/1dGtg_Gx7VZgH4ZVydd2sFly3tNLsvhds/view?usp=sharing
id = '1dGtg_Gx7VZgH4ZVydd2sFly3tNLsvhds'

In [3]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('jobs.db')

In [4]:
# Import SQLite3 and pandas library
import sqlite3
import pandas as pd

In [5]:
%%capture
%load_ext sql
%sql sqlite:///jobs.db

In [6]:
%%sql

  SELECT * 
  FROM recent_grads
  LIMIT 5

 * sqlite:///jobs.db
Done.


index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564344,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.1018518519999999,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037383,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313196,758,1069,150,692,40,0.050125313,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341630502,25694,23170,5180,16697,1672,0.061097712,65000,50000,75000,18314,4440,972


Before we dive into answer questions like the above, we'll learn how to code [if/then logic](https://en.wikipedia.org/wiki/Conditional_(computer_programming)) in SQL.

Let's get started!

## 2. If/Then in SQL

It is very common to want to create new columns according to rules based on other columns. For example, in order to quickly identify what are the top 10 majors (as per the `rank` column), we may want to create a column that indicates whether the `value` on rank for each row is greater than 10 or not.

![alt text](https://dq-content.s3.amazonaws.com/254/2.1-rank.gif)



We can do such things with the `CASE` clause in SQL. We'll program the following logic:

- If the row corresponds to a top 10 major, assign the value `Top 10`.
- If the row corresponds to a top 20 major that isn't a top 10 major, assign the value `Top 20`.
- Otherwise let it be a missing value.

We can do this with the following syntax:
```
SELECT CASE
       WHEN rank <= 10 THEN 'Top 10'
       WHEN rank <= 20 THEN 'Top 20'
       ELSE NULL
       END AS rank_category
  FROM recent_grads;
```

A more general syntax looks like this:
```
CASE
WHEN <condition_1> THEN <value_1>
WHEN <condition_2> THEN <value_2>
ELSE <value_3>
END AS <new_column_name>
```

Let's look at what the results look like. We'll look at a few select rows of the result of the following query:



In [7]:
%%sql

SELECT Major, Rank,
       CASE
       WHEN rank <= 10 THEN 'Top 10'
       WHEN rank <= 20 THEN 'Top 20'
       ELSE NULL
       END AS rank_category
  FROM recent_grads;

 * sqlite:///jobs.db
Done.


Major,Rank,rank_category
PETROLEUM ENGINEERING,1,Top 10
MINING AND MINERAL ENGINEERING,2,Top 10
METALLURGICAL ENGINEERING,3,Top 10
NAVAL ARCHITECTURE AND MARINE ENGINEERING,4,Top 10
CHEMICAL ENGINEERING,5,Top 10
NUCLEAR ENGINEERING,6,Top 10
ACTUARIAL SCIENCE,7,Top 10
ASTRONOMY AND ASTROPHYSICS,8,Top 10
MECHANICAL ENGINEERING,9,Top 10
ELECTRICAL ENGINEERING,10,Top 10


We can see that our logic was coded correctly. We'll dissect this clause in the next screen.

In this exercise we'll categorize the `Sample_size` column into three categories: `Small`, `Medium`, and `Large`.

The size of a sample is of extreme importance for statistical analysis, you can learn more about this in the statistics courses. Now let's practice!





**Instructions:**

Write a SQL query that displays, with the alias `Sample_category`, the column whose values are given by the following rules:

- `Small` if Sample_size is smaller than 200
- `Medium` if Sample_size is equal to or higher than 200, and smaller than 1000
- `Large` if `Sample_size` is equal to or higher than 1000


## 3. Dissecting CASE

Let's recall the syntax we saw in the previous screen.

```
CASE
WHEN <condition_1> THEN <value_1>
WHEN <condition_2> THEN <value_2>
ELSE <value_3>
END AS <new_column_name>
```

Note the following::

- It starts with `CASE` to indicate that we'll be creating values by cases.
- It ends with `END` to indicate where the `CASE` clause terminates.
- Each explicit case is signaled by the reserved word `WHEN`.
- The value for each case is given after the reserved word `THEN`.
- There is a fallback value given by the reserved word ELSE`.


Here are some very important observations:

- Whatever you can use in `WHERE` for filtering, you can also use in place of the conditions above.
- There can be one or more `WHEN` lines. We exemplified with three, but it works just with one, two or, for instance, ten.
- The `ELSE` line is optional — without it, rows that don't match any `WHEN` will be set with a missing value (`NULL`).

Let's reinforce our knowledge with a very similar exercise.



**Instructions:**

Write a SQL query that displays in order `Major`, `Sample_size` and a column named Sample_category, whose values are defined by the following rules:

- `Small` if `Sample_size` is smaller than 200
- `Medium` if `Sample_size` is equal to or higher than 200, and smaller than 1000
- `Large` if `Sample_size` is equal to or higher than 1000

## 4. Calculating Group-Level Summary Statistics

Let's go back to calculating summary statistics by group. By group we mean any of the unique values in a column. Typically group statistics are calculated only for columns in which the values represent categories (as opposed to measurements of some sort — like age, currency, and so on).

The `GROUP BY` SQL statement allows us to compute summary statistics by group. When we use this statement, SQL creates a group for each unique value in a column or set of columns (the same values we get when we use the `DISTINCT` statement), and then does the calculations for them.

To illustrate, we can find the total number of people employed in each major category with the following query below. This will give us the total number of employed graduates for each major category. Here's a truncated view of the output:

In [8]:
%%sql

SELECT SUM(Employed) 
  FROM recent_grads 
 GROUP BY Major_category
 LIMIT 3;

 * sqlite:///jobs.db
Done.


SUM(Employed)
66943
288114
302797


The output shows aggregate counts of the `Employed` column for each `Major_category`. Unfortunately, it doesn't indicate which major category each row refers to. We can fix this by including the `Major_category` column in our query:

In [9]:
%%sql

SELECT Major_category, SUM(Employed) 
  FROM recent_grads 
 GROUP BY Major_category;

 * sqlite:///jobs.db
Done.


Major_category,SUM(Employed)
Agriculture & Natural Resources,66943
Arts,288114
Biology & Life Science,302797
Business,1088742
Communications & Journalism,330660
Computers & Mathematics,237894
Education,479839
Engineering,420372
Health,372147
Humanities & Liberal Arts,544118


One of the questions we put forth at beginning of the mission was: What is the total number of graduates by major category? Let's answer it.

**Instructions:**

Write a SQL query that, for each major category, displays:

- The `major` category
- The total number of graduates for `Major_category` with the alias `Total_graduates`

## 5. GROUP BY Visual Breakdown

Let's recall the query from the example in the last screen.

In [10]:
%%sql

SELECT Major_category, SUM(Employed) 
  FROM recent_grads 
 GROUP BY Major_category
 LIMIT 3;

 * sqlite:///jobs.db
Done.


Major_category,SUM(Employed)
Agriculture & Natural Resources,66943
Arts,288114
Biology & Life Science,302797


Here's how the query works. The `GROUP BY` statement splits the `Major_category` column into groups (with one group for each unique major category), then calculates the sum for each group. The following diagram shows how `GROUP BY` splits the data (the diagram uses a small sample from the `recent_grads` table):

![alt text](https://dq-content.s3.amazonaws.com/254/5.1-group.gif)

For each group, the `GROUP BY` statement queries each column, and runs all of the aggregation functions we include in the query after the `SELECT` statement:

![alt text](https://dq-content.s3.amazonaws.com/254/5.2-sum.gif)

If a column is selected, the SQL engine will use the last value for that column in the group. If an aggregation function is selected, the SQL engine will compute the value for that aggregation function across the group.

The query in the diagram will give us the following result:


Employed|	Major_category	|SUM(Employed)
---| ---|---
1290	|Agriculture|	4439
36165	|Arts|	39075

Now let's try using a different aggregate function.

**Instructions:**

Write a SQL query that, for each major category, displays:
- The major category
- The average share of women with the alias `Average_women`.

## 6. Multiple Summary Statistics by Group

Just like we were able to compute multiple summary statistics for the whole table in the last mission, we can also compute multiple summary statistics by groups.

Continuing from our example, here's a query that for each major category finds:

- The total number of employed graduates
- The average number of employed graduates
- The maximum number of employed graduates in a major
- The minimum number of employed graduates in a major

In [11]:
%%sql

SELECT Major_category,
       SUM(Employed), AVG(Employed), MAX(Employed), MIN(Employed)
  FROM recent_grads
 GROUP BY Major_category
 LIMIT 3;

 * sqlite:///jobs.db
Done.


Major_category,SUM(Employed),AVG(Employed),MAX(Employed),MIN(Employed)
Agriculture & Natural Resources,66943,6694.3,17112,613
Arts,288114,36014.25,83483,2914
Biology & Life Science,302797,21628.35714285714,182295,1010


We can also make calculations with the resulting columns, much like you learned before. Here's a query that computes the difference between the maximum and the minimum number of employed graduates:

In [12]:
%%sql

SELECT Major_category,
       SUM(Employed), AVG(Employed), MAX(Employed) - MIN(Employed) as range_employed
  FROM recent_grads
 GROUP BY Major_category;

 * sqlite:///jobs.db
Done.


Major_category,SUM(Employed),AVG(Employed),range_employed
Agriculture & Natural Resources,66943,6694.3,16499
Arts,288114,36014.25,80569
Biology & Life Science,302797,21628.35714285714,181285
Business,1088742,83749.38461538461,273322
Communications & Journalism,330660,82665.0,134954
Computers & Mathematics,237894,21626.727272727272,101528
Education,479839,29989.9375,148636
Engineering,420372,14495.586206896553,75838
Health,372147,31012.25,173851
Humanities & Liberal Arts,544118,36274.53333333333,146393


Note that we were able to use an alias for the result of the calculation without an issue. Aliases are used in the same way that you learned before.

Now that we have a better understanding of the `GROUP BY` statement, let's practice using it by computing summary statistics by group for the `recent_grads` table.

**Instructions:**

Write a query that, for each major category, displays in order:
- The major category
- The total amount of women with the alias `Total_women`
- The average percentage of women with the alias `Mean_women`
- The result of multiplying the total number of graduates by the average percentage of women with the alias `Estimate_women`.

## 7. Multiple Group Columns

So far we've only grouped by one column (`Major_category`), but it's also possible to group by multiple columns.

We'll exemplify with the following table called `fruit`. This table is not in the database that we're using here, so you won't be able to experiment with it.


fruit|	color|	sourness|	weight|	price|
---|---|---|---|---
Orange|	Orange|	Sour|	131|	4
Banana|	Yellow|	Sweet|	150|	4
Papaya|	Orange|	Sweet|	450|	2
Lime|	Green|	Sour|	44|	1
Strawberry|	Red|	Sweet|	12|	3
Grapefruit|	Orange|	Sour|	236|	1
Cherry|	Red|	Sour|	12|	3
Red apple|	Red|	Sweet|	180|	5
Green apple|	Green|	Sour|	170|	2

We see that we have the groups in `color` and in `sourness`.

We've ran queries like the one below, which gives the mean weight and price.

```
SELECT color,
       AVG(weight),
       AVG(price)
  FROM fruit
 GROUP BY color;
```


color	|AVG(weight)|	AVG(price)
---|---|---
Green	|107.0|	1.5
Orange	|272.333333|	2.333333
Red	|68.0|	3.666667
Yellow|	150.0|	4.0

What if we want the mean weight and price for the group of sweet red fruits? Or more generally, the mean weight and price for all the combinations of groups of color and sourness? We can answer such questions with GROUP BY as well. Here's an example:

```
SELECT color, sourness
       AVG(weight), AVG(price), MIN(weight), MIN(price)
  FROM fruit
 GROUP BY color, sourness;
```


color|	sourness	|AVG(weight)	|AVG(price)|	MIN(weight)|	MIN(price)
---|---|---|---|---|---|
Green|	Sour|	107.0|	1.5|	44|	1
Orange|	Sour|	183.5|	2.5|	131	|1
Orange|	Sweet|	450.0|	2.0|	450|	2
Red|	Sour|	12.0|	3.0|	12|	3
Red|	Sweet|	96.0	|4.0|	12|	3
Yellow|	Sweet|	150.0|	4.0|	150|	4


We can easily confirm the minimum values by spotchecking.

Earlier in the mission we created a column whose values are given by the following rules:

- `Small` if `Sample_size` is smaller than 200
- `Medium` if `Sample_size` is equal to or higher than 200, and smaller than 1000
- `Large` if `Sample_size` is equal to or higher than 1000

*We duplicated the `recent_grads` table and added that column to a new table. It is called `new_grads` and we'll use it for the remainder of this mission.*


**Instructions:**

Write a query that, for each combination of `Major_category` and `Sample_category`, displays in order:

- The major category
- The sample category
- The average percentage of women with the alias `Mean_women`
- The sum total of graduates with the alias `Total_graduates`



## 8. Querying Virtual Columns With the HAVING Statement

Sometimes we want to select a subset of rows after performing a `GROUP BY` query. On the last screen, for instance, we may have wanted to select only those rows where `Mean_women` is greater than `.8`.

We can't use the `WHERE` clause to do this:
```
SELECT Major_category, Sample_category,
       AVG(ShareWomen) AS Mean_women,
       SUM(Total) AS Total_graduates
  FROM new_grads
 GROUP BY Major_category, Sample_category
 WHERE Mean_women > 0.8;
 ```

 ```

(sqlite3.OperationalError) near "WHERE": syntax error
[SQL: SELECT Major_category, Sample_category,        AVG(ShareWomen) AS Mean_women,        SUM(Total) AS Total_graduates   FROM new_grads  GROUP BY Major_category, Sample_category  WHERE Mean_women > 0.8  ;]
(Background on this error at: http://sqlalche.me/e/e3q8)

```

We got an error! We'll understand why `WHERE` doesn't work here in the next screen, for now let's learn how we would be able to subset the data.

When we want to filter on a column generated by a `GROUP BY` query, we can use the `HAVING` statement. Here's an example:

```
SELECT Major_category, Sample_category,
       AVG(ShareWomen) AS Mean_women,
       SUM(Total) AS Total_graduates
  FROM new_grads
 GROUP BY Major_category, Sample_category
HAVING Mean_women > 0.8;
```

Note that we used the same column name in the `HAVING` statement that we originally specified with the `AS` statement. The statement above will result in the following output:

Major_category|	Sample_category|	Mean_women|	Total_graduates
---|---|---|---|
Education|	Large	|0.923745479|	170862
Health|	Large|	0.896018988|	209394
Psychology & Social Work|	Medium|	0.81070414753552|	53552

Note that the results only include categories where `Mean_women` is greater than `.8`. That's because the `HAVING` statement filters out the other rows.





**Instructions:**

Remember to use the `new_grads` table.

- Find all of the major categories where the share of graduates with low-wage jobs is greater than `.1`.
  - Use the `SELECT` statement to select `Major_category` and `AVG(Low_wage_jobs)` / `AVG(Total)` as `Share_low_wage`
  - Use the `GROUP BY` statement to group the query by the `Major_category` column.
  - Use the `HAVING` statement to restrict the selection to rows where `Share_low_wage` is greater than `.1`.

## 9. Order of Execution

In the previous mission we learned that when executing a SQL query, the computer runs the clauses in this order:

1. `FROM`
2. `WHERE`
3. `SELECT`
4. `ORDER BY`
5. `LIMIT`

We now learned a couple more clauses: `GROUP BY` and `HAVING`. We can expand our mental model of the general structure of a query:

```
SELECT column(s)
  FROM some_table
 WHERE some_condition
 GROUP BY column(s)
HAVING some_condition
 ORDER BY column(s)
 LIMIT some_limit;

```

And the order by which SQL runs this is:

1. `FROM`
2. `WHERE`
3. `GROUP BY`
4. `HAVING`
5. `SELECT`
6. `ORDER BY`
7. `LIMIT`

We still haven't looked at using `GROUP BY` and `ORDER BY` simultaneously, we'll see an example later.

Note, however, that `ORDER BY` is executed after `GROUP BY`. One of the main goals of ordering results is having functional output, so it's not unreasonable to think that it should be one of the last clauses to run.



## 10. Rounding Results With the ROUND() Function

On a previous screen, the percentages in our results were very long and hard to read (e.g., `0.16833085991095678`). We can use the SQL `ROUND` function in our query to round them. Here's an example of what this looks like:
```
SELECT Major_category,
       ROUND(ShareWomen, 2) AS rounded_share_women 
  FROM new_grads;
```

The query will round the `ShareWomen` column to two decimal places. Here's a truncated view of the results:

Major_category|	rounded_share_women
---|---
Engineering|	0.12
Engineering	|0.1

By passing different values into the `ROUND` function, such as `ROUND(ShareWomen, 3)`, we can round to different decimal places.

**Instructions:**

Write a query with the following features:
- Displays, in this order:
  - ShareWomen rounded to 4 decimal places with the alias Rounded_women
  - Major_category
- Displays only the first ten results resulting from the above

## 11. Nesting functions

On a previous screen, we ran the following query:
```
SELECT Major_category, Sample_category,
       AVG(ShareWomen) AS Mean_women,
       SUM(Total) AS Total_graduates
  FROM new_grads
 GROUP BY Major_category, Sample_category
HAVING Mean_women > 0.8;
```
This query returned very long fractional values for `Mean_women`. We can update our query with the `ROUND` function to round the results to three decimal places:



In [13]:
%%sql

SELECT Major_category, Sample_category,
       ROUND(AVG(ShareWomen), 3) AS Mean_women,
       SUM(Total) AS Total_graduates
  FROM new_grads
 GROUP BY Major_category, Sample_category
HAVING Mean_women > 0.8;

 * sqlite:///jobs.db
(sqlite3.OperationalError) no such table: new_grads
[SQL: SELECT Major_category, Sample_category,
       ROUND(AVG(ShareWomen), 3) AS Mean_women,
       SUM(Total) AS Total_graduates
  FROM new_grads
 GROUP BY Major_category, Sample_category
HAVING Mean_women > 0.8;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


## 12. Casting

## 13. Next Steps