# 1. Introduction

In the "Summary Statistics" lesson, we computed summary statistics across columns using SQL. Often, however, we will want even more information, which we can get by computing summary statistics per group to answer questions like the following:

* What is the total number of graduates by major category?
* What major category pays the most?
* How does the sample size vary by major category?

In this lesson, we'll answer questions like these by learning how to calculate summary statistics for groups.

We'll continue working with the recent_grads table of jobs.db. Recall that each row represents a single college major, and it contains information about post-graduation employment of students who studied the major. Here are some descriptions for just a few of the 21 total columns:

* Rank — the major's rank according to median earnings.
* Major — the name of the major.
* Major_category — the broader category to which the major belongs.
* Total — the total number of people who graduated with the major.
* Sample_size — the size of the sample.
* Men — the number of graduates who are men.
* Women — the number of graduates who are women.
* ShareWomen — women as a proportion of the total number of graduates (a number ranging from 0 to 1).
* Employed — the number of employed graduates.

Here are the first few rows and columns in the dataset:

    Rank	Major	Major_category	Total	Sample_size	Men	Women	ShareWomen	Employed
    1	PETROLEUM ENGINEERING	Engineering	2339	36	2057	282	0.120564	1976
    2	MINING AND MINERAL ENGINEERING	Engineering	756	7	679	77	0.101852	640
    3	METALLURGICAL ENGINEERING	Engineering	856	3	725	131	0.153037	648
    4	NAVAL ARCHITECTURE AND MARINE ENGINEERING	Engineering	1258	16	1123	135	0.107313	758
    5	CHEMICAL ENGINEERING	Engineering	32260	289	21239	11021	0.341631	25694

Let's begin by learning how to code if/then logic in SQL.

# 2. If/Then in SQL

In [1]:
%load_ext sql
%sql sqlite://

%sql sqlite:////home/mohammeds/datasets/jobs.db

In [36]:
%%sql

SELECT CASE
        WHEN Sample_size<200 THEN 'Small'
        WHEN Sample_size<1000 THEN 'Medium'
        WHEN Sample_size>=1000 THEN 'Large'
        END AS Sample_category
    FROM recent_grads
LIMIT 20;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Sample_category
Small
Small
Small
Small
Medium
Small
Small
Small
Large
Medium


# 3. Dissecting CASE

Look again at the syntax we saw in the previous screen.

    CASE
    WHEN <condition_1> THEN <value_1>
    WHEN <condition_2> THEN <value_2>
    ELSE <value_3>
    END AS <new_column_name>

Note the following:

* It starts with CASE to indicate that we'll be creating values by cases.
* It ends with END to indicate where the CASE clause terminates.
* The reserved word WHEN signals each explicit case.
* The value for each case follows the reserved word THEN
* There is a fallback value indicated by the reserved word ELSE.

Here are some important observations:

* Anything you can use in WHERE for filtering, you can also use in place of the conditions above.
* There can be one or more WHEN lines. We demonstrated this with three, but it works with any number.
* The ELSE line is optional — without it, rows that don't match any WHEN will get a missing value (NULL).

In [3]:
%%sql

SELECT Major, Sample_size,
        CASE
        WHEN Sample_size<200 THEN "Small"
        WHEN Sample_size<1000 THEN "Medium"
        ELSE "Large"
        END AS Sample_category
    FROM recent_grads
LIMIT 20;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major,Sample_size,Sample_category
PETROLEUM ENGINEERING,36,Small
MINING AND MINERAL ENGINEERING,7,Small
METALLURGICAL ENGINEERING,3,Small
NAVAL ARCHITECTURE AND MARINE ENGINEERING,16,Small
CHEMICAL ENGINEERING,289,Medium
NUCLEAR ENGINEERING,17,Small
ACTUARIAL SCIENCE,51,Small
ASTRONOMY AND ASTROPHYSICS,10,Small
MECHANICAL ENGINEERING,1029,Large
ELECTRICAL ENGINEERING,631,Medium


# 4. Calculating Group-Level Summary Statistics

In [4]:
%%sql

SELECT Major_category, SUM(Total) AS Total_graduates
    FROM recent_grads
GROUP BY Major_category;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Total_graduates
Agriculture & Natural Resources,79981
Arts,357130
Biology & Life Science,453862
Business,1302376
Communications & Journalism,392601
Computers & Mathematics,299008
Education,559129
Engineering,537583
Health,463230
Humanities & Liberal Arts,713468


# 5. GROUP BY Visual Breakdown

Here's how the query works. The GROUP BY statement divides the Major_category column into groups (one group for each unique major category), then calculates the sum for each group. The following diagram shows how GROUP BY divides the data (the diagram uses a small sample from the recent_grads table):

![](https://dq-content.s3.amazonaws.com/254/5.1-group.gif)

For each group, the GROUP BY statement queries each column and runs all of the aggregation functions we include in the query after the SELECT statement:

![](https://dq-content.s3.amazonaws.com/254/5.2-sum.gif)

The SQL engine will use the last value for a selected column in the group. The SQL engine will also compute the value for a selected aggregation function across the group.

In [5]:
%%sql

SELECT Major_category, AVG(ShareWomen) AS Average_women
    FROM recent_grads
GROUP BY Major_category;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Average_women
Agriculture & Natural Resources,0.6179384232
Arts,0.56185119575
Biology & Life Science,0.584518475857143
Business,0.4050631853076923
Communications & Journalism,0.64383484025
Computers & Mathematics,0.5127519954545455
Education,0.6749855163125
Engineering,0.2571578951034483
Health,0.6168565694166667
Humanities & Liberal Arts,0.6761934042


# 6. Multiple Summary Statistics by Group

In addition to computing multiple summary statistics for the whole table, as we did in the "Summary Statistics" lesson, we can also compute multiple summary statistics by groups.

Working from the same example, here's a query that for each major category finds the following:

* The total number of employed graduates
* The average number of employed graduates
* The maximum number of employed graduates in a major
* The minimum number of employed graduates in a major

    SELECT Major_category,
           SUM(Employed), AVG(Employed), MAX(Employed), MIN(Employed)
      FROM recent_grads
     GROUP BY Major_category;

In [6]:
%%sql

SELECT Major_category,
        SUM(Women) AS Total_women,
        AVG(ShareWomen) AS Mean_women,
        SUM(Total)*AVG(ShareWomen) AS Estimate_women
    FROM recent_grads
GROUP BY Major_category;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Total_women,Mean_women,Estimate_women
Agriculture & Natural Resources,249812,0.6179384232,49423.3330259592
Arts,140469,0.56185119575,200653.91753819748
Biology & Life Science,578132,0.584518475857143,265290.7244894746
Business,110367,0.4050631853076923,527544.5710282911
Communications & Journalism,98278,0.64383484025,252770.20211699023
Computers & Mathematics,62599,0.5127519954545455,153316.94865687276
Education,612958,0.6749855163125,377403.9767502918
Engineering,118051,0.2571578951034483,138243.71272339704
Health,312026,0.6168565694166667,285746.4686508825
Humanities & Liberal Arts,349636,0.6761934042,482442.3557077656


# 7. Multiple Group Columns

In [7]:
%%sql

CREATE TABLE new_grads
            AS SELECT *
            FROM recent_grads;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
(sqlite3.OperationalError) table new_grads already exists
[SQL: CREATE TABLE new_grads AS SELECT *
            FROM recent_grads;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


In [12]:
%%sql

ALTER TABLE new_grads
        ADD Sample_category;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
(sqlite3.OperationalError) duplicate column name: Sample_category
[SQL: ALTER TABLE new_grads ADD Sample_category;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


In [38]:
%%sql

UPDATE new_grads
    SET Sample_category = CASE
        WHEN Sample_size<200 THEN "Small"
        WHEN Sample_size<1000 THEN "Medium"
        WHEN Sample_size>=1000 THEN "Large"
        END;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
173 rows affected.


[]

In [39]:
%%sql

select Sample_size, Sample_category from new_grads limit 20;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Sample_size,Sample_category
36,Small
7,Small
3,Small
16,Small
289,Medium
17,Small
51,Small
10,Small
1029,Large
631,Medium


In [43]:
%%sql

SELECT Major_category, 
    Sample_category,
    AVG(ShareWomen) AS Mean_women,
    SUM(Total) AS Total_graduates
    FROM new_grads
GROUP BY Major_category, Sample_category;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Sample_category,Mean_women,Total_graduates
Agriculture & Natural Resources,Medium,0.75257011,35813
Agriculture & Natural Resources,Small,0.5842805015,44168
Arts,Large,0.3743556229999999,103480
Arts,Medium,0.6070284135,217083
Arts,Small,0.5641134296666667,36567
Biology & Life Science,Large,0.601858152,280709
Biology & Life Science,Medium,0.584556133,25965
Biology & Life Science,Small,0.5830703647500001,147188
Business,Large,0.3981651196,1142867
Business,Medium,0.4335738126,130698


# 8. Querying Virtual Columns With the HAVING Statement

In [47]:
%%sql

SELECT Major_category,
    AVG(Low_wage_jobs)/AVG(Total) AS Share_low_wage
    FROM new_grads
GROUP BY Major_category
HAVING Share_low_wage>0.1;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Share_low_wage
Arts,0.1683308599109567
Communications & Journalism,0.1263241815481876
Humanities & Liberal Arts,0.1320872134419483
Industrial Arts & Consumer Services,0.1157133407603397
Law & Public Policy,0.1156850374357227
Psychology & Social Work,0.1169338491955418
Social Science,0.1022329734360317


# 9. Order of Execution

In the "Summary Statistics" lesson, we learned that when executing a SQL query, the computer runs the clauses in this order:

* FROM
* WHERE
* SELECT
* ORDER BY
* LIMIT

Now we know two more clauses: GROUP BY and HAVING. We can expand our mental model of the general structure of a query:

And the order in which SQL runs this is as follows:

* FROM
* WHERE
* GROUP BY
* HAVING
* SELECT
* ORDER BY
* LIMIT

We still haven't looked at using GROUP BY and ORDER BY simultaneously — we'll see an example later.

Note, however, that ORDER BY executes after GROUP BY. One of the main goals of ordering results is functional output, so it makes sense that it should be one of the last clauses to run.

# 10. Rounding Results With the ROUND() Function

In [51]:
%%sql

SELECT ROUND(ShareWomen,4) AS Rounded_women,
    Major_category
    FROM new_grads
LIMIT 10;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Rounded_women,Major_category
0.1206,Engineering
0.1019,Engineering
0.153,Engineering
0.1073,Engineering
0.3416,Engineering
0.145,Engineering
0.5357,Business
0.4414,Physical Sciences
0.1398,Engineering
0.4378,Engineering


# 11. Nesting functions

In [52]:
%%sql

SELECT Major_category,
    ROUND(AVG(College_jobs)/AVG(Total), 3) AS Share_degree_jobs
    FROM new_grads
GROUP BY Major_category
HAVING Share_degree_jobs<0.3;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Share_degree_jobs
Agriculture & Natural Resources,0.248
Arts,0.265
Business,0.114
Communications & Journalism,0.22
Humanities & Liberal Arts,0.27
Industrial Arts & Consumer Services,0.249
Law & Public Policy,0.163
Social Science,0.215


# 12. Casting

If we try to divide two integer columns (Women and Total, for instance), SQLite (and most other SQL dialects) will round down and return integer values.

Notice how every time Women is smaller than Total the result 0 and when Women is larger, the result is 1. The query rounded the ratio.

To get a float value, we can use the CAST() function to transform the columns into a Float type.

In [56]:
%%sql

SELECT Major_category,CAST(SUM(Women) AS Float)/CAST(SUM(Total) AS FLOAT) AS SW
    FROM new_grads
GROUP BY Major_category
ORDER BY SW;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,SW
Law & Public Policy,0.0305850692602745
Business,0.0847428085284126
Industrial Arts & Consumer Services,0.1602492689040523
Computers & Mathematics,0.2093556025256849
Engineering,0.2195958577559186
Communications & Journalism,0.2503253939750535
Arts,0.3933273597849522
Humanities & Liberal Arts,0.490051410855147
Health,0.6735876346523325
Interdisciplinary,0.8009108653220559


# 13. Next Steps

In this lesson, we covered the CASE, GROUP BY, and HAVING clauses. We can use these to quickly calculate powerful summary statistics in SQL and program if/then logic. In the next few lessons, we'll learn more about working with SQL tables, including how to insert and modify data.